by

Large files on a custom Debian installer

On I wrote about a better, simpler way of doing this. Check it out at Third party files on a custom Debian Live installer.

Following with my work on a custom Debian distribution, there was a third-party package I wanted to include in it. Unfortunately, this was proprietary software, it wasn’t included in Debian non-free, and it was a pretty large download (about 1GB).

Debian Live doesn’t seem to have a way to handle a case like this, or at least one that doesn’t involve significant drawbacks.

Initial approach

A first approach would be to download the package and store it with the build scripts, handy to be installed during the build process. The problem with this is that I keep my build scripts under source control in GitHub. Adding a 1GB file to a git repository is generally not a good idea, less so when that repo weighs under 1MB otherwise (this is including all commits in its history). There’s also the matter of whether the software license would allow uploading it to GitHub or a similar service, which legally could constitute unauthorised distribution.

What would be better is to only keep a reference (a URL) to this package, downloading it during the build process. Here the problem is that the package will have to be downloaded every time we run a build; a large download that significantly slows down the process (already slow) and makes it more cumbersome to iterate quickly.

Debian Live keeps a cache directory where it keeps the Debian packages that it downloads during a build. It should be possible to download our large package there, and avoid re-downloading if it exists already… except that the cache directory is not accessible from the chroot jail where custom scripts are run. It’s from one of these scripts that I would run downloaded file (if it is an installer) or otherwise put it in the correct location, so access from these is necessary. Well, bummer.

I also tried keeping a separate cache directory next to where the chroot hook scripts are kept (that would be config/hooks/cache), but those scripts don’t appear to be run off that location, and again that file hierarchy doesn’t appear to be accessible.

My solution

At the end I went for something a bit more involved. I altered the build scripts to add the following:

In other words, although the chroot jail doesn’t allow us to copy files from the external filesystem, it doesn’t stop us from accessing files that are served over HTTP from that external filesystem.

The scripts

This what my build script looked like after adding this feature:

 1#!/usr/bin/env sh
 2
 3set -e
 4
 5BASE_DIR=$(dirname "$0")/..
 6SCRIPTS_DIR="$BASE_DIR/scripts"
 7CACHE_DIR="$BASE_DIR/cache"
 8CONFIG_DIR="$BASE_DIR/config"
 9DOWNLOADS_DIR="$CACHE_DIR/downloads"
10LOCAL_FILES_DIR="$CONFIG_DIR/files"
11PID_FILE_PATH=$(mktemp)
12mkdir -p "$DOWNLOADS_DIR"
13
14DOWNLOADER_PATH="$SCRIPTS_DIR/downloader"
15SERVER_PATH="$SCRIPTS_DIR/server"
16
17DOWNLOADS_LIST_PATH="$CONFIG_DIR/downloads.list"
18
19"$DOWNLOADER_PATH" -d "$DOWNLOADS_DIR" -l "$LOCAL_FILES_DIR" "$DOWNLOADS_LIST_PATH"
20"$SERVER_PATH" -P "$PID_FILE_PATH" "$DOWNLOADS_DIR"
21
22lb build noauto "${@}" 2>&1 | tee build.log
23
24PID=$(cat "$PID_FILE_PATH")
25kill "$PID"

For the final implementation, I divided the files I wanted to copy into two categories: large ones I needed to download from the Internet, and smaller ones that I was ok with adding to the repository. On this second group there can be file checksums, customised config files, and probably other things.

There is a “downloader” script, referenced as $DOWNLOADER_PATH that reads a list of files ($DOWNLOADS_LIST_PATH). For each entry, if it appears to look like a URL, a file is downloaded from the Internet. Other entries are expected to be files residing at $LOCAL_FILES_DIR. All files are copied to $DOWNLOADS_DIR, renamed as per indicated in each entry of the list.

A server script ($SERVER_PATH) is then run, and we take note of its pid. This way, we can cleanly shut it down after the build is done.

This is the downloader script:

 1#!/usr/bin/env sh
 2
 3while getopts ':d:l:' opt; do
 4  case $opt in
 5    d)
 6      DOWNLOADS_DIR="$OPTARG"
 7      ;;
 8    l)
 9      LOCAL_FILES_DIR="$OPTARG"
10      ;;
11  esac
12done
13
14shift $((OPTIND-1))
15DOWNLOADS_LIST_PATH="$*"
16
17cat "$DOWNLOADS_LIST_PATH" | while read -r NAME URL; do
18  DEST_PATH="$DOWNLOADS_DIR/$NAME"
19  if echo $URL | grep -qE '^https?://' ; then
20    wget -c --output-document "$DEST_PATH" "$URL"
21  else
22    cp "$LOCAL_FILES_DIR/$URL" "$DEST_PATH"
23  fi
24done

As mentioned before, it’s fed a list of files ($DOWNLOADS_LIST_PATH) that may refer to URLs or local files (under $LOCAL_FILES_DIR). Whatever is downloaded, it will be copied to $DOWNLOADS_DIR. Using wget’s -c option, we ensure that failed downloads are continued, and existing files are not re-downloaded.

This is an example of a file list:

1big-file https://example.com/big-file
2small-file small_file_kept_locally

For each entry, the first word before the space is the name that the downloaded/copied file will receive at the cache directory. The second word is either the URL to retrieve it from, or the name of the file in the local filesystem.

This script runs the http server:

 1#!/usr/bin/env sh
 2
 3while getopts ':P:' opt; do
 4  case $opt in
 5    P)
 6      PID_FILE_PATH="$OPTARG"
 7      ;;
 8  esac
 9done
10
11shift $((OPTIND-1))
12WORKING_DIR="$*"
13cd "$WORKING_DIR"
14
15python -mSimpleHTTPServer 12345 &
16bg
17echo $! > "$PID_FILE_PATH"

Nothing much to see on it, apart from the option used to specify a pid file. This will be used to shut down the server cleanly at the end.

Finally there’s the chroot hook that will retrieve the files “over the fence” of the chroot jail and actually install them. It will be something like this:

 1#!/bin/sh
 2
 3set -e
 4
 5WORKING_DIR=$(mktemp -d /tmp/tmp.XXXXXXXXXXXXX)
 6
 7cd "$WORKING_DIR"
 8wget http://localhost:12345/big-file
 9wget http://localhost:12345/small-file
10
11# Do something with the downloaded big-file and small-file
12# ...
13
14# Finally we clean after ourselves
15rm -rf "$WORKING_DIR"