Jan 9, 2021

How to use Docker like an expert

My complete rundown on (almost) everything you need to know about Docker.

Docker is a really easy-to-use system for packaging application and running them. It is easy to get started, but it takes a lot of knowledge to use Docker the "right" way. Here is a big dump of container-related information that I picked up over the years.

I have also created an example repository that shows some of the techniques I describe here--specifically multi-stage builds and local Apt and pip caching servers.

Some terminology to remember

This document uses a lot of jargon. Here are are some simple definitions.

The basics

A Dockerfile is file containing a list of commands used to build an image. A Dockerfile is just one of many ways to create an image.
An image is a blueprint to create containers. An image is a snapshot of the filesystem needed to run the process(es) within the container.
Docker images are built and stored in layers--typically corresponding to Dockerfile commands. Docker builds typically perform layer caching. Docker typically only rebuilds a layer if the command to create it has changed--or if a layer above it has been rebuilt.
A Dockerfile is typically paired with a build context, which is a directory of local files that might be used to create the image. By default, Docker assumes the build context is the directory containing the Dockerfile--minus any files specified in the directory's .dockerignore file.
Containers are processes or groups of processes running under various security and isolation features built into the operating system kernel. The concept of a container has long predated Docker. The Unix chroot command has offered container-like features since its development in 1979. Docker itself is just now one of multiple projects implementing standards developed by the Open Container Initiative.

Storage

Files created during each layer of an image build are persisted in the final image. If a file is created in one layer and then deleted in the layer below it, the image won't be visible in the container's filesystem, but the file still exists in the image because it stores each layer separately. Layers cannot modify layers above them.
When a container is started from an image, a new temporary layer is created. Data written to this container layer may be written to disk--but only temporarily. The data does not persist after the container is shut down.
A tmpfs mount can be mounted to a directory within the container. Just like writes to the container layer, data in a tmpfs mount does not persist after the container is stopped. Data in a tmpfs mount is (typically) written to memory--not the disk.
A bind mount or a volume can be used to persist data after the container has shut down, or to share data with the host or other containers.
- A bind-mount is a mapping from any directory in the container filesystem to any directory on the host filesystem.
- A volume is a mapping from any directory in the container filesystem to a Docker-managed storage location. The default Docker volume driver maps a directory in the container to a Docker-managed directory on the host filesystem, but there are numerous Docker storage plugins that map volumes to other storage solutions like Ceph, Network File System (NFS), or Amazon S3-style block storage.
It is not possible to use a tmpfs mount, bind mount, or volume when building an image. All of the files you need to build an image should be in the build context, accessible over the network, or created by a command. If you want to connect multiple disparate build contexts, you can build a Docker image for every build context and then unify them with a multi-stage build.

Networking

Docker comes with several networking drivers:

none -- Disables all network interfaces for the container other than the container's loopback interface. A loopback interface is a virtual network interface that only communicates with itself.
host -- The container shares the host's networking namespace. The container doesn't end up with its own IP address and it has access to all of the host machine's network interfaces.
overlay -- A distributed network that spans multiple Docker hosts--commonly used with Docker Swarm so containers on different machines can transparently communicate with each other.
bridge -- An isolated network on a single Docker host. Containers can communicate with other containers on the same bridge network, but are isolated from containers connected to other networks.
macvlan -- A network mode that assigns virtual MAC addresses for containers--to be used with applications that expect to interact with a physical network interface.

Each of these drivers--except for none and host--creates virtual networks that are connected to by a specific name, although in some cases Docker may create a default network.

Networks created by these drivers can be used when running containers and/or when building containers. For example, we might operate a local package caching server listening on the build host's loopback interface. If we want package managers in the Dockerfile to access the caching server, we can build with the host network driver. When we finish building the image and are using it to spawn multiple containers, they can communicate with each other using a bridge or overlay network, but then they will lose access to the host's loopback interface.

Here is an example docker-compose.yml file for using the host network to build an image, and then a named bridge network when running the container:

  
version: "3.9"

services:
  your-service-name:
    image: your-image-name-here
    restart: always
    build:
      context: ./hello
      network: host
    networks:
      - my-bridge-network

networks:
   # There are additional config parameters you can apply
   # for your bridge network.
   my-bridge-network:

There also exist a few Docker networking plugins that add additional functionality.

The Docker daemon

Whether it is building images or running containers, the Docker client communicates with the long-running Docker daemon on the host machine. Typically, the Docker client communicates with the daemon via a special file on the host known as the Docker daemon socket. By default, the Docker daemon socket is located at /var/run/docker.sock on the host machine, but it may be elsewhere if the system is using a different container runtime such as containerd. The Docker daemon can also listen to a TCP socket and communicate via HTTP or HTTPS.

What this guide does not cover

This guide does not cover:

image build improvements like Docker BuildKit or rootless containers.
other container tools like Buildah, Podman, and Skopeo.
container orchestration platforms like Kubernetes, Docker Swarm, or HashCorp Nomad.
Kubernetes distributions like k3s, microk8s, Rancher, or OpenShift.
non-Linux container operating systems, such as Microsoft Windows containers.
container networking with IPv6.

Manage your settings with Docker Compose

You can build Docker containers directly by passing various parameters to the docker command, but the number of parameters can add up. You may find it easier to describe how to build your image and run the resulting container with a docker-compose.yml file.

You can simultaneously describe multiple Docker images, containers, networks, and volumes that depend on each other.
You can compose configurations by telling Docker Compose to merge multiple YAML files. Redundancy within YAML files can be reduced with features like YAML anchors and aliases.

While Docker Compose is typically used for local development, a few modification to your docker-compose.yml may allow you to deploy it to:

Docker Swarm
Kubernetes (with Kompose)
AWS Elastic Container Service (with Compose CLI)
Azure Container Instances (with Compose CLI)

There are also alternatives to Docker Compose. Tilt is a tool that makes it easily to locally run Kubernetes configs.

Use an init system in your container

Each Docker container has its own isolated process table. The first process spawned in each container is given the process ID (PID) of 1.

Typically in Unix-like operating systems, the process operating as PID 1 is the init system and has some special responsibilities. PID 1:

is the parent of all other processes
keeps track of its children and is responsible for reaping defunct or "zombie" processes.
has a special responsibility to handle or forward Unix signals to child processes

The host operating system typically has a specialized init system as the host's PID 1--like launchd on Mac OS X or systemd on many Linux distributions.

As a result, most software is written with the expectation that it doesn't need to accept the responsibilities of being PID 1. When such software is run in a container, it may not properly handle signals or reap zombies. This may not be a problem if your container only starts one process, but a more solid fix is to install a lightweight init system in your container. Below are the Dockerfile commands used to install the tini init system in your image.

# Download a statically-linked binary of the tini init system from GitHub.
ENV TINI_VERSION v0.18.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini-static /tini
RUN chmod +x /tini

# Set the init system to be the container entrypoint.
ENTRYPOINT ["/tini", "--"]

# Now we run this command, and the resulting processes will be managed by tini.
CMD ["run-my-command"]

Recent versions of Docker and other runtimes can automatically wrap your container with an init system if you ask, but for the sake of compatibility, it is better to install an init system inside your image as shown above.

Run your program as a non-root user

By default, your Docker image is built as the root user and the resulting container as run as the root user. This can be a significant security concern, as the root user in the container is the same as the root user on the host.

Docker allows you to pick different UID (user) and GID (group) for building the image and/or running the container. These UID/GID numbers do not have to correspond to named users on either the host or the guest.

However, it is a little clearer to create a non-root user in your image and the run subsequent commands as that user.

In this Dockerfile example, we create a user and group named anubis with a UID and GID fixed at 10000. UIDs below 500 are typically reserved on the host for system services, and human users on the host may have UIDs starting from 1000. For better security and isolation, we want to avoid having our container UID to accidentally match a user on the container's host machine.

ARG CONTAINER_UID=10000
ARG CONTAINER_GID=10000
ARG CONTAINER_USERNAME=anubis
ARG CONTAINER_GROUPNAME=anubis
RUN groupadd --gid=${CONTAINER_GID} --force ${CONTAINER_GROUPNAME} \
    && useradd --create-home --home-dir=/home/${CONTAINER_USERNAME} --shell=/bin/bash --uid=${CONTAINER_UID} --gid=${CONTAINER_GID} ${CONTAINER_USERNAME}

You can switch away from root to this container during the build process with the Dockerfile USER command, but you should only do this once you are done installing packages. That looks like

# Create the user1 user. See above for the full syntax.
RUN useradd ... user1 ...
# Create the user2 user. Ditto.
RUN useradd ... user2 ...

# We're still running as the root user
RUN cmd1

# Now we switch to user1
USER user1

# Now this command is running as user1 at build-time.
RUN cmd2

# Now switch to user2
USER user2

# When the image is used to start a container, cmd2 will run as user2.
CMD ["cmd2"]

However, switching away from the root user while building the image is not genuinely rootless. By default, access to the Docker daemon implies root privileges. Podman and other tools are working to create genuinely rootless containers, which would be able to make stronger security guarantees.

One area where you may run into trouble is when bind-mounting directories from the container to the host. Files in directories created by the host will be owned by the creating host user's UID and GID. Files and directories created by the guest will be owned by the creating container user's UID and GID. One way to handle this is with a tool called fixuid [(see below)][#matching-the-container-uid-to-the-host].

Use local caches for package installation

Docker images are stored as layers, which typically correspond to individual Dockerfile commands. If you try to build the same image twice without changing any of the files in the Docker context, the second build will us the cached layers.

However, if you make changes that affect a given layer, then that layer and all the layers "below" it must be rebuilt. If your Dockerfile makes calls to package managers like Apt or pip, you may end up repeatedly downloading the same packages if you repeatedly change the relevant layers. You can make these builds run a lot faster by installing local package caching servers and configuring your container build process to use them.

My example repository shows the Dockerfile and the docker-compose.yml configurations to build and run caching servers for Apt and pip.

Here is how to configure Apt and pip clients in your Dockerfiles to use the servers in my example repository:

Apt

The program apt-cacher-ng act as a caching proxy that downloads packages to a directory the first time you install packages and then uses its downloaded cache for any subsequent reinstallations.

Typically, Apt repositories are served via HTTP instead of HTTPS. The files downloaded were not signed or encrypted in transit, but Apt verifies downloaded files with GPG.

Nowadays, many Apt repositories are only served over HTTPS, and the extra layer of encryption makes it a little harder for apt-cacher-ng to transparently proxy Apt and cache Apt downloads.

To help apt-cacher-ng properly cache repositories via HTTPS, you need to run these commands in your container to change the repository URLs in your sources.list files to run through your apt-cacher-ng proxy. You will need to specify your proxy's URL as as a build-time environment variable called LOCAL_APT_CACHE_URL.

# Insert an Apt proxy in this container, but only if this build-time
# environment variable is defined.
ARG LOCAL_APT_CACHE_URL=
RUN if [ ! -z ${LOCAL_APT_CACHE_URL} ]; \
    then \
    # This first `find` command replaces HTTP Apt repository URL to run
    # through our apt-cacher-ng proxy.
    find /etc/apt/sources.list /etc/apt/sources.list.d/ \
    -type f -exec sed -Ei '\#'${LOCAL_APT_CACHE_URL}'#!s!http://!'${LOCAL_APT_CACHE_URL}'/!g' {} \; \
    # This second `find` commands replaces HTTPS Apt repository URLs
    # to run through out apt-cacher-ng proxy. If we don't do this,
    # then apt-cacher-ng will not be able to cache packages from
    # HTTPS repositories.
    && find /etc/apt/sources.list /etc/apt/sources.list.d/ \
    -type f -exec sed -Ei '\#'${LOCAL_APT_CACHE_URL}'#!s!https://!'${LOCAL_APT_CACHE_URL}'/HTTPS///!g' {} \; ;\
    fi

Then, if your sources.list file referenced a package repository at https://example.com and your LOCAL_APT_CACHE_URL is set to http://localhost:3000, then the repository is rewritten as http://localhost:3000/HTTPS///example.com. If you have apt-cacher-ng running, it will receive requests for new packages and then create a new HTTPS connection to https://example.com.

If you add additional Apt repositories after this find-and-replace job, you will need to run the find-and-replace again to ensure that apt-cacher-ng caches those repositories too.

When you are done installing Apt packages through the proxy, you can remove it with this command:

# Reverse our Apt caching proxy so that Docker images that inherit from this
# one receive a normal configuration.
RUN if [ ! -z ${LOCAL_APT_CACHE_URL} ]; \
    then \
    # Reverse the HTTPS URLs first.
    find /etc/apt/sources.list /etc/apt/sources.list.d/ \
    -type f -exec sed -Ei 's!'${NEOCRYM__LOCAL_APT_CACHE_URL}/HTTPS///'!https://!g' {} \; \
    # and then the HTTP URLs.
    && find /etc/apt/sources.list /etc/apt/sources.list.d/ \
    -type f -exec sed -Ei 's!'${NEOCRYM__LOCAL_APT_CACHE_URL}/'!http://!g' {} \; ;\
    fi

pip

Locally caching pip packages is a little easier than Apt packages. We just need to change pip's configuration to use the package proxy. You will need to specify a pip proxy URL with the build-time environment variable LOCAL_PYPI_CACHE_URL.

# Insert a PyPI proxy in this container, but ohly if this build-time
# variable is defined.
ARG LOCAL_PYPI_CACHE_URL=
ARG PYPI_CACHE_PROXY_CONFIG_FILE=/etc/xdg/pip/pip.conf
RUN if [ ! -z ${LOCAL_PYPI_CACHE_URL} ]; \
    then \
    mkdir -pv $(dirname ${PYPI_CACHE_PROXY_CONFIG_FILE}) \
    && echo "[global]" > ${PYPI_CACHE_PROXY_CONFIG_FILE} \
    && echo "index-url = ${LOCAL_PYPI_CACHE_URL}/root/pypi/+simple/" >> ${PYPI_CACHE_PROXY_CONFIG_FILE}; \
    fi

When we are done installing Python packages, we can reverse this proxying configuration by deleting the file.

# Reverse our pip caching proxy so that Docker images that inherit from this
# one receive a normal configuration.
RUN if [ ! -z ${LOCAL_PYPI_CACHE_URL} ]; \
    then \
    rm -f ${PYPI_CACHE_PROXY_CONFIG_FILE}; \
    fi

Use multi-stage builds

For languages like Python that are typically interpreted, you typically deploy the image with the same interpreter and libraries that you develop with.

However, for languages like C, C++, Go, or Rust that are typically compiled to a binary, you need to develop with a compiler, but only need to deploy the resulting binary.

Docker has a feature called multi-stage builds that makes it possible to use one stage of an image to compile source code into a binary, but then only select your (compiled) files of choice for the final stage.

# This is the first stage of our `Dockerfile`. Nothing in this
# stage ends up in the final image unless we deliberately copy it.
# This image has the Go compiler, which we use to create a binary
# named `hello`.
FROM golang:1.15.6-alpine3.12 as builder
COPY hello.go /app/
WORKDIR /app
RUN CGO_ENABLED=0 go build hello.go


# This is the final stage of our `Dockerfile`. We copy
# the compiled `hello` binary and set it as our command.
FROM alpine:3.12
COPY --from=builder /app/hello /bin/hello
CMD ["/bin/hello"]

Faster builds with `.dockerignore`

Docker images are built by specifying a filesystem directory as the build context. All of the files in the build context are compiled into a tarball and send to the Docker daemon--including files that are not added to the image by the Dockerfile.

To make your Docker builds run much faster, you can exclude unnecessary files from your Docker build context using a .dockerignore file. The syntax is similar to a .gitignore file but with two main differences:

.gitignore files can be placed anywhere in a project tree. A .dockerignore file can only be placed at the root of the build context.
.dockerignore glob patterns need to be prefixed with **/ in order to make them recursive.

Mount secrets as files or environment variables

There are two types of secrets to consider:

build-time secrets that are used when building the container
runtime secrets that are used when running the container

Build-time secrets

It is tricky to use secrets when building a Docker container because you may end up leaking the secret into the resulting image. The safest thing is to get the secret via the network and avoid writing it to the image's layered filesystem.

If you find yourself needing secrets at build-time, consider enabling builds with Docker BuildKit--which has built-in support for build-time secrets.

Runtime secrets

You can try passing runtime secrets to your container in three ways:

Bind-mounted files or volumes. You can write a secret value to a file on the host and bind-mount the file into the container. Processes in the container can then read the file and use the secret as needed.
Environment variables. Setting environment variables when starting a container is a common way to specify secret values. However, these environment variables are visible via the ps command or the /proc virtual filesystem, but only to the host root user and the same host user as the container process. Environment variables may also be leaked to other Docker containers via the --link command. Logging and error-reporting libraries in the container process may also report environment variables when an error is encountered. Do not set secret environment variables with the ENV command in the Dockerfile, because then the secret will be leaked into the resulting Docker image.
A network endpoint. Credentials can also be received via a network service outside of the container that trusts it. For example, if you deploy containers to Amazon Web Services (AWS), they can receive temporary credentials for accessing AWS services via the internal EC2 metadata HTTP endpoint. For on-premises deployment, HashiCorp Vault is a service that offers access to secrets via an HTTP API, but your container must have a way to authenticate itself to your Vault server.

Docker Swarm provides secrets via bind-mounted files. Kubernetes can provide secrets via bind-mounted files or environment variables.

Advanced topics

Matching the container UID to the host

If you are using Docker to build local development tools, you might run into file permissions issues because your container's UID does not match the UID that owns files on your host filesystem. A tool called fixuid can help fix this issue, although the maintainers state that it should not be used in production services.

Docker-In-Docker (DIND)

You may want to build a Docker image for a program that itself manages Docker containers. This might be a container image analysis program like Dive, a continuous integration server like Jenkins that is configured to start containers, or you may be developing the next version of Docker from inside a container.

You can allow processes within a container to manipulate the host's Docker runtime by bind-mounting the Docker socket. By default, this is usually located on the host's filesystem at /etc/docker/docker.sock but may be different if you have customized your Docker settings or are using an alternative container runtime like containerd.

Bind-mounting the Docker socket has numerous security-related downsides. Any process with access to the Docker socket effectively has root permissions on the host, even if the process is running as a non-root user.

Jérôme Petazzoni has written at length on Docker-In-Docker and the practice of bind-mounting the Docker socket.

Using Docker to run graphical applications

Jessie Frazelle has written a guide on how to build an run Docker containers for graphical desktop applications. She has also prepackaged dozens of open-source Linux desktop applications as Docker containers.