Awesome HTTP Load Balancing on Docker with Traefik
— Architecture, Cloud, Hashicorp, Docker — 6 min read
Why Traefik?
Traefik is the up-and-coming 'Edge Router / Proxy' for all things cloud. Full disclosure, I like it.
The cut back features compared to products like F5 which I have used throughout my career is refreshing - these products still do have their place, and they can do some very cool stuff.
I'm a strong believer in avoiding technical debt when I'm building out my infrastructure and applications. Traefik is a production scale tool, while still being nimble enough to run in the smallest of deployments. There is very few reasons why you shouldn't consider incorporating it into your stack now - especially if you are self-hosting and don't have something like AWS ELB available to you, or if Traefik has a killer feature you need!
This Tutorial
Traefik supports multiple orchestrators (Docker, Kubernetes, ECS to name a few), however, for this tutorial I am going to cover Docker Swarm. Check out Why Single Node Swarm to see why I think you should be using Swarm over vanilla docker, even locally.
In case you didn't get the message above, I am going to cover a scale-ready configuration for Traefik here. This means HA clustered support, meaning that as you add nodes, your Traefik service will scale with you. Technical debt to a minimum!
What do I want to achieve?
- I host a VPS server that hosts multiple websites, running multiple technologies. Thankfully they are all now containerised (it wasn't pretty).
- I want TLS certificates by default on every URL.
- I want to future proof this if I need to scale.
Treafik hits all of these points. Let's dive in....
Our HA Docker Swarm Ready solution has a few moving parts;
Traefik Container
This container will be scheduled as a global
service in our Swarm. This means that for every Host in our Docker Swarm cluster, one instance of Traefik will be deployed. This will be our traffic workhourse. In most scenarios this should have more than sufficient throughput. If a given Traefik instance is getting saturated, then you might be getting to the point where you should be horizontally scaling (more Hosts, not more CPUs).
Consul Container
Consul is used in this scenario as a configuration store. Traefik needs a repository for config data and certificates which is accessible from all nodes in the cluster. Consul has other features - it overlaps a lot with Swarm's service mesh - but these don't need to be configured for this use case. This will be deployed in an HA manner to ensure resilience to node failure.
Traefik Init Container
This is used to 'seed' our Traefik config into the Consul cluster when first starting up. Once this has done it's job, it is shutdown, and our scaled Traefik Container nodes take over.
Overall Traefik Cluster Design
Overall Docker Swarm Traefik HA Cluster Our cluster is comprised of three docker swarm nodes, the recommended minimum number required for the raft consensus algorithm to provide resilience to node failure. There is no pre-requisite for this deployment except cluster communications working properly between nodes (as detailed in the Swarm setup guide - https://docs.docker.com/engine/swarm/swarm-tutorial/).
The Terraform and Compose files for this tutorial are located on my GitLab - https://gitlab.com/fluffy-clouds-and-lines/traefik-on-docker-swarm.git
The solution shall deploy 4 services;
- A Traefik node on each manager host,
- A Consul node on each manager host (maximum 3),
- 'whoami' to facilitate testing,
- A standalone Traefik config generator.
Inside Traefik
Traefik has two main configuration areas; frontends and backends. Frontends respond to requests from the outside world. This is where, if used, TLS is applied to requests. Backends group one or more containers to serve a frontend.
In this deployment, frontends are created dynamically as a result of the docker labels declared on the whoami service (the backend group).
Networking
Traefik and Consul shall expose their management UI ports in host
mode. This means they will be exposed on each host they are deployed on. Only hosts that have these services deployed shall have 8080 and 8500 available.
In addition, Traefik shall expose the front-end ports of 80 and 443 on the ingress service mesh. This means that any Docker node will have ports 80 and 443, and will be routed internally by Docker to a Traefik instance, but not necessarily one running on that node (if there is one). This is to facilitate fail over in the case of Traefik container failure.
Deploying the Traefik Cluster
To deploy our stack we are going to use a Docker Compose file to define our services. Once deployed the Traefik instance will poll Docker (specifically the docker socket) for changes to deployed containers. Whenever a new container with appropriate metadata is started, Traefik will beginning routing traffic to it.
Environment Setup
A 3 node Docker Swarm deployment is required. If you don't have this, a Terraform manifest to deploy to AWS is available in this tutorial's repo.
You will need a DNS record that points to your Swarm cluster for the test service. I have chosen whoami2.docker.lab.infra.tomwerner.me.uk
but this will need changing to a domain name you own. For maximum availability this should resolve to all of your Swarm nodes (but in theory only one is required due to the Traefik frontend being on the ingress mesh). I have achieved this by creating duplicate A records, resolving to each IP.
The Compose File
The full compose file is available in my GitLab repo, but breaking down the deployment;
1consul:2 image: consul3 command: agent -server -bootstrap-expect=3 -ui -client 0.0.0.0 -retry-join tasks.consul4 volumes:5 - consul-data:/consul/data6 environment:7 - CONSUL_LOCAL_CONFIG={"datacenter":"eu_west2","server":true}8 - CONSUL_BIND_INTERFACE=eth09 - CONSUL_CLIENT_INTERFACE=eth010 ports:11 - target: 850012 published: 850013 mode: host14 deploy:15 replicas: 316 placement:17 constraints:18 - node.role == manager19 restart_policy:20 condition: on-failure21 networks:22 - traefik
Our Consul deployment is fairly out of the box. It ensures that;
consul is started in server mode,
3 servers are pre-defined as our required amount of nodes before a cluster election can be triggered,
that UI traffic can come from all sources,
that the remaining nodes can be contacted via the alias
tasks.consul
. This is a bit of 'trick' - normally this would be a list of IP addresses, however, using Docker's internal round robin DNS resolution, this means that this alias returns all 3 addresses of our scaled consul cluster.1traefik_init:2 image: traefik:1.73 command:4 - "storeconfig"5 - "--loglevel=debug"6 - "--api"7 - "--entrypoints=Name:http Address::80 Redirect.EntryPoint:https"8 - "--entrypoints=Name:https Address::443 TLS TLS.SniStrict:true TLS.MinVersion:VersionTLS12"9 - "--defaultentrypoints=http,https"10 - "--acme"11 - "--acme.storage=traefik/acme/account"12 - "--acme.entryPoint=https"13 - "--acme.httpChallenge.entryPoint=http"14 - "--acme.onHostRule=true"15 - "--acme.onDemand=false"16 - "--acme.email=certs@fluffycloudsandlines.blog"17 - "--docker"18 - "--docker.swarmMode"19 - "--docker.domain=docker.lab.infra.tomwerner.me.uk"20 - "--docker.watch"21 - "--consul"22 - "--consul.endpoint=consul:8500"23 - "--consul.prefix=traefik"24 - "--rest"25 networks:26 - traefik27 deploy:28 placement:29 constraints:30 - node.role == manager31 restart_policy:32 condition: on-failure33 depends_on:34 - consul
This is our 'bootstrap' config for Traefik. This service only starts upon deployment. It defines the configuration for the deployment, including enabling Lets Encrypt certificates, enables the Docker provider and defines we will be using Consul as our Key-Value store.
1traefik:2 image: traefik:1.73 depends_on:4 - traefik_init5 - consul6 command:7 - "--consul"8 - "--consul.endpoint=consul:8500"9 - "--consul.prefix=traefik"10 networks:11 - traefik12 volumes:13 - /var/run/docker.sock:/var/run/docker.sock14 ports:15 - target: 8016 published: 8017 - target: 44318 published: 44319 - target: 808020 published: 808021 mode: host 22 deploy:23 placement:24 constraints:25 - node.role == manager26 mode: global27 update_config:28 parallelism: 129 delay: 10s30 restart_policy:31 condition: on-failure
The Traefik service is again fairly out of the box. All configuration is pulled from consul (which is seeded by the traefik_init service). This container is deployed on all manager nodes in this solution - this constraint is in place as Treafik requires access to the manager node to operate. Consider the decoupling options at the end of this post if you wish to run Traefik on worker nodes.
1whoami0:2 image: containous/whoami3 networks: 4 - traefik5 deploy:6 replicas: 67 labels:8 traefik.enable: "true"9 traefik.frontend.rule: 'Host: whoami2.docker.lab.infra.tomwerner.me.uk'10 traefik.port: 8011 traefik.docker.network: 'traefik_traefik'
Our last service - whoami. This allows us to test the deployment. The labels defined here are used by Traefik to configure itself. Host: whoami2.docker.lab.infra.tomwerner.me.uk
should be changed to match a valid DNS hostname for your environment.
1networks:2 traefik:3 driver: overlay4 attachable : true5
6volumes:7 consul-data:8 driver: local
Finally our network and storage definition. The Traefik network is an overlay
network to allow it to span the Swarm Cluster. It is also attachable to allow standalone containers to be attached to allow debugging (the default if false
). The network will have a /24
subnet allocated from the default Swarm address pool (10.0.0.0/8). Experience is that the Ubuntu DNS resolver failed to resolve external addresses when the swarm range overlapped with the external 'real world' network of the VPC. To specify a smaller pool for swarm, in this case (a subset of my VPC network) , I used docker swarm init --default-addr-pool 10.10.0/18
to start my swarm.
Our consul data is persisted, as not all of it is easily recreated (i.e our issued certificates).
Check our deployment
Assuming you have a working cluster already, deploy the cluster after checking out the repo;
$ docker stack deploy -c traefik_noprism.yml traefik
Verify the deployment;
1$ docker service ls2NAME MODE REPLICAS IMAGE 3traefik_consul replicated 3/3 consul:latest 4traefik_traefik global 3/3 traefik:1.7 5traefik_traefik_init replicated 0/1 traefik:1.7 6traefik_whoami0 replicated 6/6 containous/whoami:latest
The consul cluster election can take a minute or so. Monitor the Traefik and Consul logs to observe the progress of this. Navigating to the UI and having the consul and docker provider tabs available is a good indicator that this deployment has worked.
Test the deployment
Browse to any of your node's publicly accessible addresses to view the Traefik UI;
http://<public ip>:8080/
Under the Docker tab, the details of our test service should show.
Browse to your service hostname (in my case whoami2.docker.lab.infra.tomwerner.me.uk
). This should resolve to one of your Swarm nodes and hand it off to a Traefik node. Refresh a few times to see it hit your chosen node under the Health tab (or open up all management UIs for each node).
That's it for the tutorial!
Additional Steps
All tutorials are a basis to start with. A couple of things to consider taking further;
- Applying ACLs to restrict UI access to Consul (source IP and authentication),
- Restricting access to the Traefik management UI, either by disabling it or firewall (i.e AWS VPC security group access from bastion only),
- Docker Socket security...
If you're concerned around exposing a container to the internet which is bound to the Docker socket (a potential security issue), you may want to consider;
- Deploying traefik-prism as per my previous post,
- Using TCP to communicate to the Docker socket,
- Consider Traefik Enterprise Edition if you are in a commercial environment (as this supports splitting configuration and routing roles).
As ever, comments and questions are welcomed below.