About

This tutorial demonstrates how to use Linux load-balancing and high-availability in order to distribute services amongst docker containers in a swarm in order to allow a more fine-grained control. The LVS topology chosen will be a direct-connect with the swarm nodes capable of answering requests themselves without needing to go back through the LVS director.

Topology Change

Here is an overview of a typical layout where caddy stands in front of a Docker swarm with services defined on multiple machines:

                  | connection
                  v
            +-----+-----+
            |   caddy   |
            +-----+-----+
                  |
        +---------+---------+
        |         |         |
......................................
        |         |         | 
    +---+---+ +---+---+ +---+---+
    |  s_1  | |  s_2  | |  s_3  | ... Docker swarm
    +-------+ +-------+ +-------+
......................................

The corresponding caddy configuration is something along the lines:

*.domain.duckdns.org, domain.duckdns.org {
    ...
    handle @service_1 {
        reverse_proxy docker.lan:8000
    }

    handle @service_2 {
        reverse_proxy docker.lan:8001
    }
    ...
}

where services such as service_1 are reverse-proxied into the swarm at docker.lan on the different ports that they listen to.

Now, the docker address docker.lan can be defined with some internal DNS server using A records for each node in the swarm:

docker.lan.  A   192.168.1.81
docker.lan.  A   192.168.1.82
docker.lan.  A   192.168.1.83

such that upon every DNS lookup of the docker.lan host, any of the three IP addresses are returned in round-robin fashion.

One of the problems in this setup is that DNS requests end up cached, such that accesses into the Docker swarm will occur predominantly over one of the IP addresses instead of being spread out to the entire set of IP addresses.

If one changes the caddy configuration to:

*.domain.duckdns.org, domain.duckdns.org {
    ...
    handle @service_1 {
        reverse_proxy 192.168.1.81:8000
    }

    handle @service_2 {
        reverse_proxy 192.168.1.81:8001
    }
    ...
}

where all access go to 192.168.1.81 the problem remains the same, even if Docker ensures that the request will internally be distributed to the apropriate node within the Docker swarm.

A different topology will be the following:

                  | connection
                  v
            +-----+-----+
            |   caddy   |
            +-----+-----+
                  |
            +-----+-----+
            |    IPVS   |
            +-----+-----+
                  |
        +---------+---------+
        |         |         |
......................................
        |         |         | 
    +---+---+ +---+---+ +---+---+
    |  s_1  | |  s_2  | |  s_3  | ... Docker swarm
    +-------+ +-------+ +-------+
......................................

where Linux IPVS functions as a load-balancer meant to spread out traffic among the nodes s_1 through to s_3.

Assumptions

The following configuration will be implemented:

                                                        | connection
                                                        v
                                                  +-----+-----+
                                                  |   caddy   |
                                                  +-----+-----+
                                                        |
                                                  +-----+-----+
                                                  |    IPVS   | VIP: 192.168.1.100
                                                  +-----+-----+
                                                        |
                            +---------------------------+--------------------------+
                            |                           |                          |
                ...............................................................................
                            |                           |                           |
         RIP: 192.168.1.101 |        RIP: 192.168.1.102 |        RIP: 192.168.1.103 |
         VIP: 192.168.1.100 |        VIP: 192.168.1.100 |        VIP: 192.168.1.100 |
                            |                           |                           |
                        +---+---+                   +---+---+                   +---+---+
                        |  s_1  |                   |  s_2  |                   |  s_3  | ...
                        +-------+                   +-------+                   +-------+
                ...............................................................................

where:

192.168.1.100 is a virtual IP created on the director,
192.168.1.101, 192.168.1.102 and 192.168.1.103 are the real IP addresses of nodes within the Docker swarm,
on each node in the swarm the virtual IP address 192.168.1.100 has to be maintained and preferrably the IP address has to somehow also register in the ARP tables of the entire network

Setup

There are two ways to go about this, in particular, the problem of the virtual IP address being recognized on the entire local network:

by using Linux primitives such as virtual IPs and load balancing,
using keepalived directly

The first setup will use ipvsadm and virtual interfaces while later a pure "docker" solution using "keepalived" will be provided.

Using Linux Primitives

Linux has a bunch of tools that allow the creation of a virtual IP address that then maps in the background to physical machines such that it is not necessary to use anything else.

Director

ipvsadm needs to be installed first; on Debian, the command would be:

apt-get install ipvsadm

following which, the next commands:

ipvsadm -A -f 80 -s lc
ipvsadm -a -f 80 -r 192.168.1.101 -g
ipvsadm -a -f 80 -r 192.168.1.102 -g
ipvsadm -a -f 80 -r 192.168.1.103 -g

will:

set up a load-balancer using firewall marking using the "least connections" load balancer algorithm -s lc,
add the three real IP addresses to the load balancer using routing mode (-g)

On the director, the iptables marking line can be made permanent such that the rule will be restored on reboot by installing the netfilter-persistent package. Following the example, the line that must be added to /etc/iptables/rules.v4 is the following:

-A PREROUTING -d 192.168.1.100 -j MARK --set-mark 80

Finally, the IPVS balancer can be made persistent using ipvsadm-save, respectively ipvsadm-restore to restore the rules.

Nodes

On each node in the swarm, the virtual IP has to be assigned to a virtual interface to make sure that the node will listen and respond to requests coming in from the virtual IP. The loopback interface can be used:

ifconfig lo:0 192.168.1.100 netmask 255.255.255.255 -arp up

Testing

In order to check that the different nodes are accessed, the ipvsadm command can be issued, perhaps also with the -lcn flag set in order to see any established and pending connections.

The command:

ipvsadm -lcn

should show something along the lines of:

IPVS connection entries
pro expire state       source             virtual            destination
TCP 00:22  SYN_RECV    x.x.x.x:31847   192.168.1.100:1883   192.16.1.101:1883
TCP 14:48  ESTABLISHED y.y.y.y:63330   192.168.1.100:1883   192.16.1.101:1883
TCP 00:53  SYN_RECV    z.z.z.z:19167   192.168.1.100:1883   192.16.1.102:1883

where you can see various source connections routed through the virtual IP address 192.168.1.100 to the Docker swarm destination nodes 192.16.1.101 and 192.16.1.102.

Making the Setup Persistent

On Debian, the interface configurations can be created in /etc/network/interfaces or defined in a separate file in /etc/network/interfaces.d.

The interface configuration for defining the virtual IP address on the loopback interface that must be defined for all Docker swarm nodes:

auto lo:0
iface lo:0 inet static
    address 192.168.1.100
    netmask 255.255.255.255

Implementing High-Availability

keepalived can be used to achieve high-availability by checking that the connection to the Docker swarm nodes can still be made and mitigating the situation by redirecting traffic to other Docker swarm nodes.

A very simple configuration based on the firewall marking example provided by keepalived can be seen here:

global_defs {
  router_id io
}

virtual_server fwmark 80 {
  delay_loop 6
  lb_algo lc
  lb_kind DR

  real_server 192.168.1.101 {
    weight 1
    MISC_CHECK {
      misc_path '"/usr/bin/ping -c 3"'
      misc_timeout 5
      warmup 5
    }
  }

  real_server 192.168.1.102 {
    weight 1
    MISC_CHECK {
      misc_path '"/usr/bin/ping -c 3"'
      misc_timeout 5
      warmup 5
    }
  }

  real_server 192.168.1.103 {
    weight 1
    MISC_CHECK {
      misc_path '"/usr/bin/ping -c 3"'
      misc_timeout 5
      warmup 5
    }
  }
}

The configuration sends an ICMP-echo packet every five seconds in order to check whether the host is up and if it is not up then the next connection will be redirected to the remainder of nodes that are still alive.

A Docker and Pure Keepalived Solution

keepalived can be ran under a container and allowed to modify the configuration on the host machine by passing capabilities to the Docker container that is running keepalived. Note that in this case the actual "load balancing" happens on OSI layer 4, with just the physical machine being checked by keepalived but not any service in particular.

In order to accomplish this configuration, the following SystemD service can be dropped at /etc/systemd/system/docker-swarm-keepalived:

[Unit]
Description=Docker swarm keepalived
After=docker.service
Requires=docker.service
StartLimitIntervalSec=0

[Service]
Restart=always
RestartSec=10
ExecStartPre=/usr/bin/docker pull osixia/keepalived:2.0.20
ExecStart=/usr/bin/docker run --name=docker-swarm-keepalived \
  --rm \
  --interactive \
  --user 0:0 \
  --privileged \
  --net=host \
  --cap-add=NET_RAW \
  --cap-add=NET_ADMIN \
  --cap-add=NET_BROADCAST \
  -e KEEPALIVED_INTERFACE=br0 \
  -e KEEPALIVED_UNICAST_PEERS="#PYTHON2BASH:['192.168.1.101', '192.168.1.102', '192.168.1.103']" \
  -e KEEPALIVED_VIRTUAL_IPS=192.168.1.100 \
  osixia/keepalived:2.0.20
ExecStop=/usr/bin/docker rm -f docker-swarm-keepalived
TimeoutSec=300

[Install]
WantedBy=multi-user.target