Table of Contents

About

This tutorial demonstrates how to use Linux load-balancing and high-availability in order to distribute services amongst docker containers in a swarm in order to allow a more fine-grained control. The LVS topology chosen will be a direct-connect with the swarm nodes capable of answering requests themselves without needing to go back through the LVS director.

Topology Change

Here is an overview of a typical layout where caddy stands in front of a Docker swarm with services defined on multiple machines:

                  | connection
                  v
            +-----+-----+
            |   caddy   |
            +-----+-----+
                  |
        +---------+---------+
        |         |         |
......................................
        |         |         | 
    +---+---+ +---+---+ +---+---+
    |  s_1  | |  s_2  | |  s_3  | ... Docker swarm
    +-------+ +-------+ +-------+
......................................

The corresponding caddy configuration is something along the lines:

*.domain.duckdns.org, domain.duckdns.org {
    ...
    handle @service_1 {
        reverse_proxy docker.lan:8000
    }

    handle @service_2 {
        reverse_proxy docker.lan:8001
    }
    ...
}

where services such as service_1 are reverse-proxied into the swarm at docker.lan on the different ports that they listen to.

Now, the docker address docker.lan can be defined with some internal DNS server using A records for each node in the swarm:

docker.lan.  A   192.168.1.81
docker.lan.  A   192.168.1.82
docker.lan.  A   192.168.1.83

such that upon every DNS lookup of the docker.lan host, any of the three IP addresses are returned in round-robin fashion.

One of the problems in this setup is that DNS requests end up cached, such that accesses into the Docker swarm will occur predominantly over one of the IP addresses instead of being spread out to the entire set of IP addresses.

If one changes the caddy configuration to:

*.domain.duckdns.org, domain.duckdns.org {
    ...
    handle @service_1 {
        reverse_proxy 192.168.1.81:8000
    }

    handle @service_2 {
        reverse_proxy 192.168.1.81:8001
    }
    ...
}

where all access go to 192.168.1.81 the problem remains the same, even if Docker ensures that the request will internally be distributed to the apropriate node within the Docker swarm.

A different topology will be the following:

                  | connection
                  v
            +-----+-----+
            |   caddy   |
            +-----+-----+
                  |
            +-----+-----+
            |    IPVS   |
            +-----+-----+
                  |
        +---------+---------+
        |         |         |
......................................
        |         |         | 
    +---+---+ +---+---+ +---+---+
    |  s_1  | |  s_2  | |  s_3  | ... Docker swarm
    +-------+ +-------+ +-------+
......................................

where Linux IPVS functions as a load-balancer meant to spread out traffic among the nodes s_1 through to s_3.

Assumptions

The following configuration will be implemented:

                                                        | connection
                                                        v
                                                  +-----+-----+
                                                  |   caddy   |
                                                  +-----+-----+
                                                        |
                                                  +-----+-----+
                                                  |    IPVS   | VIP: 192.168.1.100
                                                  +-----+-----+
                                                        |
                            +---------------------------+--------------------------+
                            |                           |                          |
                ...............................................................................
                            |                           |                           |
         RIP: 192.168.1.101 |        RIP: 192.168.1.102 |        RIP: 192.168.1.103 |
  VIP (lo:0): 192.168.1.100 | VIP (lo:0): 192.168.1.100 | VIP (lo:0): 192.168.1.100 |
                            |                           |                           |
                        +---+---+                   +---+---+                   +---+---+
                        |  s_1  |                   |  s_2  |                   |  s_3  | ...
                        +-------+                   +-------+                   +-------+
                ...............................................................................

where:

alias

Setup

Just for testing, the setup will be created using command line tools without any sort of persistence.

Director

ipvsadm needs to be installed first; on Debian, the command would be:

apt-get install ipvsadm

following which, the next commands:

ipvsadm -A -f 80 -s lc
ipvsadm -a -f 80 -r 192.168.1.101 -g
ipvsadm -a -f 80 -r 192.168.1.102 -g
ipvsadm -a -f 80 -r 192.168.1.103 -g

will:

  1. set up a load-balancer using firewall marking using the "least connections" load balancer algorithm -s lc,
  2. add the three real IP addresses to the load balancer using routing mode (-g)

On the director, the iptables marking line can be made permanent such that the rule will be restored on reboot by installing the netfilter-persistent package. Following the example, the line that must be added to /etc/iptables/rules.v4 is the following:

-A PREROUTING -d 192.168.1.100 -j MARK --set-mark 80

Finally, the IPVS balancer can be made persistent using ipvsadm-save, respectively ipvsadm-restore to restore the rules.

Nodes

On each node in the swarm, the virtual IP has to be assigned to a virtual interface to make sure that the node will listen and respond to requests coming in from the virtual IP. The loopback interface can be used:

ifconfig lo:0 192.168.1.100 netmask 255.255.255.255 -arp up

Testing

In order to check that the different nodes are accessed, the ipvsadm command can be issued, perhaps also with the -lcn flag set in order to see any established and pending connections.

The command:

ipvsadm -lcn 

should show something along the lines of:

IPVS connection entries
pro expire state       source             virtual            destination
TCP 00:22  SYN_RECV    x.x.x.x:31847   192.168.1.100:1883   192.16.1.101:1883
TCP 14:48  ESTABLISHED y.y.y.y:63330   192.168.1.100:1883   192.16.1.101:1883
TCP 00:53  SYN_RECV    z.z.z.z:19167   192.168.1.100:1883   192.16.1.102:1883

where you can see various source connections routed through the virtual IP address 192.168.1.100 to the Docker swarm destination nodes 192.16.1.101 and 192.16.1.102.

Making the Setup Persistent

On Debian, the interface configurations can be created in /etc/network/interfaces or defined in a separate file in /etc/network/interfaces.d.

The interface configuration for defining the virtual IP address on the loopback interface that must be defined for all Docker swarm nodes:

auto lo:0
iface lo:0 inet static
    address 192.168.1.100
    netmask 255.255.255.255

Implementing High-Availability

keepalived can be used to achieve high-availability by checking that the connection to the Docker swarm nodes can still be made and mitigating the situation by redirecting traffic to other Docker swarm nodes.

A very simple configuration based on the firewall marking example provided by keepalived can be seen here:

global_defs {
  router_id io
}

virtual_server fwmark 80 {
  delay_loop 6
  lb_algo lc
  lb_kind DR

  real_server 192.168.1.101 {
    weight 1
    MISC_CHECK {
      misc_path '"/usr/bin/ping -c 3"'
      misc_timeout 5
      warmup 5
    }
  }

  real_server 192.168.1.102 {
    weight 1
    MISC_CHECK {
      misc_path '"/usr/bin/ping -c 3"'
      misc_timeout 5
      warmup 5
    }
  }

  real_server 192.168.1.103 {
    weight 1
    MISC_CHECK {
      misc_path '"/usr/bin/ping -c 3"'
      misc_timeout 5
      warmup 5
    }
  }
}

The configuration sends an ICMP-echo packet every five seconds in order to check whether the host is up and if it is not up then the next connection will be redirected to the remainder of nodes that are still alive.