About

Whilst having a centralized storage device, such as a consumer network-attached storage NAS or an enterprise-level NAS is the current industry standard good practice to run a data-center, given advances in distributed computing, it becomes more interesting and sought-after to have a split architecture that is distributed and does not have a single point of failure hence being resilient to system failures.

A lot of the data being hot or frequently read and written to disk tends to mostly pertain to configuration files, metadata and not really storage per-se such that it also makes sense to have a long-term storage (for instance LTO tape devices) but also short-to-medium storage solutions that would just commit "actual data" to long-term storage.

Writing this has been somewhat painful due to outdated documentation and a little bit of a minefield of bugs spread throughout the setup and utilities that should have been fixed by now.

Illustrative Scenario

One of the main scenarios, as an example, is an environment running, say, a Servarr stack, including a bunch of torrents. In such an environment, the actual data, whether TV shows or movies, is much less frequently accessed than, say, the Servarr stack configuration and housekeeping files (ie: databases) that are written to very often such that it does not make too much sense to pass a large storage solution to such a stack but one would rather slow-commit the large processed files only after having completed their processing.

To be precise, say Sonarr downloads a TV show episode, then it has to download the TV show cover, say, to update the metadata and then the actual file gets passed to something like unmanic that uses ffmpeg to convert the file to a different desired quality and even then perhaps the file is passed through AI whisper to extract subtitles. Finally, only after that, the file can be commited to slow-storage where, as you would image, it would more than likely just be accessed once (because it is just an episode). For the duration of all the former, as you would imagine processes starting and stopping, all the actual disk I/O consists in housekeeping, metadata and stuff that does not need to encumber a large storage device but can just as well run off a quick NVME to speed up the operations. Only when the file is fully processed, the file can be pushed to slow-storage.

Example Scenario

For this scenario, we'll create a "Docker dipole", that is a swarm that contains only two masters that store "metadata" or "application data" to a local disk that is then block-level mirrored onto the other node in the swarm such that if one master goes down, the other master will have an up-to-date live copy of the data and will be able to run the services on their own.

We think this scenario is perhaps the best way to run a Docker swarm given that it eliminates any single point of failure, even in terms of being severed from a NAS that exists on the network.

The network established between the two machines can be of any nature and it could just be fiberchannel or some fiber-optic technology without having to be connected to an actual Ethernet or TCP/IP network. Note that as blocks are written on one device, they have to be propagated to the other device across the network which implies a high traffic.

Even though the logical volume manager (LVM) is going to be used to partition a disk, for what it is worth, the storage in both docker1 and docker2 is identical and a Verbatim SSD was used due to the low price, large storage and high speed. However, LVM was used and a logical volume was replicated instead of the whole drive for flexibility - one of the problems is that some of the technologies being used (DRBD and OCFS2) have refused to implement unique WNN or UUIDs to refer to block devices or partitions and the developers hint to setting up udev properly to obtain a consitent naming but using LVM the logical volume name will always be consistent and the path to the block device will always be stable (ie: /dev/docker/swarm).

For reference, Debian is used as the distribution but the procedure and the principles remain the same such that the steps should be applicable with local changes to other Linux distributions.

Logical Volume Manager

LVM is used on both nodes to set up a partition that will be replicated. This is fairly easy to set up and LVM is only a very thin layer such that it will not incur any serious performance penalty when the distributed replicated storage will be added.

The Verbatim SSD disk was converted to a physical LVM volume:

pvcreate /dev/sdb

and then a volume group was created from the physical volume:

vgcreate docker /dev/sdb

and then finally a logical volume named swarm was created with an estimate size of 256GiB:

lvcreate -L256G docker -n swarm

At this point the volume appears as a block device via the device mapper at /dev/docker/swarm. The procedure is then reproduced on the other node symmetrically. After the two logical volumes have been set up the net part will consist in setting up block replication and then the filesystem that will work on top of that.

The Distributed Replicated Block Device

The Distributed Replicated Block Device (DRBD) offers a block-level replication solution which seems most desirable when data has to be instantaneously synchronized between two machines. You can imagine this as a RAID-1 mirror, but instead of using drives in a RAID, network computer nodes are used. When a write takes place, that write is replicated onto all other nodes in the cluster. Block-level replication also allows a device to be completely mirrored such that 1-to-1 copies are maintained across the network, instead of, say, using a file synchronizer like syncthing, Resilio Sync, rclone or rsync that would incur a delay when files are being synchronized.

On Debian, DRBD can be installed by installing the drbd-tools package:

aptitude install drbd-tools

which will install all the essentials necessary to run DRBD.

Next, as one would expect, a configuration file has to be created in /etc/drbd.d/ in order to specify the drives and nodes that will participate in the distributed block-level replication.

For exemplification, we are going to set up a 1-to-1 mirror between two disks inside two networked servers. These disks will hold "metadata" or "application data" for a Docker swarm (hence the naming) that will ensure that if one server goes down, the other server will have a live copy of the data (a master-master configuration). Here is the /etc/drbd.d/swarm.res file that accomplishes the former:

resource swarm {
   device       /dev/drbd0;
   meta-disk    internal;

   net {
      protocol  C;
      allow-two-primaries;
      after-sb-0pri discard-zero-changes;
      after-sb-1pri discard-secondary;
      after-sb-2pri disconnect;
   }

   startup {
      become-primary-on both;
   }

   on docker1 {
      address   192.168.1.1:7790;
      disk /dev/docker/swarm;
   }

   on docker2 {
      address   192.168.1.2:7790;
      disk /dev/docker/swarm;
   }
}

The file /etc/drbd.d/swarm.res must now be transferred to the other machine and placed at the same place on the filesystem.

After that, the drbd service is restarted at the same time on both machines via the command:

systemctl restart drbd.service 

When both services restart on both nodes, the swarm resource is brought up and on one machine the node is forcibly made a primary (note that both nodes will be primaries / masters but for the sake of initial replication just one node is forced to primary):

drbdsetup /dev/drbd0 primary

Now the process of replication should have started and it can be consulted by polling /proc/drbd and checking the progress (this is very similar to monitoring RAID device replication). For instance one could open a terminal on both machines and run:

watch -n 1 cat /dev/drbd

in order to conveniently watch replication taking place in real time.

After the process has completed, the secondary node is also made primary by issuing on the second node:

drbdadm primary swarm

Now, consulting /proc/drbd, both nodes should be seen as Primary/Primary.

Just using DRBD is insufficient and even though now the /dev/drbd0 file consists in a virtual block-device that will replicate to both nodes, formatting /dev/drbd0 using conventional and non-distributed filesystems such as xfs, ext, and so on, will work but will invariably end up in file corruption.

The next part now consists in setting up the Oracle Cluster File System (OCFS2) on /dev/drbd0.

The Oracle Cluster File System (OCFS2)

One of the problems of concurrent access to files, involving reads and writes, will obviously be the problem of locking. Conventional filesystems such as xfs, ext and others are not meant to be accessed from different hardware paths such that they do not implement any type of locking (for instance, compared to network filesystems that have to maintain a global locking database).

Here is where the Oracle Cluster File System (OCFS2) comes into play and due to being a clustered filesystem, it assumes concurrent accesses to files for both read and write, whilst managing to maintain a coherent locking database to ensure that there are no races taking place as applications access the files.

Setting up OCFS2 is not that difficult, although Debian contains a few bugs that requires leaving some configurable paramters at their defaults as well as naming some files with well-defined names, even if the parameters and files could have been adjusted.

First, install ocfs2-tools:

aptitude install ocfs2-tools

and then create a file at /etc/ocfs2/cluster.conf, with the following contents (note that the file at /etc/ocfs2/cluster.conf should be really named cluster.conf due to Debian setup script bugs):

cluster:
        node_count = 2
        name = ocfs2

node:
        ip_port = 7777
        ip_address = 192.168.1.1
        number = 1
        name = docker1
        cluster = ocfs2

node:
        ip_port = 7777
        ip_address = 192.168.1.2
        number = 2
        name = docker2
        cluster = ocfs2

Note that OCFS2 is very sensitive to extra spaces in the file so be sure to eliminate any spaces after the lines or any newlines that have no purpose in /etc/ocfs2/cluster.conf.

The cluster is named ocfs2 and even though it could have been named swarm to be in-line with the rest of the tutorial, the Debian setup tools for some reason have this special name hard-coded, such that it is best to leave this as it is.

Next, run the setup for ocfs-tools:

dpkg-reconfigure ocfs2-tools

and just hit Enter for the entire setup - note again, that due to bugs, the ocfs2 cluster name, also referenced in /etc/ocfs2/cluster.conf should be left as is.

Finally, restart OCFS2 and O2CB:

systemctl restart ocfs2
systemctl restart o2cb

In case o2cb.service fails to restart, make sure to check the naming of the files and cluster as mentioned above as well as any extraneous space characters in /etc/ocfs2/cluster.conf because the Debian SystemD scripts to start O2CB are not very verbose.

Otherwise, the OCFS2 and O2CB should now be running so it is time to format the DRBD device using OCFS2 as the filesystem and the cluster name as filesystem label:

mkfs.ocfs2 -L "ocfs2" /dev/drbd0

where:

  • ocfs2 is the cluster name from /etc/ocfs2/cluster.conf.

The /dev/drbd0 block device is now ready to be mounted and used just like any other storage device. For a simple test that everything is working, mount /dev/drbd0 on both machines to local folders and then create files via touch on one or the other machine and check that the files are replicated to the other machine.

Note that in case the cluster name changes, the label on the DRBD device has to be changed to match or else mounting the OCFS2 filesystem will result in something along the lines of the following error:

tunefs.ocfs2: Cluster name is invalid while opening device "/dev/drbd0"

indicating that the filesystem label might not match the cluster name.

To add one more bug to the pile, the tunefs.ocfs2 utility is supposed to be able to change the label of the filesystem with -L but apparently -L does not change the label and –cloned-volume must be used instead.

For instance, the command:

tunefs.ocfs2 --cloned-volume=ocfs2 /dev/drbd0

seems to successfully change the filesystem label for /dev/drbd0 to ocfs2. After that, the three services drbd.service, ocfs2.service and o2cb.service have to be restarted.

Starting Everything on Boot

Typically distributed filesystems and DRBD are meant to be started with "clusterware" additional software that is meant to monitor and start all the dependent services. However, the current setup is rather thin and the way it is supposed to be, such that adding "pacemaker" or "drbd reactor" would just over-complicate the setup and add extra moving parts that can all fail on their own.

To work around the issue, the following SystemD mount and respectively automount files are used to mount the distributed filesystem on demand. First, because there are filesystems that depend on the network, the following command should be used in order to make SystemD pause till the network is online:

systemctl enable systemd-networkd-wait-online.service

After that the following two files mnt-swarm.mount and mnt-swarm.automount are added to /etc/systemd/system/ with mnt-swarm.automount enabled in order to ensure that accessing the /mnt/swarm path will make the system automatically mount everything.

Here is /etc/systemd/system/mnt-swarm.mount:

[Unit]
Description=Mounts DRBD
After=network.target network-online.target drbd.service o2cb.service ocfs2.service
Requires=network.target network-online.target drbd.service o2cb.service ocfs2.service

[Mount]
Where=/mnt/swarm
What=/dev/drbd0
Type=ocfs2
Options=_netdev,x-systemd.automount,x-systemd.device-timeout=60s

and the corresponding /etc/systemd/system/mnt-swarm.automount:

[Unit]
Description=Mounts DRBD
After=network.target network-online.target drbd.service o2cb.service ocfs2.service
Requires=network.target network-online.target drbd.service o2cb.service ocfs2.service

[Mount]
Where=/mnt/swarm
What=/dev/drbd0
Type=ocfs2
Options=_netdev,x-systemd.automount,x-systemd.device-timeout=60s
root@docker1:/mnt/swarm# cat /etc/systemd/system/mnt-swarm.automount
[Unit]
Description=Mounts DRBD Automatically

[Automount]
Where=/mnt/swarm

[Install]
WantedBy=multi-user.target

After copying the files, issue:

systemctl enable mnt-swarm.automount

in order to automatically start the necessary daemons and start everything.

Note that it should have been possible to just add an /etc/fstab entry to mount the OCFS2 fileystem but bear in mind that DRBD has to be started first, then O2CB and then OCFS2, and only after that the filesystem can be mounted. Using SystemD gives access to After= and Requires= directives within the [Unit] section that will make SystemD start all dependent services before attempting to mount the filesystem.


linux/settting_up_network_block_storage_replication_with_high_availability_and_failover.txt · Last modified: 2025/05/09 15:21 by office

Wizardry and Steamworks

© 2025 Wizardry and Steamworks

Access website using Tor Access website using i2p Wizardry and Steamworks PGP Key


For the contact, copyright, license, warranty and privacy terms for the usage of this website please see the contact, license, privacy, copyright.