About

Even though libvirt disposes of many ways to perform virtual machine backups, the main problem with libvirt backups and backups in general is that backups do not conceptually additionally imply fast restore and deployment. In fact, a "good backup device", would be a slow tape-drive like machine that would perform incremental backups and tombstone them in case of necessity in the far future. In what concerns libvirt, backups made by copying files, either via the traditional disk dumper tool (dd) or just transferring QCOW images take a very long while and will take more or less the same amount of time to restore when a backup is to be restored.

Restoring a backup implies downtime where services are unavailable until the backup is fully restored and then the virtual machine is booted up again. One alternative is to run servers in parallel for redundancy such that once a machine fails, the other machine can take over such that no downtime is perceived and additionally a backup always exists and is ready to be deployed. Taking snapshots and restoring them on the same machine is trivial, but the extra difficulty is to transparently maintain a parallel server across the network that will be ready to go in case a virtual machine fails.

Alternatives

One solution is to use "Distributed Replicated Block Device" (DRBD) on Linux that acts on the block layer and can make sure that block-level mirrors of the current hard-drive are maintained on a network. There are several drawbacks using DRBD that are implied:

machines would preferably have to be online most of the time in order to receive updates,
requires a special and complex setup environment,
requires a special filesystem to be picked,
participating nodes are tied into an infrastructure such that just detaching a node is not easy

Solution

One solution is to leverage libvirt's ability to use write-only snapshots and then use the Network Block Device feature to remotely attach to the virtual machine and clone the device. This can be performed periodically, during which the machines that copy over the virtual machine can hibernate.

Diagram

A diagram of the setup is as follows:

The host runs multiple virtual machines labeled dom_1, dom_2, etc. and exports the virtual machine block device (or QCOW file-based virtual machines) over the network via an NBD server. A client clone_1 connects via an NBD client and connects to the block device of dom_1 at which point the client clone_1 copies over the entire block device using the disk dumper tool (dd) or similar.

For this process to be transparent, the virtual machines must not be shutdown such that the block commit feature of libvirt will be used. It is assumed that clone_1 will shut down its own operations on the block device or file that it uses to run the virtual machine.

A protocol diagram would be as follows:

where the following operations will take place in order:

the client initiates a connection to the NBD server,
the server, before allowing the connection, creates a snapshot of the VM such that the block device will not be written to but the snapshot file will handle all the writes,
in case the snapshot succeded, the client is allowed to connect,
the client copies over the read-only block device to its own block device,
the client disconnects from the NBD server,
the NBD server merges all write operations that took place during the block copy operation to the block device

with the following remarks:

in case the server does not manage to create a snapshot using the prerun script, the NBD server does not allow the connection in order to prevent copying a block device that is currently in use,
the cycle can be repeated using crontab or similar in order to periodically retrieve the updated block device

Setup

The setup is remarkably simple for what it does. The only requirements are that the server runs an NBD server and the clients install the NBD client.

NBD relies on the nbd kernel module that is not loaded automatically such that it has to be added to the list of modules to be loaded. In order to do that, edit /etc/modules and add nbd to the end of the file. Next, load the NBD module manually via modprobe nbd.

Server

Next, setup the server by editing /etc/nbd-server/config and adjust the configuration to export a libvirt block device (this can also be a file):

[generic]
# If you want to run everything as root rather than the nbd user, you
# may either say "root" in the two following lines, or remove them
# altogether. Do not remove the [generic] section, however.
#       user = nbd
#       group = nbd
        includedir = /etc/nbd-server/conf.d

# What follows are export definitions. You may create as much of them as
# you want, but the section header has to be unique.
[mydomain]
    exportname = /dev/mapper/vms-mydomain
    readonly = true
    prerun = /usr/bin/virsh snapshot-create-as --domain "mydomain" "mydomain-nbd-snap" --diskspec "hda",file="/var/lib/libvirt/snapshots/mydomain-nbd-snap" --disk-only --atomic --no-metadata --quiesce
    postrun = /bin/sh -c '/usr/bin/virsh blockcommit "mydomain" "hda" --active --pivot && rm /var/lib/libvirt/snapshots/mydomain-nbd-snap'

where:

the user and group nbd have been commented out such that nbd-server may run with root permission in order to access the block devices,
mydomain is a libvirt domain, [mydomain] is the export section where mydomain is the name of the export,
exportname = /dev/mapper/vms-mydomain is the path to the libvirt virtual machine block device - this can be an LVM LV,
mydomain-nbd-snap is a snapshot file that will be created under /var/lib/libvirt/snapshots and it will be used for the virtual machine to store changes whilst the block device is copied over the network,
hda is device name local to the virtual machine to be snapshot,
the prerun script will be executed whenever an NBD client connects but before yielding the block device to the client; in case the exit status of the command passed to prerun is non-zero, then the NBD server will not grant access to the client - in this case, if libvirt fails to create a snapshot, then the whole process is aborted,
the postrun script is executed after a client issues a disconnect from the NBD server, in this case, two operations are performed sequentially, iff. the operations also succeed sequentially:
- the snapshot is commited back to the virtual machine and iff. the operation succeeds,
- the snapshot file is removed.

An additional remark is that after the client disconnects, iff. the snapshot file has not been removed, for either of the reasons:

the libvirt blockcommit failed,
the file could not be removed,

then when a client connects again, the snapshot will not be overwritten and the command will fail such that the client will not be granted access.

Client

In order to clone the remote machine block device, the client may perform the following operations in order:

# Connect to the server and request to map the mydomain export to /dev/nbd0
# The server's prerun script will be executed at this point on the server.
nbd-client server.tld -N mydomain /dev/nbd0
 
# Transfer over the block device.
dd if=/dev/nbd0 of=... 
 
# Disconnect from the NBD server.
# The server's postrun script will be executed at this point on the server.
nbd-client -d /dev/nbd0