Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Ganeti is software designed to facilitate the management of virtual machines (KVM or Xen). It helps you move virtual machine instances from one node to another, create an instance with DRBD replication on another node and do the live migration from one to another, etc.

Tutorial

Listing virtual machines (instances)

This will show the running guests, known as "instances":

gnt-instance list

Accessing serial console

Our instances do serial console, starting in grub. To access it, run

gnt-instance console test01.torproject.org

To exit, use ^] -- that is, Control-<Closing Bracket>.

How-to

Glossary

In Ganeti, we use the following terms:

  • node a physical machine is called a node and a
  • instance a virtual machine
  • master: a node where on which we issue Ganeti commands and that supervises all the other nodes

Nodes are interconnected through a private network that is used to communicate commands and synchronise disks (with DRBD). Instances are normally assigned two nodes: a primary and a secondary: the primary is where the virtual machine actually runs and the secondary acts as a hot failover.

See also the more extensive glossary in the Ganeti documentation.

Adding a new instance

This command creates a new guest, or "instance" in Ganeti's vocabulary with 10G root, 512M swap, 20G spare on SSD, 800G on HDD, 8GB ram and 2 CPU cores:

gnt-instance add \
  -o debootstrap+trixie \
  -t drbd --no-wait-for-sync \
  --net 0:ip=pool,network=gnt-fsn13-02 \
  --no-ip-check \
  --no-name-check \
  --disk 0:size=10G \
  --disk 1:size=20G \
  --disk 2:size=800G,vg=vg_ganeti_hdd \
  --backend-parameters memory=8g,vcpus=2 \
  test-01.torproject.org

What that does

This configures the following:

  • redundant disks in a DRBD mirror
  • two additional partitions: one on the default VG (SSD), one on another (HDD). A 512MB swapfile is created in /swapfile. TODO: configure disk 2 and 3 automatically in installer. (/var and /srv?)
  • 8GB of RAM with 2 virtual CPUs
  • an IP allocated from the public gnt-fsn pool: gnt-instance add will print the IPv4 address it picked to stdout. The IPv6 address can be found in /var/log/ganeti/os/ on the primary node of the instance, see below.
  • with the test-01.torproject.org hostname

Next steps

To find the root password, ssh host key fingerprints, and the IPv6 address, run this on the node where the instance was created, for example:

egrep 'root password|configured eth0 with|SHA256' $(ls -tr /var/log/ganeti/os/* | tail -1) | grep -v $(hostname)

We copy root's authorized keys into the new instance, so you should be able to log in with your token. You will be required to change the root password immediately. Pick something nice and document it in tor-passwords.

Also set reverse DNS for both IPv4 and IPv6 in hetzner's robot (Check under servers -> vSwitch -> IPs) or in our own reverse zone files (if delegated).

Then follow new-machine.

Known issues

  • allocator failures: Note that you may need to use the --node parameter to pick on which machines you want the machine to end up, otherwise Ganeti will choose for you (and may fail). Use, for example, --node fsn-node-01:fsn-node-02 to use node-01 as primary and node-02 as secondary. The allocator can sometimes fail if the allocator is upset about something in the cluster, for example:

     Can's find primary node using iallocator hail: Request failed: No valid allocation solutions, failure reasons: FailMem: 2, FailN1: 2
    

    This situation is covered by ticket 33785. If this problem occurs, it might be worth rebalancing the cluster.

    The following dashboards can help you choose the less busy nodes to use:

  • ping failure: there is a bug in ganeti-instance-debootstrap which misconfigures ping (among other things), see bug 31781. It's currently patched in our version of the Debian package, but that patch might disappear if Debian upgrade the package without shipping our patch. Note that this was fixed in Debian bullseye and later.

Other examples

Dallas cluster

This is a typical server creation in the gnt-dal cluster:

gnt-instance add \
  -o debootstrap+trixie \
  -t drbd --no-wait-for-sync \
  --net 0:ip=pool,network=gnt-dal-01 \
  --no-ip-check \
  --no-name-check \
  --disk 0:size=10G \
  --disk 1:size=20G \
  --backend-parameters memory=8g,vcpus=2 \
  test-01.torproject.org

Do not forget to follow the next steps, above.

No DRBD, test machine

A simple test machine, with only 1G of disk, ram, and 1 CPU, without DRBD, in the FSN cluster:

gnt-instance add \
      -o debootstrap+trixie \
      -t plain --no-wait-for-sync \
      --net 0:ip=pool,network=gnt-fsn13-02 \
      --no-ip-check \
      --no-name-check \
      --disk 0:size=10G \
      --backend-parameters memory=1g,vcpus=1 \
      test-01.torproject.org

Do not forget to follow the next steps, above.

Don't be afraid to create plain machines: they can be easily converted to drbd (with gnt-instance modify -t drbd) and the node's disk are already in RAID-1. What you lose is:

  • High availability during node reboots
  • Faster disaster recovery in case of a node failure

What you gain is:

  • Improved performance
  • Less (2x!) disk usage

iSCSI integration

To create a VM with iSCSI backing, a disk must first be created on the SAN, then adopted in a VM, which needs to be reinstalled on top of that. This is typically how large disks are provisionned in the (now defunct) gnt-chi cluster, in the Cymru POP.

The following instructions assume you are on a node with an iSCSI initiator properly setup, and the SAN cluster management tools setup. It also assumes you are familiar with the SMcli tool, see the storage servers documentation for an introduction on that.

  1. create a dedicated disk group and virtual disk on the SAN, assign it to the host group and propagate the multipath config across the cluster nodes:

    /usr/local/sbin/tpo-create-san-disks --san chi-node-03 --name test-01 --capacity 500
    
  2. confirm that multipath works, it should look something like this":

    root@chi-node-01:~# multipath -ll
    test-01 (36782bcb00063c6a500000d67603f7abf) dm-20 DELL,MD32xxi
    size=500G features='5 queue_if_no_path pg_init_retries 50 queue_mode mq' hwhandler='1 rdac' wp=rw
    |-+- policy='round-robin 0' prio=6 status=active
    | |- 11:0:0:4 sdi 8:128 active ready running
    | |- 12:0:0:4 sdj 8:144 active ready running
    | `- 9:0:0:4  sdh 8:112 active ready running
    `-+- policy='round-robin 0' prio=1 status=enabled
      |- 10:0:0:4 sdk 8:160 active ghost running
      |- 7:0:0:4  sdl 8:176 active ghost running
      `- 8:0:0:4  sdm 8:192 active ghost running
    root@chi-node-01:~#
    
  3. adopt the disk in Ganeti:

    gnt-instance add \
          -n chi-node-01.torproject.org \
          -o debootstrap+trixie \
          -t blockdev --no-wait-for-sync \
          --net 0:ip=pool,network=gnt-chi-01 \
          --no-ip-check \
          --no-name-check \
          --disk 0:adopt=/dev/disk/by-id/dm-name-test-01 \
          --backend-parameters memory=8g,vcpus=2 \
          test-01.torproject.org
    

    NOTE: the actual node must be manually picked because the hail allocator doesn't seem to know about block devices.

    NOTE: mixing DRBD and iSCSI volumes on a single instance is not supported.

  4. at this point, the VM probably doesn't boot, because for some reason the gnt-instance-debootstrap doesn't fire when disks are adopted. so you need to reinstall the machine, which involves stopping it first:

    gnt-instance shutdown --timeout=0 test-01
    gnt-instance reinstall test-01
    

    HACK one: the current installer fails on weird partionning errors, see upstream bug 13. We applied this patch as a workaround to avoid failures when the installer attempts to partition the virtual disk.

From here on, follow the next steps above.

TODO: This would ideally be automated by an external storage provider, see the storage reference for more information.

Troubleshooting

If a Ganeti instance install fails, it will show the end of the install log, for example:

Thu Aug 26 14:11:09 2021  - INFO: Selected nodes for instance tb-pkgstage-01.torproject.org via iallocator hail: chi-node-02.torproject.org, chi-node-01.torproject.org
Thu Aug 26 14:11:09 2021  - INFO: NIC/0 inherits netparams ['br0', 'bridged', '']
Thu Aug 26 14:11:09 2021  - INFO: Chose IP 38.229.82.29 from network gnt-chi-01
Thu Aug 26 14:11:10 2021 * creating instance disks...
Thu Aug 26 14:12:58 2021 adding instance tb-pkgstage-01.torproject.org to cluster config
Thu Aug 26 14:12:58 2021 adding disks to cluster config
Thu Aug 26 14:13:00 2021 * checking mirrors status
Thu Aug 26 14:13:01 2021  - INFO: - device disk/0: 30.90% done, 3m 32s remaining (estimated)
Thu Aug 26 14:13:01 2021  - INFO: - device disk/2:  0.60% done, 55m 26s remaining (estimated)
Thu Aug 26 14:13:01 2021 * checking mirrors status
Thu Aug 26 14:13:02 2021  - INFO: - device disk/0: 31.20% done, 3m 40s remaining (estimated)
Thu Aug 26 14:13:02 2021  - INFO: - device disk/2:  0.60% done, 52m 13s remaining (estimated)
Thu Aug 26 14:13:02 2021 * pausing disk sync to install instance OS
Thu Aug 26 14:13:03 2021 * running the instance OS create scripts...
Thu Aug 26 14:16:31 2021 * resuming disk sync
Failure: command execution error:
Could not add os for instance tb-pkgstage-01.torproject.org on node chi-node-02.torproject.org: OS create script failed (exited with exit code 1), last lines in the log file:
Setting up openssh-sftp-server (1:7.9p1-10+deb10u2) ...
Setting up openssh-server (1:7.9p1-10+deb10u2) ...
Creating SSH2 RSA key; this may take some time ...
2048 SHA256:ZTeMxYSUDTkhUUeOpDWpbuOzEAzOaehIHW/lJarOIQo root@chi-node-02 (RSA)
Creating SSH2 ED25519 key; this may take some time ...
256 SHA256:MWKeA8vJKkEG4TW+FbG2AkupiuyFFyoVWNVwO2WG0wg root@chi-node-02 (ED25519)
Created symlink /etc/systemd/system/sshd.service \xe2\x86\x92 /lib/systemd/system/ssh.service.
Created symlink /etc/systemd/system/multi-user.target.wants/ssh.service \xe2\x86\x92 /lib/systemd/system/ssh.service.
invoke-rc.d: could not determine current runlevel
Setting up ssh (1:7.9p1-10+deb10u2) ...
Processing triggers for systemd (241-7~deb10u8) ...
Processing triggers for libc-bin (2.28-10) ...
Errors were encountered while processing:
 linux-image-4.19.0-17-amd64
E: Sub-process /usr/bin/dpkg returned an error code (1)
run-parts: /etc/ganeti/instance-debootstrap/hooks/ssh exited with return code 100
Using disk /dev/drbd4 as swap...
Setting up swapspace version 1, size = 2 GiB (2147479552 bytes)
no label, UUID=96111754-c57d-43f2-83d0-8e1c8b4688b4
Not using disk 2 (/dev/drbd5) because it is not named 'swap' (name: )
root@chi-node-01:~#

Here the failure which tripped the install is:

Errors were encountered while processing:
 linux-image-4.19.0-17-amd64
E: Sub-process /usr/bin/dpkg returned an error code (1)

But the actual error is higher up, and we need to go look at the logs on the server for this, in this case in chi-node-02:/var/log/ganeti/os/add-debootstrap+buster-tb-pkgstage-01.torproject.org-2021-08-26_14_13_04.log, we can find the real problem:

Setting up linux-image-4.19.0-17-amd64 (4.19.194-3) ...
/etc/kernel/postinst.d/initramfs-tools:
update-initramfs: Generating /boot/initrd.img-4.19.0-17-amd64
W: Couldn't identify type of root file system for fsck hook
/etc/kernel/postinst.d/zz-update-grub:
/usr/sbin/grub-probe: error: cannot find a device for / (is /dev mounted?).
run-parts: /etc/kernel/postinst.d/zz-update-grub exited with return code 1
dpkg: error processing package linux-image-4.19.0-17-amd64 (--configure):
 installed linux-image-4.19.0-17-amd64 package post-installation script subprocess returned error exit status 1

In this case, oddly enough, even though Ganeti thought the install had failed, the machine can actually start:

gnt-instance start tb-pkgstage-01.torproject.org

... and after a while, we can even get a console:

gnt-instance console tb-pkgstage-01.torproject.org

And in that case, the procedure can just continue from here on: reset the root password, and just make sure you finish the install:

apt install linux-image-amd64

In the above case, the sources-list post-install hook was buggy: it wasn't mounting /dev and friends before launching the upgrades, which was causing issues when a kernel upgrade was queued.

And if you are debugging an installer and by mistake end up with half-open filesystems and stray DRBD devices, do take a look at the LVM and DRBD documentation.

Modifying an instance

CPU, memory changes

It's possible to change the IP, CPU, or memory allocation of an instance using the gnt-instance modify command:

gnt-instance modify -B vcpus=4,memory=8g test1.torproject.org
gnt-instance reboot test1.torproject.org

Note that the --hotplug-if-possible setting might make the reboot unnecessary. Test and update this section to remove this note or the reboot entry. Ganeti 3.1 makes hotplugging default.

Note that this can be more easily done with a Fabric task which will handle wall warnings, delays, silences and so on, using the standard reboot procedures:

fab -H idle-fsn-01.torproject.org ganeti.modify vcpus=4,memory=8g

If you get a cryptic failure (TODO: add sample output) about policy being violated while you're not actually violating the stated policy, it's possible this VM was already violating the policy and the changes you proposed are okay.

In that case (and only in that case!) it's okay to bypass the policy with --ignore-ipolicy. Otherwise, discuss this with a fellow sysadmin, and see if that VM really needs that many resources or if the policies need to be changed.

IP address change

IP address changes require a full stop and will require manual changes to the /etc/network/interfaces* files:

gnt-instance modify --net 0:modify,ip=116.202.120.175 test1.torproject.org
gnt-instance stop test1.torproject.org

The renumbering can be done with Fabric, with:

./ganeti -H test1.torproject.org renumber-instance --ganeti-node $PRIMARY_NODE

Note that the $PRIMARY_NODE must be passed here, not the "master"!

Alternatively, it can also be done by hand:

gnt-instance start test1.torproject.org
gnt-instance console test1.torproject.org

Resizing disks

The gnt-instance grow-disk command can be used to change the size of the underlying device:

gnt-instance grow-disk --absolute test1.torproject.org 0 16g
gnt-instance reboot test1.torproject.org

The number 0 in this context, indicates the first disk of the instance. The amount specified is the final disk size (because of the --absolute flag). In the above example, the final disk size will be 16GB. To add space to the existing disk, remove the --absolute flag:

gnt-instance grow-disk test1.torproject.org 0 16g
gnt-instance reboot test1.torproject.org

In the above example, 16GB will be ADDED to the disk. Be careful with resizes, because it's not possible to revert such a change: grow-disk does support shrinking disks. The only way to revert the change is by exporting / importing the instance.

Note the reboot, above, will impose a downtime. See upstream bug 28 about improving that. Note that Ganeti 3.1 has support for reboot-less resizes.

Then the filesystem needs to be resized inside the VM:

ssh root@test1.torproject.org 

Resizing under LVM

Use pvs to display information about the physical volumes:

root@cupani:~# pvs
PV         VG        Fmt  Attr PSize   PFree   
/dev/sdc   vg_test   lvm2 a--  <8.00g  1020.00m

Resize the physical volume to take up the new space:

pvresize /dev/sdc

Use lvs to display information about logical volumes:

# lvs
LV            VG               Attr       LSize    Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
var-opt       vg_test-01     -wi-ao---- <10.00g                                                    
test-backup vg_test-01_hdd   -wi-ao---- <20.00g            

Use lvextend to add space to the volume:

lvextend -l '+100%FREE' vg_test-01/var-opt

Finally resize the filesystem:

resize2fs /dev/vg_test-01/var-opt

See also the LVM howto, particularly if the lvextend step fails with:

  Unable to resize logical volumes of cache type.

Resizing without LVM, no partitions

If there's no LVM inside the VM (a more common configuration nowadays), the above procedure will obviously not work. If this is a secondary disk (e.g. /dev/sdc) there is a good chance a partition was created directly on it and that you do not need to repartition the drive. This is an example of a good configuration if we want to resize sdc:

root@bacula-director-01:~# lsblk 
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
fd0      2:0    1    4K  0 disk 
sda      8:0    0   10G  0 disk 
└─sda1   8:1    0   10G  0 part /
sdb      8:16   0    2G  0 disk [SWAP]
sdc      8:32   0  250G  0 disk /srv

Note that if we would need to resize sda, we'd have to follow the other procedure, in the next section.

If we check the free disk space on the device we will notice it has not changed yet:

# df -h /srv
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdc        196G  160G   27G  86% /srv

The resize is then simply:

# resize2fs /dev/sdc
resize2fs 1.44.5 (15-Dec-2018)
Filesystem at /dev/sdc is mounted on /srv; on-line resizing required
old_desc_blocks = 25, new_desc_blocks = 32
The filesystem on /dev/sdc is now 65536000 (4k) blocks long.

Note that for XFS filesystems, the above command is simply:

xfs_growfs /dev/sdb

Read on for the most complicated scenario.

Resizing without LVM, with partitions

If the filesystem to resize is not directly on the device, you will need to resize the partition manually, which can be done using fdisk. In the following example we have a sda1 partition that we want to extend from 10G to 20G to fill up the free space on /dev/sda. Here is what the partition layout looks like before the resize:

# lsblk
NAME    MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
fd0       2:0    1   4K  0 disk 
sda       8:0    0  40G  0 disk 
└─sda1    8:1    0  20G  0 part /
sdb       8:16   0   4G  0 disk [SWAP]

We use sfdisk to resize the partition to take up all available space, in this case, with the magic:

echo ", +" | sfdisk -N 1 --no-act /dev/sda

Note the --no-act here, which you'll need to remove to actually make the change, the above is just a preview to make sure you will do the right thing:

echo ", +" | sfdisk -N 1 --no-reread /dev/sda

TODO: next time, test with --force instead of --no-reread to see if we still need a reboot.

Here's a working example:

# echo ", +" | sfdisk -N 1 --no-reread /dev/sda
Disk /dev/sda: 40 GiB, 42949672960 bytes, 83886080 sectors
Disk model: QEMU HARDDISK   
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x00000000

Old situation:

Device     Boot Start      End  Sectors Size Id Type
/dev/sda1  *     2048 41943039 41940992  20G 83 Linux

/dev/sda1: 
New situation:
Disklabel type: dos
Disk identifier: 0x00000000

Device     Boot Start      End  Sectors Size Id Type
/dev/sda1  *     2048 83886079 83884032  40G 83 Linux

The partition table has been altered.
Calling ioctl() to re-read partition table.
Re-reading the partition table failed.: Device or resource busy
The kernel still uses the old table. The new table will be used at the next reboot or after you run partprobe(8) or kpartx(8).
Syncing disks.

Note that the partition table wasn't updated:

# lsblk
NAME    MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
fd0       2:0    1   4K  0 disk 
sda       8:0    0  40G  0 disk 
└─sda1    8:1    0  20G  0 part /
sdb       8:16   0   4G  0 disk [SWAP]

So we need to reboot:

reboot

Note: a previous version of this guide was using fdisk instead, but that guide was destroying and recreating the partition, which seemed too error-prone. The above procedure is more annoying (because of the reboot below) but should be less dangerous.

Now we check the partitions again:

# lsblk
NAME   MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
fd0      2:0    1   4K  0 disk 
sda      8:0    0  40G  0 disk 
└─sda1   8:1    0  40G  0 part /
sdb      8:16   0   4G  0 disk [SWAP]

If we check the free space on the device, we will notice it has not changed yet:

# df -h  /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        20G   16G  2.8G  86% /

We need to resize it:

# resize2fs /dev/sda1
resize2fs 1.44.5 (15-Dec-2018)
Filesystem at /dev/sda1 is mounted on /; on-line resizing required
old_desc_blocks = 2, new_desc_blocks = 3
The filesystem on /dev/sda1 is now 10485504 (4k) blocks long.

The resize is now complete.

Resizing an iSCSI LUN

All the above procedures detail the normal use case where disks are hosted as "plain" files or with the DRBD backend. However, some instances (most notably in the, now defunct, gnt-chi cluster) have their storage backed by an iSCSI SAN.

Growing a disk hosted on a SAN like the Dell PowerVault MD3200i involves several steps beginning with resizing the LUN itself. In the example below, we're going to grow the disk associated with the tb-build-03 instance.

It should be noted that the instance was setup in a peculiar way: it has one LUN per partition, instead of one big LUN partitioned correctly. The instructions below therefore mention a LUN named tb-build-03-srv, but normally there should be a single LUN named after the hostname of the machine, in this case it should have been named simply tb-build-03.

First, we identify how much space is available on the virtual disks' diskGroup:

# SMcli -n chi-san-01 -c "show allVirtualDisks summary;"

STANDARD VIRTUAL DISKS SUMMARY
Number of standard virtual disks: 5

Name                Thin Provisioned     Status     Capacity     Accessible by       Source
tb-build-03-srv     No                   Optimal    700.000 GB   Host Group gnt-chi  Disk Group 5

This shows that tb-build-03-srv is hosted on Disk Group "5":

# SMcli -n chi-san-01 -c "show diskGroup [5];"

DETAILS

   Name:              5

      Status:         Optimal
      Capacity:       1,852.026 GB
      Current owner:  RAID Controller Module in slot 1

      Data Service (DS) Attributes

         RAID level:                    5
         Physical Disk media type:      Physical Disk
         Physical Disk interface type:  Serial Attached SCSI (SAS)
         Enclosure loss protection:     No
         Secure Capable:                No
         Secure:                        No


      Total Virtual Disks:          1
         Standard virtual disks:    1
         Repository virtual disks:  0
         Free Capacity:             1,152.026 GB

      Associated physical disks - present (in piece order)
      Total physical disks present: 3

         Enclosure     Slot
         0             6
         1             11
         0             7

Free Capacity indicates about 1,5 TB of free space available. So we can go ahead with the actual resize:

# SMcli -n chi-san-01 -p $PASSWORD -c "set virtualdisk [\"tb-build-03-srv\"] addCapacity=100GB;"

Next, we need to make all nodes in the cluster to rescan the iSCSI LUNs and have multipathd resize the device node. This is accomplished by running this command on the primary node (eg. chi-node-01):

# gnt-cluster command "iscsiadm -m node --rescan; multipathd -v3 -k\"resize map tb-build-srv\""

The success of this step can be validated by looking at the output of lsblk: the device nodes associated with the LUN should now display the new size. The output should be identical across the cluster nodes.

In order for ganeti/qemu to make this extra space available to the instance, a reboot must be performed from outside the instance.

Then the normal resize procedure can happen inside the virtual machine, see resizing under LVM, resizing without LVM, no partitions, or Resizing without LVM, with partitions, depending on the situation.

Removing an iSCSI LUN

Use this procedure before to a virtual disk from one of the iSCSI SANs.

First, we'll need to gather a some information about the disk to remove.

  • Which SAN is hosting the disk

  • What LUN is assigned to the disk

  • The WWID of both the SAN and the virtual disk

    /usr/local/sbin/tpo-show-san-disks SMcli -n chi-san-03 -S -quick -c "show storageArray summary;" | grep "Storage array world-wide identifier" cat /etc/multipath/conf.d/test-01.conf

Second, remove the multipath config and reload:

gnt-cluster command rm /etc/multipath/conf.d/test-01.conf
gnt-cluster command "multipath -r ; multipath -w {disk-wwid} ; multipath -r"

Then, remove the iSCSI device nodes. Running iscsiadm --rescan does not remove LUNs which have been deleted from the SAN.

Be very careful with this command, it will delete device nodes without prejudice and cause data corruption if they are still in use!

gnt-cluster command "find /dev/disk/by-path/ -name \*{san-wwid}-lun-{lun} -exec readlink {} \; | cut -d/ -f3 | while read -d $'\n' n; do echo 1 > /sys/block/\$n/device/delete; done"

Finally, the disk group can be deleted from the SAN (all the virtual disks it contains will be deleted):

SMcli -n chi-san-03 -p $SAN_PASSWORD -S -quick -c "delete diskGroup [<disk-group-number>];"

Adding disks

A disk can be added to an instance with the modify command as well. This, for example, will add a 100GB disk to the test1 instance on the vg_ganeti_hdd volume group, which is "slow" rotating disks:

gnt-instance modify --disk add:size=100g,vg=vg_ganeti --no-wait-for-sync test-01.torproject.org
gnt-instance reboot test1.torproject.org

Changing disk type

If you have, say, a test instance that was created with a plain disk template but we actually want it in production, with a drbd disk template. Switching to drbd is easy:

gnt-instance shutdown test-01
gnt-instance modify -t drbd test-01
gnt-instance start test-01

The second command will use the allocator to find a secondary node. If that fails, you can assign a node manually with -n.

You can also switch back to plain to make the instance non-redundant, although you should only do that in rare cases where you don't need the high availability requirements provided by DRBD. Make sure the service admins on the machine are aware of the consequences of the changes, which are essentially a longer recovery time in case of server failure, and lower availability due to node reboots also affecting the instance.

Essentially, plain instances are only for:

  • large disks (e.g. multi-terabyte) for which the 4x (2x for RAID-1, 2x for DRBD) disk usage is too much
  • large IOPS requirements (e.g. lots of writes) for which the wear on the drives is too much

See also the upstream procedure and design document.

Removing or detaching a disk

If you need to destroy a volume from an instance, you can use the remove flag to the gnt-instance modify command. First, you must identify the disk's UUID using gnt-instance info, then:

gnt-instance modify --disk <uuid>:remove test-01

But maybe you just want to detach the disk without destroying data, it's possible to detach it. For this, use the detach keyword:

gnt-instance modify --disk <uuid>:detach test-01

Once a disk is detached, it will show up as an "orphan" disk in gnt-cluster verify until it's actually removed. On the secondary, this can be done with lvremove. But on the primary, it's trickier because the DRBD device might still be layered on top of it, see Deleting a device after it was manually detached for those instructions.

Adding a network interface on the rfc1918 vlan

We have a vlan that some VMs that do not have public addresses sit on. Its vlanid is 4002 and its backed by Hetzner vswitch vSwitch #11973 "fsn-gnt-rfc1918-traffic". Note that traffic on this vlan will travel in the clear between nodes.

To add an instance to this vlan, give it a second network interface using

gnt-instance modify --net add:link=br0,vlan=4002,mode=openvswitch test1.torproject.org

Destroying an instance

This totally deletes the instance, including all mirrors and everything, be very careful with it:

gnt-instance remove test01.torproject.org

Getting information

Information about an instances can be found in the rather verbose gnt-instance info:

root@fsn-node-01:~# gnt-instance info tb-build-02.torproject.org
- Instance name: tb-build-02.torproject.org
  UUID: 8e9f3ca6-204f-4b6c-8e3e-6a8fda137c9b
  Serial number: 5
  Creation time: 2020-12-15 14:06:41
  Modification time: 2020-12-15 14:07:31
  State: configured to be up, actual state is up
  Nodes: 
    - primary: fsn-node-03.torproject.org
      group: default (UUID 8c32fd09-dc4c-4237-9dd2-3da3dfd3189e)
    - secondaries: fsn-node-04.torproject.org (group default, group UUID 8c32fd09-dc4c-4237-9dd2-3da3dfd3189e)
  Operating system: debootstrap+buster

A quick command that can be done is this, which shows the primary/secondary for a given instance:

gnt-instance info tb-build-02.torproject.org | grep -A 3 Nodes

An equivalent command will show the primary and secondary for all instances, on top of extra information (like the CPU count, memory and disk usage):

gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort

It can be useful to run this in a loop to see changes:

watch -n5 -d 'gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort'

Disk operations (DRBD)

Instances should be setup using the DRBD backend, in which case you should probably take a look at DRBD if you have problems with that. Ganeti handles most of the logic there so that should generally not be necessary.

Identifying volumes of an instance

As noted above, ganeti handles most of the complexity around managing DRBD and LVM volumes. Sometimes though it might be interesting to know which volume is associated with which instance, especially for confirming an operation before deleting a stray device.

Ganeti maintains that information handy. On the cluster master you can extract information about all volumes on all nodes:

gnt-node volumes

If you're already connected to one node, you can check which LVM volumes correspond to which instance:

lvs -o+tags

Evaluating cluster capacity

This will list instances repeatedly, but also show their assigned memory, and compare it with the node's capacity:

gnt-instance list -o pnode,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort &&
echo &&
gnt-node list

The latter does not show disk usage for secondary volume groups (see upstream issue 1379), for a complete picture of disk usage, use:

gnt-node list-storage

The gnt-cluster verify command will also check to see if there's enough space on secondaries to account for the failure of a node. Healthy output looks like this:

root@fsn-node-01:~# gnt-cluster verify
Submitted jobs 48030, 48031
Waiting for job 48030 ...
Fri Jan 17 20:05:42 2020 * Verifying cluster config
Fri Jan 17 20:05:42 2020 * Verifying cluster certificate files
Fri Jan 17 20:05:42 2020 * Verifying hypervisor parameters
Fri Jan 17 20:05:42 2020 * Verifying all nodes belong to an existing group
Waiting for job 48031 ...
Fri Jan 17 20:05:42 2020 * Verifying group 'default'
Fri Jan 17 20:05:42 2020 * Gathering data (2 nodes)
Fri Jan 17 20:05:42 2020 * Gathering information about nodes (2 nodes)
Fri Jan 17 20:05:45 2020 * Gathering disk information (2 nodes)
Fri Jan 17 20:05:45 2020 * Verifying configuration file consistency
Fri Jan 17 20:05:45 2020 * Verifying node status
Fri Jan 17 20:05:45 2020 * Verifying instance status
Fri Jan 17 20:05:45 2020 * Verifying orphan volumes
Fri Jan 17 20:05:45 2020 * Verifying N+1 Memory redundancy
Fri Jan 17 20:05:45 2020 * Other Notes
Fri Jan 17 20:05:45 2020 * Hooks Results

A sick node would have said something like this instead:

Mon Oct 26 18:59:37 2009 * Verifying N+1 Memory redundancy
Mon Oct 26 18:59:37 2009   - ERROR: node node2: not enough memory to accommodate instance failovers should node node1 fail

See the ganeti manual for a more extensive example

Also note the hspace -L command, which can tell you how many instances can be created in a given cluster. It uses the "standard" instance template defined in the cluster (which we haven't configured yet).

Moving instances and failover

Ganeti is smart about assigning instances to nodes. There's also a command (hbal) to automatically rebalance the cluster (see below). If for some reason hbal doesn’t do what you want or you need to move things around for other reasons, here are a few commands that might be handy.

Make an instance switch to using it's secondary:

gnt-instance migrate test1.torproject.org

Make all instances on a node switch to their secondaries:

gnt-node migrate test1.torproject.org

The migrate commands does a "live" migrate which should avoid any downtime during the migration. It might be preferable to actually shutdown the machine for some reason (for example if we actually want to reboot because of a security upgrade). Or we might not be able to live-migrate because the node is down. In this case, we do a failover

gnt-instance failover test1.torproject.org

The gnt-node evacuate command can also be used to "empty" a given node altogether, in case of an emergency:

gnt-node evacuate -I . fsn-node-02.torproject.org

Similarly, the gnt-node failover command can be used to hard-recover from a completely crashed node:

gnt-node failover fsn-node-02.torproject.org

Note that you might need the --ignore-consistency flag if the node is unresponsive.

Importing external libvirt instances

Assumptions:

  • INSTANCE: name of the instance being migrated, the "old" one being outside the cluster and the "new" one being the one created inside the cluster (e.g. chiwui.torproject.org)

  • SPARE_NODE: a ganeti node with free space (e.g. fsn-node-03.torproject.org) where the INSTANCE will be migrated

  • MASTER_NODE: the master ganeti node (e.g. fsn-node-01.torproject.org)

  • KVM_HOST: the machine which we migrate the INSTANCE from

  • the INSTANCE has only root and swap partitions

  • the SPARE_NODE has space in /srv/ to host all the virtual machines to import, to check, use:

     fab -H crm-ext-01.torproject.org,crm-int-01.torproject.org,forrestii.torproject.org,nevii.torproject.org,rude.torproject.org,troodi.torproject.org,vineale.torproject.org libvirt.du -p kvm3.torproject.org | sed '/-swap$/d;s/ .*$//' <f | awk '{s+=$1} END {print s}'
    

    You will very likely need to create a /srv big enough for this, for example:

     lvcreate -L 300G vg_ganeti -n srv-tmp &&
     mkfs /dev/vg_ganeti/srv-tmp &&
     mount /dev/vg_ganeti/srv-tmp /srv
    

Import procedure:

  1. pick a viable SPARE NODE to import the INSTANCE (see "evaluating cluster capacity" above, when in doubt) and find on which KVM HOST the INSTANCE lives

  2. copy the disks, without downtime:

    ./ganeti -H $INSTANCE libvirt-import  --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST
    
  3. copy the disks again, this time suspending the machine:

    ./ganeti -H $INSTANCE libvirt-import  --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST --suspend --adopt
    
  4. renumber the host:

    ./ganeti -H $INSTANCE renumber-instance --ganeti-node $SPARE_NODE
    
  5. test services by changing your /etc/hosts, possibly warning service admins:

    Subject: $INSTANCE IP address change planned for Ganeti migration

    I will soon migrate this virtual machine to the new ganeti cluster. this will involve an IP address change which might affect the service.

    Please let me know if there are any problems you can think of. in particular, do let me know if any internal (inside the server) or external (outside the server) services hardcodes the IP address of the virtual machine.

    A test instance has been setup. You can test the service by adding the following to your /etc/hosts:

    116.202.120.182 $INSTANCE
    2a01:4f8:fff0:4f:266:37ff:fe32:cfb2 $INSTANCE
    
  6. destroy test instance:

    gnt-instance remove $INSTANCE
    
  7. lower TTLs to 5 minutes. this procedure varies a lot according to the service, but generally if all DNS entries are CNAMEs pointing to the main machine domain name, the TTL can be lowered by adding a dnsTTL entry in the LDAP entry for this host. For example, this sets the TTL to 5 minutes:

    dnsTTL: 300
    

    Then to make the changes immediate, you need the following commands:

    ssh root@alberti.torproject.org sudo -u sshdist ud-generate &&
    ssh root@nevii.torproject.org ud-replicate
    

    Warning: if you migrate one of the hosts ud-ldap depends on, this can fail and not only the TTL will not update, but it might also fail to update the IP address in the below procedure. See ticket 33766 for details.

  8. shutdown original instance and redo migration as in step 3 and 4:

    fab -H $INSTANCE reboot.halt-and-wait --delay-shutdown 60 --reason='migrating to new server' &&
    ./ganeti -H $INSTANCE libvirt-import  --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST --adopt &&
    ./ganeti -H $INSTANCE renumber-instance --ganeti-node $SPARE_NODE
    
  9. final test procedure

    TODO: establish host-level test procedure and run it here.

  10. switch to DRBD, still on the Ganeti MASTER NODE:

    gnt-instance stop $INSTANCE &&
    gnt-instance modify -t drbd $INSTANCE &&
    gnt-instance failover -f $INSTANCE &&
    gnt-instance start $INSTANCE
    

    The above can sometimes fail if the allocator is upset about something in the cluster, for example:

    Can's find secondary node using iallocator hail: Request failed: No valid allocation solutions, failure reasons: FailMem: 2, FailN1: 2
    

    This situation is covered by ticket 33785. To work around the allocator, you can specify a secondary node directly:

    gnt-instance modify -t drbd -n fsn-node-04.torproject.org $INSTANCE &&
    gnt-instance failover -f $INSTANCE &&
    gnt-instance start $INSTANCE
    

    TODO: move into fabric, maybe in a libvirt-import-live or post-libvirt-import job that would also do the renumbering below

  11. change IP address in the following locations:

    • LDAP (ipHostNumber field, but also change the physicalHost and l fields!). Also drop the dnsTTL attribute while you're at it.

    • Puppet (grep in tor-puppet source, run puppet agent -t; ud-replicate on pauli)

    • DNS (grep in tor-dns source, puppet agent -t; ud-replicate on nevii)

    • reverse DNS (upstream web UI, e.g. Hetzner Robot)

    • grep for the host's IP address on itself:

       grep -r -e 78.47.38.227  -e 2a01:4f8:fff0:4f:266:37ff:fe77:1ad8 /etc
       grep -r -e 78.47.38.227  -e 2a01:4f8:fff0:4f:266:37ff:fe77:1ad8 /srv
      
    • grep for the host's IP on all hosts:

       cumin-all-puppet
       cumin-all 'grep -r -e 78.47.38.227  -e 2a01:4f8:fff0:4f:266:37ff:fe77:1ad8 /etc'
      

    TODO: move those jobs into fabric

  12. retire old instance (only a tiny part of retire-a-host):

    fab -H $INSTANCE retire.retire-instance --parent-host $KVM_HOST
    
  13. update the Nextcloud spreadsheet to remove the machine from the KVM host

  14. warn users about the migration, for example:

To: tor-project@lists.torproject.org Subject: cupani AKA git-rw IP address changed

The main git server, cupani, is the machine you connect to when you push or pull git repositories over ssh to git-rw.torproject.org. That machines has been migrated to the new Ganeti cluster.

This required an IP address change from:

78.47.38.228 2a01:4f8:211:6e8:0:823:4:1

to:

116.202.120.182 2a01:4f8:fff0:4f:266:37ff:fe32:cfb2

DNS has been updated and preliminary tests show that everything is mostly working. You will get a warning about the IP address change when connecting over SSH, which will go away after the first connection.

Warning: Permanently added the ED25519 host key for IP address '116.202.120.182' to the list of known hosts.

That is normal. The SSH fingerprints of the host did not change.

Please do report any other anomaly using the normal channels:

https://gitlab.torproject.org/tpo/tpa/team/-/wikis/support

The service was unavailable for about an hour during the migration.

Importing external libvirt instances, manual

This procedure is now easier to accomplish with the Fabric tools written especially for this purpose. Use the above procedure instead. This is kept for historical reference.

Assumptions:

  • INSTANCE: name of the instance being migrated, the "old" one being outside the cluster and the "new" one being the one created inside the cluster (e.g. chiwui.torproject.org)
  • SPARE_NODE: a ganeti node with free space (e.g. fsn-node-03.torproject.org) where the INSTANCE will be migrated
  • MASTER_NODE: the master ganeti node (e.g. fsn-node-01.torproject.org)
  • KVM_HOST: the machine which we migrate the INSTANCE from
  • the INSTANCE has only root and swap partitions

Import procedure:

  1. pick a viable SPARE NODE to import the instance (see "evaluating cluster capacity" above, when in doubt), login to the three servers, setting the proper environment everywhere, for example:

    MASTER_NODE=fsn-node-01.torproject.org
    SPARE_NODE=fsn-node-03.torproject.org
    KVM_HOST=kvm1.torproject.org
    INSTANCE=test.torproject.org
    
  2. establish VM specs, on the KVM HOST:

    • disk space in GiB:

      for disk in /srv/vmstore/$INSTANCE/*; do
          printf "$disk: "
          echo "$(qemu-img info --output=json $disk | jq '."virtual-size"') / 1024 / 1024 / 1024" | bc -l
      done
      
    • number of CPU cores:

      sed -n '/<vcpu/{s/[^>]*>//;s/<.*//;p}' < /etc/libvirt/qemu/$INSTANCE.xml
      
    • memory, assuming from KiB to GiB:

      echo "$(sed -n '/<memory/{s/[^>]*>//;s/<.*//;p}' < /etc/libvirt/qemu/$INSTANCE.xml) /1024 /1024" | bc -l
      

      TODO: make sure the memory line is in KiB and that the number makes sense.

    • on the INSTANCE, find the swap device UUID so we can recreate it later:

      blkid -t TYPE=swap -s UUID -o value
      
  3. setup a copy channel, on the SPARE NODE:

    ssh-agent bash
    ssh-add /etc/ssh/ssh_host_ed25519_key
    cat /etc/ssh/ssh_host_ed25519_key.pub
    

    on the KVM HOST:

    echo "$KEY_FROM_SPARE_NODE" >> /etc/ssh/userkeys/root
    
  4. copy the .qcow file(s) over, from the KVM HOST to the SPARE NODE:

    rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-root /srv/
    rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-lvm /srv/ || true
    

    Note: it's possible there is not enough room in /srv: in the base Ganeti installs, everything is in the same root partition (/) which will fill up if the instance is (say) over ~30GiB. In that case, create a filesystem in /srv:

    (mkdir /root/srv && mv /srv/* /root/srv true) || true &&
    lvcreate -L 200G vg_ganeti -n srv &&
    mkfs /dev/vg_ganeti/srv &&
    echo "/dev/vg_ganeti/srv /srv ext4 rw,noatime,errors=remount-ro 0 2" >> /etc/fstab &&
    mount /srv &&
    ( mv /root/srv/* ; rmdir /root/srv )
    

    This partition can be reclaimed once the VM migrations are completed, as it needlessly takes up space on the node.

  5. on the SPARE NODE, create and initialize a logical volume with the predetermined size:

    lvcreate -L 4GiB -n $INSTANCE-swap vg_ganeti
    mkswap --uuid $SWAP_UUID /dev/vg_ganeti/$INSTANCE-swap
    lvcreate -L 20GiB -n $INSTANCE-root vg_ganeti
    qemu-img convert /srv/$INSTANCE-root  -O raw /dev/vg_ganeti/$INSTANCE-root
    lvcreate -L 40GiB -n $INSTANCE-lvm vg_ganeti_hdd
    qemu-img convert /srv/$INSTANCE-lvm  -O raw /dev/vg_ganeti_hdd/$INSTANCE-lvm
    

    Note how we assume two disks above, but the instance might have a different configuration that would require changing the above. The above, common, configuration is to have an LVM disk separate from the "root" disk, the former being on a HDD, but the HDD is sometimes completely omitted and sizes can differ.

    Sometimes it might be worth using pv to get progress on long transfers:

    qemu-img convert /srv/$INSTANCE-lvm -O raw /srv/$INSTANCE-lvm.raw
    pv /srv/$INSTANCE-lvm.raw | dd of=/dev/vg_ganeti_hdd/$INSTANCE-lvm bs=4k
    

    TODO: ideally, the above procedure (and many steps below as well) would be automatically deduced from the disk listing established in the first step.

  6. on the MASTER NODE, create the instance, adopting the LV:

    gnt-instance add -t plain \
        -n fsn-node-03 \
        --disk 0:adopt=$INSTANCE-root \
        --disk 1:adopt=$INSTANCE-swap \
        --disk 2:adopt=$INSTANCE-lvm,vg=vg_ganeti_hdd \
        --backend-parameters memory=2g,vcpus=2 \
        --net 0:ip=pool,network=gnt-fsn \
        --no-name-check \
        --no-ip-check \
        -o debootstrap+default \
        $INSTANCE
    
  7. cross your fingers and watch the party:

    gnt-instance console $INSTANCE
    
  8. IP address change on new instance:

    edit /etc/hosts and /etc/network/interfaces by hand and add IPv4 and IPv6 ip. IPv4 configuration can be found in:

      gnt-instance show $INSTANCE
    

    Latter can be guessed by concatenating 2a01:4f8:fff0:4f:: and the IPv6 local local address without fe80::. For example: a link local address of fe80::266:37ff:fe65:870f/64 should yield the following configuration:

      iface eth0 inet6 static
          accept_ra 0
          address 2a01:4f8:fff0:4f:266:37ff:fe65:870f/64
          gateway 2a01:4f8:fff0:4f::1
    

    TODO: reuse gnt-debian-interfaces from the ganeti puppet module script here?

  9. functional tests: change your /etc/hosts to point to the new server and see if everything still kind of works

  10. shutdown original instance

  11. resync and reconvert image, on the Ganeti MASTER NODE:

    gnt-instance stop $INSTANCE
    

    on the Ganeti node:

    rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-root /srv/ &&
    qemu-img convert /srv/$INSTANCE-root  -O raw /dev/vg_ganeti/$INSTANCE-root &&
    rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-lvm /srv/ &&
    qemu-img convert /srv/$INSTANCE-lvm  -O raw /dev/vg_ganeti_hdd/$INSTANCE-lvm
    
  12. switch to DRBD, still on the Ganeti MASTER NODE:

    gnt-instance modify -t drbd $INSTANCE
    gnt-instance failover $INSTANCE
    gnt-instance startup $INSTANCE
    
  13. redo IP address change in /etc/network/interfaces and /etc/hosts

  14. final functional test

  15. change IP address in the following locations:

    • LDAP (ipHostNumber field, but also change the physicalHost and l fields!)
    • Puppet (grep in tor-puppet source, run puppet agent -t; ud-replicate on pauli)
    • DNS (grep in tor-dns source, puppet agent -t; ud-replicate on nevii)
    • reverse DNS (upstream web UI, e.g. Hetzner Robot)
  16. decomission old instance (retire-a-host)

Troubleshooting

  • if boot takes a long time and you see a message like this on the console:

     [  *** ] A start job is running for dev-disk-by\x2duuid-484b5...26s / 1min 30s)
    

    ... which is generally followed by:

     [DEPEND] Dependency failed for /dev/disk/by-…6f4b5-f334-4173-8491-9353d4f94e04.
     [DEPEND] Dependency failed for Swap.
    

    it means the swap device UUID wasn't setup properly, and does not match the one provided in /etc/fstab. That is probably because you missed the mkswap -U step documented above.

References

  • Upstream docs have the canonical incantation:

     gnt-instance add -t plain -n HOME_NODE ... --disk 0:adopt=lv_name[,vg=vg_name] INSTANCE_NAME
    
  • DSA docs also use disk adoption and have a procedure to migrate to DRBD

  • Riseup docs suggest creating a VM without installing, shutting down and then syncing

Ganeti supports importing and exporting from the Open Virtualization Format (OVF), but unfortunately it doesn't seem libvirt supports exporting to OVF. There's a virt-convert tool which can import OVF, but not the reverse. The libguestfs library also has a converter but it also doesn't support exporting to OVF or anything Ganeti can load directly.

So people have written their own conversion tools or their own conversion procedure.

Ganeti also supports file-backed instances but "adoption" is specifically designed for logical volumes, so it doesn't work for our use case.

Rebooting

Those hosts need special care, as we can accomplish zero-downtime reboots on those machines. The reboot script in fabric-tasks takes care of the special steps involved (which is basically to empty a node before rebooting it).

Such a reboot should be ran interactively.

Full fleet reboot

This process is long and rather disruptive. Notifications should be posted on IRC, in #tor-project, as instances are rebooted.

A full fleet reboot can take about 2 hours, if all goes well. You'll however need to keep your eyes on the process since sometimes fabric will intercept the host before its LUKS crypto has been unlocked by mandos and it will sit there waiting for you to press enter before trying again.

This command will reboot the entire Ganeti fleet, including the hosted VMs, use this when (for example) you have kernel upgrades to deploy everywhere:

fab -H $(echo fsn-node-0{1,2,3,4,5,6,7,8}.torproject.org | sed 's/ /,/g') fleet.reboot-host --no-ganeti-migrate

In parallel, you can probably also run:

fab -H $(echo dal-node-0{1,2,3}.torproject.org | sed 's/ /,/g') fleet.reboot-host --no-ganeti-migrate

Watch out for nodes that hold redundant mirrors however.

Cancelling reboots

Note that you can cancel a node reboot with --kind cancel. For example, say you were currently rebooting node fsn-node-05, you can hit control-c and do:

fab -H fsn-node-05.torproject.org fleet.reboot-host --kind=cancel

... to cancel the reboot of the node and its instances. This can be done when the following message is showing:

waiting 10 minutes for reboot to complete at ...

... as long as there's still time left of course.

Node-only reboot

In certain cases (Open vSwitch restarts, for example), only the nodes need a reboot, and not the instances. In that case, you want to reboot the nodes but before that, migrate the instances off the node and then migrate it back when done. This incantation should do so:

fab -H $(echo fsn-node-0{1,2,3,4,5,6,7,8}.torproject.org | sed 's/ /,/g') fleet.reboot-host --reason 'Open vSwitch upgrade'

This should cause no user-visible disruption.

See also the above note about canceling reboots.

Instance-only restarts

An alternative procedure should be used if only the ganeti.service requires a restart. This happens when a Qemu dependency that has been upgraded, for example libxml or OpenSSL.

This will only migrate the VMs without rebooting the hosts:

fab -H $(echo fsn-node-0{1,2,3,4,5,6,7,8}.torproject.org | sed 's/ /,/g') \
   fleet.reboot-host --kind=cancel --reason 'qemu flagged in needrestart'

This should cause no user-visible disruption, as it migrates all the VMs around and back.

That should reset the Qemu processes across the cluster and refresh the libraries Qemu depends on.

If you actually need to restart the instances in place (and not migrate them), you need to use the --skip-ganeti-empty flag instead:

fab -H $(echo dal-node-0{1,2,3}.torproject.org | sed 's/ /,/g') \
    fleet.reboot-host --skip-ganeti-empty --kind=cancel --reason 'qemu flagged in needrestart'

Rebalancing a cluster

After a reboot or a downtime, all nodes might end up on the same machine. This is normally handled by the reboot script, but it might be desirable to do this by hand if there was a crash or another special condition.

This can be easily corrected with this command, which will spread instances around the cluster to balance it:

hbal -L -C -v -p

The above will show the proposed solution, with the state of the cluster before, and after (-p) and the commands to get there (-C). To actually execute the commands, you can copy-paste those commands. An alternative is to pass the -X argument, to tell hbal to actually issue the commands itself:

hbal -L -C -v -p -X

This will automatically move the instances around and rebalance the cluster. Here's an example run on a small cluster:

root@fsn-node-01:~# gnt-instance list
Instance                          Hypervisor OS                 Primary_node               Status  Memory
loghost01.torproject.org          kvm        debootstrap+buster fsn-node-02.torproject.org running   2.0G
onionoo-backend-01.torproject.org kvm        debootstrap+buster fsn-node-02.torproject.org running  12.0G
static-master-fsn.torproject.org  kvm        debootstrap+buster fsn-node-02.torproject.org running   8.0G
web-fsn-01.torproject.org         kvm        debootstrap+buster fsn-node-02.torproject.org running   4.0G
web-fsn-02.torproject.org         kvm        debootstrap+buster fsn-node-02.torproject.org running   4.0G
root@fsn-node-01:~# hbal -L -X
Loaded 2 nodes, 5 instances
Group size 2 nodes, 5 instances
Selected node group: default
Initial check done: 0 bad nodes, 0 bad instances.
Initial score: 8.45007519
Trying to minimize the CV...
    1. onionoo-backend-01 fsn-node-02:fsn-node-01 => fsn-node-01:fsn-node-02   4.98124611 a=f
    2. loghost01          fsn-node-02:fsn-node-01 => fsn-node-01:fsn-node-02   1.78271883 a=f
Cluster score improved from 8.45007519 to 1.78271883
Solution length=2
Got job IDs 16345
Got job IDs 16346
root@fsn-node-01:~# gnt-instance list
Instance                          Hypervisor OS                 Primary_node               Status  Memory
loghost01.torproject.org          kvm        debootstrap+buster fsn-node-01.torproject.org running   2.0G
onionoo-backend-01.torproject.org kvm        debootstrap+buster fsn-node-01.torproject.org running  12.0G
static-master-fsn.torproject.org  kvm        debootstrap+buster fsn-node-02.torproject.org running   8.0G
web-fsn-01.torproject.org         kvm        debootstrap+buster fsn-node-02.torproject.org running   4.0G
web-fsn-02.torproject.org         kvm        debootstrap+buster fsn-node-02.torproject.org running   4.0G

In the above example, you should notice that the web-fsn instances both ended up on the same node. That's because the balancer did not know that they should be distributed. A special configuration was done, below, to avoid that problem in the future. But as a workaround, instances can also be moved by hand and the cluster re-balanced.

Also notice that -X does not show the job output, use ganeti-watch-jobs for that, in another terminal. See the job inspection section for more details on that.

Redundant instances distribution

Some instances are redundant across the cluster and should not end up on the same node. A good example are the web-fsn-01 and web-fsn-02 instances which, in theory, would serve similar traffic. If they end up on the same node, it might flood the network on that machine or at least defeats the purpose of having redundant machines.

The way to ensure they get distributed properly by the balancing algorithm is to "tag" them. For the web nodes, for example, this was performed on the master:

gnt-cluster add-tags htools:iextags:service
gnt-instance add-tags web-fsn-01.torproject.org service:web-fsn
gnt-instance add-tags web-fsn-02.torproject.org service:web-fsn

This tells Ganeti that web-fsn is an "exclusion tag" and the optimizer will not try to schedule instances with those tags on the same node.

To see which tags are present, use:

# gnt-cluster list-tags
htools:iextags:service

You can also find which nodes are assigned to a tag with:

# gnt-cluster search-tags service
/cluster htools:iextags:service
/instances/web-fsn-01.torproject.org service:web-fsn
/instances/web-fsn-02.torproject.org service:web-fsn

IMPORTANT: a previous version of this article mistakenly indicated that a new cluster-level tag had to be created for each service. That method did not work. The hbal manpage explicitly mentions that the cluster-level tag is a prefix that can be used to create multiple such tags. This configuration also happens to be simpler and easier to use...

HDD migration restrictions

Cluster balancing works well until there are inconsistencies between how nodes are configured. In our case, some nodes have HDDs (Hard Disk Drives, AKA spinning rust) and others do not. Therefore, it's not possible to move an instance from a node with a disk allocated on the HDD to a node that does not have such a disk.

Yet somehow the allocator is not smart enough to tell, and you will get the following error when doing an automatic rebalancing:

one of the migrate failed and stopped the cluster balance: Can't create block device: Can't create block device <LogicalVolume(/dev/vg_ganeti_hdd/98d30e7d-0a47-4a7d-aeed-6301645d8469.disk3_data, visible as /dev/, size=102400m)> on node fsn-node-07.torproject.org for instance gitlab-02.torproject.org: Can't create block device: Can't compute PV info for vg vg_ganeti_hdd

In this case, it is trying to migrate the gitlab-02 server from fsn-node-01 (which has an HDD) to fsn-node-07 (which hasn't), which naturally fails. This is a known limitation of the Ganeti code. There has been a draft design document for multiple storage unit support since 2015, but it has never been implemented. There has been multiple issues reported upstream on the subject:

Unfortunately, there are no known workarounds for this, at least not that fix the hbal command. It is possible to exclude the faulty migration from the pool of possible moves, however, for example in the above case:

hbal -L -v -C -P --exclude-instances gitlab-02.torproject.org

It's also possible to use the --no-disk-moves option to avoid disk move operations altogether.

Both workarounds obviously do not correctly balance the cluster... Note that we have also tried to use htools:migration tags to workaround that issue, but those do not work for secondary instances. For this we would need to setup node groups instead.

A good trick is to look at the solution proposed by hbal:

Trying to minimize the CV...
    1. tbb-nightlies-master fsn-node-01:fsn-node-02 => fsn-node-04:fsn-node-02   6.12095251 a=f r:fsn-node-04 f
    2. bacula-director-01   fsn-node-01:fsn-node-03 => fsn-node-03:fsn-node-01   4.56735007 a=f
    3. staticiforme         fsn-node-02:fsn-node-04 => fsn-node-02:fsn-node-01   3.99398707 a=r:fsn-node-01
    4. cache01              fsn-node-07:fsn-node-05 => fsn-node-07:fsn-node-01   3.55940346 a=r:fsn-node-01
    5. vineale              fsn-node-05:fsn-node-06 => fsn-node-05:fsn-node-01   3.18480313 a=r:fsn-node-01
    6. pauli                fsn-node-06:fsn-node-07 => fsn-node-06:fsn-node-01   2.84263128 a=r:fsn-node-01
    7. neriniflorum         fsn-node-05:fsn-node-02 => fsn-node-05:fsn-node-01   2.59000393 a=r:fsn-node-01
    8. static-master-fsn    fsn-node-01:fsn-node-02 => fsn-node-02:fsn-node-01   2.47345604 a=f
    9. polyanthum           fsn-node-02:fsn-node-07 => fsn-node-07:fsn-node-02   2.47257956 a=f
   10. forrestii            fsn-node-07:fsn-node-06 => fsn-node-06:fsn-node-07   2.45119245 a=f
Cluster score improved from 8.92360196 to 2.45119245

Look at the last column. The a= field shows what "action" will be taken. A f is a failover (or "migrate"), and a r: is a replace-disks, with the new secondary after the semi-colon (:). In the above case, the proposed solution is correct: no secondary node is in the range of nodes that lacks HDDs (fsn-node-0[5-7]). If one of the disk replaces hits one of the nodes without HDD, then it's when you use --exclude-instances to find a better solution. A typical exclude is:

hbal -L -v -C -P --exclude-instance=bacula-director-01,tbb-nightlies-master,winklerianum,woronowii,rouyi,loghost01,materculae,gayi,weissii

Another option is to specifically look for instances that do not have a HDD and migrate only those. In my situation, gnt-cluster verify was complaining that fsn-node-02 was full, so I looked for all the instances on that node and found the ones which didn't have a HDD:

gnt-instance list -o  pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status \
  | sort | grep 'fsn-node-02' | awk '{print $3}' | \
  while read instance ; do
    printf "checking $instance: "
    if gnt-instance info $instance | grep -q hdd ; then
      echo "HAS HDD"
    else
      echo "NO HDD"
    fi
  done

Then you can manually migrate -f (to fail over to the secondary) and replace-disks -n (to find another secondary) the instances that can be migrated out of the four first machines (which have HDDs) to the last three (which do not). Look at the memory usage in gnt-node list to pick the best node.

In general, if a given node in the first four is overloaded, a good trick is to look for one that can be failed over, with, for example:

gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort | grep '^fsn-node-0[1234]' | grep 'fsn-node-0[5678]'

... or, for a particular node (say fsn-node-04):

gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort | grep ^fsn-node-04 | grep 'fsn-node-0[5678]'

The instances listed there would be ones that can be migrated to their secondary to give fsn-node-04 some breathing room.

Adding and removing addresses on instances

Say you created an instance but forgot to need to assign an extra IP. You can still do so with:

gnt-instance modify --net -1:add,ip=116.202.120.174,network=gnt-fsn test01.torproject.org

Job inspection

Sometimes it can be useful to look at the active jobs. It might be, for example, that another user has queued a bunch of jobs in another terminal which you do not have access to, or some automated process did. Ganeti has this concept of "jobs" which can provide information about those.

The command gnt-job list will show the entire job history, and gnt-job list --running will show running jobs. gnt-job watch can be used to watch a specific job.

We have a wrapper called ganeti-watch-jobs which automatically shows the output of whatever job is currently running and exits when all jobs complete. This is particularly useful while rebalancing the cluster as hbal -X does not show the job output...

Open vSwitch crash course and debugging

Open vSwitch is used in the gnt-fsn cluster to connect the multiple machines with each other through Hetzner's "vswitch" system.

You will typically not need to deal with Open vSwitch, as Ganeti takes care of configuring the network on instance creation and migration. But if you believe there might be a problem with it, you can consider reading the following:

Accessing the QEMU control ports

There is a magic warp zone on the node where an instance is running:

nc -U /var/run/ganeti/kvm-hypervisor/ctrl/$INSTANCE.monitor

This drops you in the QEMU monitor which can do all sorts of things including adding/removing devices, save/restore the VM state, pause/resume the VM, do screenshots, etc.

There are many sockets in the ctrl directory, including:

  • .serial: the instance's serial port
  • .monitor: the QEMU monitor control port
  • .qmp: the same, but with a JSON interface that I can't figure out (the -qmp argument to qemu)
  • .kvmd: same as the above?

Instance backup and migration

The export/import mechanism can be used to export and import VMs one at a time. This can be used, for example, to migrate a VM between clusters or backup a VM before a critical change.

Note that this procedure is still a work in progress. A simulation was performed in tpo/tpa/team#40917, a proper procedure might vary from this significantly. In particular, there are some optimizations possible through things like zerofree and compression...

Also note that this migration has a lot of manual steps and is better accomplished using the move-instance command, documented in the Cross-cluster migrations section.

Here is the procedure to export a single VM, copy it to another cluster, and import it:

  1. find nodes to host the exported VM on the source cluster and the target cluster; it needs enough disk space in /var/lib/ganeti/export to keep a copy of a snapshot of the VM:

    df -h /var/lib/ganeti/export
    

    Typically, you'd make a logical volume to fit more data in there:

    lvcreate -n export vg_ganeti -L200g &&
    mkfs -t ext4 /dev/vg_ganeti/export &&
    mkdir -p /var/lib/ganeti/export &&
    mount /dev/vg_ganeti/export /var/lib/ganeti/export
    

    Make sure you do that on both ends of the migration.

  2. have the right kernel modules loaded, which might require a reboot of the source node:

    modprobe dm_snapshot
    
  3. on the master of the source Ganeti cluster, export the VM to the source node, also use --noshutdown if you cannot afford to have downtime on the VM and you are ready to lose data accumulated after the snapshot:

    gnt-backup export -n chi-node-01.torproject.org test-01.torproject.org
    gnt-instance stop test-01.torproject.org
    

    WARNING: this step is currently not working if there's a second disk (or swap device? to be confirmed), see this upstream issue for details. for now we're deploying the "nocloud" export/import mechanisms through Puppet to workaround that problem which means the whole disk is copied (as opposed to only the used parts)

  4. copy the VM snapshot from the source node to node in the target cluster:

    mkdir -p /var/lib/ganeti/export
    rsync -ASHaxX --info=progress2 root@chi-node-01.torproject.org:/var/lib/ganeti/export/test-01.torproject.org/ /var/lib/ganeti/export/test-01.torproject.org/
    

    Note that this assumes the target cluster has root access on the source cluster. One way to make that happen is by creating a new SSH key:

    ssh-keygen -P "" -C 'sync key from dal-node-01'
    

    And dump that public key in /etc/ssh/userkeys/root.more on the source cluster.

  5. on the master of the target Ganeti cluster, import the VM:

    gnt-backup import -n dal-node-01:dal-node-02 --src-node=dal-node-01 --src-dir=/var/lib/ganeti/export/test-01.torproject.org --no-ip-check --no-name-check --net 0:ip=pool,network=gnt-dal-01 -t drbd --no-wait-for-sync test-01.torproject.org
    
  6. enter the restored server console to change the IP address:

    gnt-instance console test-01.torproject.org
    
  7. if everything looks well, change the IP in LDAP

  8. destroy the old VM

Cross-cluster migrations

If an entire cluster needs to be evacuated, the move-instance command can be used to automatically propagate instances between clusters.

Notes about issues and patches applied to move-instance script

Some serious configuration needs to be accomplished before the move-instance command can be used.

Also note that this procedure depends on a patched version of move-instance, which was changed after the 3.0 Ganeti release, see this comment for details. We also have patches on top of that which fix various issues we have found during the gnt-chi to gnt-dal migration, see this comment for a discussion.

On 2023-03-16, @anarcat uploaded a patched version of Ganeti to our internal repositories (on db.torproject.org) with a debdiff documented in this comment and featuring the following three patches.

An extra optimisation was reported as issue 1702 and patched on dal-node-01 and fsn-node-01 manually (see PR 1703, merged, not released).

move-instance configuration

Note that the script currently migrates only one VM at a time, because of the --net argument, a limitation which could eventually be waived.

Before you can launch an instance migration, use the following procedure to prepare the cluster. In this example, we migrate from the gnt-fsn cluster to gnt-dal.

  1. Run gnt-cluster verify on both clusters.

    (this is now handled by puppet) ensure a move-instance user has been deployed to /var/lib/ganeti/rapi/users and that the cluster domain secret is identical across all nodes of both source and destination clusters.

  2. extract the public key from the RAPI certificate on the source cluster:

    ssh fsn-node-01.torproject.org sed -n '/BEGIN CERT/,$p' /var/lib/ganeti/rapi.pem
    
  3. paste that in a certificate file on the target cluster:

    ssh dal-node-01.torproject.org tee gnt-fsn.crt
    
  4. enter the RAPI passwords from /var/lib/ganeti/rapi/users on both clusters in two files on the target cluster, for example:

    cat > gnt-fsn.password
    cat > gnt-dal.password
    
  5. disable Puppet on all ganeti nodes, as we'll be messing with files it manages:

    ssh fsn-node-01.torproject.org gnt-cluster command "puppet agent --disable 'firewall opened for cross-cluster migration'"
    ssh dal-node-01.torproject.org gnt-cluster command "puppet agent --disable 'firewall opened for cross-cluster migration'"
    
  6. open up the firewall on all destination nodes to all nodes from the source:

    for n in fsn-node-0{1..8}; do nodeip=$(dig +short ${n}.torproject.org); gnt-cluster command "iptables-legacy -I ganeti-cluster -j ACCEPT -s ${nodeip}/32"; done
    

Actual VM migration

Once the above configuration is completed, the following procedure will move one VM, in this example the fictitious test-01.torproject.org VM from the gnt-fsn to the gnt-dal cluster:

  1. stop the VM, on the source cluster:

    gnt-instance stop test-01
    

    Note that this is necessary only if you are worried changes will happen on the source node and not be reproduced on the target cluster. If the service is fully redundant and ephemeral (e.g. a DNS secondary), the VM can be kept running.

  2. move the VM to the new cluster:

    /usr/lib/ganeti/tools/move-instance  \
        fsn-node-01.torproject.org \
        dal-node-01.torproject.org \
        test-01.torproject.org \
        --src-ca-file=gnt-fsn.crt \
        --dest-ca-file=/var/lib/ganeti/rapi.pem \
        --src-username=move-instance \
        --src-password-file=gnt-fsn.password \
        --dest-username=move-instance \
        --dest-password-file=gnt-dal.password \
        --src-rapi-port=5080 \
        --dest-rapi-port=5080 \
        --net 0:ip=pool,network=gnt-dal-01,mode=,link= \
        --keep-source-instance \
        --dest-disk-template=drbd \
        --compress=lzop
        --verbose
    

    Note that for the --compress option to work the compression tool needs to be configured for clusters on both sides. See ganeti cluster configuration. This configuration was already done for the fsn and dal clusters.

  3. change the IP address inside the VM:

    fabric-tasks$ fab -H test-01.torproject.org ganeti.renumber-instance dal-node-02.torproject.org
    

    Note how we use the name of the Ganeti node where the VM resides, not the master.

    Also note that this will give you a bunch of instructions on how to complete the renumbering. Do not follow those steps yet! Wait for confirmation that the new VM works before changing DNS so we have a chance to catch problems.

  4. test the new VM

  5. reconfigure grub-pc package to account for new disk id

    dpkg-reconfigure grub-pc
    

    Once this is done, reboot the instance to test that grub-pc did the right thing and the instance comes back online correctly.

  6. if satisfied, change DNS to new VM in LDAP, and everywhere else the above renumber-instance command suggests looking.

  7. schedule destruction of the old VM (7 days)

    fabric-tasks$ fab -H test-01.torproject.org ganeti.retire --master-host=fsn-node-01.torproject.org 
    
  8. If you're all done with instance migrations, remove the password and certificate files that were created in the previous section.

Troubleshooting

The above procedure was tested on a test VM migrating from gnt-chi to gnt-dal (tpo/tpa/team#40972). In that process, many hurdles were overcome. If the above procedure is followed again and somewhat fails, this section documents workarounds for the issues we have encountered so far.

Debugging and logs

If the above procedure doesn't work, try again with --debug instead of --verbose, you might see extra error messages. The import/export logs can also be visible in /var/log/ganeti/os/ on the node where the import or export happened.

Missing patches

This error:

TypeError: '>' not supported between instances of 'NoneType' and 'int'

... is upstream bug 1696 fixed in master with PR 1697. An alternative is to add those flags to the move-instance command:

--opportunistic-tries=1 --iallocator=hail

This error:

ganeti.errors.OpPrereqError: ('If network is given, no mode or link is allowed to be passed', 'wrong_input')

... is also documented in upstream bug 1696 and fixed with PR 1698.

This mysterious failure:

Disk 0 failed to receive data: Exited with status 1 (recent output: socat: W ioctl(9, IOCTL_VM_SOCKETS_GET_LOCAL_CID, ...): Inappropriate ioctl for device\n0+0 records in\n0+0 records out\n0 bytes copied, 12.2305 s, 0.0 kB/s)

Is probably a due to a certification verification bug in Ganeti's import-export daemon. It should be confirmed in the logs in /var/log/ganeti/os on the relevant node. The actual confirmation log is:

Disk 0 failed to send data: Exited with status 1 (recent output: socat: E certificate is valid but its commonName does not match hostname "ganeti.example.com")

That is upstream bug 1681 that should have been fixed in PR 1699.

Not enough space on the volume group

If the export fail on the source cluster with:

WARNING: Could not snapshot disk/2 on node chi-node-10.torproject.org: Error while executing backend function: Not enough free space: required 20480, available 15364.0

That is because the volume group doesn't have enough room to make a snapshot. In this case, there was a 300GB swap partition on the node (!) that could easily be removed, but an alternative would be to evacuate other instances off of the node (even as secondaries) to free up some space.

Snapshot failure

If the procedure fails with:

ganeti.errors.OpExecError: Not all disks could be snapshotted, and you did not allow the instance to remain offline for a longer time through the --long-sleep option; 
aborting

... try again with the VM stopped.

Connectivity issues

If the procedure fails during the data transfer with:

pycurl.error: (7, 'Failed to connect to chi-node-01.torproject.org port 5080: Connection refused')

or:

Disk 0 failed to send data: Exited with status 1 (recent output: dd: 0 bytes copied, 0.996381 s, 0.0 kB/s\ndd: 0 bytes copied, 5.99901 s, 0.0 kB/s\nsocat: E SSL_connect(): Connection refused)

... make sure you have the firewalls opened. Note that Puppet or other things might clear out the temporary firewall rules established in the preparation step.

DNS issues

This error:

ganeti.errors.OpPrereqError: ('The given name (metrics-psqlts-01.torproject.org.2.8.0.0.0.0.0.5.0.0.8.8.4.0.6.2.ip6.arpa) does not resolve: Name or service not known', 'resolver_error')

... means the reverse DNS on the instance has not been properly configured. In this case, the fix was to add a trailing dot to the PTR record:

--- a/2.8.0.0.0.0.0.5.0.0.8.8.4.0.6.2.ip6.arpa
+++ b/2.8.0.0.0.0.0.5.0.0.8.8.4.0.6.2.ip6.arpa
@@ -55,7 +55,7 @@ b.c.b.7.0.c.e.f.f.f.8.3.6.6.4.0 IN PTR ci-runner-x8
6-01.torproject.org.
 ; 2604:8800:5000:82:466:38ff:fe3c:f0a7
 7.a.0.f.c.3.e.f.f.f.8.3.6.6.4.0 IN PTR dangerzone-01.torproject.org.
 ; 2604:8800:5000:82:466:38ff:fe97:24ac
-c.a.4.2.7.9.e.f.f.f.8.3.6.6.4.0 IN PTR metrics-psqlts-01.torproject.
org
+c.a.4.2.7.9.e.f.f.f.8.3.6.6.4.0 IN PTR metrics-psqlts-01.torproject.org.
 ; 2604:8800:5000:82:466:38ff:fed4:51a1
 1.a.1.5.4.d.e.f.f.f.8.3.6.6.4.0 IN PTR onion-test-01.torproject.org.
 ; 2604:8800:5000:82:466:38ff:fea3:7c78

Capacity issues

If the procedure fails with:

ganeti.errors.OpPrereqError: ('Instance allocation to group 64c116fc-1ab2-4f6d-ba91-89c65875f888 (default) violates policy: memory-size value 307200 is not in range [128, 65536]', 'wrong_input')

It's because the VM is smaller or bigger than the cluster configuration allow. You need to change the --ipolicy-bounds-specs in the cluster, see, for example, the gnt-dal cluster initialization instructions.

If the procedure fails with:

ganeti.errors.OpPrereqError: ("Can't compute nodes using iallocator 'hail': Request failed: Group default (preferred): No valid allocation solutions, failure reasons: FailMem: 6", 'insufficient_resources')

... you may be able to workaround the problem by specifying a destination node by hand, add this to the move-instance command, for example:

--dest-primary-node=dal-node-02.torproject.org \
--dest-secondary-node=dal-node-03.torproject.org

The error:

ganeti.errors.OpPrereqError: Disk template 'blockdev' is not enabled in cluster. Enabled disk templates are: drbd,plain

... means that you should pass a supported --dest-disk-template argument to the move-instance command.

Rerunning failed migrations

This error obviously means the instance already exists in the cluster:

ganeti.errors.OpPrereqError: ("Instance 'rdsys-frontend-01.torproject.org' is already in the cluster", 'already_exists')

... maybe you're retrying a failed move? In that case, delete the target instance (yes, really make sure you delete the target, not the source!!!):

gnt-instance remove --shutdown-timeout-0 test-01.torproject.org

Other issues

This error is harmless and can be ignored:

WARNING: Failed to run rename script for dal-rescue-01.torproject.org on node dal-node-02.torproject.org: OS rename script failed (exited with exit code 1), last lines in the log file:\nCannot rename from dal-rescue-01.torproject.org to dal-rescue-01.torproject.org:\nInstance has a different hostname (dal-rescue-01)

It's probably a flaw in the ganeti-instance-debootstrap backend that doesn't properly renumber the instance. We have our own renumbering procedure in Fabric instead, but that could be merged inside ganeti-instance-debootstrap eventually.

Tracing executed commands

Finally, to trace which commands are executed (which can be challenging in Ganeti), the execsnoop.bt command (from the bpftrace package) is invaluable. Make sure the debugfs is loaded first and the package installed:

mount -t debugfs debugfs /sys/kernel/debug
apt install bpftrace

Then simply run:

execsnoop.bt

This will show every execve(2) system call executed on the system. Filtering is probably a good idea, in my case I was doing:

execsnoop.bt | grep socat

The execsnoop command (from the libbpf-tools package) may also work but it truncates the command after 128 characters (Debian 1033013, upstream 740).

This was used to troubleshoot the certificate issues with socat in upstream bug 1681.

Pager playbook

I/O overload

In case of excessive I/O, it might be worth looking into which machine is in cause. The DRBD page explains how to map a DRBD device to a VM. You can also find which logical volume is backing an instance (and vice versa) with this command:

lvs -o+tags

This will list all logical volumes and their associated tags. If you already know which logical volume you're looking for, you can address it directly:

root@fsn-node-01:~# lvs -o tags /dev/vg_ganeti_hdd/4091b668-1177-41ac-9310-1eac45b46620.disk2_data
  LV Tags
  originstname+bacula-director-01.torproject.org

Node failure

Ganeti clusters are designed to be self-healing. As long as only one machine disappears, the cluster should be able to recover by failing over other nodes. This is currently done manually, however.

WARNING: the following procedure should be considered a LAST RESORT. In the vast majority of cases, it is simpler and less risky to just restart the node using a remote power cycle to restore the service than risking a split brain scenario which this procedure can case when not followed properly.

WARNING, AGAIN: if for some reason the node you are failing over from actually returns on its own without you being able to stop it, it may run those DRBD disks and virtual machines, and you may end up in a split brain scenario. Normally, the node asks the master for which VM to start, so it should be safe to failover from a node that is NOT the master, but make sure the rest of the cluster is healthy before going ahead with this procedure.

If, say, fsn-node-07 completely fails and you need to restore service to the virtual machines running on that server, you can failover to the secondaries. Before you do, however, you need to be completely confident it is not still running in parallel, which could lead to a "split brain" scenario. For that, just cut the power to the machine using out of band management (e.g. on Hetzner, power down the machine through the Hetzner Robot, on Cymru, use the iDRAC to cut the power to the main board).

Once the machine is powered down, instruct Ganeti to stop using it altogether:

gnt-node modify --offline=yes fsn-node-07

Then, once the machine is offline and Ganeti also agrees, switch all the instances on that node to their secondaries:

gnt-node failover fsn-node-07.torproject.org

It's possible that you need --ignore-consistency but this has caused trouble in the past (see 40229). In any case, it is not used at the WMF, for example, they explicitly say that never needed the flag.

Note that it will still try to connect to the failed node to shutdown the DRBD devices, as a last resort.

Recovering from the failure should be automatic: once the failed server is repaired and restarts, it will contact the master to ask for instances to start. Since the machines the instances have been migrated, none will be started and there should not be any inconsistencies.

Once the machine is up and running and you are confident you do not have a split brain scenario, you can re-add the machine to the cluster with:

gnt-node add --readd fsn-node-07.torproject.org

Once that is done, rebalance the cluster because you now have an empty node which could be reused (hopefully). It might, obviously, be worth exploring the root case of the failure, however, before re-adding the machine to the cluster.

Recoveries could eventually be automated if such situations occur more often, by scheduling a harep cron job, which isn't enabled in Debian by default. See also the autorepair section of the admin manual.

Master node failure

A master node failure is a special case, as you may not have access to the node to run Ganeti commands. The Ganeti wiki master failover procedure has good documentation on this, but we also include scenarios specific to our use cases, to make sure this is also available offline.

There are two different scenarios that might require a master failover:

  1. the master is expected to fail or go down for maintenance (looming HDD failure, planned maintenance) and we want to retain availability

  2. the master has completely failed (motherboard fried, power failure, etc)

The key difference between scenario 1 and 2 here is that in scenario 1, the master is still available.

Scenario 1: preventive maintenance

This is the best case scenario, as the master is still available. In that case, it should simply be a matter of doing the master-failover command and marking the old master as offline.

On the machine you want to elect as the new master:

gnt-cluster master-failover
gnt-node modify --offline yes OLDMASTER.torproject.org

When the old master is available again, re-add it to the cluster with:

gnt-node add --readd OLDMASTER.torproject.org

Note that it should be safe to boot the old master normally, as long as it doesn't think it's the master before reboot. That is because it's the master which tells nodes which VMs to start on boot. You can check that by running this on the OLDMASTER:

gnt-cluster getmaster

It should return the NEW master.

Here's an example of a routine failover performed on fsn-node-01, the nominal master of the gnt-fsn cluster, falling over to a secondary master (we picked fsn-node-02 here) in prevision for a disk replacement:

root@fsn-node-02:~# gnt-cluster master-failover
root@fsn-node-02:~# gnt-cluster getmaster
fsn-node-02.torproject.org
root@fsn-node-02:~# gnt-node modify --offline yes fsn-node-01.torproject.org
Tue Jun 21 14:30:56 2022 Failed to stop KVM daemon on node 'fsn-node-01.torproject.org': Node is marked offline
Modified node fsn-node-01.torproject.org
 - master_candidate -> False
 - offline -> True

And indeed, fsn-node-01 now thinks it's not the master anymore:

root@fsn-node-01:~# gnt-cluster getmaster
fsn-node-02.torproject.org

And this is how the node was recovered, after a reboot, on the new master:

root@fsn-node-02:~# gnt-node add --readd fsn-node-01.torproject.org
2022-06-21 16:43:52,666: The certificate differs after being reencoded. Please renew the certificates cluster-wide to prevent future inconsistencies.
Tue Jun 21 16:43:54 2022  - INFO: Readding a node, the offline/drained flags were reset
Tue Jun 21 16:43:54 2022  - INFO: Node will be a master candidate

And to promote it back, on the old master:

root@fsn-node-01:~# gnt-cluster master-failover
root@fsn-node-01:~# 

And both nodes agree on who the master is:

root@fsn-node-01:~# gnt-cluster getmaster
fsn-node-01.torproject.org

root@fsn-node-02:~# gnt-cluster getmaster
fsn-node-01.torproject.org

Now is a good time to verify the cluster too:

gnt-cluster verify

That's pretty much it! See tpo/tpa/team#40805 for the rest of that incident.

Scenario 2: complete master node failure

In this scenario, the master node is completely unavailable. In this case, the Ganeti wiki master failover procedure should be followed pretty much to the letter.

WARNING: if you follow this procedure and skip step 1, you will probably end up with a split brain scenario (recovery documented below). So make absolutely sure the old master is REALLY unavailable before moving ahead with this.

The procedure is, at the time of writing (WARNING: UNTESTED):

  1. Make sure that the original failed master won't start again while a new master is present, preferably by physically shutting down the node.

  2. To upgrade one of the master candidates to the master, issue the following command on the machine you intend to be the new master:

    gnt-cluster master-failover
    
  3. Offline the old master so the new master doesn't try to communicate with it. Issue the following command:

    gnt-node modify --offline yes oldmaster
    
  4. If there were any DRBD instances on the old master node, they can be failed over by issuing the following commands:

    gnt-node evacuate -s oldmaster
    gnt-node evacuate -p oldmaster
    
  5. Any plain instances on the old master need to be recreated again.

If the old master becomes available again, re-add it to the cluster with:

gnt-node add --readd OLDMASTER.torproject.org

The above procedure is UNTESTED. See also the Riseup master failover procedure for further ideas.

Split brain recovery

A split brain occurred during a partial failure, failover, then unexpected recovery of fsn-node-07 (issue 40229). It might occur in other scenarios, but this section documents that specific one. Hopefully the recovery will be similar in other scenarios.

The split brain was the result of an operator running this command to failover the instances running on the node:

gnt-node failover --ignore-consistency fsn-node-07.torproject.org

The symptom of the split brain is that the VM is running on two machines. You will see that in gnt-cluster verify:

Thu Apr 22 01:28:04 2021 * Verifying node status
Thu Apr 22 01:28:04 2021   - ERROR: instance palmeri.torproject.org: instance should not run on node fsn-node-07.torproject.org
Thu Apr 22 01:28:04 2021   - ERROR: instance onionoo-backend-02.torproject.org: instance should not run on node fsn-node-07.torproject.org
Thu Apr 22 01:28:04 2021   - ERROR: instance polyanthum.torproject.org: instance should not run on node fsn-node-07.torproject.org
Thu Apr 22 01:28:04 2021   - ERROR: instance onionbalance-01.torproject.org: instance should not run on node fsn-node-07.torproject.org
Thu Apr 22 01:28:04 2021   - ERROR: instance henryi.torproject.org: instance should not run on node fsn-node-07.torproject.org
Thu Apr 22 01:28:04 2021   - ERROR: instance nevii.torproject.org: instance should not run on node fsn-node-07.torproject.org

In the above, the verification finds an instance running on an unexpected server (the old primary). Disks will be in a similar "degraded" state, according to gnt-cluster verify:

Thu Apr 22 01:28:04 2021 * Verifying instance status
Thu Apr 22 01:28:04 2021   - WARNING: instance onionoo-backend-02.torproject.org: disk/0 on fsn-node-07.torproject.org is degraded; local disk state is 'ok'
Thu Apr 22 01:28:04 2021   - WARNING: instance onionoo-backend-02.torproject.org: disk/1 on fsn-node-07.torproject.org is degraded; local disk state is 'ok'
Thu Apr 22 01:28:04 2021   - WARNING: instance onionoo-backend-02.torproject.org: disk/2 on fsn-node-07.torproject.org is degraded; local disk state is 'ok'
Thu Apr 22 01:28:04 2021   - WARNING: instance onionoo-backend-02.torproject.org: disk/0 on fsn-node-06.torproject.org is degraded; local disk state is 'ok'
Thu Apr 22 01:28:04 2021   - WARNING: instance onionoo-backend-02.torproject.org: disk/1 on fsn-node-06.torproject.org is degraded; local disk state is 'ok'
Thu Apr 22 01:28:04 2021   - WARNING: instance onionoo-backend-02.torproject.org: disk/2 on fsn-node-06.torproject.org is degraded; local disk state is 'ok'

We can also see that symptom on an individual instance:

root@fsn-node-01:~# gnt-instance info onionbalance-01.torproject.org
- Instance name: onionbalance-01.torproject.org
[...]
  Disks: 
    - disk/0: drbd, size 10.0G
      access mode: rw
      nodeA: fsn-node-05.torproject.org, minor=29
      nodeB: fsn-node-07.torproject.org, minor=26
      port: 11031
      on primary: /dev/drbd29 (147:29) in sync, status *DEGRADED*
      on secondary: /dev/drbd26 (147:26) in sync, status *DEGRADED*
[...]

The first (optional) thing to do in a split brain scenario is to stop the damage made by running instances: stop all the instances running in parallel, on both the previous and new primaries:

gnt-instance stop $INSTANCES

Then on fsn-node-07 just use kill(1) to shutdown the qemu processes running the VMs directly. Now the instances should all be shutdown and no further changes will be done on the VM that could possibly be lost.

(This step is optional because you can also skip straight to the hard decision below, while leaving the instances running. But that adds pressure to you, and we don't want to do that to your poor brain right now.)

That will leave you time to make a more important decision: which node will be authoritative (which will keep running as primary) and which one will "lose" (and will have its instances destroyed)? There's no easy good or wrong answer for this: it's a judgement call. In any case, there might already been data loss: for as long as both nodes were available and the VMs running on both, data registered on one of the nodes during the split brain will be lost when we destroy the state on the "losing" node.

If you have picked the previous primary as the "new" primary, you will need to first revert the failover and flip the instances back to the previous primary:

for instance in $INSTANCES; do
    gnt-instance failover $instance
done

When that is done, or if you have picked the "new" primary (the one the instances were originally failed over to) as the official one: you need to fix the disks' state. For this, flip to a "plain" disk (i.e. turn off DRBD) and turn DRBD back on. This will stop mirroring the disk, and reallocate a new disk in the right place. Assuming all instances are stopped, this should do it:

for instance in $INSTANCES ; do
  gnt-instance modify -t plain $instance
  gnt-instance modify -t drbd --no-wait-for-sync $instance
  gnt-instance start $instance
  gnt-instance console $instance
done

Then the machines should be back up on a single machine and the split brain scenario resolved. Note that this means the other side of the DRBD mirror will be destroyed in the procedure, that is the step that drops the data which was sent to the wrong part of the "split brain".

Once everything is back to normal, it might be a good idea to rebalance the cluster.

References:

  • the -t plain hack comes from this post on the Ganeti list
  • this procedure suggests using replace-disks -n which also works, but requires us to pick the secondary by hand each time, which is annoying
  • this procedure has instructions on how to recover at the DRBD level directly, but have not required those instructions so far

Bridge configuration failures

If you get the following error while trying to bring up the bridge:

root@chi-node-02:~# ifup br0
add bridge failed: Package not installed
run-parts: /etc/network/if-pre-up.d/bridge exited with return code 1
ifup: failed to bring up br0

... it might be the bridge cannot find a way to load the kernel module, because kernel module loading has been disabled. Reboot with the /etc/no_modules_disabled file present:

touch /etc/no_modules_disabled
reboot

It might be that the machine took too long to boot because it's not in mandos and the operator took too long to enter the LUKS passphrase. Re-enable the machine with this command on mandos:

mandos-ctl --enable chi-node-02.torproject

Cleaning up orphan disks

Sometimes gnt-cluster verify will give this warning, particularly after a failed rebalance:

* Verifying orphan volumes
   - WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_meta is unknown
   - WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_data is unknown
   - WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_meta is unknown
   - WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_data is unknown

This can happen when an instance was partially migrated to a node (in this case fsn-node-06) but the migration failed because (for example) there was no HDD on the target node. The fix here is simply to remove the logical volumes on the target node:

ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_meta
ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_data
ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_meta
ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_data

Cleaning up ghost disks

Under certain circumstances, you might end up with "ghost" disks, for example:

Tue Oct  4 13:24:07 2022   - ERROR: cluster : ghost disk 'ed225e68-83af-40f7-8d8c-cf7e46adad54' in temporary DRBD map

It's unclear how this happens, but in this specific case it is believed the problem occurred because a disk failed to add to an instance being resized.

It's possible this is a situation similar to the one above, in which case you must first find where the ghost disk is, with something like:

gnt-cluster command 'lvs --noheadings' | grep 'ed225e68-83af-40f7-8d8c-cf7e46adad54'

If this finds a device, you can remove it as normal:

ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/ed225e68-83af-40f7-8d8c-cf7e46adad54.disk1_data

... but in this case, the DRBD map is not associated with a logical volume. You can also check the dmsetup output for a match as well:

gnt-cluster command 'dmsetup ls' | grep 'ed225e68-83af-40f7-8d8c-cf7e46adad54'

According to this discussion, it's possible that restarting ganeti on all nodes might clear out the issue:

gnt-cluster command 'service ganeti restart'

If all the "ghost" disks mentioned are not actually found anywhere in the cluster, either in the device mapper or logical volumes, it might just be stray data leftover in the data file.

So it looks like the proper way to do this is to remove the temporary file where this data is stored:

gnt-cluster command  'grep ed225e68-83af-40f7-8d8c-cf7e46adad54 /var/lib/ganeti/tempres.data'
ssh ... service ganeti stop
ssh ... rm /var/lib/ganeti/tempres.data
ssh ... service ganeti start
gnt-cluster verify

That solution was proposed in this discussion. Anarcat toured the Ganeti source code and found that the ComputeDRBDMap function, in the Haskell codebase, basically just sucks the data out of that tempres.data JSON file, and dumps it into the Python side of things. Then the Python code looks for those disks in its internal disk list and compares. It's pretty unlikely that the warning would happen with the disks still being around, therefore.

Fixing inconsistent disks

Sometimes gnt-cluster verify will give this error:

WARNING: instance materculae.torproject.org: disk/0 on fsn-node-02.torproject.org is degraded; local disk state is 'ok'

... or worse:

ERROR: instance materculae.torproject.org: couldn't retrieve status for disk/2 on fsn-node-03.torproject.org: Can't find device <DRBD8(hosts=46cce2d9-ddff-4450-a2d6-b2237427aa3c/10-053e482a-c9f9-49a1-984d-50ae5b4563e6/22, port=11177, backend=<LogicalVolume(/dev/vg_ganeti/486d3e6d-e503-4d61-a8d9-31720c7291bd.disk2_data, visible as /dev/, size=10240m)>, metadev=<LogicalVolume(/dev/vg_ganeti/486d3e6d-e503-4d61-a8d9-31720c7291bd.disk2_meta, visible as /dev/, size=128m)>, visible as /dev/disk/2, size=10240m)>

The fix for both is to run:

gnt-instance activate-disks materculae.torproject.org

This will make sure disks are correctly setup for the instance.

If you have a lot of those warnings, pipe the output into this filter, for example:

gnt-cluster verify | grep -e 'WARNING: instance' -e 'ERROR: instance' |
  sed 's/.*instance//;s/:.*//' |
  sort -u |
  while read instance; do
    gnt-instance activate-disks $instance
  done

If you see an error like this:

DRBD CRITICAL: Device 28 WFConnection UpToDate, Device 3 WFConnection UpToDate, Device 31 WFConnection UpToDate, Device 4 WFConnection UpToDate

In this case, it's warning that the node has device 4, 28, and 31 in WFConnection state, which is incorrect. This might not be detected by Ganeti and therefore requires some hand-holding. This is documented in the resyncing disks section of out DRBD documentation. Like in the above scenario, the solution is basically to run activate-disks on the affected instances.

Not enough memory for failovers

Another error that gnt-cluster verify can give you is, for example:

- ERROR: node fsn-node-04.torproject.org: not enough memory to accommodate instance failovers should node fsn-node-03.torproject.org fail (16384MiB needed, 10724MiB available)

The solution is to rebalance the cluster.

Can't assemble device after creation

It's possible that Ganeti fails to create an instance with this error:

Thu Jan 14 20:01:00 2021  - WARNING: Device creation failed
Failure: command execution error:
Can't create block device <DRBD8(hosts=d1b54252-dd81-479b-a9dc-2ab1568659fa/0-3aa32c9d-c0a7-44bb-832d-851710d04765/0, port=11005, backend=<LogicalVolume(/dev/vg_ganeti/3f60a066-c957-4a86-9fae-65525fe3f3c7.disk0_data, not visible, size=10240m)>, metadev=<LogicalVolume(/dev/vg_ganeti/3f60a066-c957-4a86-9fae-65525fe3f3c7.disk0_meta, not visible, size=128m)>, visible as /dev/disk/0, size=10240m)> on node chi-node-03.torproject.org for instance build-x86-13.torproject.org: Can't assemble device after creation, unusual event: drbd0: timeout while configuring network

In this case, the problem was that chi-node-03 had an incorrect secondary_ip set. The immediate fix was to correctly set the secondary address of the node:

gnt-node modify --secondary-ip=172.30.130.3 chi-node-03.torproject.org

Then gnt-cluster verify was complaining about the leftover DRBD device:

   - ERROR: node chi-node-03.torproject.org: unallocated drbd minor 0 is in use

For this, see DRBD: deleting a stray device.

SSH key verification failures

Ganeti uses SSH to launch arbitrary commands (as root!) on other nodes. It does this using a funky command, from node-daemon.log:

ssh -oEscapeChar=none -oHashKnownHosts=no \
  -oGlobalKnownHostsFile=/var/lib/ganeti/known_hosts \
  -oUserKnownHostsFile=/dev/null -oCheckHostIp=no \
  -oConnectTimeout=10 -oHostKeyAlias=chignt.torproject.org
  -oPort=22 -oBatchMode=yes -oStrictHostKeyChecking=yes -4 \
  root@chi-node-03.torproject.org

This has caused us some problems in the Ganeti buster to bullseye upgrade, possibly because of changes in host verification routines in OpenSSH. The problem was documented in issue 1608 upstream and tpo/tpa/team#40383.

A workaround is to synchronize Ganeti's known_hosts file:

grep 'chi-node-0[0-9]' /etc/ssh/ssh_known_hosts | grep -v 'initramfs' | grep ssh-rsa | sed 's/[^ ]* /chignt.torproject.org /' >> /var/lib/ganeti/known_hosts

Note that the above assumes only a < 10 nodes cluster.

Other troubleshooting

The walkthrough also has a few recipes to resolve common problems.

See also the common issues page in the Ganeti wiki.

Look into logs on the relevant nodes (particularly /var/log/ganeti/node-daemon.log, which shows all commands ran by ganeti) when you have problems.

Mass migrating instances to a new cluster

If an entire cluster needs to be evacuated, the move-instance command can be used to automatically propagate instances between clusters. It currently migrates only one VM at a time (because of the --net argument, a limitation which could eventually be waived), but should be easier to do than the export/import procedure above.

See the detailed cross-cluster migration instructions.

Reboot procedures

NOTE: this procedure is out of date since the Inciga retirement, see tpo/tpa/prometheus-alerts#16 for a rewrite.

If you get this email in Nagios:

Subject: ** PROBLEM Service Alert: chi-node-01/needrestart is WARNING **

... and in the detailed results, you see:

WARN - Kernel: 5.10.0-19-amd64, Microcode: CURRENT, Services: 1 (!), Containers: none, Sessions: none
Services:
- ganeti.service

You can try to make needrestart fix Ganeti by hand:

root@chi-node-01:~# needrestart
Scanning processes...
Scanning candidates...
Scanning processor microcode...
Scanning linux images...

Running kernel seems to be up-to-date.

The processor microcode seems to be up-to-date.

Restarting services...
 systemctl restart ganeti.service

No containers need to be restarted.

No user sessions are running outdated binaries.
root@chi-node-01:~#

... but it's actually likely this didn't fix anything. A rerun will yield the same result.

That is likely because the virtual machines, running inside a qemu process, need a restart. This can be fixed by rebooting the entire host, if it needs a reboot, or, if it doesn't, just migrating the VMs around.

See the Ganeti reboot procedures for how to proceed from here on. This is likely a case of an Instance-only restart.

Slow disk sync after rebooting/Broken migrate-back

After rebooting a node with high-traffic instances, the node's disks may take several minutes to sync. While the disks are syncing, the reboot script's --ganeti-migrate-back option can fail

Wed Aug 10 21:48:22 2022 Migrating instance onionbalance-02.torproject.org
Wed Aug 10 21:48:22 2022 * checking disk consistency between source and target
Wed Aug 10 21:48:23 2022  - WARNING: Can't find disk on node chi-node-08.torproject.org
Failure: command execution error:
Disk 0 is degraded or not fully synchronized on target node, aborting migration
unexpected exception during reboot: [<UnexpectedExit: cmd='gnt-instance migrate -f onionbalance-02.torproject.org' exited=1>] Encountered a bad command exit code!

Command: 'gnt-instance migrate -f onionbalance-02.torproject.org'

When this happens, gnt-cluter verify may show a large amount of errors for node status and instance status

Wed Aug 10 21:49:37 2022 * Verifying node status
Wed Aug 10 21:49:37 2022   - ERROR: node chi-node-08.torproject.org: drbd minor 0 of disk 1e713d4e-344c-4c39-9286-cb47bcaa8da3 (attached in instance 'probetelemetry-01.torproject.org') is not active
Wed Aug 10 21:49:37 2022   - ERROR: node chi-node-08.torproject.org: drbd minor 1 of disk 1948dcb7-b281-4ad3-a2e4-cdaf3fa159a0 (attached in instance 'probetelemetry-01.torproject.org') is not active
Wed Aug 10 21:49:37 2022   - ERROR: node chi-node-08.torproject.org: drbd minor 2 of disk 25986a9f-3c32-4f11-b546-71d432b1848f (attached in instance 'probetelemetry-01.torproject.org') is not active
Wed Aug 10 21:49:37 2022   - ERROR: node chi-node-08.torproject.org: drbd minor 3 of disk 7f3a5ef1-b522-4726-96cf-010d57436dd5 (attached in instance 'static-gitlab-shim.torproject.org') is not active
Wed Aug 10 21:49:37 2022   - ERROR: node chi-node-08.torproject.org: drbd minor 4 of disk bfd77fb0-b8ec-44dc-97ad-fd65d6c45850 (attached in instance 'static-gitlab-shim.torproject.org') is not active
Wed Aug 10 21:49:37 2022   - ERROR: node chi-node-08.torproject.org: drbd minor 5 of disk c1828d0a-87c5-49db-8abb-ee00ccabcb73 (attached in instance 'static-gitlab-shim.torproject.org') is not active
Wed Aug 10 21:49:37 2022   - ERROR: node chi-node-08.torproject.org: drbd minor 8 of disk 1f3f4f1e-0dfa-4443-aabf-0f3b4c7d2dc4 (attached in instance 'onionbalance-02.torproject.org') is not active
Wed Aug 10 21:49:37 2022   - ERROR: node chi-node-08.torproject.org: drbd minor 9 of disk bbd5b2e9-8dbb-42f4-9c10-ef0df7f59b85 (attached in instance 'onionbalance-02.torproject.org') is not active
Wed Aug 10 21:49:37 2022 * Verifying instance status
Wed Aug 10 21:49:37 2022   - WARNING: instance static-gitlab-shim.torproject.org: disk/0 on chi-node-04.torproject.org is degraded; local disk state is 'ok'
Wed Aug 10 21:49:37 2022   - WARNING: instance static-gitlab-shim.torproject.org: disk/1 on chi-node-04.torproject.org is degraded; local disk state is 'ok'
Wed Aug 10 21:49:37 2022   - WARNING: instance static-gitlab-shim.torproject.org: disk/2 on chi-node-04.torproject.org is degraded; local disk state is 'ok'
Wed Aug 10 21:49:37 2022   - ERROR: instance static-gitlab-shim.torproject.org: couldn't retrieve status for disk/0 on chi-node-08.torproject.org: Can't find device <DRBD8(hosts=0d8b8663-e2bd-42e7-9e8d-e4502fa621b8/3-3aa32c9d-c0a7-44bb-832d-851710d04765/8, port=11040, backend=<LogicalVolume(/dev/vg_ganeti/b1913b02-14f4-4c0e-9d78-970bd34f5291.disk0_data, visible as /dev/, size=10240m)>, metadev=<LogicalVolume(/dev/vg_ganeti/b1913b02-14f4-4c0e-9d78-970bd34f5291.disk0_meta, visible as /dev/, size=128m)>, visible as /dev/disk/0, size=10240m)>
Wed Aug 10 21:49:37 2022   - ERROR: instance static-gitlab-shim.torproject.org: couldn't retrieve status for disk/1 on chi-node-08.torproject.org: Can't find device <DRBD8(hosts=0d8b8663-e2bd-42e7-9e8d-e4502fa621b8/4-3aa32c9d-c0a7-44bb-832d-851710d04765/11, port=11041, backend=<LogicalVolume(/dev/vg_ganeti/5fc54069-ee70-499a-9987-8201a604ee77.disk1_data, visible as /dev/, size=2048m)>, metadev=<LogicalVolume(/dev/vg_ganeti/5fc54069-ee70-499a-9987-8201a604ee77.disk1_meta, visible as /dev/, size=128m)>, visible as /dev/disk/1, size=2048m)>
Wed Aug 10 21:49:37 2022   - ERROR: instance static-gitlab-shim.torproject.org: couldn't retrieve status for disk/2 on chi-node-08.torproject.org: Can't find device <DRBD8(hosts=0d8b8663-e2bd-42e7-9e8d-e4502fa621b8/5-3aa32c9d-c0a7-44bb-832d-851710d04765/12, port=11042, backend=<LogicalVolume(/dev/vg_ganeti/5d092bcf-d229-47cd-bb2b-04dfe241fb68.disk2_data, visible as /dev/, size=20480m)>, metadev=<LogicalVolume(/dev/vg_ganeti/5d092bcf-d229-47cd-bb2b-04dfe241fb68.disk2_meta, visible as /dev/, size=128m)>, visible as /dev/disk/2, size=20480m)>
Wed Aug 10 21:49:37 2022   - WARNING: instance probetelemetry-01.torproject.org: disk/0 on chi-node-06.torproject.org is degraded; local disk state is 'ok'
Wed Aug 10 21:49:37 2022   - WARNING: instance probetelemetry-01.torproject.org: disk/1 on chi-node-06.torproject.org is degraded; local disk state is 'ok'
Wed Aug 10 21:49:37 2022   - WARNING: instance probetelemetry-01.torproject.org: disk/2 on chi-node-06.torproject.org is degraded; local disk state is 'ok'
Wed Aug 10 21:49:37 2022   - ERROR: instance probetelemetry-01.torproject.org: couldn't retrieve status for disk/0 on chi-node-08.torproject.org: Can't find device <DRBD8(hosts=e2efd223-53e1-44f4-b84d-38f6eb26dcbb/3-0d8b8663-e2bd-42e7-9e8d-e4502fa621b8/0, port=11035, backend=<LogicalVolume(/dev/vg_ganeti/4b699f8a-ebde-4680-bfda-4e1a2e191b8f.disk0_data, visible as /dev/, size=10240m)>, metadev=<LogicalVolume(/dev/vg_ganeti/4b699f8a-ebde-4680-bfda-4e1a2e191b8f.disk0_meta, visible as /dev/, size=128m)>, visible as /dev/disk/0, size=10240m)>
Wed Aug 10 21:49:37 2022   - ERROR: instance probetelemetry-01.torproject.org: couldn't retrieve status for disk/1 on chi-node-08.torproject.org: Can't find device <DRBD8(hosts=e2efd223-53e1-44f4-b84d-38f6eb26dcbb/4-0d8b8663-e2bd-42e7-9e8d-e4502fa621b8/1, port=11036, backend=<LogicalVolume(/dev/vg_ganeti/e5f56f72-1492-4596-8957-ce442ef0fcd5.disk1_data, visible as /dev/, size=2048m)>, metadev=<LogicalVolume(/dev/vg_ganeti/e5f56f72-1492-4596-8957-ce442ef0fcd5.disk1_meta, visible as /dev/, size=128m)>, visible as /dev/disk/1, size=2048m)>
Wed Aug 10 21:49:37 2022   - ERROR: instance probetelemetry-01.torproject.org: couldn't retrieve status for disk/2 on chi-node-08.torproject.org: Can't find device <DRBD8(hosts=e2efd223-53e1-44f4-b84d-38f6eb26dcbb/5-0d8b8663-e2bd-42e7-9e8d-e4502fa621b8/2, port=11037, backend=<LogicalVolume(/dev/vg_ganeti/ee280ecd-78cb-46c6-aca4-db23a0ae1454.disk2_data, visible as /dev/, size=51200m)>, metadev=<LogicalVolume(/dev/vg_ganeti/ee280ecd-78cb-46c6-aca4-db23a0ae1454.disk2_meta, visible as /dev/, size=128m)>, visible as /dev/disk/2, size=51200m)>
Wed Aug 10 21:49:37 2022   - WARNING: instance onionbalance-02.torproject.org: disk/0 on chi-node-09.torproject.org is degraded; local disk state is 'ok'
Wed Aug 10 21:49:37 2022   - WARNING: instance onionbalance-02.torproject.org: disk/1 on chi-node-09.torproject.org is degraded; local disk state is 'ok'
Wed Aug 10 21:49:37 2022   - ERROR: instance onionbalance-02.torproject.org: couldn't retrieve status for disk/0 on chi-node-08.torproject.org: Can't find device <DRBD8(hosts=0d8b8663-e2bd-42e7-9e8d-e4502fa621b8/8-86e465ce-60df-4a6f-be17-c6abb33eaf88/4, port=11022, backend=<LogicalVolume(/dev/vg_ganeti/3b0e4300-d4c1-4b7c-970a-f20b2214dab5.disk0_data, visible as /dev/, size=10240m)>, metadev=<LogicalVolume(/dev/vg_ganeti/3b0e4300-d4c1-4b7c-970a-f20b2214dab5.disk0_meta, visible as /dev/, size=128m)>, visible as /dev/disk/0, size=10240m)>
Wed Aug 10 21:49:37 2022   - ERROR: instance onionbalance-02.torproject.org: couldn't retrieve status for disk/1 on chi-node-08.torproject.org: Can't find device <DRBD8(hosts=0d8b8663-e2bd-42e7-9e8d-e4502fa621b8/9-86e465ce-60df-4a6f-be17-c6abb33eaf88/5, port=11021, backend=<LogicalVolume(/dev/vg_ganeti/ec75f295-1e09-46df-b2c2-4fa24f064401.disk1_data, visible as /dev/, size=2048m)>, metadev=<LogicalVolume(/dev/vg_ganeti/ec75f295-1e09-46df-b2c2-4fa24f064401.disk1_meta, visible as /dev/, size=128m)>, visible as /dev/disk/1, size=2048m)>

This is usually a false alarm, and the warnings and errors will disappear in a few minutes when the disk finishes syncing. Re-check gnt-cluster verify every few minutes, and manually migrate the instances back when the errors disappear.

If such an error persists, consider telling Ganeti to "re-seat" the disks (so to speak) with, for example:

gnt-instance activate-disks onionbalance-02.torproject.org

Failed disk on node

If a disk fails on a node, we should get it replaced as soon as possible. Here are the steps one can follow to achieve that:

  1. Open an incident-type issue in gitlab in the TPA/Team project. Set its priority to High.
  2. empty the node of its instances. in the fabric-tasks repository: ./ganeti -H $cluster-node-$number.torproject.org empty-node
    • Take note in the issue of which instances were migrated by this operation.
  3. Open a support ticket with Hetzner and then once the machine is back online with the new disk, replace the it in the appropriate RAID arrays. See the RAID documentation page
  4. Finally, bring back the instances on the node with the list of instances noted down at step 1. Still in fabric-tasks: fab -H $cluster_master ganeti.migrate-instances -i instance1 -i instance2

Disaster recovery

If things get completely out of hand and the cluster becomes too unreliable for service but we still have access to all data on the instance volumes, the only solution is to rebuild another one elsewhere. Since Ganeti 2.2, there is a move-instance command to move instances between clusters that can be used for that purpose. See the mass migration procedure above, which can also be used to migrate only a subset of the instances since the script operates one instance at a time.

The mass migration procedure was used to migrate all virtual machines from Cymru (gnt-chi) to Quintex (gnt-dal) in 2023 (see issue tpo/tpa/team#40972), and worked relatively well. In 2024, the gitlab-02 VM was migrated from Hetzner (gnt-fsn) to Quintex which required more fine-tuning (like zero'ing disks and compression) because it was such a large VM (see tpo/tpa/team#41431).

Note that you can also use the export/import mechanism (see instance backup and migration section above), but now that move-instance is well tested, we recommend rather using that script instead.

If Ganeti is completely destroyed and its APIs don't work anymore, the last resort is to restore all virtual machines from backup. Hopefully, this should not happen except in the case of a catastrophic data loss bug in Ganeti or DRBD.

Reference

Installation

Ganeti is typically installed as part of the bare bones machine installation process, typically as part of the "post-install configuration" procedure, once the machine is fully installed and configured.

Typically, we add a new node to an existing cluster. Below are cluster-specific procedures to add a new node to each existing cluster, alongside the configuration of the cluster as it was done at the time (and how it could be used to rebuild a cluster from scratch).

Make sure you use the procedure specific to the cluster you are working on.

Note that this is not about installing virtual machines (VMs) inside a Ganeti cluster: for that you want to look at the new instance procedure.

New gnt-fsn node

  1. To create a new box, follow new-machine-hetzner-robot but change the following settings:

    • Server: PX62-NVMe
    • Location: FSN1
    • Operating system: Rescue
    • Additional drives: 2x10TB HDD (update: starting from fsn-node-05, we are not ordering additional drives to save on costs, see ticket 33083 for rationale)
    • Add in the comment form that the server needs to be in the same datacenter as the other machines (FSN1-DC13, but double-check)
  1. follow the new-machine post-install configuration

  2. Add the server to the two vSwitch systems in Hetzner Robot web UI

  3. install openvswitch and allow modules to be loaded:

    touch /etc/no_modules_disabled
    reboot
    apt install openvswitch-switch
    
  4. Allocate a private IP address in the 30.172.in-addr.arpa zone (and the torproject.org zone) for the node, in the admin/dns/domains.git repository

  5. copy over the /etc/network/interfaces from another ganeti node, changing the address and gateway fields to match the local entry.

  6. knock on wood, cross your fingers, pet a cat, help your local book store, and reboot:

     reboot
    
  7. Prepare all the nodes by configuring them in Puppet, by adding the class roles::ganeti::fsn to the node

  8. Re-enable modules disabling:

    rm /etc/no_modules_disabled
    
  9. run puppet across the ganeti cluster to ensure ipsec tunnels are up:

    cumin -p 0 'C:roles::ganeti::fsn' 'puppet agent -t'
    
  10. reboot again:

    reboot
    
  11. Then the node is ready to be added to the cluster, by running this on the master node:

    gnt-node add \
     --secondary-ip 172.30.135.2 \
     --no-ssh-key-check \
     --no-node-setup \
     fsn-node-02.torproject.org
    

    If this is an entirely new cluster, you need a different procedure, see the cluster initialization procedure instead.

  12. make sure everything is great in the cluster:

    gnt-cluster verify
    

    If that takes a long time and eventually fails with errors like:

    ERROR: node fsn-node-03.torproject.org: ssh communication with node 'fsn-node-06.torproject.org': ssh problem: ssh: connect to host fsn-node-06.torproject.org port 22: Connection timed out\'r\n
    

    ... that is because the service/ipsec tunnels between the nodes are failing. Make sure Puppet has run across the cluster (step 10 above) and see service/ipsec for further diagnostics. For example, the above would be fixed with:

    ssh fsn-node-03.torproject.org "puppet agent -t; service ipsec reload"
    ssh fsn-node-06.torproject.org "puppet agent -t; service ipsec reload; ipsec up gnt-fsn-be::fsn-node-03"
    

gnt-fsn cluster initialization

This procedure replaces the gnt-node add step in the initial setup of the first Ganeti node when the gnt-fsn cluster was setup:

gnt-cluster init \
    --master-netdev vlan-gntbe \
    --vg-name vg_ganeti \
    --secondary-ip 172.30.135.1 \
    --enabled-hypervisors kvm \
    --nic-parameters mode=openvswitch,link=br0,vlan=4000 \
    --mac-prefix 00:66:37 \
    --no-ssh-init \
    --no-etc-hosts \
    fsngnt.torproject.org

The above assumes that fsngnt is already in DNS. See the MAC address prefix selection section for information on how the --mac-prefix argument was selected.

Then the following extra configuration was performed:

gnt-cluster modify --reserved-lvs vg_ganeti/root,vg_ganeti/swap
gnt-cluster modify -H kvm:kernel_path=,initrd_path=
gnt-cluster modify -H kvm:security_model=pool
gnt-cluster modify -H kvm:kvm_extra='-device virtio-rng-pci\,bus=pci.0\,addr=0x1e\,max-bytes=1024\,period=1000 -global isa-fdc.fdtypeA=none'
gnt-cluster modify -H kvm:disk_cache=none
gnt-cluster modify -H kvm:disk_discard=unmap
gnt-cluster modify -H kvm:scsi_controller_type=virtio-scsi-pci
gnt-cluster modify -H kvm:disk_type=scsi-hd
gnt-cluster modify -H kvm:migration_bandwidth=950
gnt-cluster modify -H kvm:migration_downtime=500
gnt-cluster modify -H kvm:migration_caps=postcopy-ram
gnt-cluster modify -D drbd:c-plan-ahead=0,disk-custom='--c-plan-ahead 0'
gnt-cluster modify --uid-pool 4000-4019
gnt-cluster modify --compression-tools=gzip,gzip-fast,gzip-slow,lzop

The network configuration (below) must also be performed for the address blocks reserved in the cluster.

Cluster limits were changed to raise the disk usage to 2TiB:

gnt-cluster modify --ipolicy-bounds-specs \
max:cpu-count=16,disk-count=16,disk-size=2097152,\
memory-size=32768,nic-count=8,spindle-use=12\
/min:cpu-count=1,disk-count=1,disk-size=512,\
memory-size=128,nic-count=1,spindle-use=1

New gnt-dal node

  1. To create a new box, follow the quintex tutorial

  2. follow the new-machine post-install configuration

  3. Allocate a private IP address for the node in the 30.172.in-addr.arpa zone and torproject.org zone, in the admin/dns/domains.git repository

  4. add the private IP address to the eth1 interface, for example in /etc/network/interfaces.d/eth1:

    auto eth1
    iface eth1 inet static
        address 172.30.131.101/24
    

    Again, this IP must be allocated in the reverse DNS zone file (30.172.in-addr.arpa) and the torproject.org zone file in the dns/domains.git repository.

  5. enable the interface:

    ifup eth1
    
  6. setup a bridge on the public interface, replacing the eth0 blocks with something like:

    auto eth0
    iface eth0 inet manual
    
    auto br0
    iface br0 inet static
        address 204.8.99.101/24
        gateway 204.8.99.254
        bridge_ports eth0
        bridge_stp off
        bridge_fd 0
    
    # IPv6 configuration
    iface br0 inet6 static
        accept_ra 0
        address 2620:7:6002:0:3eec:efff:fed5:6b2a/64
        gateway 2620:7:6002::1
    
  7. allow modules to be loaded, cross your fingers that you didn't screw up the network configuration above, and reboot:

    touch /etc/no_modules_disabled
    reboot
    
  8. configure the node in Puppet by adding it to the roles::ganeti::dal class, and run Puppet on the new node:

    puppet agent -t
    
  9. re-disable module loading:

     rm /etc/no_modules_disabled
    
  10. run puppet across the Ganeti cluster so firewalls are correctly configured:

     cumin -p 0 'C:roles::ganeti::dal 'puppet agent -t'
    
  11. partition the extra disks, SSD:

    for disk in /dev/sd[abcdef]; do
         parted -s $disk mklabel gpt;
         parted -s $disk -a optimal mkpart primary 0% 100%;
    done &&
    mdadm --create --verbose --level=10 --metadata=1.2 \
          --raid-devices=6 \
          /dev/md2 \
          /dev/sda1 \
          /dev/sdb1 \
          /dev/sdc1 \
          /dev/sdd1 \
          /dev/sde1 \
          /dev/sdf1 &&
    dd if=/dev/random bs=64 count=128 of=/etc/luks/crypt_dev_md2 &&
    chmod 0 /etc/luks/crypt_dev_md2 &&
    cryptsetup luksFormat --key-file=/etc/luks/crypt_dev_md2 /dev/md2 &&
    cryptsetup luksOpen --key-file=/etc/luks/crypt_dev_md2 /dev/md2 crypt_dev_md2 &&
    pvcreate /dev/mapper/crypt_dev_md2 &&
    vgcreate vg_ganeti /dev/mapper/crypt_dev_md2 &&
    echo crypt_dev_md2 UUID=$(lsblk -n -o UUID /dev/md2 | head -1) /etc/luks/crypt_dev_md2 luks,discard >> /etc/crypttab &&
    update-initramfs -u
    
NVMe:

     for disk in /dev/nvme[23]n1; do
         parted -s $disk mklabel gpt;
         parted -s $disk -a optimal mkpart primary 0% 100%;
     done &&
     mdadm --create --verbose --level=1 --metadata=1.2 \
           --raid-devices=2 \
           /dev/md3 \
           /dev/nvme2n1p1 \
           /dev/nvme3n1p1 &&
     dd if=/dev/random bs=64 count=128 of=/etc/luks/crypt_dev_md3 &&
     chmod 0 /etc/luks/crypt_dev_md3 &&
     cryptsetup luksFormat --key-file=/etc/luks/crypt_dev_md3 /dev/md3 &&
     cryptsetup luksOpen --key-file=/etc/luks/crypt_dev_md3 /dev/md3 crypt_dev_md3 &&
     pvcreate /dev/mapper/crypt_dev_md3 &&
     vgcreate vg_ganeti_nvme /dev/mapper/crypt_dev_md3 &&
     echo crypt_dev_md3 UUID=$(lsblk -n -o UUID /dev/md3 | head -1) /etc/luks/crypt_dev_md3 luks,discard >> /etc/crypttab &&
     update-initramfs -u

Normally, this would have been done in the `setup-storage`
configuration, but we were in a rush. Note that we create
partitions because we're worried replacement drives might not have
exactly the same size as the ones we have. The above gives us a
1.4MB buffer at the end of the drive, and avoids having to
hard code disk sizes in bytes.
  1. Reboot to test the LUKS configuration:

    reboot
    
  2. Then the node is ready to be added to the cluster, by running this on the master node:

    gnt-node add \
     --secondary-ip 172.30.131.103 \
     --no-ssh-key-check \
     --no-node-setup \
     dal-node-03.torproject.org
    
If this is an entirely new cluster, you need a different
procedure, see [the cluster initialization procedure](#gnt-fsn-cluster-initialization) instead.
  1. make sure everything is great in the cluster:

    gnt-cluster verify
    

If the last step fails with SSH errors, you may need to re-synchronise the SSH known_hosts file, see SSH key verification failures.

gnt-dal cluster initialization

This procedure replaces the gnt-node add step in the initial setup of the first Ganeti node when the gnt-dal cluster was setup.

Initialize the ganeti cluster:

gnt-cluster init \
    --master-netdev eth1 \
    --nic-parameters link=br0 \
    --vg-name vg_ganeti \
    --secondary-ip 172.30.131.101 \
    --enabled-hypervisors kvm \
    --mac-prefix 06:66:39 \
    --no-ssh-init \
    --no-etc-hosts \
    dalgnt.torproject.org

The above assumes that dalgnt is already in DNS. See the MAC address prefix selection section for information on how the --mac-prefix argument was selected.

Then the following extra configuration was performed:

gnt-cluster modify --reserved-lvs vg_system/root,vg_system/swap
gnt-cluster modify -H kvm:kernel_path=,initrd_path=
gnt-cluster modify -H kvm:security_model=pool
gnt-cluster modify -H kvm:kvm_extra='-device virtio-rng-pci\,bus=pci.0\,addr=0x1e\,max-bytes=1024\,period=1000 -global isa-fdc.fdtypeA=none'
gnt-cluster modify -H kvm:disk_cache=none
gnt-cluster modify -H kvm:disk_discard=unmap
gnt-cluster modify -H kvm:scsi_controller_type=virtio-scsi-pci
gnt-cluster modify -H kvm:disk_type=scsi-hd
gnt-cluster modify -H kvm:migration_bandwidth=950
gnt-cluster modify -H kvm:migration_downtime=500
gnt-cluster modify -H kvm:migration_caps=postcopy-ram
gnt-cluster modify -H kvm:cpu_type=host
gnt-cluster modify -D drbd:c-plan-ahead=0,disk-custom='--c-plan-ahead 0'
gnt-cluster modify -D drbd:net-custom='--verify-alg sha1 --max-buffers 8k'
gnt-cluster modify --uid-pool 4000-4019
gnt-cluster modify --compression-tools=gzip,gzip-fast,gzip-slow,lzop

The upper limit for CPU count and memory size changed with:

gnt-cluster modify --ipolicy-bounds-specs \
max:cpu-count=32,disk-count=16,disk-size=2097152,\
memory-size=307200,nic-count=8,spindle-use=12\
/min:cpu-count=1,disk-count=1,disk-size=512,\
memory-size=128,nic-count=1,spindle-use=1

NOTE: watch out for whitespace here. The original source for this command had too much whitespace, which fails with:

Failure: unknown/wrong parameter name 'Missing value for key '' in option --ipolicy-bounds-specs'

The network configuration (below) must also be performed for the address blocks reserved in the cluster. This is the actual initial configuration performed:

gnt-network add --network 204.8.99.128/25 --gateway 204.8.99.254 --network6 2620:7:6002::/64 --gateway6 2620:7:6002::1 gnt-dal-01
gnt-network connect --nic-parameters=link=br0 gnt-dal-01 default

Note that we reserve the first /25 (204.8.99.0/25) for future use. The above only uses the second half of the network in case we need the rest of the network for other operations. A new network will need to be added if we run out of IPs in the second half. This also

No IP was reserved as the gateway is already automatically reserved by Ganeti. The node's public addresses are in the other /25 and also do not need to be reserved in this allocation.

Network configuration

IP allocation is managed by Ganeti through the gnt-network(8) system. Say we have 192.0.2.0/24 reserved for the cluster, with the host IP 192.0.2.100 and the gateway on 192.0.2.1. You will create this network with:

gnt-network add --network 192.0.2.0/24 --gateway 192.0.2.1 example-network

If there's also IPv6, it would look something like this:

gnt-network add --network 192.0.2.0/24 --gateway 192.0.2.1 --network6 2001:db8::/32 --gateway6 fe80::1 example-network

Note: the actual name of the network (example-network) above, should follow the convention established in doc/naming-scheme.

Then we associate the new network to the default node group:

gnt-network connect --nic-parameters=link=br0,vlan=4000,mode=openvswitch example-network default

The arguments to --nic-parameters come from the values configured in the cluster, above. The current values can be found with gnt-cluster info.

For example, the second ganeti network block was assigned with the following commands:

gnt-network add --network 49.12.57.128/27 --gateway 49.12.57.129 gnt-fsn13-02
gnt-network connect --nic-parameters=link=br0,vlan=4000,mode=openvswitch gnt-fsn13-02 default

IP addresses can be reserved with the --reserved-ips argument to the modify command, for example:

gnt-network modify --add-reserved-ips=38.229.82.2,38.229.82.3,38.229.82.4,38.229.82.5,38.229.82.6,38.229.82.7,38.229.82.8,38.229.82.9,38.229.82.10,38.229.82.11,38.229.82.12,38.229.82.13,38.229.82.14,38.229.82.15,38.229.82.16,38.229.82.17,38.229.82.18,38.229.82.19 gnt-chi-01 gnt-chi-01

Note that the gateway and nodes IP addresses are automatically reserved, this is for hosts outside of the cluster.

The network name must follow the naming convention.

Upgrades

Ganeti upgrades need to be handled specially. They are hit and miss: sometimes they're trivial, sometimes they fail.

Nodes should be upgraded one by one. Before upgrading the node, the node should be emptied as we're going to reboot it a couple of times, which would otherwise trigger outages in the hosted VMs. Then the package is updated (either through backports or a major update), and finally the node is checked, instances are migrated back, and we move to the next node to progressively update the entire cluster.

So, the checklist is:

  1. Checking and emptying node
  2. Backports upgrade
  3. Major upgrade
  4. Post-upgrade procedures

Here's each of those steps in details.

Checking and emptying node

First, verify the cluster to make sure things are okay before going ahead, as you'll rely on that to make sure things worked after the upgrade:

gnt-cluster verify

Take note of (or, ideally, fix!) warnings you see here.

Then, empty the node, say you're upgrading fsn-node-05:

fab ganeti.empty-node -H fsn-node-05.torproject.org

Do take note of the instances that were migrated! You'll need this later to migrate the instances back.

Once the node is empty, the Ganeti package needs to be updated. This can be done through backports (safer) or by doing the normal major upgrade procedure (riskier).

Backports upgrade

Typically, we try to upgrade the packages to backports before upgrading the entire box to the newer release, if there's a backport available. That can be done with:

apt install -y ganeti/bookworm-backports

If you're extremely confident in the upgrade, this can be done on an entire cluster with:

cumin 'C:roles::ganeti::dal' "apt install -y ganeti/bookworm-backports"

Major upgrade

Then the Debian major upgrade procedure (for example, bookworm) is followed. When that procedure is completed (technically, on step 8), perform the post upgrade procedures below.

Post-upgrade procedures

Make sure configuration file changes are deployed, for example the /etc/default/ganeti was modified in bullseye. This can be checked with:

clean_conflicts

If you've done a batch upgrade, you'll need to check the output of the upgrade procedure and check the files one by one, effectively reproducing what clean_conflicts does above:

cumin 'C:roles::ganeti::chi' 'diff -u /etc/default/ganeti.dpkg-dist /etc/default/ganeti'

And applied with:

cumin 'C:roles::ganeti::chi' 'mv /etc/default/ganeti.dpkg-dist /etc/default/ganeti'

Major upgrades may also require to run the gnt-cluster upgrade command, the release notes will let you know. In general, this should be safe to run regardless:

gnt-cluster upgrade

Once the upgrade has completed, verify the cluster on the Ganeti master:

gnt-cluster verify

If the node is in good shape, the instances should be migrated back to the upgraded node. Note that you need to specify the Ganeti master node here as the -H argument, not the node you just upgraded. Here we assume that only two instances were migrated in the empty-node step:

fab -H fsn-node-01.torproject.org ganeti.migrate-instances -i idle-fsn-01.torproject.org -i test-01.torproject.org

After the first successful upgrade, make sure to choose as the next a node that is the secondary of an instance whose primary is the first upgraded node.

Then, after the second upgrade, test live migrations between the two upgraded nodes and fix any issues that arise (eg. tpo/tpa/team#41917) before proceeding with more upgrades.

Important caveats

  • as long as the entire cluster is not upgraded, live migrations will fail with a strange error message, for example:

     Could not pre-migrate instance static-gitlab-shim.torproject.org: Failed to accept instance: Failed to start instance static-gitlab-shim.torproject.org: exited with exit code 1 (qemu-system-x86_64: -enable-kvm: unsupported machine type
     Use -machine help to list supported machines
     )
    

    note that you can generally migrate to the newer nodes, just not back to the old ones. but in practice, it's safer to just avoid doing live migrations between Ganeti releases, state doesn't carry well across major Qemu and KVM versions, and you might also find that the entire VM does migrate, but is hung. For example, this is the console after a failed migration:

     root@chi-node-01:~# gnt-instance console static-gitlab-shim.torproject.org
     Instance static-gitlab-shim.torproject.org is paused, unpausing
    

    ie. it's hung. the qemu process had to be killed to recover from that failed migration, on the node.

    a workaround for this issue is to use failover instead of migrate, which involves a shutdown. another workaround might be to upgrade qemu to backports.

  • gnt-cluster verify might warn about incompatible DRBD versions. if it's a minor version, it shouldn't matter and the warning can be ignored.

Past upgrades

SLA

As long as the cluster is not over capacity, it should be able to survive the loss of a node in the cluster unattended.

Justified machines can be provisionned within a few business days without problems.

New nodes can be provisioned within a week or two, depending on budget and hardware availability.

Design and architecture

Our first Ganeti cluster (gnt-fsn) is made of multiple machines hosted with Hetzner Robot, Hetzner's dedicated server hosting service. All machines use the same hardware to avoid problems with live migration. That is currently a customized build of the PX62-NVMe line.

Network layout

Machines are interconnected over a vSwitch, a "virtual layer 2 network" probably implemented using Software-defined Networking (SDN) on top of Hetzner's network. The details of that implementation do not matter much to us, since we do not trust the network and run an IPsec layer on top of the vswitch. We communicate with the vSwitch through Open vSwitch (OVS), which is (currently manually) configured on each node of the cluster.

There are two distinct IPsec networks:

  • gnt-fsn-public: the public network, which maps to the fsn-gnt-inet-vlan vSwitch at Hetzner, the vlan-gntinet OVS network, and the gnt-fsn network pool in Ganeti. it provides public IP addresses and routing across the network. instances get IP allocated in this network.

  • gnt-fsn-be: the private ganeti network which maps to the fsn-gnt-backend-vlan vSwitch at Hetzner and the vlan-gntbe OVS network. it has no matching gnt-network component and IP addresses are allocated manually in the 172.30.135.0/24 network through DNS. it provides internal routing for Ganeti commands and DRBD storage mirroring.

MAC address prefix selection

The MAC address prefix for the gnt-fsn cluster (00:66:37:...) seems to have been picked arbitrarily. While it does not conflict with a known existing prefix, it could eventually be issued to a manufacturer and reused, possibly leading to a MAC address clash. The closest is currently Huawei:

$ grep ^0066 /var/lib/ieee-data/oui.txt
00664B     (base 16)		HUAWEI TECHNOLOGIES CO.,LTD

Such a clash is fairly improbable, because that new manufacturer would need to show up on the local network as well. Still, new clusters SHOULD use a different MAC address prefix in a locally administered address (LAA) space, which "are distinguished by setting the second-least-significant bit of the first octet of the address". In other words, the MAC address must have 2, 6, A or E as a its second quad. In other words, the MAC address must look like one of those:

x2 - xx - xx - xx - xx - xx
x6 - xx - xx - xx - xx - xx
xA - xx - xx - xx - xx - xx
xE - xx - xx - xx - xx - xx

We used 06:66:38 in the (now defunct) gnt-chi cluster for that reason. We picked the 06:66 prefix to resemble the existing 00:66 prefix used in gnt-fsn but varied the last quad (from :37 to :38) to make them slightly more different-looking.

Obviously, it's unlikely the MAC addresses will be compared across clusters in the short term. But it's technically possible a MAC bridge could be established if an exotic VPN bridge gets established between the two networks in the future, so it's good to have some difference.

Hardware variations

We considered experimenting with the new AX line (AX51-NVMe) but in the past DSA had problems live-migrating (it wouldn't immediately fail but there were "issues" after). So we might need to failover instead of migrate between those parts of the cluster. There are also doubts that the Linux kernel supports those shiny new processors at all: similar processors had trouble booting before Linux 5.5 for example, so it might be worth waiting a little before switching to that new platform, even if it's cheaper. See the cluster configuration section below for a larger discussion of CPU emulation.

CPU emulation

Note that we might want to tweak the cpu_type parameter. By default, it emulates a lot of processing that can be delegated to the host CPU instead. If we use kvm:cpu_type=host, then each node will tailor the emulation system to the CPU on the node. But that might make the live migration more brittle: VMs or processes can crash after a live migrate because of a slightly different configuration (microcode, CPU, kernel and QEMU versions all play a role). So we need to find the lowest common denominator in CPU families. The list of available families supported by QEMU varies between releases, but is visible with:

# qemu-system-x86_64 -cpu help
Available CPUs:
x86 486
x86 Broadwell             Intel Core Processor (Broadwell)
[...]
x86 Skylake-Client        Intel Core Processor (Skylake)
x86 Skylake-Client-IBRS   Intel Core Processor (Skylake, IBRS)
x86 Skylake-Server        Intel Xeon Processor (Skylake)
x86 Skylake-Server-IBRS   Intel Xeon Processor (Skylake, IBRS)
[...]

The current PX62 line is based on the Coffee Lake Intel micro-architecture. The closest matching family would be Skylake-Server or Skylake-Server-IBRS, according to wikichip. Note that newer QEMU releases (4.2, currently in unstable) have more supported features.

In that context, of course, supporting different CPU manufacturers (say AMD vs Intel) is impractical: they will have totally different families that are not compatible with each other. This will break live migration, which can trigger crashes and problems in the migrated virtual machines.

If there are problems live-migrating between machines, it is still possible to "failover" (gnt-instance failover instead of migrate) which shuts off the machine, fails over disks, and starts it on the other side. That's not such of a big problem: we often need to reboot the guests when we reboot the hosts anyways. But it does complicate our work. Of course, it's also possible that live migrates work fine if no cpu_type at all is specified in the cluster, but that needs to be verified.

Nodes could also grouped to limit (automated) live migration to a subset of nodes.

Update: this was enabled in the gnt-dal cluster.

References:

Installer

The ganeti-instance-debootstrap package is used to install instances. It is configured through Puppet with the shared ganeti module, which deploys a few hooks to automate the install as much as possible. The installer will:

  1. setup grub to respond on the serial console
  2. setup and log a random root password
  3. make sure SSH is installed and log the public keys and fingerprints
  4. create a 512MB file-backed swap volume at /swapfile, or a swap partition if it finds one labeled swap
  5. setup basic static networking through /etc/network/interfaces.d

We have custom configurations on top of that to:

  1. add a few base packages
  2. do our own custom SSH configuration
  3. fix the hostname to be a FQDN
  4. add a line to /etc/hosts
  5. add a tmpfs

There is work underway to refactor and automate the install better, see ticket 31239 for details.

Services

TODO: document a bit how the different Ganeti services interface with each other

Storage

TODO: document how DRBD works in general, and how it's setup here in particular.

See also the DRBD documentation.

The Cymru PoP has an iSCSI cluster for large filesystem storage. Ideally, this would be automated inside Ganeti, some quick links:

For now, iSCSI volumes are manually created and passed to new virtual machines.

Queues

TODO: document gnt-job

Interfaces

TODO: document the RAPI and ssh commandline

Authentication

TODO: X509 certs and SSH

Implementation

Ganeti is implemented in a mix of Python and Haskell, in a mature codebase.

Ganeti relies heavily on DRBD for live migrations.

Issues

There is no issue tracker specifically for this project, File or search for issues in the team issue tracker with the ~Ganeti label.

Upstream Ganeti has of course its own issue tracker on GitHub.

Users

TPA are the main direct operators of the services, but most if not all TPI teams use its services either directly or indirectly.

Upstream

Ganeti used to be a Google project until it was abandoned and spun off to a separate, standalone free software community. Right now it is maintained by a mixed collection of organisations and non-profits.

Monitoring and metrics

Anarcat implemented a Prometheus metrics exporter that writes stats in the node exporter "textfile" collector. The source code is available in tor-puppet.git, as profile/files/ganeti/tpa-ganeti-prometheus-metrics.py. Those metrics are in turn displayed in the Ganeti Health Grafana dashboard.

The WMF worked on a proper Ganeti exporter we should probably switch to, once it is packaged in Debian.

Tests

To test if a cluster is working properly, the verify command can be ran:

gnt-cluster verify

Creating a VM and migrating it between machines is also a good test.

Logs

Ganeti logs a significant amount of information in /var/log/ganeti/. Those logs are of particular interest:

  • node-daemon.log: all low-level commands and HTTP requests on the node daemon, includes, for example, LVM and DRBD commands
  • os/*$hostname*.log: installation log for machine $hostname, this also includes VM migration logs for the move-instance or gnt-instance export commands

Backups

There are no backups of virtual machines directly from Ganeti: each machine is expected to perform its own backups. The Ganeti configuration should be backed up as normal by our backup systems.

Other documentation

Discussion

The Ganeti cluster has served us well over the years. This section aims at discussing the current limitations and possible future.

Overview

Ganeti works well for our purposes, which is hosting generic virtual machine. It's less efficient at managing mixed-usage or specialized setups like large file storage or high performance database, because of cross-machine contamination and storage overhead.

Security and risk assessment

No in-depth security review or risk assessment has been done on the Ganeti clusters recently. It is believe the cryptography and design of Ganeti cluster is sound. There's a concern with the server host keys reuse and, in general, there's some confusion over what goes over TLS and what goes over SSH.

Deleting VMs is relatively too easy in Ganeti. You just need one confirmation, and a VM is completely wiped, so there's always a risk of accidental removal.

Technical debt and next steps

The ganeti-instance-debootstrap installer is slow and almost abandoned upstream. It required significant patching to get cross-cluster migrations working.

There are concerns that the DRBD and memory redundancy required by the Ganeti allocators lead to resource waste, that is to be investigated in tpo/tpa/team#40799.

Proposed Solution

No recent proposal was done for the Ganeti clusters, although the Cymru migration is somewhat relevant:

Other alternatives

Proxmox is probably the biggest contender here. OpenStack is also marginally similar.

Old libvirt cluster retirement

The project of creating a Ganeti cluster for Tor has appeared in the summer of 2019. The machines were delivered by Hetzner in July 2019 and setup by weasel by the end of the month.

Goals

The goal was to replace the aging group of KVM servers (kvm[1-5], AKA textile, unifolium, macrum, kvm4 and kvm5).

Must have

  • arbitrary virtual machine provisionning
  • redundant setup
  • automated VM installation
  • replacement of existing infrastructure

Nice to have

  • fully configured in Puppet
  • full high availability with automatic failover
  • extra capacity for new projects

Non-Goals

  • Docker or "container" provisionning - we consider this out of scope for now
  • self-provisionning by end-users: TPA remains in control of provisionning

Approvals required

A budget was proposed by weasel in may 2019 and approved by Vegas in June. An extension to the budget was approved in january 2020 by Vegas.

Proposed Solution

Setup a Ganeti cluster of two machines with a Hetzner vSwitch backend.

Cost

The design based on the PX62 line has the following monthly cost structure:

  • per server: 118EUR (79EUR + 39EUR for 2x10TB HDDs)
  • IPv4 space: 35.29EUR (/27)
  • IPv6 space: 8.40EUR (/64)
  • bandwidth cost: 1EUR/TB (currently 38EUR)

At three servers, that adds up to around 435EUR/mth. Up to date costs are available in the Tor VM hosts.xlsx spreadsheet.

Alternatives considered

Note that the instance install is possible also through FAI, see the Ganeti wiki for examples.

There are GUIs for Ganeti that we are not using, but could, if we want to grant more users access:

  • Ganeti Web manager is a "Django based web frontend for managing Ganeti virtualization clusters. Since Ganeti only provides a command-line interface, Ganeti Web Manager’s goal is to provide a user friendly web interface to Ganeti via Ganeti’s Remote API. On top of Ganeti it provides a permission system for managing access to clusters and virtual machines, an in browser VNC console, and vm state and resource visualizations"
  • Synnefo is a "complete open source cloud stack written in Python that provides Compute, Network, Image, Volume and Storage services, similar to the ones offered by AWS. Synnefo manages multiple Ganeti clusters at the backend for handling of low-level VM operations and uses Archipelago to unify cloud storage. To boost 3rd-party compatibility, Synnefo exposes the OpenStack APIs to users."