Decommissioning a host

Note that this procedure is relevant only to TPA hosts. For Tails hosts, follow the Tails server decommission procedure, which should eventually be merged here.

Retirement checklist to copy-paste in retirement tickets:

announcement
retire the host in fabric
remove from LDAP with ldapvi
power-grep
remove from tor-passwords
remove from DNSwl
remove from docs
- wiki pages
- nextcloud server list if not a VM
- if an entire service is taken offline with the machine, remove the service page and links to it
remove from racks
remove from reverse DNS
notify accounting if needed

The detailed procedure:

long before (weeks or months) the machine is retired, make sure users are aware it will go away and of its replacement services
retire the host from its parent, backups and Puppet. Before launching retirement you will need to know:
- for a ganeti instance, the ganeti parent (primary) host
- the backup storage server. if the machine in in the fsn cluster, then backup-storage-01.torproject.org. Otherwise, bungei.torproject.org
for example:
```
fab -H $INSTANCE retire.retire-all --parent-host=$PARENT_HOST --backup-host=$BACKUP_HOST
```
Copy the output of the script in the retirement ticket. Adjust delay for more sensitive hosts with:
```
--retirement-delay-vm=30 --retirement-delay-backups=90
```
Above is 30 days for the destruction of disks, 90 for backups. Default is 7 days for disks, 30 for backups.

TODO: $PARENT_HOST should be some ganeti node (e.g. fsn-node-01.torproject.org) but could be auto-detected...

TODO: the backup storage host could be auto-detected

TODO: cover physical machines
remove from LDAP with ldapvi (STEP 6 above), copy-paste it in the ticket
do one huge power-grep and find over all our source code, for example with unifolium that was:
```
grep -nHr --exclude-dir .git -e 148.251.180.115 -e 2a01:4f8:211:6e8::2 -e kvm2.torproject.org  -e unifolium.torproject.org -e unifolium -e kvm2
find -iname unifolium\*
```
TODO: extract those values from LDAP (e.g. purpose) and run the grep in Fabric

remove from tor-passwords (TODO: put in fabric). magic command (not great):

pass rm root/unifolium.torproject.org
# look for traces of the host elsewhere
for f in ~/.password-store/*/*; do
    if gpg -d < $f 2>/dev/null | \
        grep -i -e 148.251.180.115 -e 2a01:4f8:211:6e8::2 -e kvm2.torproject.org -e unifolium.torproject.org -e unifolium -e kvm2 
    then
        echo match found in $f
    fi
done

remove from DNSwl
remove from the machine from this wiki (if present in documentation), the Nextcloud spreadsheet (if it is not in Ganeti), and, if it's an entire service, the services page
if it's a physical machine or a virtual host we don't control, schedule removal from racks or hosts with upstream
remove from reverse DNS
If retiring the machine took out a recurring expense (e.g. physical machines, cloud hosting), contact accounting to tell them about the expected change.

Wiping disks

To wipe disks on servers without a serial console or management interface, you need to be a little more creative. We do this with the nwipe(1) command, which should be installed before anything:

apt install nwipe vmtouch

Run in a screen:

screen

If there's a RAID array, first wipe one of the disks by taking it offline and writing garbage:

mdadm --fail /dev/md0 /dev/sdb1 &&
mdadm --remove /dev/md0 /dev/sdb1 &&
mdadm --fail /dev/md1 /dev/sdb2 &&
mdadm --remove /dev/md1 /dev/sdb2 &&
: etc, for the other RAID elements in /proc/mdstat &&
nwipe --autonuke --method=random --verify=off /dev/sdb

This will take a long time. Note that it will start a GUI which is useful because it will give you timing estimates, which the command-line version does not provide.

WARNING: this procedure doesn't cover the case where the disk is an SSD. See this paper for details on how classic data scrubbing software might not work for SSDs. For now we use this:

nwipe --autonuke --method=random --rounds=2 --verify=off /dev/nvme1n1

TODO: consider hdparm and the "secure erase" procedure for SSDs:

hdparm --user-master u --security-set-pass Eins /dev/sdc
time hdparm --user-master u --security-erase Eins /dev/sdc

See also stressant documentation abnout this.

When you return:

start a screen session with a static busybox as your SHELL that will survive disk wiping:

# make sure /tmp is on a tmpfs first!
cp -av /root /tmp/root &&
mount -o bind /tmp/root /root &&
cp /bin/busybox /tmp/root/sh &&
export SHELL=/tmp/root/sh &&
exec screen -s $SHELL

lock down busybox and screen in memory
```
vmtouch -dl /usr/bin/screen /bin/busybox /tmp/root/sh /usr/sbin/nwipe
```
TODO: the above aims at making busybox survive the destruction, so that it's cached in RAM. It's unclear if that actually works, because typically SSH is also busted and needs a lot more to bootstrap, so we can't log back in if we lose the console. Ideally, we'd run this in a serial console that would have more reliable access... See also vmtouch.

kill all processes but the SSH daemon, your SSH connection and shell. this will vary from machine to machine, but a good way is to list all processes with systemctl status and systemctl stop the services one by one. Hint: multiple services can be passed on the same stop command, for example:

systemctl stop \
    acpid \
    acpid.path \
    acpid.socket \
    apache2 \
    atd \
    bacula-fd \
    bind9 \
    cron \
    dbus \
    dbus.socket \
    fail2ban \
    ganeti \
    haveged \
    irqbalance \
    ipsec \
    iscsid \
    libvirtd \
    lvm2-lvmetad.service \
    lvm2-lvmetad.socket \
    mdmonitor \
    multipathd.service \
    multipathd.socket \
    ntp \
    openvswitch-switch \
    postfix \
    prometheus-bind-exporter \
    prometheus-node-exporter \
    smartd \
    strongswan \
    syslog-ng.service \
    systemd-journald \
    systemd-journald-audit.socket \
    systemd-journald-dev-log.socket \
    systemd-journald.socket \
    systemd-logind.service \
    systemd-udevd \
    systemd-udevd \
    systemd-udevd-control.socket \
    systemd-udevd-control.socket \
    systemd-udevd-kernel.socket \
    systemd-udevd-kernel.socket \
    timers.target \
    ulogd2 \
    unbound \
    virtlogd \
    virtlogd.socket \

disable swap:
```
swapoff -a
```
un-mount everything that can be unmounted (except /proc):
```
umount -a
```
remount everything else read-only:
```
mount -o remount,ro /
```
sync disks:
```
sync
```

wipe the remaining disk and shutdown:

# hit control-a control-g to enable the bell in screen
wipefs -af /dev/noop3 &&
wipefs -af /dev/noop && \
nwipe --autonuke --method=random --rounds=2 --verify=off /dev/noop ; \
printf "SHUTTING DOWN FOREVER IN ONE MINUTE\a\n" ; \
sleep 60 ; \
echo o > /proc/sysrq-trigger ; \
sleep 60 ; \
echo b > /proc/sysrq-trigger ; \

Note: as a safety precaution, the above device has been replaced by noop, that should be (say) sda instead.

A few tricks if nothing works in the shell which might work in a case of an emergency:

cat PATH can be expressed as mapfile -C "printf %s" < PATH in bash
echo * can be used as a rough approximation of ls

Deprecated manual procedure

Warning: this procedure is difficult to follow and error-prone. A new procedure was established in Fabric, above. It should really just be completely avoided.

long before (weeks or months) the machine is retired, make sure users are aware it will go away and of its replacement services
if applicable, stop the VM in advance:
- If the VM is on a KVM host: virsh shutdown $host, or at least stop the primary service on the machine
- If the machine is on ganeti: gnt-instance stop $host
On KVM hosts, undefine the VM: virsh undefine $host
wipe host data, possibly with a delay:
- On some KVM hosts, remove the LVM logical volumes:
```
echo 'lvremove -y vgname/lvname' | at now + 7 days
```
  Use lvs will list the logical volumes on the machine.
- Other KVM hosts use file-backed storage:
```
echo 'rm -r /srv/vmstore/gayi.torproject.org/' | at now + 7 days
```
- On Ganeti hosts, remove the actual instance with a delay, from the Ganeti master:
```
echo "gnt-instance remove $host" | at now + 7 days
```
- for a normal machine or a machine we do not own the parent host for, wipe the disks using the method described below
remove it from LDAP: the host entry and any @<host> group memberships there might be as well as any sudo passwords users might have configured for that host
if it has any associated records in tor-dns/domains or auto-dns, or upstream's reverse dns thing, remove it from there too. e.g.
```
grep -r -e build-x86-07 -e 78.47.38.230 -e 2a01:4f8:211:6e8:0:823:6:1
```
... and check upstream reverse DNS.
on the puppet server (pauli): read host ; puppet node clean $host.torproject.org && puppet node deactivate $host.torproject.org TODO: That procedure is incomplete, use the retire.revoke-puppet job in fabric instead.
grep the tor-puppet repository for the host (and maybe its IP addresses) and clean up; also look for files with hostname in their name
clean host from tor-passwords

remove any certs and backup keys from letsencrypt-domains.git and letsencrypt-domains/backup-keys.git repositories that are no longer relevant:

git -C letsencrypt-domains grep -e $host -e storm.torproject.org
# remove entries found above
git -C letsencrypt-domains commit
git -C letsencrypt-domains push
find letsencrypt-domains/backup-keys -name "$host.torproject.org" -o -name 'storm.torproject.org*' -delete
git -C letsencrypt-domains/backup-keys commit
git -C letsencrypt-domains/backup-keys push

Also clean up the relevant files on the letsencrypt master (currently nevii), for example:

ssh nevii rm -rf /srv/letsencrypt.torproject.org/var/certs/storm.torproject.org
ssh nevii find /srv/letsencrypt.torproject.org/ -name 'storm.torproject.org.*' -delete

if the machine is handling mail, remove it from dnswl.org (password in tor-passwords, hosts-extra-info) - consider that it can take a long time (weeks? months?) to be able to "re-add" an IP address in that service, so if that IP can eventually be reused, it might be better to keep it there in the short term

schedule a removal of the host's backup, on the backup server (currently bungei):

cd  /srv/backups/bacula/
mv $host.torproject.org $host.torproject.org-OLD
echo rm -rf /srv/backups/bacula/$host.torproject.org.OLD/ | at now + 30 days

remove from the machine from this wiki (if present in documentation), the Nextcloud spreadsheet (if it is not in ganeti), and, if it's an entire service, the services page
if it's a physical machine or a virtual host we don't control, schedule removal from racks or hosts with upstream
after 30 days delay, retire from Bacula catalog, on the director (currently bacula-director-01), run bconsole then:

delete client=$INSTANCE-fd

for example:

delete client=archeotrichon.torproject.org-fd
after 30 days delay, remove PostgreSQL backups on the storage server (currently /srv/backups/pg on bungi), if relevant

Keyboard shortcuts

Decommissioning a host

Wiping disks

Deprecated manual procedure