Tutorial
- Connecting
- Creating a user and a database
How-to
Reference
Discussion

PostgreSQL is an advanced database server that is robust and fast, although possibly less well-known and popular than its eternal rival in the free software world, MySQL.

Tutorial

Those are quick reminders on easy things to do in a cluster.

Connecting

Our PostgreSQL setup is fairly standard so connecting to the database is like any other Debian machine:

sudo -u postres psql

This drops you in a psql shell where you can issue SQL queries and so on.

Creating a user and a database

This procedure will create a user and a database named tor-foo:

sudo -u postgres createuser -D -E -P -R -S tor-foo
sudo -u postgres createdb tor-foo

For read-only permissions:

sudo -u postgres psql -c 'GRANT SELECT ON ALL TABLES IN SCHEMA public TO tor-foo; \
  GRANT SELECT ON ALL SEQUENCES IN SCHEMA public TO tor-foo; \
  GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA public TO tor-foo;'

For read-write:

sudo -u postgres psql -c 'GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO tor-foo; \
  GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public TO tor-foo; \
  GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA public TO tor-foo;'

How-to

Checking permissions

It's surprisingly hard to figure out the privileges of a given user in postgresql. First, it's context-sensitive (per database), and second there's all sorts of places where it can be found.

The simplest way is to use the documented \du command to list users, which will also show which databases they own, but only that. To go beyond (e.g. specific GRANTs), you need something more. This, for example, will show SELECT grants on a table, given that you're connected to the right database already:

SELECT *
  FROM information_schema.role_table_grants 
 WHERE grantee='USERNAME';

But it won't show access like table ownerships. For that you need:

SELECT *
  FROM pg_tables 
 WHERE tableowner = 'USERNAME';

But that won't show things like "functions" and so on.

This mouthful of SQL might be more exhaustive:

-- Cluster permissions not "on" anything else
SELECT
  'cluster' AS on,
  NULL AS name_1,
  NULL AS name_2,
  NULL AS name_3,
  unnest(
    CASE WHEN rolcanlogin THEN ARRAY['LOGIN'] ELSE ARRAY[]::text[] END
    || CASE WHEN rolsuper THEN ARRAY['SUPERUSER'] ELSE ARRAY[]::text[] END
    || CASE WHEN rolcreaterole THEN ARRAY['CREATE ROLE'] ELSE ARRAY[]::text[] END
    || CASE WHEN rolcreatedb THEN ARRAY['CREATE DATABASE'] ELSE ARRAY[]::text[] END
  ) AS privilege_type
FROM pg_roles
WHERE oid = quote_ident(:'rolename')::regrole

UNION ALL

-- Direct role memberships
SELECT 'role' AS on, groups.rolname AS name_1, NULL AS name_2, NULL AS name_3, 'MEMBER' AS privilege_type
FROM pg_auth_members mg
INNER JOIN pg_roles groups ON groups.oid = mg.roleid
INNER JOIN pg_roles members ON members.oid = mg.member
WHERE members.rolname = :'rolename'

-- Direct ACL or ownerships
UNION ALL (
  -- ACL or owned-by dependencies of the role - global or in the currently connected database
  WITH owned_or_acl AS (
    SELECT
      refobjid,  -- The referenced object: the role in this case
      classid,   -- The pg_class oid that the dependent object is in
      objid,     -- The oid of the dependent object in the table specified by classid
      deptype,   -- The dependency type: o==is owner, and might have acl, a==has acl and not owner
      objsubid   -- The 1-indexed column index for table column permissions. 0 otherwise.
    FROM pg_shdepend
    WHERE refobjid = quote_ident(:'rolename')::regrole
    AND refclassid='pg_catalog.pg_authid'::regclass
    AND deptype IN ('a', 'o')
    AND (dbid = 0 OR dbid = (SELECT oid FROM pg_database WHERE datname = current_database()))
  ),

  relkind_mapping(relkind, type) AS (
    VALUES 
      ('r', 'table'),
      ('v', 'view'),
      ('m', 'materialized view'),
      ('f', 'foreign table'),
      ('p', 'partitioned table'),
      ('S', 'sequence')
  ),

  prokind_mapping(prokind, type) AS (
    VALUES 
      ('f', 'function'),
      ('p', 'procedure'),
      ('a', 'aggregate function'),
      ('w', 'window function')
  ),

  typtype_mapping(typtype, type) AS (
    VALUES
      ('b', 'base type'),
      ('c', 'composite type'),
      ('e', 'enum type'),
      ('p', 'pseudo type'),
      ('r', 'range type'),
      ('m', 'multirange type'),
      ('d', 'domain')
  )

  -- Database ownership
  SELECT 'database' AS on, datname AS name_1, NULL AS name_2, NULL AS name_3, 'OWNER' AS privilege_type
  FROM pg_database d
  INNER JOIN owned_or_acl a ON a.objid = d.oid 
  WHERE classid = 'pg_database'::regclass AND deptype = 'o'

  UNION ALL

  -- Database privileges
  SELECT 'database' AS on, datname AS name_1, NULL AS name_2, NULL AS name_3, privilege_type
  FROM pg_database d
  INNER JOIN owned_or_acl a ON a.objid = d.oid 
  CROSS JOIN aclexplode(COALESCE(d.datacl, acldefault('d', d.datdba)))
  WHERE classid = 'pg_database'::regclass AND grantee = refobjid

  UNION ALL

  -- Schema ownership
  SELECT 'schema' AS on, nspname AS name_1, NULL AS name_2, NULL AS name_3, 'OWNER' AS privilege_type
  FROM pg_namespace n
  INNER JOIN owned_or_acl a ON a.objid = n.oid 
  WHERE classid = 'pg_namespace'::regclass AND deptype = 'o'

  UNION ALL

  -- Schema privileges
  SELECT 'schema' AS on, nspname AS name_1, NULL AS name_2, NULL AS name_3, privilege_type
  FROM pg_namespace n
  INNER JOIN owned_or_acl a ON a.objid = n.oid
  CROSS JOIN aclexplode(COALESCE(n.nspacl, acldefault('n', n.nspowner)))
  WHERE classid = 'pg_namespace'::regclass AND grantee = refobjid

  UNION ALL

  -- Table(-like) ownership
  SELECT r.type AS on, nspname AS name_1, relname AS name_2, NULL AS name_3, 'OWNER' AS privilege_type
  FROM pg_class c
  INNER JOIN pg_namespace n ON n.oid = c.relnamespace
  INNER JOIN owned_or_acl a ON a.objid = c.oid 
  INNER JOIN relkind_mapping r ON r.relkind = c.relkind
  WHERE classid = 'pg_class'::regclass AND deptype = 'o' AND objsubid = 0

  UNION ALL

  -- Table(-like) privileges
  SELECT r.type AS on, nspname AS name_1, relname AS name_2, NULL AS name_3, privilege_type
  FROM pg_class c
  INNER JOIN pg_namespace n ON n.oid = c.relnamespace
  INNER JOIN owned_or_acl a ON a.objid = c.oid
  CROSS JOIN aclexplode(COALESCE(c.relacl, acldefault('r', c.relowner)))
  INNER JOIN relkind_mapping r ON r.relkind = c.relkind
  WHERE classid = 'pg_class'::regclass AND grantee = refobjid AND objsubid = 0

  UNION ALL

  -- Column privileges
  SELECT 'table column', nspname AS name_1, relname AS name_2, attname AS name_3, privilege_type
  FROM pg_attribute t
  INNER JOIN pg_class c ON c.oid = t.attrelid
  INNER JOIN pg_namespace n ON n.oid = c.relnamespace
  INNER JOIN owned_or_acl a ON a.objid = t.attrelid
  CROSS JOIN aclexplode(COALESCE(t.attacl, acldefault('c', c.relowner)))
  WHERE classid = 'pg_class'::regclass AND grantee = refobjid AND objsubid != 0

  UNION ALL

  -- Function and procdedure ownership
  SELECT m.type AS on, nspname AS name_1, proname AS name_2, p.oid::text AS name_3, 'OWNER' AS privilege_type
  FROM pg_proc p
  INNER JOIN pg_namespace n ON n.oid = p.pronamespace
  INNER JOIN owned_or_acl a ON a.objid = p.oid 
  INNER JOIN prokind_mapping m ON m.prokind = p.prokind
  WHERE classid = 'pg_proc'::regclass AND deptype = 'o'

  UNION ALL

  -- Function and procedure privileges
  SELECT m.type AS on, nspname AS name_1, proname AS name_2, p.oid::text AS name_3, privilege_type
  FROM pg_proc p
  INNER JOIN pg_namespace n ON n.oid = p.pronamespace
  INNER JOIN owned_or_acl a ON a.objid = p.oid
  CROSS JOIN aclexplode(COALESCE(p.proacl, acldefault('f', p.proowner)))
  INNER JOIN prokind_mapping m ON m.prokind = p.prokind
  WHERE classid = 'pg_proc'::regclass AND grantee = refobjid

  UNION ALL

  -- Large object ownership
  SELECT 'large object' AS on, l.oid::text AS name_1, NULL AS name_2, NULL AS name_3, 'OWNER' AS privilege_type
  FROM pg_largeobject_metadata l
  INNER JOIN owned_or_acl a ON a.objid = l.oid 
  WHERE classid = 'pg_largeobject'::regclass AND deptype = 'o'

  UNION ALL

  -- Large object privileges
  SELECT 'large object' AS on, l.oid::text AS name_1, NULL AS name_2, NULL AS name_3, privilege_type
  FROM pg_largeobject_metadata l
  INNER JOIN owned_or_acl a ON a.objid = l.oid
  CROSS JOIN aclexplode(COALESCE(l.lomacl, acldefault('L', l.lomowner)))
  WHERE classid = 'pg_largeobject'::regclass AND grantee = refobjid

  UNION ALL

  -- Type ownership
  SELECT m.type, nspname AS name_1, typname AS name_2, NULL AS name_3, 'OWNER' AS privilege_type
  FROM pg_type t
  INNER JOIN pg_namespace n ON n.oid = t.typnamespace
  INNER JOIN owned_or_acl a ON a.objid = t.oid 
  INNER JOIN typtype_mapping m ON m.typtype = t.typtype
  WHERE classid = 'pg_type'::regclass AND deptype = 'o'

  UNION ALL

  -- Type privileges
  SELECT m.type, nspname AS name_1, typname AS name_2, NULL AS name_3, privilege_type
  FROM pg_type t
  INNER JOIN pg_namespace n ON n.oid = t.typnamespace
  INNER JOIN owned_or_acl a ON a.objid = t.oid
  CROSS JOIN aclexplode(COALESCE(t.typacl, acldefault('T', t.typowner)))
  INNER JOIN typtype_mapping m ON m.typtype = t.typtype
  WHERE classid = 'pg_type'::regclass AND grantee = refobjid

  UNION ALL

  -- Language ownership
  SELECT 'language' AS on, l.lanname AS name_1, NULL AS name_2, NULL AS name_3, 'OWNER' AS privilege_type
  FROM pg_language l
  INNER JOIN owned_or_acl a ON a.objid = l.oid 
  WHERE classid = 'pg_language'::regclass AND deptype = 'o'

  UNION ALL

  -- Language privileges
  SELECT 'language' AS on, l.lanname AS name_1, NULL AS name_2, NULL AS name_3, privilege_type
  FROM pg_language l
  INNER JOIN owned_or_acl a ON a.objid = l.oid
  CROSS JOIN aclexplode(COALESCE(l.lanacl, acldefault('l', l.lanowner)))
  WHERE classid = 'pg_language'::regclass AND grantee = refobjid

  UNION ALL

  -- Tablespace ownership
  SELECT 'tablespace' AS on, t.spcname AS name_1, NULL AS name_2, NULL AS name_3, 'OWNER' AS privilege_type
  FROM pg_tablespace t
  INNER JOIN owned_or_acl a ON a.objid = t.oid 
  WHERE classid = 'pg_tablespace'::regclass AND deptype = 'o'

  UNION ALL

  -- Tablespace privileges
  SELECT 'tablespace' AS on, t.spcname AS name_1, NULL AS name_2, NULL AS name_3, privilege_type
  FROM pg_tablespace t
  INNER JOIN owned_or_acl a ON a.objid = t.oid
  CROSS JOIN aclexplode(COALESCE(t.spcacl, acldefault('t', t.spcowner)))
  WHERE classid = 'pg_tablespace'::regclass AND grantee = refobjid

  UNION ALL

  -- Foreign data wrapper ownership
  SELECT 'foreign-data wrapper' AS on, f.fdwname AS name_1, NULL AS name_2, NULL AS name_3, 'OWNER' AS privilege_type
  FROM pg_foreign_data_wrapper f
  INNER JOIN owned_or_acl a ON a.objid = f.oid 
  WHERE classid = 'pg_foreign_data_wrapper'::regclass AND deptype = 'o'

  UNION ALL

  -- Foreign data wrapper privileges
  SELECT 'foreign-data wrapper' AS on, f.fdwname AS name_1, NULL AS name_2, NULL AS name_3, privilege_type
  FROM pg_foreign_data_wrapper f
  INNER JOIN owned_or_acl a ON a.objid = f.oid
  CROSS JOIN aclexplode(COALESCE(f.fdwacl, acldefault('F', f.fdwowner)))
  WHERE classid = 'pg_foreign_data_wrapper'::regclass AND grantee = refobjid

  UNION ALL

  -- Foreign server ownership
  SELECT 'foreign server' AS on, f.srvname AS name_1, NULL AS name_2, NULL AS name_3, 'OWNER' AS privilege_type
  FROM pg_foreign_server f
  INNER JOIN owned_or_acl a ON a.objid = f.oid 
  WHERE classid = 'pg_foreign_server'::regclass AND deptype = 'o'

  UNION ALL

  -- Foreign server privileges
  SELECT 'foreign server' AS on, f.srvname AS name_1, NULL AS name_2, NULL AS name_3, privilege_type
  FROM pg_foreign_server f
  INNER JOIN owned_or_acl a ON a.objid = f.oid
  CROSS JOIN aclexplode(COALESCE(f.srvacl, acldefault('S', f.srvowner)))
  WHERE classid = 'pg_foreign_server'::regclass AND grantee = refobjid

  UNION ALL

  -- Parameter privileges
  SELECT 'parameter' AS on, p.parname AS name_1, NULL AS name_2, NULL AS name_3, privilege_type
  FROM pg_parameter_acl p
  INNER JOIN owned_or_acl a ON a.objid = p.oid
  CROSS JOIN aclexplode(p.paracl)
  WHERE classid = 'pg_parameter_acl'::regclass AND grantee = refobjid
);

Replace :'rolename' with the user, or pass it on the commandline with:

psql -f show-grants-for-role.sql -v rolename=YOUR_ROLE

source.

Show running queries

If the server seems slow, it's possible to inspect running queries with this query:

SELECT datid,datname,pid,query_start,now()-query_start as age,state,query FROM pg_stat_activity;

If the state is waiting, it might be worth looking at the wait_event, and wait_event_type columns as well. We're looking for deadlocks here.

Killing a slow query

This will kill all queries to database_name:

SELECT 
    pg_terminate_backend(pid) 
FROM 
    pg_stat_activity 
WHERE 
    -- don't kill my own connection!
    pid <> pg_backend_pid()
    -- don't kill the connections to other databases
    AND datname = 'database_name'
    ;

A more selective approach is to list threads (above) and then kill only one PID, say:

SELECT 
    pg_terminate_backend(pid) 
FROM 
    pg_stat_activity 
WHERE 
    -- don't kill my own connection!
    pid = 1234;

Diagnosing performance issues

Some ideas from the #postgresql channel on Libera:

look at query_start and state, and if state is waiting, wait_event, and wait_event_type, in pg_stat_activity, possibly looking for locks here. this is done by the query above, in Show running queries
enable pg_stat_statements to see where the time is going, and then dig into the queries/functions found there, possibly with auto_explain and auto_explain.log_nested_statements=on

In general, we have a few Grafana dashboards specific to PostgreSQL (see logs and metrics, below) that might help tracing performance issues as well. Obviously, system-level statistics (disk, CPU, memory usage) can help pinpoint where the bottleneck is as well, so basic node-level Grafana dashboards are useful there as well.

Consider tuning the whole database with pgtune.

Find what is taking up space

This will show all databases with their sizes and description:

\l+

This will report size and count information for all "relations", which includes indexes:

SELECT relname AS objectname
     , relkind AS objecttype
     , reltuples AS "#entries"
     , pg_size_pretty(relpages::bigint*8*1024) AS size
     FROM pg_class
     WHERE relpages >= 8
     ORDER BY relpages DESC;

It might be difficult to track the total size of a table because it doesn't add up index size, which is typically small but can grow quite significantly.

This will report the same, but with aggregated results:

SELECT table_name
    , row_estimate
    , pg_size_pretty(total_bytes) AS total
    , pg_size_pretty(table_bytes) AS TABLE
    , pg_size_pretty(index_bytes) AS INDEX
    , pg_size_pretty(toast_bytes) AS toast
  FROM (
  SELECT *, total_bytes-index_bytes-COALESCE(toast_bytes,0) AS table_bytes FROM (
      SELECT c.oid,nspname AS table_schema, relname AS TABLE_NAME
              , c.reltuples AS row_estimate
              , pg_total_relation_size(c.oid) AS total_bytes
              , pg_indexes_size(c.oid) AS index_bytes
              , pg_total_relation_size(reltoastrelid) AS toast_bytes
          FROM pg_class c
          LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
          WHERE relkind = 'r'
  ) a
) a ORDER BY total_bytes DESC LIMIT 10;

Same with databases:

SELECT d.datname AS Name,  pg_catalog.pg_get_userbyid(d.datdba) AS Owner,
    CASE WHEN pg_catalog.has_database_privilege(d.datname, 'CONNECT')
        THEN pg_catalog.pg_size_pretty(pg_catalog.pg_database_size(d.datname))
        ELSE 'No Access'
    END AS SIZE
FROM pg_catalog.pg_database d
    ORDER BY
    CASE WHEN pg_catalog.has_database_privilege(d.datname, 'CONNECT')
        THEN pg_catalog.pg_database_size(d.datname)
        ELSE NULL
    END DESC -- nulls first
    LIMIT 20;

Source: PostgreSQL wiki. See also the upstream manual.

Checking for wasted space

PostgreSQL is particular as a database in the sense that it never actually returns free space to the operating system unless explicitly asked for. Modern PostgreSQL releases (8.1+) have an "auto-vacuum" daemon which takes care of cleaning up DELETE and related operations to reclaim that disk space, but this only marks those regions of the database as usable: it doesn't actually returns those blocks to the operating system.

Because databases typically either stay the same size or grow over their lifetime, this typically does not matter: the next INSERT will use that space and no space is actually wasted.

But sometimes that disk space can grow too large. How do we check if our database is wasting space? There are many ways...

`check_postgresql`

There is a monitoring plugin, which we didn't actually use, which checks for wasted space. It is called check_postgresql and features a bloat check which can run regularly. This could be ported to Prometheus or, perhaps better, we could have something in the PostgreSQL exporter that could check for bloat.

Running bloat query by hand

The above script might be annoying to deploy for an ad-hoc situation. You can just run the query by hand instead:

SELECT
  current_database(), schemaname, tablename, /*reltuples::bigint, relpages::bigint, otta,*/
  ROUND((CASE WHEN otta=0 THEN 0.0 ELSE sml.relpages::float/otta END)::numeric,1) AS tbloat,
  CASE WHEN relpages < otta THEN 0 ELSE bs*(sml.relpages-otta)::BIGINT END AS wastedbytes,
  iname, /*ituples::bigint, ipages::bigint, iotta,*/
  ROUND((CASE WHEN iotta=0 OR ipages=0 THEN 0.0 ELSE ipages::float/iotta END)::numeric,1) AS ibloat,
  CASE WHEN ipages < iotta THEN 0 ELSE bs*(ipages-iotta) END AS wastedibytes
FROM (
  SELECT
    schemaname, tablename, cc.reltuples, cc.relpages, bs,
    CEIL((cc.reltuples*((datahdr+ma-
      (CASE WHEN datahdr%ma=0 THEN ma ELSE datahdr%ma END))+nullhdr2+4))/(bs-20::float)) AS otta,
    COALESCE(c2.relname,'?') AS iname, COALESCE(c2.reltuples,0) AS ituples, COALESCE(c2.relpages,0) AS ipages,
    COALESCE(CEIL((c2.reltuples*(datahdr-12))/(bs-20::float)),0) AS iotta -- very rough approximation, assumes all cols
  FROM (
    SELECT
      ma,bs,schemaname,tablename,
      (datawidth+(hdr+ma-(case when hdr%ma=0 THEN ma ELSE hdr%ma END)))::numeric AS datahdr,
      (maxfracsum*(nullhdr+ma-(case when nullhdr%ma=0 THEN ma ELSE nullhdr%ma END))) AS nullhdr2
    FROM (
      SELECT
        schemaname, tablename, hdr, ma, bs,
        SUM((1-null_frac)*avg_width) AS datawidth,
        MAX(null_frac) AS maxfracsum,
        hdr+(
          SELECT 1+count(*)/8
          FROM pg_stats s2
          WHERE null_frac<>0 AND s2.schemaname = s.schemaname AND s2.tablename = s.tablename
        ) AS nullhdr
      FROM pg_stats s, (
        SELECT
          (SELECT current_setting('block_size')::numeric) AS bs,
          CASE WHEN substring(v,12,3) IN ('8.0','8.1','8.2') THEN 27 ELSE 23 END AS hdr,
          CASE WHEN v ~ 'mingw32' THEN 8 ELSE 4 END AS ma
        FROM (SELECT version() AS v) AS foo
      ) AS constants
      GROUP BY 1,2,3,4,5
    ) AS foo
  ) AS rs
  JOIN pg_class cc ON cc.relname = rs.tablename
  JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname = rs.schemaname AND nn.nspname <> 'information_schema'
  LEFT JOIN pg_index i ON indrelid = cc.oid
  LEFT JOIN pg_class c2 ON c2.oid = i.indexrelid
) AS sml
ORDER BY wastedbytes DESC

Another way

It is rumored, however, that this is not very accurate. A better option seems to be this ... more complicated query:

-- change to the max number of field per index if not default.
\set index_max_keys 32
-- (readonly) IndexTupleData size
\set index_tuple_hdr 2
-- (readonly) ItemIdData size
\set item_pointer 4
-- (readonly) IndexAttributeBitMapData size
\set index_attribute_bm (:index_max_keys + 8 - 1) / 8

SELECT current_database(), nspname, c.relname AS table_name, index_name, bs*(sub.relpages)::bigint AS totalbytes,
  CASE WHEN sub.relpages <= otta THEN 0 ELSE bs*(sub.relpages-otta)::bigint END                                    AS wastedbytes,
  CASE WHEN sub.relpages <= otta THEN 0 ELSE bs*(sub.relpages-otta)::bigint * 100 / (bs*(sub.relpages)::bigint) END AS realbloat
FROM (
  SELECT bs, nspname, table_oid, index_name, relpages, coalesce(
    ceil((reltuples*(:item_pointer+nulldatahdrwidth))/(bs-pagehdr::float)) +
      CASE WHEN am.amname IN ('hash','btree') THEN 1 ELSE 0 END , 0 -- btree and hash have a metadata reserved block
    ) AS otta
  FROM (
    SELECT maxalign, bs, nspname, relname AS index_name, reltuples, relpages, relam, table_oid,
      ( index_tuple_hdr_bm +
          maxalign - CASE /* Add padding to the index tuple header to align on MAXALIGN */
            WHEN index_tuple_hdr_bm%maxalign = 0 THEN maxalign
            ELSE index_tuple_hdr_bm%maxalign
          END
        + nulldatawidth + maxalign - CASE /* Add padding to the data to align on MAXALIGN */
            WHEN nulldatawidth::integer%maxalign = 0 THEN maxalign
            ELSE nulldatawidth::integer%maxalign
          END
      )::numeric AS nulldatahdrwidth, pagehdr
    FROM (
      SELECT
        i.nspname, i.relname, i.reltuples, i.relpages, i.relam, s.starelid, a.attrelid AS table_oid,
        current_setting('block_size')::numeric AS bs,
        /* MAXALIGN: 4 on 32bits, 8 on 64bits (and mingw32 ?) */
        CASE
          WHEN version() ~ 'mingw32' OR version() ~ '64-bit' THEN 8
          ELSE 4
        END AS maxalign,
        /* per page header, fixed size: 20 for 7.X, 24 for others */
        CASE WHEN substring(current_setting('server_version') FROM '#"[0-9]+#"%' FOR '#')::integer > 7
          THEN 24
          ELSE 20
        END AS pagehdr,
        /* per tuple header: add index_attribute_bm if some cols are null-able */
        CASE WHEN max(coalesce(s.stanullfrac,0)) = 0
          THEN :index_tuple_hdr
          ELSE :index_tuple_hdr + :index_attribute_bm
        END AS index_tuple_hdr_bm,
        /* data len: we remove null values save space using it fractionnal part from stats */
        sum( (1-coalesce(s.stanullfrac, 0)) * coalesce(s.stawidth, 2048) ) AS nulldatawidth
      FROM pg_attribute AS a
        JOIN pg_statistic AS s ON s.starelid=a.attrelid AND s.staattnum = a.attnum
        JOIN (
          SELECT nspname, relname, reltuples, relpages, indrelid, relam, regexp_split_to_table(indkey::text, ' ')::smallint AS attnum
          FROM pg_index
            JOIN pg_class ON pg_class.oid=pg_index.indexrelid
            JOIN pg_namespace ON pg_namespace.oid = pg_class.relnamespace
        ) AS i ON i.indrelid = a.attrelid AND a.attnum = i.attnum
      WHERE a.attnum > 0
      GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9
    ) AS s1
  ) AS s2
    LEFT JOIN pg_am am ON s2.relam = am.oid
) as sub
JOIN pg_class c ON c.oid=sub.table_oid
ORDER BY wastedbytes;

It was modified to sort the output by wastedbytes.

Grouped output

One disadvantage of the above query is that tables and indexes are displayed separately. How do we know which belongs to which? It also makes it less obvious what the big tables are, and which ones are important.

This one comes from the pgx_scripts GitHub repo, and is a 130+ line SQL query:

-- new table bloat query
-- still needs work; is often off by +/- 20%
WITH constants AS (
    -- define some constants for sizes of things
    -- for reference down the query and easy maintenance
    SELECT current_setting('block_size')::numeric AS bs, 23 AS hdr, 8 AS ma
),
no_stats AS (
    -- screen out table who have attributes
    -- which dont have stats, such as JSON
    SELECT table_schema, table_name, 
        n_live_tup::numeric as est_rows,
        pg_table_size(relid)::numeric as table_size
    FROM information_schema.columns
        JOIN pg_stat_user_tables as psut
           ON table_schema = psut.schemaname
           AND table_name = psut.relname
        LEFT OUTER JOIN pg_stats
        ON table_schema = pg_stats.schemaname
            AND table_name = pg_stats.tablename
            AND column_name = attname 
    WHERE attname IS NULL
        AND table_schema NOT IN ('pg_catalog', 'information_schema')
    GROUP BY table_schema, table_name, relid, n_live_tup
),
null_headers AS (
    -- calculate null header sizes
    -- omitting tables which dont have complete stats
    -- and attributes which aren't visible
    SELECT
        hdr+1+(sum(case when null_frac <> 0 THEN 1 else 0 END)/8) as nullhdr,
        SUM((1-null_frac)*avg_width) as datawidth,
        MAX(null_frac) as maxfracsum,
        schemaname,
        tablename,
        hdr, ma, bs
    FROM pg_stats CROSS JOIN constants
        LEFT OUTER JOIN no_stats
            ON schemaname = no_stats.table_schema
            AND tablename = no_stats.table_name
    WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
        AND no_stats.table_name IS NULL
        AND EXISTS ( SELECT 1
            FROM information_schema.columns
                WHERE schemaname = columns.table_schema
                    AND tablename = columns.table_name )
    GROUP BY schemaname, tablename, hdr, ma, bs
),
data_headers AS (
    -- estimate header and row size
    SELECT
        ma, bs, hdr, schemaname, tablename,
        (datawidth+(hdr+ma-(case when hdr%ma=0 THEN ma ELSE hdr%ma END)))::numeric AS datahdr,
        (maxfracsum*(nullhdr+ma-(case when nullhdr%ma=0 THEN ma ELSE nullhdr%ma END))) AS nullhdr2
    FROM null_headers
),
table_estimates AS (
    -- make estimates of how large the table should be
    -- based on row and page size
    SELECT schemaname, tablename, bs,
        reltuples::numeric as est_rows, relpages * bs as table_bytes,
    CEIL((reltuples*
            (datahdr + nullhdr2 + 4 + ma -
                (CASE WHEN datahdr%ma=0
                    THEN ma ELSE datahdr%ma END)
                )/(bs-20))) * bs AS expected_bytes,
        reltoastrelid
    FROM data_headers
        JOIN pg_class ON tablename = relname
        JOIN pg_namespace ON relnamespace = pg_namespace.oid
            AND schemaname = nspname
    WHERE pg_class.relkind = 'r'
),
estimates_with_toast AS (
    -- add in estimated TOAST table sizes
    -- estimate based on 4 toast tuples per page because we dont have 
    -- anything better.  also append the no_data tables
    SELECT schemaname, tablename, 
        TRUE as can_estimate,
        est_rows,
        table_bytes + ( coalesce(toast.relpages, 0) * bs ) as table_bytes,
        expected_bytes + ( ceil( coalesce(toast.reltuples, 0) / 4 ) * bs ) as expected_bytes
    FROM table_estimates LEFT OUTER JOIN pg_class as toast
        ON table_estimates.reltoastrelid = toast.oid
            AND toast.relkind = 't'
),
table_estimates_plus AS (
-- add some extra metadata to the table data
-- and calculations to be reused
-- including whether we can't estimate it
-- or whether we think it might be compressed
    SELECT current_database() as databasename,
            schemaname, tablename, can_estimate, 
            est_rows,
            CASE WHEN table_bytes > 0
                THEN table_bytes::NUMERIC
                ELSE NULL::NUMERIC END
                AS table_bytes,
            CASE WHEN expected_bytes > 0 
                THEN expected_bytes::NUMERIC
                ELSE NULL::NUMERIC END
                    AS expected_bytes,
            CASE WHEN expected_bytes > 0 AND table_bytes > 0
                AND expected_bytes <= table_bytes
                THEN (table_bytes - expected_bytes)::NUMERIC
                ELSE 0::NUMERIC END AS bloat_bytes
    FROM estimates_with_toast
    UNION ALL
    SELECT current_database() as databasename, 
        table_schema, table_name, FALSE, 
        est_rows, table_size,
        NULL::NUMERIC, NULL::NUMERIC
    FROM no_stats
),
bloat_data AS (
    -- do final math calculations and formatting
    select current_database() as databasename,
        schemaname, tablename, can_estimate, 
        table_bytes, round(table_bytes/(1024^2)::NUMERIC,3) as table_mb,
        expected_bytes, round(expected_bytes/(1024^2)::NUMERIC,3) as expected_mb,
        round(bloat_bytes*100/table_bytes) as pct_bloat,
        round(bloat_bytes/(1024::NUMERIC^2),2) as mb_bloat,
        table_bytes, expected_bytes, est_rows
    FROM table_estimates_plus
)
-- filter output for bloated tables
SELECT databasename, schemaname, tablename,
    can_estimate,
    est_rows,
    pct_bloat, mb_bloat,
    table_mb
FROM bloat_data
-- this where clause defines which tables actually appear
-- in the bloat chart
-- example below filters for tables which are either 50%
-- bloated and more than 20mb in size, or more than 25%
-- bloated and more than 4GB in size
WHERE ( pct_bloat >= 50 AND mb_bloat >= 10 )
    OR ( pct_bloat >= 25 AND mb_bloat >= 1000 )
ORDER BY mb_bloat DESC;

It will show only tables which have significant bloat, which is defined in the last few lines above. It makes the output much more readable.

There's also this other query we haven't evaluated.

Recovering disk space

In some cases, you do need to reclaim actual operating system disk space from the PostgreSQL server (see above to see if you do). This can happen for example,for example if you have removed years of old data from a database).

VACUUM FULL

Typically this is done with the VACUUM FULL command (instead of plain VACUUM, which the auto-vacuum does, see this discussion for details). This will actually rewrite all the tables to make sure only the relevant data is actually stored on this. It's roughly the equivalent of a dump/restore, except it is faster.

pg_repack

For very large changes (say, a dozens of terabytes) however, VACUUM FULL (and even plain VACUUM) can be prohibitively slow (think days). And while VACUUM doesn't require an exclusive lock on the tables it's working on, VACUUM FULL does which implies a significant outage.

An alternative to that method is the pg_repack extension, which is packaged in Debian. In Debian 10 buster, the following procedure was used on bacula-director-01 to purge old data about removed Bacula clients that hadn't been cleaned up in years:

apt install postgresql-11-repack

Then install the extension on the database, as the postgres user (sudo -u postgres -i), this needs to be done only once:

psql -c "CREATE EXTENSION pg_repack" -d bacula

Then, for each table:

pg_repack  -d bacula --table media

It is a good idea to start with a small table we can afford to lose, just in case something goes wrong. That job took about 2 hours on a very large table (150GB, file). The entire Bacula database went from using 161GB to 91GB after that cleanup, see this ticket for details.

When done, drop the pg_repack extension:

DROP EXTENSION pg_repack;

Also note that, after the repack, VACUUM performance improved significantly, going from hours (if not days) to minutes.

Note that pg_squeeze is another alternative to pg_repack, but it isn't available in Debian.

WAL is growing: dangling replication slot

As it is noted down below we currently generally don't (yet) use PostgreSQL replication. However, some tools can use a replication slot in order to extract backups like it is the case for barman.

If disk usage is growing linearly and you find out that the pg_wal directory is the biggest item, take a look at whether there is a replication slot that's left dangling and keeping PostgreSQL from being able to clear out its WAL:

SELECT slot_name,
   pg_wal_lsn_diff(
      pg_current_wal_lsn(),
      restart_lsn
   ) AS bytes_behind,
   active,
   wal_status
FROM pg_replication_slots
WHERE wal_status <> 'lost'
ORDER BY restart_lsn;

If there is one entry listed there, especially if the value in the column bytes_behind is high, then you might have found the source of the issue.

First off, verify that the replication point is really not used by anything anymore. That will be a matter of checking what other tools are running on the host, if the name of the replication slot evokes something that's familiar or not and to check in with services admins about this replication slot if necessary.

If you know that you can remove the replication slot safely, then you can do so with:

select pg_drop_replication_slot('barman');

After that, you'll need to wait for the next checkpoint to happen. By default this is 15 minutes, but some hosts may set a different checkpoint interval. Once the checkpoint is reached, you should see the disk usage go down on the machine.

See this page for information on other cases where the WAL can start growing.

Monitoring the VACUUM processes

In PostgreSQL, the VACUUM command "reclaims storage occupied by dead tuples". To quote the excellent PostgreSQL documentation:

In normal PostgreSQL operation, tuples that are deleted or obsoleted by an update are not physically removed from their table; they remain present until a VACUUM is done. Therefore it's necessary to do VACUUM periodically, especially on frequently-updated tables.

By default, the autovacuum launcher is enabled in PostgreSQL (and in our deployments), which should automatically take care of this problem.

This will show that the autovacuum daemon is running:

# ps aux | grep [v]acuum
postgres   534  0.5  4.7 454920 388012 ?       Ds   05:31   3:08 postgres: 11/main: autovacuum worker   bacula
postgres 17259  0.0  0.1 331376 10984 ?        Ss   Nov12   0:10 postgres: 11/main: autovacuum launcher

In the above, the launcher is running, and we can see a worker has been started to vacuum the bacula table.

If you don't see the launcher, check that it's enabled:

bacula=# SELECT name, setting FROM pg_settings WHERE name='autovacuum' or name='track_counts';
 autovacuum   | on
 track_counts | on

Both need to be on for the autovacuum workers to operate. It's possible that some tables might have autovacuum disabled, however, see:

SELECT reloptions FROM pg_class WHERE relname='my_table';

In the above scenario, the autovacuum worker bacula process had been running for hours, which was concerning. One way to diagnose is to figure out how much data there is to vacuum.

This query will show the tables with dead tuples that need to be cleaned up by the VACUUM process:

SELECT relname, n_dead_tup FROM pg_stat_user_tables where n_dead_tup > 0 order by n_dead_tup DESC LIMIT 1;

In our case, there were tens of millions of rows to clean:

bacula=# SELECT relname, n_dead_tup FROM pg_stat_user_tables where n_dead_tup > 0 order by n_dead_tup DESC LIMIT 1;
 file    |  183278595

That is 200 million tuples to cleanup!

We can see details of the vacuum operation with this funky query, taken from this amazing blog post:

SELECT
p.pid,
now() - a.xact_start AS duration,
coalesce(wait_event_type ||'.'|| wait_event, 'f') AS waiting,
CASE
WHEN a.query ~*'^autovacuum.*to prevent wraparound' THEN 'wraparound'
WHEN a.query ~*'^vacuum' THEN 'user'
ELSE 'regular'
END AS mode,
p.datname AS database,
p.relid::regclass AS table,
p.phase,
pg_size_pretty(p.heap_blks_total * current_setting('block_size')::int) AS table_size,
pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
pg_size_pretty(p.heap_blks_scanned * current_setting('block_size')::int) AS scanned,
pg_size_pretty(p.heap_blks_vacuumed * current_setting('block_size')::int) AS vacuumed,
round(100.0 * p.heap_blks_scanned / p.heap_blks_total, 1) AS scanned_pct,
round(100.0 * p.heap_blks_vacuumed / p.heap_blks_total, 1) AS vacuumed_pct,
p.index_vacuum_count,
round(100.0 * p.num_dead_tuples / p.max_dead_tuples,1) AS dead_pct
FROM pg_stat_progress_vacuum p
JOIN pg_stat_activity a using (pid)
ORDER BY now() - a.xact_start DESC

For example, the above vacuum on the Bacula director is in this state at the time of writing:

bacula=# \x
Expanded display is on.
bacula=# SELECT [...]
-[ RECORD 1 ]------+----------------
pid                | 534
duration           | 10:55:24.413986
waiting            | f
mode               | regular
database           | bacula
table              | file
phase              | scanning heap
table_size         | 55 GB
total_size         | 103 GB
scanned            | 29 GB
vacuumed           | 16 GB
scanned_pct        | 52.2
vacuumed_pct       | 29.3
index_vacuum_count | 1
dead_pct           | 93.8

This is a lot of information, but basically the worker with PID 513 has been running for 10h55m on the bacula database. It is in the scanning heap phase, second out of 8 phases of the vacuuming process. It's working on the file table which has has 55GB of data on the "heap" and a total size of 103 GB (including indexes). It scanned 29 GB of data (52%), vacuumed 16GB out of that (29%). The dead_pct indicates that the maintenance_work_mem buffer is 94% full, which could indicate raising that buffer could improve performance. I am not sure what the waiting and index_vacuum_count fields are for.

Naturally, this will return information for very large VACUUM operations, which typically do not take this long. This one VACUUM operation was especially slow because we suddenly removed almost half of the old clients in the Bacula database, see ticket 40525 for more information.

One more trick: this will show last VACUUM dates on tables:

SELECT relname, last_vacuum, last_autovacuum FROM pg_stat_user_tables WHERE last_vacuum IS NOT NULL or last_autovacuum IS NOT NULL ORDER BY relname;

Some of the ideas above were found on this datadog post.

Finally, note that the Debian 10 ("buster") version of PostgreSQL (11) does not support reporting on "FULL" VACUUM, that feature was introduced in PostgreSQL 12. Debian 11 ("bullseye") has PostgreSQL 13, but progress there is reported in the pg_stat_progress_cluster table, so the above might not work even there.

Running a backup manually

In pgBackRest, there is a systemd unit for each full or diff backup, so this is as simple as:

systemctl start pgbackrest-backup-full@materculae.service

You'd normally do a "diff" backup though:

systemctl start pgbackrest-backup-diff@materculae.service

You can follow the logs with:

journalctl -u pgbackrest-backup-diff@materculae -f

And check progress with:

watch -d sudo -u pgbackrest-materculae pgbackrest --stanza=materculae.torproject.org info

Checking backup health

The backup configuration can be tested on a client with:

sudo -u postgres pgbackrest --stanza=`hostname -f` check

For example, this was done to test weather-01:

root@weather-01:~# sudo -u postgres pgbackrest --stanza=weather-01.torproject.org check

You should be able to see information about that backup with the info command on the client:

sudo -u postgres pgbackrest --stanza=`hostname -f` info

For example:

root@weather-01:~# sudo -u postgres pgbackrest --stanza=`hostname -f` info
stanza: weather-01.torproject.org
    status: ok
    cipher: none

    db (current)
        wal archive min/max (15): 000000010000001F00000004/00000001000000210000002F

        full backup: 20241118-202245F
            timestamp start/stop: 2024-11-18 20:22:45 / 2024-11-18 20:28:43
            wal start/stop: 000000010000001F00000009 / 000000010000001F00000009
            database size: 40.3MB, database backup size: 40.3MB
            repo1: backup set size: 7.6MB, backup size: 7.6MB

This will run the check command on all configured backups:

for stanza in $( ls /var/lib/pgbackrest/backup ); do
    hostname=$(basename $stanza .torproject.org)
    sudo -u pgbackrest-$hostname pgbackrest  --stanza=$stanza check
done

This can be used to check the status of all backups in batch:

for stanza in $( ls /var/lib/pgbackrest/backup ); do
    hostname=$(basename $stanza .torproject.org)
    sudo -u pgbackrest-$hostname pgbackrest  --stanza=$stanza info | tail -12
done

It's essentially the same as the first, but with info instead of check.

Backup recovery

pgBackRest is our new PostgreSQL backup system. It features restore procedure and restore command, and detailed restore procedures, which includes instructions on how to restore a specific database in a cluster, do point in time recovery, to go back to a specific time in the past.

pgBackRest uses a variation of the official recovery procedure, which can also be referred to for more information.

Simple latest version restore

The procedure here assumes you are restoring to the latest version in the backups, overwriting the current server. It assumes PostgreSQL is installed, if not, see the installation procedure.

visit the right cluster version:
```
cd /var/lib/postgresql/15/
```
stop the server:
```
service postgresql stop
```
move or remove all files from the old cluster, alternatively:
```
mv main main.old && sudo -u postgres mkdir --mode 700 main
```
or to remove all files:
```
find main -mindepth 1 -delete
```
You should typically move files aside unless you don't have enough room to restore while keeping the bad data in place.

Run the restore command:

sudo -u postgres pgbackrest --stanza=`hostname -f` restore

Backup progress can be found in the log files, in:

/var/log/pgbackrest/`hostname -f`-restore.log

It takes a couple of minutes to start, but eventually you should see lines like:

2024-12-05 19:22:52.582 P01 DETAIL: restore file /var/lib/postgresql/15/main/base/16402/852859.4 (1GB, 11.39%) checksum 8a17b30a73a1d1ea9c8566bd264eb89d9ed3f35c

The percentage there (11.39% above) is how far in the restore you are. Note that this number, like all progress bar, lies. In particular, we've seen in the wild a long tail of 8KB files that seem to never finish:

2024-12-05 19:34:53.754 P01 DETAIL: restore file /var/lib/postgresql/15/main/base/16400/14044 (8KB, 100.00%) checksum b7a66985a1293b00b6402bfb650fa22c924fd893

It will finish eventually.

Start the restored server:
```
sudo service postgresql start
```

You're not done yet. This will replay log files from archives. Monitor the progress in /var/log/postgresql/postgresql-15-main.log, you will see:

database system is ready to accept connections

When recovery is complete. Here's an example of a recovery:

starting PostgreSQL 15.10 (Debian 15.10-0+deb12u1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
listening on IPv4 address "0.0.0.0", port 5432
listening on IPv6 address "::", port 5432
listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
database system was interrupted; last known up at 2024-12-05 16:28:52 UTC
starting archive recovery
starting backup recovery with redo LSN 12B/410000C8, checkpoint LSN 12B/41000100, on timeline ID 2
restored log file "000000020000012B00000041" from archive
redo starts at 12B/410000C8
restored log file "000000020000012B00000042" from archive
completed backup recovery with redo LSN 12B/410000C8 and end LSN 12B/410B3930
consistent recovery state reached at 12B/410B3930
database system is ready to accept read-only connections
restored log file "000000020000012B00000043" from archive
restored log file "000000020000012B00000044" from archive
redo in progress, elapsed time: 10.63 s, current LSN: 12B/43087E50
restored log file "000000020000012B00000045" from archive
redo done at 12B/452747D8 system usage: CPU: user: 0.00 s, system: 0.01 s, elapsed: 19.77 s
last completed transaction was at log time 2024-12-05 19:20:38.375101+00
restored log file "000000020000012B00000045" from archive
selected new timeline ID: 3
archive recovery complete
checkpoint starting: end-of-recovery immediate wait
checkpoint complete: wrote 840 buffers (5.1%); 0 WAL file(s) added, 0 removed, 5 recycled; write=0.123 s, sync=0.009 s, total=0.153 s; sync files=71, longest=0.004 s, average=0.001 s; distance=81919 kB, estimate=81919 kB
database system is ready to accept connections

Note that the date and LOG parts of the log entries were removed to make it easier to read.

This procedure also assumes that the pgbackrest command is functional. This should normally be the case on an existing server, but if pgBackRest is misconfigured or the server is los or too damaged, you might not be able to perform a restore with the normal procedure.

In that case, you should treat the situation as a bare-bones recovery, below.

Restoring on a new server

The normal restore procedure assumes the server is properly configured for backups (technically with a proper "stanza").

If that's not the case, for example if you're recovering the database to a new server, you first need to do a proper PostgreSQL installation which should setup the backups properly.

The only twist is that you will need to tweak the stanza names to match the server you are restoring from and will also likely need to add extra SSH keys.

TODO: document exact procedure, should be pretty similar to the bare bones recovery below

Bare bones restore

This assumes the host is configured with Puppet. If this is a real catastrophe (e.g. the Puppet server is down!), you might not have that luxury. In that case, you need to need to manually configure pgBackRest, except steps:

2.b: user and SSH keys are probably already present on server
4.b: server won't be able to connect to client
5: don't configure the pgbackrest server, it's already done
stop at step seven:
- 7: don't create the stanza on the server, already present
- 8: no need to configure backups on the client, we're restoring
- 9: the check command will fail if the server is stopped
- 10: server configuration talks to the old server
- 11: we're doing a restore, not a backup

Essentially, once you have a new machine to restore on, you will:

Install required software:
```
apt install sudo pgbackrest postgresql
```
Create SSH keys on the new VM:
```
sudo -u postgres ssh-keygen
```

Add that public to the repository server, in /etc/ssh/userkeys/pgbackrest-weather-01:

echo 'no-agent-forwarding,no-X11-forwarding,no-port-forwarding,command="/usr/bin/pgbackrest ${SSH_ORIGINAL_COMMAND#* }"  ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJrOnnOpX0cyzQ/lqvNLQt2mcJUziiJ0MdubSf/c1+2g postgres@test-01' \
  > /etc/ssh/userkeys/pgbackrest-weather-01

Configure the repository on the client, in /etc/pgbackrest.conf:

[weather-01.torproject.org]
lock-path = /var/lock/pgbackrest/weather-01.torproject.org
pg1-host = weather-01.torproject.org
pg1-path = /var/lib/postgresql/15/main
log-path = /var/log/pgbackrest/weather-01.torproject.org
repo1-path = /var/lib/pgbackrest

Restore with:

sudo -u postgres pgbackrest --stanza=weather-01.torproject.org restore

Once this is done, make sure to reconfigure the machine with Puppet properly so that it's again hooked up with the backup system.

Note that if the machine has been gone long enough, it's possible the user and configuration is gone from the server as well, in which case you'll need to create those as well (step 2.b in the manual procedure).

Restoring without pgBackRest

This is likely not the procedure you want, and should be used only in extreme cases where pgBackRest is completely failing ro restore from backups.

This procedure assumes you have already a server installed with enough disk space to hold the data to recover. We assume you are restoring the server testdb-01, which is hardcoded in this procedure.

First, disable Puppet so you have control on when PostgreSQL is running:

puppet agent --disable 'keeping control of postgresql startup -- anarcat 2019-10-09'

Then install the right PostgreSQL version and stop the server:
```
apt install postgresql-13
service postgresql stop
```
Make sure you run the SAME MAJOR VERSION of PostgreSQL than the backup! You cannot restore across versions. This might mean installing from backports or an older version of Debian.

On that new PostgreSQL server, show the postgres server public key, creating it if missing:

( [ -f ~postgres/.ssh/id_rsa.pub ] || sudo -u postgres ssh-keygen )&&
cat ~postgres/.ssh/*.pub

Then on the backup server, allow the user to access backups of the old server:
```
echo "restrict $HOSTKEY" > /etc/ssh/userkeys/pgbackrest-testdb-01.more
```
The $HOSTKEY is the public key found on the postgres server above.

NOTE: the above will not work if the key is already present in /etc/ssh/userkeys/torbackup, as the key will override the one in .more. Edit the key in there instead in that case.
Then you need to find the right BASE file to restore from. Each BASE file has a timestamp in its filename, so just sorting them by name should be enough to find the latest one.

Decompress the BASE file in place, as the postgres user:
```
sudo -u postgres -i
rsync -a pgbackrest-testdb-01@$BACKUPSERVER:/srv/backups/pg/backup/testdb-01.torproject.org/20250604-170509F/pg_data /var/lib/postgresql/13/main/
```
Make sure the pg_wal directory doesn't contain any files.
```
rm -rf -- /var/lib/postgresql/13/main/pg_wal/*
```
Note: this directory was called pg_wal in earlier PostgreSQL versions (e.g. 9.6).
Tell the database it is okay to restore from backups:
```
touch /var/lib/postgresql/13/main/recovery.signal
```
At this point, you're ready to start the database based on that restored backup. But you will probably also want to restore WAL files to get the latest changes.
Create add a configuration parameter in /etc/postgresql/13/main/postgresql.conf that will tell postgres where to find the WAL files. At least the restore_command need to be specified. Something like this may work:
```
restore_command = '/usr/bin/ssh $OLDSERVER cat /srv/backups/pg/backup/anonticket-01.torproject.org/13-1/%f'
```
You can specify a specific recovery point in the postgresql.conf, see the upstream documentation for more information. This, for example, will recover meronense from backups of the main cluster up to October 1st, and then start accepting connections (promote, other options are pause to stay in standby to accept more logs or shutdown to stop the server):
```
restore_command = '/usr/local/bin/pg-receive-file-from-backup meronense main.WAL.%f %p'
recovery_target_time = '2022-10-01T00:00:00+0000'
recovery_target_action = 'promote'
```

Then start the server and look at the logs to follow the recovery process:

service postgresql start
tail -f /var/log/postgresql/*

You should see something like this this in /var/log/postgresql/postgresql-13-main.log:

2019-10-09 21:17:47.335 UTC [9632] LOG:  database system was interrupted; last known up at 2019-10-04 08:12:28 UTC
2019-10-09 21:17:47.517 UTC [9632] LOG:  starting archive recovery
2019-10-09 21:17:47.524 UTC [9633] [unknown]@[unknown] LOG:  incomplete startup packet
2019-10-09 21:17:48.032 UTC [9639] postgres@postgres FATAL:  the database system is starting up
2019-10-09 21:17:48.538 UTC [9642] postgres@postgres FATAL:  the database system is starting up
2019-10-09 21:17:49.046 UTC [9645] postgres@postgres FATAL:  the database system is starting up
2019-10-09 21:17:49.354 UTC [9632] LOG:  restored log file "00000001000005B200000074" from archive
2019-10-09 21:17:49.552 UTC [9648] postgres@postgres FATAL:  the database system is starting up
2019-10-09 21:17:50.058 UTC [9651] postgres@postgres FATAL:  the database system is starting up
2019-10-09 21:17:50.565 UTC [9654] postgres@postgres FATAL:  the database system is starting up
2019-10-09 21:17:50.836 UTC [9632] LOG:  redo starts at 5B2/74000028
2019-10-09 21:17:51.071 UTC [9659] postgres@postgres FATAL:  the database system is starting up
2019-10-09 21:17:51.577 UTC [9665] postgres@postgres FATAL:  the database system is starting up
2019-10-09 21:20:35.790 UTC [9632] LOG:  restored log file "00000001000005B20000009F" from archive
2019-10-09 21:20:37.745 UTC [9632] LOG:  restored log file "00000001000005B2000000A0" from archive
2019-10-09 21:20:39.648 UTC [9632] LOG:  restored log file "00000001000005B2000000A1" from archive
2019-10-09 21:20:41.738 UTC [9632] LOG:  restored log file "00000001000005B2000000A2" from archive
2019-10-09 21:20:43.773 UTC [9632] LOG:  restored log file "00000001000005B2000000A3" from archive

... and so on. Note that you do see some of those notices in the normal syslog/journald logs, but, critically, not the following recovery one.

Then the recovery will complete with something like this, again in /var/log/postgresql/postgresql-13-main.log:

2019-10-10 01:30:55.460 UTC [16953] LOG:  redo done at 5B8/9C5BE738
2019-10-10 01:30:55.460 UTC [16953] LOG:  last completed transaction was at log time 2019-10-10 01:04:23.238233+00
2019-10-10 01:31:03.536 UTC [16953] LOG:  restored log file "00000001000005B80000009C" from archive
2019-10-10 01:31:06.458 UTC [16953] LOG:  selected new timeline ID: 2
2019-10-10 01:31:17.485 UTC [16953] LOG:  archive recovery complete
2019-10-10 01:32:11.975 UTC [16953] LOG:  MultiXact member wraparound protections are now enabled
2019-10-10 01:32:12.438 UTC [16950] LOG:  database system is ready to accept connections
2019-10-10 01:32:12.439 UTC [26501] LOG:  autovacuum launcher started

The server is now ready for use.

Remove the temporary SSH access on the backup server, either by removing the .more key file or restoring the previous key configuration:

rm /etc/ssh/userkeys/torbackup.more
re-enable Puppet:

puppet agent -t

Troubleshooting restore failures

could not locate required checkpoint record

If you find the following error in the logs:

FATAL:  could not locate required checkpoint record

It's because postgres cannot find the WAL logs to restore from. There could be many causes for this, but the ones I stumbled upon were:

wrong permissions on the archive (put the WAL files in ~postgres, not ~root)
wrong path or pattern for restore_command (double-check the path and make sure to include the right prefix, e.g. main.WAL)

missing "archive recovery complete" message

Note: those instructions were copied from the legacy backup system documentation. They are, however, believed to be possibly relevant to certain failure mode of PostgreSQL recovery in general, but should be carefully reviewed.

A block like this should show up in the /var/log/postgresql/postgresql-13-main.log file:

2019-10-10 01:30:55.460 UTC [16953] LOG:  redo done at 5B8/9C5BE738
2019-10-10 01:30:55.460 UTC [16953] LOG:  last completed transaction was at log time 2019-10-10 01:04:23.238233+00
2019-10-10 01:31:03.536 UTC [16953] LOG:  restored log file "00000001000005B80000009C" from archive
2019-10-10 01:31:06.458 UTC [16953] LOG:  selected new timeline ID: 2
2019-10-10 01:31:17.485 UTC [16953] LOG:  archive recovery complete
2019-10-10 01:32:11.975 UTC [16953] LOG:  MultiXact member wraparound protections are now enabled
2019-10-10 01:32:12.438 UTC [16950] LOG:  database system is ready to accept connections
2019-10-10 01:32:12.439 UTC [26501] LOG:  autovacuum launcher started

The key entry is archive recovery complete here.

It should show this in the logs. If it is not, it might just be still recovering a WAL file, or it might be paused.

You can confirm what the server is doing by looking at the processes, for example, this is still recovering a WAL file:

root@meronense-backup-01:~# systemctl status postgresql@13-main.service
● postgresql@13-main.service - PostgreSQL Cluster 13-main
     Loaded: loaded (/lib/systemd/system/postgresql@.service; enabled-runtime; vendor preset: enabled)
     Active: active (running) since Thu 2022-10-27 15:06:40 UTC; 1min 0s ago
    Process: 67835 ExecStart=/usr/bin/pg_ctlcluster --skip-systemctl-redirect 13-main start (code=exited, status=0/SUCCESS)
   Main PID: 67840 (postgres)
      Tasks: 5 (limit: 9510)
     Memory: 50.0M
        CPU: 626ms
     CGroup: /system.slice/system-postgresql.slice/postgresql@13-main.service
             ├─67840 /usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/13/main -c config_file=/etc/postgresql/13/main/postgresql.conf
             ├─67842 postgres: 13/main: startup recovering 0000000100000600000000F5
             ├─67851 postgres: 13/main: checkpointer
             ├─67853 postgres: 13/main: background writer
             └─67855 postgres: 13/main: stats collector

... because there's a process doing:

67842 postgres: 13/main: startup recovering 0000000100000600000000F5

In that case, it was stuck in "pause" mode, as the logs indicated:

2022-10-27 15:08:54.882 UTC [67933] LOG:  starting PostgreSQL 13.8 (Debian 13.8-0+deb11u1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2022-10-27 15:08:54.882 UTC [67933] LOG:  listening on IPv6 address "::1", port 5432
2022-10-27 15:08:54.882 UTC [67933] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2022-10-27 15:08:54.998 UTC [67933] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-10-27 15:08:55.236 UTC [67939] LOG:  database system was shut down in recovery at 2022-10-27 15:08:54 UTC
2022-10-27 15:08:55.911 UTC [67939] LOG:  starting point-in-time recovery to 2022-10-01 00:00:00+00
2022-10-27 15:08:56.764 UTC [67939] LOG:  restored log file "0000000100000600000000F4" from archive
2022-10-27 15:08:57.316 UTC [67939] LOG:  redo starts at 600/F4000028
2022-10-27 15:08:58.497 UTC [67939] LOG:  restored log file "0000000100000600000000F5" from archive
2022-10-27 15:08:59.119 UTC [67939] LOG:  consistent recovery state reached at 600/F50051F0
2022-10-27 15:08:59.119 UTC [67933] LOG:  database system is ready to accept read only connections
2022-10-27 15:08:59.120 UTC [67939] LOG:  recovery stopping before commit of transaction 12884886, time 2022-10-01 08:40:35.735422+00
2022-10-27 15:08:59.120 UTC [67939] LOG:  pausing at the end of recovery
2022-10-27 15:08:59.120 UTC [67939] HINT:  Execute pg_wal_replay_resume() to promote.

The pg_wal_replay_resume() is not actually the right statement to use here, however. That would put the server back into recovery mode, where it would start fetching WAL files again. It's useful for replicated setups, but this is not such a case.

In the above scenario, a recovery_target_time was added but without a recovery_target_action, which led the server to be paused instead of resuming normal operation.

The correct way to recover here is to issue a pg_promote statement:

sudo -u postgres psql -c 'SELECT pg_promote();'

Deleting backups

If, for some reason, you need to purge an old backup (e.g. some PII made it there that should not have), you can manual expire backups with the expire --set command.

This, for example, will delete a specific backup regardless of retention policies:

sudo -u pgbackrest-weather-01 pgbackrest --stanza=weather-01.torproject.org expire --set 20241205-162349F_20241207-162351D

Logs for this operation will show up in the (e.g.) /var/log/pgbackrest/weather-01.torproject.org/weather-01.torproject.org-expire.log directory.

You can also expire incremental backups associated only with the oldest full backup with:

host=weather-01
cd /srv/backups/pg/backup/$host.torproject.org
for set in $(ls -d *F | sort | head -1)*I ; do
    sudo -u pgbackrest-$host pgbackrest --stanza=$host.torproject.org --dry-run expire --set $set;
done

Remove --dry-run when you're confident this will work.

To remove all incremental backups:

host=weather-01
cd /srv/backups/pg/backup/$host.torproject.org
for set in *I ; do
    sudo -u pgbackrest-$host pgbackrest --stanza=$host.torproject.org --dry-run expire --set $set;
done

To remove all incremental backups from all hosts:

cd /srv/backups/pg/backup &&
ls | sed 's/\..*//'| while read host; do
  cd $host.torproject.org &&
  echo $host &&
  for set in *I ; do
      [ -d $set ] && sudo -u pgbackrest-$host pgbackrest --stanza=$host.torproject.org --dry-run expire --set $set
  done
  cd ..
done

OOM (Out Of Memory)

We have had situations where PostgreSQL ran out of memory a few times (tpo/tpa/team#40814, tpo/tpa/team#40482, tpo/tpa/team#40815). You can confirm the problem by looking at the node exporter graphs, for example this link will show you the last 4 months of memory usage on materculae:

https://grafana.torproject.org/d/Z7T7Cfemz/node-exporter-full?orgId=1&var-job=node&var-node=materculae.torproject.org&var-port=9100&from=now-4M&to=now&viewPanel=78&refresh=1m

The blue "dots" (if any) show the number of times the OOM-killer was called. If there are no dots, it wasn't called, obviously. You can see examples of graphs like this in the history of tpo/tpa/team#40815.

If you are not sure PostgreSQL is responsible, you should be able to confirm by looking at the per-process memory graphs established in July 2022. Here's, for example, a graph of the per-process memory usage on materculae for the past 60 days:

https://grafana.torproject.org/d/LbhyBYq7k/per-process-memory-usage?orgId=1&var-instance=materculae.torproject.org&var-process=java&var-process=postgres&var-min_size=0&from=now-60d&to=now

... or a similar graph for processes with more than 2GB of usage:

https://grafana.torproject.org/d/LbhyBYq7k/per-process-memory-usage?orgId=1&var-instance=materculae.torproject.org&var-process=java&var-process=postgres&var-min_size=2000000&from=now-7d&to=now

This was especially prominent after the Debian bullseye upgrades where there is a problem with the JIT compiler enabled in PostgreSQL 13 (Debian bug 1019503, upstream thread). So the first thing to do if a server misbehaves is to disabled the JIT:

sudo -u psql -c 'SET jit TO OFF';

This is specifically what fixed a recurring OOM on Materculae in September 2022 (tpo/tpa/team#40815).

If that fails, another strategy is to try to avoid using the OOM killer altogether. By default, the Linux kernel over commits memory, which means it actually allows processes to allocate more memory than is available on the system. When that memory is actually used is when problems can occur, and when the OOM killer intervenes to kill processes using "heuristics" to hopefully kill the right one.

The PostgreSQL manual actually recommends disabling that feature with:

sysctl -w vm.overcommit_memory=2
sysctl -w vm.overcommit_ratio=90

To make this permanent, add the setting in /etc/sysctl.d/:

echo vm.overcommit_memory=2 > /etc/sysctl.d/no-overcommit.conf
echo vm.overcommit_ratio=90 >> /etc/sysctl.d/no-overcommit.conf

This will keep the kernel from over-allocating memory, limiting the total memory usage to the swap size plus 90% of the main memory (default is 50%). Note that the comments about the oom_score_adj do not apply to the Debian package as it already sets a proper score for the PostgreSQL server.

Concretely, avoiding overcommit will make the caller fail when it tries to allocate memory. This can still lead to PostgreSQL crashing, but at least it will give a more useful stack trace that will show what was happening during that allocation.

Another thing to look into is possible bad behavior on the client side. A client could abuse memory usage by doing multiple PREPARE statements and never executing them. "HOLD cursors" are also something, apparently.

Finally, PostgreSQL itself can be tweaked, see this part of the upstream documentation, again:

In some cases, it may help to lower memory-related configuration parameters, particularly shared_buffers, work_mem, and hash_mem_multiplier. In other cases, the problem may be caused by allowing too many connections to the database server itself. In many cases, it may be better to reduce max_connections and instead make use of external connection-pooling software.

Exporter failures

If you get a PgExporterScrapeErrors alert like:

PostgreSQL exporter failure on weather-01.torproject.org

It's because the PostgreSQL exporter cannot talk to database server.

First, look at the exporter logs, which should show the error, for example in our case:

root@weather-01:~# journalctl -u prometheus-postgres-exporter.service -n 3  | cat
Sep 24 15:04:20 weather-01 prometheus-postgres-exporter[453]: ts=2024-09-24T15:04:20.670Z caller=collector.go:196 level=error msg="collector failed" name=bgwriter duration_seconds=0.002675663 err="pq: Peer authentication failed for user \"prometheus\""
Sep 24 15:04:20 weather-01 prometheus-postgres-exporter[453]: ts=2024-09-24T15:04:20.673Z caller=collector.go:196 level=error msg="collector failed" name=database duration_seconds=0.005719853 err="pq: Peer authentication failed for user \"prometheus\""
Sep 24 15:04:21 weather-01 prometheus-postgres-exporter[453]: ts=2024-09-24T15:04:21.670Z caller=postgres_exporter.go:714 level=error err="Error opening connection to database (user=prometheus%20host=/var/run/postgresql%20database=postgres%20sslmode=disable): pq: Peer authentication failed for user \"prometheus\""

Then you can turn to the PostgreSQL server logs to see the other side of that error:

root@weather-01:~# tail -3 /var/log/postgresql/postgresql-15-main.log
2024-09-24 15:05:20.672 UTC [116289] prometheus@postgres LOG:  no match in usermap "torweather" for user "prometheus" authenticated as "prometheus"
2024-09-24 15:05:20.672 UTC [116289] prometheus@postgres FATAL:  Peer authentication failed for user "prometheus"
2024-09-24 15:05:20.672 UTC [116289] prometheus@postgres DETAIL:  Connection matched pg_hba.conf line 11: "local	all	all		ident	map=torweather"

In this case, it is a misconfiguration of the authentication layer. The fix was to correct the pg_hba.conf file to avoid overriding the configuration for the prometheus user in the username map, see tor-puppet.git@123d79c19 (restrict the weather pg_ident map to the right user, 2024-09-24).

But a more typical scenario is that the database server is down, make sure it is running correctly with:

systemctl status postgresql@15-main.service

Archiver failure

A PgArchiverFailed alert looks like:

Increased PostgreSQL archiver failure rate on test.example.com

It means the archive_command (from postgresql.conf) has been failing for too long. A failure or two (say when the backup server is rebooting) is normal, but the alert is specifically designed to alert after a longer period of time.

This means the "point in time recovery" backups have stopped working, and changes since the failures started are not mirrored on the backup server.

Check the server log file (currently /var/log/postgresql/postgresql-15-main.log) for errors. The most typical scenario here is that the backup server is down, or there's a configuration problem in the archive_command.

Here's a pgBackRest failure, for example:

2025-02-25 23:06:22.117 UTC [648720] DETAIL:  The failed archive command was: pgbackrest --stanza=weather-01.torproject.org archive-push pg_wal/00000001000000280000009B
ERROR: [103]: unable to find a valid repository:
       repo1: [FileOpenError] raised from remote-0 ssh protocol on 'backup-storage-01.torproject.org': unable to get info for path/file '/var/lock/pgbackrest/weather-01.torproject.org/weather-01.torproject.org.stop': [13] Permission denied
2025-02-25 23:06:25.287 UTC [648720] LOG:  archive command failed with exit code 103
2025-02-25 23:06:25.287 UTC [648720] DETAIL:  The failed archive command was: pgbackrest --stanza=weather-01.torproject.org archive-push pg_wal/00000001000000280000009B
2025-02-25 23:06:25.287 UTC [648720] WARNING:  archiving write-ahead log file "00000001000000280000009B" failed too many times, will try again later

You can try running the archive command by hand, for pgBackRest servers, this would be:

cd /var/lib/postgresql/15/main/
sudo -u postgres pgbackrest --stanza=weather-01.torproject.org archive-push pg_wal/00000001000000280000009B

There used to be an issue where a reboot of the repository server would lead to the lock directory being missing, and therefore errors in the archiver. This was fixed in tpo/tpa/team#42058.

A more typical reason for those failures is a discrepancy between the pgBackRest version on the server and client, a known issue with pgBackRest:

status: error (other)
        [ProtocolError] expected value '2.x' for greeting key 'version' but got '2.y'
        HINT: is the same version of pgBackRest installed on the local and remote host?

The solution is to harmonize those versions across the fleet, see the upgrades section for details.

Once the archiver is fixed, you can force a write with:

sudo -u postgres psql -c CHECKPOINT

Watch the log file for failures, the alert should be fixed within a couple of minutes.

Archiver lag

A PgArchiverAge alert looks something like:

PostgreSQL archiver lagging on test.torproject.org

It means the archive_command (from postgresql.conf) has been struggling to keep up with changes in the database. Check the server log file (currently /var/log/postgresql/postgresql-15-main.log) for errors, otherwise look at the backup server for disk saturation.

Once the archiver is fixed, you can force a write with:

sudo -u postgres psql -c CHECKPOINT

Watch the log file for failures, the alert should be fixed within a couple of minutes.

If this keeps occurring, settings could be changed in PostgreSQL to commit changes to WAL files more frequently, for example by changing the max_wal_size or checkpoint_timeout settings. Normally, a daily job does a CHECKPOINT, you can check if it's running with:

systemctl status pg-checkpoint.timer pg-checkpoint.service

Resetting archiver statistics

This is not usually a solution that one should use for archive errors.

But if you're disabling postgresql archives and you end up with the PgArchiverAge alert even though no archive is being done, intentionally, then to clear out the alert you'll want to reset the archiver statistics.

To do this, connect to the database with the administrator account and then run one query, as follows:

# sudo -u postgres psql
[...]
postgres=# select pg_stat_reset_shared('archiver');

Connection saturation

A PgConnectionsSaturation looks like:

PostgreSQL connection count near saturation on test.torproject.org

It means the number of connected clients is close to the maximum number of allowed clients. It leaves the server unlikely to respond properly to higher demand.

A few ideas:

look into the Diagnosing performance issue section
look at the long term trend, by plotting the pg_stat_activity_count metric over time
consider bumping the max_connections setting (in postgresql.conf) if this is a long term trend

Stale backups

The PgBackRestStaleBackups alert looks like:

PostgreSQL backups are stale on weather-01.torproject.org

This implies that scheduled (normally, daily) backups are not running on that host.

The metric behind that alert (pgbackrest_backup_since_last_completion_seconds) is generated by the pgbackrest_exporter (see backups monitoring), based on the output of the pgbackrest command.

You can inspect the general health of this stanza with this command on the repository server (currently backup-storage-01):

sudo -u pgbackrest-weather-01 pgbackrest check --stanza=weather-01.torproject.org

This command takes a dozen seconds to complete, that is normal. It should return without any output. Otherwise it will tell you if there's a problem for the repository server to reach the client.

If that works, next up is to check the last backups with the info command:

sudo -u pgbackrest-weather-01 pgbackrest info --stanza=weather-01.torproject.org

This should show something like:

root@backup-storage-01:~# sudo -u pgbackrest-weather-01 pgbackrest  --stanza=weather-01.torproject.org info | head -12
stanza: weather-01.torproject.org
    status: ok
    cipher: none

    db (current)
        wal archive min/max (15): 000000010000001F00000004/000000010000002100000047

        full backup: 20241118-202245F
            timestamp start/stop: 2024-11-18 20:22:45 / 2024-11-18 20:28:43
            wal start/stop: 000000010000001F00000009 / 000000010000001F00000009
            database size: 40.3MB, database backup size: 40.3MB
            repo1: backup set size: 7.6MB, backup size: 7.6MB

The oldest backups are shown first, and here we're showing the first one (head -12), let's see the last one:

root@backup-storage-01:~# sudo -u pgbackrest-weather-01 pgbackrest  --stanza=weather-01.torproject.org info | tail -6
        diff backup: 20241209-183838F_20241211-001900D
            timestamp start/stop: 2024-12-11 00:19:00 / 2024-12-11 00:19:20
            wal start/stop: 000000010000002100000032 / 000000010000002100000033
            database size: 40.7MB, database backup size: 10.3MB
            repo1: backup set size: 7.7MB, backup size: 3.5MB
            backup reference list: 20241209-183838F

If the backups are not running, check the systemd timer to see if it's properly enabled and running:

systemctl status pgbackrest-backup-incr@weather-01.timer

You can see the state of all pgBackRest timers with:

systemctl list-timers | grep -e NEXT -e pgbackrest

In this case, the backup is fresh enough, but if that last backup is not recent enough, you can try to run a backup manually to see if you can reproduce the issue, through the systemd unit. For example, a incr backup:

systemctl start pgbackrest-backup-incr@weather-01

See the Running a backup manually instructions for details.

Note that the pgbackrest_exporter only pulls metrics from pgBackRest once per --collect.interval which defaults to 600 seconds (10 minutes), so it might take unexpectedly long for an alert to resolve.

It used to be that we would rely solely on OnCalendar and RandomizedDelaySec (for example, OnCalendar=weekly and RandomizedDelaySec=7d for diff backups) to spread that load, but that introduced issues when provisionning new servers or rebooting the repository server, see tpo/tpa/team#42043. We consider this to be a bug in systemd itself, and worked around it by setting the randomization in Puppet (see puppet-control@227ddb642).

Backup checksum errors

The PgBackRestBackupErrors alert looks like:

pgBackRest stanza weather-01.torproject.org page checksum errors

It means that the backup (in the above example, for weather-01 stanza) contains one or more page checksum errors.

To display the list of errors, you need manually run the command like:

sudo -u pgbackrest-HOSTNAME pgbackrest info --stanza FQDN --set backup_name --repo repo_key.

For example:

sudo -u pgbackrest-weather-01 pgbackrest info --stanza weather-01.torproject.org --set 20241209-183838F_20241211-001900D

This will, presumably, give you more information about the checksum errors. It's unclear how those can be resolved, we've never encountered such errors so far.

Backups misconfigurations

A certain number of conditions can be raised by the backups monitoring system that will raise an alert. Those are, at the time of writing:

Alert name	Metric	Explanation
`PgBackRestExporterFailure`	`pgbackrest_exporter_status`	exporter can't talk to pgBackRest
`PgBackRestRepositoryError`	`pgbackrest_repo_status`	misconfigured repository
`PgBackRestStanzaError`	`pgbackrest_stanza_status`	misconfigured stanza

We have never encountered those errors so far, so it is currently unclear how to handle those. The exporter README file has explanations on what the metrics mean as well.

It is likely that the exporter will log more detailed error messages in its logs, which should be visible with:

journalctl -u prometheus-pgbackrest-exporter.service -e

In all case, another idea is to check backup health. This will confirm (or not) that stanzas are properly configured, and outline misconfigured stanza or errors in the global repository configuration.

The status code 99 means "other". This generally means that some external reason is causing things to not run correctly. For example permission errors that make the exporter unable to read from the backup directories.

Disk is full or nearly full

It's possible that pgBackRest backups are taking up all disk space on the backup server. This will generate an alert like this on IRC:

17:40:07 -ALERTOR1:#tor-alerts- DiskWillFillSoon [firing] Disk /srv/backups/pg on backup-storage-01.torproject.org is almost full

The first step is to inspect the directory with:

ncdu -x /srv/backups/pg

The goal of this is to figure out if there's a specific host that's using more disk space than usual, or if there's a specific kind of backups that's using more disk space. The files in backup/, for example, are full/diff/incr backups, while the files in archive/ are the WAL logs.

You can see the relative size of the different backup types with:

for f in  F D I ; do printf "$f: " ;  du -ssch *$f | grep total ; done

For example:

root@backup-storage-01:/srv/backups/pg/backup/rude.torproject.org# for f in  F D I ; do printf "$f: " ;  du -ssch *$f | grep total ; done
F: 9.6G total
D: 13G  total
I: 65G  total

In the above incident #41982, disk space was used overwhelmingly by incr backups, which were actually disabled to workaround the problem. This, however, means WAL files will take up more space, so a balance must be found in this.

If a specific host is using more disk space, it's possible there's an explosion in disk use on the originating server, which can be investigated with the team responsible for the service.

It might be possible to recover disk space by deleting or expiring backups as well.

In any case, depending on how long it will take for the disk to fill up, the best strategy might be to resize the logical volume.

Disaster recovery

If a PostgreSQL server is destroyed completely or in part, we need to restore from backups, using the backup recovery procedure.

This requires Puppet to be up and running. If the Puppet infrastructure is damaged, a manual recovery procedure is required, see Bare bones restore.

Reference

Installation

The profile::postgresql Puppet class should be used to deploy and manage PostgreSQL databases on nodes. It takes care of installation, configuration and setting up the required role and permissions for backups.

One the class is deployed, run the Puppet agent on both the server and storage server, then make a make a full backup. See also the backups section for a discussion about backups configuration.

You will probably want to bind-mount /var/lib/postgresql to /srv/postgresql, unless you are certain you have enough room in /var for the database:

systemctl stop postgresql &&
echo /srv/postgresql /var/lib/postgresql none bind 0 0 >> /etc/fstab &&
mv /var/lib/postgresql /srv/ &&
mkdir /var/lib/postgresql &&
mount /var/lib/postgresql &&
systemctl start postgresql

This assumes /srv is already formatted and properly mounted, of course, but that should have been taken care of as part of the new machine procedure.

Manual installation

To test PostgreSQL on a server not managed by Puppet, you can probably get away with installing Puppet by hand from Debian packages with:

apt install postgresql

Do NOT do this on a production server managed by TPA, as you'll be missing critical pieces of infrastructure, namely backups and monitoring.

Prometheus PostgreSQL exporter deployment

Prometheus metrics collection is configured automatically when the Puppet class profile::postgresql is deployed on the node.

Manual deployment

NOTE: This is now done automatically by the Puppet profile. Those instructions are kept for historical reference only.

First, include the following line in pg_hba.conf:

local   all             prometheus                              peer

Then run the following SQL queries as the postgres user, for example after sudo -u postgres psql, you first create the monitoring user to match the above:

-- To use IF statements, hence to be able to check if the user exists before
-- attempting creation, we need to switch to procedural SQL (PL/pgSQL)
-- instead of standard SQL.
-- More: https://www.postgresql.org/docs/9.3/plpgsql-overview.html
-- To preserve compatibility with <9.0, DO blocks are not used; instead,
-- a function is created and dropped.
CREATE OR REPLACE FUNCTION __tmp_create_user() returns void as $$
BEGIN
  IF NOT EXISTS (
          SELECT                       -- SELECT list can stay empty for this
          FROM   pg_catalog.pg_user
          WHERE  usename = 'prometheus') THEN
    CREATE USER prometheus;
  END IF;
END;
$$ language plpgsql;

SELECT __tmp_create_user();
DROP FUNCTION __tmp_create_user();

This will make the user connect to the right database by default:

ALTER USER prometheus SET SEARCH_PATH TO postgres_exporter,pg_catalog;
GRANT CONNECT ON DATABASE postgres TO prometheus;

... and grant the required accesses to do the probes:

GRANT pg_monitor to prometheus;

Note the procedure was modified from the upstream procedure to use the prometheus user (instead of postgres_exporter), and to remove the hardcoded password (since we rely on the "peer" authentication method).

A previous version of this documentation mistakenly recommended creating views and other complex objects that were only required in PostgreSQL < 10, and were never actually necessary. Those can be cleaned up with the following:

DROP SCHEMA postgres_exporter CASCADE;
DROP FUNCTION get_pg_stat_replication;
DROP FUNCTION get_pg_stat_statements;
DROP FUNCTION get_pg_stat_activity;

... and it wouldn't hurt then to rerun the above install procedure to grant the correct rights to the prometheus user.

Then restart the exporter to be sure everything still works:

systemctl restart prometheus-postgres-exporter.service

Upgrades

PostgreSQL upgrades are a delicate operation that typically require downtime if there's no (logical) replication.

This section generally documents the normal (pgBackRest) procedure. The legacy backup system has been retired and so has its documentation.

Preparation

Before starting the fleet upgrade, read the release notes for the relevant release (e.g. 17.0 to see if there are any specific changes that are needed at the application level, for service owners. In general, the procedure below does use pg_upgrade so that's already covered.

Also note that the PostgreSQL server might need a fleet-wide pgBackRest upgrade, as an old pgBackRest might not be compatible with the newer PostgreSQL server or, worse, a new pgbackrest might not be compatible with the one from the previous stable. During the Debian 12 to 13 (bookworm to trixie) upgrade, both of those were a problem and the pgbackrest package was updated across the fleet, using the apt.postgresql.org repository.

The upstream backports repository can be enabled in the profile::postgresql::backports class. It's actually included by default in the profile::postgresql but enabled only on older releases. This can be tweaked from Hiera.

Procedure

This is the procedure for pgBackRest-backed servers.

Make a full backup of the old cluster or make sure a recent one is present:
```
fab -H testdb-01.torproject.org postgresql.backup --no-wait
```
Make sure the pgBackRest versions on the client and server are compatible. (See note about fleet-wide upgrades above.)
Simulate the cluster upgrade:
```
fab -H testdb-01.torproject.org --dry postgresql.upgrade
```
Look at the version numbers and make sure you're upgrading and dropping the right clusters.

This assumes the newer PostgreSQL packages are already available and installed, but that the upgrade wasn't performed. The normal "major upgrade" procedures bring you to that state, otherwise the https://apt.postgresql.org sources need to be installed on the server.
Run the cluster upgrade:
```
fab -H testdb-01.torproject.org postgresql.upgrade
```
At this point, the old cluster is still present, but runs on a different port, and the upgraded cluster is ready for service.
Verify service health

Test the service which depends on the database, see if you can read and write to the database.
Verify the backup health

Check that WAL files are still sent to the backup server. After an hour, if the archiver is not working properly, Prometheus will send a PgArchiverFailed alert, for example. Such errors should be visible in tail -f /var/log/postgresql/p*.log but will silently resolve themselves. You can check the metrics in Prometheus to see if they're being probed correctly with:
```
fab prometheus.query-to-series --expression 'pgbackrest_backup_info{alias="testdb-01.torproject.org"}'
```

Note that the upgrade procedure takes care of destroying the old cluster, after 7 days by default, with the at(1) command. Make sure you check everything is alright before that delay!

SLA

No service level is defined for this service.

Design and architecture

We use PostgreSQL for a handful of services. Each service has its own PostgreSQL server installed, with no high availability or replication, currently, although we use the "write-ahead log" to keep a binary dump of databases on the backup server.

It should be noted for people unfamiliar with PostgreSQL that it (or at least the Debian package) can manage multiple "clusters" of distinct databases with overlapping namespaces, running on different ports. To quote the upstream documentation:

PostgreSQL is a relational database management system (RDBMS). That means it is a system for managing data stored in relations. Relation is essentially a mathematical term for table. [...]

Each table is a named collection of rows. Each row of a given table has the same set of named columns, and each column is of a specific data type. [...]

Tables are grouped into databases, and a collection of databases managed by a single PostgreSQL server instance constitutes a database cluster.

TODO Services

TODO Storage

TODO Queues

TODO Interfaces

TODO Authentication

TODO Implementation

Issues

There is no issue tracker specifically for this project, File or search for issues in the team issue tracker with the ~PostgreSQL label.

Maintainer

PostgreSQL services are part of the core services maintained by TPA. The postgres Puppet module and associated backup synchronisation code was written by Peter Palfrader.

TODO: update wrt pgbackrest and new profile, mention lavamind

TODO Users

TODO Upstream

The PostgreSQL project itself is a major database, free software project, which calls itself "The World's Most Advanced Open Source Relational Database, with regular releases and a healthy community.

Monitoring and metrics

Prometheus monitors the PostgreSQL servers through the PostgreSQL exporter deployed by Puppet through the profile::prometheus::postgres_exporter class.

The Grafana server has a handful of dashboards in various working states:

Postgres Overview - basic dashboard with minimal metrics
PostgreSQL Overview (Percona) - mostly working
GitLab Omnibus - PostgreSQL - broken

Note that there is a program called pgstatsmon which can provide very detailed information about the state of a PostgreSQL database, see this blog post for details.

Backups monitoring

PostgreSQL backups are monitored through the pgbackrest_exporter, which pulls metrics from the pgbackrest binary on the storage server periodically, and exposes them through a web interface.

The collected metrics can be seen on this Grafana dashboard (grafana.com source).

Alertmanager has a set of alerts that look for out of date backups, see the pager playbook for a reference.

TODO Tests

Logs

PostgreSQL keeps log files in /var/log/postgresql/, one per "cluster". Since it logs failed queries, logs may contain PII in the form of SQL queries. The log rotation policy is the one set by the Debian package and keeps logs for 10 weeks.

The backup system keeps logs of its periodic full/diff backups in systemd's journal files. To consult the logs for the full backups on rude, for example, see:

journalctl -b -u pgbackrest-backup-full@rude.service

Backups

The new backup system is based on pgBackRest. It works by SSH'ing between the client and server and running pgbackrest commands, which encapsulates all functionality including backup, and restore.

Backups are retained for (30 days), although the source of truth for this is not here but in Hiera, in tor-puppet.git's hiera/common/postgresql.yaml, the pgbackrest::config:global:repo1-retention-full value. Expiration is performed when backups are ran, from the systemd timers. See also the upstream documentation on retention.

pgBackRest considers 3 different backup types, here are schedules for those:

type	frequency	note
`full`	30 days	all database cluster files will be copied and there will be no dependencies on previous backups.
`diff`	7 days	like an incremental backup but always based on the last full backup.
`incr`	24h	incremental from the last successful backup.

Backups are scheduled using systemd timers exported from each node, based on a template per backup type, so there's a matrix of pgbackrest-backup-{diff,full}@.{service,timer} files on the repository server, e.g.

root@backup-storage-01:~# ls /etc/systemd/system | grep @\\.
pgbackrest-backup-diff@.service
pgbackrest-backup-diff@.timer
pgbackrest-backup-full@.service
pgbackrest-backup-full@.timer
pgbackrest-backup-incr@.service
pgbackrest-backup-incr@.timer

Each server has its own instance of that, a symlink to those, for example weather-01:

root@backup-storage-01:~# ls -l /etc/systemd/system | grep weather-01
lrwxrwxrwx 1 root root   31 Dec  5 02:02 pgbackrest-backup-diff@weather-01.service -> pgbackrest-backup-diff@.service
lrwxrwxrwx 1 root root   49 Dec  4 21:51 pgbackrest-backup-diff@weather-01.timer -> /etc/systemd/system/pgbackrest-backup-diff@.timer
lrwxrwxrwx 1 root root   31 Dec  5 02:02 pgbackrest-backup-full@weather-01.service -> pgbackrest-backup-full@.service
lrwxrwxrwx 1 root root   49 Dec  4 21:51 pgbackrest-backup-full@weather-01.timer -> /etc/systemd/system/pgbackrest-backup-full@.timer
lrwxrwxrwx 1 root root   31 Dec 16 18:32 pgbackrest-backup-incr@weather-01.service -> pgbackrest-backup-incr@.service
lrwxrwxrwx 1 root root   49 Dec 16 18:32 pgbackrest-backup-incr@weather-01.timer -> /etc/systemd/system/pgbackrest-backup-incr@.timer

Retention is configured at the "full" level, with the repo1-retention-full setting.

Puppet setup

PostgreSQL servers are automatically configured to use pgBackRest to backup to a central server (called repository), as soon as the profile::postgresql is included, if profile::postgresql::pgbackrest is true.

Note that the instructions here also apply if you're converting a legacy host to pgBackRest.

This takes a few times to converge: at first, the catalog on the repository side will fail because of missing SSH keys on the client.

By default, the backup-storage-01.torproject.org server is used as a repository, but this can be overridden in Hiera with the profile::postgresql::pgbackrest_repository parameter. This is normally automatically configured by hoster, however, so you shouldn't need to change anything.

Manual configuration

Those instructions are for disaster recovery scenarios, when a manual configuration of pgBackRest is required. This typically happens when Puppet is down, for example if the PuppetDB server was destroyed and need to be recovered, it wouldn't be possible to deploy the backup system with Puppet.

Otherwise those instructions should generally not be used, as they are normally covered by the profile::postgresql class.

Here, we followed the dedicated repository host installation instructions. Below, we treat the "client" (weather-01) as the server that's actually running PostgreSQL in production and the "server" (backup-storage-01) as the backup server that's receiving the backups.

Install package on both the client and the server:
```
apt install pgbackrest
```
Note: this creates a postgresql user instead of pgbackrest.

Create an SSH key on the client:

sudo -u postgres ssh-keygen

Create a user and SSH key on the server:

adduser --system pgbackrest-weather-01
sudo -u pgbackrest-weather-01 ssh-keygen

Those keys were exchanged to the other host by adding them in /etc/ssh/userkeys/$HOSTNAME with the prefix:

restrict,command="/usr/bin/pgbackrest ${SSH_ORIGINAL_COMMAND#* }"

For example, on the server:

echo 'restrict,command="/usr/bin/pgbackrest ${SSH_ORIGINAL_COMMAND#* }"  ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJrOnnOpX0cyzQ/lqvNLQt2mcJUziiJ0MdubSf/c1+2g postgres@test-01' \
  > /etc/ssh/userkeys/pgbackrest-weather-01

On the client, the key should be in /etc/ssh/userkeys/postgres.

Test the cross-connect with:

root@weather-01:~# sudo -u postgres ssh pgbackrest-weather-01@backup-storage-01.torproject.org

This should display the pgbackrest usage. Also test from the server to the client:

root@backup-storage-01:~# sudo -u weather-01 ssh postgres@weather-01.torproject.org

Configure the client on the server, in /etc/pgbackrest/conf.d/weather-01.torproject.org.conf:

[weather-01.torproject.org]
lock-path = /var/lock/pgbackrest/weather-01.torproject.org
pg1-host = weather-01.torproject.org
pg1-path = /var/lib/postgresql/15/main
log-path = /var/log/pgbackrest/weather-01.torproject.org
repo1-path = /var/lib/pgbackrest

Configure the server on the client, in /etc/pgbackrest/conf.d/server.conf:

[global]
log-level-file = detail
repo1-path = /var/lib/pgbackrest
repo1-host = backup-storage-01.torproject.org
repo1-host-user = pgbackrest-weather-01

[weather-01.torproject.org]
pg1-path = /var/lib/postgresql/15/main

Create the "stanza" on the server:

sudo -u pgbackrest-weather-01 pgbackrest --stanza=weather-01 stanza-create

Modify the PostgreSQL configuration on the client to archive to pgBackRest, in /etc/postgresql/15/main/postgresql.conf:

archive_command = 'pgbackrest --stanza=main archive-push %p'
wal_level = replica

Test the configuration, on the client:

root@weather-01:~# sudo -u postgres pgbackrest --stanza=weather-01 check

Note that this will wait for an archive to be successfully sent to the server. It will wait a full minute before failing with a helpful error message, like:

ERROR: [082]: WAL segment 000000010000001F00000004 was not archived before the 60000ms timeout
HINT: check the archive_command to ensure that all options are correct (especially --stanza).
HINT: check the PostgreSQL server log for errors.
HINT: run the 'start' command if the stanza was previously stopped.

In my case, the --stanza in the postgresql.conf file was incorrect.

Test the configuration, on the server:

root@backup-storage-01:~# sudo -u pgbackrest-weather-01 pgbackrest --stanza=weather-01 check
Perform a first backup, from the server:

root@backup-storage-01:~# sudo -u postgres pgbackrest --stanza=weather-01 backup

The warning (WARN: no prior backup exists, incr backup has been changed to full) is expected.

The first full backup completed in 6 minutes on weather-01.

Discussion

Overview

Technical debt that needs to eventually be addressed:

the pgbackrest_exporter currently runs as root since it needs to be able to read from backup directories under all of the backup users. We want to implement a better method for the exporter to get access to the files without running as root.
pgBackRest runs over SSH, while it seems TLS offers better performance and isolation, see this comment and others
the pgbackrest Puppet module has effectively been forked to support automated multiple servers backup, and should be merged back upstream
PITR restores (e.g. "go back in time") are not well documented, but should be relatively easy to perform in pgBackRest

Goals

Must have

Nice to have

Non-Goals

Approvals required

Proposed Solution

Cost

Alternatives considered

Backup systems

We used to have a legacy system inherited from DSA without any other upstream, with code living here and there in various git repositories.

In late 2024 and early 2025, it was replaced with pgBackRest as part of TPA-RFC-65. It's not perfect: upstream documentation is, as often the case, not quite complete, but it's pretty good. Performance is excellent, it's much simpler and contained, it's well packaged in Debian, and well supported upstream. It seems to be pretty much the standard PG backup tool at this point.

This section document various alternative backup systems, including the legacy backup system.

Barman

Barman presumably makes "taking an online hot backup of PostgreSQL" "as easy as ordering a good espresso coffee". It seems well maintained (last release 3.2.0 on 20 October 20220, 7 days ago), and with a healthy community (45 contributors, 7 with more than 1000 SLOC, 5 pending PRs, 83 open issues).

It is still seeing active development and new features, with a few sponsors and professional support from the company owning the copyright (EntrepriseDB).

It's in Debian, and well maintained there (only day between the 3.2.0 release and upload to unstable). It's licensed under the GPLv3.

The documentation is a little confusing; it's a one page HTML page or a PDF on the release page. The main command and configuration files each have a manual page, and so do some sub-commands, but not all.

Quote from the about page:

Features & Goals

Full hot physical backup of a PostgreSQL server

Point-In-Time-Recovery (PITR)

Management of multiple PostgreSQL servers

Remote backup via rsync/SSH or pg_basebackup (including a 9.2+ standby)

Support for both local and remote (via SSH) recovery

Support for both WAL archiving and streaming

Support for synchronous WAL streaming (“zero data loss”, RPO=0)

Incremental backup and recovery

Parallel backup and recovery

Hub of WAL files for enhanced integration with standby servers

Management of retention policies for backups and WAL files

Server status and information

Compression of WAL files (bzip2, gzip or custom)

Management of base backups and WAL files through a catalogue

A simple INI configuration file

Totally written in Python

Relocation of PGDATA and tablespaces at recovery time

General and disk usage information of backups

Server diagnostics for backup

Integration with standard archiving tools (e.g. tar)

Pre/Post backup hook scripts

Local storage of metadata

Missing features:

streaming replication support
S3 support

The design is actually eerily similar to the existing setup: it uses pg_basebackup to make a full backup, then the archive_command to stream WAL logs, at least in one configuration. It actually supports another configuration which provides zero data loss in case of an outage, as setups depending on archive_command actually can result in data loss, because PostgreSQL commits the WAL file only in 16MB chunks. See the discussion in the Barman WAL archive for more information on those two modes.

In any case, the architecture is compatible with our current setup and it looked like a good candidate. The WAL file compression is particularly interesting, but all the other extra features and the community, regular releases, and Debian packaging make it a prime candidate for replacing our bespoke scripts.

In September 2024, Barman was tested in tpo/tpa/team#40950, but it did not go well and Barman was ultimately abandoned. Debugging was difficult, documentation was confusing, and it just didn't actually work. See this comment for details.

pg_rman

pg_rman is a "Backup and restore management tool for PostgreSQL". It seems relatively well maintained, with a release in late 2021 (1.3.14, less than a year go), and the last commit in September (about a month ago). It has a smaller community than Barman, with 13 contributors and only 3 with more than a thousand SLOC. 10 pending PRs, 12 open issues.

It's unclear where one would get support for this tool. There doesn't seem to be commercial support or sponsors.

It doesn't appear to be in Debian. It is licensed under an unusual BSD-like license requiring attribution to the NIPPON TELEGRAPH AND TELEPHONE CORPORATION.

Documentation is a single manpage.

It's not exactly clear how this software operates. It seems like it's a tool to make PITR backups but only locally.

Probably not a good enough candidate.

repmgr

repmgr is a tool for "managing replication and failover in a cluster of PostgreSQL servers. It enhances PostgreSQL's built-in hot-standby capabilities with tools to set up standby servers, monitor replication, and perform administrative tasks such as failover or manual switchover operations".

It does not seem, in itself, to be a backup manager, but could be abused to be one. It could be interesting to operate hot-standby backup servers, if we'd wish to go in that direction.

It is developed by the same company as Barman, EntrepriseDB. It is packaged in Debian.

No other investigation was performed on the program because its designed was seen as compatible with our current design, but also because EntrepriseDB also maintains Barman. And, surely, they wouldn't have two backup systems, would they?

omniptr

omniptr is another such tool I found. Its README is really lacking in details, but it looks like something like we do, which hooks into the archive_command to send logs... somewhere.

I couldn't actually figure out its architecture or configuration from a quick read of the documentation, which is not a good sign. There's a bunch of .pod files in a doc directory, but it's kind of a mess in there.

It does not seem to be packaged in Debian, and doesn't seem very active. The last release (2.0.0) is almost 5 years old (November 2017). It doesn't have a large developer community, only 8 developers, none of them with more than a thousand lines of code (omniptr is small though).

It's written in Perl, with a license similar to the PostgreSQL license.

I do not believe it is a suitable replacement for our backup system.

pgBackRest TLS server

pgBackRest has a server command that runs a TLS-enabled server that runs on the PostgreSQL server and the repository. Then the server uses TLS instead of SSH pipes to push WAL files to the repository, and the repository pulls backups over TLS from the servers.

We haven't picked that option because it requires running pgbackrest server everywhere. We prefer to rely on SSH instead.

Using SSH also allows us to use multiple, distinct users for each backup server which reduces lateral movement between backed up hosts.

Legacy DSA backup system

We were previously using a bespoke backup system shared with DSA. It was built with a couple of shell and Perl script deployed with Puppet.

It used upstream's Continuous Archiving and Point-in-Time Recovery (PITR) which relies on PostgreSQL's "write-ahead log" (WAL) to write regular "transaction logs" of the cluster to the backup host. (Think of transaction logs as incremental backups.) This was configured in postgresql.conf, using a configuration like this:

track_counts = yes
archive_mode = on
wal_level = archive
max_wal_senders = 3
archive_timeout = 6h
archive_command = '/usr/local/bin/pg-backup-file main WAL %p'

The latter was a site-specific script which reads a config file in /etc/dsa/pg-backup-file.conf where the backup host is specified (e.g. torbackup@bungei.torproject.org). That command passes the WAL logs onto the backup server, over SSH. A WAL file is shipped immediately when it is full (16MB of data by default) but no later than 6 hours (varies, see archive_timeout on each host) after it was first written to. On the backup server, the command is set to debbackup-ssh-wrap in the authorized_keys file and takes the store-file pg argument to write the file to the right location.

WAL files are written to /srv/backups/pg/$HOSTNAME where $HOSTNAME (without .torproject.org). WAL files are prefixed with main.WAL. (where main is the cluster name) with a long unique string after, e.g. main.WAL.00000001000000A40000007F.

For that system to work, we also needed full backups to happen on a regular basis. That was done straight from the backup server (again bungei) which connects to the various PostgreSQL servers and runs a pg_basebackup to get a complete snapshot of the cluster. This happens weekly (every 7 to 10 days) in the wrapper postgres-make-base-backups, which is a wrapper (based on a Puppet concat::fragment template) that calls postgres-make-one-base-backup for each PostgreSQL server.

The base files are written to the same directory as WAL file and are named using the template:

$CLUSTER.BASE.$SERVER_FQDN-$DATE-$ID-$CLIENT_FQDN-$CLUSTER-$VERSION-backup.tar.gz

... for example:

main.BASE.bungei.torproject.org-20190804-214510-troodi.torproject.org-main-13-backup.tar.gz

All of this works because SSH public keys and PostgreSQL credentials are passed around between servers. That is handled in the Puppet postgresql module for the most part, but some bits might still be configured manually on some servers.

Backups were checked for freshness in Nagios using the dsa-check-backuppg plugin with its configuration stored in /etc/dsa/postgresql-backup/dsa-check-backuppg.conf.d/, per cluster. The Nagios plugin also took care of expiring backups when they are healthy.

The actual retention period was defined in the /etc/nagios/dsa-check-backuppg.conf configuration file on the storage server:

retention: 1814400

That number, in seconds, was 21 days.

Running backups was a weird affair, this was the command, to run a backup for meronense:

sudo -u torbackup postgres-make-one-base-backup $(grep ^meronense.torproject.org $(which postgres-make-base-backups ))

Indeed, the postgres-make-base-backups file was generated by Puppet based on Concat exported resources (!) and had its configuration inline (as opposed to a separate configuration file).

This system was finally and completely retired in June 2025. Most of the code was ripped out of Puppet then, in ad6e74e31 (rip out legacy backup code (tpo/tpa/team#40950), 2025-06-04). Large chunks of documentation about the legacy system were also removed from this page in 67d6000d (postgresql: purge legacy documentation (tpo/tpa/team#40950), 2025-06-17).

Replication

We don't do high availability right now, but if we would, we might want to consider pg_easy_replicate.

Keyboard shortcuts