Grafana is a graphing engine and dashboard management tool that processes data from multiple data sources. We use it to trend various metrics collected from servers by Prometheus.

Grafana is installed alongside Prometheus, on the same server. Those are the known instances:

https://grafana.torproject.org/ - internal server
https://grafana2.torproject.org/ - external server

See also the Prometheus monitored services to understand the difference between the internal and external servers.

Tutorial
How-to
Reference
Discussion

Tutorial

Important dashboards

Typically, working Grafana dashboards are "starred". Since we have many such dashboards now, here's a curated list of the most important dashboards you might need to look at:

Overview - first panel to show up on login, can filter basic stats (bandwidth, memory, load, etc) per server role (currently "class" field)
Per-node server stats - basic server stats (CPU, disk, memory usage), with drill down options
Node comparison dashboard - similar to the above, but can display multiple servers in columns, useful for cluster overview and drawing correlations between servers
Postfix - to monitor mailings, see monitoring mailings, in the CRM documentation

Other services (e.g. Apache, Bind, PostgreSQL, GitLab), also have their own dashboards, and many dashboards are still work in progress.

The above list doesn't cover the "external" Grafana server (grafana2) which has its own distinct set of dashboards.

Basic authentication

Access to grafana is now granted via one of the passwords, the "web password" in LDAP accounts.

If you have an LDAP account and need to grant you access to the web interface for this service (or if you need to reset your password to something you know):

login to https://db.torproject.org/
set your new password in the row titled "Change web password:" -- you'll need to enter it once in each of the two fields of that row and then save the changes with the "Update..." button at the bottom of the form
- if you're only updating the web password, you don't need to change or enter values in the other fields
- note that this "web password" does not need to be the same as your LDAP or email passwords. It is usually considered better to have differing passwords to limit the impact of a leak (this is where your password manager comes in handy!)
wait for your password to propagate. Normally this can take about 15 minutes. If after 1 or 2h your password has not yet been set, you can contact TPA to look into what's happening. After the delay you should be able to login with your new "web password"
if you logged in to grafana for the first time, you may need to obtain some additional access in order to view and/or edit some graphs. Check in with TPA to obtain the required access for your user

Granting access to folders

Individual access to folders is determined at the "Team" level. First, a user needs to be added to a Team, then the folder needs to be modified to grant access to the team.

To grant access to a folder:

head to the folder in the dashboards list
select the "Folder actions" button on the top-right
select "Manage permissions"
wait a while for Grafana to finish loading
select "Add a permission"
"choose" the "team" item in the left drop-down, the appropriate permission (View, Edit, Admin, typically Edit or Admin, as View is available by default), then hit Save

You typically need "admin" access to the entire Grafana instance to manage those things, which require the "fallback" admin password, stored in Trocla and TPA's password manager. See the authentication section for details.

How-to

Updating a dashboard

As mentioned in the installation section below, the Grafana dashboards are maintained by Puppet. So while new dashboard can be created and edited in the Grafana web interface, changes to provisioned will be lost when Puppet ships a new version of the dashboard.

You therefore need to make sure you update the Dashboard in git before leaving. New dashboards not in git should be safe, but please do also commit them to git so we have a proper versioned history of their deployment. It's also the right way to make sure they are usable across other instances of Grafana. Finally, they are also easier to share and collaborate on that way.

Folders and tags

Dashboards provisioned by Grafana should be tagged with the provisioned label, and filed in the appropriate folder:

meta: self-monitoring, mostly metrics on Prometheus and Grafana themselves
network: network monitoring, bandwidth management
services: service-specific dashboards, for example database, web server, applications like GitLab, etc
system: system-level metrics, like disk, memory, CPU usage

Non-provisioned dashboards should be filed in one of those folders:

broken: dashboards found to be completely broken and useless, might be deleted in the future
deprecated: functionality overlapping with another dashboard, to be deleted in the future
inprogrress: currently being built, could be partly operational, must absolutely NOT be deleted

The General folder is special and holds the "home" dashboard, which is, on grafana1, the "TPO overview" dashboard. It should not be used by other dashboards.

See the grafana-dashboards repository for instructions on how to export dashboards into git.

In general, Grafana is not a high availability service and shouldn't "page" you. It is, however, quite useful in emergencies or diagnostics situations. To diagnose server-level issues, head to the per-node server stats, which basic server stats (CPU, disk, memory usage), with drill down options. If that's not enough, look at the list of important dashboards

Disaster recovery

In theory, if the Grafana server dies in a fire, it should be possible to rebuild it from scratch in Puppet, see the installation procedure.

In practice, it's possible (currently most likely) that some data like important dashboards, users and groups (teams) might not have been saved into git, in which case restoring /var/lib/grafana/grafana.db from backups might bring them back. Restoring this file should take only a handful of seconds since it's small.

Reference

Installation

Puppet deployment

Grafana was installed with Puppet using the upstream Debian package, following a debate regarding the merits of Debian packages versus Docker containers when neither are trusted, see this comment for a summary.

Some manual configuration was performed after the install. An admin password reset on first install, stored in tor-passwords.git, in hosts-extra-info. Everything else is configured in Puppet.

Grafana dashboards, in particular, the grafana-dashboards repository. The README.md file there contains more instructions on how to add and update dashboards. In general, dashboards must not be modified directly through the web interface, at least not without being exported back into the repository.

SLA

There is no SLA established for this service.

Design

Grafana is a single-binary daemon written in Golang with a frontend written in Typescript. It stores its configuration in a INI file (in /etc/grafana/grafana.ini, managed by Puppet). It doesn't keep metrics itself and instead delegates time series storage to "data stores", which we currently use Prometheus for.

It is mostly driven by a web browser interface making heavy use of Javascript. Dashboards are stored in JSON files deployed by Puppet.

It supports doing alerting, but we do not use that feature, instead relying on Prometheus for alerts.

Authentication is delegated to the webserver proxy (currently Apache).

Authentication

The web interface is protected by HTTP basic authentication backed by userdir-ldap. Users with access to LDAP can set a webPassword password which gets propagated to the server.

There is a "fallback" user (hardcoded admin username, password in Trocla (profile::prometheus::server::password_fallback) and the password manager (under services/prometheus.torproject.org) that can be used in case the other system fails.

See the basic authentication for more information for users.

Note that only the admin account has full access to everything. The password is also stored in TPA's password manager under services/prometheus.torproject.org.

Note that we used to have only static password here, this was changed in June 2024 (tpo/tpa/team#41636)

Access control is given to a "team". Each user is assigned to a team and a team is given access to folders.

We have not used the "Organization" because, according to this blog post, "orgs" fully isolate everything between orgs: data sources, plugins, dashboards, everything is isolated and you can't share stuff between groups. It's effectively a multi-tenancy solution.

We might have given a team access to the entire "org" (say "edit all dashboards" here) but unfortunately that can't be done: we need to grant access on a per-folder basis.

Issues

There is no issue tracker specifically for this project, File or search for issues in the team issue tracker with the ~Grafana label.

Issues with Grafana itself may be browsed or filed on GitHub.

Maintainer, users, and upstream

This service was deployed by anarcat and hiro. The internal server is used by TPA and the external server can be used by any other teams, but is particularly used by the anti-censorship and metrics teams.

Upstream is Grafana Labs, a startup with a few products alongside Grafana.

Monitoring and testing

Grafana itself is monitored by Prometheus and produces graphs for its own metrics.

The test procedure is basically to login to the service and loading a few dashboards.

Logs and metrics

Grafana doesn't hold metrics in itself, and delegates this task to external datasource. We use Prometheus for that purpose, but other backends could be used as well.

Grafana logs incoming requests in /var/log/grafana/grafana.log and may contain private information like IP addresses and request times.

Backups

No special backup procedure has been established for Grafana, considering the service can be rebuilt from scratch.

Discussion

Overview

The Grafana project was quickly thrown together in 2019 to replace the Munin service who had "died in a fire". Prometheus was first setup to collect metrics and Grafana was picked as a frontend because Prometheus didn't seem sufficient to produce good graphs. There was no elaborate discussion or evaluation of alternatives done at the time.

There hasn't been a significant security audit of the service, but given that authentication is managed by Apache with a limited set of users, it should be fairly safe.

Note that it is assumed the dashboard and Prometheus are public on the internal server. The external server is considered private and shouldn't be publicly accessible.

There are lots of dashboards in the interface, which should probably be cleaned up and renamed. Some are not in Git and might be lost in a reinstall. Some dashboards do not work very well.

Tutorial

Important dashboards

Basic authentication

Granting access to folders

How-to

Updating a dashboard

Folders and tags

Disaster recovery

Reference

Installation

Puppet deployment

SLA

Design

Authentication

Issues

Maintainer, users, and upstream

Monitoring and testing

Logs and metrics

Backups

Other documentation

Discussion

Overview

Goals

Must have

Nice to have

Non-Goals

Approvals required

Proposed Solution

Cost

Alternatives considered

Keyboard shortcuts