Grafana is a graphing engine and dashboard management tool that processes data from multiple data sources. We use it to trend various metrics collected from servers by Prometheus.
Grafana is installed alongside Prometheus, on the same server. Those are the known instances:
- https://grafana.torproject.org/ - internal server
- https://grafana2.torproject.org/ - external server
See also the Prometheus monitored services to understand the difference between the internal and external servers.
Tutorial
Important dashboards
Typically, working Grafana dashboards are "starred". Since we have many such dashboards now, here's a curated list of the most important dashboards you might need to look at:
- Overview - first panel to show up on login, can filter basic stats (bandwidth, memory, load, etc) per server role (currently "class" field)
- Per-node server stats - basic server stats (CPU, disk, memory usage), with drill down options
- Node comparison dashboard - similar to the above, but can display multiple servers in columns, useful for cluster overview and drawing correlations between servers
- Postfix - to monitor mailings, see monitoring mailings, in the CRM documentation
Other services (e.g. Apache, Bind, PostgreSQL, GitLab), also have their own dashboards, and many dashboards are still work in progress.
The above list doesn't cover the "external" Grafana server
(grafana2) which has its own distinct set of dashboards.
Basic authentication
Access to grafana is now granted via one of the passwords, the "web password" in LDAP accounts.
If you have an LDAP account and need to grant you access to the web interface for this service (or if you need to reset your password to something you know):
- login to https://db.torproject.org/
- set your new password in the row titled "Change web password:" -- you'll need
to enter it once in each of the two fields of that row and then save the
changes with the "Update..." button at the bottom of the form
- if you're only updating the web password, you don't need to change or enter values in the other fields
- note that this "web password" does not need to be the same as your LDAP or email passwords. It is usually considered better to have differing passwords to limit the impact of a leak (this is where your password manager comes in handy!)
- wait for your password to propagate. Normally this can take about 15 minutes. If after 1 or 2h your password has not yet been set, you can contact TPA to look into what's happening. After the delay you should be able to login with your new "web password"
- if you logged in to grafana for the first time, you may need to obtain some additional access in order to view and/or edit some graphs. Check in with TPA to obtain the required access for your user
Granting access to folders
Individual access to folders is determined at the "Team" level. First, a user needs to be added to a Team, then the folder needs to be modified to grant access to the team.
To grant access to a folder:
- head to the folder in the dashboards list
- select the "Folder actions" button on the top-right
- select "Manage permissions"
- wait a while for Grafana to finish loading
- select "Add a permission"
- "choose" the "team" item in the left drop-down, the appropriate permission (View, Edit, Admin, typically Edit or Admin, as View is available by default), then hit Save
You typically need "admin" access to the entire Grafana instance to
manage those things, which require the "fallback" admin password,
stored in Trocla and TPA's password manager. See the authentication
section for details.
How-to
Updating a dashboard
As mentioned in the installation section below, the Grafana dashboards are maintained by Puppet. So while new dashboard can be created and edited in the Grafana web interface, changes to provisioned will be lost when Puppet ships a new version of the dashboard.
You therefore need to make sure you update the Dashboard in git before leaving. New dashboards not in git should be safe, but please do also commit them to git so we have a proper versioned history of their deployment. It's also the right way to make sure they are usable across other instances of Grafana. Finally, they are also easier to share and collaborate on that way.
Folders and tags
Dashboards provisioned by Grafana should be tagged with the
provisioned label, and filed in the appropriate folder:
-
meta: self-monitoring, mostly metrics on Prometheus and Grafana themselves -
network: network monitoring, bandwidth management -
services: service-specific dashboards, for example database, web server, applications like GitLab, etc -
system: system-level metrics, like disk, memory, CPU usage
Non-provisioned dashboards should be filed in one of those folders:
-
broken: dashboards found to be completely broken and useless, might be deleted in the future -
deprecated: functionality overlapping with another dashboard, to be deleted in the future -
inprogrress: currently being built, could be partly operational, must absolutely NOT be deleted
The General folder is special and holds the "home" dashboard, which
is, on grafana1, the "TPO overview" dashboard. It should not be
used by other dashboards.
See the grafana-dashboards repository for instructions on how to export dashboards into git.
Pager playbook
In general, Grafana is not a high availability service and shouldn't "page" you. It is, however, quite useful in emergencies or diagnostics situations. To diagnose server-level issues, head to the per-node server stats, which basic server stats (CPU, disk, memory usage), with drill down options. If that's not enough, look at the list of important dashboards
Disaster recovery
In theory, if the Grafana server dies in a fire, it should be possible to rebuild it from scratch in Puppet, see the installation procedure.
In practice, it's possible (currently most likely) that some data like important
dashboards, users and groups (teams) might not have been saved into git, in
which case restoring /var/lib/grafana/grafana.db from backups might bring them
back. Restoring this file should take only a handful of seconds since it's small.
Reference
Installation
Puppet deployment
Grafana was installed with Puppet using the upstream Debian package, following a debate regarding the merits of Debian packages versus Docker containers when neither are trusted, see this comment for a summary.
Some manual configuration was performed after the install. An admin
password reset on first install, stored in tor-passwords.git, in
hosts-extra-info. Everything else is configured in Puppet.
Grafana dashboards, in particular, the grafana-dashboards
repository. The README.md file there contains more instructions
on how to add and update dashboards. In general, dashboards must not
be modified directly through the web interface, at least not without
being exported back into the repository.
SLA
There is no SLA established for this service.
Design
Grafana is a single-binary daemon written in Golang with a frontend
written in Typescript. It stores its configuration in a INI file (in
/etc/grafana/grafana.ini, managed by Puppet). It doesn't keep
metrics itself and instead delegates time series storage to "data
stores", which we currently use Prometheus for.
It is mostly driven by a web browser interface making heavy use of Javascript. Dashboards are stored in JSON files deployed by Puppet.
It supports doing alerting, but we do not use that feature, instead relying on Prometheus for alerts.
Authentication is delegated to the webserver proxy (currently Apache).
Authentication
The web interface is protected by HTTP basic authentication backed by
userdir-ldap. Users with access to LDAP can set a webPassword password which
gets propagated to the server.
There is a "fallback" user (hardcoded admin username, password in
Trocla (profile::prometheus::server::password_fallback) and the
password manager (under services/prometheus.torproject.org) that can
be used in case the other system fails.
See the basic authentication for more information for users.
Note that only the admin account has full access to everything. The
password is also stored in TPA's password manager under
services/prometheus.torproject.org.
Note that we used to have only static password here, this was changed in June 2024 (tpo/tpa/team#41636)
Access control is given to a "team". Each user is assigned to a team and a team is given access to folders.
We have not used the "Organization" because, according to this blog post, "orgs" fully isolate everything between orgs: data sources, plugins, dashboards, everything is isolated and you can't share stuff between groups. It's effectively a multi-tenancy solution.
We might have given a team access to the entire "org" (say "edit all dashboards" here) but unfortunately that can't be done: we need to grant access on a per-folder basis.
Issues
There is no issue tracker specifically for this project, File or search for issues in the team issue tracker with the ~Grafana label.
Issues with Grafana itself may be browsed or filed on GitHub.
Maintainer, users, and upstream
This service was deployed by anarcat and hiro. The internal server is used by TPA and the external server can be used by any other teams, but is particularly used by the anti-censorship and metrics teams.
Upstream is Grafana Labs, a startup with a few products alongside Grafana.
Monitoring and testing
Grafana itself is monitored by Prometheus and produces graphs for its own metrics.
The test procedure is basically to login to the service and loading a few dashboards.
Logs and metrics
Grafana doesn't hold metrics in itself, and delegates this task to external datasource. We use Prometheus for that purpose, but other backends could be used as well.
Grafana logs incoming requests in /var/log/grafana/grafana.log and
may contain private information like IP addresses and request times.
Backups
No special backup procedure has been established for Grafana, considering the service can be rebuilt from scratch.
Other documentation
Discussion
Overview
The Grafana project was quickly thrown together in 2019 to replace the Munin service who had "died in a fire". Prometheus was first setup to collect metrics and Grafana was picked as a frontend because Prometheus didn't seem sufficient to produce good graphs. There was no elaborate discussion or evaluation of alternatives done at the time.
There hasn't been a significant security audit of the service, but given that authentication is managed by Apache with a limited set of users, it should be fairly safe.
Note that it is assumed the dashboard and Prometheus are public on the internal server. The external server is considered private and shouldn't be publicly accessible.
There are lots of dashboards in the interface, which should probably be cleaned up and renamed. Some are not in Git and might be lost in a reinstall. Some dashboards do not work very well.
Goals
N/A. No ongoing migration or major project.
Must have
Nice to have
Non-Goals
Approvals required
Proposed Solution
N/A.
Cost
N/A.
Alternatives considered
No extensive evaluation of alternatives were performed when Grafana was deployed.