SLOs for everyone with Sloth
One of our long-term and ongoing goals as the SRE team in HelloFresh, is to drive the adoption of SRE practices for the whole engineering department. And one of the manifestations of this adoption is the implementation of SLOs across the whole department.
Before explaining the task at hand in more detail, some background on our systems is going to be helpful to understand our choices later on.
Prometheus
As discussed here and more loosely here, our Prometheus setup for each cluster consists of one central Prometheus installation and then one Prometheus installation per namespace.
The central instance holds metrics about everything that is specific to Kubernetes; pod metrics, deployment metrics, etc.
The Prometheus per namespace holds Istio mesh metrics for the pods of the namespace, as well as metrics exposed by the workloads of that namespace themselves, for convenience called “custom” metrics.
An important piece of information for later is that our Prometheus installations are configured to retain only 6 hours worth of data. We use Thanos for everything longer than that.
Helm usage in HelloFresh
We use Helm for all of our Kubernetes workloads. Specifically, an in-house built helm plugin. Our plugin has two primary roles: First, it helps teams generate helm values files preconfigured with sane defaults for newly created applications. Secondly, every time a release is triggered, the respective application’s value files will get rendered across a shared set of templates, effectively generating a helm chart.
This way we can add functionality to one place, the generated helm chart, and each application has access to that functionality in the next release cycle (e.g. Ambassador Mappings, Ingress, Istio, monitoring, tracing, etc). A helpful by-product of this setup is that we can auto-enable features (or migrate) for everyone according to different criteria, without asking all the teams to explicitly integrate that new functionality.
SLOs v1
Up until now, we used to maintain an internally built Kubernetes operator, called Service Level Alerts Operator. The intention behind this project was to allow teams to easily define SLIs and SLOs based on HTTP requests’ errors and latency. Under the hood, this operator would translate the following values.yaml section to Multi Window, Multi-Burn-Rate Alerts PrometheusRules that rely on Istio request metrics to build the respective alerting rules for teams. All the service owners had to provide was their targets:
Goals for this project
While we were happy with the first phase of SLO adoption, it was now time to move forward with phase two. For this we aimed for the following:
- Let teams define SLOs on custom SLIs (ie. any metrics they want and not just Istio metrics)
- Since we rely on Prometheus metrics, our solution needed to be compatible with our Prometheus setup
- Ideally, we wanted to have reusable components. In the same way, we abstracted SLOs on HTTP requests, we wanted to build reusable plugins for common types of applications (e.g. Kafka consumers)
- Our solution needed to support both a rolling 14d and a fixed month Error Budget window (those 2 windows are the decided ones for our services in Hellofresh)
- Most importantly, everything had to be done in an easy and intuitive way (ie. service owners don’t have to tamper with complex PromQL)
Sloth to the rescue
Nothing is set in stone, so we were really open to all ideas, from developing our own solution to using something that is out there. Truth be told, we were really keen on developing our own solution, just to feel cooler about this, but then we stumbled upon Sloth.
Luckily for us, Sloth seems to tick all of the boxes. It is a really well thought out Kubernetes Operator that implements the technical aspects of SLOs described in the SRE workbook.
Sloth setup
Our Sloth setup is a mirror image of our Prometheus setup. So, we end up with one central Sloth and one sloth installation per tribe. Those have been configured to look for definitions in the respective namespaces.
Sloth plugins
Sloth comes out of the box with support for plugins, which are reusable templated Prometheus queries. It even has a repo for common plugins which can be found here. This is really important for us, since this was a prerequisite for our solution, having reusable components.
It is really easy to write custom Sloth plugins. For example, a custom plugin which we call istio/RED, that defines an SLO on two HTTP SLIs (requests, latency) is the following:
Side note: Because of using both common sloth plugin repositories as well as our own, we had to tweak the provided helm chart and maintain our own, since (at the time of writing this) the default helm chart does not support multiple plugin repositories. In a nutshell: the difference is mainly in deployment.yaml and looks something like this:
Custom alert windows
As mentioned earlier, when it comes to Alert Windows we do use two kinds at HelloFresh. A rolling 14d window as well as a fixed month one (mainly for reporting). However, we do rely on the 14d window for alerting.
Sloth is great since it supports custom alert windows. In our case, taking into consideration that we have only 6h worth of data in each Prometheus, we end up with the following configuration:
Integrating Sloth with helm-chart-gen
As mentioned above, one of the most important goals was for this entire solution to be easy and intuitive enough for teams to use. The interface we decided upon was the following:
As you can see, we have different kinds of SLOs:
- kind: application — It is meant to be used with application-specific metrics and will be deployed in the same namespace as the respective Prometheus tribe
- kind: system — It is meant to be used with Kubernetes and general system metrics and will be deployed in the same namespace as our central Prometheus
- kind: http — This one will use our previous mentioned custom plugin for HTTP requests
This seems like a reasonable compromise between usability and customizability. People can use arbitrary queries (still in an intuitive way, since all that is needed is an error_query and a total_query) as well as out-of-the-box functionality with re-usable plugins.
Grafana Dashboards
Grafana dashboards are always a way of showing off all of your work (of course, they also help monitor your applications).
Sloth provides 2 dashboards that will work out-of-the-box here and here. We have slightly tweaked those and have come up with the following:
A high-level dashboard per tribe:
A detailed dashboard per service:
Detailed view of a specific SLO:
Note: In reality, our price-service showcased in the detailed graphs is one of our most reliable services, we just have been messing with it in the test environment (where those screenshots were taken) in order to validate the integrity of Sloth.
Final thoughts
In the end, Sloth integration is straightforward and hassle-free. There is only one catch that you have to be aware of: Sloth will create a lot of Prometheus recording rules and your Prometheus instances may need more CPU memory resources depending on the amount of created rules.
The setup described above has worked really well for our intended goal. We can measure the adoption of SRE practices (namely SLO adoption) and we’re anticipating that it’ll skyrocket.
It has to be said though, that the adoption of SRE practices is far from a solely technical problem, it is a change of mindset in the way of measuring service reliability. It requires buy-in from many different stakeholders. However, making the technical aspect of it easy and intuitive can and will only help in the final end goal!