The journey to improving our logging pipeline

Chi Toan
HelloTech
Published in
9 min readApr 20, 2020

--

The Site Reliability Engineering Team at HelloFresh has a clear mission: provide proper visibility for everyone within HelloTech. As our business grows, the team must constantly look at ways to improve our processes around managing hundreds of thousands of log events that come in every second. To manage and adapt to the business needs and scalability required, we chose to re-architect and optimise our logging pipeline. We’ve written about our current monitoring set up before, this article will be focused on the re-architecture, monitoring we implemented, and some challenges along the way.

Current architecture :

At HelloFresh, we use 3 main components for logging pipeline: FluentD (collector) + GrayLog (visualisation) to store all data in ElasticSearch. The current architecture was adapted to fit the requirements for all the teams (squads / tribes) at Hello Fresh.

FluentD running model

This picture shows each of K8s nodes, which have an individual FluentD pod running (daemon set). FluentD runs as a separate container, each of the containers in our K8s nodes log to /var/log/containers/*.log. The container logs are written on the host, FluentD tails the logs and retrieves the messages for each line. We have hundreds of K8s nodes and hundreds of FluentD provisioned for responsibility handling (scrape — parse and push) for all the applications running inside that node.

In this post, I’ll discuss the steps we took in order to solve the problems outlined below.

  1. First, I’ll discuss the Unified Logging Architecture that we implemented.
  2. Second, like any other critical system at HelloFresh, we needed to implement nuclear monitoring for our logging pipeline. I will discuss the monitoring approach we took for the new architecture.
  3. And finally, I’ll talk about the improvements we made to the logging system in an effort to optimize for abnormal indicators.

UNIFIED LOGGING — New Architecture

The current architecture has several drawbacks:

  • We cannot apply the global rules (eg: throttling) on each of FluentD containers for some particular applications. Because that node may have / have not the application running.
  • We want to send these logs to multiple targets either GrayLog or S3 in the near future
  • No visibility into how much each log volume from Tribe / Squad is consuming

Several years ago, Treasure Data proposed the definition of Log Aggregator. Needless to describe the core values of the Aggregator layer. The job of the Aggregator Node is to combine multiple smaller streams into a single data stream that’s easier to process and ingest into the Store, where it’s persisted for future consumption. We have kept that in mind when plan to implement for the new architecture

Therefore, the biggest change for the new architecture was implementing the Unified Logging Layer (Fluent Aggregator) and adding more reliability and scalability to data transport. When all the log data starts running through the Unified Logging Layer, it effectively filters, buffers and routes incoming data. Now, we have solved the aforementioned problems and our solution has the following benefits:

  • Reliability: We are using a file buffer to store all incoming logs in period time before pushing to the output targets like S3 or GrayLog. From the Grafana dashboard we can easily break down and realise how much data is lost between ingestion & forwarding. It does not matter how you want to send those things together or individually in the meantime
  • Flexibility: The aggregator layer now takes main responsibility as a router, thus it won’t be a big deal if we plan to apply new rules for specific application log groups. Also sending those events to the same target or different being more simpler.
  • High availability: FluentD aggregators are now running in active and passive either active and active mode. As several times using chaos methodology for the instability testing, We are confident to keep all logs route through the layer as always. The backup mode will automatically handle these fail cases.

With those conclusions we have implemented the architecture like the image shown below.

MONITORING FOR NEW ARCHITECTURE

We spent hours manipulating the helm chart to spin up a new aggregator service, this is a part of our standard K8S policy. We now have 2 fluentd aggregator pods running side by side (1 active and 1 passive). Like any other critical systems, we need to gather the golden metrics for observing. So we defined 4 critical metrics need to have on this aggregator:

  • Total log sent to aggregator per minute (incoming)
  • Total log processed and sent to GrayLog per minute (outgoing)
  • How much size of the buffer every minutes
  • How many queue pending on Aggregator side

Fortunately, all these metrics are easily collected by the FluentD prometheus plugin. After that, we came up with some graphs for Grafana.

Please note that if your FluentD instance is running multiple workers mode, the prometheus port will bind following the total of workers you have. For example, if you set the instance to run 4 workers and prometheus exposes metrics listening at 24231, then they will automatically bind from 24231 → 24234. Therefore, the metric exposed on each port is actually counting for that worker only. Using /aggregated_metrics to scrape metrics from all workers at once.

And what about the collector side?

As mentioned above, each of K8S nodes will have 1 FluentD pod running, but they are not configured as aggregators. They are called collectors. From our point of view, all the logic needs to migrate from collector to aggregator. The collector will only have 1 critical mission: to tail and transfer to aggregators. Our collector also enriches every log message with K8S metadata. With the methodology described, we decided to implement monitoring for the collector side with one critical metric: “Connectivity between collector and aggregator”.

There is no specific metric we can use from the Prometheus plugin to gather this information. The Prometheus plugin does provide one potentially useful metric called output_status_retry_count. Even though we cannot assume this metric entirely relates to connectivity between collector and aggregator, at least it gives fast feedback when abnormal situations occur.

In general we merged all our maps into a single Grafana dashboard where they show the behaviour of Collector and Aggregator. Beside Grafana dashboard, the prometheus alert is also involved to help us paging whenever logging pipeline unhealthy. But before to define those strict rules to Alert Manager system, our of the pipeline heavily relying on Buffer / Retry Rate / Error Rate . Several hypotheses were raised before those alerting rules were set

We aim to observe the behaviour and evaluate the health of the pipeline heavily relying on Buffer / Retry Rate / Error Rate . Several hypotheses were raised before setup alerting

And the final alert were strictly following.

Alert Manager definition

IMPROVEMENTS TO THE NEW ARCHITECTURE — FluentBit to the Rescue

Last time we designed the Grafana dashboard, we spent several hours observing the system. There are several issues that force us to rethink FluentD collectors. See image below:

We used to get a high buffer / failed rate on each of the collectors when high log traffic occurred. Every FluentD instance seemed stuck on collecting new messages and processing the old message. We were aware that it might be related to the buffer configuration, which we have tried to optimise. Simply increasing the parameters during the peak time traffic will be inefficient and not scalable because we cannot foresee when a peak will happen again and we don’t want to trade off resources for those kinds of situations.

In an ideal world, the logging pipeline platform should be resilient enough to deliver millions of logs without failing. We decided this buffer system was important to solve for now, in order to scale later; and we came up with several ideas:

  1. Scale up / scale out destination when writing data is slow
  2. Increase flush_thread_count when the write latency is lower. Launching multiple threads can reduce the latency.
  3. Improve network setting. If the network is unstable, the number of retries increases and makes buffer flush slow.
  4. Optimize buffer chunk limit or flush_interval for the destination. For example, kafka and mongodb have different characteristics for data ingestion.

Option 4 aims to deal with buffer parameters, which we don’t recommend. We will still need to have a proper number to adapt perfectly to each case which is impossible. Option 1 + 2 are feasible, but they would take too much time to improve the current automation and also mess with the HPA mechanism on Kubernetes. Option 3 is something we threw out due to the complexity of the current K8s network.

Then we asked the following question: do we have any self protect mechanism by FluentD itself to prevent high buffers? Sounds like a queue message!

After a few days, the team came up with several ideas and implementation plan:

  1. Use message queues as the primary buffer system, so that the messages will be collected and published into the queues. Aggregator pod can subscribe to every queue one by one and consume each of the messages in time
  2. Whether we want to use message queues or not we still need to deal with FluentD as an open gateway. Having the proper metrics at this layer to see the data growth frequency is a must. Base on this data collected we can have set the buffer parameter properly (We are using file buffer)
  3. Another solution is to use FluentBit. Reading the documentation about mem_buf_limit under backpressure we got a better understanding of data processing flow and it seems to be exactly the answer we were looking for the question above.

Replacing FluentD with FluentBit for collector layer is not a difficult technical challenge, but there are few trade-offs:

  1. There is no active — standby mode in forwarding like FluentD has. So, if we decided to use FluentBit as a collector layer, and forward all the logs in a balanced manner to FluentD aggregator pods.
  2. FluentBit is lightweight so it doesn’t have as many plugins as FluentD It means we need to be careful in the future if a need to deal with some specific source input which is not supported will arise.
  3. FluentBit enrich messages with K8s metadata automatically.

Below is the architecture with the FluentBit solution

Basically, we just replace FluentD at the collector side with FluentBit and configure it to send log messages continuously to the Aggregator layer (FluentD).

With the new relatively minor change on the collector side, we have achieved the following:

  1. No high buffer and zero instability network symptoms from Grafana
  2. Ability to collect and visualise metrics for each layer needed (FluentBit — FluentD — GrayLog — ElasticSearch)
  3. New architecture is ready to be enhanced with the throttling rules — calculating message size rules.
  4. All configurations are now managed with HelmChart, which improves the resilience of our infrastructure.

Key takeaways

At HelloFresh engineering we are encouraged to ask questions as often as possible, iterate and learn to be patient with every unexpected turn. I’ve learned to be ready to change as often as needed.

Monitoring systems are key to assess the quality either for both critical and non critical services. No single setup will be the right solution forever, so keep improving based on your own questions everyday.

Using CNCF in your infrastructure may provide flexibility for your architecture, but it leads to hundreds of hidden problems, which will require you to work proactively with the community.

In part 2 of this article, we will talk about how to optimise these components further and about “must-have” rules to strengthen your logging platform.

Stay tuned!

--

--

Just like other site reliability engineers who keep observing the life and lost in the eco-system spaces