Live Data Driven Testing: Extracting a Microservice from a Monolith

Published in

HelloTech

7 min readMay 16, 2018

With the rising trend of microservices, the task of replacing the old legacy systems with small, fast and robust microservices is becoming a major task for most teams. The new service could be a simple CRUD service with a small amount of business logic or it could be a service in the heart of the system that encapsulates a tremendous amount of business logic. This article is about the latter type of microservices and how to use Live Data Driven approach to reach production with 0 bugs.

The Problem

In order to achieve stakeholders’ requirements of offering our customers a robust, fast and customizable checkout experience, we decided that it is time to extract this logic from our monolith into a set of microservices that provides Frontend clients with convenient APIs that not only would be hard to achieve within the monolith but would also turn into a mess nobody can comprehend or reason about. One of the first services that we decided to extract was the price-service.

As seen in the image above, the service is responsible of calculating the price breakdown of a bag of items based on their prices, delivery fee, taxes, discount … etc, that will be displayed to the customer and charged from his account. Since the service will be called by the frontend client at an early stage of the checkout funnel, extracting it from the monolith would be of great value. Another major factor was related to the lack of unit/integration tests in this part of the system, which makes the development of new features related to pricing very prone to catastrophic bugs. Also the nature of such service has a high latency of finding bugs, since a miscalculation would be detected by someone in finance or BI department, after possibly months of reviewing the numbers to figure out that the calculation was off by a cent, and accumulated to millions of losses. Therefore, having a solid set of tests was an essential requirement.

Getting to work

The first task was to understand how the monolith is doing the job, and to isolate the parts of code that would be replaced by the service. Then plan how to integrate it with frontend clients and the existing monolith code that is still expecting the price break down to be calculated for each new order. Next we applied a code freeze on that code in the monolith and started to build the service.

Going live

After 3 months of coding, the service was mature enough to run the exact same calculation as the monolith. Unit/Integration tests are vital and were written for all the critical paths, but getting a full coverage is a time consuming process, which could take more time than writing the code itself.

Let’s say your function has 5 possible run paths, and you are interested in having 100% coverage. 1 test will provide 20% test coverage, and 5 will get you a 100%. When you want to have component tests coverage, in this example, a request will pass through 5 different components (functions). Each function has 5 possible run paths, this works out to 3125 independent run paths. After writing the code we realized that the number of possibilities is approaching infinity, since we are not just talking about the different possible execution paths a calculation request would take, but the values that could lead into different rounding strategies and accumulating these differences will lead into an exponential number of required tests. Of course we could have taken 3–6 more months to identify every single path and write a test for it, but that was not even an option with the increasing demand from our stakeholders to deliver the service. Eventually, we decided to roll it out in 3 phases:

1. Dry-run phase

The service is now live with metrics and logs, but not being consumed by any client. The monolith is calling the service on every calculation request it gets. We also wrapped the whole integration code in the monolith with extra safety measures, so in case it encountered any errors it won’t disturb the monolith from doing its job and at the same time it was very useful to detect bugs in the integration code itself.

We also surrounded the whole code with a feature toggle as a final defence line. This phase enabled us to test performance against live load and fine tune the instance type before really going live. Another advantage was having live metrics for weeks before going live, this is very handy for setting accurate alerts in advance. Also we managed to detect all types of bugs through a technique we will get into in a moment.

2. Live with feature-toggle

After fixing plenty of bugs, we were confident of going live with the service. We modified the integration code in the monolith to start consuming the result of the service instead of calculating it itself, and then we changed the use of feature toggle to either use the service result or the monolith so we can revert back to safety in case of unexpected critical incidents.

3. The point of no return

This is the easiest and most satisfying part since it’s just about deleting code. We replaced the whole calculation code in the monolith (not even deprecating so nobody gets confused by it) with a single call to the service.

Live Data Driven Testing:

Both the service and the monolith are doing the exact same data transformation. During Dry-run phase, the monolith was calling the service on every calculation request after it had ran the exact same calculation then published a message through RabbitMQ with the serialized result of both calculations. Then we built a small script that was consuming these messages from that queue and comparing the final result and logging “true” if they were identical otherwise it will log a message with differences between the two objects, the request payload to the service to reproduce the calculation and both of the calculated objects.

Having extra metrics telling us the number of mismatches would be pretty useful to keep track of fixed bugs and the amount of work left. This phase was not only important but a life saver since the number of mismatches was astronomical! It enabled us not only to detect our own bugs, but bugs in all of downstream services which are being called by our service, and boy there were too many of them. Fixing those bugs wasn’t always a trivial task, some of them were caused by some nasty hacks which were well hidden deep inside the monolith that nobody knew about. One more thing is to add Unit/Integration tests after detecting/finding any bug so we approached a more complete test coverage state. Since the service depends on 6 downstream services, integration tests were essential. Wiremock recording feature comes in handy since it would save lots of manual labor work.

The cost of fixing bugs

In the event of finding a vulnerability, the software code may require redesign and implementation. This iterative cycle is costly in time and resources, The cost will go beyond that influenced by the nature of the application. The following table, which is from an A study at IBM Systems Sciences Institute shows that it is 100 times more expensive to fix a defect that is found after it has been released.

Combining that with the fix cost of a single miscalculation (refunds, customer care, … etc), then we can clearly justify the extra effort of such approach.

Also keep in mind, when you have any dependencies or downstream calls that could cause mismatches and you think that you would be able to fix them in X number of days, then multiply this number by 3. unless you are also responsible of those dependencies (then it should be X * 2). That is because the person who will fix those bugs wouldn’t have the full scope of it and will end up fixing only that narrow case.

Ready made tools

This pattern has recently become more common. Twitter released an open source tool called Diffy that works as a proxy which receives calls from live instances, forwards it to testing instances and then compares and logs any differences in the responses. It also includes a nice feature that automatically excludes noise mismatches, but it doesn’t support adding manual exclusion rules. Unfortunately, This was a requirement for us to have full control over noise exclusion. Github’s Scientist Is also a great tool that’s worth looking into.

Key takeaways

Eventually, we had a successful launch with 0 bugs and happy stakeholders! The service proved its value as playing a key role at the heart of the system. If you are planning a project that contains a big amount of complex business logic, I would definitely recommend this technique.

One last thing, estimations are vital for planning. If you are running on a tight schedule and planning to use such a technique, it’s a good idea to ensure stakeholders are fully aware of the plan. It may cost a little bit more time upfront, and nobody wants to hear this, but the savings down the line in having less or no bugs certainly pay off.

Looking for a new job opportunity? Become a part of our team! We are constantly on the lookout for great talent. Help us in our mission of delivering fresh ingredients to your door and change the way people eat forever.