Growing Pains
Transitioning from a startup to fully grown company hurts, a lot. This is an old story that I myself lived many times. There is no way around it, because the mindset is very different and even if you take all the right architectural decisions to get an MVP out fast and prove your business, it makes no sense to spend time planning or building tools for the future, solving problems that you don’t yet have.
I'm an experienced guy (you could say, just, old!) and I always loved working in startups and new companies. I’ve spent my entire life working online; every step of the way that was where I wanted to be. This gave me the opportunity to witness the same thing many times: suddenly, everything has worked out, business is going well, when it hits you: we now have a whole new set of problems.
Well, growing pains can be treated but you must know it’s not going to be easy to grow a company from 10 developers to 100 or 200. Sometimes more people don’t increase the speed or output.
Our only one-year old code is now legacy and unknown to most of our developers (everybody is new here…), the corners that were cut are now breaking stuff every day and it's very hard to our new guys to find those problems. There isn't a machine big enough on Amazon to handle our database, suddenly cloud costs are skyrocketing, and we don't even know if we need all this power — or sometimes even what a particular server is for. In the past we just added more instances if the site was slow, now this doesn’t work anymore. BTW, I'm not even talking about HelloFresh yet; I’ve seen this happening in nearly every startup our there!
Even before I joined, HelloFresh did a very smart thing: invested in devOps and infrastructure automation. That was one of the responsibilities of the team I joined — the Platform team. Thus it was my job as PM of the team to keep moving our dev tools and infrastructure forward, to make sure all other teams have the tools and the infrastructure their projects need but that they do not over-provision them. In short, I needed to try to keep our costs in check.
When I joined, the platform team had their hands full: they were trying to build monitoring, logging, infra-structure automation, CI, queue services… and literally every week we have to deal with something critical failing.
At the same time the company is trying to move to a Microservices architecture, so that the growing number of teams can work more independently.
I joined HelloFresh in February, and as with every new job I spent my first month just trying to figure out things. How many services do we have? What are these 100 EC2 instances doing? Why do we have so many CI tools? Why is our site so slow? Of course none of those questions have easy answers. Everyday I learned something new, and for every system we checked we found some technical debt that needed to be addressed.
At this point you might think I was disappointed or frustrated. Not at all, I was thrilled. I'm experienced enough to expect much much worse, and based on the history of HelloFresh and how fast it grew I was expecting to find a bunch of skeletons buried around. The most important thing I could figure out and that got me thrilled was I'm surrounded by very smart people, they are all good at their jobs and really want to fix things. I could see that HelloFresh had been built obeying the first rule for a successful startup: "hire good people". We are heading in the right direction, we just need to keep moving.
The war-room
We were having too many critical breaks, and so we needed a plan to quickly fix most of them. We needed a task force that will involve most of the senior people on HelloFresh IT and we needed to be sure all teams gave some resources to that.
For each system failure we had, we held a proper postmortem meeting and tried to learn more and more where the bottlenecks were and what kind of technical debt we must address.
At this point everyone on the business side understood that we needed this task force. We sold to them that this is critical for the company and everybody need to help. So, composing at least one developer from each IT team (so we were sure all services were covered) we built our task force, we had some planning meetings and drafted a plan of action. “Everybody is team platform now” quickly became our motto, and we repurposed a meeting room to become our “war room”.
We already had a new infrastructure design there, fully automated (using Ansible, if you want to know the little secrets), it was clear that we just needed to move all our legacy applications there and throw away our old EC2 instances. By managing everything via automation we can easily add and manage each service size and needs. Honestly at the time this looked like an impossibly large task, given all the services we would need to migrate, but by pulling in resources from other teams as required we were able to finish this in much quicker time than expected.
Our MySQL databases were too big, but we figured out we can easily split them by country and rethink each RDS instance size and capacity without having to rewrite all applications, which was a nice quick win. More, smaller instances means higher reliability and lower cost. Plus, we are moving to an event driven architecture in all our new services anyway and away from MySQL, so in this case we wanted to patch up the problem and move on.
We had to profile all our critical services. We made a list of our more resource-intensive transactions. Now, we could find and fix any bottleneck and dedicate some performance-minded developers on carefully optimizing or refactoring at least the most expensive transactions. Soon, we learned that just a few of our services were consuming most of our CPUs. By choosing the most expensive application and taking the time to migrate it to the more performant PHP7, we were able to solve this problem, too.
One thing that’s often neglected for rapid development on any MVP is proper use of database secondaries (read from replicas, write to primary). Making our services aware of them and configuring our MongoDB and RabbitMQ to scale was another win — plus we’ll certainly be making better use of those guys in our new event driven services.
Finally, we deployed a proper CDN for all our sites and make sure to cache everything possible. We know some of our legacy applications do not make proper use of it (you know... PHP sessions are a pain) but, as I said, we just need to keep moving on the right direction and we will eventually be there.
The endgame
At the end of this effort, beside all the increases in reliability and automation, we also got a 19% decrease in our Amazon bill in August. A growing company that can lower their infra-structure cost should be celebrated.
We now have many more EC2 instances, much more them the 100 we had when I joined and the bill is still lower. With all our infrastructure automated it became much easier to include proper monitoring and to test and change the instance sizes and types.
There is still a lot to do for the platform team but now all the extra hands can go back to their teams. Personally, I feel that worked very well as a bonding exercise. I'm really happy to have been there.
As for our front end efforts to continue to speed up our site, check out Javascript con Carne.
Looking for a new job opportunity? Become a part of our team! We are constantly on the lookout for great talent. Help us in our mission of delivering fresh ingredients to your door and make home cooking accessible to everyone.