Our first unplanned downtime in two years
In the second week of May 2019, we had our first unannounced service-downtime since launching in April 2017, which lasted 2 hours for 3 days in a row. We’re very much aware that the stability of our service plays a huge role when trusting our product and that’s why we want to be as transparent as possible. In the following blog post, we want to share our most important learnings and what we’ve done to provide you an even better service in the future.
To better understand the whole story, here is a quick intro about our infrastructure. (If you’re not interested in the technical side, skip the next two paragraphs.)
To better maintain and scale our infrastructure, we don’t have one huge service but many small ones called micro-services that can easily be deployed on a server via Docker. These micro-services are managed and scaled with Kubernetes. We run multiple instances of every micro-service for redundancy and performance reasons.
On Tuesday the 7th of May at 03:50 pm CEST our monitoring services alerted us that all our instances of the most important micro-service called time-tracking-service were unreachable. This means that while you were still able to register, log in, manage integrations, etc. you were not able to do the most important thing: track your time.
After investigating the issue, we quickly noticed that the time tracking service had huge queues with threads (something like small task-runners) waiting to get a database connection. We started trying to solve the problem with the usual procedures like restarting the services, tuning the settings and adding more instances. After several attempts, this solved the issue on the first day and decreased our heart rate a little.
While we were looking for the root cause of the downtime on Wednesday, the time tracking went down at nearly the same time again, which increased our heart rate by quite a lot.
At this point we had two assumptions: as both incidents happened at nearly the same time of the day, we thought either it’s because most of our US customers become active at that time and our European customers are still active, or because a few specific customers become active.
We started to run intensive load-tests on our testing-environment, which first didn’t provide any significant insights but at some point, we were finally able to replicate the issue. Doing a deeper investigation revealed that, simply put, the database hard-drive was too slow and was unable to keep up with the load.
As a solution, we then provisioned a faster SSD drive on the testing and production environment which fixed the issue.
The issue was hard to discover because we had disabled additional debugging information on our production system. This is considered best practice for speed and security reasons but has some trade-offs in situations such as these. We’ve learned a couple of things (the hard way) and tuned our monitoring system, especially regarding the database as it turned out we had a few blind spots there.
So as a result of the incidents we’ve improved the monitoring system for our database and documented how to enable additional debugging on the production system.
Although we run load tests at Timeular since the beginning, we’ve now increased our effort in this area to know the limits of our infrastructure early on and to further increase the stability of our infrastructure.
We know that many of you requested an offline tracking mode and we are aware of the advantages it would have, especially in these situations. We can assure you the offline feature has gained a higher priority now.
Besides learning and improving a lot of things, one thing that surprised me the most were the supportive messages we received from you, our users, during this intense time. The incredible support that you’ve shown definitely helped us to stay positive late at night while searching for the root of the problem. So this is my time to thank all of you for the kind words and your faith in us.
Manuel Z. (CTO)