Rewriting an existing Web Service in Rust
If you keep up to date on trends in software-engineering, you have probably heard of the Rust programming language before.
Several members of our dev team developed a bit of a taste for this language, so it was almost inevitable that at some point, we would rewrite a core piece of our software in Rust. 😉
In this post, we will look at the motivation and process behind such a rewrite.
Why it might have been the first of many, and what we learned from the whole experience.
What we will not cover is an introduction to Rust itself beyond the trade offs, which are relevant for us.
There are many good resources available between an Exercism course, Tour of Rust, the Rust subreddit and the official Rust page. This will not be a technical post, but rather a post around the process of introducing new technology to your stack in a safe and sustainable way.
Let’s dive in.
Is a rewrite a good idea?
It’s fun to play around with technology and to learn new things.
However, when it comes to a production system dealing with real world customer data, things get a lot more serious. There needs to be a bigger reason than just “fun” to invest the effort into rewriting an existing, working system.
Good reasons include that the platform an existing system was built on is no longer maintained, the people maintaining it have left, or that the underlying technology is not something you want to use/support anymore.
In cases, when a code-base went through many changes over a long time and didn’t get much maintenance love, leading to code where the cost of change explodes, it might also make sense to start fresh.
This is especially true for microservices-based systems in a small startup. At the outset of a service, the problem it should solve might not have been completely clear yet. So, we can say that, under certain circumstances, rewriting an existing (micro)service is a reasonable thing to do.
Choice of technology
The next question is, whether the rewrite should happen in the same technology, or a different one. The obvious advantage of the first approach is that you can reuse some of the code.
If we’re talking about rewriting an existing service in a, for the organization, new technology, which hasn’t “proven” itself in production yet, there needs to be an even better reason to do it. So let’s see what our motivation was for doing exactly that, rewriting an existing, working microservice in Rust.
There were several factors that played a role in deciding whether or not to rewrite the service in question in Rust. These factors are partly of technical nature, but also include personal preferences of the people directly involved within the dev team. We initially wrote the service in Node.js. For this purely IO-based service this wasn’t a bad choice when we first implemented it. We were aware of some scalability limits we’d hit at some point with Node.js (CPU bound tasks), but weren’t close to hitting them yet. However, besides this service, we didn’t really get warm with Node.js as a backend technology.
After a while, this service was the last production piece based on Node.js in our cluster. We also don’t plan to write new services in Node.js – not because Node is bad, but due to our personal preferences and experience. So a rewrite in another technology would have removed Node.js from our tech stack, simplifying it.
On the other hand, rewriting something with a new technology neutralizes this advantage immediately by the new addition to the tech-stack. 😉 If you expect the new technology to be used more widely, there is still a benefit though. At the end of the day, there was not (yet) a convincing technical, or economic reason to rewrite this service. The majority of the motivation to do this came from the personal preference of the people involved.
Rust, by design, moves concerns such as safety and performance to an earlier point during development. With it’s strict compile time checks and rigid type system, you’ll catch many errors at compile time. This is great for safety and robustness, but comes at the initial cost of development speed. especially when you’re not that comfortable with Rust yet.
Over time, when you get more familiar with Rust and how to design systems in it, the compiler becomes a powerful ally. it will catch issues at the time where they are cheap to fix. In the long run, the additional initial time commitment will likely amortize by having to use less time finding and fixing bugs later on. The biggest trade-off when it comes to Rust is it’s infamously steep learning curve. There also isn’t really a way around it.
There are, at this point, already several tools and courses to help with learning Rust. The learning experience and effort will also vary to a high degree depending on the background of the learner. However, there is a learning curve and it is steeper and takes more time to overcome than in other modern languages such as Go or Kotlin.
The biggest barrier in terms of learning, seems to be the borrow checker and the mental model around ownership of memory. Especially coming from a garbage collected language, and not having had to deal with memory management for a while, getting an intuition for this might take some time. The upside is, that you will come out with a way better understanding about the things you struggled with first.
This also translates to other languages and ecosystems – even gargabe collected ones.
First and foremost and this is true for any technological innovation and change, it’s important to get the commitment from the whole team. You need to guarantee that, if you leave, your colleagues won’t be stuck with a code base using technology no one is comfortable with.
In our case, multiple people had been playing around with Rust and we even had several internal coding dojos and hackathons, where we did some Rust. If the team commits and is on board, the next step is to allocate some time. This is usually where stuff gets difficult, since time is, in most cases, a scarce and highly contended resource.
If there aren’t any immediate technological, or economic reasons to do the rewrite, it’s even more difficult to convince management to invest time for such an enterprise. And for good reason.
At Timeular, we have two ways of working around this problem. The first are our Developer Focus Fridays. Every second Friday, developers at Timeular can choose what they want to work on completely on their own. If that means they want to experiment with a new language – great! Check out that new machine learning framework? Sure!
There are basically no limitations on what you can do. The idea is to keep motivation up and to also foster some technical innovation and learning internally.
This is a good chunk of time you can allocate for projects such as this.
There is another model that worked in our case for features, which might be interesting, but where we couldn’t justify spending time on at the time. In this model, the idea is that if people are interested in implementing something, maybe because they find it technically challenging, they can work on it in their free time.
The company commits itself to integrate and support any outcome, if it adds value and is supported by the team, at the point of maturity.
In such a model it’s very, very important to set clear boundaries. You need to 100% avoid to outsource normal dev work to people’s free time. Similarly to the Developer Focus Fridays, it is imperative to avoid moving necessary maintenance, or bug fixing, or feature work out of the normal development cycle. Such a misuse of these tools will inevitably lead to bad incentives such as “it’s fine, hack it together – you can fix bugs and make it pretty in your free time”. Avoid this at all costs.
So in this scenario, it needs to be 100% voluntary and if nothing comes of it, no harm done. However, if something valuable starts to emerge, the company commits to invest the time to take the last steps to production, integrate it and maintain it from then onwards.
This model isn’t perfect since it relies on high self-motivation and lots of free time. However, it is a way to bring on change you wouldn’t otherwise be able to and as long as there are clear boundaries, I believe it’s worthwhile for a company to try this.
It certainly worked for us so far. Within Timeular’s culture, overworking is a non-starter. This is reflected in our 50 days vacation policy among other things. Because of that the issue of exploiting people’s free time wasn’t something we had to worry about.
It’s still important to keep an eye on it though, even in a situation such as ours. We talked about the motivation behind a rewrite, the trade offs, commitment and how to find the time to do it.
The only thing left is to look at implementation and runtime implications.
The biggest lesson here is to check the ecosystem in terms of library support before committing too hard.
In the case of Rust, when we began playing around with the idea of a Rust web service, the very important async/await feature was not stabilized. Before that happened, we wouldn’t have gone to production with anything.
Not necessarily because async/await was paramount to the success of the rewrite.
But it was clear that the web-ecosystem would stabilize a lot more after it landed. This assumption proved to be correct in hindsight. Also, searching for libraries and seeing if they’re actively maintained isn’t enough. You need to actually use them and get to know their trade-offs.
Only if you’re sure everything you’ll need is either there, or you’ll be able to build it yourself, should you move on. If the goal is a production-grade service at the end.
Personally, I like to go step-by-step when doing a full rewrite. Building isolated modules, testing and documenting them in isolation and then moving on, until only the step of wiring everything together is left. This approach isn’t perfect and it also makes a difference if you move top-down, or bottom-up. The idea is, to progress oriented by existing technical concepts. The domain logic doesn’t (and shouldn’t) change much in a rewrite.
If the existing service has a good suite of integration tests, optimally one that runs end-to-end, that can also be very helpful. You can reuse these tests to validate the implementation of the rewrite, even before porting them to the new language. Another thing I try to do when rewriting services is to document as much as I can on-the-go. There will always have to be an additional documentation pass at the end, but adding basic docs during implementation saves time, since you’re already deep into the context of the module.
Once you finish the implementation with documentation and automated tests available, you’ll experience first hand why people say the last 20% take 80% of the time. 😉
Since you’re replacing existing infrastructure, you need to make sure that the new thing is stable and works correctly, since real-world user data is involved. You also need to make sure that transitioning from the old service to the new works smoothly. In our case, this meant doing extensive QA on multiple layers and also to run the service through an extensive load-test.
All of our services run inside Docker containers within a Kubernetes cluster. So getting the Rust web service ready to be included in our cluster was no problem at all. Once we were confident in the correctness and stability of the rewrite, we deployed it. Having invested so much time making sure it was bulletproof, that part wasn’t very exciting, which is a good thing.
In terms of runtime stability and performance Rust delivered on it’s promises so far.
We had a couple of minor bugs – very few for rewriting a >10k loc code base. They were all pure logic errors and couldn’t have been caught by the compiler.
Memory and CPU usage are better than the replaced node.js service and so far we didn’t have instability issues. It’s safe to say that our runtime experience has been flawless up to this point (7 months and counting…).
Since our first experiment of using a Rust web service in production was very successful, this rewrite will likely not stay the only one for long. At the time of writing this, we have four Rust services running in production. The first three services were rewrites in the spirit of the one mentioned in this post. The fourth service is a new one, however. Since the first steps went well and we didn’t encounter any problems at runtime, or maintenance so far, we decided to base an entirely new, green field project on Rust.
This was another big step in terms of trusting Rust and its ecosystem. We will continue to monitor the maintenance effort and stability of our Rust services.
If everything continues as it has so far, Rust will very likely become one of our primary technologies in the backend.
To conclude this summary of our experience rewriting an existing service in Rust, let’s re-iterate the most important points.
First, we made certain that there is team commitment to do the work and to maintain the service once it’s there – this is the most important point.
Then, another non-optional thing is to check that the ecosystem of the technology you’re switching to sufficiently supports everything you need.
Once the commitment is there and everything is prepared to do the rewrite, there are some practices we found helpful.
If it’s difficult to get a time commitment within the company, get creative! Internal programs to foster motivation and learning help prevent burnout and will increase long-term productivity.
How open your management will be to this will depend on the company you work at, but it’s always worth a try. In terms of implementation, tackling isolated modules one after another makes it easier to keep an overview and to keep everything working. Also, you can write tests and documentation at the time of writing the code. This experiment worked great for us.
We’re excited to see what benefits and challenges await us on our path with Rust as a first class technology.