You are working in a microservice and/or serverless environment? Debugging your application is a nightmare? You think that having an APM service will cost you thousands of dollars? If you are in either (or all) of this situation, I think you will find this blog post handy!
A couple of months back, at Sipios, we started a new project to help French companies to face the economic consequences of the Coronavirus crisis. French companies could apply for a loan provided by the French “regions”, as long as the amount asked by all companies was under the limit.
Traditionally, it is really painful to apply for a loan and you have to fill lots of paperwork. In this very hard time for companies, we had to create a seamless application that any customer could use and enjoy. That’s why performance was important to not create any additional frustration for our users.
At first, everything went well, the code was running fast. But after a couple of weeks, we started to see some latencies of several seconds for computing the total amount asked by the companies. It even started to cause errors because of timeouts.
If we look at our code, it was not doing something very complex. I have reproduced the situation, with the following example:
The implementation seems nice and clean, right? So why is it taking so long?
As we had implemented an APM (Application Performance Management) that traces our requests, we were able to see instantly what was going on under the hood:
Query before optimization:
We can see that we were running hundreds of queries to the database. Even if each of them is really small, it causes overhead and has a big impact on performance. It is typical of an N+1 queries problem. For more information on N+1 queries and how to eliminate them, I invite you to read the great article of Yann Briançon on how to eliminate them: https://www.sipios.com/blog-tech/eliminate-hibernate-n-plus-1-queries.
Now that we know what is going on, it is easy to improve our performance by combining our request into one:
And if we look at our performance, we can see a 133 reduction of the time of our request (our 23 seconds request is down to a 0.165 seconds one)
Query after optimization:
Now that we know why it is incredibly helpful to have an APM solution, let’s see how it can be implemented!
APM solutions are often very expensive. Yet, with Elastic APM, you have a cheap but effective solution at hand. It is open-source, so you can host your Elastic infrastructure on your own machines and be compliant with your data management policy.
To collect and analyze your traces, the Elastic APM solution is decomposed in a few components:
Let’s see how to implement this kind of solution with two Spring microservices!
You can find a configuration to launch the Elastic Infrastructure with docker-compose on the following GitHub repository: https://github.com/clementdessoude/elastic-apm-demo.
Here is the code snippet:
If a process spans on several microservices, you will want to have a unique trace covering the whole process, with spans on the different microservices implicated in the process.
To propagate the trace between the different microservices, a header (called
traceparent) is added in the requests made between our several components. This header will be interpreted by the Elastic APM agent to build the history of the process, and link the spans of this process to a single trace.
Elastic follows the W3C recommendation to do so.
As this header is not added by default in every library, you will often have to add it yourself. The following example is how I configured it with
Spring Webflux. Even if this example is specific to Java, the logic is the same in every language.
Logs are another way to have observability of your system and understand what is going on. However, if you run in a microservice environment, with multiple instances of your application, it can be difficult to know what logs are in the context of a single request. But there is a simple solution: enabling log correlation!
If you enable log correlation, the Elastic APM agent will inject the trace id and the span id in the Mapped Diagnostic Context (MDC). Every log in the context of a request will have the same trace id. Thus, it will be possible to filter on this parameter and see the logs of a single request.
Since Elastic APM v7.4, you can even switch from the APM console to the logs in one click!
To enable log correlation, you can set the
ELASTIC_APM_ENABLE_LOG_CORRELATION environment variable to true when starting the Elastic APM server.
Elastic APM is awesome, and complete the range of tools provided by Elastic. It is particularly useful if you have already an ELK stack for visualizing your logs.
Nevertheless, it is still a young tool and comes with a few drawbacks. One of the biggest ones is the lack of integration with some libraries, Spring Webflux for instance, or using it with Kafka. You can still add your own implementation, but it will cost you some time. You can find some documentation on how to do it and configure your own transactions and spans with the Java APM Agent right here: https://www.elastic.co/guide/en/apm/agent/java/current/public-api.html#api-start-transaction
Other alternatives, like New Relic or Datadog, are easier to configure with your environment, but are expensive, particularly when you are dealing with lots of hosts, when Elastic APM is free, except for maintenance costs.
After this article, I hope you are convinced how an APM solution can greatly improve the observability of your systems and helps you to detect bugs and latencies instantaneously.
With the Elastic suite, you have an easy solution to implement an effective, cheap, self-managed solution.
So I hope you are eager to implement one in your projects!
Feel free to comment. I will be happy to answer your questions if you have any.