How to manage architectural debt in a context of hypergrowth
Architecture debt is a fundamental concept in modern software development. As a company grows rapidly, it becomes increasingly difficult to maintain a healthy and sustainable architecture system. Cyril Beslay, a solution architect at ManoMano, recently presented a conference on managing architecture debt in a hyper-growth context at Devoxx. The conference highlighted the importance of managing architecture debt and shared insights into the system that was implemented. In this article, we will share key learnings from Cyril Beslay's conference and draw parallels with our lean strategy for an effective architecture debt management strategy.
What is architectural debt?
To understand the principle of architectural debt one must first understand the principle of technical debt. In software development, technical debt is the implied cost of future reworking required when choosing an easy but limited solution instead of a better approach that could take more time.
Analogy with the financial debt
In his talk, Cyril makes a direct analogy between technical debt and financial debt. If you want to buy a house, you can imagine two solutions to make this project a success:
- save up until you have the amount of money you need for the purchase
- take a loan to buy the house immediately
In this example, the loan will allow us to reach our goal more quickly, but with a future cost that we will have to pay back: the interest.
Architectural debt is more complex
Architectural debt is a subset of technical debt.
Although technical debt is fairly simple to understand overall, it is not necessarily easy to identify and measure at the scale of an architecture and in particular in a context of hypergrowth where the code can change quickly. For example, a poor choice of framework or tool or a migration initiated but not completed. There are currently no really mature projects to help us quantify and monitor this technical debt. Static code analysis tools are often ineffective on architectural issues.
So if we want to be able to act effectively on architectural debt, we need to agree on a simpler definition that allows us to quantify it. This is exactly what Cyril proposed in his conference with the following definition:
Architecture debt is the gap between the current system and the architecture principles or guidelines of the company.
Cyril's concrete example is that all API calls to external services at ManoMano must implement the Circuit Breaker design pattern.
At Sipios, the notion of deviation from the standard is something deeply rooted in the company culture as a checkpoint for a problem. Thus, when we encounter deviations from the standard, we try to analyse them in order to solve them and learn from them. To achieve this, several methods of analysis and problem solving are used, such as PISCAR or daily problem solving.
Let’s see an example of daily problem solving on a architectural debt problem
Without knowing it, the speaker went through each of our problem-solving steps on several concrete examples to identify the main causes of the problem and propose countermeasures.
- Formulated as a deviation from the ideal situation - This deviation impacts on the team's performance, for example on quality or deadlines
- Tied to a concrete part - otherwise ill-defined problems are taken, which leads to general actions that are not very precise
An example: the call to the XXX dependency does not implement the Circuit Breaker design pattern.
These checkpoints allow us to ensure that we are not trying to solve a problem that is too big to act on.
The probable cause
- A knowledge or skill that we have missed - otherwise we are too quick to take corrective action when we still don't understand what is going on
- Confirmed in the field - otherwise time is wasted on unnecessary actions while the problem remains
An example: developer YYY does not know the Circuit Breaker design pattern.
Here, what we regularly try to do is to link the cause to one of the 4Ms (man, method, machine, material) as detailed in the book Learning To Scale by Régis Medina. We favour causes that are linked to Man or Method because this allows us to have more efficient actions and to increase our learning.
- A quick action to test on the same day or in the next few days to mitigate the impact on the customer
- An idea to avoid the problem coming back
An example: developer YYY is peer-progamming with ZZZ to implement the Circuit Breaker design pattern on route XXX. The idea could be to put clients in services that respect a certain naming convention and add a linter rule to check that the correct design pattern is being implemented.
The objective of the countermeasure is to solve our problem. The idea is to start thinking about longer term solutions that will be interesting to prioritise if the problem occurs again.
- The consequences of the counter-measure - otherwise there is a risk of moving on while the problem is not solved
- Unintended consequences - if you just look to see if the problem is solved you may not realise that you have created other problems.
An exemple : developer YYY is using the circuit breaker pattern on ticket N without exceeding the complexity allocated to the ticket (which could be an unintended consequence in this case).
Once the countermeasures have been taken, it is important to check that our problem has been solved and that there are no side effects. It is also an opportunity to step back from the problem and measure what you have learned.
What did we learn?
Architectural debt is continuously created, even if today's decision is in opposition to tomorrow's needs. We need to understand why architectural decisions are made so that we can better act later.
Architectural debt reduction strategy
It's easy to blame the previous team when you discover an architectural debt, especially when you don't understand the choices that were made.
💡 But sometimes we create architectural debt in order to be innovative. Let's look at an example: A proof of concept with a small time-to-market is developed to validate a business hypothesis with a few early adopters. The concept is validated but we built it using shortcuts to speed up production.
- Advantage: we used a small budget to test an idea and open a new business perspective.
- Disadvantage: we now need a strategy to redesign the architecture to align with standards.
Context is essential to understand the decisions that have been made. It is preferable to record all these decisions in order to better understand the choices and therefore the debt in order to plan its reduction. The Architecture Decision Record (ADR) is a powerful tool for recording all decisions (accepted and rejected) and keeping track of them over time.
Take time to design the required developments. Include this in the team's job description. By doing this, you force yourself to question the current architecture and check how the new development will fit in or change.
Whenever you find architectural debt, you should create a ticket in your system to make an inventory of what needs to be reworked. This will reveal the tip of the iceberg and help you prioritise.
Control the creation of new debt
When you start to reduce architectural debt, the first thing to do is to check that new developments will not make the situation worse. Teams are still building their product and making choices that impact the current architecture.
They need guidelines to help them make decisions: they need standards. A standard is a set of key points and examples that will allow developers to deliver perfect parts. These standards are also what will allow teams to challenge the choices that have been made and to strengthen the links between teams.
Then, in order to orientate the choices towards sustainable solutions, use solutions such as radar tech. The alternatives considered in your ADRs should be the best solutions on the market and therefore have been evaluated by your peers. This will increase your confidence in your architecture and help curb the creation of new debt.
Finally, be creative and implement automatic analysis solutions. If you are close to the delivery chain, you will be very responsive to change and teams will have control over the debt created. New opportunities for control will open up, for example Gamification to encourage teams to generate as little debt as possible.
Measure the actual debt to plan reducing it
How can we know the extent of our architectural debt when we start looking into it? This is one of the interesting points of Cyril's talk: talk to those who know.
Through your analysis, you will get a macro picture of your debt but you will not see the hidden part of the iceberg. One of the practices implemented at ManoMano is the team leader interviews. Through a questionnaire, the team leader reveals the debt that he knows about in his area and that was not under your attention.
You can then set up a scoring system to apply to each entry in your inventory. The objective of scoring is to map your architectural debt in order to make choices about its reduction. You can consider several dimensions:
- Is this something that will be decommissioned in the coming weeks or months?
- Is this critical to the business?
- Will this have an impact on nearby developments?
You can then mosaic your debt architecture and use it as a visual management tool to assess your performance and take action on your learning.
Plan-Do-Check-Act : other lean-inspired solutions or tools that are used to manage debt?
While this analysis allows us to better understand the principle of architecture debt, we understand that the culture of the teams must be aligned to produce quality. Cyril showed us through his feedback from ManoMano how he was able to quantify the debt and what were the countermeasures to reduce it. At Sipios, we also have this culture of producing quality and we are inspired by the lean strategy and the TPS to innovate and to improve the skills of our teams. To go further, you can use these resources which will complement the ideas listed in this article
- You can reduce your debt through continuous improvement or batch improvement. This is one of the goals of Kaizen (bringing value closer to the customer) and can be a lever in your architecture debt reduction strategy. This is well describe by Clément in his article How We Reduced a Request Time by 133 with Tracing and Elastic APM.
- If you want to go further in understanding what an ADR is, you can follow the recommendations of AWS in their page or review the conference given by Antoine Grenard at Human Talks Paris (in french)
- Dantotsu: you may be interested in the radical quality of keeping control over the debt. A detailed explanation was given by Woody, CTO of Sipios, and Flavian, at the FlowCon and Devoxx conferences.