16. December 2021

Five Popular Tools to Increase the Resilience of Your IT Infrastructure

Is your website slow to load, is your backend often down, or is your database slow to respond? Fortunately, problems like these can be resolved by implementing technical tools that make your IT infrastructure more resilient.

AUTHOR

Rùna Jacobsen

Manager

CO-AUTHOR

Sebastian Schwarze

Senior Consultant


In an IT infrastructure, a lot can go wrong. A database table may be missing, an application may run out of memory, or an open connection may be cut. These are challenges we have become accustomed to dealing with manually on a day-to-day basis.

In large companies, the IT infrastructure can quickly become very extensive, and there is a risk of being overwhelmed by errors and performance problems. It is therefore important to consider resilience in the infrastructure as a whole and in each system to prevent them from becoming fragile and unreliable.

In this article, we introduce you to five popular technical tools to make your IT infrastructure more resilient.


1. Caching

Problem: Your website is taking a long time to load due to many slow calls.

Solution: Build a cache

Caching is a widely used technique to resolve performance problems at specific points in an IT infrastructure. Basically, it consists of temporarily storing a response to a query so that it can be reused later. If a query is particularly heavy, it usually spends resources on retrieving data from various sources as well as on computational logic, and these resources can be saved by caching the response.

The technique is often used for heavy queries for information that is rarely updated - for example, a product catalogue. However, the cache needs memory and there is a risk that you do not always have the latest data. On the other hand, it increases speed and saves resources that would otherwise be spent on retrieving and computing the response. So, caching responses can, among other things, improve your website's TTI (time-to-interactive).


2. Circuit breakers

Problem: Your database sometimes runs out of hardware resources and becomes slow to respond. The connected systems keep sending queries to the database until it crashes completely.

Solution: Integrate circuit breakers against the database in the connected systems. This isolates the database in case of overload.

A circuit breaker - or a 'fuse' - is a mechanism that helps a sending system temporarily disconnect from a hard-pressed receiving system. The mechanism allows the receiving system to get some breathing space.

A typical example of the usefulness of circuit breakers is when a receiving system becomes too slow to respond or responds with many error within a short period of time. Then the 'fuse' in the sending system will 'blow', so that the receiving system gets a few minutes break. This gives the receiving system time to complete its work so that normal communication can resume.

3. Rate Limiters

Problem: An external system sends too many requests for information to your internal registry. This makes the entire system slow.

Solution: Integrate a rate limiter in the registry that tells the remote system when it is sending too many requests.

A common type of cyber attack is an attempt to overload IT systems with a large number of requests (known as Denial-of-Service Attacks). In cases like these, you can protect your receiving system against overload by using rate limiters. This simple measure limits the rate of requests that a system will receive. If it reaches the limit, it will reject multiple call from the same sender for a period of time.


4. Redundancy and load-balancing

Problem: Your website sometimes cannot get in touch with the backend because it is often down.

Solution: Run multiple instances of the backend and hide these behind a load balancer.

A common architectural design principle is to provide multiple copies of the same component. This makes the system more resilient because the system continues to work when individual components fail. Components can be, for example, hardware resources – such as RAM or internet bandwidth – but also database copies or critical services. In other words, you can upscale your systems to handle fluctuations in supply and demand.

The related concept of load-balancing deals with how you distribute work across the components available. A classic example of how redundancy and load-balancing go hand in hand is in microservice architectures, where you often have multiple copies of the same microservice running. You would then typically set up a load-balancer, which sends requests to the different copies of the same microservice. It distributes backlogs of work evenly and relieves resources lagging behind. If a service goes down, the load balancer will redirect traffic to a service that is still up. This way, the failure of a single service cannot be felt on the outside of the load balancer.

5. Chaos Engineering

Problem: You are not sure if your infrastructure is resilient enough to withstand unexpected failure or disruptions.

Solution: Test resilience by deliberately introducing bugs into the infrastructure.


In 2011, Netflix introduced a Chaos Monkey script into their production environment. The script is quite simple: At random time intervals, it turns off a random service in production. The idea is that the infrastructure should be resilient enough to handle this outage.

Netflix's idea later became the basis for the concept of chaos engineering, which refers to deliberately introduction failure and disruptions into one's IT systems to test resilience. Typically, scripts are run that, for example, close down services or connections. However, you should refrain from doing this in your production environment and use a test environment instead - unless you are as confident as Netflix is.


There are carious other tools that can increase the resilience of your infrastructure. Whatever the challenge, there is probably a software design that can solve the problem.

However, one of the big questions is often how to implement these useful tools, as existing systems and infrastructures do not always have the capacity to support them. It may be a good idea to introduce the tools via an integration platform, such as an Enterprise Service Bus, which handles inter-system communication. In any case, it is important to increase resilience and thereby future-proof one's IT infrastructure, and for many companies one or more of the tools mentioned is the way forward.