Building Resilient Systems: The Art of Stability in Enterprise Software

Introduction

When we build any commercial software, it's very important that it works well and it's stable. Just like a fresh graduate faces various challenges in the real world, software must be ready for unexpected problems. This article explores the concept of stability in computer systems and provides insights into building resilient software.

The Need for Cynical Software

Enterprise software must adopt a cynical approach, anticipating and preparing for potential failures. This mindset involves:

Expecting bad things to happen and are never surprised when they do.
Cynical software doesn’t even trust itself, so it puts up internal barriers to protect itself from failures.
Not getting too intimate with other systems, because it could get hurt :P .

The Cost of Instability

Poor stability in software systems can lead to:

Tangible loss, such as lost customers and loss of money.
Intangible loss, including damage to brand reputation.

What is Stability and Resilience

A resilient system maintains functionality even when faced with:

Transient impulses (sudden shocks)
Persistent stresses (prolonged pressures)
Component failures

It's not about just the application server stays up and running, rather users can still get the work done.

Understanding Impulses and Stresses:

Impulses: Rapid shocks to the system (e.g., flash sales, exam result checks)
Stresses: Prolonged forces applied to the system (e.g., resource exhaustion)

What are the different failure Modes:

Sudden impulses and excessive strain both can trigger catastrophic failure. The initial trigger point of a failure and its impact spreads to the rest of the system, together with the result of the damage, are collectively called a failure mode.
Some scenarios when this happens are: Improper exception handling, thread blocking, connection pool exhaustion

Designing for Failure:

No matter what, our system will have a variety of failure modes.

We have to first accept the inevitability of failures and bugs in the system.
Create "safe" failure modes to contain damage (simple example, have try catch wherever possible)
Protect critical system features

The Chain of Failure:

Underneath every system outage, there are chain of events. When one thing goes wrong, it can cause other problems.
A failure in one point or layer actually increases the probability of other failures. For eg, If the database gets slow, then the application servers are more likely to run out of memory. Because the layers are coupled, the events are not independent.
Tight coupling and complexity can accelerate the failure propagation to other systems.

Preparing for Failure:

When designing systems, consider various scenarios:

I/O Connection issues
Response time anomalies
DB Disconnections
Unresponsive endpoints

And ask, “What are all the ways this can go wrong?”
Think about the different types of impulse and stress that can be applied:

What if I can’t make the initial connection?
What if it takes ten minutes to make the connection?
What if I can make the connection and then it gets disconnected?
What if I can make the connection and I just can’t get any response from the other end?
What if it takes two minutes to respond to my query?

Stability Patterns and Anti-Patterns

Patterns of failure are called Anti-patterns.

Below is the list of anti-patterns which we will discuss in detail in upcoming articles.

Integration Points
Chain Reactions
Cascading Failures
Blocked Threads
Attacks of Self-Denial
Slow Responses
Unbounded Result Sets

Some common solutions to the patterns of failure are called Stability patterns.

Use Timeouts
Circuit Breaker
Bulk heads
Steady State
Fail Fast
Test Harness

Stability patterns deals with design and architecture patterns to defeat the Anti-patterns.

We should design our architecture to counter failure modes
Our aim is to contain damage and preserve partial functionality
This patterns does not prevent the failures, they minimise the damage and preserve partial functionality instead of allowing total crashes

Conclusion

Building stable and resilient enterprise software requires a proactive approach to anticipating and managing failures. By adopting a cynical mindset, understanding failure modes, and implementing stability patterns, developers can create systems that withstand the unpredictable nature of real-world operations.

Stay tuned and follow me for more articles about architecture stability! I'll be writing in detail about stability patterns and Anti-patterns in the coming weeks.

Thanks, peace.

References

Release IT -> https://www.amazon.com/Release-Design-Deploy-Production-Ready-Software/dp/1680502395
Photo by Tima Miroshnichenko: https://www.pexels.com/photo/close-up-view-of-system-hacking-in-a-monitor-5380664/

Join Ankit on Peerlist!

Join amazing folks like Ankit and thousands of other builders on Peerlist.