
When we build any commercial software, it's very important that it works well and it's stable. Just like a fresh graduate faces various challenges in the real world, software must be ready for unexpected problems. This article explores the concept of stability in computer systems and provides insights into building resilient software.
Enterprise software must adopt a cynical approach, anticipating and preparing for potential failures. This mindset involves:
Expecting bad things to happen and are never surprised when they do.
Cynical software doesn’t even trust itself, so it puts up internal barriers to protect itself from failures.
Not getting too intimate with other systems, because it could get hurt :P .
Poor stability in software systems can lead to:
Tangible loss, such as lost customers and loss of money.
Intangible loss, including damage to brand reputation.
A resilient system maintains functionality even when faced with:
Transient impulses (sudden shocks)
Persistent stresses (prolonged pressures)
Component failures
It's not about just the application server stays up and running, rather users can still get the work done.
Impulses: Rapid shocks to the system (e.g., flash sales, exam result checks)
Stresses: Prolonged forces applied to the system (e.g., resource exhaustion)
Sudden impulses and excessive strain both can trigger catastrophic failure. The initial trigger point of a failure and its impact spreads to the rest of the system, together with the result of the damage, are collectively called a failure mode.
Some scenarios when this happens are: Improper exception handling, thread blocking, connection pool exhaustion
No matter what, our system will have a variety of failure modes.
We have to first accept the inevitability of failures and bugs in the system.
Create "safe" failure modes to contain damage (simple example, have try catch wherever possible)
Protect critical system features
Underneath every system outage, there are chain of events. When one thing goes wrong, it can cause other problems.
A failure in one point or layer actually increases the probability of other failures. For eg, If the database gets slow, then the application servers are more likely to run out of memory. Because the layers are coupled, the events are not independent.
Tight coupling and complexity can accelerate the failure propagation to other systems.
When designing systems, consider various scenarios:
I/O Connection issues
Response time anomalies
DB Disconnections
Unresponsive endpoints
And ask, “What are all the ways this can go wrong?”
Think about the different types of impulse and stress that can be applied:
What if I can’t make the initial connection?
What if it takes ten minutes to make the connection?
What if I can make the connection and then it gets disconnected?
What if I can make the connection and I just can’t get any response from the other end?
What if it takes two minutes to respond to my query?
Patterns of failure are called Anti-patterns.
Below is the list of anti-patterns which we will discuss in detail in upcoming articles.
Integration Points
Chain Reactions
Cascading Failures
Blocked Threads
Attacks of Self-Denial
Slow Responses
Unbounded Result Sets
Some common solutions to the patterns of failure are called Stability patterns.
Use Timeouts
Circuit Breaker
Bulk heads
Steady State
Fail Fast
Test Harness
Stability patterns deals with design and architecture patterns to defeat the Anti-patterns.
We should design our architecture to counter failure modes
Our aim is to contain damage and preserve partial functionality
This patterns does not prevent the failures, they minimise the damage and preserve partial functionality instead of allowing total crashes
Building stable and resilient enterprise software requires a proactive approach to anticipating and managing failures. By adopting a cynical mindset, understanding failure modes, and implementing stability patterns, developers can create systems that withstand the unpredictable nature of real-world operations.
Stay tuned and follow me for more articles about architecture stability! I'll be writing in detail about stability patterns and Anti-patterns in the coming weeks.
Thanks, peace.
Release IT -> https://www.amazon.com/Release-Design-Deploy-Production-Ready-Software/dp/1680502395
Photo by Tima Miroshnichenko: https://www.pexels.com/photo/close-up-view-of-system-hacking-in-a-monitor-5380664/
1
22
0