IT Stability Health Radar

How to run IT applications stable in production? Instead of providing one more operations readiness checklist, I created an IT Stability health radar. This helps in agile projects to choose the next stability improvement depending on current situation. All non-functional requirements in this health radar are prioritized by levels, so that you and your Product Owner can plan them properly.

IT Stability Health Radar

Companies have often operations readiness checklists, which describe what to do before customers can use an IT system in production. These long lists with non-functional requirements are not easy to use and fulfill in agile working environments. Product owners and business stakeholders have a hard time to accept these requirements, if they come all at the same time and might block a big part of the team for several weeks.

In this article I focus on non-functional requirements related to IT stability. I group and order them, so that you learn where to start and which stability requirements should be planned next. I decided to not go for an operations or production readiness health radar, because that would include many more requirement domains like security, legal, etc. Security has obviously an impact on IT stability, but since this is such a big field, I decided to keep this out of scope (for now).

The non-functional requirements of my IT Stability Health Radar are grouped in 4 categories:

High Availability with focus on redundancy and elimination of single point of failures.
Performance is about how your application handles many parallel usage, so that it does not fail during load peaks.
Resilient Code is about how developers should implement the application to be resilient in case of unexpected events.
Operations with focus on change process and monitoring with alerting.

Each category has 5 levels. Level 1 contains the critical foundation of a stable IT system. Level 5 is very advanced and contains the last requirements, which should be tackled for new systems. Depending on the maturity of your application and the expected stability at some point you should address all requirements including level 5.

Requirements in each category might depend on each other. E.g. you can not fully address level 2 requirement (horizontal) scalability, if your application can not run redundant on more than 1 server (level 1 high availability requirement). Requirements in different categories do not depend on each other. But I advise to develop non-functional requirements level by level, because the selected level is based on effort and stability impact estimations.

Category: High Availability

Level 1

Redundancy
In case your application is productive, it should be running on more than one server to ensure high availability.

Level 2

Scalability
In case of occasionally high load, your application must scale automatically to handle successful load peaks.

Level 3

Empty level in this category.

Level 4

Multi Availability Zones
In case your application is high available (redundant servers), it should be deployed in multiple availability zones.

Level 5

Chaos monkey
Test the resilience of your application with chaos engineering: At least in your test environment, you could activate "chaos monkeys" to test your applications resilience.

Single point of failure
In case your application is high available, you should reduce the single points of failure. E.g., you have 2 servers behind 1 load balancer, that means the load balancer is your single point of failure.

Category: Performance

Level 1

Empty level in this category.

Level 2

Performance tests
Test if your application can handle the expected load. You must know the limits of your application - which load peeks can be handled by your application.

Throttling (incoming)
Your application can not scale "infinitely", so you should set a rate limit for incoming request / calls.

Level 3

Caching
In case of expectable high load, caching should be added to your application. Think about proper caching strategies for your application and Use-Cases. Find more about caching with Spring here.

Resource limits
In case your application can scale vertical, set resource limits. Limit CPU and RAM consumption of Containers (vertical scaling) to ensure that horizontal scaling is also possible.

Level 4

Performant database design
In case your system uses a database, verify that database design does not cause bad performances.

Level 5

Automatic Performance tests
Performance tests should be regularly executed by your CICD pipeline. Focus first on the critical UseCases of your application. When your automated performance tests become stable and do not produce false alerts, you can start adding other UseCases.

Throttling (outgoing)
In case your application sends request to other systems, you could protect those systems with throttling for outgoing requests.

If you work on requirements in performance category, you should know about good practices for performance tuning.

Category: Resilient Code

Level 1

Reconnect
In case your application loses the connection to another system, database, API etc., it must not crash and is able to reconnect. This event should be logged.

Error handling
In case of errors, your application must catch and handle them in proper way. This could be logging, retry or converting to own error message depending on error type (business, technical) and context.

Tests
Your application must be tested. You should automate your tests from scratch, but the key message here is: Untested software can not be considered as stable. If you do not test, your users test and consider found bugs as instabilities.

Level 2

Automatic regression test
Regression tests should be automated and executed by your CICD pipeline. For E2E-regression tests focus first on the critical Use-Cases of your application.

Level 3

Empty level in this category.

Level 4

Retry
In case of unexpected error while calling a 3rd party, your system should do a retry. Retry logic must not flood other systems, so proper backoff settings must be in place. A retry should only be added, if it makes sense in the context of this Use-Case.

Stable IO design

In case your application has access to files or other data stream, proper handling must be implemented.

Level 5

Empty level in this category.

Category: Operations

Level 1

Deployment process
In case your application is productive, and your team is operating it (DevOps-team), you must have a ticket-based and well-defined deployment-/change-process.

Deployment verification
In case your team changes productive systems, you must be able to test or verify the success of your change.

More about stable deployments

Level 2

Alerting: Logs
Monitoring must cover critical log entries. Trigger an alert, if unknown errors or well-known critical entries are logged, which require operations involvement.

Alerting: Application health
Monitoring must cover the health status endpoint of your application. Trigger an alert, if the health status is negative. Your application requires an API endpoint, which returns its health status. The health status shows, that you application is up and running. The health status might also report insights of your application like connection status against database or 3rd party APIs.

Level 3

Alerting: Resources
Monitoring must cover CPU-, memory- and storage-consumption. Trigger an alert, if consumption is on a critical level.

Alerting: Certificates
Monitoring must cover expiry date of (SSL) certificates. Trigger an alert before the certificate expires and renew it in time.

Canary Deployment
You should use canary deployments as a progressive rollout of your application. The canary deployment splits traffic between an already-deployed version and a new version. The new version is rolled out to a subset of users before it is fully rolled out to all users. This way problems in new version have limited impact due to traffic split and rollback to already-deployed version is quick and easy.

Operations manual
In case your application is in production and your team is operating it (DevOps-team), you must have an operations manual. This manual describes operation processes (like changes, deployments, database update & backup, etc.) and the incident handling process. It lists also the stakeholders to be contacted in case of incidents or other issues.

Level 4

Alerting: Timeouts
Monitoring must cover number of timeouts. Trigger an alert, if too many timeouts happened in a defined time range.

Alerting: Latency (incoming)
Monitoring should cover latency for incoming requests to your application. Trigger an alert, if your application takes too much time to handle incoming requests.

Alerting: Latency (outgoing)
Monitoring should cover latency for outgoing requests from your application. An alert could be triggered, if outgoing requests take too much time to be answered by other systems - in that case the problem-solving might not be in your team.

Level 5

Alerting: Load
Monitoring must cover current load. Trigger an alert, if current load is higher than expected load in a defined time range.

How to use the IT Stability Health Radar?

If you work on an existing IT system and you want to improve its stability, you should check first at which level you are currently. Start with level 1 non-functional requirements in each category and check, if your systems fulfils these requirements. If not I recommend to start with these requirements. If you fulfil all requirements in one level, check the next level.

In case you start to develop a new system, you should check which stability level is required. For a prototype or a small MVP, you might not need much. But it is always a good idea to do things in the right way from the beginning, so keep the requirements and level in mind, when you start development.

Dieses Blog durchsuchen

Agile Coding mit Java, Kotlin, Spring und Microservices