IT Stability Health Radar
How to run IT applications stable in production? Instead of providing one more operations readiness checklist, I created an IT Stability health radar. This helps in agile projects to choose the next stability improvement depending on current situation. All non-functional requirements in this health radar are prioritized by levels, so that you and your Product Owner can plan them properly.
IT Stability Health Radar
- High Availability with focus on redundancy and elimination of single point of failures.
- Performance is about how your application handles many parallel usage, so that it does not fail during load peaks.
- Resilient Code is about how developers should implement the application to be resilient in case of unexpected events.
- Operations with focus on change process and monitoring with alerting.
Category: High Availability
Level 1
Redundancy
In case your application is productive, it should be running on more than one server to ensure high availability.
Level 2
Scalability
In case of occasionally high load, your application must scale automatically to handle successful load peaks.
Level 3
Level 4
Multi Availability Zones
In case your application is high available (redundant servers), it should be deployed in multiple availability zones.
Level 5
Chaos monkey
Test the resilience of your application with chaos engineering: At least in your test environment, you could activate "chaos monkeys" to test your applications resilience.
Single point of failure
In case your application is high available, you should reduce the single points of failure. E.g., you have 2 servers behind 1 load balancer, that means the load balancer is your single point of failure.
Category: Performance
Level 1
Level 2
Performance tests
Test if your application can handle the expected load. You must know the limits of your application - which load peeks can be handled by your application.
Throttling (incoming)
Your application can not scale "infinitely", so you should set a rate limit for incoming request / calls.
Level 3
Caching
In case of expectable high load, caching should be added to your application. Think about proper caching strategies for your application and Use-Cases. Find more about caching with Spring here.
Resource limits
In case your application can scale vertical, set resource limits. Limit CPU and RAM consumption of Containers (vertical scaling) to ensure that horizontal scaling is also possible.
Level 4
Performant database design
In case your system uses a database, verify that database design does not cause bad performances.
Level 5
Automatic Performance tests
Performance tests should be regularly executed by your CICD pipeline. Focus first on the critical UseCases of your application. When your automated performance tests become stable and do not produce false alerts, you can start adding other UseCases.
Throttling (outgoing)
In case your application sends request to other systems, you could protect those systems with throttling for outgoing requests.
If you work on requirements in performance category, you should know about good practices for performance tuning.
Category: Resilient Code
Level 1
Reconnect
In case your application loses the connection to another system, database, API etc., it must not crash and is able to reconnect. This event should be logged.
Error handling
In case of errors, your application must catch and handle them in proper way. This could be logging, retry or converting to own error message depending on error type (business, technical) and context.
Tests
Your application must be tested. You should automate your tests from scratch, but the key message here is: Untested software can not be considered as stable. If you do not test, your users test and consider found bugs as instabilities.
Level 2
Automatic regression test
Regression tests should be automated and executed by your CICD pipeline. For E2E-regression tests focus first on the critical Use-Cases of your application.
Level 3
Level 4
Retry
In case of unexpected error while calling a 3rd party, your system should do a retry. Retry logic must not flood other systems, so proper backoff settings must be in place. A retry should only be added, if it makes sense in the context of this Use-Case.
Level 5
Category: Operations
Level 1
Deployment process
In case your application is productive, and your team is operating it (DevOps-team), you must have a ticket-based and well-defined deployment-/change-process.
In case your team changes productive systems, you must be able to test or verify the success of your change.
Level 2
Alerting: Logs
Monitoring must cover critical log entries. Trigger an alert, if unknown errors or well-known critical entries are logged, which require operations involvement.
Alerting: Application health
Monitoring must cover the health status endpoint of your application. Trigger an alert, if the health status is negative. Your application requires an API endpoint, which returns its health status. The health status shows, that you application is up and running. The health status might also report insights of your application like connection status against database or 3rd party APIs.
Level 3
Alerting: Resources
Monitoring must cover CPU-, memory- and storage-consumption. Trigger an alert, if consumption is on a critical level.
Monitoring must cover expiry date of (SSL) certificates. Trigger an alert before the certificate expires and renew it in time.
Canary Deployment
You should use canary deployments as a progressive rollout of your application. The canary deployment splits traffic between an already-deployed version and a new version. The new version is rolled out to a subset of users before it is fully rolled out to all users. This way problems in new version have limited impact due to traffic split and rollback to already-deployed version is quick and easy.
In case your application is in production and your team is operating it (DevOps-team), you must have an operations manual. This manual describes operation processes (like changes, deployments, database update & backup, etc.) and the incident handling process. It lists also the stakeholders to be contacted in case of incidents or other issues.
Level 4
Alerting: Timeouts
Monitoring must cover number of timeouts. Trigger an alert, if too many timeouts happened in a defined time range.
Monitoring should cover latency for incoming requests to your application. Trigger an alert, if your application takes too much time to handle incoming requests.
Alerting: Latency (outgoing)
Monitoring should cover latency for outgoing requests from your application. An alert could be triggered, if outgoing requests take too much time to be answered by other systems - in that case the problem-solving might not be in your team.
Level 5
Alerting: Load
Monitoring must cover current load. Trigger an alert, if current load is higher than expected load in a defined time range.
Kommentare