Azure Chaos Studio: The Art of Predicting Disruption in the Cloud

At the beginning of November 2021, Microsoft made a lot of announcements at Ignite, and one of those that caught my attention was the Azure Chaos Studio. The Azure Chaos Studio is a managed service for improving resilience by injecting faults into your Azure applications.

Chaos engineering first became relevant at Internet companies that were pioneering large-scale distributed systems. For example, in 2010, Netflix decided to leverage the Chaos methodology to answer their demand to move from physical infrastructure to AWS cloud and help them make their services more resilient. This would help them avoid downtime, as failure of individual components in the cloud architecture would not necessarily compromise the availability of the entire system.

Inject Failure to Prevent Failure

Chaos engineering leverages the failure injection to proactively test how an application or system responds under stress, so you can identify and fix failures before they end up in costly outages. It is essential to highlight that the failure injection process doesn’t happen randomly without a purpose. Instead, it is a well-defined and formalized scientific method of experimentation that provides several benefits to distributed systems and microservices. The failure injection is categorized into five levels: resource, network and dependencies, application, process and service, infrastructure, and people.

The experiments could be applied to help with things you are aware of and understand, something you are aware of but don’t fully understand, things you understand but are not aware of, and things you are neither aware of nor fully understand. There are no limits to Chaos experiments. The type of tests you run depends on the architecture of your distributed system and business goals. The most common Chaos known tests are used to simulate the failure of a micro-component, turning a server off to see how a dependency reacts, simulating a high CPU load, producing latency between services, emulating I/O errors, and producing sudden traffic spikes.

Controlled Chaos

The Chaos principles should be applied continuously to help you expose issues early. Ideally, it would help to leverage that when deploying new code, adding dependencies, observing changes in usage patterns, and mitigating problems. According to the 2021 State of Chaos Engineering report, the most common outcomes of Chaos engineering are increased availability, lower mean time to resolution (MTTR), lower mean time to detection (MTTD), fewer bugs shipped to product, and fewer outages. In addition, teams who frequently run Chaos engineering experiments are more likely to have >99.9% availability.

Nowadays, large tech companies such as LinkedIn, Meta (formerly Facebook), Google, Microsoft, Amazon, and more traditional industries like banking and finance are practicing Chaos engineering to understand their distributed systems and microservice architectures better and help them ensure the reliability of every new feature.

As a Microsoft Azure Expert MSP, our Cloud Solution Architects and Engineers have the knowledge and experience to help you reduce disruptions by improving the reliability and availability of your Azure applications leveraging Chaos engineering experiments. Let’s get in touch and dive into the Chaos engineering conversation and see how it can help you evolve in your cloud deployments

Leandro Rocha

Leandro is a lifelong learner with over 20 years of experience in the IT field, with expertise in various IT operations. He is passionate about cloud technologies, and over the past several years, he has been helping organizations to migrate and adopt cloud services.