First conceived by Netflix, chaos testing is part of the practice of intentionally trying to harm an application in production..
What is Chaos Testing Engineering?
Chaos testing, or chaos engineering, is the highly disciplined approach to testing a system’s integrity by proactively simulating and identifying failures in a given environment before they lead to unplanned downtime or a negative user experience. DevOps and IT teams that utilize chaos engineering will need to set up a system of monitoring tools and actively run chaos testing in a production environment. This way, teams are able to see real-life simulations of how their application or service responds to different pressures and stresses.
Chaos engineering is made up of five main principles:
- Ensure your system works and define a steady state. In order to do this, you’ll need to define a “steady state” or control as a measurable system output that indicates normal working behavior (well-below a one percent error rate).
- Hypothesize the system’s steady state will hold. Once a steady state has been determined, it must be hypothesized that it will continue in both control and experimental conditions.
- Ensure minimal impact to your users. During chaos testing, the goal is to actively try to break or disrupt the system, but it’s important to do so in a way that minimizes the blast radius and any negative impact to your users. Your team will be responsible for ensuring all tests are focused on specific areas and should be ready for incident response as needed.
- Introduce chaos. Once you are confident that your system is working, your team is prepared, and the blast radius is contained, you can start running your chaos testing applications.. You’ll want to introduce different variables with the intention of simulating real world scenarios, including everything from a server crash to malfunctioning hardware and severed network connections. It’s best to test in a production environment so you can monitor how your service or application would react to these events without directly affecting the live version and active users.
- Monitor and repeat. With chaos engineering, the key is to test consistently, introducing chaos to pinpoint any weaknesses within your system. The goal of chaos engineering is to disprove your hypothesis from number two, building a bulletproof, more reliable system in the process.
What is Chaos Monkey and How Does it Work?
When Netflix started chaos testing their system during their move to AWS, they created different “chaos monkeys” to help meet the need of continuous and consistent testing. These chaos monkeys were deployed into a system to introduce specific issues—network delays, instances, missing data segments, etc—and simulate different real-world scenarios.
Each chaos monkey had its own name and job, including:
- Latency Monkey: Induces artificial delays
- Conformity and Security Monkeys: Hunt and kill instances that don’t adhere to best practices
- Janitor Monkey: Cleans up and removes unused resources
- Chaos Gorilla: Simulates an entire Amazon availability zone outage
Collectively, these and more chaos monkeys are now known as Simian Army.
The Advantages and Disadvantages of Chaos Testing?
Chaos engineering is gaining popularity with some of the industry’s largest IT and DevOps teams. However, it’s not always the right choice for every team and situation.
The advantages of chaos testing are:
- IT and DevOps teams are able to more quickly identify and resolve issues that might not be captured with other testing
- Unplanned downtime and outages are far less likely to occur due to proactive and constant testing
- Strengthens system integrity
- Great for large, complex systems (ie: cloud-based applications and services) as well as for scaling up
However, chaos testing may not be right for:
- Smaller systems or desktop software
- Applications and services that are not mission-critical to the success of the business
- Application environments that don’t require 24×7 uptime via customer SLAs
- Systems in which failures are acceptable if resolved by the end of the day
How Does Chaos Testing Work in DevOps?
Chaos engineering fits well within a DevOps structure. Typically, chaos engineering falls on the shoulders of a DevOps engineer such as the XA (Experience Assurance Professional). This person is in charge of defining the different testing scenarios, executing the tests, and tracking the outcome and results. They are also responsible for ensuring minimal impact to the customer.
While testing, there’s a very fine line that the DevOps engineer must walk. One on side, there’s testing the system’s integrity by introducing chaos and trying to get it to crash (hence, why this is best done in a production environment). On the other, there’s conducting unplanned or undisciplined tests that actually cause the system to crash and affect user experience.
How to Get Started with Chaos Testing
Chaos engineering has proven to be extremely effective at improving the integrity of very large and complex systems, offering benefits such as faster incidence response times, less unplanned downtime, and ultimate flexibility in terms of scaling up and out.
If you would like to learn more about chaos engineering and how you can begin implementing it within your organization, please do not hesitate to contact us online or start your 14-day free trial today.