Notes & TILs
Search…
Chaos Engineering πŸ’οΈ
Act of studying a system so as to build confidence in its capability to withstand harsh conditions. This is done by literally breaking (experimenting with) the system.
Although chaos engineering really seems to only help some large-scale distributed operations in the world, its principles can be applied to even smaller systems to achieve strong systems
The said "experiments" follow certain steps:
  1. 1.
    Define the steady state of the system that specifies normal or expected behaviour (e.g latency, throughput)
  2. 2.
    Segregate environments for experimenting
    • Normal Group (Main production app)
    • Experimental Group
  3. 3.
    Introduce variables that reflect real world events (server crash, hard-drive malfunction, sending large payload data, limited network connection).
The harder is to break the steady state, the more confidence we have in the behaviour of our system

Advanced principles:

  1. 1.
    Run Experiments in prod πŸ’£οΈ: Systems behave differently depending on environment & since a experimental group may not exactly have the same usage metrics, it becomes necessary to experiment in production.
  2. 2.
    Automate experiments to run continuously: Running experiments manually can be time-consuming & costly.
  3. 3.
    Minimize Blast Radius: Experimenting in production has the potential to cause unnecessary customer pain.

Tools

  • Netflix's chaosmonkey is a resiliency tool that helps applications tolerate random instance failures.
  • ​pumba is a chaos testing, network emulation and stress testing tool for containers.
  • ​Chaos Mesh is a chaos engineering platform under CNCF

Resources & Credits

Last modified 1mo ago