Produced by Hangzhou ApeCloud Co., Ltd
Test Engineer: Huang Zhangshu
Test Manager: Zhang Mingjing
Product Owner: Wang Ruijun
ChaosMesh is a chaos engineering experimentation platform targeting Kubernetes environments, aimed at helping users test system stability and fault tolerance by simulating various failure scenarios.
Chaos engineering is an advanced system testing methodology that intentionally injects various faults into distributed systems, such as network errors, disk failures, CPU load, etc., to verify the system's behavior and recovery capabilities when encountering failures. This proactive testing approach helps improve system reliability and resilience.
ChaosMesh provides a set of fault injection toolkits based on Kubernetes, including various failure scenarios such as network, kernel, disk, and container, capable of simulating various abnormal situations that may occur in real environments. Users can define the types of faults to be injected, scope, duration, and other parameters through YAML configuration files or a Web UI, and then ChaosMesh will execute the corresponding fault operations on the target Pods or nodes.
During the chaos experiment, ChaosMesh continuously monitors the system status and records various metrics and log information during the system recovery process. After the experiment is completed, the collected data can be analyzed to evaluate the system's resilience and fault tolerance capabilities, thereby optimizing system design and fixing potential vulnerabilities. Furthermore, ChaosMesh supports extending new fault scenarios to meet customized requirements in different scenarios.
The emergence of ChaosMesh has lowered the barrier to chaos experiments and can help development and operation teams more efficiently and systematically test and improve the reliability of distributed systems. Through the various chaos experiment scenarios provided by ChaosMesh, it becomes easier to identify the weaknesses in the system and enhance application reliability.
In summary, as an open-source chaos engineering platform, ChaosMesh brings better system reliability assurance for Kubernetes applications. In the increasingly complex cloud-native environment, continuously conducting chaos experiments is crucial for ensuring system high availability.
Simulating a database instance failover scenario by deleting a Pod involves the following principles and significance:
In a Kubernetes cluster, stateful database components are typically deployed as StatefulSets, with each Pod corresponding to a database instance. When a running database Pod is deleted, Kubernetes automatically recreates a new Pod using the same persistent data volume. The newly created Pod needs to perform database recovery operations based on the existing data and rejoin the database cluster. This process simulates a database instance failover.
By triggering a failover by deleting a Pod, you can check whether the database cluster can automatically recover when an instance fails, and whether the new instance can properly rejoin the cluster and provide services.
During the failover process, you can observe various metrics of the database cluster, such as failure detection time, new instance rebuilding time, data recovery time, etc., and evaluate whether the failover performance meets the requirements.
Real-world failover scenarios may expose some failures and edge cases, such as data inconsistency, split-brain, etc. Simulation helps to discover and fix these potential issues.
Applications need to have the ability to gracefully handle database failovers. Various failover exceptions discovered during experiments can be used to improve the design and implementation of applications. In summary, simulating database failover by leveraging Kubernetes Pods is an efficient and controlled chaos engineering practice that helps to improve the reliability and resilience of distributed systems.
When testing the database failover scenario, we observe whether the database instance in KubeBlocks can switch over normally by killing the number 1 process in the main container of the Pod. The principle and significance are as follows:
In the Linux system, each running process has a unique process ID (PID), and the number 1 process is the initialization process of the entire system. The database process is usually not the number 1 process, but rather a child process forked from the number 1 process. When the number 1 process is killed, all its child processes, including the database process, will also be terminated. As a container orchestration system, KubeBlocks will detect the exit of the main process in the Pod, triggering the Pod rebuild process. During the rebuild process, KubeBlocks will start a new database instance Pod and perform data recovery based on the existing data, thereby completing the failover.
In production environments, processes may crash due to code errors, resource depletion, etc. Killing the number 1 process can simulate this extreme situation.
Directly deleting a Pod to trigger failover is a common case, while killing the number 1 process is an edge abnormal case, which can more comprehensively test the robustness of the failover mechanism.
When the main process exits abnormally, KubeBlocks needs to be able to quickly detect and rebuild the Pod, ensuring service continuity. This is a test of its automatic recovery capability.
Extreme exceptional situations may expose potential system vulnerabilities and defects, such as resource leaks, deadlocks, etc., which can help discover and fix problems in advance. In summary, killing the number 1 process is an extreme approach to inducing database failover, which can subject KubeBlocks' database high availability and reliability to more stringent verification, thereby enhancing the robustness of the system.
Testing the Failover operation of database instance Pods under OOM (Out of Memory) situations in Kubernetes is crucial, as it can verify the high availability and fault tolerance capabilities of the database cluster.
OOM refers to the situation where the system memory is insufficient, causing processes within the Pod to be unable to allocate the required memory resources. In such cases, Kubernetes will select one or more processes based on their OOM Score values within the Pod and perform memory eviction to release memory and ensure the stability of the entire system. You can deterministically trigger OOM for a specific process by setting its OOM Score to -1000, thereby simulating a memory shortage scenario.
In the Failover test, you will select a specific Replica Pod of the database instance as the OOM object. Since a database instance typically consists of a Primary node and multiple Secondary nodes, where the Primary node is responsible for write operations and the Secondary nodes are responsible for backup and read operations. Therefore, when the Primary Pod encounters OOM, it is necessary to verify whether the entire cluster can correctly execute the Failover operation, promote one of the Secondary Pods to become the new Primary, and adjust the roles and perform data replication for the other Pods to ensure the availability and data consistency of the cluster.
During the testing process, the test program will trigger OOM and closely monitor the entire Failover process. Once the Failover is detected as completed, it will verify the role labels of each Pod to confirm whether the new Primary node and Secondary nodes have been properly switched. Additionally, it will check whether the new Primary node can provide write services normally, and whether the Secondary nodes can correctly replicate data and provide read services.
Through this approach, the test system can comprehensively test the fault tolerance capability of the database instance under extreme memory pressure, ensuring that even in the event of node failure, the entire cluster can quickly recover and continue to provide services. This is crucial for ensuring the high availability and data consistency of the database, especially when running critical business in production environments.
In summary, the role of the OOM mechanism in Failover testing is to simulate extreme resource pressure situations, verify whether the database instance can correctly execute high availability policies in the event of failure, and evaluate the fault tolerance capabilities and reliability of the entire cluster through the verification of Failover results.
ChaosMesh's CPU Stress fault experiment aims to simulate scenarios where the system encounters CPU resource stress. Its working principle and implementation method are as follows:
ChaosMesh injects a process that consumes a large amount of CPU resources into the target Pod, thereby reducing the available CPU resources of the Pod and triggering the system's capacity planning or auto-scaling mechanisms.
ChaosMesh utilizes the Kubernetes ChaosContainer mechanism to inject a container named chaos-cpu into the target Pod. This chaos-cpu container runs a stress-ng process that continuously consumes CPU resources using algorithms such as prime number calculation.
Users can set the following parameters in the configuration file for the CPU Stress experiment:
During the experiment, ChaosMesh continuously monitors the CPU usage of the target Pods. After the experiment ends, ChaosMesh will automatically stop the chaos-cpu container and release the occupied CPU resources.
Through the CPU Stress fault experiment, users can evaluate the system's response to CPU resource stress scenarios, such as scaling or priority preemption, thereby optimizing resource scheduling and improving system robustness.
ChaosMesh's Network Delay fault experiment aims to simulate network latency scenarios. Its working principle and implementation method are as follows:
ChaosMesh configures the corresponding network rules on the node where the target Pod resides, artificially introducing network latency, thereby simulating the situation where network communication between the Pod and other resources experiences delay.
ChaosMesh utilizes the NetEm (Network Emulator) module in the Linux kernel and injects network delay rules on a specific network interface of the node where the target Pod resides. Specifically, it runs the tc (traffic control) command on the node to invoke the delay queue mechanism of NetEm.
Users can set the following key parameters in the configuration file for the Network Delay experiment:
During the experiment, ChaosMesh continuously monitors the network status. After the experiment ends, ChaosMesh automatically cleans up the previously injected network rules and restores the normal network.
Through the Network Delay experiment, users can evaluate the stability and availability of distributed systems under network delay conditions, thereby optimizing network policies and improving system robustness.
ChaosMesh's Time Chaos feature aims to simulate time offset scenarios. Its working principle and implementation method are as follows:
ChaosMesh modifies the target process's perception of time, causing it to mistakenly believe that the system time has shifted, thereby simulating a time error scenario. Such time errors may cause abnormal process behavior, such as timer triggering errors, inaccurate scheduled task execution times, etc.
ChaosMesh utilizes the time-related system calls (clock_gettime, clock_settime, etc.) provided by the Linux system to control time. When initiating a Time Chaos attack, ChaosMesh attaches to the target process, intercepts its requests for time-related system calls, and returns artificially modified time values, thereby deceiving the process.
Specifically, ChaosMesh attaches to the target process using the ptrace system call, modifies the system call parameters and return values in its memory, and injects the desired time offset. Users can specify the following in the attack command:
During the attack process, ChaosMesh continuously modifies the target process's perception of time. After the attack ends, ChaosMesh automatically removes the previously injected time modifications and restores the normal time.
Through the Time Chaos experiment, users can evaluate the robustness of distributed systems under clock error conditions, identify potential time-related bugs, optimize time handling strategies, and improve system reliability.
Engine | FailoverOps | State | Props | Description |
Postgresql ( Topology = replication ; Replicas = 2 ) | Pod Failure | PASSED | HA=Pod Failure Durations=2m ComponentName=postgresql |
Simulates conditions where pods experience failure for a period of time either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to failure. |
Evicting Pod | PASSED | HA=Evicting Pod ComponentName=postgresql |
Simulates conditions where pods evicting either due to node drained thereby testing the application's resilience to unavailability of some replicas due to evicting. | |
Connection Stress | PASSED | HA=Connection Stress ComponentName=postgresql |
Simulates conditions where pods experience connection stress either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to high Connection load. | |
Network Delay | PASSED | HA=Network Delay Durations=2m ComponentName=postgresql |
Simulates network delay fault thereby testing the application's resilience to potential slowness/unavailability of some replicas due to delay network. | |
OOM | PASSED | HA=OOM Durations=2m ComponentName=postgresql |
Simulates conditions where pods experience OOM either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to high Memory load. | |
Full CPU | PASSED | HA=Full CPU Durations=2m ComponentName=postgresql |
Simulates conditions where pods experience CPU full either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to high CPU load. | |
Time Offset | PASSED | HA=Time Offset Durations=2m ComponentName=postgresql |
Simulates a time offset scenario thereby testing the application's resilience to potential slowness/unavailability of some replicas due to time offset. | |
Kill 1 | PASSED | HA=Kill 1 ComponentName=postgresql |
Simulates conditions where process 1 killed either due to expected/undesired processes thereby testing the application's resilience to unavailability of some replicas due to abnormal termination signals. |
Engine | FailoverOps | State | Props | Description |
Redis ( Topology = replication ; Replicas = 2 ) | Evicting Pod | PASSED | HA=Evicting Pod ComponentName=redis |
Simulates conditions where pods evicting either due to node drained thereby testing the application's resilience to unavailability of some replicas due to evicting. |
Kill 1 | PASSED | HA=Kill 1 ComponentName=redis |
Simulates conditions where process 1 killed either due to expected/undesired processes thereby testing the application's resilience to unavailability of some replicas due to abnormal termination signals. | |
OOM | PASSED | HA=OOM Durations=2m ComponentName=redis |
Simulates conditions where pods experience OOM either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to high Memory load. | |
Pod Failure | PASSED | HA=Pod Failure Durations=2m ComponentName=redis |
Simulates conditions where pods experience failure for a period of time either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to failure. | |
Full CPU Failover | PASSED | HA=Full CPU Failover Durations=2m ComponentName=redis |
Simulates conditions where pods experience CPU full either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to high CPU load. | |
Network Delay Failover | PASSED | HA=Network Delay Failover Durations=2m ComponentName=redis |
Simulates network delay fault thereby testing the application's resilience to potential slowness/unavailability of some replicas due to delay network. | |
Connection Stress | PASSED | HA=Connection Stress ComponentName=redis |
Simulates conditions where pods experience connection stress either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to high Connection load. | |
Time Offset | PASSED | HA=Time Offset Durations=2m ComponentName=redis |
Simulates a time offset scenario thereby testing the application's resilience to potential slowness/unavailability of some replicas due to time offset. |
Engine | FailoverOps | State | Props | Description |
Kafka ( Replicas = 3 ) |
Connection Stress | PASSED | HA=Connection Stress ComponentName=kafka-combine |
Simulates conditions where pods experience connection stress either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to high Connection load. |
Engine | FailoverOps | State | Props | Description |
Qdrant ( Topology = cluster ; Replicas = 2 ) | Connection Stress | PASSED | HA=Connection Stress ComponentName=qdrant |
Simulates conditions where pods experience connection stress either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to high Connection load. |
Engine | FailoverOps | State | Props | Description |
Mysql ( Topology = replication ; Replicas = 2 ) | Time Offset | PASSED | HA=Time Offset Durations=2m ComponentName=mysql |
Simulates a time offset scenario thereby testing the application's resilience to potential slowness/unavailability of some replicas due to time offset. |
Pod Failure | PASSED | HA=Pod Failure Durations=2m ComponentName=mysql |
Simulates conditions where pods experience failure for a period of time either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to failure. | |
OOM | PASSED | HA=OOM Durations=2m ComponentName=mysql |
Simulates conditions where pods experience OOM either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to high Memory load. | |
Evicting Pod | PASSED | HA=Evicting Pod ComponentName=mysql |
Simulates conditions where pods evicting either due to node drained thereby testing the application's resilience to unavailability of some replicas due to evicting. | |
Kill 1 | PASSED | HA=Kill 1 ComponentName=mysql |
Simulates conditions where process 1 killed either due to expected/undesired processes thereby testing the application's resilience to unavailability of some replicas due to abnormal termination signals. | |
Connection Stress | PASSED | HA=Connection Stress ComponentName=mysql |
Simulates conditions where pods experience connection stress either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to high Connection load. | |
Network Delay | PASSED | HA=Network Delay Durations=2m ComponentName=mysql |
Simulates network delay fault thereby testing the application's resilience to potential slowness/unavailability of some replicas due to delay network. | |
Full CPU | PASSED | HA=Full CPU Durations=2m ComponentName=mysql |
Simulates conditions where pods experience CPU full either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to high CPU load. |
Engine | FailoverOps | State | Props | Description |
Clickhouse ( Topology = cluster ; Replicas = 2 ) | Connection Stress | PASSED | HA=Connection Stress ComponentName=clickhouse |
Simulates conditions where pods experience connection stress either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to high Connection load. |
Engine | FailoverOps | State | Props | Description |
ElasticSearch ( Topology = m-d-i-t ; Replicas = 2 ) | Connection Stress | PASSED | HA=Connection Stress ComponentName=master |
Simulates conditions where pods experience connection stress either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to high Connection load. |
Engine | FailoverOps | State | Props | Description |
OceanBase Ent ( Topology = replication ; Replicas = 2 ) | Connection Stress | PASSED | HA=Connection Stress ComponentName=oceanbase |
Simulates conditions where pods experience connection stress either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to high Connection load. |
Engine | FailoverOps | State | Props | Description |
Starrocks Ent ( Topology = shared-nothing ; Replicas = 2 ) | Connection Stress | PASSED | HA=Connection Stress ComponentName=be |
Simulates conditions where pods experience connection stress either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to high Connection load. |
Engine | FailoverOps | State | Props | Description |
Damengdb ( Topology = realtime-replication ; Replicas = 3 ) | Network Delay | PASSED | HA=Network Delay Durations=2m ComponentName=dmdb |
Simulates network delay fault thereby testing the application's resilience to potential slowness/unavailability of some replicas due to delay network. |
Time Offset | PASSED | HA=Time Offset Durations=2m ComponentName=dmdb |
Simulates a time offset scenario thereby testing the application's resilience to potential slowness/unavailability of some replicas due to time offset. | |
Evicting Pod | PASSED | HA=Evicting Pod ComponentName=dmdb |
Simulates conditions where pods evicting either due to node drained thereby testing the application's resilience to unavailability of some replicas due to evicting. | |
Full CPU | PASSED | HA=Full CPU Durations=2m ComponentName=dmdb |
Simulates conditions where pods experience CPU full either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to high CPU load. | |
OOM | PASSED | HA=OOM Durations=2m ComponentName=dmdb |
Simulates conditions where pods experience OOM either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to high Memory load. | |
Kill 1 | PASSED | HA=Kill 1 ComponentName=dmdb |
Simulates conditions where process 1 killed either due to expected/undesired processes thereby testing the application's resilience to unavailability of some replicas due to abnormal termination signals. | |
Pod Failure | PASSED | HA=Pod Failure Durations=2m ComponentName=dmdb |
Simulates conditions where pods experience failure for a period of time either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to failure. | |
Connection Stress | PASSED | HA=Connection Stress ComponentName=dmdb |
Simulates conditions where pods experience connection stress either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to high Connection load. |
Engine | FailoverOps | State | Props | Description |
Kingbase ( Topology = kingbase-cluster ; Replicas = 3 ) | Network Delay | PASSED | HA=Network Delay Durations=2m ComponentName=kingbase |
Simulates network delay fault thereby testing the application's resilience to potential slowness/unavailability of some replicas due to delay network. |
Pod Failure | PASSED | HA=Pod Failure Durations=2m ComponentName=kingbase |
Simulates conditions where pods experience failure for a period of time either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to failure. | |
Connection Stress | PASSED | HA=Connection Stress ComponentName=kingbase |
Simulates conditions where pods experience connection stress either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to high Connection load. | |
Evicting Pod | PASSED | HA=Evicting Pod ComponentName=kingbase |
Simulates conditions where pods evicting either due to node drained thereby testing the application's resilience to unavailability of some replicas due to evicting. | |
Full CPU | PASSED | HA=Full CPU Durations=2m ComponentName=kingbase |
Simulates conditions where pods experience CPU full either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to high CPU load. | |
Time Offset | PASSED | HA=Time Offset Durations=2m ComponentName=kingbase |
Simulates a time offset scenario thereby testing the application's resilience to potential slowness/unavailability of some replicas due to time offset. | |
Kill 1 | PASSED | HA=Kill 1 ComponentName=kingbase |
Simulates conditions where process 1 killed either due to expected/undesired processes thereby testing the application's resilience to unavailability of some replicas due to abnormal termination signals. | |
OOM | PASSED | HA=OOM Durations=2m ComponentName=kingbase |
Simulates conditions where pods experience OOM either due to expected/undesired processes thereby testing the application's resilience to potential slowness/unavailability of some replicas due to high Memory load. |
Test Period: Nov 13, 2024 - Nov 28, 2024