Senior cloud quality engineering specialist, SAP, Israel, Rehovot
USING CHAOS ENGINEERING FOR BUILDING RESILIENT AND SUSTAINABLE CLOUD SERVICES
ABSTRACT
Chaos engineering helps organizations build more robust and resilient cloud services by systematically testing their systems' response to failure scenarios.
Chaos is intentionally introduced into cloud infrastructure to simulate real-world failures, enabling an assessment of the system's resilience and response. This research explores the integration of comparative analysis, synergetic frameworks, and advanced modeling techniques. Additionally, it examines the fundamental principles of chaos testing and addresses the challenges inherent in automating these processes. The study presents novel contributions, including a structured framework for chaos engineering procedures across four layers of system security, alongside the development of mathematical models for chaos engineering. The findings are applicable to testing situational chaos and evaluating its outcomes. Future research may focus on refining models and advancing situational forecasting.
АННОТАЦИЯ
Хаос-инжиниринг – актуальная методология для стабильной и безопасной работы программного обеспечения. Используются случайно внедряемый в облачную инфраструктуру хаос (шум) для имитации реального сбоя и оценки реакции системы. Рассмотрены применение методов сравнительного анализа, синергетики и моделирования, ключевые принципы тестирования хаоса, проблемы автоматизации тестирования. Новые результаты работы - структура этапов и процедур инженерии хаоса для четырехуровневой безопасности системы, математические модели инженерии хаоса. Результаты работы применимы для тестирования ситуационного хаоса и оценки его результатов. Возможна работа по совершенствованию моделей и ситуационному прогнозированию.
Keywords: chaos engineering, resilience, testing, program, sustainability, reliability, cloud services, modeling
Ключевые слова: хаос инжиниринг, отказоустойчивость, тестирование, программа, устойчивость, надежность, облачные услуги, моделирование
Controlled chaos experiments involve the study of open systems characterized by nonlinear dynamics and bifurcations. [1-2].
Resilience, stability, and security of software (SW) have emerged as critically significant issues. One modern approach to enhancing the resilience and reliability of software is Chaos Engineering. Chaos engineering has become widely adopted due to its systematic methodology for identifying vulnerabilities and failure points in software by implementing and managing controlled chaos and resilience testing.
Chaos testing was systematically implemented by Netflix’s cloud team in 2010 through a practice termed Chaos Monkey, which involves randomly disabling components of cloud infrastructure to simulate real-world failures and evaluate the system’s resilience. Other notable tools in this domain include Gremlin, which provides capabilities for network failure simulation, and Chaos Toolkit, an open-source platform for analyzing experiments. The effectiveness of chaos testing led Netflix to develop it into a comprehensive suite of methodologies and tools known as the Simian Army.
Chaos engineering as a discipline was primarily developed by practitioners within the software engineering and systems reliability fields. Key contributors include Patrick Debois, known for his work on DevOps, and John Allspaw and Jez Humble, who have influenced resilience engineering practices. Additionally, Nicolas (Nikki) R. Smith and Casey Rosenthal have advanced chaos engineering practices and tools.
Key principles of chaos testing are as follows:
- Establishing the baseline (risk-free) state of the system and developing its digital profile (including performance metrics, response times, error rates, etc.);
- Formulating a stability hypothesis regarding the system's expected response to disruptions, such as server outages;
- Introducing controlled chaos and failures into the system;
- Monitoring and analyzing the system's response to these disruptions, including identifying the causes of the chaos;
- Iteratively enhancing the system's resilience based on the insights derived from the analysis.
If the service is actively utilized, demonstrates sufficient stability, and the project is prepared for deployment on a new machine (by making a release candidate), the primary focus should then shift to identifying and addressing risks and vulnerabilities. [3]. The project must be verified and tested primarily for risks and vulnerabilities in the following areas:
- Operational – such as unexpected failures and delays
- Network – including DDoS attacks and data center issues
- Infrastructure – covering organizational and security-related factors
- User experience – addressing human factors and other related issues
Chaos engineering can help mitigate risks associated with potential issues (bugs) by preparing the system to effectively handle such scenarios. This process involves generating and testing these scenarios in a controlled environment, including both test and production settings, where real clients interact with actual data. [4].
The implementation of chaos engineering follows a procedure similar to that used in system risk analysis. The proposed stages and structure of this procedure (fig. 1) are aligned with the SWIFT (Structured What-If Technique) methodology.
Figure 1. Structure of the stages and procedures for chaos engineering based on the SWIFT methodology
The experimental process begins with minor, controlled changes that are progressively scaled in complexity. Through load testing, the system's resilience is evaluated under critical stress conditions, facilitating the identification of vulnerabilities and enabling the determination of its maximum performance thresholds.
The automation of chaos engineering facilitates regular experimentation with minimal resource investment. This process employs specialized tools and scripts to inject chaos and analyze the system's response. Alongside Netflix's Simian Army and Chaos Monkey, Amazon provides tools like the AWS Fault Injection Simulator, along with solutions from other providers.
Numerous tools are available to facilitate the automation of chaos engineering. Among them, SaaS solutions are particularly effective for generating and analyzing samples across diverse data streams [5].
Automation in cloud infrastructure aims to identify and mitigate configuration and administrative risks, including organizational and human factors. This also encompasses risks associated with traditional threats such as DDoS attacks, intrusions, and control hijacking. The system continuously anticipates risks, adapts its responses to client requests and operations, and addresses issues such as inaccuracies in Access Control Policies (ACP) and excessive privileges granted to certain users [6].
Chaos testing is a powerful tool for enhancing the resilience and reliability of complex, large-scale software systems. It enables the identification and mitigation of vulnerabilities while preparing systems to handle unexpected failures effectively.
Despite the significant risks posed by adverse situations in cloud structures for businesses, the adoption and demand for cloud computing continue to grow in these environments. Many companies aim to evolve their infrastructure in this direction. This trend is further driven by a decline in IT personnel competencies, particularly among system administrators, as well as by inefficient work organization.
The legal framework for cloud processes is lagging behind, with key concepts such as "cloud technologies," "cloud computing," "electronic budget," and "cloud guarantees" remaining vague. However, categories such as "server time," "access speed," "data volume," "access fees," "resource pool," "computing power," "elasticity," and "costs" are being successfully defined and specified.
Either prior to or during ("on-the-fly") interactions between the customer and the cloud service provider, ensuring compliance with regulations while maintaining customer appeal and provider profitability.
Payment operates on a "pay-as-you-go" principle [7], allowing businesses to pay only for what they use, with the saved resources directed toward developing core business processes. This model is particularly beneficial for small and medium-sized businesses as well as startups. However, large enterprises also recognize strategic value in cloud technologies, such as enhancing guaranteed security (through redundancy, archiving, and authentication) or improving the relevance of marketing analytics.
In organizations that implement a multi-tier security policy, such as credit institutions, a four-tier security system is frequently employed, comprising:
1) priorities and tolerances that ensure anti-insider protection
2) requirements and standards that ensure transactional (operational) protection
3) strict geo-temporal requirements for updating corporate data, its volume, storage, modification, archiving, and backup
4) automation, virtualization and intellectualization of information and network security
At the automation level, security decisions are made across all “layers”:
1) physical (server)
2) software (client)
3) virtual (virtual machine)
4) intellectual (decision making)
At each level, it is imperative to instill in the client a sense of information security. It is more difficult to do this at the cloud service level. To ensure robust data security and mitigate unauthorized access, large organizations commonly implement the following mechanisms:
- Password Recovery: Procedures for resetting and recovering access credentials in the event of loss or compromise
- Cryptographic Data Protection: Employing cryptographic techniques such as digital signatures, SSL protocols, and other encryption algorithms to safeguard data confidentiality and integrity.
- Multi-factor Authentication: Implementing additional authentication factors beyond passwords, such as biometric data or one-time codes, to enhance security.
- Distributed Data Storage and Recovery: Utilizing data centers (DCs) for data backup and recovery, including the ability to restore historical versions of data.
- Security Auditing and IT Auditing: Conducting regular assessments of security systems to identify vulnerabilities and ensure compliance with security standards.
There is insufficient rigor and a lack of effective tools, such as criteria and evaluation procedures, for assessing the reliability of information [8-9]. While representational statistical samples allow for objective and quantitative evaluations of trustworthiness, trust itself remains a subjective concept, influenced by the controllability of parameters. The following levels (or models) of trust can be considered:
- Full (sufficient) trust
- Confident (necessary) trust
- Acceptable (minimal) trust
The assessment of user trust directly impacts the audience of a cloud service, the value of its data, and the potential damage from risk events.
In information systems, users often avoid updating critical applications and operating systems as long as previous versions remain functional and cause no inconvenience. This behavior is influenced by a natural mistrust of potential backdoors or hidden virtual mechanisms that could enable developers to enforce restrictions or allow attackers to hijack control.
Effective measures to bolster the resilience of cloud interactions and fortify the security architecture of ICT infrastructure remain a priority. This includes implementing solutions at foundational levels, such as BIOS configurations and proxy boot modules. Innovations that enhance system manageability (e.g., granting prioritized access to administrators), controllability (e.g., monitoring user behavior), and ensuring overall system stability are vital for maintaining secure and reliable cloud operations.
By 2030, analysts, including those from Strategy Partners [10], anticipate that domestic products will account for 71% of the Russian software market. Revenue from these developments is projected to surpass 211 billion rubles, while the total IT market, including software, services, and hardware, is expected to expand to 7 trillion rubles. These forecasts underscore the significant growth potential of the domestic IT sector and its increasing role in the national economy.
The dynamics of the IT market are illustrated in Figure 2 (the figure created by the author is based on the data from [10]).
Figure 2. Dynamics of the growth of the Russian IT market until 2030
Sales by Russian software developers and integrators have increased by 28%, covering projects such as:
- Engineering and industrial support programs (CAD, PLM, CAE, etc.);
- Ensuring compatibility and integration of foreign software with Russian processors (e.g., "Baikal," "Elbrus"), including the use of at least one browser registered in the Russian Federation's Software Registry;
- Other services, such as DBMS, EDS, office, communication, email, presentation software, file managers, antivirus protection, and content viewing and editing tools, including the Wink video content service by Rostelecom.
Currently, a systemic and synergistic approach to risk analysis for data security and data centers is lacking, though this domain is gradually developing.
A modeling approach for forecasting risk resilience in cloud computing should be developed alongside chaos testing. This is particularly relevant for situations involving uncertainty ("white noise"), disorganization, vulnerability analysis, and the simulation of intrusion scenarios, especially those involving clusters of attackers using multi-wave network attacks like DDoS. It is crucial to forecast not only the occurrence of attacks but also their intensity and distribution. Cisco, for instance, has enhanced its efforts in DevNet certification and accreditation to support these objectives [11].
he use of artificial intelligence (AI) and neural network-based intrusion detection systems (NNBID), combined with deep machine learning and social engineering techniques, can play a significant role in enhancing cybersecurity defenses. These technologies are particularly beneficial for detecting and responding to threats in complex and unpredictable environments, as seen in chaos engineering practices. AI systems can identify patterns and anomalies in network behavior, helping to proactively mitigate attacks such as brute force password attempts, gateway intrusions, and even social media-based exploits.
However, while these methods offer substantial improvements, they are not without limitations. AI and machine learning models depend heavily on the quality of the data they are trained on and may struggle with evolving tactics used by attackers. Social engineering attacks, for example, often exploit human behavior, which cannot always be accurately predicted or countered by AI alone. Furthermore, chaos engineering tests system resilience under failure conditions, which can benefit from AI integration, but successful implementation requires continuous adaptation and model refinement.
While AI and neural network-based solutions are powerful assets in combating cyber threats, their true potential is realized when seamlessly integrated into a comprehensive, multi-layered cybersecurity strategy.
Figure 3 illustrates the architecture of the company's Artificial Intelligence System (AIS), specifically engineered to detect and analyze malicious activity within the corporate network. The system leverages advanced machine learning algorithms, neural network models, and data correlation techniques to monitor, identify, and respond to suspicious patterns and behaviors, ensuring robust network security.
Figure 3. AIS in the corporate security ecosystem
The following models are proposed for application in the context of software chaos testing.
Example 1: Let –количество уязвимостей, багов в момент времени t в системе. По текущим (измеренным на практике) ошибка в программе следует идентифицировать максимальное количество оставшихся ошибок
This value will guide the chaos testing process, particularly in resource allocation, such as the number of modules, testers, failures, etc.
Due to the continuity of the maximum number of errors
can be determined. The testing model is expressed as:
The solution to the equation is:
Substituting we obtain:
The only stable solution is:
Chaos can be introduced dynamically by adjusting the parameters
As a potential criterion for "absence of errors," the system's potential function is given by
Example 2. Consider a model based on the Verhulst-Volterra model:
For constant parameters, the solution is:
Using this expression, the error distribution can be evaluated as:
.
This approach enables the estimation of risks by assessing the acceptable level of security damage that the system can tolerate without compromising its critical functions.
Uncertainties are a source of chaos; however, the deterministic chaos introduced by chaos engineering can reveal the consequences of weak corporate security [12]. If the theoretical (hypothetical) distribution function is known, the measure of uncertainty can be defined by Shannon’s entropy:
Uncertainty reflects the lack of information about the situation being tested. In the era of digital infrastructures, chaos testing ensures the continuous operation of applications and customer satisfaction with cloud service capabilities, including models like IaaS, SaaS, PaaS, and BaaS. It enables proactive testing of system responses to stress resilience and the implementation of fault-tolerant solutions. This approach is crucial in maintaining the robustness of cloud services, as it helps to identify vulnerabilities and prepare for potential disruptions, ensuring that systems remain functional under extreme conditions.
References:
- Prigogine I, Stengers I, Order Out of Chaos: Man's New Dialogue with Nature / Verso, pp. 1986. -432
- Ramonet I. Geopolitics of chaos / Translation from French by Yegorov I. / М.: Тeis, 2001. -128 p.
- Berdugin А. Chaos Engineering for Cloud Resiliency (Haos inzhiniring dlya obespecheniya otkazoustojchivosti v oblake) / URL: https://www.itsec.ru/articles/haos-inzhiniring-dlya-obespecheniya-otkazoustojchivosti-v-oblake (access date: 20.05.2024).
- Gummesson E. Total relationship marketing: From the 4Ps-Product, Price, Promotion, Place-of traditional marketing management to the 30Rs-The thirty relationships-of the new marketing paradigm. Butterworth-Heinemann. – Oxford. 2004. -257 p.
- Yudin I.A., Zhigalov I.E. Possibility of using SAS solutions in decision support systems (Vozmozhnost' ispol'zovaniya resheniy SAS v sistemakh podderzhki prinyatiya resheniy) // Modern science: current problems of theory and practice. Series "Natural and technical sciences" (Sovremennaya nauka: aktual'nyye problemy teorii i praktiki. Ser. «Yestestvennyye i tekhnicheskiye nauki»). 2021. №03. С.237-240. DOI: 10.37882/2223-2966.2021.03.40
- IBM Security. 2019 cost of a data breach report / URL: https://www.ibm.com/downloads/cas/RDEOK07R (access date: 20.05.2024).
- Sheinman V. Comparative system analysis of the cloud services pricing model "Pay As You Go" (Sravnitel'nyy sistemnyy analiz modeli tsenoobrazovaniya oblachnykh uslug «Pay As You Go» // Science-sphere (Nauka-sfera). 2024. №6(1). 352-358 p. DOI: 10.5281/zenodo.11656041
- Eluwa A.C. Cloud Computing Security, Protecting University Information // Open Access Library Journal. 2023. No.10. PP.1-10. DOI: 10.4236/oalib.1110925
- Mlgheit J., Houssein E., Zayed H. Security Model for Preserving Privacy over Encrypted Cloud Computing // Journal of Computer and Communications. 2017. No.5. PP.149-165. DOI: 10.4236/jcc.2017.56009
- Russian IT market to grow more than 2-fold by 2030 / Vedomosti (11.09.2023). URL: https://www.vedomosti.ru/technology/articles/2023/09/11/994374-rossiiskii-it-rinok-virastet-bolee-chem-v-2-raza (access date 20.06.2024)
- Cisco certification: new direction DevNet. URL: https://edu-cisco.org/cisco-certifications/ (access date: 01.07.2024).
- Kaziev V.M., Kazieva B.V., Kaziev K.V. Fundamentals of legal informatics and informatization of legal systems (Osnovy pravovoj informatiki i informatizatzii pravovyh sistem) (2nd ed.). -M:, INFRA-M. ser. “University textbook”, -2017. -336 p.