AUDITING ML SYSTEMS BEYOND ACCURACY: A SCENARIO-BASED BEHAVIORAL TESTING APPROACH

АУДИТ СИСТЕМ МАШИННОГО ОБУЧЕНИЯ: СЦЕНАРНЫЙ ПОДХОД К ПОВЕДЕНЧЕСКОМУ ТЕСТИРОВАНИЮ

Ait A.M.

28.04.2026 108

4(145)

10. Информатика, вычислительная техника и управление

Цитировать:

Ait A.M. AUDITING ML SYSTEMS BEYOND ACCURACY: A SCENARIO-BASED BEHAVIORAL TESTING APPROACH // Universum: технические науки : электрон. научн. журн. 2026. 4(145). URL: https://7universum.com/ru/tech/archive/item/22430 (дата обращения: 28.05.2026).

Прочитать статью:

DOI - 10.32743/UniTech.2026.145.4.22430

Статья поступила в редакцию: 02.04.2026

Принята к публикации: 14.04.2026

Опубликована: 28.04.2026

ABSTRACT

Machine learning (ML) systems drive consequential decisions in credit scoring, healthcare, and hiring, yet existing IT governance frameworks were not designed for them. This paper examined whether five widely adopted frameworks — COBIT, ITIL, ISO/IEC 27001, ISO/IEC 42001, and NIST AI RMF — address behavioral risks that ML systems introduce. A scenario-based behavioral testing approach was developed and validated on the UCI German Credit Dataset using three classifiers. All five frameworks lack concrete testing procedures for behavioral stability. Accuracy on borderline applicants dropped 14–17 percentage points below normal-condition accuracy. Flip rates of 3.0–3.3% and a statistically significant data drift (D=0.130, p=0.013) remained invisible to accuracy-based audit checks.

АННОТАЦИЯ

Системы машинного обучения (МО) применяются в кредитном скоринге, здравоохранении и подборе персонала, однако существующие системы ИТ-управления не были разработаны с учётом их специфики. В работе исследуется, в какой мере пять фреймворков — COBIT, ITIL, ISO/IEC 27001, ISO/IEC 42001 и NIST AI RMF — охватывают поведенческие риски МО-систем. Разработан и апробирован сценарный подход к поведенческому тестированию. Все пять фреймворков не содержат процедур тестирования поведенческой стабильности. Точность на пограничных заявках снизилась на 14–17 п.п. Дрейф данных (D=0.130, p=0.013) выявлен на уровне признаков, но остался незамеченным при мониторинге точности.

Keywords: ML auditing, behavior-based audit, IT audit frameworks, data drift, scenario testing.

Ключевые слова: аудит МО-систем, поведенческий аудит, фреймворки ИТ-аудита, дрейф данных, сценарное тестирование.

Introduction

Banks use ML to approve or deny credit applications; hospitals use it to rank patients by urgency; employers use it to filter job applicants [8]. Around 88% of organizations have deployed ML in at least one core business function [19, p. 496]. Unlike conventional software, ML systems derive decision logic from historical data — logic that is not directly readable and cannot be caught by controls designed for deterministic systems. IT auditors currently rely on five main frameworks: COBIT (IT governance), ITIL (service operations), ISO/IEC 27001 (information security), ISO/IEC 42001 — the first international AI management standard published in 2023 [9] — and NIST AI RMF, which provides a taxonomy of AI-related risks [20]. All five check whether documented processes are followed. None requires testing how an ML model actually behaves in production [14, 18].

The practical scale of this problem is well-documented. Accuracy on borderline credit applicants can be nearly twice the overall error rate [4, p. 72], and aggregate accuracy conceals subgroup failures by design [3]. Data quality failures and monitoring gaps are the most common sources of ML system issues [2]. The EU AI Act, in force since August 2024, requires conformity assessments for high-risk AI including credit scoring [6]. SR 11-7 requires model outputs to stay consistent under plausible input variation [7]. Neither requirement has been translated into a concrete audit procedure, creating a growing compliance exposure [10, 11, p. 634]. Research confirms that demonstrating correct behavior across conditions is more meaningful than documenting compliance — yet no empirically validated, governance-aligned testing procedure exists [21, p. 73].

Three bodies of literature converge on this gap: IT governance standards [15] provide process structure but not behavioral testing; explainability research [22, 12] cannot substitute for systematic testing because XAI explanations are unstable across retraining cycles [4, p. 71]; and the algorithmic accountability literature [16, 1, 17] establishes the correct conceptual foundation — testing observable outputs — but lacks empirical validation on real data. This paper makes three contributions: (1) a structured gap analysis of five frameworks identifies where each falls short; (2) a scenario-based approach with a model-agnostic edge-case definition avoids the circular-logic problem of prior proposals [16, 1]; (3) the approach is validated on the UCI German Credit Dataset with sensitivity analysis [5, 13, p. 124].

Materials and Methods

Framework Gap Analysis. Five frameworks were analyzed against four ML-specific audit requirements: (1) testing output consistency under small input variation; (2) evaluation of decision quality on uncertain cases; (3) monitoring of input feature distributions over time; and (4) lifecycle controls for model retraining and post-deployment re-evaluation. A gap was recorded wherever no applicable procedure existed. COBIT 2019's DSS01 domain frames all quality concerns in terms of service availability with no control for model output stability; ISO/IEC 42001 references performance monitoring (Clause 9.1) but does not define what this means operationally for a classifier; the NIST AI RMF defers test design to the implementing organization; SR 11-7 and the OCC handbook [7, 15] — the most operationally specific financial-sector documents — were designed for parametric models and do not address ML-specific technical debt [18, 2].

Dataset and Models. The experiment used the UCI Statlog German Credit Dataset [5] — comprising 1,000 applicants described by 20 features and a binary credit risk label (70% good, 30% bad) — split 70/30 into training (700) and test (300) sets, stratified by class [13, p. 125]. Three model families were tested: Random Forest (100 estimators), Logistic Regression (standardized inputs), and Decision Tree (max depth 5). All code: Python 3.11, scikit-learn 1.4, NumPy 1.26, pandas 2.2, SciPy 1.12, seed=42.

Behavioral Scenario Testing. Three scenario types were applied. Normal: all 300 test records evaluated as-is, reproducing existing framework-based audit output. Edge: borderline cases identified by a majority-vote criterion — a record was classified as edge if at least two of three models assigned an approval probability in [0.40, 0.60]; this avoids the circular dependency of prior proposals [16, 1]. Sixty-six records (22%) met this criterion. Stress: all 300 records re-evaluated after small perturbations to credit amount (±5%) and loan duration (±10%), reflecting documented data-entry variation [18, 2]. The flip rate (share of decisions changed under perturbation) is the key metric; SR 11-7 requires output consistency under plausible input changes [7].

Data Drift and Sensitivity Analysis. Two drifted test populations were generated: moderate (+20% credit amount, σ=3% noise, +15% employment) and strong (+40%/+30%), reflecting European household credit inflation trends [6]. Labels were unchanged. Distributional shift was assessed with a two-sample KS test. Sensitivity analysis was run across three edge thresholds ([0.45,0.55], [0.40,0.60], [0.35,0.65]) and three perturbation levels (low ±2%/±5%, main ±5%/±10%, high ±10%/±20%). Four metrics: accuracy ((TP+TN)/N), approval rate, flip rate, and drift impact (Δacc).

Results and Discussion

Framework Gap Analysis. Four structural gaps appeared consistently across all five frameworks: no procedure for detecting behavioral change after retraining; no requirement to evaluate decision quality on uncertain cases; no mandate for input distribution monitoring; no threshold to trigger model re-evaluation. ISO/IEC 42001 comes closest — requiring impact assessments and lifecycle planning — but its controls are organizational, not operational. An organization fully compliant with all five frameworks could deploy a model on a drifted population with no governance mechanism to detect the problem.

Scenario-Based Testing Results. Table 1 and Figure 1 show accuracy across all three scenarios. All three models degraded substantially on edge cases. Decision Tree fell to 54.5% — below the 70% majority-class baseline, meaning it performed worse on borderline applicants than a trivial 'always approve' rule. Random Forest and Logistic Regression dropped 16.6 and 14.1 percentage points respectively [21, p. 74]. This degradation was consistent across all threshold configurations in sensitivity analysis: even at the narrowest threshold (n=28), every model remained below normal-condition accuracy.

Table 1.

Multi-Model Comparison Across Audit Scenarios

Model	Normal Acc.	Edge Acc.	Drop (pp)	Stress Acc.	Flip Rate	Appr. Rate
Random Forest	0.757	0.591	16.6	0.760	3.00%	81.7%
Logistic Regression	0.777	0.636	14.1	0.777	0.00%	76.3%
Decision Tree	0.697	0.545	15.2	0.710	3.33%	59.7%

Drop (pp) = Normal − Edge accuracy in percentage points. Appr. Rate = share of applicants approved.

Figure 1. Accuracy by model and scenario

Edge accuracy falls below the 70% majority-class baseline for all three models. Normal and Stress results are comparable, confirming that standard audit metrics miss the edge-case degradation.

Logistic Regression produced a 0.00% flip rate across all perturbation magnitudes: its linear decision boundary does not cross under the tested perturbations. Random Forest and Decision Tree both exceeded 3.0%, meaning 9–10 applicants in every 300 received a different credit decision based solely on minor data variation. At 5,000 applications per month, this translates to 150–170 inconsistent outcomes — a material SR 11-7 finding [7]. Approval rates also differed sharply: Random Forest approved 81.7% versus 59.7% for Decision Tree, a 22 pp gap no existing framework would flag.

Data Drift Results. Table 2 and Figure 2 show accuracy under original and drifted populations. Accuracy changes are negligible even under strong drift (at most −2.0 pp for Decision Tree). Yet a KS test on credit amount found the moderate drift statistically significant (D=0.130, p=0.013). The proposed distribution-monitoring step would flag this for governance review; an accuracy-only audit would not. Sensitivity analysis confirmed robustness: RF and DT flip rates remained non-zero at every perturbation level (reaching 5.33% and 4.00% at high perturbation).

Table 2.

Model Accuracy Under Simulated Data Drift

Model	Original Acc.	Moderate Drift	ΔM	Strong Drift	ΔS
Random Forest	0.757	0.757	0.000	0.753	−0.003
Logistic Regression	0.777	0.777	0.000	0.780	+0.003
Decision Tree	0.697	0.683	−0.013	0.677	−0.020

ΔM / ΔS = accuracy change under moderate / strong drift. KS test (credit amount): D=0.130, p=0.013 under moderate drift.

Figure 2. Credit amount distributions under original, moderate (+20%), and strong (+40%) drift.

Dashed lines show distribution means. The shift is statistically significant yet invisible to accuracy-based monitoring across all models.

Discussion. Under any of the five frameworks, all three models would pass a compliance audit. The behavioral approach exposed findings compliance auditing cannot produce: accuracy gaps of 14–17 pp on the most consequential applicants, flip rates translating to hundreds of inconsistent decisions per month, a 22 pp spread in approval rates, and a confirmed distributional shift invisible to accuracy monitoring. These results give empirical weight to Raji et al.'s [16] argument that accountability requires testing observable outputs, and to Bucker et al.'s [4, p. 71] finding that XAI is insufficient for ML auditability. Table 3 compares the proposed approach against existing alternatives.

Table 3.

Comparison of ML Audit Approaches

Approach	No Model Internals	Pass/Fail Finding	Behavioral Stability	Model- Agnostic	Regulatory Reference
Compliance audit (COBIT/ITIL)	Yes	Yes	No	Yes	Yes
XAI review (SHAP/LIME)	Partial*	No	No	No	No
Ad-hoc monitoring (PSI/KS)	Yes	No	Partial	Yes	No
Proposed approach	Yes	Yes	Yes	Yes	Partial**

* LIME is black-box; tree-specific SHAP requires model internals. ** SR 11-7 output consistency (flip rate) and EU AI Act ongoing behavioral monitoring [6] are partially addressed.

The proposed approach is the only one combining all five audit properties. An auditor applies it in four steps: (1) identify borderline applicants using the majority-vote criterion (requires only predicted probabilities); (2) document any edge accuracy gap exceeding 10 pp; (3) report any non-zero flip rate as a consistency finding under SR 11-7; (4) trigger re-validation if the KS test for key features exceeds p < 0.05. None of these steps requires model internals — they can be performed with access only to the prediction API. Limitations: single dataset; three classic model families only; gradient boosting and neural networks excluded.

Conclusion

This paper demonstrated that standard IT governance frameworks — COBIT, ITIL, ISO/IEC 27001, ISO/IEC 42001, and NIST AI RMF — do not detect the behavioral risks that ML systems carry. Three findings stand out. First, all five frameworks lack concrete testing procedures for behavioral stability, borderline-case quality, and input distribution change. Second, accuracy on borderline applicants fell 14–17 percentage points below normal-condition accuracy across all three models and all tested configurations — invisible to any compliance audit. Third, a statistically significant distributional shift (p=0.013) was detectable at the feature level but produced negligible accuracy change, confirming that accuracy is a lagging and unreliable audit indicator.

The scenario-based testing procedure is executable, model-agnostic, and requires no model internals. A flip rate of 3% means one in thirty-three applicants receives a different credit decision depending on how their data was recorded — a finding interpretable by non-technical auditors. Embedding this procedure as a required annual control within ISO/IEC 42001 Section 9.1 or SR 11-7's ongoing validation requirements would give institutions a concrete compliance path that does not currently exist. Open questions include: what flip rate threshold should trigger a formal finding under the EU AI Act, and can this approach be extended to gradient boosting and neural network models that dominate production scoring [17]?

References:

Adler P. et al. Auditing Black-Box Models for Indirect Influence // Proc. IEEE ICDM. — 2016. DOI: 10.1109/ICDM.2016.0011
Amershi S. et al. Software Engineering for Machine Learning: A Case Study // Proc. ICSE-SEIP. — 2019. DOI: 10.1109/ICSE-SEIP.2019.00042
Barocas S., Hardt M., Narayanan A. Fairness and Machine Learning. — fairmlbook.org, 2019.
Bucker M. et al. Transparency, Auditability, and Explainability of ML Models in Credit Scoring // J. Oper. Res. Soc. — 2022. — Vol. 73, No. 1. — P. 70–90. DOI: 10.1080/01605682.2021.1922098
Dua D., Graff C. UCI ML Repository: Statlog (German Credit Data). — Univ. of California, Irvine, 2019. Available: https://archive.ics.uci.edu/dataset/144
European Union. Regulation (EU) on Artificial Intelligence (AI Act). — 2024. Available: https://artificialintelligenceact.eu
Federal Reserve. Supervisory Guidance on Model Risk Management (SR 11-7). — 2011. Available: https://www.federalreserve.gov/supervisionreg/srletters/sr1107.htm
Financial Stability Board. Artificial Intelligence and Machine Learning in Financial Services. FSB Report. — 2017. Available: https://www.fsb.org/wp-content/uploads/P011117.pdf
ISO/IEC. ISO/IEC 42001:2023 — Artificial Intelligence — Management System. — International Organization for Standardization, Geneva, 2023.
Koshiyama A. et al. Towards Algorithm Auditing // Royal Soc. Open Science. — 2024. — Vol. 11, No. 5. — P. 230859. DOI: 10.1098/rsos.230859
Kroll J. A. et al. Accountable Algorithms // U. Penn. Law Review. — 2017. — Vol. 165, No. 3. — P. 633–705.
Langer M. et al. Explainability Auditing for Intelligent Systems. — arXiv:2108.07711, 2021. Available: https://arxiv.org/abs/2108.07711
Lessmann S. et al. Benchmarking Credit Scoring Algorithms // Eur. J. Oper. Res. — 2015. — Vol. 247, No. 1. — P. 124–136. DOI: 10.1016/j.ejor.2015.05.030
Mokander J. Auditing of AI: Legal, Ethical and Technical Approaches // Digital Society. — 2023. — Vol. 2. — P. 49. DOI: 10.1007/s44206-023-00074-y
OCC. Comptroller's Handbook: Model Risk Management. — 2021. Available: https://www.occ.gov
Raji I. D. et al. Closing the AI Accountability Gap // Proc. ACM FAccT. — 2020. DOI: 10.1145/3351095.3372873
Sandvig C. et al. Auditing Algorithms: Research Methods for Detecting Discrimination on Internet Platforms // Proc. ICA Annual Conf. — 2014.
Sculley D. et al. Hidden Technical Debt in Machine Learning Systems // Proc. NeurIPS. — 2015. — Vol. 28.
Suhadolnik N., Da Silva S. Machine Learning for Enhanced Credit Risk Assessment // J. Risk Financial Manag. — 2023. — Vol. 16, No. 12. — P. 496. DOI: 10.3390/jrfm16120496
Tabassi E. Artificial Intelligence Risk Management Framework (AI RMF 1.0). — NIST, 2023. DOI: 10.6028/NIST.AI.100-1
Van den Heuvel E. Evolution of IT Auditing in a Nutshell // Maandblad voor Accountancy en Bedrijfseconomie. — 2025. — Vol. 99, No. 2. — P. 73–83. DOI: 10.5117/mab.99.140994
Zhang C. A., Rezaee Z. Explainable AI (XAI) in Auditing // Int. J. Accounting Inf. Syst. — 2022. — Vol. 45. — P. 100559. DOI: 10.1016/j.accinf.2022.100559