AUDITING ML SYSTEMS BEYOND ACCURACY: A SCENARIO-BASED BEHAVIORAL TESTING APPROACH

АУДИТ СИСТЕМ МАШИННОГО ОБУЧЕНИЯ: СЦЕНАРНЫЙ ПОДХОД К ПОВЕДЕНЧЕСКОМУ ТЕСТИРОВАНИЮ
Ait A.M.
Цитировать:
Ait A.M. AUDITING ML SYSTEMS BEYOND ACCURACY: A SCENARIO-BASED BEHAVIORAL TESTING APPROACH // Universum: технические науки : электрон. научн. журн. 2026. 4(145). URL: https://7universum.com/ru/tech/archive/item/22430 (дата обращения: 07.05.2026).
Прочитать статью:
DOI - 10.32743/UniTech.2026.145.4.22430
Статья поступила в редакцию: 02.04.2026
Принята к публикации: 14.04.2026
Опубликована: 28.04.2026

 

ABSTRACT

Machine learning (ML) systems drive consequential decisions in credit scoring, healthcare, and hiring, yet existing IT governance frameworks were not designed for them. This paper examined whether five widely adopted frameworks — COBIT, ITIL, ISO/IEC 27001, ISO/IEC 42001, and NIST AI RMF — address behavioral risks that ML systems introduce. A scenario-based behavioral testing approach was developed and validated on the UCI German Credit Dataset using three classifiers. All five frameworks lack concrete testing procedures for behavioral stability. Accuracy on borderline applicants dropped 14–17 percentage points below normal-condition accuracy. Flip rates of 3.0–3.3% and a statistically significant data drift (D=0.130, p=0.013) remained invisible to accuracy-based audit checks.

АННОТАЦИЯ

Системы машинного обучения (МО) применяются в кредитном скоринге, здравоохранении и подборе персонала, однако существующие системы ИТ-управления не были разработаны с учётом их специфики. В работе исследуется, в какой мере пять фреймворков — COBIT, ITIL, ISO/IEC 27001, ISO/IEC 42001 и NIST AI RMF — охватывают поведенческие риски МО-систем. Разработан и апробирован сценарный подход к поведенческому тестированию. Все пять фреймворков не содержат процедур тестирования поведенческой стабильности. Точность на пограничных заявках снизилась на 14–17 п.п. Дрейф данных (D=0.130, p=0.013) выявлен на уровне признаков, но остался незамеченным при мониторинге точности.

 

Keywords: ML auditing, behavior-based audit, IT audit frameworks, data drift, scenario testing.

Ключевые слова: аудит МО-систем, поведенческий аудит, фреймворки ИТ-аудита, дрейф данных, сценарное тестирование.

 

Introduction

Banks use ML to approve or deny credit applications; hospitals use it to rank patients by urgency; employers use it to filter job applicants [8]. Around 88% of organizations have deployed ML in at least one core business function [19, p. 496]. Unlike conventional software, ML systems derive decision logic from historical data — logic that is not directly readable and cannot be caught by controls designed for deterministic systems. IT auditors currently rely on five main frameworks: COBIT (IT governance), ITIL (service operations), ISO/IEC 27001 (information security), ISO/IEC 42001 — the first international AI management standard published in 2023 [9] — and NIST AI RMF, which provides a taxonomy of AI-related risks [20]. All five check whether documented processes are followed. None requires testing how an ML model actually behaves in production [14, 18].

The practical scale of this problem is well-documented. Accuracy on borderline credit applicants can be nearly twice the overall error rate [4, p. 72], and aggregate accuracy conceals subgroup failures by design [3]. Data quality failures and monitoring gaps are the most common sources of ML system issues [2]. The EU AI Act, in force since August 2024, requires conformity assessments for high-risk AI including credit scoring [6]. SR 11-7 requires model outputs to stay consistent under plausible input variation [7]. Neither requirement has been translated into a concrete audit procedure, creating a growing compliance exposure [10, 11, p. 634]. Research confirms that demonstrating correct behavior across conditions is more meaningful than documenting compliance — yet no empirically validated, governance-aligned testing procedure exists [21, p. 73].

Three bodies of literature converge on this gap: IT governance standards [15] provide process structure but not behavioral testing; explainability research [22, 12] cannot substitute for systematic testing because XAI explanations are unstable across retraining cycles [4, p. 71]; and the algorithmic accountability literature [16, 1, 17] establishes the correct conceptual foundation — testing observable outputs — but lacks empirical validation on real data. This paper makes three contributions: (1) a structured gap analysis of five frameworks identifies where each falls short; (2) a scenario-based approach with a model-agnostic edge-case definition avoids the circular-logic problem of prior proposals [16, 1]; (3) the approach is validated on the UCI German Credit Dataset with sensitivity analysis [5, 13, p. 124].

Materials and Methods

Framework Gap Analysis. Five frameworks were analyzed against four ML-specific audit requirements: (1) testing output consistency under small input variation; (2) evaluation of decision quality on uncertain cases; (3) monitoring of input feature distributions over time; and (4) lifecycle controls for model retraining and post-deployment re-evaluation. A gap was recorded wherever no applicable procedure existed. COBIT 2019's DSS01 domain frames all quality concerns in terms of service availability with no control for model output stability; ISO/IEC 42001 references performance monitoring (Clause 9.1) but does not define what this means operationally for a classifier; the NIST AI RMF defers test design to the implementing organization; SR 11-7 and the OCC handbook [7, 15] — the most operationally specific financial-sector documents — were designed for parametric models and do not address ML-specific technical debt [18, 2].

Dataset and Models. The experiment used the UCI Statlog German Credit Dataset [5] — comprising 1,000 applicants described by 20 features and a binary credit risk label (70% good, 30% bad) — split 70/30 into training (700) and test (300) sets, stratified by class [13, p. 125]. Three model families were tested: Random Forest (100 estimators), Logistic Regression (standardized inputs), and Decision Tree (max depth 5). All code: Python 3.11, scikit-learn 1.4, NumPy 1.26, pandas 2.2, SciPy 1.12, seed=42.

Behavioral Scenario Testing. Three scenario types were applied. Normal: all 300 test records evaluated as-is, reproducing existing framework-based audit output. Edge: borderline cases identified by a majority-vote criterion — a record was classified as edge if at least two of three models assigned an approval probability in [0.40, 0.60]; this avoids the circular dependency of prior proposals [16, 1]. Sixty-six records (22%) met this criterion. Stress: all 300 records re-evaluated after small perturbations to credit amount (±5%) and loan duration (±10%), reflecting documented data-entry variation [18, 2]. The flip rate (share of decisions changed under perturbation) is the key metric; SR 11-7 requires output consistency under plausible input changes [7].

Data Drift and Sensitivity Analysis. Two drifted test populations were generated: moderate (+20% credit amount, σ=3% noise, +15% employment) and strong (+40%/+30%), reflecting European household credit inflation trends [6]. Labels were unchanged. Distributional shift was assessed with a two-sample KS test. Sensitivity analysis was run across three edge thresholds ([0.45,0.55], [0.40,0.60], [0.35,0.65]) and three perturbation levels (low ±2%/±5%, main ±5%/±10%, high ±10%/±20%). Four metrics: accuracy ((TP+TN)/N), approval rate, flip rate, and drift impact (Δacc).

Results and Discussion

Framework Gap Analysis. Four structural gaps appeared consistently across all five frameworks: no procedure for detecting behavioral change after retraining; no requirement to evaluate decision quality on uncertain cases; no mandate for input distribution monitoring; no threshold to trigger model re-evaluation. ISO/IEC 42001 comes closest — requiring impact assessments and lifecycle planning — but its controls are organizational, not operational. An organization fully compliant with all five frameworks could deploy a model on a drifted population with no governance mechanism to detect the problem.

Scenario-Based Testing Results. Table 1 and Figure 1 show accuracy across all three scenarios. All three models degraded substantially on edge cases. Decision Tree fell to 54.5% — below the 70% majority-class baseline, meaning it performed worse on borderline applicants than a trivial 'always approve' rule. Random Forest and Logistic Regression dropped 16.6 and 14.1 percentage points respectively [21, p. 74]. This degradation was consistent across all threshold configurations in sensitivity analysis: even at the narrowest threshold (n=28), every model remained below normal-condition accuracy.

Table 1.

Multi-Model Comparison Across Audit Scenarios

Model

Normal Acc.

Edge Acc.

Drop (pp)

Stress Acc.

Flip Rate

Appr. Rate

Random Forest

0.757

0.591

16.6

0.760

3.00%

81.7%

Logistic Regression

0.777

0.636

14.1

0.777

0.00%

76.3%

Decision Tree

0.697

0.545

15.2

0.710

3.33%

59.7%

Drop (pp) = Normal − Edge accuracy in percentage points.  Appr. Rate = share of applicants approved.

 

Figure 1. Accuracy by model and scenario

Edge accuracy falls below the 70% majority-class baseline for all three models. Normal and Stress results are comparable, confirming that standard audit metrics miss the edge-case degradation.

 

Logistic Regression produced a 0.00% flip rate across all perturbation magnitudes: its linear decision boundary does not cross under the tested perturbations. Random Forest and Decision Tree both exceeded 3.0%, meaning 9–10 applicants in every 300 received a different credit decision based solely on minor data variation. At 5,000 applications per month, this translates to 150–170 inconsistent outcomes — a material SR 11-7 finding [7]. Approval rates also differed sharply: Random Forest approved 81.7% versus 59.7% for Decision Tree, a 22 pp gap no existing framework would flag.

Data Drift Results. Table 2 and Figure 2 show accuracy under original and drifted populations. Accuracy changes are negligible even under strong drift (at most −2.0 pp for Decision Tree). Yet a KS test on credit amount found the moderate drift statistically significant (D=0.130, p=0.013). The proposed distribution-monitoring step would flag this for governance review; an accuracy-only audit would not. Sensitivity analysis confirmed robustness: RF and DT flip rates remained non-zero at every perturbation level (reaching 5.33% and 4.00% at high perturbation).

Table 2.

Model Accuracy Under Simulated Data Drift

Model

Original Acc.

Moderate Drift

ΔM

Strong Drift

ΔS

Random Forest

0.757

0.757

0.000

0.753

−0.003

Logistic Regression

0.777

0.777

0.000

0.780

+0.003

Decision Tree

0.697

0.683

−0.013

0.677

−0.020

ΔM / ΔS = accuracy change under moderate / strong drift.  KS test (credit amount): D=0.130, p=0.013 under moderate drift.

 

Figure 2. Credit amount distributions under original, moderate (+20%), and strong (+40%) drift.

Dashed lines show distribution means. The shift is statistically significant yet invisible to accuracy-based monitoring across all models.

 

Discussion. Under any of the five frameworks, all three models would pass a compliance audit. The behavioral approach exposed findings compliance auditing cannot produce: accuracy gaps of 14–17 pp on the most consequential applicants, flip rates translating to hundreds of inconsistent decisions per month, a 22 pp spread in approval rates, and a confirmed distributional shift invisible to accuracy monitoring. These results give empirical weight to Raji et al.'s [16] argument that accountability requires testing observable outputs, and to Bucker et al.'s [4, p. 71] finding that XAI is insufficient for ML auditability. Table 3 compares the proposed approach against existing alternatives.

Table 3.

 Comparison of ML Audit Approaches

Approach

No Model Internals

Pass/Fail Finding

Behavioral Stability

Model- Agnostic

Regulatory Reference

Compliance audit (COBIT/ITIL)

Yes

Yes

No

Yes

Yes

XAI review (SHAP/LIME)

Partial*

No

No

No

No

Ad-hoc monitoring (PSI/KS)

Yes

No

Partial

Yes

No

Proposed approach

Yes

Yes

Yes

Yes

Partial**

* LIME is black-box; tree-specific SHAP requires model internals.  ** SR 11-7 output consistency (flip rate) and EU AI Act ongoing behavioral monitoring [6] are partially addressed.

 

The proposed approach is the only one combining all five audit properties. An auditor applies it in four steps: (1) identify borderline applicants using the majority-vote criterion (requires only predicted probabilities); (2) document any edge accuracy gap exceeding 10 pp; (3) report any non-zero flip rate as a consistency finding under SR 11-7; (4) trigger re-validation if the KS test for key features exceeds p < 0.05. None of these steps requires model internals — they can be performed with access only to the prediction API. Limitations: single dataset; three classic model families only; gradient boosting and neural networks excluded.

Conclusion

This paper demonstrated that standard IT governance frameworks — COBIT, ITIL, ISO/IEC 27001, ISO/IEC 42001, and NIST AI RMF — do not detect the behavioral risks that ML systems carry. Three findings stand out. First, all five frameworks lack concrete testing procedures for behavioral stability, borderline-case quality, and input distribution change. Second, accuracy on borderline applicants fell 14–17 percentage points below normal-condition accuracy across all three models and all tested configurations — invisible to any compliance audit. Third, a statistically significant distributional shift (p=0.013) was detectable at the feature level but produced negligible accuracy change, confirming that accuracy is a lagging and unreliable audit indicator.

The scenario-based testing procedure is executable, model-agnostic, and requires no model internals. A flip rate of 3% means one in thirty-three applicants receives a different credit decision depending on how their data was recorded — a finding interpretable by non-technical auditors. Embedding this procedure as a required annual control within ISO/IEC 42001 Section 9.1 or SR 11-7's ongoing validation requirements would give institutions a concrete compliance path that does not currently exist. Open questions include: what flip rate threshold should trigger a formal finding under the EU AI Act, and can this approach be extended to gradient boosting and neural network models that dominate production scoring [17]?

 

References:

  1. Adler P. et al. Auditing Black-Box Models for Indirect Influence // Proc. IEEE ICDM. — 2016. DOI: 10.1109/ICDM.2016.0011
  2. Amershi S. et al. Software Engineering for Machine Learning: A Case Study // Proc. ICSE-SEIP. — 2019. DOI: 10.1109/ICSE-SEIP.2019.00042
  3. Barocas S., Hardt M., Narayanan A. Fairness and Machine Learning. — fairmlbook.org, 2019.
  4. Bucker M. et al. Transparency, Auditability, and Explainability of ML Models in Credit Scoring // J. Oper. Res. Soc. — 2022. — Vol. 73, No. 1. — P. 70–90. DOI: 10.1080/01605682.2021.1922098
  5. Dua D., Graff C. UCI ML Repository: Statlog (German Credit Data). — Univ. of California, Irvine, 2019. Available: https://archive.ics.uci.edu/dataset/144
  6. European Union. Regulation (EU) on Artificial Intelligence (AI Act). — 2024. Available: https://artificialintelligenceact.eu
  7. Federal Reserve. Supervisory Guidance on Model Risk Management (SR 11-7). — 2011. Available: https://www.federalreserve.gov/supervisionreg/srletters/sr1107.htm
  8. Financial Stability Board. Artificial Intelligence and Machine Learning in Financial Services. FSB Report. — 2017. Available: https://www.fsb.org/wp-content/uploads/P011117.pdf
  9. ISO/IEC. ISO/IEC 42001:2023 — Artificial Intelligence — Management System. — International Organization for Standardization, Geneva, 2023.
  10. Koshiyama A. et al. Towards Algorithm Auditing // Royal Soc. Open Science. — 2024. — Vol. 11, No. 5. — P. 230859. DOI: 10.1098/rsos.230859
  11. Kroll J. A. et al. Accountable Algorithms // U. Penn. Law Review. — 2017. — Vol. 165, No. 3. — P. 633–705.
  12. Langer M. et al. Explainability Auditing for Intelligent Systems. — arXiv:2108.07711, 2021. Available: https://arxiv.org/abs/2108.07711
  13. Lessmann S. et al. Benchmarking Credit Scoring Algorithms // Eur. J. Oper. Res. — 2015. — Vol. 247, No. 1. — P. 124–136. DOI: 10.1016/j.ejor.2015.05.030
  14. Mokander J. Auditing of AI: Legal, Ethical and Technical Approaches // Digital Society. — 2023. — Vol. 2. — P. 49. DOI: 10.1007/s44206-023-00074-y
  15. OCC. Comptroller's Handbook: Model Risk Management. — 2021. Available: https://www.occ.gov
  16. Raji I. D. et al. Closing the AI Accountability Gap // Proc. ACM FAccT. — 2020. DOI: 10.1145/3351095.3372873
  17. Sandvig C. et al. Auditing Algorithms: Research Methods for Detecting Discrimination on Internet Platforms // Proc. ICA Annual Conf. — 2014.
  18. Sculley D. et al. Hidden Technical Debt in Machine Learning Systems // Proc. NeurIPS. — 2015. — Vol. 28.
  19. Suhadolnik N., Da Silva S. Machine Learning for Enhanced Credit Risk Assessment // J. Risk Financial Manag. — 2023. — Vol. 16, No. 12. — P. 496. DOI: 10.3390/jrfm16120496
  20. Tabassi E. Artificial Intelligence Risk Management Framework (AI RMF 1.0). — NIST, 2023. DOI: 10.6028/NIST.AI.100-1
  21. Van den Heuvel E. Evolution of IT Auditing in a Nutshell // Maandblad voor Accountancy en Bedrijfseconomie. — 2025. — Vol. 99, No. 2. — P. 73–83. DOI: 10.5117/mab.99.140994
  22. Zhang C. A., Rezaee Z. Explainable AI (XAI) in Auditing // Int. J. Accounting Inf. Syst. — 2022. — Vol. 45. — P. 100559. DOI: 10.1016/j.accinf.2022.100559
Информация об авторах

Master's Student, School of IT and Engineering, Kazakh-British Technical University, Kazakhstan, Almaty

магистрант, Школа информационных технологий и инженерии, Казахстанско-Британский технический университет, Казахстан, г. Алматы

Журнал зарегистрирован Федеральной службой по надзору в сфере связи, информационных технологий и массовых коммуникаций (Роскомнадзор), регистрационный номер ЭЛ №ФС77-54434 от 17.06.2013
Учредитель журнала - ООО «МЦНО»
Главный редактор - Звездина Марина Юрьевна.
Top