ROBUST EVALUATION OF MACHINE LEARNING MODELS FOR CREDIT CARD FRAUD DETECTION UNDER SEVERE CLASS IMBALANCE

НАДЕЖНАЯ ОЦЕНКА МОДЕЛЕЙ МАШИННОГО ОБУЧЕНИЯ ДЛЯ ОБНАРУЖЕНИЯ МОШЕННИЧЕСКИХ ОПЕРАЦИЙ С КРЕДИТНЫМИ КАРТАМИ В УСЛОВИЯХ ВЫРАЖЕННОГО ДИСБАЛАНСА КЛАССОВ

Zhetpisbay N.N. Kuatbayeva A.A.

01.03.2026 355

2(143)

10. Информатика, вычислительная техника и управление

Цитировать:

Zhetpisbay N.N., Kuatbayeva A.A. ROBUST EVALUATION OF MACHINE LEARNING MODELS FOR CREDIT CARD FRAUD DETECTION UNDER SEVERE CLASS IMBALANCE // Universum: технические науки : электрон. научн. журн. 2026. 2(143). URL: https://7universum.com/ru/tech/archive/item/22067 (дата обращения: 09.07.2026).

Прочитать статью:

DOI - 10.32743/UniTech.2026.143.2.22067

ABSTRACT

Credit card fraud detection remains a challenging problem due to the extreme class imbalance of transaction data and the financial impact of undetected fraudulent activity. This study evaluates the performance and stability of several machine learning models under severe imbalance conditions using a dataset of 284,807 transactions, including 492 fraud cases. The models under investigation include Logistic Regression, Random Forest, XGBoost, resampling-based techniques, and a stacking ensemble. Model performance was assessed using stratified cross-validation and precision–recall oriented metrics. The results show that tree-based ensemble models demonstrate the most stable performance, achieving PR-AUC values above 0.84 with low variance across folds. While resampling methods increase recall, they also substantially raise the number of false positive detections. The stacking architecture does not consistently outperform standalone tree-based models in terms of F1-score and stability. The findings suggest that moderately complex ensemble methods provide a reliable trade-off between detection quality and operational cost in highly imbalanced fraud detection tasks.

АННОТАЦИЯ

Обнаружение мошеннических операций с использованием кредитных карт остается сложной задачей вследствие выраженного дисбаланса классов в транзакционных данных и значительных финансовых потерь, связанных с невыявленными случаями мошенничества. В данной работе проводится оценка производительности и устойчивости ряда моделей машинного обучения в условиях сильного дисбаланса на основе набора данных, содержащего 284 807 транзакций, включая 492 мошеннических случая. Рассматриваются модели Logistic Regression, Random Forest, XGBoost, методы с применением ресэмплинга, а также стекинг-ансамбль. Оценка качества моделей осуществлялась с использованием стратифицированной кросс-валидации и метрик, основанных на кривой «точность–полнота». Результаты показывают, что ансамблевые методы на основе деревьев решений демонстрируют наиболее стабильную работу, достигая значения PR-AUC выше 0,84 при низкой вариативности между фолдами. Применение методов ресэмплинга повышает полноту, однако сопровождается существенным ростом числа ложноположительных срабатываний. Стекинг-архитектура не обеспечивает устойчивого превосходства по сравнению с отдельными ансамблевыми моделями. Полученные результаты свидетельствуют о том, что ансамблевые методы умеренной сложности обеспечивают оптимальный баланс между качеством обнаружения и операционными затратами в задачах выявления мошенничества при сильном дисбалансе классов.

Keywords: Credit card fraud detection, class imbalance, precision–recall curve, Random Forest, XGBoost, stacking ensemble, machine learning.

Ключевые слова: Обнаружение мошенничества, дисбаланс классов, кривая точность–полнота, Random Forest, XGBoost, стекинг-ансамбль, машинное обучение.

Introduction

The rapid development of digital payment technologies has significantly increased the volume and speed of financial transactions worldwide. At the same time, the expansion of online commerce and electronic banking has intensified the problem of fraudulent financial operations. Credit card fraud detection has therefore become one of the key tasks in financial risk management systems.

Fraud detection differs from standard classification problems due to several specific characteristics. First, fraudulent transactions usually represent only a very small fraction of the total dataset. In publicly available benchmarks, the proportion of fraud cases is often below 1%, which leads to severe class imbalance. Under such conditions, traditional performance metrics such as accuracy may be misleading, as a model can achieve high accuracy while failing to detect minority fraud cases. For highly skewed datasets, precision–recall based evaluation is considered more informative than ROC-based assessment [3].

Second, fraud patterns evolve over time. Attackers continuously modify their strategies in response to detection mechanisms. As a result, fraud detection models must generalize beyond previously observed cases and remain robust under distribution shifts. In addition, the operational environment imposes constraints on acceptable false positive rates, since incorrectly flagged legitimate transactions may negatively affect customer experience and increase investigation costs.

Machine learning approaches have been widely applied to credit card fraud detection. Linear models such as Logistic Regression offer interpretability and computational efficiency. Tree-based ensemble methods, including Random Forest and gradient boosting, have demonstrated strong performance in structured transaction data due to their ability to model nonlinear relationships [1], [6]. Random Forest, in particular, has shown consistent results in fraud-related tasks [1].

To address class imbalance, resampling techniques such as SMOTE have been proposed to artificially increase the representation of minority samples [4]. More recently, hybrid approaches and stacking ensembles have been introduced in an attempt to combine the strengths of multiple base learners [8]. However, the effectiveness of such complex architectures depends strongly on evaluation methodology and the characteristics of the dataset.

Despite the growing number of studies, the question remains whether increasing model complexity systematically leads to improved detection quality under severe imbalance. Many published works report results based on a single train–test split, which may not fully reflect model stability. Given the limited number of fraud cases, performance estimates may vary significantly across different data partitions.

The objective of this study is to conduct a controlled comparative analysis of several widely used machine learning models for credit card fraud detection and to evaluate their robustness using precision–recall oriented metrics and stratified cross-validation. The research focuses not on proposing a new architecture, but on assessing the stability and practical applicability of existing approaches under extreme class imbalance conditions.

Materials and methods

Among them, 492 transactions are labeled as fraudulent, which corresponds to approximately 0.17% of the total data. This results in an imbalance ratio of nearly 577:1 between legitimate and fraudulent transactions, representing a highly skewed classification problem.

The experimental study was conducted using a publicly available credit card transaction dataset containing 284,807 observations. The dataset was originally released by the Machine Learning Group of the Université Libre de Bruxelles (ULB) in collaboration with Worldline and is publicly available via the Kaggle repository (Dal Pozzolo et al., 2015). The dataset is accessible at:

https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud.

The dataset includes 28 anonymized numerical features (V1–V28) obtained through principal component transformation in order to preserve confidentiality of sensitive transaction information. In addition, two original attributes are provided: transaction time and transaction amount. Although the principal components are already scaled, additional feature engineering was performed to incorporate potentially informative temporal and monetary characteristics. The variable Hour was derived from the transaction timestamp to capture possible daily patterns of fraudulent activity. The variable LogAmount was obtained using a logarithmic transformation log(1 + amount) in order to reduce right-skewness and stabilize variance. After preprocessing, the modeling dataset consisted of 30 explanatory variables and one binary target variable.

The objective of the study was not to introduce a new algorithm but to evaluate the practical robustness of commonly used machine learning models under extreme imbalance conditions. Therefore, models were selected to represent increasing levels of complexity and structural diversity. Logistic Regression (LR) was used as a baseline linear classifier. It estimates the probability of fraud using a sigmoid transformation and provides interpretable coefficients, serving as a reference point for evaluating nonlinear ensemble models. Random Forest (RF), a bagging-based ensemble method, constructs multiple decision trees on bootstrap samples and aggregates their predictions, reducing variance and improving generalization stability [1]. XGBoost (XGB) is a gradient boosting algorithm that sequentially builds decision trees to minimize prediction error and incorporates regularization, making it suitable for large-scale tabular datasets [6]. To investigate the effect of artificial class balancing, Logistic Regression and XGBoost were combined with the SMOTETomek resampling technique [4], which increases minority class representation while attempting to remove overlapping samples. In addition, a two-level stacking architecture was implemented, where XGBoost served as the base learner and Logistic Regression was used as the meta-classifier [8].

Given the severe imbalance ratio, special measures were taken to ensure fair training and evaluation. Resampling was applied exclusively to training folds within cross-validation in order to prevent data leakage. Synthetic minority samples were generated using SMOTE, while Tomek links were removed to reduce class overlap [4]. Class weighting was also applied in Logistic Regression to assign higher penalty to fraud misclassification. No resampling was performed on validation or test subsets, ensuring unbiased performance estimation.

Two complementary validation procedures were employed: a single stratified 80/20 train–test split used to provide direct performance comparison, and stratified 3-fold cross-validation used to assess stability and variance across data partitions. Stratification ensured preservation of the original fraud ratio in each fold, and three folds were selected to maintain a sufficient number of fraud cases per validation subset.

Model performance was evaluated using precision, recall, F1-score, ROC-AUC, and PR-AUC. Given the severe class imbalance in the dataset, accuracy was not considered an informative metric and was therefore not reported. In rare-event classification problems, precision–recall based evaluation provides more meaningful insight than ROC-based assessment, as it focuses directly on minority class detection [3].

Precision measures the proportion of correctly identified fraudulent transactions among all transactions predicted as fraud:

where TP denotes true positives and FP denotes false positives.

Recall (sensitivity) represents the proportion of correctly detected fraud cases among all actual fraudulent transactions:

where FN denotes false negatives.

The F1-score is defined as the harmonic mean of precision and recall:

The ROC-AUC metric represents the area under the Receiver Operating Characteristic curve and evaluates the ranking quality of predicted probabilities across all possible thresholds. However, in highly imbalanced settings, ROC-AUC may overestimate performance due to the dominance of the majority class.

Therefore, the primary evaluation metric in this study is PR-AUC, defined as the area under the precision–recall curve:

PR-AUC captures the trade-off between fraud detection capability and false positive control and is particularly suitable for highly skewed transaction datasets.

To improve comparability, decision thresholds were optimized individually for each model based on the maximum F1-score obtained from validation data. All experiments were implemented in Python using widely adopted machine learning libraries. Random seeds were fixed to ensure reproducibility. Identical preprocessing steps and validation splits were applied to all models to guarantee a fair and consistent comparison framework.

Results and Discussion

The evaluation was first conducted using a stratified 80/20 train–test split. The test subset preserved the original class distribution and contained approximately 98 fraudulent transactions. The results demonstrate clear structural differences between model categories.

Table 1.

Performance of evaluated models on stratified 80/20 split

Model	ROC-AUC	PR-AUC	Precision	Recall	F1	TP	FP	TN	FN
Random Forest	0.9618	0.8642	0.9425	0.8367	0.8865	82	5	56859	16
XGBoost	0.9760	0.8635	0.9080	0.8061	0.8541	79	8	56856	19
Stacking	0.9839	0.8200	0.5903	0.8673	0.7025	85	59	56805	13
XGB + SMOTETomek	0.9839	0.8200	0.5089	0.8776	0.6442	86	83	56781	12
Logistic Regression	0.9605	0.7379	0.7327	0.7551	0.7437	74	27	56837	24
LR + SMOTETomek	0.9761	0.7377	0.6434	0.8469	0.7313	83	46	56818	15

Table 1 demonstrates that Random Forest achieved the highest F1-score and maintained the lowest number of false positive detections. Although stacking achieved the highest ROC-AUC, its precision–recall performance was inferior compared to standalone tree-based models.

Among all evaluated approaches, Random Forest achieved the highest F1-score (0.886) while maintaining very high precision (0.943) and low false positive count (5). The PR-AUC value reached 0.864, indicating a strong balance between fraud detection capability and false alarm control. These results confirm the robustness of bagging-based ensembles under extreme imbalance conditions.

XGBoost achieved comparable PR-AUC (0.864) and slightly higher ROC-AUC (0.976). However, its recall (0.806) and F1-score (0.854) were marginally lower than those of Random Forest. While boosting improves error correction, it appears to introduce slightly higher sensitivity to minority distribution compared to bagging.

Logistic Regression demonstrated noticeably weaker performance. Although its recall reached 0.755, the PR-AUC value (0.738) and F1-score (0.744) indicate limited ability to model nonlinear transaction patterns. This confirms that linear decision boundaries are insufficient for capturing complex fraud relationships.

When resampling techniques were applied, recall increased substantially. For example, XGBoost combined with SMOTETomek achieved recall above 0.87. However, this improvement was accompanied by a dramatic increase in false positives (83), resulting in reduced precision (0.509) and lower F1-score (0.644). Similar behavior was observed for Logistic Regression with resampling. These findings suggest that synthetic oversampling shifts the decision boundary toward aggressive fraud detection at the expense of operational stability.

The stacking ensemble achieved the highest ROC-AUC (0.984), but this did not translate into superior precision–recall performance. While recall reached 0.867, precision decreased to 0.590, and the number of false positives increased to 59. Consequently, the F1-score (0.702) remained significantly below that of standalone tree-based models. This result illustrates that high ROC-AUC does not necessarily imply strong performance under severe class imbalance.

To assess robustness, stratified 3-fold cross-validation was performed. Random Forest exhibited the most stable behavior across folds, with PR-AUC mean of 0.853 and very low F1-score variance (0.007). XGBoost maintained competitive mean PR-AUC (0.850), although with slightly higher variability (std = 0.026 for F1). These findings indicate consistent generalization of tree-based ensembles.

In contrast, resampling-based models showed substantial instability. Although mean PR-AUC values remained moderate (around 0.786), the average F1-score dropped dramatically (approximately 0.22–0.26), reflecting inconsistent threshold behavior across folds.

The stacking architecture demonstrated the most unstable behavior. Under the default threshold of 0.5, the model failed to detect minority instances in at least one validation fold, resulting in a mean F1-score of 0.000. This indicates extreme sensitivity to minority allocation within training partitions and highlights a potential limitation of multi-level ensemble architectures when the minority class size is very small.

Table 2.

Cross-validation performance (mean ± std)

Model	PR-AUC (mean ± std)	F1 (mean ± std)
Random Forest	0.8527 ± 0.0238	0.8429 ± 0.0072
XGBoost	0.8495 ± 0.0328	0.8153 ± 0.0263
Logistic Regression	0.7561 ± 0.0212	0.7186 ± 0.0536
XGB + SMOTETomek	0.7857 ± 0.0367	0.2177 ± 0.0251
LR + SMOTETomek	0.7527 ± 0.0101	0.2572 ± 0.0181
Stacking	0.7857 ± 0.0367	0.0000 ± 0.0000

Cross-validation results presented in Table 2 confirm that Random Forest exhibits the most stable performance across folds, while stacking and aggressive resampling demonstrate substantial instability under default threshold conditions.

Under severe imbalance conditions, the precision–recall trade-off becomes the most relevant evaluation perspective. Tree-based ensemble models demonstrated smoother precision decay as recall increased, maintaining acceptable false positive rates. In contrast, resampling and stacking approaches tended to increase recall through aggressive boundary shifts, leading to rapid precision deterioration.

Overall, the results demonstrate that increasing architectural complexity does not guarantee improved fraud detection performance. While boosting and stacking may improve ROC-based metrics, moderately complex bagging ensembles provide superior stability and operationally balanced performance under extreme class imbalance. To further illustrate the precision–recall trade-off between models, Figure 1 presents the precision–recall curves obtained on the single stratified split.

Figure 1. Precision–Recall curves for evaluated fraud detection models

Conclusion

The present study investigated whether increasing model complexity leads to improved fraud detection performance under severe class imbalance conditions. A controlled comparative analysis was conducted using linear, bagging-based, boosting-based, resampling-enhanced, and stacking architectures.

The results demonstrate that tree-based ensemble methods, particularly Random Forest, provide the most stable and operationally balanced performance. Random Forest achieved the highest F1-score in single split evaluation and maintained the lowest variance across cross-validation folds. XGBoost also demonstrated strong predictive capability, although with slightly higher variability.

Resampling techniques increased recall but introduced substantial instability and a sharp rise in false positive detections. The stacking architecture exhibited the highest sensitivity to minority class distribution and, under cross-validation, failed to consistently detect fraud instances at the default threshold.

These findings indicate that increasing architectural complexity does not guarantee improved generalization in highly imbalanced fraud detection tasks. Instead, model robustness and stability across data partitions appear to be more critical than peak ROC-based performance.

In practical financial systems, moderately complex ensemble methods that maintain stable precision–recall trade-offs and controlled false positive rates may be preferable to more complex hybrid architectures.

References:

Dal Pozzolo A., Caelen O., Johnson R., Bontempi G. Calibrating probability with undersampling for unbalanced classification. // IEEE Symposium Series on Computational Intelligence. – 2015. – P. 159–166.
Dal Pozzolo A., Boracchi G., Caelen O., Alippi C., Bontempi G. Credit card fraud detection and concept-drift adaptation with delayed supervised information. // IEEE International Joint Conference on Neural Networks. – 2015. – P. 1–8.
Saito T., Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. // PLOS ONE. – 2015. – Vol. 10 (3). – e0118432.
Chawla N. V., Bowyer K. W., Hall L. O., Kegelmeyer W. P. SMOTE: Synthetic Minority Over-sampling Technique. // Journal of Artificial Intelligence Research. – 2002. – Vol. 16. – P. 321–357.
Haixiang G., Yijing L., Shang J., Mingyun G., Yuanyue H., Bing G. Learning from class-imbalanced data: Review of methods and applications. // Expert Systems with Applications. – 2017. – Vol. 73. – P. 220–239.
Chen T., Guestrin C. XGBoost: A scalable tree boosting system. // Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. – 2016. – P. 785–794.
Bahnsen A. C., Villegas S., Aouada D., Ottersten B. Fraud detection by stacking cost-sensitive decision trees. // IEEE International Conference on Data Science and Advanced Analytics. – 2015. – P. 1–10.
Carcillo F., Dal Pozzolo A., Bontempi G., Le Borgne Y. A., Caelen O. SCARFF: A scalable framework for streaming credit card fraud detection with Spark. // Information Fusion. – 2021. – Vol. 65. – P. 182–194.
Machine Learning Group (ULB) & Worldline. (2015). Credit Card Fraud Detection Dataset. Kaggle Repository. Available at: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

Информация об авторах