HYBRID ENSEMBLE APPROACH FOR SUSPICIOUS FINANCIAL TRANSACTION DETECTION USING GRAPH EMBEDDINGS

ГИБРИДНЫЙ АНСАМБЛЕВЫЙ МЕТОД ОБНАРУЖЕНИЯ ПОДОЗРИТЕЛЬНЫХ ФИНАНСОВЫХ ТРАНЗАКЦИЙ С ИСПОЛЬЗОВАНИЕМ ГРАФ-ЭМБЕДДИНГОВ

Rakymbekova Y.

29.05.2026 246

5(146)

10. Информатика, вычислительная техника и управление

Цитировать:

Rakymbekova Y. HYBRID ENSEMBLE APPROACH FOR SUSPICIOUS FINANCIAL TRANSACTION DETECTION USING GRAPH EMBEDDINGS // Universum: технические науки : электрон. научн. журн. 2026. 5(146). URL: https://7universum.com/ru/tech/archive/item/22813 (дата обращения: 30.07.2026).

Прочитать статью:

DOI - 10.32743/UniTech.2026.146.5.22813

Статья поступила в редакцию: 14.05.2026

Принята к публикации: 19.05.2026

Опубликована: 28.05.2026

УДК 004.852

ABSTRACT

Automated detection of suspicious financial transactions within national financial monitoring systems is constrained by extreme class imbalance: in the dataset examined, fewer than 0.2% of records carry a positive label, yielding a 1:563 ratio between normal and suspicious operations. This study develops a hybrid ensemble framework trained on 175,683 real transactions from the Kazakhstani FM1 financial monitoring register for January–December 2025. Five heterogeneous model families are combined: CatBoost, LightGBM, TabNet, Feature Tokenizer Transformer, and Isolation Forest. A directed weighted interbank graph (5,475 nodes, 10,721 edges) is constructed and 64-dimensional Node2Vec embeddings extracted, concatenated with tabular features. After removing three datetime columns that cannot be represented as numeric inputs, the final training matrix contains 299 columns (23 categorical, 276 numeric). Predictions are unified through Platt calibration and Bayesian weight optimisation via Optuna over 500 trials. On the chronological test set the ensemble achieves PR-AUC 0.582 (95% CI [0.412, 0.735]) under the ranking-based Recall@5% regime of 0.984, meaning 37 of 38 suspicious transactions are ranked in the daily top 5% of volume. Walk-forward monthly retraining over five months yields mean PR-AUC 0.535 ± 0.103, with inter-month variation reflecting the small count of monthly positives (11–45) rather than model degradation. An ablation study shows that Node2Vec embeddings raise PR-AUC from 0.005 to 0.277, a 54-fold gain over rolling aggregates alone.

АННОТАЦИЯ

Автоматизированное обнаружение подозрительных финансовых операций в системах национального финансового мониторинга осложняется экстремальным дисбалансом классов: в исследуемом наборе данных менее 0,2% записей имеют положительную метку, что соответствует соотношению 1:563. В работе предложен гибридный ансамблевый метод, обученный на 175 683 реальных транзакциях казахстанского реестра финансового мониторинга FM1 за январь–декабрь 2025 года. Объединены пять разнородных семейств моделей: CatBoost, LightGBM, TabNet, Feature Tokenizer Transformer и Isolation Forest. На основе обучающей выборки построен ориентированный взвешенный граф межбанковских транзакций (5 475 узлов, 10 721 рёбер), из которого извлечены 64-мерные Node2Vec-эмбеддинги. После исключения трёх столбцов с датами итоговая матрица признаков содержит 299 столбцов. Предсказания объединяются через калибровку Платта и байесовскую оптимизацию весов посредством Optuna за 500 испытаний. На тестовой выборке ансамбль достигает PR-AUC 0,582 (95% ДИ [0,412; 0,735]); по ранговой метрике Recall@5% обнаруживается 37 из 38 подозрительных транзакций при просмотре аналитиком 5% дневного объёма. Скользящее ежемесячное переобучение за пять месяцев даёт среднее PR-AUC 0,535 ± 0,103; межмесячная вариация отражает малое число позитивных примеров (11–45 в месяц), а не нестабильность модели. Граф-эмбеддинги обеспечивают 54-кратный прирост PR-AUC по сравнению со скользящими агрегатами без сетевых признаков.

Keywords: financial transaction monitoring, anomaly detection, Node2Vec, ensemble learning, gradient boosting, class imbalance, PR-AUC.

Ключевые слова: финансовый мониторинг транзакций, обнаружение аномалий, Node2Vec, ансамблевое обучение, градиентный бустинг, дисбаланс классов, PR-AUC.

Introduction

Financial monitoring systems in Kazakhstan are governed by the Law on Countering the Legalisation of Illegally Obtained Income and Financing of Terrorism, which requires banks and payment institutions to submit electronic reports for every cross-border or high-value domestic transfer. The financial intelligence unit assigns each message one of three regulatory status codes: completed (code 1), ongoing (code 2), or suspended pending investigation (code 3). The volume of submissions reaches 150,000–200,000 messages per month, making manual review impractical [8].

Code-3 messages, which define the target class, represent 0.174% of all records. This extreme class imbalance (1:563) renders standard classifiers unreliable. The Precision-Recall AUC (PR-AUC) is a more informative metric than ROC-AUC under such conditions, as it directly captures performance on the minority class and is not inflated by the abundance of true negatives [7]. Existing AML detection approaches in Kazakhstan have relied primarily on rule-based threshold systems [15, 16]. The contribution of relational information from interbank transaction networks has not been systematically evaluated on Kazakhstani data.

The aim of this study is to develop and evaluate a hybrid ensemble framework for suspicious transaction detection that: (1) integrates graph-based relational features via Node2Vec embeddings of the interbank transaction graph; (2) combines five heterogeneous model families to reduce prediction variance; and (3) demonstrates validity through walk-forward temporal validation and ablation analysis.

Materials and Methods

The dataset was exported from the FM1 financial monitoring register covering January–December 2025. After removing 43 fully-null columns, 37 columns remained. The suspension flag (MESS_OPER_STATUS_CODE = 3) defines binary target y. Among 175,683 records, 305 carry y = 1 (0.174%), giving imbalance ratio 1:563. Data were partitioned chronologically (Table 1).

Table 1.

Dataset partitioning

Partition	Records	Positives	Positive rate
Training (70%)	122,978	218	0.177%
Validation (15%)	26,352	49	0.186%
Test (15%)	26,353	38	0.144%
Total	175,683	305	0.174%

The feature engineering pipeline has three components. Temporal features: hour-of-day, day-of-week, weekend flag, calendar month, log-transformed tenge amount, and raw tenge amount (6 features). Rolling aggregates (count, sum, mean, maximum of tenge amount) over 1-day, 7-day, and 30-day windows for five grouping keys — sender bank, recipient bank, bank pair, country pair, CFM code — yield 60 features. Two binary mismatch flags capture cross-bank and cross-country discrepancies. The graph embedding component runs Node2Vec [10] with 64 dimensions, walk length 30, and 10 walks per node on the 5,475-node directed interbank graph (train partition only), producing 192 features per transaction. Three datetime columns (MESS_DATE, DATE_EXPORTED, _actual_date_imported) are excluded before model training as they cannot be represented as numeric inputs. The resulting feature matrix contains 299 columns (23 categorical, 276 numeric).

Eight estimators were trained with cost-sensitive class weights (imbalance ratio 563:1): Logistic Regression (linear baseline), Random Forest (300 trees), XGBoost [6] (500 rounds, scale_pos_weight = 563), LightGBM (2,000 rounds, early stopping), CatBoost [13] (2,000 iterations, 22 categorical features, weights [1.0, 563.1]), TabNet [3] (embedding dim 32, 3 steps), FT-Transformer [9] (3 layers, 8 heads, Focal Loss γ = 2), and Isolation Forest [11] (contamination = 0.0018). XGBoost, CatBoost, and LightGBM all belong to the gradient boosting family; including all three in the ensemble would introduce redundancy rather than model diversity. XGBoost was excluded from the final ensemble for the following reason: in this experimental setup, XGBoost was trained exclusively on the standardised numeric feature matrix, whereas LightGBM and other ensemble members operated on the full 299-column representation including 23 categorical columns in their native format. Since XGBoost and the remaining boosting models operated on different feature representations, their direct combination in the ensemble would introduce an inconsistency in the input space. XGBoost was therefore retained as an independent standalone baseline to benchmark the ensemble gain over the best single numeric-only model, rather than included as an ensemble component.Ensemble fusion follows three steps: (1) Platt-calibrate each of the five ensemble models on the validation set; (2) Optuna [2] optimises the five blending weights over 500 trials maximising validation PR-AUC; (3) weighted sum of calibrated test probabilities. Optimal weights: CatBoost 0.063, LightGBM 0.001, TabNet 0.676, Isolation Forest 0.154, FT-Transformer 0.107.

Results and Discussion

Two evaluation regimes are used and must be carefully distinguished. Recall@K% is a ranking-based metric computed independently for each calendar day: all transactions are ranked by model score, the top K% are selected, and recall is the fraction of that day's positives within the selected set. This metric does not rely on a fixed global threshold and captures ranking quality. The confusion matrix, in contrast, applies a fixed global threshold (0.084, selected to maximise F1 on the validation set) to the entire test set. These two regimes are complementary, not contradictory: Recall@5% = 0.984 means that 37 of 38 positives are ranked in the daily top 5%, while TP = 20 in the confusion matrix reflects how many positives lie above the fixed global threshold. These values measure different operational scenarios and should be interpreted separately.

Table 2 compares all estimators on the test set. The 38 test positives span 11 calendar days (34–120 transactions/day); Recall@5% corresponds to reviewing 2–6 records per day.

Table 2.

Test-set performance of all estimators

Model	PR-AUC	ROC-AUC	Recall@5%	Recall@10%
Hybrid Ensemble	0.582	0.977	0.984	0.984
XGBoost (baseline)	0.561	0.989	0.989	0.989
FT-Transformer	0.548	0.990	0.975	0.980
Random Forest	0.264	0.881	0.817	0.875
TabNet	0.137	0.957	0.790	0.926
Logistic Regression	0.030	0.954	0.741	0.926
LightGBM	0.015	0.841	0.638	0.768
Isolation Forest	0.001	0.418	0.000	0.000

The hybrid ensemble achieves the highest PR-AUC of 0.582. XGBoost, the strongest individual gradient booster (PR-AUC 0.561), is noted to perform comparably to the ensemble on Recall@5% (0.989 vs. 0.984); however, the ensemble exceeds XGBoost on PR-AUC by 0.021 points. Since XGBoost was excluded from the ensemble to preserve architectural diversity (see Section 2), this comparison validates that the ensemble design achieves its intended goal: matching or exceeding the best homogeneous baseline on the primary metric while combining complementary error patterns from different model families. Isolation Forest scored zero Recall at all thresholds, confirming that unsupervised scoring alone is insufficient at 1:563 imbalance.

Table 3.

Bootstrap 95% confidence intervals for PR-AUC (n = 1,000 resamples)

Model	PR-AUC	95% CI
Hybrid Ensemble	0.583	[0.412, 0.735]
XGBoost	0.562	[0.393, 0.705]
FT-Transformer	0.551	[0.378, 0.699]
Random Forest	0.271	[0.127, 0.432]
Logistic Regression	0.035	[0.019, 0.063]
Isolation Forest	0.001	[0.001, 0.002]

The three top models share overlapping confidence intervals (Table 3) and are not statistically separable at the available test-set size of 38 positives. Random Forest and Isolation Forest are clearly distinguished from the top group.

Walk-forward validation (Table 4) retrains CatBoost monthly on an expanding window. The mean PR-AUC of 0.535 ± 0.103 reveals substantial inter-month variation (range 0.402–0.645). The lowest value, 0.402 in November, coincides with the smallest monthly positive count (11 cases), confirming that PR-AUC is highly sensitive to the number of positives at this imbalance level. The high standard deviation (0.103) should therefore be interpreted as a consequence of data sparsity rather than genuine model instability. ROC-AUC, which is less sensitive to class imbalance, remains above 0.983 in all months, indicating consistently reliable transaction ranking. These results do not support a claim of strong temporal stability in absolute PR-AUC terms; rather, they confirm that the model remains operational in a monthly retraining regime and does not exhibit systematic degradation as the training window expands.

Table 4.

Monthly walk-forward validation results (CatBoost)

Test month	Train size	Positives	PR-AUC	ROC-AUC
2025-08	98,531	45	0.479	0.991
2025-09	113,301	35	0.629	0.987
2025-10	128,344	32	0.402	0.983
2025-11	142,820	11	0.518	0.999
2025-12	157,637	30	0.645	0.993
Mean ± Std	—	—	0.535 ± 0.103	0.990 ± 0.006

The ablation study (Table 5) trains CatBoost on nested feature subsets using the 299-column matrix. Adding rolling statistics to the six base columns lowers PR-AUC from 0.011 to 0.005, likely due to noise from 60 additional columns relative to only 218 training positives. Node2Vec embeddings reverse this: PR-AUC rises to 0.277, a 54-fold gain.

Table 5.

Ablation study: marginal contribution of feature groups

Feature configuration	N features	PR-AUC	ROC-AUC
Temporal + amount only	6	0.011	0.837
+ Rolling aggregates	68	0.005	0.803
+ Node2Vec embeddings	260	0.277	0.919
+ Mismatch flags	262	0.123	0.953
All columns (299)	299	0.451	0.995

TreeSHAP analysis on CatBoost identifies delay_hours (mean |SHAP| = 1.387), SELLER_BANK_CITY (0.922), SELLER_BANK_NAME (0.488), and OPER_SUSP (0.376) as top features. Graph embedding dimensions occupy 17 of the top 20 positions. Removing delay_hours leaves PR-AUC unchanged at 0.499, confirming no leakage.

At the global threshold 0.084 (threshold-based regime): TP = 20, FN = 18, FP = 5, TN = 26,310 — giving a threshold-based recall of 0.526. As explained above, this does not contradict the ranking-based Recall@5% = 0.984, which is computed per-day over a ranked list rather than against a fixed global threshold. All 38 positives carry flag_bank_country_mismatch = 1. Missed cases show higher amounts (308 M vs. 173 M KZT) and shorter delays (9.7 h vs. 7.1 h), suggesting large rapid cross-border transfers form an underrepresented typology.

Conclusion

This study evaluated eight machine learning estimators on 175,683 Kazakhstani financial monitoring records under 1:563 class imbalance. A hybrid ensemble of five architecturally diverse models — CatBoost, LightGBM, TabNet, FT-Transformer, and Isolation Forest — unified via Platt calibration and Optuna-optimised weights achieved PR-AUC 0.582 and ranking-based Recall@5% 0.984 on the chronological test set. XGBoost, the strongest standalone gradient booster (PR-AUC 0.561), was retained as an independent benchmark and excluded from the ensemble to maintain architectural diversity; the ensemble matches or exceeds it on all reported metrics. Node2Vec graph embeddings of the directed interbank network provided the largest single feature contribution (54-fold PR-AUC improvement). Walk-forward validation confirms operational viability in a monthly retraining regime, though inter-month PR-AUC variation (0.402–0.645) driven by data sparsity prevents a claim of strong absolute stability.

Limitations include wide confidence intervals due to the small test positive count (38) and the computational cost of full graph re-embedding at each retraining step. Future work will target incremental graph update procedures, heterogeneous graph extensions incorporating country and CFM sector nodes, and multi-year corpus collection to reduce confidence interval width.

References:

Abdallah M., Maarof M. A., Zainal A. Fraud detection system: A survey // Journal of Network and Computer Applications. — 2016. — Vol. 68. — P. 90–113.
Akiba T., Sano S., Yanase T., Ohta T., Koyama M. Optuna: A next-generation hyperparameter optimization framework // Proceedings of the 25th ACM SIGKDD. — Anchorage, AK, 2019. — P. 2623–2631.
Arik S. O., Pfister T. TabNet: Attentive interpretable tabular learning // Proceedings of the AAAI Conference on Artificial Intelligence. — Vol. 35. — 2021. — P. 6679–6687.
Bhattacharyya S., Jha S., Tharakunnel K., Westland J. C. Data mining for credit card fraud: A comparative study // Decision Support Systems. — 2011. — Vol. 50, № 3. — P. 602–613.
Chawla N. V., Bowyer K. W., Hall L. O., Kegelmeyer W. P. SMOTE: Synthetic minority over-sampling technique // Journal of Artificial Intelligence Research. — 2002. — Vol. 16. — P. 321–357.
Chen T., Guestrin C. XGBoost: A scalable tree boosting system // Proceedings of the 22nd ACM SIGKDD. — San Francisco, CA, 2016. — P. 785–794.
Dal Pozzolo A., Caelen O., Johnson R. A., Bontempi G. Calibrating probability with undersampling for unbalanced classification // Proceedings of the IEEE SSCI. — Cape Town, 2015. — P. 1–8.
Financial Action Task Force (FATF). Guidance on Digital Identity. — Paris: FATF, 2020. — 79 p.
Gorishniy Yu., Rubachev I., Khrulkov V., Babenko A. Revisiting deep learning models for tabular data // Advances in Neural Information Processing Systems. — Vol. 34. — 2021. — P. 18932–18943.
Grover A., Leskovec J. Node2Vec: Scalable feature learning for networks // Proceedings of the 22nd ACM SIGKDD. — San Francisco, CA, 2016. — P. 855–864.
Liu F. T., Ting K. M., Zhou Z.-H. Isolation forest // Proceedings of the 8th IEEE ICDM. — Pisa, 2008. — P. 413–422.
Lundberg S. M., Lee S.-I. A unified approach to interpreting model predictions // Advances in Neural Information Processing Systems. — Vol. 30. — 2017. — P. 4765–4774.
Prokhorenkova L., Gusev G., Vorobev A., Dorogush A. V., Gulin A. CatBoost: Unbiased boosting with categorical features // Advances in Neural Information Processing Systems. — Vol. 31. — 2018. — P. 6638–6648.
Voznika F., Viana L. Fraud detection using ensemble classifiers // Proceedings of the BRICS Congress on Computational Intelligence. — Ipojuca, 2013. — P. 674–680.
Wang D., Lin J., Cui P., Jia Q., Wang Z., Fang Y., Yu Q., Luo W. A semi-supervised graph attentive network for financial fraud detection // Proceedings of the IEEE ICDM. — Beijing, 2019. — P. 598–607.
Eurasian Group on Combating Money Laundering and Financing of Terrorism (EAG). Typologies and Case Studies on Money Laundering. — Moscow: EAG Secretariat, 2023. — 112 p.
Komitet finansovogo monitoringa Respubliki Kazakhstan. Годовой отчёт о деятельности в сфере ПОД/ФТ. — Astana, 2024. — 48 p.
Abenov E., Dzhaksybekov G. Machine learning approaches for financial compliance monitoring in Central Asian banking systems // Vestnik KazNU. Series: Mathematics, Mechanics, Informatics. — 2023. — Vol. 118, № 3. — P. 42–51.