Master Student, Department of Computer Science, Kazakh British Technical University (KBTU), Kazakhstan, Almaty
COMPARATIVE OVERVIEW OF MACHINE LEARNING MODELS FOR CREDIT SCORING
ABSTRACT
The objective of this paper is to assess the performance of various machine learning algorithms in developing a model for bank credit scoring. Through comparative analysis, the study aims to identify the most effective algorithm for predicting creditworthiness, thereby enhancing the accuracy and reliability of credit risk assessment in banking. The research evaluates models such as Logistic Regression, Decision Trees, Random Forest, XGBoost, and Neural Networks, trained and tested on publicly available credit scoring datasets. Evaluation metrics including AUC-ROC, F1-score, and precision-recall are used to benchmark model performance. The findings reveal that ensemble-based models like XGBoost offer superior predictive accuracy, while traditional models maintain higher interpretability. The study also discusses trade-offs between performance and explainability, highlighting the importance of selecting models based on the specific regulatory and operational needs of financial institutions. This research contributes practical insights into the application of AI in financial decision-making and provides recommendations for the deployment of machine learning in credit scoring systems.
АННОТАЦИЯ
Целью данной работы является оценка эффективности различных алгоритмов машинного обучения при построении модели для кредитного скоринга в банковской сфере. Посредством сравнительного анализа исследование направлено на выявление наиболее результативного алгоритма для прогнозирования кредитоспособности, что способствует повышению точности и надёжности оценки кредитных рисков. В исследовании рассматриваются такие модели, как логистическая регрессия, решающие деревья, случайный лес, XGBoost и нейронные сети, обученные и протестированные на общедоступных датасетах по кредитному скорингу. Для оценки производительности моделей используются метрики AUC-ROC, F1-оценка и точность-вызываемость. Результаты показывают, что ансамблевые модели, такие как XGBoost, обеспечивают наивысшую точность прогнозов, в то время как традиционные модели обладают лучшей интерпретируемостью. В работе также рассматриваются компромиссы между точностью и объяснимостью, подчёркивая важность выбора модели в зависимости от конкретных регуляторных и операционных требований финансовых организаций. Исследование предоставляет практические рекомендации по внедрению технологий искусственного интеллекта в систему кредитного скоринга и принятия решений в банковской сфере.
Keywords: credit scoring, machine learning, model comparison, predictive analytics, financial technology, explainable AI
Ключевые слова: кредитный скоринг, машинное обучение, сравнение моделей, предиктивная аналитика, финансовые технологии, объяснимый искусственный интеллект
Introduction
In today’s data-centric financial environment, credit scoring serves as a cornerstone for evaluating the creditworthiness of individuals and businesses. For banks and lending institutions, making accurate and reliable decisions regarding loan approvals is crucial, as it directly impacts financial stability and profitability. Traditionally, these decisions relied on expert judgment and simple statistical models, but with the increasing volume and complexity of credit-related data, the adoption of machine learning (ML) techniques has become essential for enhancing predictive accuracy.
The relevance of credit scoring lies in its ability to reduce credit risk and optimize lending operations. However, the choice of a machine learning algorithm for building a credit scoring model is far from trivial. Numerous methods—ranging from Logistic Regression and Decision Trees to more advanced algorithms like Random Forests, XGBoost, and Neural Networks—are widely used. Each model offers a trade-off between performance, interpretability, scalability, and ease of implementation.
The scientific literature offers mixed findings on the superiority of models. As highlighted in [1], while advanced methods such as Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), and Multivariate Adaptive Regression Splines (MARS) show slight gains in accuracy, simpler models like scorecards and decision trees are preferred in industry settings for their interpretability and regulatory compliance. Moreover, deep learning techniques have seen limited application in credit scoring due to concerns about explainability and transparency [2, 3]. Among modern algorithms, XGBoost has shown consistent superiority in classification accuracy and is increasingly favored in performance-critical environments.
Furthermore, ensemble learning approaches, which aggregate predictions from multiple base models, have demonstrated notable improvements in predictive performance [4]. These models use weighted or unweighted voting schemes to arrive at final decisions and often outperform individual algorithms.
The features used in credit scoring are generally divided into two broad categories:
– Application data, such as age, employment status, and declared income provided at the time of loan request;
– Behavioral data, including credit history, repayment patterns, account usage, and transactional behavior [5].
/Zhumashev.files/image001.jpg)
Figure 1. Methodology figure
Given these challenges and the absence of a universally accepted modeling approach, this study aims to conduct a comparative analysis of popular machine learning algorithms used in credit scoring. By implementing and evaluating a range of models on publicly available datasets, the study seeks to identify the most effective techniques for credit risk classification, balancing predictive power with interpretability and practical deployment considerations.
To evaluate the performance of models, common metrics include ROC-AUC, GINI index, and the Kolmogorov–Smirnov (KS) statistic [6]. However, due to institutional data confidentiality, many researchers are forced to use synthetic or anonymized datasets, which limit the generalizability of their findings and introduce variability in model performance [7].
/Zhumashev.files/image002.jpg)
Figure 2. Chilean Dataset
Materials and methods
The study included, at the first stage, data collection using a questionnaire on the criteria for evaluating a mobile application. At the second stage, the functionality of the mobile application was assessed using new criteria. To determine the consistency of expert opinions, mathematical processing (expertise) of the obtained data was carried out [2; 5].
Our methodology includes all the steps performed in build- ing a machine learning model.
A. Dataset
The sample for our experiment will consist of data provided by the Chilean bank. The data includes both application data and behavioral data of the client’s credit history at the time of issuance, as well as scores from other models. Since the dataset lacks dates, conducting temporal analysis is not possible.
B. Pre-processing
Pre-processing. Before beginning to build the model, it is necessary to prepare the data for training. The pre-processing stage includes encoding categorical variables and handling missing values. Typically, missing values are filled with mean/median values for numerical data and mode (most fre- quent classes) for categorical data. Most algorithms can only work with numerical values, so when dealing with categories, they need to be transformed. There are two types of categorical encoding: Label Encoding and One-Hot Encoding. In Label Encoding, each category is assigned a unique number to replace it. In One-Hot Encoding, a separate column is created for each category, indicating whether the variable belongs to that category or not (values 0 - not belong, 1 - belong). There is also Ordinal Encoding - a special case of Label Encoding, where category IDs are assigned in ascending order depending on their influence on the target variable (target). In our experiment, we will apply Label Encoding as a more universal type of encoding.
/Zhumashev.files/image003.jpg)
Figure 3. Correlation Heatmap
replace it. In One-Hot Encoding, a separate column is created for each category, indicating whether the variable belongs to that category or not (values 0 - not belong, 1 - belong). There is also Ordinal Encoding - a special case of Label Encoding, where category IDs are assigned in ascending order depending on their influence on the target variable (target). In our experiment, we will apply Label Encoding as a more universal type of encoding.
C. Feature Selection
Feature Selection. Not all variables presented in the original dataset will be equally useful for the model. There are many indicators of variable informativeness, one of which is Fea- ture Importance. The Feature Importance metric is calculated based on the results of building decision tree models. In our experiment, we will use the Feature Importance calculation function for Random Forests algorithm. The importance of a variable can be assessed from two sides - in terms of the contribution of the variable to the final probability and based on how many observations fall under the decision rules inside the decision tree related to that variable. Variables with relatively low importance will be removed from the dataset. The correlation coefficient indicates the relationship between changes in two variables. If the correlation between variables is too high, one of them should be removed, leaving the one with the higher correlation with the target variable.
D. Evaluating Results
Evaluating Results. When evaluating credit scoring mod- els, the most commonly used metrics are Jini, Kolmogorov Smirnov, F1-score, Precision, and Recall. Precision and Recall indicate the number of correct classifications. Based on these results, conclusions will be drawn about the quality of the models and their comparative analysis.
Evaluating the performance of these models requires metrics that capture various aspects of model effectiveness, including discrimination, calibration, and overall predictive power. While JINI, AUC-ROC, Kolmogorov-Smirnov, and accuracy are in- deed commonly used metrics, there are a few others that are also relevant for evaluating credit risk models. Here are some additional metrics:
- Gini Coefficient: Like the JINI index, the Gini coefficient measures the inequality in the distribution of predicted probabilities. It ranges from 0 to 1, where higher values indicate better discrimination between pos- itive and negative outcomes.
- K-S Statistic: The Kolmogorov-Smirnov (K-S) statistic measures the maximum difference between the cumula- tive distributions of predicted probabilities for positive and negative outcomes. It provides a single measure of the model’s ability to discriminate between good and bad credits.
- F1 Score: The F1 score is the harmonic mean of precision and recall. It balances both false positives and false negatives and is useful when the class distribution is imbalanced.
- Precision and Recall: Precision measures the proportion of true positives among all predicted positives, while recall measures the proportion of true positives among all actual positives. These metrics are especially relevant when the costs of false positives and false negatives are different.
- Profit/Loss Metrics: In the context of credit risk modeling, it’s often important to consider the financial impact of model decisions. Profit and loss metrics, such as net profit, expected loss, or return on investment (ROI), can provide a more business-oriented perspective on model performance.
- Kappa Statistic: The Kappa statistic measures the agree- ment between the model’s predictions and the actual outcomes, correcting for the agreement that would be expected by chance alone. It is particularly useful when evaluating models in situations with imbalanced class distributions.
/Zhumashev.files/image004.jpg)
Figure 4. Feature Importance
Results and discussions
Interpreting the results of the credit scoring models involves analyzing the performance metrics for each algorithm and drawing conclusions about their effectiveness in predicting credit risk.
Table 1.
Meanings
|
ML-Algorithm |
Jini |
KS |
F1 |
Precision |
Recall |
|
Logistic regression |
0.54 |
0.42 |
0.97 |
0.95 |
0.99 |
|
Random forest |
0.58 |
0.44 |
0.97 |
0.95 |
0.99 |
|
Decision tree |
0.22 |
0.22 |
0.96 |
0.95 |
0.97 |
|
Naive Bayes |
0.55 |
0.43 |
0.97 |
0.95 |
0.98 |
|
K-Nearest Neighbors |
0.21 |
0.22 |
0.97 |
0.95 |
0.99 |
Starting with the JINI index, we observe significant varia- tions among the models. The Random Forest model achieves the highest JINI score of 0.584, indicating its superior dis-criminatory power in distinguishing between good and bad credit risks. Logistic Regression and Naive Bayes also reasonably well, with JINI scores above 0.54. However, the Decision Tree and K-Nearest Neighbors models exhibit notably lower JINI scores, suggesting less effective discrimination between positive and negative outcomes.
Next, considering the Kolmogorov-Smirnov (KS) statistics, which measure the maximum difference between the cumu-lative distributions of predicted probabilities for positive and negative outcomes, we again see variations across the models. Random Forest and Logistic Regression models achieve rela- tively high KS scores, indicating strong discriminatory power in separating positive and negative outcomes. However, the Decision Tree and K-Nearest Neighbors models show lower KS scores, implying less effective discrimination.
Conclusion
In conclusion, while all models demonstrate a strong perfor- mance in predicting credit risk, there are variations in their discriminatory power and effectiveness in separating positive and negative outcomes. The Random Forest model stands out as the top performer, achieving the highest JINI and KS scores among the models evaluated. Logistic Regression and Naive Bayes also perform well, while the Decision Tree and K-Nearest Neighbors models exhibit relatively lower dis- criminatory power. Overall, the results suggest that Random Forest, Logistic Regression, and Naive Bayes are promising candidates for credit risk prediction, with Random Forest being the preferred choice due to its superior performance across multiple metrics.
References:
- B. W. Yap, S. H. Ong, and N. H. M. Husain, “Using data mining to improve assessment of credit worthiness via credit scoring models,” Expert Systems with Applications, vol. 38, pp. 13 274–13 283, 9 2011.
- A. Markov, Z. Seleznyova, and V. Lapshin, “Credit scoring methods: Latest trends and points to consider,” pp. 180–201, 11 2022.
- S. Gunnarsson, S. Vanden Broucke, B. Baesens, M. O´ skarsdo´ttir, and W. Lemahieu, “Deep learning for credit scoring: Do or don’t?” European Journal of Operational Research, vol. 295, pp. 292–305, 11 2021.
- A. Chopra and P. Bhilare, “Application of ensemble models in credit scoring models,” Business Perspectives and Research, vol. 6, pp. 129– 141, 7 2018.
- J. N. Crook, R. Hamilton, and L. C. Thomas, “A comparison of a credit scoring model with a credit performance model,” The Service Industries Journal, vol. 12, pp. 558–579, 10 1992.