Student, School of Information Technology and Engineering, Kazakh-British Technical University, Kazakhstan, Almaty
COMPARATIVE EVALUATION OF MACHINE LEARNING METHODS TO PREDICT CARDIOVASCULAR DISEASES
ABSTRACT
Cardiovascular diseases (CVDs) are the top global cause of death, responsible for around 18 million deaths each year. Early diagnosis is key, and machine learning (ML) offers valuable support in improving prediction and clinical decisions. This study compares three ML models - Logistic Regression, Random Forest, and XGBoost - using a real-world dataset of 70,000 patient records with 12 health indicators. [1] After preprocessing and feature selection, models were evaluated based on accuracy, precision, recall, F1-score, and ROC-AUC. Random Forest performed best with 88% accuracy, closely followed by XGBoost. Findings highlight the effectiveness of ensemble models and the need for interpretability in medical applications.
АННОТАЦИЯ
Сердечно-сосудистые заболевания (ССЗ) остаются ведущей причиной смертности в мире, обусловливая около 18 миллионов случаев ежегодно. Ранняя диагностика имеет решающее значение, при этом методы машинного обучения (МО) способствуют повышению точности прогнозирования и поддержке клинических решений. В работе представлен сравнительный анализ трёх моделей - логистической регрессии, случайного леса и XGBoost - на основе реального набора данных, включающего 70 000 записей пациентов и 12 показателей здоровья [1]. Оценка моделей проводилась с использованием метрик accuracy, precision, recall, F1 и ROC-AUC. Наилучшие результаты продемонстрировал случайный лес (accuracy - 88%), незначительно опередив XGBoost. Полученные результаты подтверждают эффективность ансамблевых методов и подчеркивают важность интерпретируемости моделей в медицинских приложениях.
Keywords: Cardiovascular Diseases (CVD), Machine Learning, Predictive Modeling, Logistic Regression, Random Forest, XGBoost, Classification, Medical Data Analysis, Feature Selection, ROC-AUC.
Ключевые слова: Сердечно-сосудистые заболевания (ССЗ), Машинное обучение, Прогностическое моделирование, Логистическая регрессия, Случайный лес, XGBoost, Классификация, Анализ медицинских данных, Отбор признаков, ROC-AUC.
Introduction
Cardiovascular diseases account for about 32% of all global deaths and are largely associated with modifiable risk factors such as hypertension, smoking, poor diet, and physical inactivity [2]. Early identification of at-risk individuals is essential for prevention and effective intervention.
Traditional statistical models, such as the Framingham Risk Score, rely on a limited number of variables and often fail to capture complex relationships between risk factors. Machine learning techniques provide a more flexible alternative, enabling the analysis of large datasets and detection of nonlinear interactions.
Previous studies have demonstrated the effectiveness of ML approaches in cardiovascular prediction [3], [4]. Comparative analyses show that ensemble models, particularly Random Forest, often outperform simpler models like Logistic Regression [5]. More advanced approaches, including IoT-based systems and deep learning, further enhance prediction capabilities but introduce challenges related to interpretability [6], [7]. Therefore, evaluating the trade-off between accuracy and transparency remains essential [9].
This study aims to compare the performance of Logistic Regression, Random Forest, and XGBoost to identify the most effective model for CVD prediction.
Materials and methods
2.1 Data Source and Preprocessing
The dataset used in this study contains 70,000 patient records with demographic, clinical, and lifestyle features [1]. Key variables include age, gender, blood pressure, cholesterol, glucose levels, smoking status, alcohol intake, and physical activity.
Data preprocessing involved:
- removing missing and inconsistent records
- eliminating outliers (e.g., unrealistic blood pressure values)
- normalizing numerical features
- encoding categorical variables
Additionally, Body Mass Index (BMI) was calculated to enhance the representation of obesity-related risk. The dataset was balanced, with approximately equal distribution of CVD and non-CVD cases.
2.2 Feature Selection
Feature selection was performed using Recursive Feature Elimination (RFE) and SHAP analysis. The most significant predictors included age, systolic and diastolic blood pressure, cholesterol, and BMI.
These findings are consistent with prior research highlighting the importance of these risk factors in cardiovascular disease prediction [8], [9]. Feature selection improved model efficiency and interpretability.
2.3 Models and Tools
Three machine learning models were implemented:
- Logistic Regression (LR): simple and interpretable baseline model
- Random Forest (RF): ensemble method capable of capturing nonlinear relationships
- XGBoost (XGB): gradient boosting algorithm optimized for performance
Models were implemented in Python using scikit-learn and XGBoost libraries. Hyperparameters were tuned using cross-validation to maximize predictive performance.
2.4 Training and Evaluation
The dataset was split into 80% training and 20% testing sets using stratified sampling. Models were evaluated using the following metrics:
- Accuracy
- Precision
- Recall
- F1-score
- ROC-AUC
ROC curves and confusion matrices were generated to visualize model performance.
Results and discussion
The performance of the models is presented in Table 1.
Table 1.
Performance of ML Models for CVD prediction
/Alfiya.files/image001.png)
Random Forest achieved the highest accuracy (88%) and F1-score (0.86), followed closely by XGBoost. Logistic Regression showed lower performance, with accuracy around 79%.
Ensemble models significantly outperformed the linear model due to their ability to capture nonlinear relationships between features. Random Forest demonstrated the best balance between precision and recall, making it particularly effective for identifying CVD cases.
ROC curve analysis (Figure 1) further confirms that Random Forest provides superior classification performance, achieving the highest AUC value. XGBoost also performed strongly, while Logistic Regression showed lower discriminative ability.
/Alfiya.files/image002.png)
Figure 1. ROC Curve for ML Models
Feature importance analysis (Figure 2) revealed that age, systolic blood pressure, cholesterol, and BMI are the most influential predictors. These results align with established medical knowledge, supporting the validity of the models.
/Alfiya.files/image003.png)
Figure 2. Feature Importance (Random forest)
Conclusion
This study demonstrates that machine learning methods can effectively predict cardiovascular disease using routine health data. Ensemble models, particularly Random Forest, significantly outperform Logistic Regression, achieving higher accuracy and better classification performance.
The results highlight the importance of nonlinear modeling and feature interactions in medical prediction tasks. However, limitations remain, including reduced interpretability of complex models and potential issues with generalization.
Future work should focus on improving model explainability, incorporating additional data sources, and validating models in real clinical environments. Advanced approaches such as deep learning and IoT-based systems also offer promising directions for further research [6], [7].
References:
- Ulianova S. Cardiovascular disease dataset // Kaggle. – 2019. – [Электронный ресурс]. – Режим доступа: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset (дата обращения: 03.2025).
- World Health Organization. Cardiovascular diseases (CVDs) // WHO. – 2021. – [Электронный ресурс]. – Режим доступа: https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds) (дата обращения: 03.2025).
- Haider et al. A study of data mining approaches for heart disease prediction // International Journal of Advanced Research in Computer Science. – 2017.
- Arora et al. Application of machine learning in predicting heart disease // Journal of Medical Informatics. – 2018.
- Kumar et al. Comparison of machine learning algorithms for cardiovascular risk prediction // BioMedical Engineering Online. – 2018.
- Islam et al. Real-time cardiovascular risk assessment using IoT and machine learning // Healthcare Technology Letters. – 2023.
- Li et al. Deep learning for cardiovascular disease risk assessment: challenges and opportunities // IEEE Transactions on Biomedical Engineering. – 2022.
- Louridi et al. Feature selection techniques for cardiovascular disease prediction // Applied Computing and Informatics. – 2019.
- Dhar et al. A hybrid machine learning model for cardiovascular disease prediction // Expert Systems with Applications. – 2018.
- Krishnan et al. Addressing data imbalance in machine learning for healthcare // Journal of Healthcare Informatics. – 2019.