PREDICTING PROBABILITY OF STROKE IN A PATIENT USING MACHINE LEARNING ALGORITHMS

ПРОГНОЗИРОВАНИЕ ВЕРОЯТНОСТИ РАЗВИТИЯ ИНСУЛЬТА У ПАЦИЕНТА С ПОМОЩЬЮ АЛГОРИТМОВ МАШИННОГО ОБУЧЕНИЯ

Abutayev Zh. Kabdrakhova S.

28.06.2025 200

6(135)

10. Информатика, вычислительная техника и управление

Цитировать:

Abutayev Zh., Kabdrakhova S. PREDICTING PROBABILITY OF STROKE IN A PATIENT USING MACHINE LEARNING ALGORITHMS // Universum: технические науки : электрон. научн. журн. 2025. 6(135). URL: https://7universum.com/ru/tech/archive/item/20234 (дата обращения: 07.01.2026).

Прочитать статью:

DOI - 10.32743/UniTech.2025.135.6.20234

ABSTRACT

Stroke is the second leading cause of death globally. Timely medical intervention within the critical window of 4–6 hours can significantly improve survival rates and reduce long-term disability. This paper investigates the use of machine learning algorithms, such as Logistic Regression, Random Forest, and Support Vector Machine (SVM), to predict the probability of stroke in patients. Using a real-world dataset and methods like SMOTE for class balancing, the study compares model performance using evaluation metrics including accuracy, recall, precision, and F1-score. The results highlight the importance of recall in medical applications, where missing a positive case may lead to severe consequences. The study also emphasizes the trade-off between model complexity and interpretability and proposes future work focused on explainable AI techniques.

АННОТАЦИЯ

Инсульт является второй ведущей причиной смерти во всем мире. Своевременное медицинское вмешательство в критический промежуток времени 4-6 часов может значительно улучшить показатели выживаемости и снизить длительную нетрудоспособность. В данной работе исследуется использование алгоритмов машинного обучения, таких как логистическая регрессия, случайный лес и машина опорных векторов (SVM), для прогнозирования вероятности развития инсульта у пациентов. Используя реальный набор данных и такие методы, как SMOTE для балансировки классов, в исследовании сравнивается производительность моделей с помощью таких оценочных показателей, как точность, отзыв, точность и F1-score. Полученные результаты подчеркивают важность запоминания в медицинских приложениях, где пропуск положительного случая может привести к серьезным последствиям. Исследование также подчеркивает компромисс между сложностью и интерпретируемостью модели и предлагает будущую работу, направленную на объяснимые методы ИИ.

Keywords: stroke, machine learning, mobile application, logistic regression, random forest, SMOTE, medical data.

Ключевые слова: инсульт, машинное обучение, мобильное приложение, логистическая регрессия, случайный лес, SMOTE, медицинские данные.

Introduction

Stroke remains one of the leading causes of death and disability worldwide. According to the World Health Organization (WHO), approximately 15 million people suffer a stroke each year globally, with nearly 5 million deaths and another 5 million left permanently disabled. This estimate is supported by WHO's noncommunicable diseases statistics [World Health Organization, "Stroke, Cerebrovascular accident," WHO Fact Sheets, 2023. Available: https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death]. The global burden of stroke continues to rise due to aging populations and increasing prevalence of risk factors such as hypertension, diabetes, and obesity. In Kazakhstan, stroke is also a significant public health issue, accounting for a substantial proportion of hospitalizations and long-term disability cases. According to national health statistics, over 40,000 new stroke cases are recorded annually, highlighting the urgent need for effective prediction and prevention strategies.

Table 1.

Estimated Stroke Incidence and Mortality

Region	Annual Stroke Cases	Stroke-related Deaths
Worldwide	15 million	5 million
Kazakhstan	40 000	14 000

Given the clinical and economic burden of stroke, accurate and early prediction of stroke risk has become a pressing area of research. Numerous studies have been conducted to explore risk factors and predictive models. For example, Mendis et al. [1] provided a global overview of stroke incidence and emphasized the importance of timely diagnosis and intervention. Hung et al. [11] analyzed electronic medical claim databases using machine learning to predict stroke occurrence in large-scale populations. Monteiro et al. [6] used clinical variables to forecast post-stroke functional outcomes, demonstrating the utility of predictive models in improving rehabilitation planning.

This study proposes a machine learning-based approach to predict the probability of stroke using structured patient data. Our contribution includes the application and comparison of several algorithms—Logistic Regression, Random Forest, and Support Vector Machine (SVM)—to identify the most effective model. Furthermore, the study addresses class imbalance through the use of SMOTE and explores feature importance to enhance interpretability. Unlike prior work that often relies on unbalanced data or limited evaluation metrics, this research emphasizes recall and model transparency, which are crucial for real-world clinical application.

Machine learning (ML) has demonstrated promising results in various areas of healthcare. Esteva et al. [3] achieved dermatologist-level accuracy in skin cancer classification using deep learning. Rajpurkar et al. [4] developed CheXNet, a convolutional neural network capable of diagnosing pneumonia from chest X-rays with high accuracy. Miotto et al. [5] applied unsupervised deep learning techniques to predict future diagnoses from electronic health records (EHR), demonstrating the power of data-driven prediction in clinical settings.

These advances highlight the potential of ML techniques for disease detection and prognosis. In the context of stroke, ML models offer an opportunity to improve early detection by learning complex patterns within multidimensional datasets. However, many existing models lack generalizability, rely on limited input features, or fail to account for class imbalance. This study builds on existing research and aims to address these limitations through rigorous model comparison, proper handling of imbalanced data, and emphasis on performance metrics most relevant for clinical impact.

2. Materials and Methods

2.1 Dataset

The dataset used in this study was sourced in 2022 and contains 5,110 observations across 12 variables. These include both demographic (e.g., age, gender, marital status) and health-related attributes (e.g., hypertension, heart disease, average glucose level, BMI, smoking status). The target variable, “stroke,” is highly imbalanced, with only 249 positive cases—approximately 4.9% of the total dataset—indicating the presence of a stroke. To ensure data quality, missing values were handled appropriately. In particular, 201 records with missing BMI values were excluded, resulting in a cleaned dataset of 4,909 samples. This class imbalance and the presence of health-related risk factors make the dataset suitable for evaluating the effectiveness of machine learning techniques for stroke prediction in real-world conditions.

Table 2.

Dataset

№	Column name	Data type	Count
1	Id	Int64	5110
2	Gender	Object	5110
3	Age	Float64	5110
4	Hypertension	Int64	5110
5	Heart disease	Int64	5110
6	Ever married	Object	5110
7	Work type	Object	5110
8	Residence type	Object	5110
9	Avg glucose level	Float64	5110
10	BMI	Float64	4909
11	Smoking status	Object	5110
12	Stroke	Int64	5110

2.2 Data Preprocessing

Prior to model training, several preprocessing steps were undertaken to prepare the dataset for analysis. All categorical variables, including gender, marital status, work type, residence type, and smoking status, were transformed using one-hot encoding to enable compatibility with machine learning algorithms that require numerical input. Continuous variables such as age, average glucose level, and BMI were standardized to ensure consistent scaling across features, which can help improve the convergence and stability of certain models.

The dataset was then randomly partitioned into training and testing subsets using an 80:20 split. This ensured that model evaluation could be conducted on previously unseen data to better reflect real-world generalization performance.

In Figure 1, we can see a significant class imbalance - only about 5% of the instances represent positive stroke cases - so data balancing techniques were required. To address this problem, the synthetic minority over-sampling technique (SMOTE) [13] was applied to the training set. SMOTE generates synthetic samples for the minority class by interpolating between existing minority instances, thereby increasing class balance and reducing the risk of model bias towards the majority class. This step was very important to improve the model's ability to detect stroke cases, especially when using recall-sensitive evaluation metrics.

Figure 1. Visualization of stroke class imbalance in the dataset

To better understand the distribution of key numerical features, we visualized the age, average glucose level, and BMI distributions for stroke and non-stroke cases using kernel density estimation plots (see Figure 2). As illustrated, stroke patients tend to be older and have higher average glucose levels. While BMI distributions also differ slightly, the distinction is less pronounced compared to the other two features. These insights support the relevance of these variables in predicting stroke risk and motivate their inclusion in the model.

Figure 2. Distribution comparison of Age, Avg. Glucose Levels, and BMI for stroke (darker curve) vs. non-stroke (lighter curve) cases

2.3 Model Training

To evaluate the effectiveness of different machine learning methods in stroke prediction, three classification algorithms were selected: Logistic Regression, Random Forest, and Support Vector Machine (SVM). These models were chosen for their established use in medical prediction tasks, varying levels of complexity, and interpretability.

Logistic Regression was used as a baseline model due to its simplicity and ability to provide easily interpretable coefficients, making it a suitable choice for healthcare settings where model transparency is essential. Support Vector Machine (SVM) with a linear kernel was selected for its effectiveness in high-dimensional spaces and its robustness against overfitting when properly regularized.

Random Forest, an ensemble method based on decision trees, was included for its ability to model non-linear relationships and handle both numerical and categorical data effectively. In addition to evaluating the default Random Forest configuration, we developed a tuned version of the model to assess the impact of hyperparameter optimization on predictive performance. Specifically, we set the number of decision trees (n_estimators) to 300 to improve stability, limited the depth of each tree (max_depth) to 10 to avoid overfitting, and applied class balancing through the class_weight='balanced' parameter to compensate for data imbalance. All models were trained on the SMOTE-balanced dataset described in Section 2.2.

Hyperparameter tuning was performed manually based on iterative experimentation and validation results on the training set. The goal was to find an optimal trade-off between performance metrics such as accuracy, recall, and F1-score, especially considering the imbalanced nature of the dataset.

2.4 Evaluation Metrics

To comprehensively assess the performance of each machine learning model, a combination of evaluation metrics was employed: accuracy, precision, recall, F1-score, and the area under the Receiver Operating Characteristic curve (AUC-ROC). These metrics were selected based on their relevance to binary classification problems, particularly in cases involving class imbalance, such as stroke prediction.

Accuracy measures the proportion of correctly predicted observations over the total number of observations. However, in imbalanced datasets, accuracy can be misleading, as a model may achieve high accuracy by predominantly predicting the majority class. Therefore, additional metrics were used to provide a more nuanced understanding of model behavior.

Precision quantifies the proportion of true positive predictions among all positive predictions made by the model, reflecting its ability to avoid false positives. Recall, or sensitivity, represents the proportion of actual stroke cases that were correctly identified, and is particularly critical in medical contexts where missing a positive case can lead to serious consequences. The F1-score, defined as the harmonic mean of precision and recall, balances both concerns and is useful when dealing with uneven class distributions. AUC-ROC evaluates the trade-off between true positive and false positive rates across different thresholds, offering insight into model discrimination capability.

Together, these metrics enable a thorough evaluation of each model’s effectiveness, with a focus on identifying stroke cases accurately and reliably.

3. Results

The performance evaluation of the developed models revealed several important findings. Among all algorithms tested, the tuned Random Forest classifier achieved the highest overall accuracy at approximately 94%. However, despite this impressive score, its ability to correctly identify stroke cases as measured by recall dropped drastically to only 2%. This suggests that the model, while effective at recognizing non-stroke cases, failed to detect the majority of actual stroke instances.

In contrast, the original (non-tuned) Random Forest model demonstrated a lower overall accuracy of around 88%, but significantly higher recall for stroke predictions, reaching approximately 24%. This makes it a more practical option in clinical settings, where missing stroke cases can have severe consequences. A similar trade-off was observed in the other models, such as Logistic Regression and SVM, which showed balanced yet moderate performance across all metrics.

Further analysis of the Random Forest feature importance scores indicated that age was the most influential predictor, followed by average glucose level and body mass index (BMI). This finding aligns with known medical knowledge about stroke risk factors and reinforces the relevance of these variables in predictive modeling.

These results highlight a key challenge in medical machine learning: optimizing for overall performance metrics like accuracy may come at the cost of identifying high-risk patients. Therefore, it is crucial to consider recall and other sensitivity measures when developing predictive models for health-related applications, especially in cases involving rare but critical conditions like stroke.

Table 2.

Results of classification algorithms

In these results, it is evident that the tuned Random Forest model, despite its higher accuracy, significantly underperforms in terms of recall for stroke patients. This means that while the model is better at overall classification, it misses most of the stroke cases. In the context of medical prediction, it is crucial to predict stroke cases rather than non-stroke cases, as this can have a significant impact on patient life and health.

4. Conclusions

This study explored the use of machine learning algorithms for predicting the probability of stroke based on patient data. Three models—Logistic Regression, Random Forest, and Support Vector Machine (SVM)—were implemented and evaluated using a real-world clinical dataset. The findings indicate that while the tuned Random Forest model achieved the highest overall accuracy, it performed poorly in terms of recall, which is critical in medical applications. On the other hand, models with slightly lower accuracy, such as the original Random Forest and Logistic Regression, provided a better balance between sensitivity and specificity, making them more appropriate for use in clinical decision support systems.

The importance of recall in stroke prediction cannot be overstated, as missing a high-risk case may result in delayed treatment and severe health outcomes. The study also highlighted key predictive features such as age, BMI, and glucose levels, which align with established clinical risk factors.

Future work will aim to enhance the interpretability of these models through explainable AI techniques, as well as improve efficiency and adaptability by optimizing algorithm parameters. Overall, the continued development of accurate and interpretable models has the potential to significantly improve early stroke detection, enabling timely intervention and ultimately improving patient outcomes

References:

Tianyu Shi, Huiyan Jiang, and Bin Zheng. C2MA-Net: Cross-modal cross-attention network for acute ischemic stroke lesion segmentation based on CT perfusion scans. IEEE Transactions on Biomedical Engineering, 69(1):108–118, 2022.
Wai-Fai Tung, Fu-Hsing Wu, Po-Chou Chan, Hsuan-Hung Lin, Yung-Fu Chen, and Chih-Sheng Lin. Designing AI models for predicting ischemic stroke using administrative healthcare database. In 2020 International Symposium on Computer, Consumer and Control (IS3C), pages 49–52, 2020.
Andre Esteva, Brett Kuprel, Roberto A. Novoa, Justin Ko, Susan M. Swetter, Helen M. Blau, and Sebastian Thrun. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639):115–118, 2017.
Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, et al. CheXNet: Radiologist-level pneumonia detection on chest X-rays with deep learning. arXiv preprint arXiv:1711.05225, 2017.
Riccardo Miotto, Fei Li, Brian A. Kidd, and Joel T. Dudley. Deep Patient: An unsupervised representation to predict the future of patients from the electronic health records. Scientific Reports, 6:26094, 2016.
R. Punitha Lakshmi, Melingi Sunil Babu, and V. Vijayalakshmi. Voxel based lesion segmentation through SVM classifier for effective brain stroke detection. In 2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), pages 1064–1067, 2017.
A. K. Subudhi, S. S. Jena, M. Mohanty, and S. K. Sabut. Computational intelligence approach for predicting ischemic stroke using brain MRI. In 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT), pages 1707–1712, 2018.
Chidozie Shamrock Nwosu, Soumyabrata Dev, Peru Bhardwaj, Bharadwaj Veeravalli, and Deepu John. Predicting stroke from electronic health records. In 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 5704–5707, 2019.
I-Min Chiu, Wun-Huei Zeng, and Chun-Hung Richard Lin. Using multiclass machine learning model to improve outcome prediction of acute ischemic stroke patients after reperfusion therapy. In 2020 International Computer Symposium (ICS), pages 225–231, 2020.
1 Md. Azizul Hakim, Md. Zahid Hasan, Md. Mahabur Alam, Md. Mehadi Hasan, and Mohammad Nurul Huda. An efficient modified bagging method for early prediction of brain stroke. In 2019 International Conference on Computer, Communication, Chemical, Materials and Electronic Engineering (IC4ME2), pages 1–4, 2019.
Chen-Ying Hung, Wei-Chen Chen, Po-Tsun Lai, Ching-Heng Lin, and Chi-Chun Lee. Comparing deep neural network and other machine learning algorithms for stroke prediction in a large-scale population-based electronic medical claims database. In 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 3110–3113, 2017.
Minhaz Uddin Emon, Maria Sultana Keya, Tamara Islam Meghla, Md. Mahfujur Rahman, M. Shamim Al Mamun, and M. Shamim Kaiser. Performance analysis of machine learning approaches in stroke prediction. In 2020 4th International Conference on Electronics, Communication and Aerospace Technology (ICECA), pages 1464–1469, 2020.
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16:321–357, 2002.