Student, School of Information Technology and Engineering, Kazakh-British Technical University, Kazakhstan, Almaty
PERSONALIZED ALLERGY PROFILING USING UNSUPERVISED METHODS AND EXPLAINABLE AI
ABSTRACT
Asthma, rhinitis, and food allergies are examples of common allergic diseases that are caused by complex immunological reactions involving immunoglobulin E (IgE). Using information from NHANES 2005–06, this study provides a thorough, explainable pipeline for individual allergy profiling. To detect distinct allergic endotypes, principal component analysis (PCA) was used for dimensionality reduction after preprocessing and standardization of IgE biomarker and symptom data. This was followed by K-means clustering. Pet/food sensitization, seasonal/pollen sensitization, and mixed sensitization with low symptoms were the three clinically significant clusters identified. When trained on the entire dataset, the Random Forest classifier predicted cluster membership with good accuracy (97%). According to SHapley Additive exPlanations (SHAP), the most significant features for endotype assignment were allergen-specific IgE levels, especially for peanut, ragweed, and birch, but questionnaire-based symptom factors had little impact. The proposed PCA–KMeans–SHAP framework shows how unsupervised clustering and interpretable machine learning can be used to find hidden allergy subgroups, which helps in creating more accurate diagnostic and treatment plans.
АННОТАЦИЯ
Астма, ринит и пищевые аллергии являются примерами распространённых аллергических заболеваний, в основе которых лежат сложные иммунологические реакции с участием иммуноглобулина E (IgE). Используя данные исследования NHANES 2005–06, данная работа представляет комплексный и интерпретируемый методологический конвейер для построения индивидуальных аллергологических профилей. После предварительной обработки и стандартизации данных об IgE-биомаркерах и симптомах, для снижения размерности и последующего выявления отдельных аллергических эндотипов был использован метод PCA с последующей кластеризацией K-средних. В результате были выделены три клинически значимых кластера: сенсибилизация к животным/пищевым аллергенам, сезонная/пыльцевая сенсибилизация и смешанная сенсибилизация со слабо выраженными симптомами. Модель Random Forest, обученная на всём наборе данных, продемонстрировала высокую точность (97%) в прогнозировании принадлежности к кластерам. Анализ важности признаков с помощью метода SHAP показал, что ключевую роль для определения эндотипа играют уровни специфических IgE к определённым аллергенам, в то время как факторы на основе опросников о симптомах имели незначительное влияние. Предложенный PCA–KMeans–SHAP подход демонстрирует потенциал комбинации методов неконтролируемой кластеризации и интерпретируемого машинного обучения для выявления скрытых подгрупп аллергических заболеваний, что способствует разработке более точных диагностических и терапевтических стратегий.
Keywords: Allergy profiling, Unsupervised learning, IgE biomarkers, K-means clustering, PCA, SHAP, Random Forest
Ключевые слова: Аллергопрофилирование, Неконтролируемое обучение, IgE-биомаркеры, Кластеризация K-means, PCA, SHAP, Random Forest
Introduction
Approximately 20% of the global population suffers from Immunoglobulin E (IgE)-mediated allergic diseases, which range from mild rhinitis to potentially fatal anaphylaxis [1; 2]. Despite diagnostic advances, treatment often follows a one-size-fits-all paradigm, ignoring patient-specific endotypes and affecting long-term outcomes [2; 3]. The 2005–06 National Health and Nutrition Examination Survey (NHANES) provides a solid serologic and demographic foundation for data-driven endotype discovery [4].
Unsupervised learning methods like Principal Component Analysis (PCA) and K-means clustering are effective for identifying hidden patient subgroups in high-dimensional IgE and symptom data. Clustering of multi-allergen IgE titers has revealed distinct sensitization patterns (e.g., food vs. aeroallergen) [5 – 8], and PCA in pediatric cohorts has shown the first two axes can capture over 70% of variation, separating food- and pollen-driven sensitization [6]. Combining hierarchical clustering with PCA in allergic rhinitis has identified subgroups with significantly varying seasonal symptoms and treatment responses [9]. Similar approaches in pediatric asthma have identified reproducible inflammatory endotypes (e.g., Th2-high vs. Th2-low) linked to differential corticosteroid response [10].
Supervised learning enhances this by predicting clinical outcomes and interpreting key drivers. Random Forest models trained on IgE and clinical data have achieved high precision (~85%) in predicting oral food challenge results [11]. Integrating Shapley Additive Explanations (SHAP) into such models provides global and local interpretability, revealing that features like seasonal sneezing frequency, pet avoidance, and total IgE significantly drive endotype assignment [7; 8]. Deep neural networks applied to clinical notes have also demonstrated high sensitivity and specificity for allergic reaction detection [12].
Advances in high-throughput molecular diagnostics, such as allergen microarrays, allow for component-resolved IgE profiling [14]. Integrating machine learning with these workflows yields immunological signatures that predict therapy response and identify biological endotypes, such as barrier dysfunction versus immune dysregulation in atopic dermatitis [15; 16]. Broader guidelines for AI in allergology emphasize robust validation, transparency, and clinical workflow integration [17; 18]. Future approaches for integrative endotyping are suggested by extensions of interpretable machine-learning frameworks to microbiome–allergy relationships, which show how versatile SHAP is in multimodal biological contexts [19]. All these observations together provide a strong methodological basis for the current PCA–K-Means–Random Forest–SHAP pipeline.
The purpose of this work: to identify distinct allergic endotypes and develop an explainable, personalized allergy profiling framework using unsupervised learning and feature interpretation.
The object of the study is a data-driven pipeline for allergy endotype discovery based on IgE biomarker and symptom data.
The subject of the study is the analysis of the structure and implementation of this pipeline using Principal Component Analysis (PCA), K‑means clustering, Random Forest classification, and SHAP explanations to uncover hidden patient subgroups and determine the most significant features driving cluster assignment.
To achieve the goal, it is necessary to complete the following tasks:
1) A comprehensive review of the field was conducted, covering allergic endotypes, unsupervised learning applications in allergy, and model interpretability in medical AI.
2) A conceptual and methodological framework was designed, integrating dimensionality reduction, cluster analysis, supervised validation, and feature importance explanation.
3) The in-demand analytical functionality was determined by applying this framework to real-world NHANES data, revealing clinically significant clusters and ranking the diagnostic contribution of IgE biomarkers versus symptom reports.
Materials and methods
The analytical workflow is structured into three sequential stages: data preprocessing, unsupervised endotype discovery, and supervised interpretation. Each stage is methodologically grounded in established practices from both allergy research and machine learning, forming a coherent pipeline designed to move from raw data to clinically interpretable insights. The overall process is visualized in Figure 1.
/Shegenova.files/image001.png)
Figure 1. Methodology
The raw data used in this study are publicly available from the National Health and Nutrition Examination Survey (NHANES), conducted by the Centers for Disease Control and Prevention (CDC). The NHANES 2005-2006 allergy data (Table 1, 2) comprises laboratory measurements of total and allergen-specific immunoglobulin E (IgE) levels alongside self-reported questionnaire responses on allergy and asthma history. The IgE data includes continuous measurements in kU/L for total IgE and six common allergens, with values below the 0.25 kU/L detection limit recorded as 0.25. The questionnaire data contains categorical responses indicating the presence or absence of doctor-diagnosed conditions and recent symptoms. This combined dataset enables the examination of both biological markers and clinical manifestations of allergic disease in a nationally representative sample.
Table 1.
NHANES Immunoglobulin E Mesurements
/Shegenova.files/image002.png)
Table 2.
NHANES Allergy Questionnaire Responses
/Shegenova.files/image003.png)
Data Preprocessing
NHANES 2005–06 files (AL IGE D, AGQ D, MCQ D) were linked via the participant identifier SEQN [4; 13]. Records with missing values were excluded, resulting in a cohort of 589 participants (Table 3). To address right-skewness of IgE measurements, the log-transformation
/Shegenova.files/image004.png)
was applied, followed by standardization to zero mean and unit variance:
/Shegenova.files/image005.png)
ensuring equal contribution of all features to downstream analyses.
Table 3.
Sample of preprocessed NHANES Allergy data
/Shegenova.files/image006.png)
The analytic dataset used in this study (n=589) is available at: https://github.com/danagul0901/Allergy-profiling/blob/main/processed_data.csv
Unsupervised Endotype Discovery
The goal of this stage was to identify latent subgroups within the allergic population without prior labeling, allowing the data itself to reveal natural partitions. The high dimensionality of the feature space (multiple IgE and symptom variables) can obscure underlying patterns due to noise and multicollinearity. Therefore, the first step was dimensionality reduction via Principal Component Analysis (PCA). PCA is a linear transformation technique that identifies orthogonal axes (principal components) in the data that capture the maximum variance. Mathematically, it involves the eigen-decomposition of the covariance matrix C of the standardized data, solving the equation
/Shegenova.files/image007.png)
or eigenvectors
(the principal components) and eigenvalues
(the variance explained by each component) [5]. The analysis revealed that the first two principal components collectively accounted for approximately 72% of the total variance in the dataset. This high explanatory power meant that projecting the data onto these two components created a low-dimensional representation that preserved the majority of the structural information, effectively creating a simplified informative map of the participants' allergic profiles.
This two-dimensional projection served as the input for cluster analysis. The K-Means algorithm was employed to partition the data points into a pre-specified number (K) of clusters. K-Means operates by iteratively assigning each data point
to the cluster with the nearest centroid
and then recalculating centroids as the mean of all points in the cluster [6]. The algorithm seeks to minimize the total within-cluster variance, formally expressed as the objective function
/Shegenova.files/image012.png)
Determining the optimal number of clusters K is a critical model selection step. This was guided by two complementary methods: the elbow method, which plots the within-cluster sum of squares against K and looks for a point of diminishing returns (the "elbow"), and silhouette analysis, which quantifies how well each point fits within its own cluster compared to neighboring clusters [5; 9]. Both methods robustly indicated that K =3 was the optimal choice for this dataset.
Supervised Classification and Interpretability
To validate the distinctiveness of the clusters identified in the unsupervised stage and to create a predictive model, a supervised learning approach was implemented. A Random Forest classifier, an ensemble method comprising multiple decision trees, was selected for its robustness and high predictive performance [8; 11]. The model was configured with a specified number of trees. To enhance model generalizability and reduce overfitting, a feature bagging strategy was employed, where each tree in the forest was trained on a random subset of features at every split. The preprocessed dataset was divided into a training subset and a held-out test subset using a standard split ratio.
The performance of the trained Random Forest classifier was assessed on the unseen test set. Standard classification metrics, including accuracy, precision, recall, and the F1-score, were calculated to provide a comprehensive evaluation of the model's predictive capability across all identified clusters.
To move beyond predictive accuracy and understand the model's decision-making process, the Shapley Additive exPlanations (SHAP) framework was applied. SHAP is a game-theoretic approach that attributes the prediction for an individual instance to the contribution of each input feature. The SHAP value
for a feature i is calculated by considering its marginal contribution across all possible combinations of other features [7; 20]. The formula,
/Shegenova.files/image014.png)
computes a weighted average of the difference in model output when the feature is included versus excluded from subset S. This was computed for every feature and every prediction in the test set. The mean absolute SHAP value for each feature, stratified by cluster, was then computed to identify the primary drivers of each endotype.
This comprehensive methodology, from careful data curation through unsupervised discovery to supervised validation and interpretation, provides a transparent and reproducible framework for allergy endotyping that leverages the strengths of both classical statistics and modern machine learning.
All Python code for data processing, analysis, and figure generation is available at: https://github.com/danagul0901/Allergy-profiling/blob/main/Allergy_profiling.ipynb
Results and discussions
Figure 2 shows the projection of all participants onto the first two principal components, with points colored by K-Means cluster label.
Figure 2. PCA scatter plot of standardized features, colored by K-Means cluster assignment (k=3)
Three well-separated groups emerge:
- Cluster 1 (teal) comprises the majority of participants, centered around PCA1 1.5 and PCA2 0.5.
- Cluster 0 (purple) occupies intermediate PCA1 values (approximately 4–8) and negative PCA2 scores.
- Cluster 2 (yellow) contains a small set of extreme outliers with high PCA1 (greater than 12) and variable PCA2.
The clear spatial separation indicates that the combination of IgE and symptom features is captured effectively by the first two principal components, and that K-Means clustering reliably identifies three distinct sensitization endotypes.
Random Forest Classification Performance
A Random Forest classifier was trained on 80% of the data to predict the three cluster labels and evaluated on the remaining 20%. The overall test accuracy was 97.46%, and the macro-averaged F1 score was 0.91. Table 4 presents the detailed classification report:
Table 4.
Random Forest test performance by cluster
/Shegenova.files/image016.png)
Explainability via SHAP
SHAP values were computed for each feature on the test set to quantify their contribution to the model’s outputs. Figure 3 displays the mean absolute SHAP value for each feature and cluster. Key observations include:
- Ragweed
IgE exerted the largest influence (mean |ϕ| up to 0.11), followed by Peanut
IgE and Birch
IgE. - Dog
IgE, Cat
IgE, and Milk
IgE contributed moderately. - Questionnaire-based variables such as Sneezing
12mo, Seasonal indicators, and Eczema had negligible impact (mean |ϕ| < 0.001).
/Shegenova.files/image020.png)
Figure 3. Mean absolute SHAP values for each feature, stratified by cluster
The clear separation of three clusters in PCA space, combined with the high accuracy of the Random Forest classifier, confirms the presence of distinct and robust sensitization endotypes. The SHAP analysis reveals that these subgroups are primarily defined by serological IgE levels (especially to ragweed, peanut, and birch), while self-reported symptoms contribute minimally. This decouples immunological sensitization from clinical manifestation, suggesting that biomarker-based endotyping could offer a more objective basis for patient stratification than symptom profiles alone. However, the reliance on K-Means and a single cohort are limitations; future validation in independent populations is needed. Overall, these data-driven endotypes provide a foundation for more precise, mechanism-informed approaches to allergy research and management.
Conclusion
This study presents a reliable process for allergy profiling using NHANES 2005–06 data. PCA reduced dimensionality while preserving over 70% of variation from IgE and symptom data, enabling the identification of three clinically distinct endotypes via K-means clustering. A Random Forest classifier confirmed these endotypes with 97% accuracy. SHAP analysis revealed that cluster assignment was primarily driven by serologic IgE markers (birch, peanut, ragweed), with symptoms and demographics contributing minimally. This supports the potential for targeted diagnostic panels. Limitations include the cross-sectional design and small cluster size for rare patterns. Future work should incorporate longitudinal data, deeper phenotyping, and automated tools to translate this pipeline into clinical practice. This data-driven approach offers a transparent framework for advancing precision allergy diagnosis and care.
References:
- Rasool R., Gull A., Yetoo D.M., [et al.]. Polysensitization to Aeroallergens in Patients with Nasobronchial Allergy in Kashmir Valley // International Journal of Educational Science and Research (IJESR). — 2017. —Vol. 7, no. 5. — P. 7–16.
- Mersha T.B., Afanador Y., Johansson E., [et al.]. Resolving Clinical Phenotypes into Endotypes in Allergy: Molecular and Omics Approaches // Clinical Reviews in Allergy & Immunology. — 2021. — Apr. — Vol. 60, no. 2. — P. 200–219.
- Agache I., Akdis C.A. Endotypes of allergic diseases and asthma: An important step in building blocks for the future of precision medicine //Allergology International. — 2016. — July. — Vol. 65, no. 3. — P. 243 – 252.
- Salo P.M., Arbes S.J.J., Jaramillo R., al. et. Prevalence of allergic sensitization in the United States: results from NHANES 2005–2006 // Journal of Allergy and Clinical Immunology. — 2014. — Vol. 134, no. 2. — P. 350–359.
- Zhao L., Fang J., Ji Y., [et al.]. K-means cluster analysis of characteristic patterns of allergen in different ages: Real life study // Clinical and Translational Allergy. — 2023. — July. — Vol. 13, no. 7. — e12281.
- Yamamoto-Hanada K., Borres M.P., ˚Aberg M.K., [et al.]. IgE responses to multiple allergen components among school-aged children in a general population birth cohort in Tokyo // World Allergy Organization Journal. —2020. — Feb. — Vol. 13, no. 2. — P. 100105.
- Ponce-Bobadilla A.V., Schmitt V., Maier C.S., Mensing S., Stodtmann S. Practical guide to SHAP analysis: Explaining supervised machine learning model predictions in drug development // Clinical and Translational Science. — 2024. — Nov. — Vol. 17, no. 11. — e70056.
- Deb D., Smith R.M. Application of Random Forest and SHAP Tree Explainer in Exploring Spatial (In)Justice to Aid Urban Planning // ISPRS International Journal of Geo-Information. — 2021. — Vol. 10, no. 9. — P. 629.
- Malizia V., Cilluffo G., Fasola S., [et al.]. Endotyping allergic rhinitis in children: A machine learning approach // Pediatric Allergy and Immunology. — 2022. — Jan. — Vol. 33, Suppl 27. — P. 18–21.
- Salvermoser M., Zeber K., Boeck A., Kl¨ucker E., Schaub B. Childhood asthma: Novel endotyping by cytokines, validated through sensitization profiles and clinical characteristics // Clinical & Experimental Allergy. —2021. — May. — Vol. 51, no. 5. — P. 654–665. — Epub 2021 Mar 14.
- Zhang J., Lee D., Jungles K., [et al.]. Prediction of oral food challenge outcomes via ensemble learning // Informatics in Medicine Unlocked. — 2023. — Vol. 36. — P. 101142.
- Yang J., Wang L., Phadke N.A., [et al.]. Development and Validation of a Deep Learning Model for Detection of Allergic Reactions Using Safety Event Reports Across Hospitals // JAMA Network Open. — 2020. — Vol. 3, no. 11. — e2022836.
- Salo P.M., Calatroni A., Gergen P.J., [et al.]. Allergy-related outcomes in relation to serum IgE: results from the National Health and Nutrition Examination Survey 2005-2006 // Journal of Allergy and Clinical Immunology. — 2011. — May. — Vol. 127, no. 5. — 1226–1235.e7.
- Matricardi P.M., Hage M. van, Custovic A., [et al.]. Molecular allergy diagnosis enabling personalized medicine // Journal of Allergy and Clinical Immunology. — 2025. — PMID: 39855360.
- Breugel M. van, Fehrmann R.S.N., B¨ugel M., [et al.]. Current state and prospects of artificial intelligence in allergy // Allergy. — 2023. — Vol. 78, no. 10. — P. 2623–2643. — Epub 2023 Aug 16. PMID: 37584170.
- Fyhrquist N., Yang Y., Karisola P., Alenius H. Endotypes of atopic dermatitis // Journal of Allergy and Clinical Immunology. — 2025. — In press.
- MacMath D., Chen M., Khoury P. Artificial Intelligence: Exploring the Future of Innovation in Allergy Immunology // Current Allergy and Asthma Reports. — 2023. — June. — Vol. 23, no. 6. — P. 351–362.
- Lisik D., Basna R., Dinh T., [et al.]. Artificial intelligence in pediatric allergy research // European Journal of Pediatrics. — 2025. — Vol. 184. — P. 98.
- Ma J., Fang Y., Li S., [et al.]. Interpretable machine learning algorithms reveal gut microbiome features associated with atopic dermatitis // Frontiers in Immunology. — 2025. — May. — Vol. 16. — P. 1528046.
- Lundberg S.M., Lee S.-I. A unified approach to interpreting model predictions // Advances in Neural Information Processing Systems. Vol. 30. — 2017. — P. 4765–4774.