Master's student, School of Information Technology and Engineering, Kazakh-British Technical University, Kazakhstan, Almaty
FORECASTING PM2.5 CONCENTRATIONS IN ALMATY USING XGBOOST AND GRU: A COMPARATIVE STUDY
ABSTRACT
Air pollution is a major public health concern in Almaty, Central Asia's largest city, where mountain basin topography traps pollutants and causes persistent winter smog. No machine-learning-based PM2.5 forecasting study has been conducted for this region to date. This paper compares XGBoost and the Gated Recurrent Unit (GRU) for hourly PM2.5 forecasting using data from three Almaty monitoring stations, including meteorological variables and lagged PM2.5 values. XGBoost consistently outperformed GRU across all stations, achieving a best R² of 0.87 and RMSE of 0.0157 mg/m³ at Alm-013. This study establishes a benchmark for data-driven air quality forecasting in Central Asia.
АННОТАЦИЯ
Загрязнение воздуха представляет серьёзную угрозу общественному здоровью, особенно в городах с неблагоприятным рельефом. Алматы, крупнейший город Центральной Азии, страдает от затяжного зимнего смога из-за расположения в горной котловине. Целью данного исследования является сравнительный анализ двух подходов к почасовому прогнозированию концентраций PM2.5: градиентного бустинга XGBoost и рекуррентной нейросети GRU. Обе модели обучены и протестированы на данных трёх станций мониторинга воздуха Алматы за 2024 год с использованием метеорологических параметров и лаговых значений PM2.5. Результаты показали, что XGBoost стабильно превосходит GRU по всем станциям и метрикам, достигая наилучшего R² = 0,87 и RMSE = 0,0157 мг/м³ на станции Alm-013. Исследование является первым применением машинного обучения для прогнозирования PM2.5 в Центральной Азии.
Keywords: PM2.5, air quality forecasting, XGBoost, GRU, Almaty, machine learning, deep learning.
Ключевые слова: PM2.5, прогнозирование качества воздуха, XGBoost, GRU, Алматы, машинное обучение, глубокое обучение.
Introduction
Air pollution is a major global problem, impacting both the environment and public health. According to the World Health Organization, roughly 4.2 million people die prematurely each year due to outdoor air pollution, with the majority of these deaths occurring in low and middle-income nations [19]. PM2.5, which refers to fine particulate matter with a diameter of 2.5 micrometers or less, is especially concerning. These tiny particles can be inhaled into the lungs and then enter the bloodstream, potentially causing problems in both the respiratory and cardiovascular systems.
Almaty, Kazakhstan's largest city with a population of roughly two million, is particularly vulnerable to poor air quality. Situated in a mountain basin at the northern foothills of the Tien Shan range, the city's topography traps pollutants during temperature inversion events. In winter, vehicle emissions, residential heating, and stagnant atmospheric conditions combine to produce prolonged smog episodes in which PM2.5 concentrations frequently exceed WHO guideline levels by several times. Reliable PM2.5 forecasts are therefore critical for issuing public health advisories, guiding urban planning, and informing environmental policy.
Machine learning (ML) and deep learning (DL) have become important methods for predicting air quality, often outperforming traditional statistical and physics-based models [12]. Tree-based ensemble methods, including XGBoost, demonstrate strong performance with tabular environmental data [1, 2]. Conversely, recurrent architectures like Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks are particularly adept at modeling temporal dependencies within pollutant time series [14, 18]. Despite these advancements, direct comparisons between tree-based and recurrent approaches utilizing identical datasets are limited, thereby complicating the assessment of which paradigm is more suitable for particular circumstances.
While machine learning has been employed in air quality research across urban centers in China, India, Europe, and Southeast Asia, no PM2.5 forecasting investigations have focused on Almaty or any other city within Kazakhstan and the wider Central Asian area. This absence is significant, considering Almaty's distinct geographical, climatic, and emission profiles.
Data-driven air quality forecasting has grown rapidly over the past decade. Among traditional ML methods, Cao et al. [3] combined ARIMA with Empirical Mode Decomposition to improve multi-station forecasts, while Al-Eidi et al. [1] showed that Decision Tree regression outperforms other models in a smart city setting. Recurrent neural networks have become a dominant paradigm: Naz et al. [18] compared LSTM, GRU, and ARIMA in Belfast (R² = 0.856), Yang et al. [20] designed a Seasonal GRU for pollution forecasting in Taiwan, and Zhang et al. [21] proposed Deep-AIR, a CNN-LSTM model for fine-grained estimation in Hong Kong and Beijing. Chiang and Horng [7] combined autoencoders, dilated CNNs, and GRU for PM2.5 forecasting across 76 Taiwanese stations.
More advanced architectures have also been explored. Dey et al. [9] introduced CombineDeepNet using BiLSTM and BiGRU (R² = 0.96), while in a separate study [10] they proposed a Gaussian-mixture Variational Autoencoder that reduced RMSE by at least 31% compared to standard recurrent models. Kalajdjieski et al. [13] used adversarial networks to handle missing sensor data, and Lakshmi and Krishnamoorthy [16] achieved R² = 0.999 with a BiConvLSTM model in Delhi. Graph-based approaches include Iskandaryan et al. [11], who combined attention, GRUs, and graph convolutional networks in Madrid, and Kumar et al. [15], who paired a GCN with Transformer-GRU for multi-city forecasting in India. Cao et al. [4] applied a PatchTST-Enhanced model in Hebei Province, and Mazinani et al. [17] showed that quantized CNN-BiGRU models can reduce size by 66% with minimal accuracy loss. Kristiani et al. [14] combined XGBoost-based feature selection with LSTM in Taichung.
Despite this breadth, existing studies span China, India, Europe, and Southeast Asia, yet no PM2.5 forecasting study has targeted Central Asia. Chennareddy et al. [6] applied ConvLSTM in Kolkata (R² = 0.901), and Berkani et al. [2] showed that LightGBM and CatBoost outperform other models in Morocco. XGBoost and GRU each show strong results individually, but direct comparisons on the same dataset are rare. This paper fills that gap with a head-to-head evaluation at three Almaty stations, the first such study for the Central Asian region.
Materials and methods
Almaty is the largest city in Kazakhstan and Central Asia, home to approximately two million people. It lies at the northern foothills of the Trans-Ili Alatau range, part of the Tien Shan mountain system, with elevations ranging from roughly 600 m in the north to over 1,000 m in the south. The basin's shape, combined with a continental climate characterized by cold winters and hot summers, frequently leads to temperature inversions. These inversions trap pollutants close to the ground. This problem is especially noticeable in winter, from November to February. During this time, heating emissions increase, and the atmosphere does not mix as well, causing smog events to last longer.
/Otegaliyev.files/image001.png)
Figure 1. Locations of the three air quality monitoring stations in Almaty
Figure 1 shows the locations of the selected monitoring stations within Almaty. The dataset, covering the year 2024, was obtained from the Almaty air quality monitoring network [22]. It includes about 1.83 million data points, each recorded every twenty minutes at various monitoring stations. Measured parameters include PM2.5, air temperature (AT), wind speed (WS), wind direction (WD), barometric pressure (BP), and several other pollutants not used here. We selected three stations with the highest volume of PM2.5 measurements: Alm-016, Alm-013, and Alm-018. Table 1 summarizes their characteristics.
Table 1.
Summary of Selected Monitoring Stations
|
Station |
District |
Coordinates |
PM2.5 Records |
|
Alm-016 |
Almaly |
43.24°N, 76.95°E |
29,067 |
|
Alm-013 |
Bostandyk |
43.22°N, 76.90°E |
28,946 |
|
Alm-018 |
Nauryzbay |
43.30°N, 76.84°E |
28,499 |
The raw data contained several quality issues: wind speed readings as high as 959 m/s, air temperature values up to 500°C, barometric pressure readings of zero, negative PM2.5 concentrations, and approximately 7.4% of records with missing parameter names.
The preprocessing pipeline was carried out as follows:
1) Feature selection - for each station, only PM2.5 and four meteorological variables (AT, WS, WD, BP) were retained.
2) Outlier removal - physically impossible values were excluded: WS > 50 m/s, AT outside [−50, 60]°C, BP < 500 hPa or BP = 0, and negative PM2.5 concentrations.
3) Temporal alignment - the filtered data was pivoted into a time-indexed table per station and resampled from 20-minute to hourly intervals via mean aggregation.
4) Missing value imputation - forward-fill interpolation (limit of 3 consecutive gaps) was applied; remaining incomplete rows were dropped.
5) Lag feature engineering - three lag features, PM2.5(t−1), PM2.5(t−2), PM2.5(t−3), were created, encoding PM2.5 concentrations one, two, and three hours before the target time step.
6) Train–test split - the dataset was divided into training and testing sets, maintaining the original order with an 80/20 split. After preprocessing, each station had about 5,200 hourly records.
7) Normalization - MinMaxScaler mapped all features and the target to the [0, 1] range. The scaler was fitted on the training set only.
The final feature set comprised seven variables: AT, WS, WD, BP, PM2.5(t−1), PM2.5(t−2), and PM2.5(t−3). The target variable was PM2.5(t).
XGBoost, or eXtreme Gradient Boosting, is a scalable method for gradient-boosted decision trees [5]. This algorithm builds a group of weak learners one after the other, where each new tree is built to correct the errors of the trees that came before it. XGBoost has shown considerable success in regression tasks involving tabular data, including the prediction of PM2.5 levels. The model processes a seven-dimensional feature vector, ultimately producing a single PM2.5 prediction.
The Gated Recurrent Unit (GRU), a type of recurrent neural network, is designed to model temporal relationships in sequential data [8]. It uses two gating mechanisms, the reset gate and the update gate, to control how information flows and to reduce the vanishing gradient problem. The GRU model used in this study was designed to handle a sliding window of 24 consecutive hourly observations, each described by seven input features, yielding an input tensor of shape (24, 7). The model's structure included a single GRU layer with 64 hidden units, followed by a Dense layer with a single output neuron. The Adam optimizer, along with the mean squared error loss function, was used to train the model. To reduce overfitting, early stopping was used, with a patience setting of five epochs, and a 10% validation split was employed.
Three standard regression metrics were used to compare the models: Root Mean Square Error (RMSE):
, Mean Absolute Error (MAE):
,
and the Coefficient of Determination (R²):
, where ŷᵢ is predicted value for observation i, yᵢ is actual value for observation i, ȳ is mean of all actual values and n is total number of observations. Lower RMSE and MAE values indicate better performance, while R² values closer to 1.0 indicate a better fit.
Results and discussion
Table 2 presents the full performance comparison. XGBoost outperforms GRU on every metric at every station. The best results are achieved by XGBoost at Alm-013: R² = 0.8705, RMSE = 0.0157 mg/m³, and MAE = 0.0107 mg/m³.
Table 2.
Performance Comparison of XGBoost and GRU
(units: mg/m³ for RMSE and MAE)
|
Station |
Model |
RMSE |
MAE |
R² |
|
Alm-016 |
XGBoost |
0.0174 |
0.0123 |
0.8133 |
|
Alm-016 |
GRU |
0.0207 |
0.0145 |
0.7362 |
|
Alm-013 |
XGBoost |
0.0157 |
0.0107 |
0.8705 |
|
Alm-013 |
GRU |
0.0201 |
0.0140 |
0.7857 |
|
Alm-018 |
XGBoost |
0.0082 |
0.0042 |
0.7298 |
|
Alm-018 |
GRU |
0.0100 |
0.0063 |
0.5961 |
Alm-013 exhibited the most elevated R² values in both models, specifically 0.8705 and 0.7857, thereby indicating the most consistent and predictable PM2.5 patterns across the three sites. This finding is likely a consequence of its location within the Bostandyk district, a central urban area distinguished by comparatively stable emission sources, which likely promotes more regular temporal patterns.
Alm-016, situated within the Almaly district, demonstrates a moderate level of performance, as indicated by an XGBoost R² value of 0.8133. This location exhibits the most substantial raw PM2.5 record count, totaling 29,067, which suggests a consistent monitoring effort within a densely populated urban setting characterized by a variety of pollution sources.
Alm-018 presents the most difficult prediction challenge, as evidenced by the lowest R² values observed in both models (0.7298 and 0.5961). Situated within the peripheral Nauryzbay district, this station is likely subject to a more restricted PM2.5 concentration range, thereby impeding the detection of minor fluctuations. The reduced RMSE (0.0082 for XGBoost) is indicative of the smaller absolute magnitude of concentrations, rather than superior predictive precision.
/Otegaliyev.files/image005.png)
Figure 2. Scatter plot of actual vs. predicted PM2.5 concentrations at station Alm-016 for XGBoost and GRU models
/Otegaliyev.files/image006.png)
Figure 3. Scatter plot of actual vs. predicted PM2.5 concentrations at station Alm-013 for XGBoost and GRU models
/Otegaliyev.files/image007.png)
Figure 4. Scatter plot of actual vs. predicted PM2.5 concentrations at station Alm-018 for XGBoost and GRU models
Figures 2–4 present a comparison of observed and forecasted PM2.5 concentrations across all stations for both models. The XGBoost model's predictions demonstrate a tighter clustering around the line of perfect agreement (y = x), whereas the GRU model's predictions display a wider dispersion, particularly at elevated concentrations where precise forecasting is crucial for public health notifications. Both models generally underestimate extreme PM2.5 events, a frequent constraint stemming from the infrequency of peak episodes within the training dataset.
/Otegaliyev.files/image008.png)
Figure 5. Bar chart comparison of RMSE, MAE, and R² for XGBoost and GRU across all three stations
Figure 5 clearly illustrates XGBoost's persistent superiority. The most significant disparity appears at Alm-018, where XGBoost achieved an R² of 0.7298 compared to GRU's R² of 0.5961, a difference of more than 13 percentage points.
The consistent advantages of XGBoost can be explained by several factors. The lagged PM2.5 characteristics, particularly t−1, t−2, and t−3, exhibit substantial autoregressive predictive capability, a feature readily exploited by tree-based models. XGBoost directly derives threshold-based decision rules from these lagged values, thus efficiently leveraging the inherent autocorrelation found within the hourly PM2.5 time series data. The GRU can also access these features within its 24-hour window, but the added sequential processing yields no proportional benefit when the most predictive information lies in the recent lags.
With roughly 4,000 training samples per station, the dataset is small by deep learning standards. Tree-based ensembles are effective at dividing the feature space in these situations. In contrast, neural networks usually need larger datasets to learn useful representations. The seven-dimensional input is well-suited for XGBoost's split-based learning, but it does not offer much sequential structure for the GRU to use. A more complex feature set or a longer history could be better for the recurrent model.
Our best XGBoost R² of 0.87 is comparable to the 0.856 achieved by Naz et al. [18] with LSTM in Belfast. Chennareddy et al. [6] achieved an accuracy of 0.901 in Kolkata using a BiConvLSTM model with a more complex architecture and a larger dataset. Studies with attention mechanisms, such as Lakshmi and Krishnamoorthy [16] (R² = 0.999), employ substantially more complex models and multi-site data, suggesting room for improvement in Almaty through advanced architectures.
Almaty's geographical characteristics create specific pollution dynamics. The city's basin shape affects how pollutants spread, with temperature inversions and valley effects influencing concentration patterns. The stillness of winter, combined with increased heating emissions, likely causes the high-concentration events where the models are less accurate. Improving these models could be achieved by including seasonal changes and considering the effects of the surrounding environment.
Conclusion
This investigation assessed the predictive capabilities of XGBoost and GRU models concerning hourly PM2.5 concentrations at three Almaty, Kazakhstan monitoring stations. Employing meteorological data and historical PM2.5 records, the XGBoost model demonstrated superior performance relative to the GRU model at each station and across all evaluated metrics. At station Alm-013, the best results were seen, with an R² value of 0.8705 and an RMSE of 0.0157 mg/m³.
These findings confirm that gradient-boosted ensembles are effective for predicting PM2.5 concentrations. This is especially true when there is not much training data and strong autoregressive lag features are available. Compared to recurrent neural networks, tree-based models require less data, fewer hyperparameters, and shorter training times, making them practical for operational deployment in monitoring networks with moderate data availability.
This study, the first to use machine learning to forecast PM2.5 levels in Almaty and Central Asia, lays the groundwork for future research. However, some limitations should be noted. The analysis is limited to the year 2024, with sparse monitoring data in July. Additionally, this study does not include spatial modeling between the different monitoring stations, and it only considers weather data as external factors.
Future research should involve extending the dataset over several years to capture seasonal and yearly changes. It should also use spatial models, like graph neural networks, to take advantage of the relationships between different stations. In addition, including more predictors, such as traffic data, satellite observations, and indicators for the heating season, is recommended. Finally, it is important to evaluate hybrid models that combine predictions from tree-based methods and deep learning techniques.
References:
- Al-Eidi S., Amsaad F., Darwish O., Tashtoush Y., Alqahtani A., Niveshitha N. Comparative Analysis Study for Air Quality Prediction in Smart Cities Using Regression Techniques // IEEE Access. – 2023. – Vol. 11. – P. 115140–115149.
- Berkani S., Gryech I., Ghogho M., Guermah B., Kobbane A. Data Driven Forecasting Models for Urban Air Pollution: MoreAir Case Study // IEEE Access. – 2023. – Vol. 11. – P. 133131–133142.
- Cao Y., Zhang D., Ding S., Zhong W., Yan C. A Hybrid Air Quality Prediction Model Based on Empirical Mode Decomposition // 2024. – Vol. 29, No. 1.
- Cao W., Zhang R., Cao W. Multi-Site Air Quality Index Forecasting Based on Spatiotemporal Distribution and PatchTST-Enhanced: Evidence From Hebei Province in China // IEEE Access. – 2024. – Vol. 12. – P. 132038–132055.
- Chen T., Guestrin C. XGBoost: A Scalable Tree Boosting System // Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. – 2016. – P. 785–794.
- Chennareddy S., Saha S., Das A., Kayal T. PM2.5 Concentration Forecasting in the Kolkata Region With Spatiotemporal Sliding Window Approaches // IEEE Access. – 2024. – Vol. 12. – P. 82333–82353.
- Chiang P.W., Horng S.J. Hybrid Time-Series Framework for Daily-Based PM2.5 Forecasting // IEEE Access. – 2021. – Vol. 9. – P. 104162–104176.
- Cho K., van Merriënboer B., Gulcehre C., Bahdanau D., Bougares F., Schwenk H., Bengio Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation // arXiv preprint arXiv:1406.1078. – 2014.
- Dey P., Dev S., Schoen Phelan B. CombineDeepNet: A Deep Network for Multistep Prediction of Near-Surface PM2.5 Concentration // IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. – 2024. – Vol. 17. – P. 788–807.
- Dey P., Dev S., Schoen Phelan B. Predicting Multivariate Air Pollution: A Gaussian-Mixture Nested Factorial Variational Autoencoder Approach // IEEE Geoscience and Remote Sensing Letters. – 2024. – Vol. 21.
- Iskandaryan D., Ramos F., Trilles S. Graph Neural Network for Air Quality Prediction: A Case Study in Madrid // IEEE Access. – 2023. – Vol. 11. – P. 2729–2742.
- Jai Kumaran G., Mohan S. A Review on Air Pollution Prediction Using Artificial Intelligence // IEEE Access. – 2026.
- Kalajdjieski J., Trivodaliev K., Mirceva G., Kalajdziski S., Gievska S. A Complete Air Pollution Monitoring and Prediction Framework // IEEE Access. – 2023. – Vol. 11. – P. 88730–88744.
- Kristiani E., Kuo T.Y., Yang C.T., Pai K.C., Huang C.Y., Nguyen K.L.P. PM2.5 Forecasting Model Using a Combination of Deep Learning and Statistical Feature Selection // IEEE Access. – 2021. – Vol. 9. – P. 68573–68582.
- Kumar S., Kour V., Raj A., Tapung T., Mishra S., Misra R., Singh T.N. Optimizing Air Pollution Forecasting Models Through Knowledge Distillation: A Novel GCN and TRANS_GRU Methodology for Indian Cities // IEEE Access. – 2025. – Vol. 13. – P. 40237–40257.
- Lakshmi S., Krishnamoorthy A. Effective Multi-Step PM2.5 and PM10 Air Quality Forecasting Using Bidirectional ConvLSTM Encoder-Decoder with STA Mechanism // IEEE Access. – 2024. – Vol. 12. – P. 179628–179647.
- Mazinani A., Antonucci D., Pau D.P., Davoli L., Ferrari G. Air Quality Prediction via Embedded ML/DL and Quantized Models // IEEE Access. – 2025. – Vol. 13. – P. 154203–154218.
- Naz F., McCann C., Fahim M., Cao T.V., Hunter R., Viet N.T., Nguyen L.D., Duong T.Q. Comparative Analysis of Deep Learning and Statistical Models for Air Pollutants Prediction in Urban Areas // IEEE Access. – 2023. – Vol. 11. – P. 64016–64025.
- World Health Organization. Ambient (outdoor) air pollution [Electronic resource]. – 2024. – URL: https://www.who.int/news-room/fact-sheets/detail/ambient-(outdoor)-air-quality-and-health/.
- Yang C.H., Chen P.H., Yang C.S., Chuang L.Y. Analysis and Forecasting of Air Pollution on Nitrogen Dioxide and Sulfur Dioxide Using Deep Learning // IEEE Access. – 2024. – Vol. 12. – P. 165236–165252.
- Zhang Q., Han Y., Li V.O.K., Lam J.C.K. Deep-AIR: A Hybrid CNN-LSTM Framework for Fine-Grained Air Pollution Estimation and Forecast in Metropolitan Cities // IEEE Access. – 2022. – Vol. 10. – P. 55818–55841.
- Smart Almaty. Air Quality Sensor Measurements API [Electronic resource]. – 2024. – URL: https://admin.smartalmaty.kz/api/v1/ecology/eco_air_sensors_measurements/.