Student, School of Information Technology and Engineering, Kazakh-British Technical University, Kazakhstan, Almaty
PREDICTING UNEMPLOYMENT USING SEARCH, MOBILITY AND VACANCY BIG DATA
УДК 004.942
ABSTRACT
This study examines the feasibility of forecasting Kazakhstan's unemployment rate using Google Trends search query data within machine learning and classical time series frameworks. Official unemployment statistics are published quarterly as a single national aggregate, smoothing regional and sectoral heterogeneity and introducing publication lags that limit their utility for real-time labor market monitoring. Internet search data reflect household behavioral responses to changing economic conditions, offering a high-frequency leading indicator. Using quarterly data from 2017–2024, a feature set is constructed from three Google Trends indices related to job-seeking behavior (job/work, vacancies, find a job), selected based on stationarity after first-differencing. These serve as predictors in Random Forest and XGBoost models, benchmarked against SARIMA. Performance on a hold-out test set (Q1 2023–Q4 2024) using RMSE, MAE, and R² indicates SARIMA superiority (RMSE=0.125) over XGBoost (0.167) and Random Forest (0.180). Negative R² values reflect low series volatility and limited training sample (20 observations). Search features yielded no measurable predictive gains. Findings delineate boundary conditions for search-based indicators in emerging economies.
АННОТАЦИЯ
Настоящее исследование оценивает возможность прогнозирования уровня безработицы в Казахстане с использованием данных Google Trends по поисковым запросам в рамках моделей машинного обучения и классических временных рядов. Официальная статистика безработицы публикуется ежеквартально в виде единого национального агрегата, что сглаживает региональную и секторальную неоднородность и вводит запаздывания публикации, ограничивающие ее полезность для мониторинга рынка труда в реальном времени. В качестве альтернативы предлагаются данные интернет-поиска, отражающие поведенческие реакции домохозяйств на изменения экономических условий. На основе ежеквартальных данных за 2017–2024 гг. сформирован набор признаков из трех индексов Google Trends, связанных с поиском работы (работа, вакансии, найти работу), отобранных по критерию стационарности после первого дифференцирования. Эти индексы использованы как предикторы в моделях Random Forest и XGBoost, с бенчмарком в виде модели SARIMA. Оценка производительности на тестовой выборке Q1 2023 – Q4 2024 по метрикам RMSE, MAE и R² показала превосходство SARIMA (RMSE=0.125) над XGBoost (0.167) и Random Forest (0.180). Отрицательные значения R² отражают низкую волатильность серии безработицы и малый объем выборки (20 наблюдений). Данные поиска не дали значимого прироста предсказательной силы. Результаты определяют граничные условия применимости индикаторов на основе поиска в развивающихся экономиках.
Keywords: unemployment forecasting, Google Trends, machine learning, Kazakhstan, nowcasting, labor market.
Ключевые слова: прогнозирование безработицы, Google Trends, машинное обучение, Казахстан, nowcasting, рынок труда.
Introduction
In 2025, Kazakhstan’s economy entered the year with one of the lowest unemployment rates in the country’s modern history. According to the results of the fourth quarter of 2024, the unemployment rate stood at 4.6%. Starting in 2020, the indicator showed a steady downward trend over 17 consecutive quarters following a short-term increase during the COVID-19 pandemic, when unemployment rose from 4.8% to 5.0% in the second quarter of 2020. Compared with the pre-pandemic period (the fourth quarter of 2019), the unemployment rate had decreased by 0.2 percentage points by the end of 2024. However, in absolute terms, the number of unemployed people increased by 7 thousand over this period and amounted to 448 thousand by the end of 2024. It should be noted that in 2022 the number of unemployed reached 460 thousand, after which it began to decline rather rapidly. An important feature of official unemployment statistics in Kazakhstan is that the published indicators represent an average value across urban and rural areas. Such aggregation may smooth regional and sectoral differences in the labor market and reduce the sensitivity of the indicator to short-term economic shocks. For example, downturns in specific sectors, such as agriculture, are not always reflected in the dynamics of the official unemployment rate in rural areas. Beyond the urban-rural divide, Kazakhstan’s labor market is characterized by significant structural heterogeneity. The country’s vast geography, uneven distribution of economic activity across regions, and the coexistence of resource-intensive export sectors alongside subsistence-oriented rural employment create a labor market that is difficult to capture through a single aggregate indicator. The official unemployment rate, derived from labor force surveys conducted on a quarterly basis, reflects conditions as they existed weeks or months prior to publication. This publication lag, combined with the smoothing effect of aggregation, means that policymakers and analysts are often operating with an incomplete and outdated picture of labor market conditions. The limitations of traditional statistical methods have become increasingly apparent in the context of rapid economic change. Researchers and statistical agencies in a number of countries have begun integrating non-traditional data — including internet search activity, mobile device location data, and online job vacancy postings — into labor market monitoring frameworks, with promising results. An increase in the frequency of search queries related to job search, job loss, and social support may precede changes in official unemployment statistics. By evaluating the predictive power of each data source individually and in combination, this study seeks to contribute both to the applied literature on nowcasting labor markets in emerging economies and to the practical development of an early warning system for Kazakhstan’s labor market.
Literature Review
The application of predictive analytics in economic forecasting has gained significant attention in recent years, particularly with advancements in machine learning and big data technologies. For instance, Punia and Shankar (2022) developed a deep learning-based ensemble model specifically aimed at demand forecasting [1]. Their approach effectively integrates structured and unstructured data, improving accuracy by employing innovative covariate modeling and real-time demand sensing. Similarly, Fameliti and Skintzi (2024) focused on stock market volatility prediction during the COVID-19 pandemic [2]. Their work demonstrated how uncertainty indices for G7 stock markets could be effectively modeled using HARRV models and combination frameworks, achieving better forecasting performance during periods of heightened economic turbulence. The integration of big data in predicting economic indicators has also shown promising results. Liu and Tang (2022) proposed a big data-based method utilizing neural networks to forecast government economic situations, achieving better accuracy and efficiency [3]. Another approach employing big data techniques was introduced by Al-Azzawi et al. (2023), who used Bayesian belief networks to predict economic indicators, with the advantage of incorporating domain knowledge to improve prediction accuracy [4]. Furthermore, Aoki et al. (2023) developed a fully data-driven methodology relying on search engine query data to approximate economic variables in real-time [5]. This model notably outperformed human-selected models during crises, such as the COVID-19 pandemic, due to its ability to rapidly adapt to changing conditions and provide timely predictions. Machine learning models have been extremely valuable in forecasting inflation, with various researchers comparing statistical learning techniques to traditional methods. Botha et al. (2023) demonstrated how statistical learning models could be effectively applied to forecast South
African inflation during economic crises where nonlinear relationships between variables are prevalent [6]. Likewise, Medeiros et al. (2021) emphasized the superiority of random forest models in forecasting U.S. inflation [7]. Their approach leveraged large datasets to capture nonlinear interactions between macroeconomic variables, resulting in improved predictive performance compared to conventional models. Social big data has also been explored as a potential source for economic forecasting. Yamada et al. (2018) proposed a methodology for estimating Japanese economic indicators based on word frequencies from blogs [8]. By analyzing social media data, the model enabled real-time predictions with reduced time lags compared to official government announcements, highlighting the value of unconventional data sources for economic monitoring. Despite these advancements, significant challenges remain in handling noisy and unstable data. Huang (2021) addressed such issues by focusing on FPGA-based platforms for exchange rate prediction, which aimed to improve signal-to-noise ratios and stability of predictions in financial data mining [9]. These efforts underline the ongoing struggle to manage the volatility of financial markets while aiming for more reliable forecasting systems. To facilitate empirical analysis and reproducibility of related research, McCracken and Ng (2016) introduced the FRED-MD database, a comprehensive monthly macroeconomic database designed for big data research [10]. This tool has become an essential resource for researchers working on macroeconomic forecasting, providing a standardized framework for comparison and validation of predictive models.
Research Methodology
This study uses four categories of monthly data covering the period from January 2017 to December 2024, resulting in 96 observations per series. Taken together, the literature on Google Trends, mobility data, and online vacancy postings presents a consistent case for the value of high-frequency, alternative data sources in labor market monitoring, as illustrated in Fig. 1.
/Shayakhmetova.files/image001.png)
Figure 1. Google Trends
Combining official statistics with three types of alternative high-frequency indicators enables a comprehensive analysis of labor market dynamics in Kazakhstan. The dependent variable is the monthly unemployment rate for Kazakhstan, obtained from the Bureau of National Statistics (BNS) of the Agency for Strategic Planning and Reforms of the Republic of Kazakhstan. The indicator is based on the Labor Force Survey (LFS) conducted according to International Labour Organization (ILO) methodology and represents the share of unemployed persons in the economically active population. Although published monthly, the data are survey-based, reflect prior-period conditions, and may be revised. The series covers both urban and rural populations and is reported at the national level. Search query data are sourced from Google Trends, which provides a normalized index of relative search volume for specified keywords over time and geography. The index ranges from 0 to 100, where 100 represents peak search interest within the selected period. For Kazakhstan, queries were collected in both Russian and Kazakh, reflecting the country’s main internet languages. Keywords cover job search behavior (e.g., “работа”, “вакансии”, “найти работу”), unemployment benefits (e.g., “пособие по безработице”, “центр занятости”), and labor market transitions such as dismissal and redundancy (e.g., “сокращение”, “увольнение”). Weekly data were aggregated into monthly averages to match the frequency of the dependent variable. Since Google Trends values are relative and may vary depending on extraction time, all series were collected in a single session and consistently normalized. Job vacancy data were obtained from hh.kz, the leading online recruitment platform in Kazakhstan and part of the HeadHunter Group. Monthly counts of active job postings were used as a proxy for labor demand and employer hiring intentions in near real time. Raw vacancy counts were log-transformed to reduce skewness and stabilize variance, and seasonal adjustment was applied to account for predictable hiring cycles linked to fiscal and academic calendars.
Prior to modeling, all variables underwent a standardized preprocessing procedure. Stationarity was tested using the Augmented Dickey-Fuller (ADF) and Kwiatkowski–Phillips–Schmidt–Shin (KPSS) tests. Non-stationary series were differenced as necessary, and the order of integration was recorded for model specification. All continuous variables were then standardized to zero mean and unit variance to ensure comparability across different scales and to improve the numerical stability of machine learning algorithms.
The forecasting framework is based on two machine learning models: Random Forest and XGBoost (Extreme Gradient Boosting). A Seasonal Autoregressive Integrated Moving Average (SARIMA) model is used as a benchmark. The SARIMA(p, d, q)(P, D, Q)₁₂ specification is selected using autocorrelation and partial autocorrelation diagnostics, along with Akaike Information Criterion (AIC) minimization. In addition, a SARIMAX extension including exogenous variables is estimated to compare linear and nonlinear approaches using the same feature set.
The dataset is split into a training sample (January 2017–December 2022, 72 observations) and a test sample (January 2023–December 2024, 24 observations). This ensures that evaluation is performed exclusively on unseen data, allowing for unbiased assessment of forecasting performance.
Hyperparameter tuning for machine learning models is conducted only within the training set using five-fold cross-validation to avoid information leakage. Final models are re-estimated on the full training set using optimal parameters before generating predictions for the test period.
Model performance is evaluated using three metrics. Root Mean Squared Error (RMSE) measures sensitivity to large errors and is particularly useful for detecting failure during structural changes. Mean Absolute Error (MAE) provides a robust measure of average forecast accuracy that is less sensitive to outliers. The coefficient of determination (R²) measures the proportion of variance explained in the test sample, offering an intuitive measure of fit. RMSE is used as the primary evaluation metric, consistent with standard practice in nowcasting literature.
Results
The final dataset consists of 28 usable quarterly observations spanning Q2 2017 to Q4 2024, after first-differencing and lag construction. The dependent variable—Kazakhstan’s official unemployment rate—exhibits low variability over the sample period, ranging from 4.6% to 5.2%, with a mean of approximately 4.85% and a standard deviation of 0.14. This narrow range reflects a structurally stable labor market, with a modest increase during the COVID-19 pandemic in 2020–2021 followed by a gradual decline in subsequent years.
Of the five Google Trends series initially considered, three became stationary after first-differencing at the 5% significance level according to the Augmented Dickey-Fuller (ADF) test: “работа” (job/work, p = 0.000), “вакансии” (vacancies, p = 0.026), and “найти работу” (find a job, p = 0.000). Two series—“безработица” (unemployment) and “центр занятости” (employment center)—remained non-stationary and were excluded from the final feature set. The retained variables are primarily job-seeking queries rather than unemployment-awareness terms, which aligns with the theoretical mechanism in the literature: active job-search behavior tends to precede changes in official labor market statistics, whereas general unemployment-related searches are less predictive. The stationarity results are summarized in Table 1.
Table 1.
Stationarity testing of Google Trends search queries using the Augmented Dickey–Fuller (ADF) test
/Shayakhmetova.files/image002.png)
All models were evaluated on a hold-out test set covering Q1 2023 to Q4 2024, consisting of eight quarterly observations as shown in Fig. 2. Forecast accuracy was assessed using RMSE, MAE, and R² against the actual unemployment rate in levels, the comparison is shown in Fig. 3.
/Shayakhmetova.files/image003.png)
Figure 2. Actual vs Predicted Unemployment Rate
The results indicate that the SARIMA model outperformed both machine learning approaches across all evaluation metrics. SARIMA achieved an RMSE of 0.1254 and an MAE of 0.1104 percentage points. In comparison, XGBoost recorded an RMSE of 0.1669 and MAE of 0.1563, while Random Forest performed slightly worse with an RMSE of 0.1799 and MAE of 0.1678. All models produced negative R² values, indicating that none of them outperformed a naïve mean forecast over the evaluation period. SARIMA remained the best-performing model with an R² of −3.37, while Random Forest exhibited the weakest performance with an R² of −8.01.
/Shayakhmetova.files/image004.png)
Figure 3. Forecast Error Comparison
The result shown in Table 2 implies that SARIMA outperforms both machine learning models, combined with uniformly negative R² values, requires careful interpretation rather than being viewed as a modeling failure.
Table 2.
Performance comparison of forecasting models
/Shayakhmetova.files/image005.png)
Several structural factors likely explain these findings.
First, the sample size is extremely limited. After differencing and lag construction, the effective training set contains only 20 observations, which is insufficient for machine learning models such as Random Forest and XGBoost that typically require large datasets to generalize effectively. Although regularization techniques (e.g., shallow trees and penalization) were applied, these measures cannot fully compensate for the scarcity of data. In contrast, SARIMA is a low-parameter statistical model specifically designed for small-sample time series and is therefore better suited to this setting.
Second, the unemployment rate in Kazakhstan demonstrates very low volatility throughout the test period. During 2023–2024, values ranged only between 4.6% and 4.8%, a total variation of just 0.2 percentage points across eight quarters. In such a low-variance environment, even minor systematic prediction errors can dominate total variance, making it difficult for any model to achieve positive R² values. All models exhibit a consistent upward bias, with predicted values clustering around 4.74%–4.89% compared to actual values of 4.6%–4.8%. This bias is likely driven by training data from the pandemic period (2020–2021), when unemployment levels were higher, causing models to overestimate post-pandemic equilibrium levels.
Third, and most importantly for this study, Google Trends variables did not provide meaningful incremental predictive value beyond the SARIMA benchmark analyzed by Fig. 4 correlation heatmap. This is a substantive empirical result rather than a methodological shortcoming. It suggests that in the Kazakhstan context—characterized by a stable labor market, limited cyclical variation, and small sample sizes—search behavior does not contain sufficient additional signal to improve forecasting accuracy at quarterly frequency. This contrasts with findings from higher-frequency studies in more volatile economies, where search data often capture short-term fluctuations more effectively.
/Shayakhmetova.files/image006.png)
Figure 4. Correlation heatmap (Trends vs unemployment)
Several limitations should be acknowledged. First, the analysis is constrained to quarterly frequency due to the unavailability of monthly unemployment data in Kazakhstan, which significantly reduces the number of observations and limits statistical power. Second, complementary high-frequency datasets such as Google Mobility indicators were not available for the country. Third, historical vacancy data from hh.kz could not be reliably obtained due to API and access limitations. Fourth, Google Trends provides normalized relative indices rather than absolute search volumes, which introduces measurement noise and may weaken the observed relationships. Finally, the study relies on a single national-level unemployment indicator, which may obscure regional and sectoral heterogeneity that could be informative for forecasting.
Conclusion
This study examined the feasibility of forecasting Kazakhstan’s unemployment rate using Google Trends search indices within both machine learning and traditional time series frameworks. The findings show that, at a quarterly frequency and with a limited sample size, a parsimonious SARIMA model outperforms both Random Forest and XGBoost models augmented with search-based predictors. None of the tested models achieved a positive R² on the hold-out sample, which reflects both the extremely low volatility of Kazakhstan’s unemployment rate during 2023–2024 and the structural limitations imposed by a small dataset.
Several directions for future research can be identified. The most important is the construction of a monthly unemployment series for Kazakhstan, either through methodological refinement of the Bureau of National Statistics Labor Force Survey or by developing a composite indicator based on administrative data such as registered unemployment, unemployment benefit claims, and payroll tax records. A higher-frequency dataset would significantly increase the number of observations available for modeling and better align with the temporal advantages of Google Trends data.
References:
- Punia S., Shankar S. Predictive analytics for demand forecasting: A deep learning-based decision support system // Knowledge-Based Systems. 2022. Vol. 258.
- Fameliti S. P., Skintzi V. D. Uncertainty indices and stock market volatility predictability during the global pandemic: Evidence from G7 countries // Applied Economics. 2024. Vol. 56.
- Liu Y., Tang A. Prediction method of government economic situation based on big data analysis // Digital Government: Research and Practice. 2022. Vol. 3.
- Al-Azzawi A., Mora F. T., Lim C., Shang Y. Artificial intelligence methodology based on Bayesian belief networks for economic indicators prediction // International Journal of Advanced Computer Science and Applications. 2023. Vol. 14.
- Aoki G., Ataka K., Doi T., Tsubouchi K. Data-driven estimation of economic indicators using search data // Journal of Finance and Data Science. 2023. Vol. 9.
- Botha B., Burger R., Kotzé K., Rankin N., Steenkamp D. Big data forecasting of South African inflation // Empirical Economics. 2023. Vol. 65.
- Medeiros M. C., Vasconcelos G. F., Veiga A., Zilberman E. Forecasting inflation in a data-rich environment // Journal of Business & Economic Statistics. 2021. Vol. 39.
- Yamada K., Takayasu H., Takayasu M. Estimation of economic indicators from social big data // Entropy. 2018. Vol. 20.
- Huang P. Big data application in exchange rate prediction based on FPGA // Microprocessors and Microsystems. 2021. Vol. 80.
- McCracken M. W., Ng S. FRED-MD: A monthly database for macroeconomic research // Journal of Business & Economic Statistics. 2016. Vol. 34.