Master's Student, School of Information Technology and Engineering, Kazakhstan-British Technical University, Kazakhstan, Almaty
COMPARATIVE ANALYSIS OF MACHINE LEARNING MODELS FOR HOURLY RESIDENTIAL ENERGY CONSUMPTION FORECASTING USING LONDON SMART METER DATA
УДК 004.942
ABSTRACT
Accurate short-term electricity forecasting is essential for smart grids and efficient power system operation. This study compares five machine learning models to forecast hourly residential energy consumption using the London Smart Meter dataset combined with weather data. The input variables include hour of the day, day of the week, month of the year, temperature, humidity, wind speed, pressure, visibility, and apparent temperature. We considered five models: Linear Regression, Decision Tree Regressor, Random Forest Regressor, XGBoost Regressor, and MLPRegressor. The model performance was evaluated based on R² and RMSE measures. Random Forest achieved the best performance with R² = 0.824 and RMSE = 308.79, followed by XGBoost. The findings demonstrate that ensemble tree-based models perform better than neural networks when dealing with structured tabular data.
АННОТАЦИЯ
Точное краткосрочное прогнозирование потребления электроэнергии имеет важное значение для интеллектуальных сетей и эффективной работы энергосистем. В этом исследовании сравниваются пять моделей машинного обучения для прогнозирования почасового потребления энергии в жилых домах с использованием набора данных интеллектуальных счетчиков в Лондоне в сочетании с данными о погоде. Входные переменные включают время суток, день недели, месяц года, температуру, влажность, скорость ветра, давление, видимость и кажущуюся температуру. Мы рассмотрели пять моделей: Linear Regression, Decision Tree Regressor, Random Forest Regressor, XGBoost Regressor и MLPRegressor. Эффективность модели оценивалась на основе показателей R2 и RMSE. Random Forest показал наилучшую производительность с R² = 0.824 and RMSE = 308.79, за ним следует XGBoost. Полученные результаты демонстрируют, что модели на основе дерева ансамблей работают лучше, чем нейронные сети, при работе со структурированными табличными данными.
Keywords: energy forecasting, smart meter data, machine learning, Random Forest, XGBoost, load prediction, smart grid.
Ключевые слова: прогнозирование энергопотребления, данные интеллектуальных счетчиков, машинное обучение, Random Forest, XGBoost, прогнозирование нагрузки, интеллектуальная сеть.
Introduction. Electricity demand forecasting is an essential task for modern power grids. For energy producers and providers, accurate forecasts are crucial to ensure that electricity generation matches consumption, minimize operational costs, and avoid overload conditions. Inaccurate forecasts might lead to excessive capacity or unstable network operation and inefficiency of energy distribution [1].
Household energy forecasting is a difficult problem. Numerous factors influence the electricity demand in households. These factors include climate, season, time of the day, day of the week, and even user behaviors. Traditional methods based on statistical analysis such as linear regression or autoregressive models are frequently used; however, in many cases, they fail due to non-linear relationships between variables [2].
Modern advances in technology, specifically the advent of smart metering, provide access to high-quality datasets of household consumption statistics. Consumption data gathered by smart meters can be used for machine learning analysis, enabling models to detect hidden patterns in user behavior and build an effective prediction model [3].
Many studies have examined the use of machine learning techniques for electricity forecasting. For instance, Random Forest and Gradient Boosting models were frequently tested and showed good results due to their ability to model complex nonlinear interactions [4]. Also, artificial neural networks have gained significant attention in recent years and often deliver high accuracy, provided enough data and proper parameter tuning [5]. Finally, recent studies explored deep learning approaches and showed the effectiveness of Long Short-Term Memory networks and transformers in prediction problems [6]. However, not all advanced machine learning methods are better than classic approaches, especially when the data are structured tabular datasets.
Among publicly available household consumption datasets, one widely used resource is the London Smart Meter dataset. It includes extensive information about electrical consumption of households and can be used for academic research [7]. If combined with data about weather, it becomes possible to estimate the effect of climatic factors on electrical consumption in households.
In this paper, five machine learning models are compared: Linear Regression, Decision Tree Regressor, Random Forest Regressor, XGBoost Regressor, and MLPRegressor. The main contribution of this paper is to demonstrate that tree-based ensemble methods remain stronger than neural networks for structured hourly smart meter prediction tasks.
Materials and methods
In this study, we used the London Smart Meter dataset and hourly weather data. Originally, electricity consumption was recorded at 30-minute intervals; to convert the data into hourly forecasting format, we merged two consecutive periods into one.
The target variable was defined as energy_sum = hourly household electricity consumption.
First, the dataset was cleaned and prepared. Missing values were handled appropriately. We merged electricity and weather data using timestamp information, resulting in rows representing one hourly observation.
Two categories of input features were considered:
Features based on time: hour of the day, day of the week and month. These features reflect cyclic changes like morning peak hours, evening hours, weekend effect and seasonality.
Weather features: temperature, humidity, wind speed, pressure, visibility and apparent temperature. Weather features are important, since heating/cooling processes and daily activities can be affected by the weather conditions.
For all models, we used the same set of features. For model evaluation, we split the dataset into train and test subsets and evaluated them on previously unseen test data.
Two evaluation metrics were used:
- R² score that describes the proportion of explained variance. Higher values indicate better performance.
- RMSE measure that describes mean error magnitude. Lower values indicate better performance.
Results and discussion
The forecasting results of all tested models are shown in Table 1.
Table 1.
Comparison of model performance
|
Model |
R² Score |
RMSE |
|
Random Forest |
0.824 |
308.79 |
|
XGBoost |
0.818 |
314.41 |
|
Decision Tree |
0.742 |
373.95 |
|
MLPRegressor |
0.614 |
457.38 |
|
Linear Regression |
-3.489 |
1560.34 |
The Random Forest model showed excellent performance, achieving the highest R² value and achieving the smallest RMSE. The XGBoost model displayed a nearly identical level of performance and ranked second. The Decision Tree produced relatively strong results but still fell short of the performance of the ensemble tree models. The MLPRegressor showed lower performance than expected and underperformed compared to the Random Forest and XGBoost models. Finally, the performance of Linear Regression was the poorest among all models tested, returning a negative R² score.
These findings confirm the main thesis of this research paper: tree-based ensemble models perform better in hourly forecasting of smart meter structured data compared to neural networks.
This outcome can be explained by several reasons.
First, Random Forest and XGBoost are extremely efficient algorithms when working with structured tabular data. The set of input features used in this project included engineered variables like hour, day, month, temperature, humidity, and pressure. Tree-based algorithms are well-known to provide excellent predictions for such datasets [9].
Second, the demand for electricity is characterized by nonlinearity. Energy consumption is likely to rise dramatically in cases of severe weather conditions or increased evening household activity. Tree-based models can capture such nonlinear behavior naturally.
Third, neural networks require significantly more preprocessing steps before training. Such steps include data normalization, model architecture optimization, selecting an appropriate learning rate, and regularization techniques. In contrast, tree-based algorithms are known to require less preprocessing and can be trained relatively easily.
Fourth, there is a possibility of noise present in the dataset due to holidays, irregular work schedules, or abnormal behavior of customers. Tree-based models show increased robustness to such issues.
Finally, neural networks demonstrate better performance on larger amounts of data. When working with medium-sized structured datasets, classical machine learning models remain relevant [10].
In practice, these results have important practical implications. Forecasts of smart meters' hourly usage data help utilities maintain a balance between supply and demand in the grid. These models can help in the process of generation scheduling, reducing peaks and maintenance. Countries like Kazakhstan can implement the same approach to forecast hourly demand to plan load management properly. Finally, these forecasts can be used in demand side management to create price-based and incentivizing policies.
This research paper has some limitations worth considering. The investigation is based on only one publicly available dataset. Consumer demographics and occupancy-related information was not considered. Only one type of neural network was tested. Other deep learning techniques, such as LSTMs or transformer architectures, can lead to better results under different conditions.
Conclusion
This study conducted an empirical evaluation of the predictive performance of five machine learning algorithms for forecasting hourly electricity usage in residential buildings using the London Smart Meter dataset combined with meteorological data. Among all the studied models, Random Forest achieved the best overall performance with R² = 0.824 and RMSE = 308.79, followed by the XGBoost algorithm. The Decision Tree showed moderate predictive performance, while the MLPRegressor underperformed compared with other ensemble models based on trees. The weakest results were produced by Linear Regression, which suggests that a simple linear correlation was not enough for this particular problem.
Therefore, these results confirm that classical tree-based ensemble models remain a highly effective approach for structured datasets from smart meters. Despite the growing popularity of artificial neural networks in recent times, the current study demonstrates that tree-based models still have superior predictive accuracy and practical reliability when applied to medium-sized tabular datasets from the energy sector. These outcomes may be valuable for further academic research as well as for electricity providers and government agencies, which might need accurate demand forecasting for energy planning purposes.
References:
- T. Hong and S. Fan, “Probabilistic electric load forecasting: A tutorial review,” International Journal of Forecasting, vol. 32, no. 3, pp. 914–938, 2016.
- R. J. Hyndman and G. Athanasopoulos, Forecasting: Principles and Practice, 3rd ed., Melbourne, 2021.
- A. Albert and R. Rajagopal, “Smart meter driven segmentation: What your consumption says about you,” IEEE Transactions on Power Systems, vol. 28, no. 4, pp. 4019–4030, 2013.
- L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
- G. P. Zhang, “Time series forecasting using a hybrid ARIMA and neural network model,” Neurocomputing, vol. 50, pp. 159–175, 2003.
- B. Wang, Y. Wang, and J. Li, “Deep learning for load forecasting: Review and future trends,” Energy AI, vol. 1, 2020.
- J. Kelly and W. Knottenbelt, “The UK-DALE dataset, domestic appliance-level electricity demand and whole-house demand from five UK homes,” Scientific Data, vol. 2, 2015.
- T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” Proceedings of ACM SIGKDD, pp. 785–794, 2016.
- G. Shmueli, “To explain or to predict?” Statistical Science, vol. 25, no. 3, pp. 289–310, 2010.
- M. Makridakis, E. Spiliotis, and V. Assimakopoulos, “The M4 competition: Results, findings, conclusion and way forward,” International Journal of Forecasting, vol. 34, no. 4, pp. 802–808, 2018.