Student, School of Information Technology and Engineering, Kazakh-British Technical University, Kazakhstan, Almaty
SOLAR ENERGY FORECASTING USING LINEAR REGRESSION, RANDOM FOREST AND XGBOOST
ABSTRACT
This study compares machine learning algorithms for forecasting solar power generation using real operational data from two photovoltaic systems. A classical linear regression model and two ensemble methods (Random Forest and XGBoost) were evaluated. The dataset includes key meteorological variables such as solar radiation, ambient temperature, and module temperature. Data preprocessing involved cleaning, handling missing values, and feature scaling. Model performance was assessed using MAE, RMSE, and R². The results demonstrate that ensemble models, particularly Random Forest, outperform linear regression in prediction accuracy. Feature importance analysis indicates that solar radiation and module temperature are the most influential predictors. The study emphasizes not only predictive accuracy but also the robustness and practical applicability of ensemble models under real-world meteorological variability, providing insights into model stability and feature relevance for operational solar energy forecasting.
АННОТАЦИЯ
В работе проведено сравнение алгоритмов машинного обучения для прогнозирования выработки солнечной энергии на основе реальных эксплуатационных данных двух фотоэлектрических систем. Рассмотрены классическая линейная регрессия и ансамблевые методы (Random Forest и XGBoost). Набор данных включает ключевые метеорологические параметры: солнечную радиацию, температуру окружающей среды и температуру модулей. Предварительная обработка данных включала очистку, обработку пропущенных значений и масштабирование признаков. Оценка качества моделей выполнена с использованием метрик MAE, RMSE и R². Результаты показали, что ансамблевые модели, особенно Random Forest, обеспечивают более высокую точность по сравнению с линейной регрессией. Анализ значимости признаков подтвердил, что наибольшее влияние оказывают солнечная радиация и температура модулей. Исследование подчёркивает не только высокую точность прогнозирования, но и устойчивость и практическую применимость ансамблевых моделей в условиях реальной метеорологической изменчивости, что позволяет оценить стабильность моделей и значимость признаков для операционного прогнозирования солнечной генерации.
Keywords: solar energy forecasting, machine learning, ensemble learning, random forest, XGBoost, linear regression.
Ключевые слова: прогнозирование солнечной энергии, машинное обучение, ансамблевое обучение, random forest, XGBoost, линейная регрессия.
Introduction
The global transition to renewable energy is essential for a sustainable future. One of its key benefits is improving air and water quality. Although it is difficult to fully come to renewable energy, a sustainable future can be promised by integrating solar energy and forecasting its production [1; 2]. Among various renewable sources, solar energy is the fastest and cheapest energy source. However, the unpredictability of solar power generation caused by weather fluctuations, cloud cover, and seasonal variations can create a significant issue for efficient energy management [3]. Therefore, we need accurate prediction [4].
Its production is difficult to estimate using typical linear models, but machine learning (ML) and deep learning (DL) provide enormous potential to increase prediction accuracy [5; 6]. Moreover, the use of AI models for forecasting plays a major role in achieving this. Recent improvements in AI showed that hybrid models could significantly increase forecasting precision [1; 7]. However, there are still challenges that should be solved for a better forecast. This paper investigates an AI-driven strategy that incorporates ML/DL models to enhance forecast accuracy and consider effective energy management.
The application of ML/DL in energy management has increased significantly in recent years. Feedforward and Elman neural networks for solar power generation forecasting were studied in one of the first studies in this area [8]. Their research showed that neural networks were useful in capturing time-series patterns, but they struggled with sudden weather fluctuations [9; 10]. Around the same time, research was conducted on hybrid models combining different predictive techniques [11]. It was numerical weather prediction (NWP models combined with ML techniques like SVM and GEFS). Their study highlighted that while adding more weather variables from the Global Ensemble Forecast System (GEFS) improved accuracy, it also increased the risk of overfitting [11]. After many advancements, researchers have increasingly focused on improving forecasting accuracy by combining models into hybrid architectures, which have been shown to outperform standalone methods. A stacked ensemble approach that integrated weak predictors with association rule learning, along with hybrid architectures combining GCNs, VAEs, CNNs, and LSTMs, was shown to significantly improve forecasting accuracy by leveraging rich weather data over simplified inputs [7; 12]. By using adversarial training techniques, these models were able to improve on their predictions over time, achieving lower error rates compared to conventional ML methods.
Not only hybrid models, there were also comparative approaches, considering which of the ML algorithms was the best fit for prediction of solar energy [6]. Comparative studies evaluating various regression techniques have shown that model performance can vary depending on the forecasting task and energy source; while SVM-based methods demonstrated high accuracy for solar energy prediction, neural networks and Gaussian processes proved more effective in short-term or wind energy forecasting contexts [6; 13].
Additional studies have focused on improving solar energy forecasting through comparative analyses of various machine learning models, revealing that Artificial Neural Networks (ANN) outperformed several standalone approaches, while ensemble learning techniques demonstrated even higher prediction by leveraging multiple models [2; 14].
Moreover, feature selection has been a key challenge in solar forecasting research, with studies showing that incorporating too many weather-related variables into machine learning models can lead to overfitting and reduced generalization performance [6; 11]. In addition, that it will struggle with real-world predictions. Thus, selection of input data is one of the key moments for model accuracy.
Beyond forecasting, artificial intelligence has also been applied to solar grid optimization and system monitoring; for instance, hybrid fault detection systems combining Type-2 Fuzzy Logic with Artificial Neural Networks (ANN) have been developed to enhance diagnostic accuracy in photovoltaic systems [15]. His method was particularly effective in identifying faults in photovoltaic (PV) systems, improving system reliability, and reducing downtime.
Although many studies have improved solar energy forecasting, some important problems still exist. Most of them focused on machine learning models but did not fully consider how to update predictions in real-time based on changing weather conditions. In addition, it is making them harder to use in real-world systems [1; 16]. Another issue is that most studies do not include important factors like pollution, seasonal changes, or solar panel aging, which can affect energy production [11; 13].
Materials and methods
In this paper, we apply machine learning models to forecast solar energy production based on weather conditions. The general scheme of the proposed methodology is presented in Figure 1.
/Meirkhanova.files/image001.jpg)
Figure 1. Methodology
A. Research Setup
For this research, analysis and training of models were performed in the Jupyter Notebook environment. We used Pandas for data processing, Scikit-learn for implementing linear regression and random forest model, XGBoost for gradient boosting, and Matplotlib and Seaborn for visualizing results.
B. Data collection
In this study, we used the publicly available dataset “Solar Power Generation Data” from the Kaggle platform (Solar Power Generation Data), which includes real-world measurements from two photovoltaic (PV) power plants in India, collected at 15-minute intervals over an approximately 34-day observation period. Each plant consists of two data components: (1) generation data and (2) weather sensor data. These datasets contain key variables necessary for forecasting solar power output, including timestamps, DC and AC power readings, daily and total energy yields, ambient and module temperature, and solar irradiance. For Plant 1, the generation dataset contains 68,778 records with 7 features. The weather dataset for Plant 1 includes 3,182 records. A preview of these datasets are presented in Table 1/2/3:
Table 1.
Sample records from Plant 1 Generation dataset
/Meirkhanova.files/image002.jpg)
Table 2.
Summary statistics for generation data (Plant 1 and Plant 2)
/Meirkhanova.files/image003.png)
Table 3.
Summary statistics for weather sensor data
/Meirkhanova.files/image004.png)
The inclusion of detailed weather parameters is critical, as they directly influence the performance of PV systems. This is supported by findings showing that the integration of meteorological data significantly enhances the accuracy of models [4]. Our study’s feature selection was validated by the research, which revealed that factors like temperature and sun irradiance had the most effects on power output [17]. Similarly, the use of a multivariate set of weather and operational features such as humidity, pressure, and sunlight has been shown to substantially reduce forecast error, reinforcing the importance of rich, multidimensional inputs when modeling solar time series data [16].
We followed a similar strategy by integrating comprehensive weather data with historical power generation records. This approach enables our models to learn from both daily and seasonal patterns, increasing robustness to variability in real-world climatic conditions. Unlike synthetic datasets, real measurements provide high-fidelity temporal and environmental diversity, which is essential for deploying predictive solutions in grid operations and energy planning.
C. Data Analysis
In this study, special attention was paid to data preprocessing, as the quality of the original dataset significantly affects the accuracy of the forecast. At the initial stage, missing values were eliminated using linear interpolation and forward filling, which preserved the integrity of the time series without introducing significant distortions. To ensure feature comparability and improve model convergence, data normalization was performed using MinMax scaling. This approach is particularly effective when applied to tree-based models, where feature normalization has been shown to improve forecasting accuracy [18; 19].
For categorical variables such as the day of the week, hour of the day, one-hot encoding was applied, enabling the correct integration of temporal characteristics into the model. The dataset was divided in an 80/20 ratio between training and testing subsets, ensuring an objective evaluation of model performance. Additionally, prior to training, correlation matrix analysis was conducted, which made it possible to identify the most significant features and exclude redundant ones. For solar energy weather time series forecasts, a similar pre-analytics approach has shown success, highlighting the value of feature selection to minimize overfitting and improve the generalization capabilities of ML models [11; 20].
D. Models
This paper focuses on the selection of models capable of solving the problem of predicting solar energy production given meteorological and time data. The main objective was to evaluate both the accuracy and the robustness of the models under data nonlinearities, noise, outliers, and temporal instability [2; 6; 9].
Linear regression: One of the most straightforward and understandable machine learning model is linear regression. It assumes that the input characteristics have a linear relationship (temperature, humidity, wind speed, etc.) and the output variable (energy production). In several scientific studies, linear regression is commonly used as a reference model for comparison with more complex nonlinear methods [5; 6]. Despite its simplicity, it can produce acceptable results if the input data has a high correlation with the target variable. However, in the presence of complex nonlinear dependencies and outliers, the model loses accuracy and can lead to significant forecasting errors. In our work, linear regression was used primarily to build a baseline for comparison.
Random Forest Regressor: Random forest is an ensemble approach based on decision trees. It has been shown that the Random Forest model achieves high accuracy in predicting solar radiation, particularly under conditions of high cloud cover and weather instability, as demonstrated in case studies conducted in Vietnam [4]; similar findings were observed in studies using real weather data from India, where Random Forest also produced stable and reliable results [3; 16].
In our study, the random forest model was used to capture nonlinear relationships between meteorological parameters and solar energy production. The algorithm proved to be more robust than linear regression and showed improved RMSE and R^2 values on the test sample.
XGBoost Regressor: The use of XGBoost is justified in problems with complex relationships between features and a high degree of nonlinearity. XGBoost has been shown to outperform neural networks in terms of accuracy and generalization ability, demonstrating superior predictive performance when applied to real photovoltaic power plant data [9; 18]. Moreover, combining XGBoost with Numerical Weather Prediction (NWP) predictors has significantly improved short-term forecasting accuracy [9].
To evaluate the forecasting performance of the machine learning models, three key metrics were used: the coefficient of determination (R²), mean absolute error (MAE), and root mean squared error (RMSE). We are able to assess the models’ overall correctness, resilience to outliers, and capacity to identify relationships in the data by utilizing numerous measures at once. Several recent studies that highlight the significance of thorough research in solar generation forecasts use a similar methodology [4; 10].
Root Mean Squared Error (RMSE): One of the measures that is most susceptible to data variations and outliers is the root mean square error. It allows us to evaluate the model’s resilience to outliers by amplifying the impact of significant mistakes. Due to its suitability for time series with abrupt variations, RMSE has been widely used to evaluate short-term solar power forecasting performance based on weather predictions [10]. In our study, this metric allowed us to visually compare the sensitivity of models to errors: XGBoost and Random Forest showed the lowest RMSE values, confirming their effectiveness in handling unstable weather conditions [4].
RMSE=
(1)
where y_i is the actual value of the target variable, y ̂_i is the corresponding predicted value, and n denotes the number of data samples.
This parameter allowed us to examine the models’ resilience in our investigation, and the models with the lowest RMSE values were Random Forest and XGBoost.
Mean Absolute Error (MAE): calculates the mean absolute difference between predicted and actual values without amplifying the effect of large errors, in contrast to RMSE. This property makes MAE more robust to outliers and well suited for real-world applications where it is important to estimate the typical magnitude of prediction error. MAE is commonly used in time series forecasting tasks due to its straightforward interpretation and reduced sensitivity to extreme values compared to RMSE. In the present study, the XGBoost and Random Forest models demonstrated superior robustness, achieving lower MAE values than linear regression.
MAE=
(2)
where
denotes the actual observed value,
represents the predicted value, and n is the total number of observations.
R-squared (R²): The coefficient of determination (R²) is used to quantify the proportion of variance in the target variable explained by the model. The model’s excellent accuracy and capacity to identify internal relationships in the data are shown by a R² score around 1. R² is recognized as a key indicator for comparing models based on their explanatory power with respect to the data [3; 17]. In our experiment, the best R² values were achieved using Random Forest, which confirms its ability to qualitatively capture the relationships between meteorological parameters and solar energy production.
(3)
where
represents the actual observed value,
is the predicted value obtained from the model, denotes the mean of the observed values, and n is the number of observations.
R² values close to 1 indicate high predictive power of the model. The role of this metric has been highlighted in assessing model quality when applied to solar radiation prediction tasks [3].
Results and discussions
The outcomes of the trials enabled us to assess how well different machine learning models performed in the job of predicting the output of solar energy. Three primary measures were used to compare the models: RMSE, MAE, and R², which are in line with current guidelines for validating forecasting algorithms in the field of renewable energy [3; 5; 10].
The metric values were calculated for three models-linear regression, random forest, and XGBoost-based on historical weather and energy production data. The results are summarized in Table 2.
Table 4.
Model Performance Comparison
The random forest model showed the best explained variance (R²), the lowest mean error (MAE), and the lowest root mean square error (RMSE). Similar results are confirmed in a number of scientific papers. For example, Random Forest has been shown to outperform other models in forecasting under high cloud cover and has demonstrated robustness when dealing with noisy meteorological data [3; 4].
Feature Importance: Figure 2 shows that the greatest influence on the power forecast is exerted by module temperature, solar radiation, and ambient temperature.
/Meirkhanova.files/image011.jpg)
Figure 2. Feature importance scores for DC power prediction using Random Forest
These results align with prior studies that highlight the importance of temperature and solar radiation as major factors influencing photovoltaic system efficiency [10; 16]. Such an analysis allows for more informed parameter selection when designing generation models and control systems.
Visualization: The Random Forest model is the most successful of the evaluated algorithms, according on the metrics that were acquired. Its high accuracy and robustness to noise make it particularly useful in practical tasks, from short-term generation forecasting to integration into smart energy platforms. A similar approach is recommended in recent studies, which demonstrate that ensemble models offer a superior accuracy-to-robustness ratio compared to classical linear approaches [10]. Due to its stability under varying environmental conditions, Random Forest is particularly well suited for real-time applications, where reliability and low prediction error are critical [17].
As shown in Figures 3 and 4, the random forest model shows high accuracy in predicting the output power, almost matching the actual values. This is also supported by the overall results of the models shown in Table 3, where it can be seen that Random Forest consistently outperforms linear regression and XGBoost.
/Meirkhanova.files/image012.jpg)
Figure 3. Random Forest: Actual vs Predicted DC Power
/Meirkhanova.files/image013.jpg)
Figure 4. Results of models
Table 5.
Comparison of actual DC power with predicted values
/Meirkhanova.files/image014.jpg)
Conclusion
Three machine learning models linear regression, random forest, and XGBoost were applied in this study to develop and evaluate a forecasting system for solar power generation based on real meteorological and operational data. The obtained results confirm that ensemble learning models significantly outperform classical linear regression models in short-term solar energy forecasting. In particular, the random forest model demonstrated the best performance, achieving a root mean squared error of 469.24, a mean absolute error of 168.38, and a coefficient of determination R² of 0.986, indicating high predictive accuracy and robustness under varying weather conditions.
A comparison with related studies shows that previously reported RMSE values for similar datasets typically range from 500 to 800, making the achieved result of 469.24 notably competitive [3; 16]. These findings are consistent with earlier research demonstrating the superiority of ensemble approaches over traditional regression-based methods [3; 4; 16].
Feature importance analysis revealed that solar radiation and module temperature are the most influential factors affecting solar energy production, which highlights the critical role of meteorological conditions in forecasting model performance [10; 16]. Overall, the results demonstrate that integrating weather data with ensemble machine learning algorithms enables the construction of reliable and accurate solar energy forecasting systems. Future research may focus on expanding the dataset, incorporating additional environmental variables, and evaluating alternative machine learning and deep-learning models to further improve forecasting accuracy.
The proposed comparative framework and findings contribute to the practical selection of machine learning models for reliable solar energy forecasting based on real-world photovoltaic data.
References:
- Ahmad T., Zhang D., Huang C., et al. Artificial intelligence in sustainable energy industry: status quo, challenges and opportunities // Renewable and Sustainable Energy Reviews. — 2021. — Vol. 139. — P. 110–120.
- Ledmaoui Y., Maghraoui A. E., Aroussi M. E., et al. Forecasting solar energy production: a comparative study of machine learning algorithms // Energy Reports. — 2023. — Vol. 10. — P. 1004–1012.
- Deka K., Rabha M., Hazarika H., et al. Harnessing machine learning for improved solar radiation prediction // Proceedings of the 2nd IEEE International Conference on Networks, Multimedia and Information Technology (NMITCON). — 2024.
- Nguyen N. T., Dao T. C. T., Nguyen L. N. T., et al. Solar radiation forecasting based on random forest and XGBoost // Proceedings of the 7th International Conference on Green Technology and Sustainable Development (GTSD). — 2024. — P. 136–140.
- Bachulkar R., Bhatkande R., Patil A., Telgi B. F. Machine learning algorithms for the prediction of daily mean solar power // Proceedings of the 5th International Conference on Electronics, Communication and Aerospace Technology (ICECA). — 2021. — P. 860–866.
- Tripathi A. K., Aruna M., Elumalai P. V., et al. Advancing solar PV panel power prediction: a comparative machine learning approach in fluctuating environmental conditions // Case Studies in Thermal Engineering. — 2024. — Vol. 59.
- Gomathi S., Kannan E., Belinda M. J. C. M., et al. Solar energy prediction with synergistic adversarial energy forecasting system (Solar-SAFS) // Case Studies in Thermal Engineering. — 2024. — Vol. 63.
- Dumitru C.-D., Gligor A., Enachescu C. Solar photovoltaic energy production forecast using neural networks // Procedia Technology. — 2016. — Vol. 22. — P. 808–815.
- Li T., Zhang Y., Zhu Z., et al. A comparative study of photovoltaic power prediction models based on BP and XGBoost // Proceedings of the 6th International Conference on Robotics, Intelligent Control and Artificial Intelligence (RICAI). — 2024. — P. 567–570.
- Bamisile O., Ejiyi C. J., Osei-Mensah E., et al. Long-term prediction of solar radiation using XGBoost, LSTM and machine learning algorithms // Proceedings of the Asia Energy and Electrical Engineering Symposium (AEEES). — 2022. — P. 214–218.
- Aler R., Martín R., Valls J. M., Galván I. M. A study of machine learning techniques for daily solar energy forecasting using numerical weather models // Studies in Computational Intelligence. — 2015. — Vol. 570. — P. 269–278.
- Shakhovska N., Medykovskyi M., Gurbych O., et al. Enhancing solar energy production forecasting using advanced machine learning and deep learning techniques // Computers, Materials and Continua. — 2024. — Vol. 81. — P. 3147–3163.
- Sharifzadeh M., Sikinioti-Lock A., Shah N. Machine learning methods for integrated renewable power generation // Renewable and Sustainable Energy Reviews. — 2019. — Vol. 108. — P. 513–538.
- Vennila C., Titus A., Sudha T. S., et al. Forecasting solar energy production using machine learning // International Journal of Photoenergy. — 2022.
- Janarthanan R., Maheshwari R. U., Shukla P. K., et al. Intelligent detection of PV faults based on artificial neural network and type-2 fuzzy systems // Energies. — 2021. — Vol. 14.
- Gottwald D., Parmar M., Zureck A. Forecasting solar power generation: a comparative analysis of machine learning models // Proceedings of the International Conference on Renewable Energies and Smart Technologies (REST). — 2024.
- Elmoaqet H., Karasneh D., Al-Dahidi S., Al-Refai G. Predicting solar photovoltaic power production using artificial intelligence-based algorithms // Proceedings of the International IEEE Conference. — 2024.
- Talukdar A., Panda B., Hota S., et al. Intelligent solar power forecast with machine learning // Proceedings of the International Symposium on Advanced Electrical and Communication Technologies (ISAECT). — 2024.
- Mohammad K. S., Yousuf A. H., Boddu M. K., et al. Predicting power output of solar photovoltaic panels using machine learning techniques // Proceedings of the International Conference for Artificial Intelligence, Applications, Innovation and Ethics (AI2E). — 2025. — P. 1–6.
- Kane S. N., Rathore A., Abhijeet, Anees S., et al. Comparative analysis of machine learning models for solar power prediction using time series weather data // Proceedings of the IEEE International Conference on Power Electronics, Intelligent Control and Energy Systems (ICPEICES). — 2024. — P. 779–784.