CLASSIFYING NEWS ARTICLES BASED ON THEIR HEADLINES USING MACHINE LEARNING ALGORITHMS

КЛАССИФИКАЦИЯ НОВОСТНЫХ СТАТЕЙ НА ОСНОВЕ ИХ ЗАГОЛОВКОВ С ИСПОЛЬЗОВАНИЕМ АЛГОРИТМОВ МАШИННОГО ОБУЧЕНИЯ
Цитировать:
Ibragimov T., Kabdrakhova S.S. CLASSIFYING NEWS ARTICLES BASED ON THEIR HEADLINES USING MACHINE LEARNING ALGORITHMS // Universum: технические науки : электрон. научн. журн. 2025. 5(134). URL: https://7universum.com/ru/tech/archive/item/20152 (дата обращения: 05.12.2025).
Прочитать статью:
DOI - 10.32743/UniTech.2025.134.5.20152

 

ABSTRACT

As technology keeps evolving, organizing and analyzing large amounts of text data is becoming more important, especially in the media industry. News outlets generate thousands of articles daily, making it difficult to categorize and structure information efficiently. Headline-based classification of news articles can be effectively automated using machine learning techniques.  This study explores different machine learning algorithms, including Support Vector Machine (SVM), Naive Bayes, and K-Nearest Neighbors (KNN), to determine which method is most effective for headline classification. The models are evaluated using accuracy, recall, precision, and F1 score. Experimental results show that SVM achieves the highest classification accuracy (81.2%), while Bayesian methods tend to retain patterns better. These findings confirm that machine learning can improve the way we analyze and categorize news headlines. Enhancing automated news aggregation systems with machine learning can help users access relevant information faster and more efficiently, making news consumption more personalized and streamlined.

АННОТАЦИЯ

По мере развития технологий организация и анализ больших объемов текстовых данных становятся всё более актуальными, особенно в сфере медиа. Новостные издания ежедневно публикуют тысячи статей, что затрудняет эффективную категоризацию и структурирование информации. Классификация новостных статей на основе заголовков может быть эффективно автоматизирована с помощью методов машинного обучения. В данном исследовании рассматриваются различные алгоритмы машинного обучения, включая метод опорных векторов (SVM), наивный байесовский классификатор (Naive Bayes) и метод k ближайших соседей (KNN), с целью определить наиболее эффективный подход для классификации заголовков. Модели оцениваются по точности, полноте, точности предсказания и F1-мере. Результаты эксперимента показывают, что SVM достигает наивысшей точности классификации (81.2%), в то время как байесовские методы лучше сохраняют шаблоны. Эти выводы подтверждают, что машинное обучение способно улучшить анализ и категоризацию новостных заголовков. Внедрение таких технологий в автоматизированные новостные агрегаторы помогает пользователям быстрее и эффективнее находить релевантную информацию, делая потребление новостей более персонализированным и удобным.

 

Keywords: Text Classification, Natural Language Processing, Machine Learning, Headline Analysis, TF-IDF, SVM, Naive Bayes, KNN, News Classification, Lemmatization, Vectorization, Stemming, Tokenization.

Ключевые слова: классификация текстов, обработка естественного языка, машинное обучение, анализ заголовков, TF-IDF, метод опорных векторов (SVM), наивный байесовский классификатор, метод ближайших соседей (KNN), классификация новостей, лемматизация, векторизация, стемминг, токенизация.

 

Introduction

With the rapid growth of digital content, online platforms face an unprecedented influx of unstructured textual data [6]. This challenge is especially pronounced on news websites, where users frequently seek concise information related to specific topics instead of reading complete articles. While traditional classification methods typically analyze entire article texts and their thematic elements, headlines offer a concise, focused alternative for effective categorization aligned with the structure of news platforms and editorial practices [1;3].

This article explores how machine learning algorithms can effectively classify news content based solely on headlines [1;3]. Utilizing headlines as the primary input for classification streamlines the process of organizing and retrieving news articles, making it faster and less resource-intensive compared to full-text analysis [3]. Headlines inherently encapsulate the core content of articles, making them ideal targets for classification methods that seek a balance between efficiency and accuracy [1;3].

Several widely-used machine learning models are evaluated for their performance in headline-based classification tasks. The models are applied across diverse news categories, including Technology, Economy, Politics, Law Enforcement, Health, Sport, Culture, Emergencies, Social Media, Tourism, and Auto, reflecting the broad spectrum of contemporary news topics [2].

To prepare headlines for classification, essential Natural Language Processing (NLP) techniques are employed. These include tokenization, stemming, lemmatization, and TF-IDF vectorization to convert headline text into numerical formats suitable for processing by machine learning models [8;9;10;11]. Additionally, dimensionality reduction techniques such as Principal Component Analysis (PCA) are applied to enhance computational efficiency without compromising the essential characteristics of the data [9].

This study identifies the most effective machine learning techniques suitable for classifying news headlines [1;2;13]. By doing so, it provides actionable insights for optimizing automated news categorization systems, enabling quicker, more personalized, and more relevant user experiences [6;7].

Related works

Text classification has been a prominent area of study due to the rapid growth of textual data, particularly in digital news platforms. Headlines, being concise and category-driven, present a unique opportunity for automatic classification. This review examines significant contributions in the field, focusing on techniques relevant to headline classification, pre-processing, feature extraction, and machine learning models.

News articles span a broad spectrum of subject areas. Within the scope of this study, categorization was conducted across 11 categories: Technology, Economy, Politics, Law Enforcement, Health, Sport, Culture, Emergencies, Social Media, Tourism, and Auto. Several studies have explored headline-based classification as a lightweight yet effective alternative to full-text analysis.

Rana et al. [1] examined various text pre-processing techniques and classification methods used specifically for headlines. Their work concluded that no single algorithm universally outperforms others, and that model performance often depends on dataset characteristics and preprocessing quality.

Sunagar et al. [2] implemented classification on AG’s News Topic Dataset by applying SVM, Multinomial Naive Bayes, Rocchio, KNN, along with combined approaches like boosting and bagging. Utilizing Snowball stemming and TF-IDF for feature extraction, their findings confirmed that SVM achieved the highest accuracy among the tested models (91%).

Leonard et al. [3] noted that traditional classifiers may struggle with long and complex articles. Headlines, being short and focused, can serve as a more efficient alternative for classification models.

This research is methodologically grounded in Aurélien Géron’s seminal work Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow [4], which provides a detailed and application-oriented guide to machine learning development. Géron’s emphasis on modular preprocessing, model evaluation through cross-validation, and hyperparameter tuning informed the overall design of the system.

For conceptual clarity and theoretical grounding, this study draws extensively from Andriy Burkov’s The Hundred-Page Machine Learning Book [5]. Burkov’s concise yet rigorous presentation helped bridge the gap between algorithmic intuition and implementation decisions.

To contextualize the research within current technological trends, the study references the analysis by Janiesch et al. [6], who examine the transformative role of machine learning and deep learning in modern digital platforms. Their discussion highlights how these technologies enable content classification at scale.

The application layer of this project is inspired by Chris Moroney’s Machine Learning Projects for Mobile Applications [7]. Moroney’s focus on text-based classification tasks guided the development of the prototype application.

Misra and Grover [8] emphasized the critical importance of preprocessing steps including token splitting, removal of non-informative words, and converting words to their base forms. These methods are especially valuable when working with brief and often ambiguous texts like headlines.

Mishra and Vishwakarma [9] examined TF-IDF and its variations in the context of information retrieval, noting the model's effectiveness and limitations, such as sensitivity to term frequency fluctuations and inability to capture semantic meaning.

To manage vocabulary size and enhance generalization, the study integrates lemmatization and stemming techniques. Siddhartha et al. [10] shed light on operational mechanisms and effectiveness of both methods, guiding decisions within the preprocessing workflow.

Singh and Gupta [11] offer an in-depth evaluation of various stemming algorithms, highlighting how inappropriate stemming may distort text semantics.

Fernández et al. [12] discussed SMOTE for addressing imbalanced datasets, marking it as a critical technique in managing class distribution in text classification tasks.

Daud et al. [13] found that an optimized SVM model consistently outperformed other models in thematic classification, supporting the robustness of SVM in high-dimensional feature spaces.

Working with high-dimensional textual data, such as TF-IDF representations, poses challenges for machine learning algorithms due to computational costs and potential overfitting. Mishra and Vishwakarma [9] highlighted PCA as effective in compressing feature space while preserving informative components of data.

A critical examination reveals that headline-level classification offers advantages in speed, simplicity, and relevance for real-time applications. Traditional machine learning models, particularly Support Vector Machines (SVM), serve as robust baselines when coupled with thoughtful preprocessing and feature engineering techniques.

Materials and methods

This study evaluates the effectiveness of various machine learning algorithms in classifying news articles using only their headlines [1;3;13]. Compared to full-text analysis, headline-based classification requires less computational power while still offering meaningful categorization. A comparative experimental design was adopted, applying multiple algorithms to a consistent dataset and evaluating their performance using standard metrics.

The dataset was collected from Tengrinews.kz and contains over 20,000 Russian-language headlines categorized into 11 topics: Technology, Economy, Politics, Law Enforcement, Health, Sport, Culture, Emergencies, Social Media, Tourism, and Auto. These concise headlines are well-suited for classification tasks.

 

Figure 1. Distribution of News Headlines Across Categories (Bar chart)

 

Data was gathered using a web scraper built with undetected-chromedriver and BeautifulSoup. It mimicked human browsing to bypass anti-bot protections, extracted headline text and publication dates, and saved the results into CSV format with UTF-8 encoding to support Cyrillic characters. Delays between requests prevented blocking.

 

Figure 2. Flowchart of the Data Collection Process

 

The resulting dataset was clean, structured, and sufficiently balanced across categories to support fair model training and evaluation.

 

Figure 3. Example of the News Headlines Dataset in CSV Format

 

Data preprocessing included several key steps (Figure 5). First, all characters were converted to lowercase to normalize the text (Figure 4). Tokenization was applied to split each headline into words. Common stop words were removed to reduce noise [8;9]. Words were then stemmed and lemmatized to simplify their forms [10;11]. Finally, the cleaned headlines (Figure 6) were converted into numerical representations using TF-IDF vectorization, which captures term importance across the corpus [9].

 

Figure 4. Headlines After Lowercasing

 

Figure 5. Example of Pre-processing Steps Applied to a Sentence

 

Figure 6. Example of Headline Before and After Text Preprocessing

 

To address the high dimensionality introduced by TF-IDF, Principal Component Analysis (PCA) was applied (Figure 7). This technique compressed the feature space into a smaller number of principal components, retaining the most informative features. The reduced vectors improved computational efficiency and generalization.

Figure 7. Visualization of Headlines After Dimensionality Reduction with PCA

 

Because some categories contained fewer headlines, the dataset exhibited slight class imbalance. To resolve this, SMOTE was used to generate synthetic samples for underrepresented categories (Figure 8). This approach improved the model's ability to learn from all classes equally.

 

Figure 8. Category Distribution Before and After Applying SMOTE

 

The dataset was divided into training (80%) and testing (20%) sets. To evaluate the classification models, several performance metrics were used: accuracy, precision, recall, and F1-score. These allowed for a balanced comparison of algorithm performance and robustness across the varied headline categories.

Results and discussions

The effectiveness of four popular machine learning models—Support Vector Machine (SVM), Logistic Regression, Multinomial Naive Bayes, and K-Nearest Neighbors (KNN)—was evaluated for classifying news headlines. These models were selected for their interpretability, computational efficiency, and proven performance in text classification tasks using TF-IDF vectors.

All models were trained on the same preprocessed and balanced dataset. Evaluation was performed using accuracy, precision, recall, and F1-score. The dataset was split 80/20 for training and testing, and stratified sampling was used to preserve class proportions. Grid search with cross-validation was used to optimize hyperparameters.

KNN showed moderate performance (\~66–67% accuracy) and was most precise in categories with distinctive keywords, such as Economy and Health. However, its reliance on distance measures reduced recall in underrepresented classes.

Naive Bayes achieved 72% accuracy overall. It performed particularly well in categories with consistent term distributions, benefiting from its probabilistic nature and simple parameter tuning (optimal alpha = 0.5).

Logistic Regression provided a strong baseline with 79% accuracy. The model’s ability to handle sparse input and its robustness across categories made it suitable for this classification task. Optimal configuration included C=10 and the liblinear solver.

SVM achieved the highest performance with 81% accuracy and consistent scores across categories. Using an RBF kernel with C=1 and gamma=1, the model was able to learn non-linear boundaries effectively, making it the most balanced and reliable classifier in this study.

 

Figure 9. Experimental results

 


Figure 10. Best Accuracy Results of Different Algorithms

 

Conclusion

This study demonstrated that machine learning models can effectively classify news articles using only their headlines. By focusing on headline-based classification, the research prioritized computational efficiency without sacrificing classification accuracy. Headlines, being compact summaries of full articles, proved to be a viable and practical input for automated categorization.

The dataset, collected from Tengrinews.kz, included over 20,000 Russian-language headlines manually labeled across eleven categories. Preprocessing involved standard NLP techniques such as tokenization, stop-word removal, stemming, lemmatization, and TF-IDF vectorization. Dimensionality reduction via PCA helped manage feature space and improve model efficiency.

Four machine learning models were evaluated: KNN, Naive Bayes, Logistic Regression, and SVM. Among them, SVM showed the highest accuracy (81%), making it the most reliable for this task. Logistic Regression followed closely with strong overall performance and low computational cost. Naive Bayes performed well on ambiguous categories due to its probabilistic nature, while KNN offered moderate results.

Future work may include expanding the dataset with multilingual content, experimenting with more advanced vectorization or dimensionality reduction methods, and integrating user feedback into model refinement.

In summary, the study confirms that headline-based classification is a feasible and efficient solution for automating news categorization and can support scalable applications in digital journalism and personalized news delivery.

 

References:

  1. News classification based on their headlines: a review // IEEE INMIC. – 2014. – DOI: https://doi.org/10.1109/INMIC.2014.7097339.
  2. News Topic Classification Using Machine Learning Techniques // Lecture Notes in Electrical Engineering. – 2021. – Т. 733. – DOI: https://doi.org/10.1007/978-981-33-4909-4_35.
  3. News Classification Based On News Headline Using SVC Classifier // Proc. of the 16th Int. Conf. on Telecommunication Systems Services and Applications (TSSA). – 2022. – DOI: https://doi.org/10.1109/TSSA56819.2022.10063879.
  4. Géron A. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow. – O'Reilly Media, 2022. – 861 p.
  5. Burkov A. The Hundred-Page Machine Learning Book. – Expert Systems, 2021. – 160 p.
  6. Janiesch C., Zschech P., Heinrich K. Machine learning and deep learning // Electronic Markets. – 2021. – Т. 31, № 3. – DOI: https://doi.org/10.1007/s12525-021-00475-2.
  7. Moroney C. Machine Learning Projects for Mobile Applications. – O'Reilly Media, 2020. – 246 p.
  8. Misra R., Grover J. Sculpting Data for ML: The First Act of Machine Learning. – 2021. – 187 p.
  9. Mishra A., Vishwakarma S. Analysis of TF-IDF Model and its Variant for Document Retrieval // Proc. of CICN. – 2015. – DOI: https://doi.org/10.1109/CICN.2015.157.
  10. Siddhartha S. B., Khyani D., Niveditha M. N., Divya M. B. An Interpretation of Lemmatization and Stemming in NLP // Journal of University of Shanghai for Science and Technology. – 2020. – Т. 22, № 10.
  11. Singh J., Gupta V. Text stemming: Approaches, applications, and challenges // ACM Computing Surveys. – 2016. – Т. 49, № 3. – DOI: https://doi.org/10.1145/2975608.
  12. Fernández A., García S., Herrera F., Chawla N. V. SMOTE for Learning from Imbalanced Data // Journal of Artificial Intelligence Research. – 2018. – Т. 61. – DOI: https://doi.org/10.1613/jair.1.11192.
  13. Daud S., Ullah M., Rehman A., Saba T., Damaševičius R., Sattar A. Topic Classification of Online News Articles Using Optimized ML Models // Computers. – 2023. – Т. 12, № 1. – DOI: https://doi.org/10.3390/computers12010016.
Информация об авторах

Student, School of Information Technology and Engineering, Kazakh-British Technical University, Kazakhstan, Almaty

студент, Школа информационных технологий и инженерии, Казахстанско-Британский технический университет, Казахстан, г. Алматы

Candidate of Physical and Mathematical Sciences, Al-Farabi Kazakh National University, associate professor, Kazakhstan, Almaty

канд. физ.-мат. наук, Казахский Национальный Университет имени Аль-Фараби, Казахстан, г. Алматы

Журнал зарегистрирован Федеральной службой по надзору в сфере связи, информационных технологий и массовых коммуникаций (Роскомнадзор), регистрационный номер ЭЛ №ФС77-54434 от 17.06.2013
Учредитель журнала - ООО «МЦНО»
Главный редактор - Звездина Марина Юрьевна.
Top