Master’s student, School of Information Technology and Engineering, Kazakh-British Technical University, Kazakhstan, Almaty
TOXICITY DETECTION IN TEXT USING TRANSFORMER MODELS: A COMPARATIVE STUDY ON THE TOXIGEN DATASET
ABSTRACT
Social media is an important part of the modern world today. They connect users from different parts of the world and unite them into one large community of social media users. These platforms provide an opportunity to express yourself, communicate, express your opinion, and learn from other people. But, unfortunately, this freedom of action has led to disastrous results. People feel impunity for their statements on social media, which leads to cyberbullying on the Internet. And over time, offensive comments and toxic content flooded everything around. These studies show that this behavior is related to the inability of people to get the point and accept opinions of other people and their personality traits. That’s why it’s important to moderate content on social media. The study shows various ways to moderate content based on the framework, from soft to hard.
The aim of our research is to find a more effective solution to find toxic texts on social media.
Toxic content can take various forms, from obscenities to aggressive statements on racial, gender, or political grounds. The consequences of distributing such content can be extremely devastating, from the downfall of the psychological state of users to largescale information conflicts. In addition, toxicity disrupts trust in the platforms and reduces the quality of public dialogue. Existing moderation methods, including manual verification and keyword filtering, often turn out to be insufficiently effective or scalable.
АННОТАЦИЯ
Социальные сети являются важной частью современного мира. Они объединяют пользователей из разных уголков мира в одно большое сообщество пользователей социальных сетей. Эти платформы предоставляют возможность самовыражения, общения, выражения своего мнения и обучения у других людей. Но, к сожалению, эта свобода действий привела к катастрофическим результатам. Люди чувствуют себя безнаказанными за свои высказывания в социальных сетях, что приводит к кибербуллингу в интернете. И со временем оскорбительные комментарии и токсичный контент заполонили все вокруг. Эти исследования показывают, что такое поведение связано с неспособностью людей понимать суть и принимать мнения других людей и их личностные черты. Именно поэтому важно модерировать контент в социальных сетях. Исследование показывает различные способы модерации контента на основе предложенной структуры, от мягких до жестких.
Цель нашего исследования — найти более эффективное решение для выявления токсичных текстов в социальных сетях.
Токсичный контент может принимать различные формы, от нецензурных выражений до агрессивных заявлений на расовой, гендерной или политической почве. Последствия распространения подобного контента могут быть крайне разрушительными: от ухудшения психологического состояния пользователей до масштабных информационных конфликтов. Кроме того, токсичность подрывает доверие к платформам и снижает качество публичного диалога. Существующие методы модерации, включая ручную проверку и фильтрацию по ключевым словам, часто оказываются недостаточно эффективными или масштабируемыми.
Keywords: social media, toxic content, cyberbullying, offensive comments.
Ключевые слова: социальные сети, токсичный контент, кибербуллинг, оскорбительные комментарии.
Introduction
A lot of data is released daily through social media. This data is affecting the quality of users life significantly [3]-[5], but unfortunately toxic environment in social media created by other people negatively affecting to overall well-being of millions. Because of this, there is a lack of healthy discussion on the Web sites since toxic comments restrict people to truly be who they are and express their feelings. That is why it is very important to detect and block antisocial behavior in digital world. So there are a lot of researches trying to find algorithms to make the social media platforms safe for everyone by moderating online discussions. Nevertheless, each social network has its own topics, which are regularly moderated to maintain a favorable atmosphere for each user of the platform [6]. Topics such as:
- Misinformation and Election Integrity. In was also popular during COVID-19, when people were spreading false information all around social media
- Adult Nudity & Sexual Content. This type of content is not allowed in most social media platforms because it is harmful content.
Promotion or Glorification of Self-Harm. This content also counts as abusive that is why it is not allowed in social media platforms like TikTok, YouTube, Instagram etc. (принадлежит Meta, признана экстремистской и запрещенной в России)
- Violence, Incitement, Gore & Mutilation Content. Also in the list of content that meets moderation.
Accordingly, there have been various studies in search of algorithms to make social media websites secure for every- one through moderation of online discussions. Detection of unwanted information as well as filtration can be carried out through natural language processing (NLP) as well as artificial intelligence. Mostly in researches people used transformer- based models like BERT and RoBERTa. These models showed high accuracy in classification combined with low false posi- tives. In our research, we plan to use the RoBERTa model to examine text to identify toxicity level of their text. We want to make sure our model is very accurate and easy to understand, so that its results can be easily explained and used in the real world. Thus, we wanted to make a conclusion of which model performs better to identify toxic content. Additionally, we want to have a minimal dataset where functioning of the model can be marked as a success. The results can help create automatic moderation systems that make communication safer and more positive. Detection of hate, abusive or toxic speech by NLP is a task used globally in a widespread manner. In some research studies [3], [7], [8], [9] they used machine learning along with group methods in their experiments. They also tried out word embedding techniques like fastText and BERT. On experimenting, they found that BERT embedding in conjunction with CNN provided most accurate results. For reference, two datasets were used as a reference point. One was solely focusing on toxic Twitter discussions among youths.
We notice in a different paper [10] that bidirectional LSTM neural networkswith optimized hyperparameters. The proposed approach performs best with a performance level higher 95% accurate in identifying different kinds of toxicity such as threats, hate, insults, etc. This validates that Bi-LSTM can encode text context and dependencies effectively.
Besides harmful social media posts, other works by Ortega- Mendoza et al. [11] looked closely at hateful messages about women in songs. Because hate speech aimed at females in social media can be ethically toxic, to even instigate physical violence. They used datasets of cleaned song text. They used light variant of BERT and attention-based GRU (Gated Recurrent Unit network model.
Finding offensive posts on social media by training a machine learning model using data from social media sites has become a popular and effective method [3], [12]. But in few research studies [13] authors have aimed to evaluate performance of solutions already existing in area. Authors have keenly searched comparative studies in search engines as well as in academic databases. Authors while shortlisting papers have considered only title as well as abstract. Authors aim was to compile a summary of research until now in automatic hate speech detection in social media.
The rest of experiments employed BERT, together with Bi-LSTM embeddings, and sequential modeling to a large extent, but could not identify contextual subtleties like sar- casm, humor or implicit toxicity. Posts that have toxicity have various speech patterns which demand a higher contextual understanding, which these newer generations of models lack. Additionally, in most of the experiments, no baseline was used to gauge the performance of these transformer models. Because the baseline model assists in interpreting what a more advanced machine learning model should exhibit.
Furthermore, a model is also built from a dataset of a given platform, which can turn out to be a limitation of an experiment. Because social media platforms differ in language usage, behavior. That means that a model built from a given site might not function in another. Experiments done in foreign countries, therefore, have a tendency to skew to a given demographic groups. Some of the cultural differences might not account for experiments. For instance, referring to a person as gay in foreign countries is an insult. But not in the West, where individuals don’t shy away from admitting to being gay and think to them it’s a part of who they are.
Methodology
As a first step in our toxic content detection project, we established a simple baseline model. This allowed us to better understand the complexity of the classification task and provided a minimal performance benchmark for evaluating more sophisticated methods. In the study [14], [15], [16] they also used logistic regression, and they had similar results like in our research. Malmasi and Zampieri [17] divided their input dataset into three part: HATE(contains hate speech), OFFENSIVE(contains offensive language but no hate speech), OK(no offensive language).
/Aidana.files/image001.jpg)
Figure 1. Logistics Regression
Whereas our baseline was a majority-class classifier, which always predicted the most frequent class (toxic or non-toxic). Also in this study [18] we can see the diviation of a input data for model training. While such a model is overly simplistic and not viable for real-world use, it was useful for setting a lower bound on performance metrics such as accuracy and F1-score. In our dataset, the toxic class was slightly more common, so the baseline achieved a non-trivial accuracy of around 54–56%, but its F1-score was very low, underscoring the importance of considering other metrics beyond accuracy. To improve upon the baseline, we implemented classical machine learning models using TF-IDF (Term Frequency–Inverse Document Frequency) vectorization. This technique converts text into sparse numerical representations based on word importance across the dataset. We experimented with logistic regression, support vector machines (SVM), and random forest classifiers. Logistic regression was particularly effective due to its simplicity, robustness, and interpretability in high-dimensional spaces. Support vector machines performed slightly worse, likely due to limited hyperparameter tuning and the linear kernel used. The random forest model exhibited more variability, performing better on training data but generalizing less effectively to validation samples. The logistic regression model achieved the best overall performance among classical models, with an accuracy around 76–78% and an F1-score of approximately 0.77, confirming its status as a strong linear baseline. However, it still lacked the ability to model subtle semantic features and contextual dependencies in language, motivating our shift toward deep learning approaches. We create a confusion matrix that shows the results of a baseline model in Fig. 1. We saw the number of False Positives and decided to analyze the words that the model considered as toxic, but in reality they are not toxic. Results are shown in Fig. 2 in words cloud. As you can see the words related to ethnicity, religion, and gender considered as toxic. But they are not toxic in general, if we not consider context of usage of these words. So the results showed us how important is to train model with different contexts, sentences so that the accuracy will be much more higher.
/Aidana.files/image002.jpg)
Figure 2. Words Cloud
A. Data Representation
Our dataset, ToxiGen, is specifically designed for evaluating language models’ ability to detect toxic content. It consists of synthetic prompts and machine-generated responses, each labeled toxic or non-toxic. We loaded the dataset using the python libraries and conducted experiment to inspect and to understand the size, structure, and content of the data. We ensured data integrity by checking for missing values, null entries, or corrupted rows and removed or corrected any inconsistencies we encountered. Labels were converted from string format to binary integers (0 for non-toxic, 1 for toxic) to facilitate training. A crucial part of early data analysis was the assessment of class imbalance. Although the dataset was relatively balanced, a slight predominance of toxic samples (approximately 55–60%) was observed, which could affect model performance. We visualized class distributions and word clouds to gain insights into the vocabulary typically used in toxic and non-toxic texts. Texts classified as toxic often included profanity, slurs, or aggressive language patterns. Preprocessing steps included converting all text to lowercase, stripping unnecessary whitespace, and removing special char- acters if they had no linguistic value. We intentionally avoided stemming, lemmatization, or stopword removal, since these steps might remove important contextual cues. For instance, in toxic language, function words or grammatical structures can be essential indicators of intent. Given that transformer models operate best on raw, unaltered text, we kept preprocessing minimal to preserve semantic richness. These decisions aimed to optimize downstream performance and ensure compatibility with contextual embedding models.
B. Model Selection: RoBERTa
Before finalizing our choice of model, we conducted a comparative analysis of two prominent transformer-based ar- chitectures: BERT and RoBERTa. Our goal was to determine which model performs more reliably in the context of toxic content detection. To do this, we used a publicly available dataset separate from our main training set, consisting of examples labeled as toxic and non-toxic. We passed these texts through both models pre-trained BERT and RoBERTa classifiers fine-tuned on similar tasks and compared their outputs. In Fig. 3 we can see the schema of work principal.
/Aidana.files/image003.jpg)
Figure 3. Scheme
The evaluation was both quantitative and qualitative. We observed that while both models achieved high overall accuracy (around 91% for BERT and 93% for RoBERTa), their performance characteristics differed significantly. BERT showed high recall but low precision, meaning it successfully detected most toxic content but often misclassified non-toxic examples as toxic. This led to a large number of false positives, which is problematic in real-world applications where over- flagging harmless content is undesirable. RoBERTa, on the other hand, demonstrated high precision but lower recall. It was more conservative in flagging content as toxic, but when it did, it was usually correct—an important trait for minimizing unjustified censorship.
In addition to metrics, we examined the distribution of predictions. The BERT-based model labeled approximately 20% of the test inputs as toxic and 80% as non-toxic, while RoBERTa labeled only about 10% as toxic and 90% as non-toxic. A manual review of the predictions confirmed that RoBERTa’s toxic classifications were generally more accu- rate and aligned with human judgment. To further ensure objectivity, we conducted a truthful evaluation phase where a small human-annotated sample was reviewed independently, yielding consistent results: RoBERTa made fewer misclassifications.
/Aidana.files/image004.jpg)
Figure 4. BERT model
Given these insights, we selected RoBERTa for our final implementation. We used the ”roberta-base” model architecture, which includes 12 transformer layers and 125 million parameters. Also in Fig. 5 we can see that the RoBERTa model showed better result than the BERT model in Fig. 4 the model was loaded from the Hugging Face Transformers library and adapted with a classification head for binary output. Tokenization was performed using RobertaTokenizer with truncation and padding to 128 tokens. After preprocessing, the dataset was split into training and validation subsets in an 80/20 ratio, ensuring class balance across splits. This careful model
selection process gave us confidence that RoBERTa would provide more reliable results in detecting toxic content while minimizing false positives. To improve the quality and robustness of predictions, we transitioned from classical mod- els to transformer-based architectures. We selected RoBERTa (Robustly Optimized BERT Pretraining Approach) due to its enhanced training regime, including larger datasets, longer training times, and removal of the next-sentence prediction task found in BERT.
/Aidana.files/image005.jpg)
Figure 5. RoBERTa model
These improvements make RoBERTa more capable of capturing complex linguistic patterns and long-range dependencies in text, which are essential for tasks such as toxicity detection. In table II-C we can see that the RoBERTa model showed a better result than BERT model. Even though their F1-Score are identical the results accuracy presents that RoBERTa works great in comparison with BERT. We used the “roberta-base” variant, which includes 12 transformer layers, 768 hidden dimensions, and 125 million parameters. This model was loaded via the Hugging Face Transformers library and augmented with a simple feed- forward classification head to output binary predictions. Tokenization was performed using the RobertaTokenizer, which converts raw text into token IDs and attention masks. We applied truncation and padding to a maximum length of 128 tokens to ensure uniform input sizes and efficient GPU batch processing. The choice of 128 tokens provided a balance between capturing sufficient context and limiting memory usage. After tokenization, we split the dataset into training and validation sets using an 80/20 ratio. The split was stratified to preserve the class distribution across both subsets. This preparation ensured that the model would have both sufficient data to learn from and a representative validation set for performance assessment. The model was then ready for fine-tuning on the toxicity classification task.
C. Evaluation Metrics
To evaluate the overall performance of a model, we used several key metrics. These metrics are like a baseline metrics in most of a researches [18], [19]:
- Accuracy: The proportion of correctly predicted labels.
- Precision: The ratio of true toxic predictions to all predicted toxic.
- Recall: The ratio of true toxic predictions to all actual toxic samples.
- F1-Score: The harmonic mean of precision and recall, useful in imbalanced datasets.
Table 1.
Comparison of Transformer Models
|
Model |
Accuracy |
Precision |
Recall |
F1-Score |
|
BERT |
91% |
Low |
High |
Average |
|
RoBERTa |
93% |
High |
Low |
Average |
Limitation and Future Works
There are many limitations in our work. One of them is that we worked on standard datasets, where text toxicity is much easier to find. Therefore, the errors of the models that we used were small. In future work, you can focus on creating your own dataset, where you can collect data from various social networks where people actively use sarcasm, memes, and slang. We also do not know the specifics of the mentalities of different countries. In other words, our analysis was carried out only with data from English-speaking countries, which have their own specifics that differ from the countries of Asia and the CIS. In future work, we plan to focus on these pages of the CIS, and on social networks popular in these countries. Because the mentality in our countries and in the West is completely different.
Conclusion
In this research, we worked on the problem of determining the toxic content in texts, first starting from the basic model, and then moving on to transformer-based models. Our first experiments were conducted using a basic model, including logistic regression, which gave us a general idea of the task, as well as the first results that we used as a guideline when experimenting with more complex models.
To improve the accuracy of detecting toxic messages, we switched to deep learning methods and compared machine learning models, BERT and RoBERTa. We compared the two models to better understand which one is more efficient and has fewer false positives in the end. Despite the fact that both models showed really good results with high accuracy (91% for BERT and 93% for RoBERTa), their behavior was different. BERT was very responsive, but with low accuracy, he had more false positives. RoBERTa, however, is more precise and processes words like a human, which makes it more suitable for use in real life.
In the end, we chose RoBERTa because it combines preci- sion and overall neatness. After fine-tuning the model based on a dataset ToxiGen the model RoBERTa achieved an F1 score of 0.91 points, which indicates its reliability in determining the content of toxic substances with a minimum number of false positives.
Our results show that high accuracy, although necessary, is not the only criteria for detecting toxic content. And main message is that modern NLP techniques help us, in particular transformer-based architecture, to create a more respectful and safe online communities.
References:
- E. Elbasani and J. D. Kim, “Amr-cnn: Abstract meaning representation with convolution neural network for toxic content detection,” Journal of Web Engineering, vol. 21, pp. 677–692, 2022.
- W. Q. Zhang Jiang, “Efficient toxic content detection by bootstrapping and distilling large language models.”
- P. Malik, A. Aggrawal, and D. K. Vishwakarma, “Toxic speech detection using traditional machine learning models and bert and fasttext embed- ding with deep neural networks,” in Proceedings - 5th International Conference on Computing Methodologies and Communication, ICCMC 2021. Institute of Electrical and Electronics Engineers Inc., 4 2021, pp. 1254–1259.
- Shaik, S. Rohit, B. Raviteja, . Barleapally, K. Reddy, and A. Shangloo, “Toxic comment classification based on personality traits using nlp,” p. 9. [Online]. Available: http://www.ijritcc.org
- M. Chhikara and S. K. Malik, “Classification of cyber hate speech from social networks using machine learning,” in Proceedings of the 2022 11th International Conference on System Modeling and Advancement in Research Trends, SMART 2022. Institute of Electrical and Electronics Engineers Inc., 2022, pp. 419–423.
- M. Singhal, C. Ling, P. Paudel, P. Thota, N. Kumarswamy, G. Stringhini, and S. Nilizadeh, “Sok: Content moderation in social media, from guidelines to enforcement, and research to practice,” pp. 868–895, 2023.
- C. Duchene, H. Jamet, P. Guillaume, and R. Dehak, “A benchmark for toxic comment classification on civil comments dataset,” 1 2023. [Online]. Available: http://arxiv.org/abs/2301.11125
- F. A. Rawther and G. Titus, “Transformer models for recognizing abusive language an investigation and review on tweeteval and solid dataset,” in 2023 2nd International Conference on Electrical, Elec- tronics, Information and Communication Technologies, ICEEICT 2023. Institute of Electrical and Electronics Engineers Inc., 2023.
- M. Kamphuis, “Tiny-toxic-detector: A compact transformer-based model for toxic content detection,” 8 2024. [Online]. Available: http://arxiv.org/abs/2409.02114
- A. Maity, R. More, A. Patil, J. Oza, and G. Kambli, “Toxic comment detection using bidirectional sequence classifiers,” in 2nd International Conference on Intelligent Data Communication Technologies and Inter- net of Things, IDCIoT 2024. Institute of Electrical and Electronics Engineers Inc., 2024, pp. 709–716.
- R. Calderon-Suarez, R. M. Ortega-Mendoza, M. Montes-Y-Gomez, C. Toxqui-Quitl, and M. A. Marquez-Vera, “Enhancing the detection of misogynistic content in social media by transferring knowledge from song phrases,” IEEE Access, vol. 11, pp. 13 179–13 190, 2023.
- E. F. Ayetiran and O. Ozgobek, “A review of deep learning techniques for multimodal fake news and harmful languages detection,” IEEE Access, vol. 12, pp. 76 133–76 153, 2024.
- N. S. Mullah and W. M. N. W. Zainon, “Advances in machine learning algorithms for hate speech detection in social media: A review,” pp. 88 364–88 376, 2021.
- T. Davidson, D. Warmsley, M. Macy, and I. Weber, “Automated hate speech detection and the problem of offensive language,” in Proceedings of the 11th International Conference on Web and Social Media, ICWSM 2017. AAAI Press, 2017, pp. 512–515.
- M. A. Al-Garadi, M. R. Hussain, N. Khan, G. Murtaza, H. F. Nweke, Ali, G. Mujtaba, H. Chiroma, H. A. Khattak, and A. Gani, “Predicting cyberbullying on social media in the big data era using machine learning algorithms: Review of literature and open challenges,” IEEE Access, vol. 7, pp. 70 701–70 718, 2019.
- Isha, Anjali, K. Sharma, Kirti, and V. Pratap, “Classifying toxic comments with machine learning and deep learning approaches,” International Journal of Scientific Research in Science and Technology, vol. 12, pp. 1074–1082, 4 2025. [Online]. Available: https://ijsrst.com/ index.php/home/article/view/IJSRST251222664
- S. Malmasi and M. Zampieri, “Detecting hate speech in social media,” 2017. [Online]. Available: https://data.world/crowdflower/
- S. Kaur, S. Singh, and S. Kaushal, “Deep learning-based approaches for abusive content detection and classification for multi-class online user-generated data,” International Journal of Cognitive Computing in Engineering, vol. 5, pp. 104–122, 1 2024.
- H. Ismail, A. Khalil, and A. Jasmy, “Enhancing online toxicity detec- tion on gaming networks: a novel embeddings-based valence lexicon approach,” International Journal of Data Science and Analytics, 2025.