Student, School of IT and Engineering, Kazakh-British Technical University, Kazakhstan, Almaty
EVALUATION OF A MOVIE RECOMMENDATION SYSTEM USING NLP TECHNIQUES
ABSTRACT
This paper presents a lightweight, content-based movie recommendation system utilizing TF-IDF vectorization and cosine similarity. Designed for real-time applications, the system supports localized content with a focus on Kazakh-language films. Implemented with Streamlit, it includes a user feedback module that captures ratings from real users. A study involving 96 university students showed a high satisfaction rate (4.45/5), demonstrating that even simple models can offer strong performance in low-resource and culturally specific settings.
АННОТАЦИЯ
В данной работе представлена облегчённая контент-ориентированная система рекомендации фильмов, использующая векторизацию TF-IDF и косинусное сходство. Система разработана для работы в режиме реального времени и поддерживает локализованный контент с акцентом на казахоязычные фильмы. Реализация выполнена с использованием Streamlit и включает модуль обратной связи пользователей. Исследование с участием 96 студентов университета показало высокий уровень удовлетворенности (4.45/5), что демонстрирует эффективность даже простых моделей в условиях ограниченных данных и культурной специфики.
Keywords: Movie recommendation, TF-IDF, cosine similarity, user feedback, NLP, content- based filtering, real-time system.
Ключевые слова: система рекомендаций фильмов, TF-IDF, косинусное сходство, пользовательская обратная связь, NLP, контентная фильтрация, система реального вре мени.
Introduction
Personalized recommendations are everywhere, yet users still spend an average of 10.5 minutes searching for something to watch. This signals a gap between algorithmic prediction and user satisfaction [15]. Despite many improvements, there is still a fail to fully meet the emotional and contextual expectations of viewers. This creates frustration and choice overload where too many options make it harder to decide [4, 19]. A good recommendation system not only needs to be accurate but also simple, diverse, and aligned with user mood and preferences. Recommendation systems are essential in social media platforms such as Netflix, Spotify, and etc. They help users discover new content and improves satisfaction. Recommenders also play a crucial role in retention, pushing users to spend more time on the platform [7, 13, 24]. There are three main types of recommendation systems: collaborative filtering (CF), content- based filtering (CBF), and hybrid models [1, 3]. CF recommends items based on user-item interactions. It performs well when there is a lot of data, but suffers from the lack of data in the beginning [9, 18]. CBF, on the other hand, uses item metadata such as genres, actors, or keywords [10, 12]. It works better for new users but can lead to over-specialization by repeating similar content [5].
To address these issues, hybrid models combine CF and CBF. Several studies show that hybrid methods improve accuracy, especially when data is limited [14, 20, 23]. For example, Lecaros et al. [14] use TF-IDF and SVD together, while others incorporate deep learning [8] or neural attention [9]. Still, these systems require high computing resources, long training times, and often lack transparency.
A different group of studies focus on simpler approaches such as vector-based models using TF-IDF and cosine similarity [2, 6, 12]. These methods convert movie descriptions into vectors and use cosine similarity to find similar items. They are fast, explainable, and work well in real-time use cases. In addition, some researchers apply keyword analysis [16] or Word2Vec and TF-IDF combinations [21] to enhance semantic matching.
Another key challenge is dataset bias. Many popular datasets such as MovieLens and Netflix Prize focus on Western content [17, 25]. This leads to reduced effectiveness for local users. As a result, local users often get recommendations that do not match their interests. That is why it is important to localize the data. Research shows that adding regional metadata and using language-specific information can significantly improve recommendation quality [17, 22].
This paper addresses these problems by developing a lightweight content-based recommendation system. It uses TF-IDF to extract key terms from movie metadata and cosine similarity to compute recommendations. The system includes Kazakh movies, works in real time, and is evaluated through a user interface built in Streamlit. The goal is to offer a simple and explainable solution that achieves strong user satisfaction without relying on heavy infrastructure.
0.1 System Architecture
The system consists of five main components: data collection, preprocessing, vectorization, similarity calculation, and recommendation delivery.
/Rakhymbay.files/image001.jpg)
Figure 1. System Architecture Diagram
As shown in Fig. 1, data is first collected from movie sources. Then, text preprocessing steps like cleaning, stemming, and tokenization are applied. The cleaned text is converted into vectors using TF-IDF, and cosine similarity is used to compare them.
The main advantage of this architecture is its low complexity and ease of updates. New movies can be appended to the dataset, and TF-IDF can be re-fit periodically without complex retraining. This makes the system flexible for localization and for small teams operating without GPU infrastructure.
Methodology
0.2 Approach
Our system follows a content-based recommendation approach. Unlike collaborative filtering, which relies on user-item interaction history, content-based methods analyze the intrinsic properties of the items themselves.
We selected TF-IDF vectorization in combination with cosine similarity due to their simplicity and effectiveness in text-based recommendation tasks. Each movie is represented as a vector of weighted keywords derived from its metadata, enabling the system to compute similarity between movies based on shared attributes.
The recommendation process begins when a user selects a movie. The system extracts metadata from that movie and compares it to other movies using cosine similarity. The most similar items are returned as recommendations.
/Rakhymbay.files/image002.jpg)
Figure 2.Content-based recommendation example
0.3 Dataset
We used a combination of datasets to build a diverse movie database. In total, we gathered information on approximately 50,000 movies from global sources, covering various genres, languages, and styles. This dataset was enriched with information on Kazakh films. After filtering out low-quality entries, movies with insufficient data, and duplicates, the final dataset included 45,000 unique movies.
The metadata fields used for recommendation were selected because they are broadly available across movie catalogs and provide complementary signals. Overview and tagline capture narrative and theme, genres provide high-level category cues, and keywords and recommendations contribute descriptive terms that often align with user expectations. In practice, Kazakh- language entries may have shorter descriptions than mainstream movies, so combining multiple fields reduces sparsity and supports more stable similarity estimates.
0.4 Text Preprocessing
Text preprocessing was essential to prepare the dataset for vectorization. We performed several steps to clean and standardize the text data. This included removing punctuation, converting all text to lowercase, and removing stopwords. We also applied stemming using the Python library Snowball Stemmer to reduce words to their root forms. Finally, we created a new tags field by combining multiple metadata fields, overview, genres, keywords, tagline, and recommendations, into a single string. This was done to collect all essential textual information about each movie into one unified field.
A key practical issue is that a multilingual catalog may contain mixed-language tokens within the same entry. For example, Kazakh titles may appear alongside Russian or English keywords. We treat all tokens uniformly at vectorization time, since TF-IDF weighting naturally down-weights frequent language artifacts and emphasizes distinctive terms.
0.5 TF-IDF Vectorization
To turn movie descriptions into vectors, we use TF-IDF. This method gives higher weight to words that are important in one movie but not common in others. It helps make each movie’s content unique.
(1)
where:
• TF(t, d) is the term frequency of term t in document d
• DF(t) is the document frequency of term t
• N is the total number of documents
0.6 Cosine Similarity
After turning all movies into TF-IDF vectors, we compare them using cosine similarity. It measures the angle between two vectors. A value close to 1 means the movies are highly similar; closer to 0 means they are unrelated.
where:
(2)
• A · B is the dot product of the vectors
•
A
and
B
are the magnitudes of the vectors
0.7 Frontend and Feedback
We developed an interactive web interface using Streamlit, enabling users to experience the recommendation system in real time. Users can input a movie title and receive similar movie suggestions. They can rate the results using a 1-to-5 star scale. The feedback is stored for future evaluation and iterative improvement.
/Rakhymbay.files/image005.jpg)
Figure 3. User interface for collecting feedback
/Rakhymbay.files/image006.jpg)
Figure 4. Web-based system interface
0.8 Experimental Setup and Reproducibility
This section describes implementation choices that affect reproducibility and runtime behavior.
The system was implemented in Python using standard NLP and data science libraries. TF-IDF vectorization and cosine similarity were computed using scikit-learn, while Pandas and NumPy were used for data handling. The user interface and feedback logging were implemented in Streamlit to support interactive evaluation.
The feature space size was constrained to a fixed maximum vocabulary to reduce memory usage and improve latency. In practice, TF-IDF uses a sparse matrix representation, which allows similarity calculations to remain efficient even for tens of thousands of movies. The recommendation step for a selected movie consists of reading a precomputed similarity vector or computing similarity against the sparse matrix, then selecting top-k candidates.
To reduce noise, the preprocessing pipeline removed punctuation, normalized case, and removed stopwords. Stemming was applied to reduce morphological variation and improve matching between related words. Although stemming can sometimes reduce readability of explanations, it improves retrieval robustness for short descriptions.
Results and Discussion
To evaluate the system, a survey was conducted among 96 university students. Each participant used the system to search for movies and then rated the quality of the recommendations on a 1-to-5 star scale.
Table 1 shows the distribution of the user ratings collected.
Table 1.
User Rating Distribution
|
Stars |
Number of Users |
Percentage |
|
5 |
58 |
60.42% |
|
4 |
28 |
29.17% |
|
3 |
6 |
6.25% |
|
2 |
3 |
3.13% |
|
1 |
1 |
1.04% |
|
Total |
96 |
100% |
/Rakhymbay.files/image007.jpg)
Figure 5. User rating distribution (1 to 5 stars)
In addition to the rating distribution, evaluation metrics were computed using the feedback provided by participants. Table 2 summarizes the results. The average rating was 4.45, while the acceptance rate, defined as the percentage of users who rated the recommendations with 4 or 5 stars, reached 89.59%. The estimated standard deviation of 0.81 suggests that user responses were consistently positive.
0.9 Additional Quantitative Analysis
Beyond mean rating, it is useful to estimate uncertainty of the survey mean. Using the estimated standard deviation s = 0.81 and sample size n = 96, the standard error of the mean is:
/Rakhymbay.files/image008.png)
Figure 6: Acceptance rate of recommended movies
Table 2.
Evaluation Metrics Summary
|
Metric |
Value |
|
Mean Rating |
4.45 |
|
Acceptance Rate (≥ 4 stars) Standard Deviation (est.) |
89.59% 0.81 |
|
User Sample Size |
96 |
(3)
This implies that the observed mean rating is stable for this sample. A rough 95% confidence interval using 1.96 · SE yields:
CI95 ≈ 4.45 ± 1.96 · 0.083 ≈ [4.29, 4.61] (4)
While this interval does not guarantee generalization to all users, it indicates that within the tested group the perceived quality was consistently high.
We also report the positive rating rate, which is the proportion of 4 and 5 star responses.
Let p = 86/96 = 0.8959. The binomial standard error is:
(5)
This suggests that the acceptance rate is also stable for the tested sample.
0.10 Discussion
The user feedback highlighted several strengths of the system. Participants appreciated the inclusion of Kazakh-language content, the fast response time, and the simple, intuitive interface.
However, many suggested the addition of genre filters and personalization options to improve control over results.
Table 3 presents selected user comments gathered during testing.
Table 3.
Sample User Comments
|
ID |
Comment |
|
1 |
“Cool that there were Kazakh films I didn’t know before.” |
|
2 |
“Good response time!” |
|
3 |
“It would be great if I could choose genres I like.” |
|
4 |
“Simple interface. Very easy to use.” |
|
5 |
“Suggestions were mostly good, but I want to skip some genres.” |
Our findings are consistent with those reported in [11], where a lightweight model was used to recommend items in cold-start settings. While their method depends on latent factors derived from past interactions, our approach is fully content-based and does not require any historical data. Despite this, our system achieved a similar satisfaction level, demonstrating that simple and explainable models can deliver effective recommendations when designed around the needs of underrepresented user groups.
0.11 Why Localization Helped
A practical benefit of adding Kazakh movies is improved perceived relevance for local users. In mainstream datasets, regional content may be missing, which reduces the chance that a user sees culturally familiar recommendations. By enriching the catalog with regional entries and including language-specific metadata, the system increases the probability that a user receives recommendations aligned with local viewing habits and language preferences.
0.12 Common Failure Cases
During informal observation, a few failure patterns were noticed. First, for movies with very short or generic descriptions, TF-IDF may not capture distinctive terms, which can lead to recommendations that are only loosely related. Second, some genres share overlapping vocabulary, such as action and thriller, which can produce recommendations across adjacent categories even if a user expects stricter genre boundaries. Third, multilingual mixing can reduce precision when the selected movie is described in one language but similar movies are described in another, especially when descriptions are sparse.
0.13 Complexity and Real-Time Behavior
The system is designed for real-time interaction. TF-IDF produces a sparse matrix representation, and cosine similarity can be computed efficiently. In a typical workflow, the expensive step is building the TF-IDF matrix, which is done offline or periodically. Online recommendation consists of retrieving the vector for the selected movie and computing similarity scores to rank candidates.
Let M be the number of movies and V the vocabulary size. Training TF-IDF is approximately linear in the number of nonzero tokens across the corpus. Online similarity for a single query movie can be implemented as a sparse dot product between the query vector and the matrix, then selecting top-k. This supports low latency and makes the model practical for lightweight deployments and student projects.
0.14 Limitations and Ethics
This work has several limitations. First, the evaluation was conducted on a student sample, which may not represent broader demographics. Second, ratings measure perceived quality, but they do not directly measure long-term engagement or retention effects. Third, the approach is content-based and may over-recommend similar items, which can reduce novelty and diversity. From an ethics perspective, the feedback module stores user ratings. The system should minimize stored personal data and avoid collecting identifiers unless necessary. If identifiers are used in future personalization work, the system should provide clear consent mechanisms and basic privacy protections. The dataset may also reflect biases present in the original sources, such as overrepresentation of mainstream content and underrepresentation of niche local films, which can influence exposure.
Conclusion. We presented a recommendation system based on TF-IDF and cosine similarity. The system performed well in user testing, receiving an average rating of 4.45 and an acceptance rate of 89.59%. Users appreciated simplicity and local relevance, while also identifying areas for enhancement such as genre filtering and user controls.
Compared to collaborative filtering approaches [11], our method offers strong performance without requiring prior user behavior data, making it suitable for cold-start scenarios and culturally specific applications. Future work will focus on interactive filtering mechanisms, genre-aware ranking, multilingual normalization, and continuous learning from user feedback.
References:
- G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledgeand Data Engineering, 17(6):734–749, 2005.
- Pratik K. Biswas and Songlin Liu. A hybrid recommender system for recommending smartphones to prospective customers. arXiv preprint arXiv:2105.12876, 2021.
- Jesu´s Bobadilla, Fernando Ortega, Antonio Hernando, and Abraham Guti´errez. Recommender systems survey. Knowledge-Based Systems, 46:109–132, 2013.
- Johan Bollen, Huina Mao, and Xiao-Jun Zeng. Twitter mood predicts the stock market. Journal of Computational Science, 2(1):1–8, 2010.
- Robin Burke. Hybrid recommender systems: Survey and experiments. User Modeling and User-Adapted Interaction, 12(4):331–370, 2002.
- Yifan Chen, Xiang Zhao, Junjiao Gan, Junkai Ren, and Yang Fang. Content-based top-n recommendation using heterogeneous relations. arXiv preprint arXiv:1606.08104, 2016.
- Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, pages 191–198, 2016.
- Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. Are we really making much progress? a worrying analysis of recent neural recommendation approaches. Proceedings of the 13th ACM Conference on Recommender Systems, pages 101–109, 2019.
- Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, pages 173–182, 2017.
- Ran Huang. Improved content recommendation algorithm integrating semantic information. Journal of Big Data, 10(1), 2023.
- Bhupendra Khadka, Ramesh Pandey, and Subarna Shakya. Explainable recommendation for cold-start users via lightweight matrix factorization. IEEE Access, 11:7637–7650, 2023.
- Puskal Khadka and Prabhav Lamichhane. Content-based recommendation engine for video streaming platform. International Journal of Advanced Computer Science and Applications, 2023. Preprint.
- Sudhanshu Kumar, Kanjar De, and Partha Pratim Roy. Movie recommendation system using sentiment analysis from microblogging data. IEEE Transactions on Computational Social Systems, 7(4):915–923, 2020.
- Lordjette Leigh Lecaros and Concepcion Khan. A tech hybrid-recommendation engine and personalized notification. In International Journal of Computing Sciences Research, volume 6, pages 925–939, 2022.
- Nielsen. State of play report: Content discovery challenges in streaming. https://www.nielsen.com/news-center/2023/nielsens-state-of-play-report-delivers-new- insights-as-streamings-next-evolution-brings-content-discovery-challenges-for-viewers/, 2023. Accessed: 2025-05-15.
- Aditya Narayan S., Kumaar Hareesh, Sathya Narayanan D., Srikumaran S., and Veni S. Content-based movie recommender system using keywords and plot overview. In 2022 International Conference on Wireless Communications Signal Processing and Networking (WiSPNET), pages 49–53, 2022.
- Sandipan Sahu, Raghvendra Kumar, et al. Movie popularity and target audience prediction using the content-based recommender system. IEEE Access, 10:42044–42060, 2022.
- Andrew I. Schein, Alexandrin Popescul, Lyle H. Ungar, and David M. Pennock. Methods and metrics for cold-start recommendations. Proceedings of the 25th Annual International ACM SIGIR Conference, pages 253–260, 2002.
- Barry Schwartz. The Paradox of Choice: Why More Is Less. Harper Perennial, 2004.
- Bingqing Sun, Haiping Ma, and Jinhua Guo. Research on personalized recommendation algorithm based on deep learning. IEEE Access, 8:122708–122718, 2020.
- Rui Wang and Yuliang Shi. Research on application of article recommendation algorithm based on word2vec and tfidf. In 2022 IEEE 4th International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE), pages 274–278, 2022.
- Yuchen Xiao and Ruzhe Zhong. A hybrid recommendation algorithm based on weighted stochastic block model. arXiv preprint arXiv:1905.03192, 2019.
- Qiang Zhang, Jie Lu, and Yu Jin. Artificial intelligence in recommender systems. Complex & Intelligent Systems, 7:439–457, 2021.
- Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. Deep learning based recommender system: A survey and new perspectives. ACM Computing Surveys, 52(1), 2019.
- A.Y. Zhubatkhan, Z.A. Buribayev, S.S. Aubakirov, et al. Comparison models of machine learning for movie recommendation systems. News of the National Academy of Sciences of the Republic of Kazakhstan, 1(335):26–31, 2021.