A HYBRID MUSIC RECOMMENDATION SYSTEM: INTEGRATING COLLABORATIVE FILTERING AND AUDIO FEATURE ANALYSIS

ГИБРИДНАЯ СИСТЕМА РЕКОМЕНДАЦИИ МУЗЫКИ: ИНТЕГРАЦИЯ КОЛЛАБОРАТИВНОЙ ФИЛЬТРАЦИИ И АНАЛИЗА АУДИОПРИЗНАКОВ

Izimov A. Naizabayeva L.K.

28.04.2026 122

4(145)

10. Информатика, вычислительная техника и управление

Цитировать:

Izimov A., Naizabayeva L.K. A HYBRID MUSIC RECOMMENDATION SYSTEM: INTEGRATING COLLABORATIVE FILTERING AND AUDIO FEATURE ANALYSIS // Universum: технические науки : электрон. научн. журн. 2026. 4(145). URL: https://7universum.com/ru/tech/archive/item/22543 (дата обращения: 28.05.2026).

Прочитать статью:

Статья поступила в редакцию: 08.04.2026

Принята к публикации: 14.04.2026

Опубликована: 28.04.2026

ABSTRACT

This study evaluates a hybrid music recommendation system based on the PinSage graph neural network, combining collaborative filtering with deep audio feature anal- ysis. A bipartite playlist-song graph was constructed from Spotify API data; audio embeddings were extracted using L3-Net, VGGish, and MusicNN. PinSage was com- pared against node2vec, Personalized PageRank, matrix-based collaborative filtering, and content-based baselines on a song co-occurrence prediction task. Graph-based methods consistently outperformed matrix-based approaches. PinSage delivered balanced results across accuracy and beyond-accuracy metrics. No significant perfor- mance degradation was observed in the long-tail catalog segment, suggesting graph representations effectively mitigate data sparsity.

АННОТАЦИЯ

В работе исследуется гибридная система рекомендации музыки на основе граф-нейронной сети PinSage, объединяющей коллаборативную фильтра- цию и анализ аудиопризнаков. По данным Spotify API построен двудольный граф плейлист-трек; аудиоэмбеддинги извлечены моделями L3-Net, VGGish и MusicNN. PinSage сравнивается с методами node2vec, персонализирован- ным PageRank, матричной коллаборативной фильтрацией и контентными методами. Граф-ориентированные методы стабильно превосходят матрич- ные подходы. Значимого снижения качества в «длинном хвосте» каталога не выявлено.

Keywords: music recommendation, collaborative filtering, graph neural networks, PinSage, audio features, hybrid recommender systems.

Ключевые слова: рекомендация музыки, коллаборативная фильтрация, граф- нейронные сети, PinSage, аудиопризнаки, гибридные рекомендательные систе- мы.

Introduction

Online music platforms such as Spotify and Apple Music host millions of tracks, making automated recommendation essential. Collaborative filtering (CF) — the dominant paradigm — models user–item interactions but degrades under data sparsity and cold-start conditions [3, 1]. Content-based filtering (CBF) analyses audio features directly [7], yet produces less diverse results [6]. Hybrid approaches combining both signals have shown promise [9, 5], and graph neural networks (GNNs) have recently enabled richer joint modelling of relational and content information.

This work adapts PinSage [11] — a scalable random-walk graph convolutional network — to the music domain. We evaluate it against CF, graph, and content-based baselines on real Spotify and Last.fm data, with focus on accuracy, diversity, and long-tail performance.

Materials and methods

Dataset

A bipartite playlist-song graph was built via the Spotify Web API (112,050 songs, 53,092 playlists, 4,368,950 edges). Song co-occurrence pairs from the LFM- 1B Last.fm dataset (1,853,537 pairs) serve as training and evaluation labels. The degree distribution follows a power law, reflecting the long-tail structure typical of real music catalogs (Table 1).

Table 1.

Dataset Statistics

Statistic	Playlist-Song Graph	Co-occurrences
Songs / Playlists	112,050 / 53,092	—
Edges	4,368,950	—
Mean / Median song degree	3.9 / 1	—
Positive pairs	—	1,853,537
Mean / Median frequency	—	16.5 / 4

Audio feature extraction

Three pre-trained deep learning models produce 512-dimensional audio embed- dings per song: L3-Net (audio–visual self-supervised model), VGGish (VGGNet- based model trained on video tags), and MusicNN (convolutional network for music audio tagging). These embeddings serve as node feature vectors in the graph.

PinSage model

For each node u, influential neighbours are identified via personalized PageR- ank (PPR); convolutional layers aggregate their features, concatenate with u’s own embedding, apply ReLU, and L2-normalise (Algorithm 1). Configuration: L3-Net embeddings, input/hidden dimension 512, output 128, 2 convolutional layers, neigh- bourhood size 3, 30 epochs, learning rate 10−4, decay 0.95 per epoch, max-margin loss.

Baselines

User-behavior methods: node2vec (unsupervised node embeddings via biased random walks), Personalized PageRank (PPR), Playlist-track CF (matrix factorisation of the playlist-song matrix), Track-track CF (matrix factorisation of the co-occurrence matrix). Content-based methods: L3-Net, VGGish, and MusicNN embeddings used directly for nearest-neighbour retrieval.

Evaluation framework

The dataset is split 70 % / 30 % for training and testing. Accuracy metrics:

Hit-Rate@k (HR@10, HR@100, HR@500) and Mean Reciprocal Rank (MRR).

Beyond-accuracy metrics: intra-list diversity, inter-list diversity, catalog coverage, and mean recommended degree. Long-tail metrics: Low-Degree MRR (queries with graph degree < 2) and Sparse MRR.

Results and discussion

Graph-based methods consistently outperform matrix CF and content-based baselines (Table 2). PPR achieves the highest overall accuracy (average rank 1.2), followed by node2vec (rank 2.2) and PinSage Base (rank 4.2). PinSage attains the best HR@500 among all non-node2vec methods. VGGish yields the highest MRR (0.0128), indicating strong perceptual similarity for top-1 ranking.

Table 2.

Recommendation Accuracy

Method	HR@10	HR@100	HR@500	MRR
Random	0.0001	0.0009	0.0043	0.0011
node2vec	0.0244	0.0789	0.1254	0.0114
PPR	0.0267	0.0805	0.0978	0.0122
Playlist-track CF	0.0195	0.0517	0.0851	0.0092
Track-track CF	0.0131	0.0532	0.0955	0.0068
L3-Net	0.0155	0.0294	0.0536	0.0115
VGGish	0.0171	0.0335	0.0666	0.0128
MusicNN	0.0153	0.0271	0.0510	0.0119
PinSage (Base)	0.0201	0.0570	0.1166	0.0112
PinSage (PPR)	0.0209	0.0546	0.0961	0.0103

Beyond-accuracy metrics reveal trade-offs invisible to accuracy alone (Table 3). PPR’s high accuracy comes with strong popularity bias (mean recommended de- gree 45.01 versus graph average 3.94); node2vec favours obscure tracks (mean degree 2.10); PinSage Base is well-balanced (4.71) with high coverage (0.985) and good intra-list diversity (0.798). Track-track CF leads intra-list diversity (0.932) but suffers the lowest coverage (0.804) and worst Sparse MRR, confirming that co- occurrence factorisation degrades under sparsity. Contrary to expectations, most methods maintain performance in the long tail, indicating that graph structure alone is sufficient to handle sparse items.

Table 3.

Beyond-Accuracy Metrics

Method	Intra-div	Inter-div	Coverage	Mean Deg.
Random	0.988	0.999	1.000	3.94
node2vec	0.771	0.999	0.998	2.10
PPR	0.852	0.869	1.000	45.01
Playlist-track CF	0.834	0.999	0.999	3.97
Track-track CF	0.932	0.960	0.804	9.38
L3-Net	0.371	0.998	0.996	4.22
VGGish	0.588	0.999	0.999	4.10
MusicNN	0.620	0.998	0.998	4.20
PinSage (Base)	0.798	0.995	0.985	4.71
PinSage (PPR)	0.673	0.998	0.999	8.26

The ablation study confirms that audio embedding quality is critical: replacing L3-Net with MusicNN drops HR@10 from 0.0201 to 0.0017, while random fea- tures yield only 0.0035. Adding hard negatives unexpectedly reduced performance (HR@10 0.0088), likely due to false negatives in co-occurrence data. Increasing model depth to five layers provided no improvement over the two-layer baseline.

Conclusion

Graph-based methods — particularly unsupervised PPR and node2vec — out- perform matrix collaborative filtering across all accuracy metrics, demonstrating that the playlist-song graph encodes rich song similarity information. PinSage, combining graph convolution with L3-Net audio embeddings, delivers balanced performance across accuracy and beyond-accuracy dimensions, avoiding PPR’s popularity bias while remaining competitive. Future work should explore lyrics and artist metadata as additional node features, alternative GNN architectures, and user studies to validate whether beyond-accuracy differences translate to real user satisfaction.

References:

Bobadilla J., Ortega F., Hernando A., Gutiérrez A. Recommender systems survey // Knowledge-Based Systems. — 2013. — Vol. 46. — P. 109–132.
Hamilton W. L., Ying R., Leskovec J. Inductive representation learning on large graphs // Advances in Neural Information Processing Systems. — 2017. — Vol. 30. — P. 1024–1034.
Hu Y., Koren Y., Volinsky C. Collaborative filtering for implicit feedback datasets // Proc. 8th IEEE ICDM. — Washington : IEEE, 2008. — P. 263– 272.
Oramas S., Nieto O., Sordo M., Serra X. A deep multimodal approach for cold-start music recommendation // Proc. 2nd Workshop on Deep Learning for Recommender Systems. — New York : ACM, 2017. — P. 32–37.
Schedl M. Current challenges and visions in music recommender systems re- search // Int. J. Multimed. Inf. Retr. — 2019. — Vol. 7, № 2. — P. 95–116.
Schedl M., Gómez E., Urbano J. Music information retrieval: Recent develop- ments and applications // Found. Trends Inf. Retr. — 2014. — Vol. 8, № 2–3. —P. 127–261.
Tzanetakis G., Cook P. Musical genre classification of audio signals // IEEE Trans. Speech Audio Process. — 2002. — Vol. 10, № 5. — P. 293–302.
Van den Oord A., Dieleman S., Schrauwen B. Deep content-based music recom- mendation // Advances in Neural Information Processing Systems. — 2013. — Vol. 26. — P. 2643–2651.
Wang Y., Wang X., Liu H. Improving content-based and hybrid music recom- mendation using deep learning // Proc. 22nd ACM Int. Conf. Multimedia. — New York : ACM, 2014. — P. 627–636.
Wang X., He X., Wang M., Feng F., Chua T.-S. Neural graph collaborative filtering // Proc. 42nd ACM SIGIR. — New York : ACM, 2019. — P. 165–174.
Ying R., He R., Chen K., Eksombatchai P., Hamilton W. L., Leskovec J. Graph convolutional neural networks for web-scale recommender systems // Proc. 24th ACM SIGKDD. — New York : ACM, 2018. — P. 974–983.