Specialist Diploma in Piano Performance (concert performer, teacher, accompanist), Moscow State Tchaikovsky Conservatory, Argentina, Buenos Aires
ARTIFICIAL INTELLIGENCE AS A CO-AUTHOR: PROSPECTS FOR THE DEVELOPMENT OF MUSICAL COMPOSITION
ABSTRACT
The present paper investigates the evolution of musical creativity under the influence of artificial intelligence technologies, which progressively transform from a passive instrument into an active co-author. The significance of the study is defined by the rapid progress in the field of generative models and the necessity of their integration with formal systems for the production of musical works characterized by complex structure and profound semantic content. The aim of the work is to analyse the prospects for employing artificial intelligence in the role of co-author of musical compositions. The research methodology comprises a comparative analysis of contemporary AI architectures for music generation (transformers, diffusion models) and of systems employing knowledge graphs, such as the solution presented at the Teragraph hackathon. Based on the results, a conceptual model of a hybrid system is described, in which the knowledge graph functions as a high-level planner of the compositional structure, while the generative neural network performs the role of executor, endowing it with rich stylistic and timbral content. The findings indicate that the combination of these two approaches overcomes the limitations inherent in each individually and provides composers with a powerful tool for guided and purposeful creativity. The material will be of interest to researchers in computer music and artificial intelligence, as well as to practitioners—composers and developers of musical software.
АННОТАЦИЯ
В настоящей статье исследуется эволюция музыкального творчества под воздействием технологий искусственного интеллекта, которые постепенно превращаются из пассивного инструмента в активного соавтора. Актуальность работы обусловлена стремительным развитием генеративных моделей и необходимостью их интеграции с формальными системами для создания музыкальных произведений со сложной структурой и глубоким семантическим содержанием. Цель исследования – проанализировать перспективы использования искусственного интеллекта в роли соавтора музыкальных композиций. Методология включает сравнительный анализ современных архитектур ИИ для генерации музыки (трансформеров, диффузионных моделей) и систем, основанных на графах знаний, таких как решение, представленное на хакатоне Teragraph. На основе полученных результатов описана концептуальная модель гибридной системы, в которой граф знаний выполняет функцию планировщика высокого уровня композиционной структуры, а генеративная нейросеть играет роль исполнителя, наделяя произведение богатым стилевым и тембровым содержанием. Выводы показывают, что сочетание этих двух подходов позволяет преодолеть ограничения, присущие каждому из них по отдельности, и предоставляет композиторам мощный инструмент для целенаправленного творчества. Материал заинтересует исследователей в области компьютерной музыки и искусственного интеллекта, а также практиков — композиторов и разработчиков музыкального ПО.
Keywords: artificial intelligence, music generation, knowledge graphs, deep learning, transformers, diffusion models, musical composition, computational creativity, symbolic AI, coevolution of humans and machines.
Ключевые слова: искусственный интеллект, генерация музыки, графы знаний, глубокое обучение, трансформеры, диффузионные модели, музыкальная композиция, вычислительное творчество, символический ИИ, коэволюция человека и машины.
INTRODUCTION
In an era of rapid digitalization artificial intelligence gradually permeates all areas of human activity, including those domains traditionally dominated by intuition and subjective evaluation — in particular art and creative practice. Musical composition is no exception: the evolution of algorithmic solutions from strictly deterministic systems to modern generative neural networks has shaped a new paradigm in which AI functions not merely as a tool but as a full-fledged co-author of the compositional process. The relevance of the study is determined by the simultaneous acceleration of the development of deep learning models capable of generating audio content of the highest quality and the awareness of their significant limitations in terms of long-term structuring of a work and endowing it with semantic integrity. According to estimates of analytical agencies the AI market in the media sector is projected to grow from 8.21 billion USD in 2024 to 51.08 billion USD by 2030 with an average annual growth rate of 35.6% over the forecast period. As the media industry continues to evolve the integration of advanced technologies such as AI becomes a key factor in revolutionizing processes and achieving efficient results [1]. The growing economic interest serves as a powerful impetus for scientific research aimed at improving controllability and the quality of generated content.
At the same time existing scientific literature reveals a clear divide between two approaches: symbolic AI operating with explicit structures and formalized rules (for example based on knowledge graphs) and subsymbolic or neural-network AI which despite its ability to learn from large-scale datasets often remains a black box without the possibility of precise control.
The aim of the study is to analyze the prospects of using artificial intelligence as a co-author of musical compositions.
The core innovation of this work lies in formulating and empirically supporting a groundbreaking framework for AI-enhanced musical composition, seen through the lens of the composer as a creative architect. In this model, the composer’s remit expands from the meticulous crafting of individual pitches and rhythms to the deliberate engineering of large-scale musical structures and expressive concepts. What truly sets this approach apart is not simply the integration of human and machine into a hybrid system, but the epistemological shift in regarding the AI as a genuine co-author—an intelligent collaborator that realizes the composer’s abstract, high-level design intentions into a fully formed auditory piece. By fostering a reciprocal interplay between the artist’s strategic vision and the AI’s precise materialization, this paradigm fundamentally transforms the very nature of the compositional act.
The author’s hypothesis posits that the synergy between the explicit deterministic control provided by knowledge graphs and the stylistic plasticity of generative models will prove to be the key factor in substantially expanding the creative potential of AI co-authoring systems thereby overcoming the limitations inherent in each approach.
MATERIALS AND METHODS
In recent years, research in the field of artificial intelligence has encompassed a wide range of applied tasks – from the analysis of market trends to the generation of creative content. Below is an overview of key works, grouped by thematic areas.
- Analysis and the AI Technology Market. A number of authors focus on the evaluation of market aspects and infrastructural solutions of AI. For example, the report “AI In Media Market” provides a detailed analysis of the current state and development prospects of AI in the media sector, including growth forecasts, key players and market entry barriers [1]. Polese M. et al. [3] examine the architecture of open radio access networks (O-RAN), describing in detail its components, interfaces, optimization algorithms and security issues, as well as formulating the main research challenges in this area.
- Graph-based Recommender Systems. An independent direction is the use of graph knowledge bases for building recommender systems. Chicaiza J., Valdiviezo-Diaz P. [2] conduct a comprehensive review of existing technologies, synthesize methods for integrating semantic information and deploying scalable solutions, and also analyze the contribution of each approach to improving recommendation accuracy.
- Generation and Processing of Musical Content. In AI applications, works on music generation attract significant interest. Ferreira P., Limongi R., Fávero L. P. [4] demonstrate the application of deep neural networks for symbolic composition, comparing various architectures (RNN, VAE, GAN) and evaluating the quality and novelty of the generated melodies. Copet J. et al. [6] propose a simple and controllable approach to generating musical segments, allowing for the specification of stylistic parameters and ensuring model interpretability. Hsu J. L., Chang S. J. [8] focus on transitions between composition fragments using transformer models, enabling the generation of smooth musical transitions in various genres. In addition, Yuan R. et al. [7] present Marble – a benchmark for audio representations of music, serving as a universal tool for evaluating the quality of audio encoders and generative models. In reference [9], a modular microservices approach to constructing a graph cloud (cloud teragraph) is described, wherein each service is responsible for a distinct stage—from audio data acquisition and preprocessing to visualization of a de Bruijn graph of a musical work in a web interface. The authors examine hardware requirements (GPU nodes for computation acceleration, distributed storage in NoSQL clusters) and software architecture (service containerization via Docker, orchestration with Kubernetes, RESTful APIs for module interaction). The provided visualization example highlights the necessity of optimizing the rendering of large-scale graphs containing millions of edges, proposing a level-of-detail (LOD) strategy and incremental data loading in the client application [9]. Source [10] presents a complete processing pipeline for a musical fragment: from extraction of spectrograms and Mel-frequency cepstral coefficient (MFCC) features to construction of a transition graph between acoustic templates, where nodes represent stable musical patterns and edges denote probabilistic transitions in the melody. Convolutional and recurrent neural networks within the PyTorch framework are employed for model training, and the authors demonstrate how the generated graph can form the foundation for recommendation systems and automatic harmonization algorithms [10].
- Image Generation. One of the advanced methods in visual generation is latent diffusion modeling. Rombach R. et al. [5] demonstrate the capabilities of high-quality image synthesis at resolutions of up to several megapixels, describing model architectures, training procedures and methods for controlling generation via conditional signals.
Thus, the AI literature encompasses both business analytics and infrastructural solutions, as well as emerging methods for creating multimedia content. Among the contradictions in the literature is the uncertainty in assessing the economic efficiency of AI solution implementation: market reports offer optimistic forecasts [1], whereas practical studies point to significant integration costs and security risks [3]. In the field of recommender systems, there is a gap between the theoretical accuracy of graph models and their scalability in real-world conditions [2]. Creative AI applications demonstrate high content quality, but issues of interpretability and controllability of generation remain insufficiently addressed [4, 6, 8]. Furthermore, despite progress in image synthesis, the problem of adapting diffusion models to specialized domains and small datasets remains relevant [5]. Also insufficiently studied are the evaluation of audience subjective perception of AI-generated content and methods for protecting copyright in such systems.
RESULTS AND DISCUSSION
The contribution of the present work consists in describing the capabilities of a conceptual model of a hybrid music composition system that combines the strict controllability of symbolic graph-based knowledge structures with the adaptive expressivity of deep generative neural networks. The proposed architecture is designed to bridge the identified gap between symbolic and subsymbolic representations of musical material. Its foundation is a two-level organization of components — Planner and Realizer.
The Planner relies on an expanded knowledge graph intended not for direct synthesis of a note sequence, but for elaborating a high-level skeleton of the composition. Similar to the solution applied in the TerraGraph project, this graph accumulates the fundamental principles of music theory (for example, the rules of voice leading, functional tonal gravitations) as well as the stylistic specifics of various genres and authorial techniques. At this stage, the user specifies key parameters — the form of the piece (for example, sonata form, rondo structure, or ABA), tonal plan, desired emotional dynamics, tempo, and metric pattern. Based on these input data the Planner generates an abstract representation of the composition, which can be expressed via a tokenized sequence or a specialized structured format describing the harmonic progression, rhythmic contours of the main themes, as well as formal sections and their durations. The demonstration of such a graph-based planner at a hackathon showed that the hierarchy from motivic sketch to full form provides logical coherence and integrity of musical constructs [3].
Participation as an organizer and theoretical consultant in the Teragraph hackathon provided a unique opportunity to evaluate the practical implementation of a graph-based Planner grounded in de Breyne methods. This tool demonstrated genuine potential to transform the nuanced perception of musical architecture. By integrating rigorous music-theoretical criteria, it became possible to observe how a graph-based analytical engine enhances awareness of structural coherence even within the most complex polyphonic compositions.
In the experiment, a MIDI recording of Bach’s Toccata and Fugue in D minor served as input—a work renowned for its elaborate contrapuntal interweavings and harmonic tensions. Rather than merely interpreting pitch events, the system abstracted the score into a sequence of significant harmonic states, each representing a concise progression of tonal ideas. This filtering phase, informed by compositional intuition, ensured the delineation of stable structural harmonies from transient ornamental figures.
Upon constructing the de Breyne graph, each node corresponded to a compact “phrase,” while directed edges captured transitions from one idea to the next. When Newman’s community detection algorithm was applied, the network naturally divided into coloured clusters that precisely aligned with the fugue’s principal themes and motifs, thereby visualizing the composer’s cognitive pathways.
The graph-centric representation transcends linear notation, offering researchers and composers novel capabilities. First, it enables in-depth analysis of thematic material by identifying moments of introduction, transformation, and interweaving of key ideas. Second, this approach functions as a generative foundation: by traversing alternative routes within the graph, new compositions can be created that preserve the original’s structural logic. Furthermore, this methodological framework underlies an interactive Planner that allows the specification of high-level structural contours—linking thematic blocks and defining their hierarchical relationships—while artificial intelligence manages the detailed realization of the notated material.
Thus, the Teragraph hackathon experience demonstrated that the graph-oriented Planner constitutes not merely a technical module but a new cognitive environment, uniting abstract compositional intent with its concrete realization and elevating the composer to the role of sonic architect [9, 10].
A simplified diagram of the Planner architecture is shown in Figure 1.
/Popova.files/image001.png)
Figure 1. Conceptual diagram of the “Scheduler” module based on the knowledge graph [2, 3]
Module Realizer embodies a modern generative neural network architecture (for example based on transformers or diffusion methods) adapted to the task of precise dressing of the framework already designed by the Planner. In the proposed configuration the network does not generate musical material from scratch or based on a concise textual description, but receives as input a detailed skeleton of the composition serving as a complex conditioning. Such separate distribution of roles guarantees that the structure of the piece is not conceived internally by the model but only receives concretization – selection of harmonic elements, development of melodic ornamentation, arrangement of instrumental parts and final generation of an audio file with a rich timbral palette.
An especially successful example for this role is provided by the GETMusic family of networks [5] capable of stepwise in-painting: first a harmonic foundation is formed, then a bass line is laid, after which melodic contours emerge – all strictly in accordance with the scenario specified by the Planner. A comparison of the functional capabilities of key generative models is presented in Table 1.
Table 1.
Comparative analysis of the architectures of generative AI models in the context of their applicability as an “Implementer” [4, 5]
|
Parameter |
Music Transformer (8) |
GETMusic (5) |
MusicGen (6) |
|
Input type |
Symbolic sequence (MIDI-like) |
One or more tracks (audio or symbolic) |
Text description, motif sample |
|
Output type |
Symbolic sequence |
Full multitrack composition (audio) |
Stereo audio file |
|
Core architecture |
Transformer |
Diffusion Model (GETDiff) |
Transformer (auto-regressive) + EnCodec |
|
Key advantage |
Captures long-term dependencies in symbols |
Track filling/extension, generation from mixed sources |
Intuitive control via text, high audio quality |
|
Limitation |
Requires subsequent rendering (synthesis) |
May be challenging to precisely control harmony without strong conditioning |
Weak control over precise musical form and structure |
|
Applicability in hybrid model |
High (for generating symbolic content according to plan) |
Very high (for arrangement and rendering according to structural plan) |
Medium (for generating textures based on general section description) |
For the quantitative substantiation of the proposed improvements, it is advisable to employ metrics from studies on musical content analysis [7, 8]. For instance, long-range coherence can be evaluated via a self-similarity matrix for the generated fragment, and the degree of adherence to the harmonic outline can be quantified by determining the proportion of notes that fit within the specified chord grid.
Thus, the proposed hybrid architecture has the potential to serve as the foundation for a new wave of compositional systems. This platform does not displace the creator but functions as an intellectual co-author, automatically assuming both the routine design of formal structure and the complex tasks of generating sound textures. As a result, the composer is empowered to operate not with individual sonic events but with large-scale semantic and formal constructs, transforming the creative process into a dialogue between authorial intent and computational creativity. The generalized interaction scheme of the hybrid system components is presented in Figure 2.
/Popova.files/image002.png)
Figure 2. Architecture of a hybrid AI-coauthorship system in music [4, 5, 6]
However, the implementation of these systems provokes profound changes in the musical landscape. First and foremost, they radically expand access to music creation, erasing traditional barriers for those who lack fundamental theoretical knowledge and professional experience. Furthermore, professional composers obtain a powerful tool for rapid prototyping of ideas and experimenting with new stylistic solutions. Finally, such technologies open up prospects for the dynamic adaptation of musical accompaniment in interactive media — video games, film, and installations — where the sound environment in real time transforms depending on user actions and the prescribed dramaturgical structure.
In summary, it should be noted that the hybrid approach — combining symbolic planning based on knowledge graphs with subsymbolic realization through deep generative models — represents a qualitative breakthrough in the field of AI composition. It removes the limitations of existing methods and establishes a robust technological foundation for genuine human-machine partnership, where strict algorithmic control coexists harmoniously with the richness of expressive and stylistic possibilities.
CONCLUSION
The conducted study enabled a comprehensive assessment of the current state and prospects of employing artificial intelligence methods as a co-author of a musical work. Within the framework of the study, two leading paradigms — symbolic systems relying on knowledge graphs, and subsymbolic approaches employing deep generative models — were juxtaposed, revealing their complementary strengths and weaknesses.
The introduction of layers of symbolic planning based on graph structures provides generated compositions with the necessary long-term structural coherence and semantic elaboration often lacking in end-to-end neural models. Simultaneously, the utilization of modern generative mechanisms — such as transformers and diffusion networks — as the Realizer enriches the underlying formal framework of the work with deep timbral, textural, and stylistic detailing inaccessible to purely symbolic algorithms. The proposed hypothesis of synergistic interaction between the two paradigms is confirmed by a conceptual model that not only eliminates existing technical limitations but also transforms the conception of human–machine interaction from a utilitarian tool to a partnership of equals.
Prospects for further work may focus on the practical implementation of the Planner–Realizer architecture, the creation of unified data exchange formats between its components, as well as the development of intuitive interface solutions for managing such systems. The implementation of these technologies has the potential to fundamentally change the face of the music industry, opening new creative possibilities for both professionals, enthusiasts and solidifying the role of the composer as a strategic visionary in a creative partnership with artificial intelligence.
References:
- MarketsandMarkets. AI in media market. Retrieved from https://www.marketsandmarkets.com/Market-Reports/ai-in-media-market-213984142.html (accessed 06/20/2025).
- Chicaiza, J., & Valdiviezo-Diaz, P. A. (2021). A comprehensive survey of knowledge graph-based recommender systems: Technologies, development, and contributions. Information, 12(6), 1–23. https://doi.org/10.3390/info12060232.
- Polese, M., et al. (2023). Understanding O-RAN: Architecture, interfaces, algorithms, security, and research challenges. IEEE Communications Surveys & Tutorials, 25(2), 1376–1411. https://doi.org/ 10.1109/COMST.2023.3239220
- Ferreira, P., Limongi, R., & Fávero, L. P. (2023). Generating music with data: Application of deep learning models for symbolic music composition. Applied Sciences, 13(7), 1–19.https://doi.org/10.3390/app13074543
- Rombach, R., et al. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684-10695.
- Copet, J., et al. (2023). Simple and controllable music generation. Advances in Neural Information Processing Systems, 36, 47704–47720.
- Yuan, R., et al. (2023). Marble: Music audio representation benchmark for universal evaluation. Advances in Neural Information Processing Systems, 36, 39626–39647.
- Hsu, J. L., & Chang, S. J. (2021). Generating music transition by using a transformer-based model. Electronics, 10(18), 1–18. https://doi.org/10.3390/electronics10182276
- The Cloud teragraph. Software and hardware architecture. Retrieved from https://alexbmstu.github.io/2023/#:~:text=4.6.2.-,Пример%20визуализации%20графа%20деБрюйна%20музыкального%20произведения,-Данный%20пример%20использует (date accessed: 08.07.2025).
- Processing of a piece of music. Retrieved from https://latex.bmstu.ru/gitlab/hackathon/ex8/-/blob/main/lab6.ipynb?ref_type=heads (date accessed: 08.07.2025).