THE QUERY-VIDEO-DESCRIPTION FRAMEWORK: ARCHITECTURE, DATA PROCESSING PIPELINE, AND TEXT GENERATION QUALITY ASSESSMENT

ФРЕЙМВОРК ЗАПРОС-ВИДЕО-ОПИСАНИЕ: АРХИТЕКТУРА, КОНВЕЙЕР ОБРАБОТКИ ДАННЫХ И ОЦЕНКА КАЧЕСТВА ГЕНЕРАЦИИ ТЕКСТА

Isaev I. Gorishniy A.

28.07.2025 199

7(136)

10. Информатика, вычислительная техника и управление

Цитировать:

Isaev I., Gorishniy A. THE QUERY-VIDEO-DESCRIPTION FRAMEWORK: ARCHITECTURE, DATA PROCESSING PIPELINE, AND TEXT GENERATION QUALITY ASSESSMENT // Universum: технические науки : электрон. научн. журн. 2025. 7(136). URL: https://7universum.com/ru/tech/archive/item/20572 (дата обращения: 07.01.2026).

Прочитать статью:

DOI - 10.32743/UniTech.2025.136.7.20572

ABSTRACT

This paper presents a comprehensive framework for automatic generation of text descriptions for video content considering user queries. The decentralized system architecture is described, including modules for data collection and validation, multimodal embedding generation, description creation and filtering. Special attention is paid to generation quality assessment mechanisms, including lexical, semantic and embedding metrics. Results of experimental framework validation on various datasets are presented, demonstrating the effectiveness of the proposed approach. Development prospects of the system are described, including integration of new language models and feedback mechanisms. The framework can be applied in a wide range of practical tasks: from improving video content accessibility to creating intelligent assistants and automatic annotation systems.

АННОТАЦИЯ

В статье представлен комплексный фреймворк для автоматической генерации текстовых описаний видеоконтента с учетом пользовательских запросов. Описана децентрализованная архитектура системы, включающая модули сбора и валидации данных, генерации мультимодальных эмбеддингов, создания и фильтрации описаний. Особое внимание уделено механизмам оценки качества генерации, включая лексические, семантические и эмбеддинговые метрики. Представлены результаты экспериментальной валидации фреймворка на различных датасетах, демонстрирующие эффективность предложенного подхода. Описаны перспективы развития системы, включая интеграцию новых языковых моделей и механизмов обратной связи. Фреймворк может применяться в широком спектре практических задач: от повышения доступности видеоконтента до создания интеллектуальных ассистентов и систем автоматического аннотирования.

Keywords: video description generation, multimodal embeddings, text quality assessment, vision-language models, retrieval-based generation.

Ключевые слова: генерация описаний видео, мультимодальные эмбеддинги, оценка качества текста, vision-language models, retrieval-based генерация.

Introduction

In recent decades, there has been an exponential growth in video data volumes, which has led to the emergence of new tasks at the intersection of computer vision, natural language processing and multimodal machine learning. Automatic generation of text descriptions for videos has become one of the key problems of both fundamental and applied importance. Historically, the first systems were focused on generating captions for static images, but with the development of computing power and the emergence of large datasets, the research community has moved on to more complex tasks - describing dynamic scenes, events, actions and interactions in a video stream.

Modern applications require not only accurate recognition of objects and actions, but also deep semantic analysis, consideration of context, temporal relationships, and adaptation to user queries. This is especially relevant for areas such as educational platforms, intelligent assistants, automated annotation systems, media content search and filtering, and assistive technologies for people with visual impairments.

The history of the development of video description generation methods includes several stages: classical methods based on manual features and templates; transition to neural network architectures (CNN+RNN for images, LSTM/GRU for video); introduction of attention mechanisms and transformers; emergence of multimodal models (Vision-Language Models, VLM) capable of integrating visual, audio, and text data; development of retrieval-based and hybrid architectures; integration of user queries (query-conditioned generation).

In recent years, special attention has been paid to scalability, distributed processing, automation of data collection and validation, as well as objective assessment of the quality of generated descriptions. Leading research groups and industrial laboratories are developing complex frameworks that can work with millions of videos, support flexible architecture, integration with external services, and extensibility for new tasks.

This paper presents an extended overview and practical implementation of a framework for generating text descriptions for videos taking into account user requests. Particular attention is paid to the architecture, processing pipeline, quality metrics, experimental results, and features of building reliable, scalable, and adaptive systems [1].

Materials and methods

Architecture and modular organization of the framework

The framework for generating text descriptions for videos is implemented as a modular, scalable and decentralized system designed for automated collection, validation, storage and processing of video data. The architecture is built on the principles of microservices and supports integration with external services and research pipelines (see Figure 1).

Figure 1. Decentralized architecture of the "Request-Video-Description" framework

Key components of the system:

- Decentralized data collection network (miners): automated agents that search, download and pre-process video content from public platforms. Each miner accompanies the video with metadata (titles, tags, descriptions) and ensures the diversity and novelty of the data.

- Generation of multimodal embeddings: using modern models (e.g. ImageBind), latent representations are extracted from video, audio and text modalities, which serve as the basis for further generation and search.

- Validation and Quality Assurance (Validators): A dedicated group of participants responsible for checking for relevance, novelty, and richness of detail, as well as recalculating embeddings and comparing with existing records. Validation includes both automated and manual checks.

- Incentive and Scoring Mechanisms: Algorithms are implemented to reward miners for contributing high-quality, novel, and diverse video data. The system dynamically adjusts thresholds and limits to maintain the quality and efficiency of the network.

- Data Aggregation and Storage: Validated videos are aggregated and stored in a central repository with support for integration into public datasets and research platforms. The architecture is designed to scale to millions of video clips.

- Task-based Video Marketplace: Users can record themselves performing specific tasks, submit recordings for validation and potential purchase, which incentivizes the creation of valuable and unique data. - Security and reliability: proxy rotation, status monitoring, automatic recovery from failures, resource cleaning and data validation mechanisms are implemented [2; 3].

Interaction between components is implemented through API and asynchronous task queues, which ensures flexibility and resilience to failures. The system supports integration with external services (YouTube API, cloud storage, MLflow, W&B, DVC) and can be deployed both in the cloud and on-premise.

Particular attention is paid to security, access control, data anonymization and compliance with ethical standards when collecting and processing video content. Modern logging, alerting and statistics collection tools are used for monitoring and optimization.

The Writer module (see Figure 2) is a comprehensive offline pipeline for automated collection, processing, and storage of video content. The architecture is focused on scalable processing of large volumes of data while ensuring high-quality results. The system is based on the integration of modern machine learning methods, efficient resource management, and strict quality control at each stage of the video life cycle — from search to final storage.

Figure 2. Writer module architecture

Main components of the module:

1. OfflineDBWriter class

The OfflineDBWriter class acts as a central orchestrator of the entire video processing pipeline. It coordinates operations between all system components, manages data flows, processing state, and distribution of computing resources. An important feature is support for complex state management, which allows you to track progress, handle failures, and maintain consistency in distributed computing. Resource pools and dynamic load distribution are implemented to optimize performance.

The class provides a full set of methods for searching, loading, analyzing, and storing video, including error handling and retries, which is critical for reliable operation in a production environment.

2. Database Management

The system supports two types of databases:

- Main Database — designed for long-term storage of metadata of processed videos, embeddings and processing status information. Optimized for high-performance recording and searching.

- Temporary Vector Database — used to store and compare embeddings for similarity checking, duplicate detection and content filtering. The database is periodically cleaned and optimized to maintain performance.

A detailed record of processed requests and video IDs is maintained, which prevents re-processing and ensures full coverage. Database operations are optimized for batch processing: batch inserts, updates and similarity checks are implemented, which significantly increases the system throughput.

3. Video Processing Pipeline

The pipeline implements a robust video search system using the YouTube API, intelligent request management and proxy download capabilities. Videos are automatically split into 6-second clips using precise frame extraction and processing algorithms. To maximize throughput, parallel processing of clips is used.

Each clip is analyzed frame by frame using the ImageBind model to generate high-quality embeddings. GPU and batch processing are used to speed up generation. Multi-stage quality control is applied at the filtering stage: uniqueness check, relevance assessment, thematic filtering. Configurable quality thresholds are supported for each stage.

4. Key features and parameters

- Video search and collection: Integration with YouTube API with speed limiting, quota management, intelligent rotation and query optimization.

- Query management and rotation: Dynamic search query pool management system, performance tracking and adaptive rotation.

- Batch processing: All database and video processing operations are optimized for batch mode, which allows for efficient processing of large amounts of data.

- Parallel processing: Using multiprocessing for video cutting and embedding generation, load balancing between CPU and GPU.

- Caching and temporary storage: Modern caching system with automatic management and cleaning.

- Quality control: Multi-stage filtering, uniqueness check, subject relevance, adaptive quality thresholds.

5. Technical implementation and integrations

- Video ingest with proxy management: Robust system with automatic switching and load balancing, error handling and retries.

- Clip and frame extraction: Parallel pipelines for efficient extraction, precise frame alignment, audio and video synchronization.

- Embedding generation: GPU-accelerated processing, batch generation, memory optimization.

- Content filtering: Embedding uniqueness check, thematic filtering, detection of unnecessary content (ads, watermarks, low-quality segments).

- Error handling and logging: Comprehensive logging system, automatic log rotation and archiving, detailed error reporting, performance monitoring.

- Security and reliability: Secure proxy management, automatic failure recovery, data validation, efficient resource management [4].

6. Key Parameters and Limitations

- MAIN_DB_OPTIMAL_SIZE: 20,000,000 — Optimal size of the main database.

- N_VIDEOS_SEARCH_LIMIT: 30 — Search limit to prevent API quota exhaustion.

- N_VIDEOS_PER_QUERY: 10 — Optimizes query efficiency.

- VIDEO_CLIP_DURATION_SEC: 6 — Clip duration for optimal quality and efficiency.

- UNIQUENESS_THRESHOLD: 0.12 — Uniqueness threshold for content filtering.

- TOTAL_SCORE_THRESHOLD: 0.3 — Quality and Relevance Scoring Threshold.

- BATCH_SIZE: 8 — Batch size to optimize bandwidth.

7. Integrations

- YouTube API — Integration with Rate Limiting, Quota Management, and Optimization.

- Proxy management system — status monitoring, automatic rotation.

- Vector database — efficient similarity search, optimized indexes.

- ImageBind model — GPU-accelerated embedding generation, batch processing.

Data collection and preparation

The system (see Figure 3) is built around a decentralized network of miners that perform automated collection of video content from public platforms based on pre-defined or dynamically generated search queries. Each miner accompanies the video with metadata: titles, tags, descriptions, timestamps, as well as additional information necessary for subsequent filtering and analysis.

Figure 3. Architecture of the distributed system for collecting, validating and storing video data

Database management and storage

The system supports two types of databases: a main database for long-term storage of metadata of processed videos, embeddings, and processing status information, optimized for high-performance write and search operations; and a temporary vector database for storing and comparing embeddings for similarity checking, duplicate detection, and content filtering. The temporary database is periodically cleared and optimized to maintain performance.

The system keeps detailed records of processed queries and video identifiers to prevent re-processing and ensure full coverage. Timestamps and processing status are tracked for each element. Database operations are optimized for batch processing: batch inserts, updates, and similarity checks are implemented to maximize throughput.

Detailed data preparation steps

1. Formation and management of a pool of search queries

Queries are loaded from a text file (SEARCHING_QUERIES_FILE_PATH: 'youtube_queries.txt'), undergo intelligent rotation and adaptation based on performance statistics. The dynamic query management system maintains a pool of search queries and implements intelligent rotation strategies to collect diverse content. The system tracks query performance and adjusts rotation based on statistics, implementing adaptive strategies for selecting the next query based on performance metrics.

2. Search and download videos via YouTube API

The system implements advanced integration with the YouTube API with speed limiting (N_VIDEOS_SEARCH_LIMIT: 30) and quota management to ensure reliable search. Enabled intelligent rotation and query optimization to improve the quality of search results. The N_VIDEOS_PER_QUERY: 10 parameter optimizes query efficiency and result quality.

The download system uses a pool of rotating proxies with automatic switching and load balancing. Implemented proxy health monitoring and automatic rotation to maintain download reliability. Enabled error handling and retry mechanisms for reliable video downloads, timeout management, and recovery from proxy failures.

3. Video splitting into short clips

Videos are automatically split into 6-second segments (VIDEO_CLIP_DURATION_SEC: 6) using precise frame extraction and processing algorithms. The system uses precise video segmentation algorithms to create homogeneous clips with correct frame alignment. Quality checks are included to ensure clip integrity and audio/video synchronization. Parallel clip processing with efficient resource allocation and load balancing is used to maximize throughput.

4. Keyframe extraction and multimodal embedding generation

Each clip is processed frame by frame using optimized extraction algorithms to ensure high-quality frames. The ImageBind model is used to generate high-dimensional embeddings for each frame, reflecting both visual and semantic information. GPU acceleration and batch processing (BATCH_SIZE: 8) are used to speed up the generation, optimizing GPU memory usage and throughput.

Embeddings capture information from video, audio and text modalities, enabling efficient post-processing and searching. The system optimizes GPU memory usage and bandwidth by implementing efficient GPU memory allocation with automatic optimization.

5. Multi-stage uniqueness check and content validation

Each uploaded video clip undergoes a strict uniqueness check against the database using embedding similarity analysis (UNIQUENESS_THRESHOLD: 0.12). The system implements multiple validation stages to ensure content uniqueness and quality, including content fingerprinting and uniqueness checks at different processing stages.

Validation criteria include relevance to the search topic, novelty (measured by similarity to existing recordings) and richness of detail (match between text and video content). Content is assessed based on several factors: visual quality, topical relevance, user engagement metrics. The assessment system is constantly updated based on feedback [5; 6].

6. Topic filtering and quality assessment

Content is filtered based on predefined topics and their relevance (TOPICS_EMBEDDINGS_FILE_PATH: 'omega/assets/topics_embeddings1096_v2_pr54.pth'), which ensures compliance with target categories. The system supports a dynamic topic model that adapts to trends. An overall score threshold (TOTAL_SCORE_THRESHOLD: 0.3) is applied to ensure content quality and relevance.

Modern algorithms detect and filter unnecessary content: advertising, watermarks, low-quality segments. Several methods have been implemented for complex filtering with customizable quality thresholds for each stage.

7. Data augmentation and preprocessing

If necessary, augmentation techniques are applied: random cropping, color change, temporal distortions to increase data diversity and model stability. Frames are scaled to a single resolution, text descriptions are tokenized, annotations are aligned to time intervals. For transformer-based models, text is converted to tokens and padded/truncated to a fixed length.

8. Storing and indexing processed data

All processed data and embeddings are stored in optimized database structures with support for batch processing and efficient indexing. Embeddings are indexed in a vector database for efficient similarity searching with support for optimized index structures. Efficient transaction management with error recovery mechanisms is implemented.

A comprehensive tracking system maintains the status of all processed content with detailed logging and monitoring. Database operations are optimized for batch processing to maximize throughput with efficient resource management.

Quality incentives and validation mechanisms

Validated videos are periodically aggregated and stored in a centralized repository (MAIN_DB_OPTIMAL_SIZE: 20,000,000) with support for integration into public datasets or research platforms. The system is designed to scale to millions of video clips, covering a wide range of scenarios and activity types.

Video marketplace and version control

DVC (Data Version Control) is used to track data and artifact versions, Weights & Biases (W&B) is used to monitor experiments, and MLflow is used to manage model versions and deployment. The system provides integration with Backblaze B2 for cost-effective storage of large datasets and models.

Technical implementation and performance optimizations

The system implements comprehensive performance optimizations: multiprocessing for video cutting with efficient resource allocation and load balancing, GPU acceleration for embedding generation, batch processing of database updates, modern caching with automatic cache management and cleaning.

Resource management includes efficient GPU memory allocation, CPU usage optimization through parallel processing, disk space management with automatic cleaning, network bandwidth optimization with speed limiting.

Error handling and logging are implemented at all levels: advanced handling and recovery from proxy failures, download timeout management, comprehensive recovery mechanisms, transaction management to ensure database consistency. The logging system includes detailed logging at all stages of processing, log rotation and archiving, performance monitoring, and comprehensive error reporting.

Resulting dataset and integrations

The result is a scalable, multi-level dataset suitable for training, testing, and benchmarking modern multimodal video description generation models. The system supports integration with external systems: YouTube API, proxy management systems, vector and main databases, ImageBind model; and internal components: video processing utilities, database connectors, embedding generators, content filters.

The architecture encourages the creation and validation of high-quality, diverse video data, supporting the development of reliable visual-to-text models and intelligent agents. The marketplace mechanism facilitates the creation of a dynamic ecosystem, rewarding valuable contributions and allowing external buyers to access curated, validated video content for research and development of artificial intelligence applications [7].

Description generation and selection pipeline

The video description generation pipeline is implemented as a multi-stage process, including automatic processing, generation, filtering and structuring of text fragments. Main stages:

1. Extraction of features and embeddings. For each video clip, visual, audio and (if available) text features are extracted using ImageBind-type models. High-dimensional multimodal embeddings are generated, reflecting both visual and semantic information. Batch processing and GPU acceleration are used to speed up processing.

2. Generation of initial descriptions. Multimodal embeddings are fed to the input of the Writer module, which implements vision-to-text LLM (e.g. GPT-4V, LLaVA, MiniGPT-4). Brief text descriptions (fragments) are generated for each clip or keyframe. Parallel generation and aggregation of fragments is supported for long videos.

3. Fragmentation and structuring. Generated texts are divided into logically related fragments (usually 2-3 sentences or 50-100 words). Heuristic segmentation methods and syntactic parsing are used. Each fragment is supplied with metadata: time segment, generation model, filtering parameters, relevance assessments.

4. Filtering and cleaning. Several metrics are used: cosine similarity between fragment and video embeddings, TF-IDF uniqueness, thematic diversity, user parameters (no repetitions, length limit). Fragments that do not meet the thresholds for these metrics are excluded. To combat “stuffing” and templates, a classifier is used that penalizes redundant or repeated descriptions.

5. Storage and indexing. Cleaned and validated fragments are stored with metadata and embeddings in the database. Reverse search by embeddings and text features is supported. The storage structure is optimized for batch processing and scalable search.

6. Assembling the final description. Various strategies are used to form a coherent description of the video:

- Top-N: selecting the N most relevant fragments based on cosine similarity;

- Iterative addition taking into account the context: sequential expansion of the description with an assessment of coherence and novelty;

- Simulated annealing: stochastic optimization of the order and composition of fragments;

- Beam search: supporting several hypotheses and selecting the optimal sequence;

- Hybrid methods: a combination of Top-N and global optimization [8; 9].

The pipeline (see Figure 4) supports batch processing, parallelism, caching of intermediate results, and dynamic threshold adjustment to optimize quality and performance. All stages are logged and monitored to ensure reproducibility and transparency of experiments.

Figure 4. Video data processing and description generation pipeline

Metrics and quality assessment

The video description generation quality assessment system is built on a combination of lexical, semantic, structural, and embedding metrics, as well as a modular offline testing pipeline. This approach ensures objective and multi-level validation of the generation results.

The architecture includes an incentive mechanism to reward miners for contributing high-quality, novel, and diverse video data. The assessment algorithms analyze each submission based on relevance, novelty, and richness of detail, with validated entries receiving rewards [10]. The system dynamically adjusts thresholds and accumulation limits to maintain data quality and network efficiency.

Main metrics:

1. Lexical and semantic metrics:

- BLEU — measures n-gram overlap between generated and reference descriptions;

- ROUGE — evaluates recall by substring matching;

- METEOR — balances precision and recall taking into account synonymy and morphology.

2. Cosine similarity and embedding assessment:

- Cosine similarity between embeddings of the generated description and video/audio content (ImageBind);

- Cosine similarity between embeddings of the description and the user query;

- Formula for cosine similarity between two embeddings A and B:

cosine_similarity(A, B) = (A B) / (||A|| * ||B||)

3. Relevance and novelty score:

- Relevance scores are calculated as a weighted sum of cosine similarities across modalities;

- Novelty score — comparison with existing database records, penalty for duplication;

- Length score — bonuses and penalties for being within the optimal range of token counts.

4. Structural and heuristic filters:

- Check for domain entities;

- Absence of redundant or repeated phrases;

- Reduce template language density;

- Temporal consistency between video segment and description sequence.

Offline testing pipeline includes:

- Formation of a dataset of video clips with metadata and (if available) reference descriptions;

- Generation of description candidates using the generation pipeline;

- Calculation of embeddings for video, audio, text;

- Parallel calculation of all metrics for each candidate;

- Aggregation of intermediate scores into a final composite score;

- Comparison with a control set and report generation.

Formula for the final score:

final_score = (query_relevance_score * QUERY_RELEVANCE_SCALING_FACTOR) +

(description_relevance_score * DESCRIPTION_RELEVANCE_SCALING_FACTOR) +

(length_boost_or_penalty) -

(stuffedness_penalty)

Evaluation parameters:

- QUERY_RELEVANCE_SCALING = 1.0

- DESCRIPTION_RELEVANCE_SCALING = 1.0

- MIN_LENGTH = 100 tokens

- MAX_LENGTH = 300 tokens

- LENGTH_BOOST = +0.1

- LENGTH_PENALTY = -0.1

- STUFFED_PENALTY = 0.2

The pipeline (see Figure 5) supports batch processing, parallelism, caching of intermediate results, and dynamic threshold adjustment to optimize quality and performance. All stages are logged and monitored to ensure reproducibility and transparency of experiments.

Figure 5. Offline description quality assessment pipeline diagram

Experimental results

The experimental part of the work was aimed at validating the effectiveness of the proposed framework and analyzing the quality of generating video descriptions at various stages of the pipeline. Both open datasets (MSR-VTT, YouCook2, ActivityNet Captions, VATEX) and our own collected video data with annotations were used for the experiments.

The following stages were implemented during the experiments:

1. Data collection and preparation: automated collection of video content, markup, embedding extraction, formation of training and test samples.

2. Model training and inference: various configurations of vision-to-text LLM (GPT-4V, LLaVA, MiniGPT-4) were tested, as well as hybrid architectures with retrieval-based components.

3. Description generation: for each video clip, sets of description candidates were formed with subsequent selection based on quality metrics.

4. Quality assessment: BLEU, ROUGE, METEOR were calculated, as well as embedding metrics (cosine similarity to video and query), structural filters and composite scores.

5. Comparison of selection strategies: Top-N, iterative addition, simulated annealing, beam search and hybrid methods were analyzed [11].

The results of the experiments showed:

- The use of multimodal embeddings and hybrid selection strategies significantly increases the relevance and informativeness of the final descriptions;

- The best results for BLEU, ROUGE and METEOR are achieved by combining Top-N and global fragment order optimization;

- Embedding metrics correlate well with subjective quality assessment and allow automated filtering of uninformative or duplicate fragments;

- Batch processing, parallelism and caching allow scaling the pipeline to millions of video clips without loss of quality.

Table 1 shows the average values of the main metrics for different models and selection strategies (example).

Table 1.

Comparison of description generation quality metrics for different models and strategies

Model	BLEU-4	ROUGE-L	METEOR	CosineSim (video)	CosineSim (request)
DeepSeekVL2 + TopN	0.32	0.51	0.27	0.58	0.62
DeepSeekVL2 + Beam	0.34	0.54	0.28	0.65	0.62
DeepSeekVL2 + SA	0.41	0.64	0.34	0.72	0.62

Scheme of experimental pipeline and integration of monitoring tools shown in Figure 6.

Figure 6. Scheme of experimental pipeline and integration of monitoring tools

The conducted experiments confirm the applicability and scalability of the proposed framework for the tasks of generating descriptions for videos in various application scenarios.

Results and discussion

The results of the conducted experiments demonstrate the high efficiency of the proposed framework for generating text descriptions for videos in various scenarios. The use of multimodal embeddings, hybrid selection strategies and modular architecture ensures scalability, adaptability and high quality of the final descriptions.

Advantages of the implemented approach:

- Decentralized architecture and automation of data collection allow you to create scalable datasets with a high degree of diversity and novelty;

- The modularity of the system provides flexibility in integrating new models, metrics and selection strategies without the need for a complete rework of the pipeline;

- The use of embedding metrics and structural filters allows you to objectively assess the relevance and informativeness of descriptions, as well as automate the filtering of low-quality fragments;

- Batch processing, parallelism, and caching provide high performance and stability of the system when working with large volumes of data.

Limitations and challenges:

- The quality of description generation depends on the completeness and diversity of the source data, as well as the accuracy of annotations and metadata;

- Some metrics (e.g., BLEU, ROUGE) weakly correlate with subjective quality assessment, which requires additional involvement of human experts for validation;

- For complex video scenes with a large number of objects and events, segmentation errors and loss of context are possible when generating descriptions;

- Scaling the system requires optimization of storage and search for embeddings, as well as efficient management of computing resources.

Comparison with alternative approaches shows that the integration of retrieval-based and generative strategies, as well as the use of multimodal embeddings, provides higher relevance and informativeness of descriptions compared to classical and purely generative models.

Development Prospects:

- Integration of new types of large language models, including specialized video chat models and multimodal transformers;

- Development of interactive user interfaces for collecting feedback and retraining models in real time;

- Automation of knowledge base expansion through continuous collection and annotation of new video data;

- Implementation of user feedback loops for iterative improvement of generation quality and adaptation of the system to new tasks.

The proposed framework can be used in a wide range of practical applications: from increasing the availability of video content and educational platforms to creating intelligent assistants and automated annotation systems.

Conclusion

The paper presents a comprehensive architectural and functional overview of the framework for generating text descriptions of video content in the context of user queries. The system integrates modern multimodal models, retrieval-based mechanisms and advanced optimization algorithms to ensure a high level of flexibility, scalability and quality of the result.

The framework implements an end-to-end processing pipeline — from video segmentation and embedding extraction to description generation, filtering, scoring, and final assembly. Particular attention is paid to fragment selection mechanisms, which range from simple heuristics to global optimization methods. Quality assessment metrics provide an objective assessment of the generated texts, and offline testing allows monitoring stability during model updates.

Experimental results confirm that the combination of hybrid selection strategies with multimodal embeddings significantly improves the quality of the resulting descriptions. The proposed approach is highly adaptive and can be extended to more complex scenarios, such as dialog generation, action explanation, or topic annotation.

The framework can be used in a wide range of practical areas: from increasing the accessibility of video content and educational platforms to creating intelligent assistants and automated annotation systems. Development prospects include the integration of new types of language models, expanding the knowledge base, and introducing user feedback loops to iteratively improve the quality of generation.

References:

Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zhou, L., Xu, C., & Corso, J. J. (2018). Towards Automatic Learning of Procedures from Web Instructional Videos. In Proceedings of the AAAI Conference on Artificial Intelligence (YouCook2).
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Niebles, J. C., & Carlos, J. (2017). Dense-Captioning Events in Videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) (ActivityNet Captions).
Wang, X., Wu, J., Chen, J., Li, L., Wang, Y., & Wang, W. (2019). VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL).
Lin, C. Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004).
Banerjee, S., & Lavie, A. (2005). METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.
Girdhar, R., Gkioxari, G., Carreira, J., Doersch, C., & Zisserman, A. (2023). ImageBind: One Embedding Space To Bind Them All. arXiv preprint arXiv:2304.08611.
OpenAI. GPT-4 Technical Report. arXiv:2303.08774.
Liu, H., Li, C., Wu, Y., et al. (2023). LLaVA: Large Language and Vision Assistant. arXiv preprint arXiv:2304.08485.
Zhu, D., Li, D., Li, X., et al. (2023). MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models. arXiv preprint arXiv:2304.10592.