EDUCATIONAL FIGURE GENERATION USING TEXT PERCEPTUAL LOSS

ГЕНЕРАЦИЯ ОБРАЗОВАТЕЛЬНЫХ КАРТИНОК НА ОСНОВЕ ТЕКСТОВОЙ ПЕРЦЕПТИВНОЙ ФУНКЦИИ ПОТЕРЬ

Salamat K. Lytkin S.M.

29.05.2025 459

5(134)

10. Информатика, вычислительная техника и управление

Цитировать:

Salamat K., Lytkin S.M. EDUCATIONAL FIGURE GENERATION USING TEXT PERCEPTUAL LOSS // Universum: технические науки : электрон. научн. журн. 2025. 5(134). URL: https://7universum.com/ru/tech/archive/item/20148 (дата обращения: 08.01.2026).

Прочитать статью:

DOI - 10.32743/UniTech.2025.134.5.20148

ABSTRACT

Generative models have achieved remarkable advancements in creating realistic and intricate images. However, challenges remain in adapting these models for specialized domains, particularly educational figure generation. Educational visuals demand a higher degree of accuracy and contextual relevance, especially when sourced from technical literature. Here, clarity and alignment with the content are crucial for effective learning. In this study, we delve into the generation of educational figures derived from text in machine learning articles and books available online. Our innovative approach focuses on a text perceptual loss function designed to enhance the alignment between source text and generated visuals. This method particularly emphasizes the text elements within figures. By integrating this loss function, we aspire to create figures that not only reflect the original instructional intent but also enhance the clarity and relevance of the information presented. Ultimately, achieving this goal could significantly streamline the learning process and improve educational outcomes.

АННОТАЦИЯ

Генеративные модели достигли значительных успехов в создании реалистичных и сложных изображений. Однако остаются проблемы с адаптацией этих моделей для специализированных областей, в частности для создания рисунков в образовательных целях. Образовательные визуальные материалы требуют более высокой степени точности и контекстуальной значимости, особенно если они основаны на технической литературе. Здесь ясность и соответствие содержанию имеют решающее значение для эффективного обучения. В этом исследовании мы исследуем создание образовательных изображений, полученных из текста в статьях и книгах по машинному обучению, доступных в Интернете. Наш инновационный подход основан на функции потери восприятия текста, предназначенной для улучшения соответствия между исходным текстом и сгенерированными визуальными элементами. В этом методе особое внимание уделяется текстовым элементам внутри рисунков. Интегрируя эту функцию потери данных, мы стремимся создать рисунки, которые не только отражают первоначальную цель обучения, но и повышают ясность и актуальность представленной информации. В конечном счете, достижение этой цели может значительно упростить процесс обучения и улучшить результаты обучения.

Keywords: Generative Models, Latent Diffusion Models (LDM), Autoencoder, Text-to-Image Generation, OCR, Educational Figures, Text Perceptual Loss (TPL).

Ключевые слова: генеративные модели, модели скрытой диффузии, автоэнкодер, Text-to-Image, OCR, образовательные картинки, текстовая перцептивная функция потерь.

Introduction

Educational visuals are crucial in enhancing learning experiences [1, 2] by engaging students and facilitating better comprehension of complex concepts. Visual aids provide a representation of information that can make it easier for students to understand and remember key ideas.

Visual aids help students remember information better. By associating facts with images or symbols, students can create mental connections that make recalling information easier. Engaging visuals capture students’ attention and keep them interested.

However, the process of creating quality educational visuals requires significant time and effort from educators. This process can be time-consuming and complex, often requiring specialized skills in graphic design and digital tools. Automating the generation of visual content through AI-driven solutions could streamline this process and allow educators to focus more on pedagogical tasks.

Automated generation of educational images can also enhance accessibility and personalization in learning. Research shows that personalized learning strategy can improve student performance [3], as they cater to individual learning styles and levels. Such systems are especially beneficial in blended and online learning environments, where understanding unique student needs is crucial. By leveraging generative models, educators can create adaptive visuals that meet specific student requirements and improve the overall effectiveness of educational resources.

Furthermore, educational images play a vital role in helping students grasp complex and abstract concepts. Visual materials, such as graphs, diagrams, and infographics, can simplify the understanding of high-level ideas, particularly in STEM [4]. Studies indicate that visual aids enhance information retention and accelerate comprehension, creating visual associations for recall.

Literature review

An important contribution to this field is OCR-VQGAN: Taming Text-within-Image Generation by Rodriguez et al. [5], which addresses the challenge of generating readable texts within images. In their work, they use an image encoder and decoder that rely on OCR perceptual loss, which utilizes pretrained model CRAFT [6] to extract text features. It helps the model keep text and diagrams clear and accurate. This is very helpful for making figures, which often have complicated diagrams with lots of labels. To train and test their approach, they created a large dataset called Paper2Fig100k, which has over 100,000 images of figures and captions from scientific papers. They showed that OCR-VQGAN works well in the task of figure reconstruction.

Their subsequent work is FigGen [7], which uses a diffusion method [8] to generate scientific figures from text descriptions, specifically a latent diffusion model. FigGen focuses on the hard parts of making figures, representing complex relationships between discrete components like boxes, arrows and text. They use the same method for the encoder part and the dataset as used in the previous study.

The article [9] explores the alignment of text and image for applications in scene text detection. This alignment is essential for improving the integration of text with optical character recognition and for enabling pretrained models to adapt to the complex and diverse styles found in scene text detection and spotting tasks. The main focus is on loss calculation for optimizing the model (model produces a binary image). One of the losses used is the OCR perceptual loss as referenced in [5].

Any Text [10] emphasizes its multilingual features to overcome challenges in generating visual text in various languages, facilitating both text creation and modification within images across multiple languages. They introduced a large multilingual dataset AnyWord-3M. To improve the accuracy of text generation they have used text perceptual loss. They employed a pretrained PP-OCRv3 model to obtain OCR feature maps for further loss calculation. As the low level of loss calculation used MSE. It minimizes discrepancies between predicted and original images in text regions.

Assessing the quality of generated images remains vital for educational visuals. Sara et al. [11] compared standard image quality metrics like SSIM, MSE, and PSNR, finding that while these metrics provide foundational quality assessments, they fall short in capturing perceptual nuances critical to educational effectiveness. Expanding on this, Zhang et al. (2018) [12] demonstrated that perceptual metrics derived from deep features offer a more reliable assessment of image quality, aligning better with human judgment and making it suitable for evaluating visuals used in education. They have created their own dataset called BAPPS, which contains original images along with various distortions generated using traditional algorithms and CNN-based approaches. To capture human perceptual judgments regarding image quality, they conducted similarity measurements using the Two Alternative Forced Choice (TAFC) method and Just Noticeable Differences (JND) for validation.

All provided studies have utilized OCR feature maps and focused on the integration of text with images. Our work builds upon these foundations, emphasizing a more detailed analysis of text features to enhance the precision of educational figure generation.

Methods and Materials

1. ML-Figs Dataset

We present the ML-Figs dataset [20], a comprehensive collection of 4,302 figures and captions (see Figure 1) extracted from 43 machine learning books. This dataset is designed to advance research in understanding and interpreting educational materials. It includes 4000 samples for training and 302 for testing.

Figure 1. Sample figures from the dataset

The dataset was constructed through a multi-step process:

Data collection. Machine learning books were sourced from reputable online repositories and manually curated. To automate the process of collecting books, tools such as Scrapy and Beautiful Soup were used (Figure 2, Step 1).
Parsing figures and captions. By using tools such as PDFFigures [13], we extracted images and their captions from books (Figure 2, Step 2).
Optical character recognition (OCR). The text from the figures was extracted using Tesseract, enabling its conversion to a machine-readable format (Figure 2, Step 3).

The dataset includes JSON files for each book, representing metadata for figures. Each metadata includes attributes such as captions, bounding boxes for captions and figures, figure types (e.g., “Graph”, “Table”), OCR-extracted text along with its coordinates and confidence scores. This detailed metadata helps with a range of research tasks, including figure-caption association, enhancing OCR tasks, and analyzing the structure of documents. It provides a solid basis for deepening our understanding of educational content.

Figure 2. Educational Figure Collection. Figure Collection consists of three stages: (1) Educational ML Books, (2) Parsing figures and captions from PDF ML books (ID, Caption), and (3) performing Optical Character Recognition (OCR) on the image to extract text. By combining these elements, we gain a deeper understanding of the figure’s content

To improve the coverage and diversity of our datasets, we decided to expand the ML-Figs dataset by adding extra figures and captions from the SciCap dataset [16], particularly those from ACL papers. This expansion (ML-Figs + SciCap) has boosted the total size of our dataset to an impressive 19,514 samples. This increase helps us achieve a wider range of topics and makes the dataset more useful for training in various tasks. After the expansion, the training set consists of 15,611 samples and 3,903 samples for testing.

1.1 Text Analysis from Figure Captions and OCR

To better understand the textual content in the ML-Figs dataset, we generated word clouds for figure captions and OCR-extracted text from figures. This analysis helps identify the most common terms and highlights the differences between these two textual representations.

Figure 3 (a) illustrates the most common terms found in figure captions related to machine learning models, data representation, and mathematical concepts. The words that appear most frequently include “figure”, “illustration”, “model”, “data”, “function”, “result” and “distribution”. This suggests that captions typically describe the underlying methodology or key concepts illustrated in the figures.

Figure 3. Word cloud analysis for figure captions and OCR-extracted text

In contrast, Figure 3 (b) displays the common terms extracted from the OCR text. It includes more technical terminology, such as “input”, “output”, “time”, “feature” and “algorithm”. Additionally, words like “user”, “recommendation” and “model serving” indicate the presence of content related to system architectures and recommendation models. Some OCR noise is observed, particularly in cases where mathematical symbols, equations, or subscripted text are misinterpreted.

These findings highlight the domain-specific nature of figure text in ML and educational contexts. Unlike general text-to-image tasks, where prompts explicitly describe image content, figure captions and extracted text contain technical vocabulary that requires specialized text processing. Due to the need for more accurate semantic representation, we decided to use a BERT-based [15] encoder with additional transformer layers to better align textual inputs with figure generation.

2. Method

Our approach employs latent diffusion models (LDM) for generating and reconstructing figures, complemented by a custom loss function to ensure high accuracy in the synthesized images. This custom loss function focuses on comparing the text regions of the original figures with the reconstructed ones, which is crucial for capturing the finer details, especially in the text. The code is available online [21].

To effectively capture and analyze text elements in the figures, we use a pretrained text recognition model during the third step of data collection (Figure 2, Step 3). Specifically, the text recognition model extracts text bounding boxes from the figures, which are then utilized to train our AutoencoderKL.

2.1 Text Perceptual Loss

The Text Perceptual Loss calculates the perceptual similarity between the text regions of two images by extracting text bounding boxes. The mean squared error (MSE) loss is then computed for each corresponding text region. The final loss is the average of these individual region losses.

Figure 4 illustrates the algorithm for computing the text perceptual loss. The algorithm iterates over each text region specified by the bounding boxes. For each region, the corresponding sections of the original and reconstructed images are extracted, and the MSE loss is computed. The final perceptual loss is the average of all individual losses, representing the overall accuracy of text reconstruction in the figure.

Figure 4. Text Perceptual Loss algorithm

2.2 AutoencoderKL

The AutoencoderKL (Variational Autoencoder with Kullback-Leibler divergence) [14] plays a crucial role in learning a compact latent space representation of the images. It comprises an encoder that maps images into a lower-dimensional latent space, and a decoder that reconstructs images from these latent vectors. The architecture operates with a downsampling factor of 8, making it particularly efficient for high-resolution image modeling.

Figure 5. Architecture of AutoencoderKL incorporating Text Perceptual Loss

During training, the AutoencoderKL was optimized on images with a resolution of , resulting in latent representations of size (Figure 5). Despite being trained at 384 resolution, the model can generalize to higher resolutions due to its fully convolutional design. In our downstream LDM pipeline, we leverage the same autoencoder to encode and decode images at resolution, resulting in latent feature maps of size .

To guide learning, we define a multi-term reconstruction loss (1) that combines pixel-wise and perceptual objectives:

Here, denotes the standard pixel-wise L1 loss, is the learned perceptual similarity metric [12], and is the text perceptual loss (Figure 4), which encourages alignment between reconstructed images and their associated textual descriptions. We use weights of and to balance the perceptual and text-based contributions.

Additionally, the training process includes Kullback-Leibler divergence and an adversarial loss via a discriminator, which encourages the generation of visually realistic reconstructions.

2.3 Latent Diffusion Model (LDM)

The Latent Diffusion Model (LDM) [8] offers a powerful way to streamline image generation. By working in a latent space, it significantly reduces the computational demands that come with directly processing high-dimensional images. After training the Autoencoder, we used its capabilities within the LDM pipeline (see Figure 6). The compact latent representations produced by the AutoencoderKL make the diffusion process much more efficient, ensuring we still achieve high-quality image outputs without additional costs for complex processing of high-dimensional data.

Figure 6. Latent Diffusion Model architecture

The main part of the LDM for reducing noise is built on a U-Net design that has special features like spatial transformers and cross-attention layers. These elements help the model understand local structures and semantic information at the same time when creating images.

We trained a BERT-based [15] text encoder together with a diffusion model to enhance our conditioning process. This setup features 12 transformer layers that produce 512-dimensional embeddings. These embeddings are then integrated into the U-Net using cross-attention layers. By training both components simultaneously, our model effectively learns to align image generation with the semantics of the textual inputs.

2.4 Evaluation Metrics

To evaluate the performance of our autoencoders, we have used a set of metrics to calculate the degree of similarity between the original and reconstructed images. Peak Signal-to-Noise Ratio (PSNR) [11] calculates the ratio of maximum signal power to the power of distorting noise affecting quality representation. Structural Similarity Index Method (SSIM) [11] is a perception-based model that considers image degradation as a change in the perception of structural information. Fréchet Inception Distance (FID) [18] measures the similarity between the distributions of real and reconstructed images. Learned Perceptual Image Patch Similarity (LPIPS) [12] evaluates perceptual similarity between images. Mean Squared Error (MSE) [11] calculates pixel-wise differences between real and reconstructed images. Finally, Text Perceptual Loss (TPL) (see Section 2.1) assesses the alignment of textual content in images, emphasizing semantic consistency, and lower values indicate superior performance in this aspect.

In addition to FID, Inception Score (IS), Kernel Inception Distance (KID) [17], and CLIP Score [19] were used to evaluate the LDM model. These metrics are widely used in generative modeling to quantify image fidelity, semantic alignment, and diversity. While higher IS and lower FID/KID indicate better generative performance in terms of realism and variation, CLIP Score measures the semantic alignment between generated images and their corresponding text prompts, offering an additional perspective on cross-modal consistency.

Together, these metrics provide a robust framework for assessing both the reconstruction fidelity of autoencoders and the generative quality of diffusion models, which is crucial for downstream tasks like text-to-image generation.

Results and Discussion

In this study, we trained two autoencoders on distinct datasets: ML-Figs and an expanded version, ML-Figs+SciCap. The performance of two autoencoders was compared using the ML-Figs/+SciCap test dataset. The models (A, B) was specifically trained with the inclusion of the text perceptual loss to enhance text quality and readability. On the other hand, the pretrained model from Stable Diffusion v1-4 served as the baseline for comparison.

As shown in Table 1, B, trained on the combined ML-Figs+SciCap dataset, achieved the best performance across all evaluation metrics except PSNR. However, Figure 7 and [11] shows that from a human visual perspective, SSIM generally considered more reliable than PSNR. This indicates its effectiveness in preserving both perceptual quality and semantic alignment. In comparison, A (trained only on ML-Figs) showed strong performance in FID, MSE and TPL, while the SD baseline performed better results in PSNR, SSIM and LPIPS when compared solely to Model A.

Table 1.

Quantitative Comparison of Autoencoder Models

Method		PSNR	SSIM	FID	LPIPS	MSE	TPL
ML-Figs Test
		33.01	0.970	20.51	0.022	0.003	0.043
	A	30.71	0.954	16.13	0.056	0.002	0.017
ML-Figs + SciCap Test
		32.60	0.970	12.69	0.023	0.004	0.061
	A	29.94	0.954	9.235	0.057	0.003	0.028
	B	31.47	0.979	6.256	0.016	0.001	0.010
Note: Model A trained on ML-Figs, Model B trained on ML-Figs + SciCap. TPL: Text Perceptual Loss. refers to Stable Diffusion v1-4 trained on LAION.

These findings underscore the importance of both dataset richness and the inclusion of text perceptual loss functions. The superior performance of Model B suggests that combining diverse training data with TPL yields significant gains in both perceptual and semantic quality of reconstructed images. Future improvements could further enhance this approach by incorporating even more diverse domain-specific datasets or adaptive perceptual losses.

To assess the downstream impact of improved autoencoder representations, we trained text-conditioned LDM using the latent space of B. Training was conducted on the ML-Figs+SciCap dataset.

The model was tailored for our multimodal dataset of educational figures and their associated captions. We used a learning rate of 1e-6 and optimized the training objective over 1000 diffusion steps, with a linear noise schedule ranging from 0.0015 to 0.0205. A LambdaLR scheduler with 10,000 warm-up steps and a flattened learning rate followed. Exponential Moving Average (EMA) updates were employed to stabilize the training process.

The LDM architecture consists of a UNet backbone that operates on latent representations with 4 input/output channels. Spatial transformers are enabled to facilitate text-image alignment via cross-attention mechanisms, using a 512-dimensional context space.

The first-stage model is an AutoencoderKL initialized from the checkpoint trained on ML-Figs+SciCap. It compresses RGB images into a 4-channel latent space. Text conditioning is performed using a BERT-based encoder with 12 transformer layers and a 512-dimensional embedding. Cross attention enables the diffusion model to align image generation with textual inputs effectively.

Training was carried out on an NVIDIA RTX 6000 Ada GPU (48GB) with a batch size of 7. The resolution was set to . Text prompts were derived from figure captions, and square padding and region-of-interest (ROI) bounding boxes were used to improve input consistency.

Figure 7. Qualitative Comparison of Autoencoder Models. Our model B outperforms other models in terms of clarity and legibility of the text

According to our LDM model, it achieved an IS of 1.055(), FID of 28.633(), and KID of 0.027() on the ML-Figs+SciCap dataset, indicating reasonable alignment between the generated and ground-truth image distributions. Additionally, the CLIP Score reached 21.24(), reflecting a good level of semantic consistency between the generated figures and their corresponding text prompts.

As shown in Figure 8, increasing the classifier-free guidance (CFG) scale significantly enhances both the semantic alignment of the content’s meaning and the overall visual quality. Among the tested values, a CFG of 3.0 produced the most coherent and faithful generations across a range of caption prompts. Based on these observations, we selected CFG = 3.0 for final evaluation.

Figure 8. Generated samples across varying classifier-free guidance (CFG) scales. Each column corresponds to a specific figure caption and its ground-truth reference image (top row). The images are generated at CFG scales 1.0, 1.5, 2.0, 2.5, and 3.0, respectively.

Results shows that improved autoencoder representations (especially those incorporating text perceptual loss) enhance LDM training and generation quality. Moreover, controlling the CFG scale provides an effective mechanism for balancing fidelity and diversity in image generation.

Conclusion

In this paper, we presented a method for educational figure generation using a text perceptual loss. We presented ML-Figs dataset, for text-to-image and reverse tasks, specifically in educational purposes. By incorporating our text perceptual loss and several experiments, our autoencoders demonstrated strong performance across perceptual metrics (LPIPS = 0.016, TPL = 0.010, etc.). Qualitative results showed the effectiveness preserving both image quality and textual content during reconstruction, with readable and well-aligned text.

We trained the LDM model for text-to-image task using our pretrained autoencoder with a text perceptual loss. The results showed that the generated samples closely resemble real images and effectively capture the content and layout described in the captions. In contrast to reconstruction, where textual elements remain clear, text in generated figures is often blurry or distorted. We concluded that generating figures with clear and readable text requires more training iterations and additional computational resources, which reflects our main limitation.

Future work will focus on improving the readability and accuracy of textual content in generated figures to better support educational use cases.

References:

Veřmiřovský Jan, 2013: The Importance of Visualisation in Education, [in] E-learning & Lifelong Learning. Monograph. Sc. Editor Eugenia Smyrnova-Trybulska, Studio Noa for University of Silesia in Katowice, Katowice-Cieszyn, pp. 453-463. ISBN 978-83-60071-66-3.
Hidayah L. R. et al. The Importance of Using Visual in Delivering Information //VCD. – 2023. – Т. 8. – №. 1. – С. 52-61.
Makhambetova A., Zhiyenbayeva N., Ergesheva E. Personalized learning strategy as a tool to improve academic performance and motivation of students //International Journal of Web-Based Learning and Teaching Technologies (IJWLTT). – 2021. – Т. 16. – №. 6. – С. 1-17.
Gates P. The importance of diagrams, graphics and other visual representations in STEM teaching //STEM Education in the Junior Secondary: The state of play. – Singapore : Springer Singapore, 2017. – С. 169-196.
Rodriguez J. A. et al. Ocr-vqgan: Taming text-within-image generation //Proceedings of the IEEE/CVF winter conference on applications of computer vision. – 2023. – С. 3689-3698.
Baek Y. et al. Character region awareness for text detection //Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. – 2019. – С. 9365-9374.
Rodriguez J. A. et al. Figgen: Text to scientific figure generation //arXiv preprint arXiv:2306.00800. – 2023.
Rombach R. et al. High-resolution image synthesis with latent diffusion models //Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. – 2022. – С. 10684-10695.
Duan C. et al. Odm: A text-image further alignment pre-training approach for scene text detection and spotting //Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. – 2024. – С. 15587-15597.
Tuo Y. et al. Anytext: Multilingual visual text generation and editing //arXiv preprint arXiv:2311.03054. – 2023.
Sara U., Akter M., Uddin M. S. Image quality assessment through FSIM, SSIM, MSE and PSNR—a comparative study //Journal of Computer and Communications. – 2019. – Т. 7. – №. 3. – С. 8-18.
Zhang R. et al. The unreasonable effectiveness of deep features as a perceptual metric //Proceedings of the IEEE conference on computer vision and pattern recognition. – 2018. – С. 586-595.
Clark C., Divvala S. Pdffigures 2.0: Mining figures from research papers //Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. – 2016. – С. 143-152.
Kingma D. P. et al. Auto-encoding variational bayes [Электронный ресурс].
Devlin J. et al. Bert: Pre-training of deep bidirectional transformers for language understanding //Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). – 2019. – С. 4171-4186.
Hsu T. Y., Giles C. L., Huang T. H. K. SciCap: Generating captions for scientific figures //arXiv preprint arXiv:2110.11624. – 2021.
Bińkowski M. et al. Demystifying mmd gans //arXiv preprint arXiv:1801.01401. – 2018.
Heusel M. et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium //Advances in neural information processing systems. – 2017. – Т. 30.
Hessel J. et al. Clipscore: A reference-free evaluation metric for image captioning //arXiv preprint arXiv:2104.08718. – 2021.
ML-Figs Dataset // Hugging Face URL: https://doi.org/10.57967/hf/5251
Source code ML-FIGS-LDM // GitHub repository URL: https://github.com/salamnocap/ml-figs-ldm