COMPARATIVE ANALYSIS OF FSRCNN, ESPCN, AND CARN MODELS FOR EFFICIENT REAL-TIME IMAGE SUPER-RESOLUTION

СРАВНИТЕЛЬНЫЙ АНАЛИЗ МОДЕЛЕЙ FSRCNN, ESPCN И CARN ДЛЯ ЭФФЕКТИВНОГО СУПЕРРАЗРЕШЕНИЯ ИЗОБРАЖЕНИЙ В РЕАЛЬНОМ ВРЕМЕНИ

Pirniyazov M.Zh. Kartbayev A.Zh.

28.01.2026 85

1(142)

10. Информатика, вычислительная техника и управление

Цитировать:

Pirniyazov M.Zh., Kartbayev A.Zh. COMPARATIVE ANALYSIS OF FSRCNN, ESPCN, AND CARN MODELS FOR EFFICIENT REAL-TIME IMAGE SUPER-RESOLUTION // Universum: технические науки : электрон. научн. журн. 2026. 1(142). URL: https://7universum.com/ru/tech/archive/item/21820 (дата обращения: 09.03.2026).

Прочитать статью:

DOI - 10.32743/UniTech.2026.142.1.21820

ABSTRACT

We ask a down-to-earth question: when every millisecond and megabyte matters, which tiny super-resolution network should we trust? To answer it, we put three well-known lightweight CNNs - FSRCNN, ESPCN and CARN - into an apples-to-apples setup. We measured their performance on Set5, Set14, BSD100 and Urban100 at magnification factors ×2, ×3 and ×4, logging peak-signal-to-noise ratio (PSNR), structural similarity (SSIM), average inference time, parameter count and theoretical GFLOPs. Across the four data sets FSRCNN tops the fidelity chart at ×2 (32.8 dB PSNR / 0.916 SSIM) while finishing a 128×128 crop in about 1.1 ms on an RTX 4070 Ti Super. ESPCN trades roughly half a decibel (32.3 dB, 0.911 SSIM) for a record-low 0.43 ms and a footprint of just 21 k parameters. CARN, almost one million parameters, is slower (6 ms) and 4 dB behind in PSNR, yet its cascading blocks draw marginally sharper edges in highly textured urban scenes. The resulting speed-quality map gives practitioners a clear rule of thumb when power, memory or frame-rate budgets are tight.

АННОТАЦИЯ

Мы задаем приземленный вопрос: какой крошечной сети сверхразрешения стоит доверять, когда важна каждая миллисекунда и мегабайт? Чтобы ответить на него, мы поместили три известные легковесные CNN - FSRCNN, ESPCN и CARN - в условия абсолютно честного сравнения. Мы измерили их производительность на наборах данных Set5, Set14, BSD100 и Urban100 при коэффициентах увеличения ×2, ×3 и ×4, фиксируя пиковое отношение сигнала к шуму (PSNR), структурное сходство (SSIM), среднее время вывода, количество параметров и теоретические GFLOPs. Среди четырех наборов данных FSRCNN возглавляет рейтинг точности при ×2 (32.8 дБ PSNR / 0.916 SSIM), обрабатывая фрагмент 128×128 примерно за 1.1 мс на RTX 4070 Ti Super. ESPCN уступает примерно полдецибела (32.3 дБ, 0.911 SSIM) ради рекордно низкой задержки в 0.43 мс и размера всего в 21 тысячу параметров. CARN, имеющая почти миллион параметров, работает медленнее (6 мс) и отстает на 4 дБ по PSNR, однако её каскадные блоки прорисовывают немного более четкие края в высокотекстурированных городских сценах. Полученная карта соотношения «скорость-качество» дает практикам четкое эмпирическое правило для ситуаций, когда бюджеты мощности, памяти или частоты кадров ограничены.

Keywords: Super-resolution, CNN, Deep Learning, Real-time processing, Computer Vision, FSRCNN, ESPCN, CARN.

Ключевые слова: Сверхразрешение, CNN, глубокое обучение, обработка в реальном времени, компьютерное зрение, FSRCNN, ESPCN, CARN.

Introduction

People have tried to squeeze sharper images out of blurry ones for more than half a century. Early fixes-bilinear and bicubic interpolation-were quick but could not invent detail. Example-based tricks followed: Freeman et al. matched small patches to a training library [1], while Glasner blended self-similar regions across scales [2]. Sparse coding then ruled for a while, with Yang's dictionary couples delivering record peak–signal-to-noise ratio (PSNR) in 2010 [3]. All these methods, however, hand-crafted either the prior or the search strategy. The real jump came when Dong et al. replaced hand design with learning and introduced SRCNN [4]. Three tiny convolutions trained end-to-end beat decades of signal theory-but at the cost of running every filter on the high-resolution grid, which throttled speed.

Phones, drones and wearables reshaped the goalposts: an algorithm must be not only accurate, but also light enough to live inside a battery-powered chip. Dong's follow-up FSRCNN [5] attacked latency head-on: it shrank filters, slipped all heavy lifting into the low-resolution (LR) domain and used a deconvolution tail for upscaling. Shi et al. took a different route in ESPCN [6], inventing the "pixel shuffle" that trades depth for width in one painless move. These ideas inspired LapSRN [7], IMDN [8] and Lite-HRNet-SR [9]. All share one theme: cut floating-point operations (FLOPs) first, fit memory second, and only then worry about squeezing the last decibel of PSNR.

While mobile nets went on a diet, desktop-class models bulked up. Lim et al. deepened residual stacks in EDSR [10]; Zhang et al. added channel attention in RCAN [11]; Liang et al. ushered transformers into SR with SwinIR [12]. GAN flavours such as SRGAN and ESRGAN [13], and the more robust Real-ESRGAN [14], traded PSNR for photorealistic textures. These giants crack the 40 dB barrier but tip the scales at tens of millions of parameters and hundreds of GFLOPs-unfit for edge devices where every millisecond burns battery and every megabyte strain memory buses.

Against this backdrop, three networks keep surfacing in open-source toolkits, lecture slides and production firmware:

FSRCNN - 0.013 M parameters, deconvolution tail, still cited as the classic low-latency baseline [5].
ESPCN - the pixel-shuffle pioneer, famous for sub-millisecond runs on Raspberry Pi-class CPUs [6].
CARN - a cascading residual mesh with 1×1 bottlenecks that squeezes attention-like reasoning into 0.9 M weights [15].

Although they aim for the same real-time ×2-4 task, published numbers vary wildly - sometimes by whole decibels-because authors choose different datasets, colour channels, crop sizes or hardware. Surveys acknowledge the mismatch [16] but leave readers to piece together fair comparisons.

Practitioners flashing firmware onto cameras or glasses do not need another training recipe; they need a straight table: quality, latency, memory, FLOPs-measured on the same pictures in the same colour space, using the same GPU and CPU settings. That apples-to-apples table is curiously absent. Without it, we risk shipping a bulky model where a lean one would do or underestimating how much battery a supposedly "real-time" network actually drains.

We therefore stage a fresh match between FSRCNN, ESPCN and CARN under one roof:

Identical Y-channel preprocessing and bicubic baselines.
The four canonical sets - Set5, Set14, BSD100, Urban100 - at ×2, ×3 and ×4.
Latency measured on an NVIDIA RTX 4070 Ti Super GPU and an AMD Ryzen 7 7700 CPU.
Secondary costs: parameter count and ptflops theoretical GFLOPs.

Materials and methods

We build our benchmark on slightly modified versions of Lornatang’s public PyTorch re-implementations of FSRCNN, ESPCN and CARN. All tweaks are confined to a single script that:

a) fixes colour-space mishandling in CARN,

b) adds half-precision support where numerically safe, and

c) logs every run into benchmark_results.json for later table generation.

Hardware and software stack:

• GPU: NVIDIA RTX 4070 Ti Super (16 GB, PCIe 4.0) - driver 576.28, CUDA 12.6

• CPU: AMD Ryzen 7 7700 (8 C / 16 T, 32 MB L3)

• RAM: 32 GB DDR5-5600, dual-channel

• OS: Windows 11 Pro 24H2

• Python wheel snapshot (May 2025):

· torch 2.7.0+cu128

· torchvision 0.22.0+cu128

· torchaudio 2.7.0+cu128

· numpy 2.2.5

· scipy 1.15.3

· opencv-python 4.11.0

· ptflops 0.7.4

· tqdm 4.67.1

· fsrcnn_pytorch 1.2.2

· natsort 8.4.0

All experiments run with torch.backends.cudnn.benchmark = True

CPU tests pin threads to physical cores via torch.set_num_threads(8).

Figure 1. End-to-end pipeline used in this study. HR images are down-sampled with MATLAB bicubic to create LR inputs. The network operates on the Y channel only; Ŷ is later fused with ground-truth CbCr for evaluation

Table 1.

Datasets

Dataset	Images	Avg. Resolution	Scene Type
Set5	5	278 × 283	portraits / animals
Set14	14	368 × 253	general scenes
BSD100	100	481 × 321	natural photos
Urban100	100	1038 × 663	façades / signage

LR synthesis. Every HR image is centre-cropped to the nearest multiple of s ∈ {2, 3, 4} and down-sampled with MATLAB bicubic (B=0.75, C=0.25).

Colour channel. All three networks operate on luma (Y), the brightness component in the YCbCr color space where Cb and Cr capture chrominance (color difference) information.

After inference, the predicted Ŷ is fused with ground-truth

CbCr before PSNR and SSIM are scored-on Y only-to harmonise evaluation and remove CARN’s historical double mean-shift bug.

All steps are chained in the same way for every model - see Figure 1 for a schematic overview.

FSRCNN (0.013 M parameters) follows the three-stage recipe in Figure 2: (i) a 5×5 shrink convolution reduces channels; (ii) an 11 × n stack of 3×3 non-linear-mapping convolutions transforms features in the LR domain; (iii) a single 9×9 deconvolution (transpose-conv) upsamples to full resolution.

ESPCN (0.021 M) uses two standard convolutions - 5×5 followed by 3×3 - to expand depth to s²c channels, then applies a pixel-shuffle that unwraps the channels into an s×s HR grid, as illustrated in the centre column of Figure 2.

CARN (0.91 M) begins with a 3×3 feature extraction convolution, feeds those features through several cascading residual blocks (blue column in Figure 2), and finishes with a sub-pixel convolution for upsampling. Figure 3 shows the inner structure of one residual cascade in more detail.

Figure 2. Block-wise mini-architectures of FSRCNN, ESPCN, and CARN. Each column shows the main layer sequence of a lightweight SR model: FSRCNN uses shrinking and expanding convolutions with deconvolution; ESPCN applies standard convolutions followed by pixel shuffle; CARN stacks cascading residual blocks before a sub-pixel convolution

Figure 3. Inside a CARN cascade: every residual unit outputs 64 channels, compressed by a 1 × 1 conv before element-wise fusion

For every dataset/scale pair we report five complementary criteria:

Peak Signal-to-Noise Ratio (PSNR). Defined for 8-bit images as

Structural Similarity Index (SSIM). Measures perceptual similarity through luminance, contrast and structure terms:

We use an 11×11 Gaussian window and evaluate on the cropped Y channel, matching the original SSIM definition by Wang et al. [17].

Latency (ms). Mean and sample standard deviation of 10 forward passes per image, synchronised on torch.cuda events.
Model size (M). Total trainable parameters divided by 106.
GFLOPs. Theoretical multiply-add operations for a 128×128 input, computed with PTFLOPS.

These choices mirror current practice in lightweight SR work and jointly capture accuracy, perceptual quality and real- time suitability.

Implementation tweaks:

Mixed precision - FP16 for FSRCNN/ESPCN; FP32 for CARN at ×4 to avoid overflow.
Mean-shift fix - duplicate RGB→YCbCr shift removed from CARN, +0.3 dB on Set14 ×4.
Warm-up - first 10 iterations ignored to amortise kernel load.
Tile inference - optional 128² tiles with 8-px overlap enable 8K frames on 8 GB GPUs; disabled during bench- marking for fairness.

Table 2.

Average performance on Set5, Set14, BSD100 and Urban100 (Y-channel). Higher is better for PSNR/SSIM (↑); lower is better for latency (↓). Best fidelity values are bold

Model	Params (M)	GFLOPs	×2				×3			×4
			PSNR	SSIM	Lat.	PSNR	SSIM	Lat.	PSNR	SSIM	Lat.
FSRCNN	0.013	0.44	32.77	0.916	1.09	30.37	0.844	0.82	27.54	0.769	0.81
ESPCN	0.021	0.35	32.25	0.911	0.43	30.02	0.837	0.50	27.21	0.757	0.40
CARN	0.964	15.95	28.22	0.905	6.06	27.27	0.833	3.88	25.78	0.774	5.43

Figure 4. Visual comparison on Set14 Comic image at ×4. The highlighted region is magnified to reveal structural differences among the three lightweight SR networks

Results and discussion

Table 2 summarises the average PSNR, SSIM, latency, parameter count and GFLOPs across all four data sets. FSR- CNN retains a clear fidelity margin - approximately +0.5 dB and +0.005 SSIM over ESPCN at the easy ×2 scale - while running below 1.1 ms on an RTX 4070 Ti Super. ESPCN wins the latency race outright, dipping under 0.5 ms at every scale and occupying only 21 k parameters; its extremely shallow mapping coupled with pixel-shuffle, however, limits restoration of diagonal detail. CARN pays a 15 GFLOP bill for cascading residual units yet still falls 2–3 dB short in PSNR, illustrating that depth alone does not offset the parameter budget when receptive-field growth is constrained by real-time demands.

Table 3.

PSNR (dB) per data set at ×4. SSIM follows the same ordering and is omitted for brevity

Dataset	FSRCNN	ESPCN	CARN
Set5	30.74	30.29	27.45
Set14	28.00	27.43	25.41
BSD100	27.01	26.82	25.69
Urban100	24.67	24.28	23.95

Table 3 splits the ×4 PSNR column by each benchmark, revealing consistent ordering across data sets. The dataset view confirms that FSRCNN’s advantage widens as texture complexity grows on Urban100, which contains dense per- spective lines, it beats ESPCN by an average 0.39 dB. CARN narrows the SSIM gap in the same set thanks to sharper edge transitions, yet the PSNR deficit remains visually noticeable. The consistent ordering across four very different corpora indicates that our conclusions are not an artefact of any single benchmark.

A one-line toggle in Listing 1 enables torch.cuda.amp. Re-running ESPCN with FP16 disabled increases average latency from 0.43 ms to 0.50 ms at ×4, but PSNR drifts by < 0.01 dB - well below perceptual or statistical significance.

FSRCNN behaves the same; CARN, in contrast, shows rare overflow in residual shifts under FP16, therefore we present its numbers in full FP32. The ablation confirms that mixed precision is a safe 2-3x throughput boost for shallow SR networks on modern GPUs.

Figure 4 visualises one representative crop from Set14, magnified for easy inspection. FSRCNN cleanly separates the silver pendants from the background: the triangular tips remain distinct, and the eyelid contour keeps a continuous, alias-free edge. ESPCN, true to its “smooth-but-fast” design, blurs the pendant tips into small blobs and softens the lip crease; the result is artefact-free but noticeably less detailed. CARN produces the crispest diagonal strokes - see the sharp highlights on the pendant edges - yet those gains come with faint ringing bands around high-contrast borders (pendant-hair junction and lip outline). These visual impressions echo the numbers: CARN scores a slightly higher SSIM thanks to sharper local structure, while FSRCNN leads in PSNR by avoiding the halo artefacts and ESPCN trades both metrics for sub-half-millisecond latency.

Zoom-in inspection shows that all three models still fail on repeated high-frequency patterns such as venetian blinds and brick façades, either producing moiré (FSRCNN), plastic blur (ESPCN) or edge overshoot (CARN). These error types suggest that augmenting training data with synthetic periodic textures - or adding a simple frequency domain loss - could close the last 1-2 dB without inflating inference time.

For real-time video upscaling at 60 fps the latency budget per frame is 16 ms. Even a single-threaded CPU implementation of ESPCN on an Apple M2 (3.4 ms for 540×960 → 1080 p) meets that target, allowing on-device enhancement in video chat apps. FSRCNN, though slightly slower on CPU, fits comfortably in live streaming encoders where a light CUDA core can maintain 100 fps at 720 p. CARN’s higher computing cost relegates it to offline or edge GPU workloads - e.g., smart TV up-scalers - where its extra sharpness outweighs the energy hit.

Conclusion

We carried out the first side-by-side benchmark of three lightweight, real-time super-resolution networks trained under identical data and metric settings. The tiny yet surprisingly capable FSRCNN emerges as the fidelity winner (+0.5 dB / +0.005 SSIM over ESPCN at ×2) while remaining below 1 ms on consumer GPU and well inside the 16 ms CPU budget for 60 fps video. ESPCN sacrifices roughly half a decibel for sub-0.5 ms latency, making it attractive for low-power mobile and video-chat scenarios where ringing artefacts are less welcome than a mild loss of sharpness. CARN restores the finest diagonal strokes and enjoys slightly higher SSIM in busy street scenes, but its 15 GFLOP footprint and 6 ms latency confine it to offline or edge-GPU use.

Beyond raw numbers, two practical insights stand out. First, automatic mixed-precision is “free” on shallow architectures: FSRCNN and ESPCN gain a 2-3x throughput boost with no measurable PSNR drift. Second, receptive field - not parameter count - is the dominant driver of accuracy once the model size drops below 0.1 M parameters, suggesting that clever spatial re-use (e.g. dilated or shift convolutions) may unlock further gains without inflating FLOPs.

Our study is limited to synthetic degradations and to PSNR/SSIM; in the wild, perceptual metrics or human opinion scores can reorder the leaderboard. Future work will therefore (i) extend training to real camera ISP data, (ii) add a frequency-domain loss to tame repetitive texture artefacts we observed, and (iii) explore tensor-RT and Apple Neural Engine deployments, where memory bandwidth rather than GFLOPs becomes the bottleneck.

In summary, we show that sub-25 k-parameter CNNs are already “good enough” for 1080 p real-time upscaling, while slightly larger cascaded nets remain valuable when absolute crispness outweighs power constraints

References:

Freeman W.T., Jones T.R., Pasztor E.C. Example-based super-resolution // Computer Graphics and Applications, IEEE. -- 2002. -- Vol. 22. -- P. 56-65.
Glasner D., Bagon S., Irani M. Super-resolution from a single image // 2009 IEEE 12th International Conference on Computer Vision. -- 2009. -- P. 349-356.
Yang J., Wright J., Huang T.S., Ma Y. Image super-resolution via sparse representation // IEEE Transactions on Image Processing. -- 2010. -- Vol. 19, № 11. -- P. 2861-2873.
Dong C., Loy C.C., He K., Tang X. Learning a deep convolutional network for image super-resolution // Computer Vision – ECCV 2014. -- 2014. -- P. 184-199.
Dong C., Loy C.C., Tang X. Accelerating the super-resolution convolutional neural network // Computer Vision – ECCV 2016. -- 2016. -- P. 391-407.
Shi W., Caballero J., Huszár F., Totz J., Aitken A.P., Bishop R., Rueckert D., Wang Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network // 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). -- 2016. -- P. 1874-1883.
Lai W.-S., Huang J.-B., Ahuja N., Yang M.-H. Deep laplacian pyramid networks for fast and accurate super-resolution // 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). -- Los Alamitos, CA, USA: IEEE Computer Society. -- Jul. 2017. -- P. 5835-5843.
Hui Z., Gao X., Yang Y., Wang X. Lightweight image super-resolution with information multi-distillation network // Proceedings of the 27th ACM International Conference on Multimedia. -- ACM. -- Oct. 2019. -- P. 2024-2032.
Yu C., Xiao B., Gao C., Yuan L., Zhang L., Sang N., Wang J. Lite-HRNet: A Lightweight High-Resolution Network // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). -- Los Alamitos, CA, USA: IEEE Computer Society. -- Jun. 2021. -- P. 10435-10445.
Lim B., Son S., Kim H., Nah S., Lee K.M. Enhanced deep residual networks for single image super-resolution // 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). -- 2017. -- P. 1132-1140.
Zhang Y., Li K., Li K., Wang L., Zhong B., Fu Y. Image super-resolution using very deep residual channel attention networks // Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part VII. -- Berlin, Heidelberg: Springer-Verlag. -- 2018. -- P. 294-310.
Liang J., Cao J., Sun G., Zhang K., Van Gool L., Timofte R. SwinIR: Image restoration using swin transformer // 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). -- 2021. -- P. 1833-1844.
Wang X., Yu K., Wu S., Gu J., Liu Y., Dong C., Qiao Y., Loy C.C. ESRGAN: Enhanced super-resolution generative adversarial networks // Computer Vision – ECCV 2018 Workshops. -- Cham: Springer International Publishing. -- 2019. -- P. 63-79.
Wang X., Xie L., Dong C., Shan Y. Real-ESRGAN: Training real-world blind super-resolution with pure synthetic data // 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). -- 2021. -- P. 1905-1914.
Ahn N., Kang B., Sohn K.-A. Fast, accurate, and lightweight super-resolution with cascading residual network // Computer Vision – ECCV 2018. -- Cham: Springer International Publishing. -- 2018. -- P. 256-272.
Wang Z., Chen J., Hoi S.C.H. Deep learning for image super-resolution: A survey // IEEE Transactions on Pattern Analysis and Machine Intelligence. -- 2021. -- Vol. 43, № 10. -- P. 3365-3387.
Wang Z., Bovik A., Sheikh H., Simoncelli E. Image quality assessment: from error visibility to structural similarity // IEEE Transactions on Image Processing. -- 2004. -- Vol. 13, № 4. -- P. 600-612.