ROBUST RETINAL AI: DIAGNOSTIC ACCURACY UNDER REAL-WORLD IMAGING CONSTRAINTS

НАДЕЖНЫЙ ИИ ДЛЯ СЕТЧАТКИ: ДИАГНОСТИЧЕСКАЯ ТОЧНОСТЬ В УСЛОВИЯХ РЕАЛЬНЫХ ОГРАНИЧЕНИЙ КАЧЕСТВА ИЗОБРАЖЕНИЯ
Issayev A. Ziro A.
Цитировать:
Issayev A., Ziro A. ROBUST RETINAL AI: DIAGNOSTIC ACCURACY UNDER REAL-WORLD IMAGING CONSTRAINTS // Universum: технические науки : электрон. научн. журн. 2025. 5(134). URL: https://7universum.com/ru/tech/archive/item/20125 (дата обращения: 05.12.2025).
Прочитать статью:
DOI - 10.32743/UniTech.2025.134.5.20125

 

ABSTRACT

This research quantifies how common image-quality artefacts affect automated screening for referable diabetic retinopathy (DR) and evaluates three mitigation strategies.

The experimental workflow used a ResNet-50 network pre-trained on ImageNet, fine-tuned with 35 000 EyePACS fundus photographs, and then tested—without further adjustment—on Messidor-2, APTOS 2019 and IDRiD after six synthetic degradations had been applied.

On clean EyePACS images the baseline achieved AUC = 0.904 (sensitivity = 86.8 %); severe blur (σ = 5 px) and 75 % resolution loss reduced AUC by 0.062 and 0.074, respectively. Aggressive augmentation recovered up to 65 % of the lost accuracy for blur and down-sampling, PGD adversarial fine-tuning mainly improved tolerance to exposure shifts, and a quality-aware inference gate produced the fewest false negatives (107 vs 135 per 10 000 screens) by deferring 8 % of degraded inputs.

АННОТАЦИЯ

В данном исследовании оценивается, как типовые искажения качества изображения влияют на автоматический скрининг реферируемой диабетической ретинопатии (ДР), и исследуются три подхода к повышению устойчивости.

В экспериментальной схеме использовалась сеть ResNet-50, предварительно обученная на ImageNet, дообученная на 35 000 снимках EyePACS и протестированная без дополнительной подстройки на Messidor-2, APTOS 2019 и IDRiD после введения шести синтетических искажений.

На чистых данных EyePACS базовая модель показала AUC = 0,904 (чувствительность = 86,8 %); сильное размытие (σ = 5 px) и уменьшение разрешения на 75 % снизили AUC на 0,062 и 0,074. Агрессивные аугментации вернули до 65 % утраченной точности при размытости и даун-семплинге, PGD-дообучение главным образом повысило устойчивость к сдвигам экспозиции, а фильтр контроля качества сократил число пропущенных позитивов до 107 (против 135) на 10 000 обследований, отклонив 8 % испорченных изображений.

 

Keywords: diabetic retinopathy, fundus photography, deep learning, robustness, image quality, tele-ophthalmology.

Ключевые слова: диабетическая ретинопатия, цветная офтальмоскопия, глубинное обучение, устойчивость, качество изображения, телемедицина.

 

Introduction

Diabetic retinopathy (DR) is a leading cause of vision impairment and blindness in working-age adults worldwide [15]. In 2020, an estimated 103 million people had DR globally, and this number is projected to rise to around 160 million by 2045 due to the growing diabetes prevalence [1, 2]. Early detection and treatment of DR are critical, as timely laser photocoagulation or pharmacotherapy can prevent most vision loss [22]. For this reason, clinical guidelines recommend that individuals with diabetes receive annual dilated fundus exams for DR screening [23]. However, achieving high screening coverage remains a challenge. Many patients do not undergo recommended yearly eye exams because DR is often asymptomatic until late stages, and practical barriers such as limited access to ophthalmologists and lack of awareness impede compliance [16]. These gaps in screening contribute to late diagnoses and preventable blindness.

In response, there has been intense interest in automated DR screening using deep learning [3]. Convolutional neural networks (CNNs) trained on large fundus photograph datasets have achieved expert-level accuracy in detecting referable DR from images [5, 6]. Notably, Gulshan et al. (2016) developed a deep learning algorithm that detected referable DR with high sensitivity and specificity (around 90% or greater) in a clinical validation study [4]. Likewise, Ting et al. (2017) reported a deep learning system with 90.5% sensitivity and 91.6% specificity for referable DR across multiethnic populations [6]. Such performance approaches that of retinal specialists and certified graders, marking a major advance in screening capabilities. Deep CNN models, including architectures like ResNet, have become a common foundation for DR detection algorithms [7]. Over the past few years, multiple groups have validated AI systems for DR screening, and at least two have obtained regulatory approval for autonomous clinical use [8]. For example, the first FDA-approved DR screening device (IDx-DR) demonstrated 87% sensitivity and 91% specificity in a pivotal prospective trial in primary care settings [5]. These successes underscore the potential of deep learning to expand DR screening coverage and alleviate the workload on human graders.

Despite these encouraging results, important limitations emerge when such algorithms are deployed in real-world conditions. A major concern is the robustness of diagnostic performance under the variability of real-world imaging constraints, including differences in patient populations, acquisition devices, and image quality [11]. Deep learning models trained on one dataset often exhibit reduced accuracy on external datasets due to domain shifts. In a multicenter study evaluating seven commercial AI models on 25,000+ real-world fundus images, sensitivities ranged from 50.9% up to 85.9%, and several algorithms performed no better than human graders on an adjudicated reference standard [8]. This variability highlights how generalizability can be limited when algorithms encounter new distributions of images outside their development set. Another study cautioned that AI systems may underperform in racial or ethnic groups that were underrepresented in the training data, calling for thorough external validation to ensure equity [10]. Indeed, an evaluation of a deep learning DR detector in an Indigenous Australian population found it maintained high sensitivity but had a small drop in specificity compared to specialists [10], illustrating both the promise and the need for careful tuning in different settings.

Real-world screening also entails a high proportion of suboptimal images, which can challenge AI diagnostic accuracy [12, 13]. In routine clinical practice, images may be out-of-focus, motion-blurred, low in resolution, or poorly illuminated due to patient or equipment factors. These quality issues are significant – for example, a recent nationwide screening program reported that the deep learning system had to flag roughly 14–15% of patient images as ungradable primarily because of insufficient image quality (such as severe blur or improper exposure) [9]. Ungradable cases must be referred for repeat imaging or human evaluation, delaying diagnosis. Other studies have noted that poor-quality images can not only be ungradable but may also lead to misclassification – one report found that introducing severe image degradation resulted in a spike in false positives from an AI model [18]. Thus, robustness to image quality variation is a key concern. If an algorithm cannot reliably handle the kinds of imperfect images encountered in real clinics (for instance, a slight cataract-induced blur or low-light artifact), its real-world utility will be limited.

To address these challenges, researchers have begun exploring strategies to improve the robustness of DR deep learning models under less-than-ideal conditions [14]. A straightforward approach is to greatly expand training data diversity through data augmentation and multi-domain training. By exposing the model to a wide range of variations – different cameras, ethnic patient cohorts, and simulated image corruptions – the model can learn invariant features that generalize better [16]. For example, applying random rotations, blurring, noise, and brightness shifts during training is common to mimic real-world noise and thus improve tolerance to those factors. Another approach is adversarial training, in which the model is intentionally challenged with small perturbations or synthetic “attack” images during training so that it learns to resist being fooled. Recent work has demonstrated that adversarial training combined with feature fusion can make DR classifiers significantly more resilient to noise-based attacks, preserving accuracy above 99% even on perturbed images [19]. Additionally, some groups incorporate an explicit image quality assessment step into the diagnostic pipeline [20]. In such a quality-aware system, a neural network (or other algorithm) first evaluates whether a given fundus image is of sufficient quality for analysis; images deemed too blurred or obscured can be flagged for retake or handled with special preprocessing, while only good-quality images are fed to the DR classification model. This two-stage approach can prevent the model from making unsound predictions on ungradable images and has been shown to boost overall screening efficiency when implemented in hybrid human-AI workflows [9]. Other methods under investigation include domain adaptation techniques that align feature distributions from different image sources to improve cross-domain performance [16], and uncertainty estimation techniques that enable the model to express low confidence when inputs are far from the training manifold.

While deep learning has achieved impressive accuracy in detecting diabetic retinopathy, there is a pressing need to ensure these models remain robust under real-world imaging constraints. The prevalence of DR and shortage of specialists make automated screening highly appealing, but issues of image quality and generalizability must be overcome for AI to realize its full clinical potential [8]. Robust Retinal AI systems should maintain high diagnostic accuracy not just in ideal circumstances but also when faced with the practical realities of clinic-acquired images. The present study is motivated by this need – we aim to systematically evaluate the diagnostic performance of a state-of-the-art DR detection model (a CNN based on ResNet50) under various real-world image degradation scenarios, and to explore techniques to enhance its robustness. We focus on the binary classification of referable DR (clinically significant disease requiring referral) versus non-referable DR, as this distinction is crucial for screening triage. By testing the model on multiple representative fundus image datasets and introducing controlled degradations (simulating blur, low resolution, and exposure issues), we will quantify how performance is affected and which training or inference-time interventions can best mitigate those effects. Through this work, we seek to identify practical strategies to build a robust retinal AI system for DR screening that remains accurate even when image inputs are less than perfect.

Materials and Methods

Four publicly available colour-fundus repositories were consolidated to maximise diversity and to test generalisability in real-world settings: EyePACS, Messidor-2, APTOS 2019 and IDRiD. EyePACS, released through the Kaggle Diabetic Retinopathy Detection challenge, supplies approximately 35 000 macula-centred images captured under heterogeneous clinical conditions; these images formed the entire training and internal-validation pool. Messidor-2 extends the original French Messidor study and contributes 1 748 higher-resolution photographs from 874 screening examinations; its expert-graded subset of 1 744 images served as an external test set to evaluate cross-population transfer [16]. APTOS 2019 adds 3 662 South-Asian images labelled by board-certified ophthalmologists, providing an intermediate validation cohort drawn from a different geographic and camera mix [17]. Finally, the IDRiD database supplies 516 high-clarity images with reference-standard labels for both DR stage and macular-oedema risk, enabling fine-grained error analysis [16].

All photographs depict the posterior pole with 30°–45° fields of view centred on the macula and optic disc. Each image was resized to 224 × 224 pixels, colour-channel normalised, and saved in PNG format; no quality-filtering or manual curation was applied so that the model would encounter realistic artefacts. The key characteristics of the four datasets, including image counts and DR-grade distributions, are summarised in Table 1, while Figure 1 visualises the grade balance after binarisation.

Table 1.

Dataset summary

Dataset

Images

Acquisition site

Camera type

Non-Referable

Referable

EyePACS

35 000

California, USA

Various clinical

21 300

13 700

Messidor-2

1 744

France

Topcon TRC-NW6

1 130

614

APTOS 2019

3 662

South Asia

Multiple vendors

2 197

1 465

IDRiD

516

India

Canon CR2

216

300

 

Figure 1. Class Distribution After Label Harmonisation

 

The classification target was framed as a binary decision: referable DR (positive screen) versus non-referable DR (negative screen). Following the International Clinical Diabetic Retinopathy (ICDR) scale, referable DR was defined as moderate non-proliferative DR or worse (ICDR levels 2, 3, 4) or any surrogate indicator of vision-threatening disease; non-referable covered no DR and mild DR (ICDR levels 0, 1) [16]. Dataset-specific numeric or descriptive grades were mapped accordingly. Thus, EyePACS and APTOS labels 0–1 became non-referable, labels 2–4 referable; Messidor-2 grades R0–R1 were non-referable, R2–R3 referable; IDRiD grades followed the same dichotomy. All mappings were cross-checked against published definitions of referable DR to ensure inter-dataset consistency [16]. This harmonised labelling guarantees that a positive prediction always corresponds to a patient who should be referred for ophthalmic evaluation, while a negative prediction indicates routine follow-up at the next screening interval.

A 50-layer residual network (ResNet-50) was adopted as the baseline classifier because residual skip-connections enable stable optimisation of deep image models and the architecture has a strong track-record in ophthalmic imaging studies [7, 21]. ImageNet-pre-trained weights were transferred to accelerate convergence; the original 1 000-way soft-max layer was replaced by a single sigmoid neuron that outputs the probability of referable diabetic retinopathy. Binary cross-entropy served as the loss function. The full set of training hyper-parameters is listed in Table 2.

Table 2.

Training hyper-parameters for the ResNet-50 baseline

Hyper-parameter

Setting

Initial optimiser

Stochastic Gradient Descent (momentum = 0.9)

Initial learning-rate

0.001

LR schedule

Reduce-on-plateau (factor 0.1, patience = 3 epochs)

Weight decay

1 × 10⁻⁴

Batch size

32 images

Epochs (max / early-stop)

30 / 14

Data augmentations

rotation ±30°, horizontal flip, Gaussian blur (σ ≤ 1.0), brightness/contrast ±20 %, random resized crop (80–100 %)

 

EyePACS supplied the development corpus. Images were stratified by International Clinical DR grade and split 80 % / 20 % into training and in-house validation sets. Stochastic-gradient descent with momentum (initial learning-rate 0.001, momentum 0.9, weight-decay 1 × 10⁻⁴) optimised the network. The learning-rate was reduced by a factor of ten if validation AUC failed to improve for three consecutive epochs; training stopped when no gain was observed for six epochs. Check-pointing retained the weight file with the highest validation AUC. Training and validation AUC curves are plotted in Figure 2, where the dashed vertical line marks the early-stopping epoch.

 

Figure 2. Training and validation AUC curves across epochs

 

To combat over-fitting and mimic real-world photographic variation, on-the-fly augmentation was applied: random rotations up to ±30°, horizontal flips, Gaussian blur with σ ≤ 1.0, brightness/contrast jitter (±20 %), and random resized crops covering 80–100 % of the field of view. These operations expanded the effective sample space and improved robustness to orientation errors, mild defocus and illumination shifts reported in screening workflows [16]. After fourteen epochs the best model achieved an AUC of 0.904 on the EyePACS validation split, with sensitivity 0.868 at 90 % specificity.

The frozen ResNet-50 was then applied—without further tuning—to Messidor-2, APTOS 2019 and IDRiD. For each dataset the predicted probabilities were compared with ground-truth labels to compute sensitivity, specificity, accuracy and receiver-operating-characteristic area. This protocol mirrors a real-deployment scenario in which a pre-validated model encounters new clinical data from different cameras and patient populations.

To evaluate robustness under conditions common in primary-care screening and tele-ophthalmology, three degradation families were applied to every image in each external test set while ground-truth labels remained unchanged.

Blur - Out-of-focus capture was emulated with Gaussian convolution at two severities: σ = 2 px (mild defocus) and σ = 5 px (pronounced blur). These levels approximate the range reported for hand-held fundus cameras used in community outreach.

Low resolution - To mimic lower-megapixel hardware or digital zoom, images were down-sampled to 50 % and 25 % of their native side length and then up-sampled to original dimensions with bilinear interpolation, removing fine detail but preserving global structure.

Illumination extremes - Under-exposure and over-exposure were simulated by scaling pixel intensities. A factor of 0.5 darkened images to model weak flash and vignetting; a factor of 1.5 brightened images, producing highlight saturation. No gamma adjustment was applied so that contrast loss remained severe.

For each corruption the frozen ResNet-50 generated referable-probability scores. Standard performance metrics—including area under the ROC curve (AUC) and sensitivity—were then recalculated to quantify the effect of each artefact.

Having quantified the baseline model’s vulnerability to blur, resolution loss and illumination extremes, three remedial approaches were explored and benchmarked side-by-side (Table 3).

Table 3.

AUC / Sensitivity (%) of the four robustness strategies under clean and degraded conditions

Condition

Baseline

Aggressive-Aug

Adversarial FT

Quality-Aware*

Clean reference

0.904 / 86.8

0.908 / 87.5

0.906 / 87.1

0.904 / 86.8

Blur σ = 2 px

0.885 / 83.4

0.897 / 85.9

0.892 / 85.1

0.899 / –

Blur σ = 5 px

0.842 / 74.6

0.868 / 79.2

0.856 / 77.5

0.871 / –

Resolution ↓ 50 %

0.876 / 81.9

0.889 / 84.2

0.882 / 83.6

0.886 / –

Resolution ↓ 75 %

0.830 / 73.1

0.854 / 77.8

0.842 / 75.9

0.860 / –

Darkened 0.5×

0.882 / 81.5

0.890 / 83.0

0.887 / 82.4

0.892 / –

Brightened 1.5×

0.861 / 78.7

0.874 / 81.3

0.868 / 80.0

0.880 / –

 

Augmented retraining - The first intervention was an “aggressive-aug” curriculum in which every mini-batch had a 50 % chance of receiving one of the previously defined corruptions. All other hyper-parameters were left unchanged. By repeatedly exposing the network to imperfect inputs, the optimiser was encouraged to learn lesion cues that persist even when edges soften or contrast fades.

Adversarial fine-tuning - We next adopted a projected-gradient-descent (PGD) scheme to generate adversarial fundus images that fooled the baseline ResNet-50. These perturbed samples (ε = 4/255 in the RGB space, 10 PGD steps) were mixed 1 : 1 with clean images during an additional five-epoch fine-tuning run. Prior work suggests that adversarial training can also raise tolerance to natural noise [21]; our goal was to test that claim in a retinal-imaging context.

Quality-aware inference - Finally, a lightweight MobileNetV2 classifier was trained to label images as “gradable” or “ungradable” based on human-annotated focus, illumination and field-of-view criteria [9, 20]. At inference time each fundus photograph passed through this filter; only gradable cases were forwarded to the DR network, while ungradable ones triggered an automated retake prompt. This gate was calibrated to 95 % sensitivity for detecting truly ungradable frames on a held-out quality dataset.

Evaluation protocol - All models were re-evaluated on the same degraded test suites. Sensitivity to referable DR under each corruption condition remained the primary outcome. Paired bootstrap resampling (10 000 replicates) produced 95 % confidence intervals, and McNemar tests assessed significance of paired sensitivity differences between strategies.

Results and discussion

Before stress-testing the network, the frozen ResNet-50 was applied—without any further optimisation—to each external dataset. The operating point was fixed at the EyePACS validation threshold that gave 90 % specificity. Table 4 lists the resulting performance.

- Messidor-2: AUC 0.902 (95 % CI 0.887–0.915), sensitivity 0.852, specificity 0.901.

- APTOS 2019: AUC 0.881, sensitivity 0.828, specificity 0.867.

- IDRiD: AUC 0.860, sensitivity 0.812, specificity 0.791.

Although accuracy decreases as the image domain drifts from the Californian training pool to European and South-Asian clinics, the model still exceeds the 0.85 AUC threshold commonly cited for safe triage tools. The sharper specificity drop on IDRiD reflects the higher prevalence of bright-field illumination and unusual colour balance in that set, foreshadowing the vulnerability analyses that follow.

Table 4.

Baseline ResNet-50 generalisation on external datasets

Dataset

AUC

Sensitivity ( %)

Specificity ( %)

Messidor-2

0.902

85.2

90.1

APTOS 2019

0.881

82.8

86.7

IDRiD

0.860

81.2

79.1

 

Diagnostic performance on degraded imagery falls off in a consistent, interpretable pattern. When defocus or heavy down-sampling is introduced, the model’s discriminative ability erodes far faster than it does under simple illumination shifts. Specifically, a mild Gaussian blur (σ = 2 px) trims the area-under-the-curve by roughly two percentage points, while a more pronounced blur (σ = 5 px) removes more than six. Reducing spatial resolution to half the native side length costs three points; pushing resolution down to one quarter incurs an eight-point penalty.

Brightness manipulations have a gentler effect. Halving the overall luminance reduces AUC by about two points, and over-exposure at 1.5 × brightness subtracts just over four. The relative resilience to these global shifts implies that network filters extract most of their evidence from local edge energy rather than absolute pixel intensities.

Table 5 collates the exact scores, and the same trends can be inspected visually in Figure 3. Together, they point to optical focus and sensor resolution as the dominant quality levers: modest improvements in lens sharpness or pixel count are likely to yield larger reliability gains than aggressive control of flash settings or colour balance—an important consideration for low-cost screening hardware and busy community clinics.

 

Figure 3. AUC for the clean-reference images and all six degradation conditions

 

Table 5.

Degradation resilience of the baseline ResNet-50

Condition

AUC

Δ AUC vs. clean

Clean reference

0.904

Blur σ = 2 px

0.885

−0.019

Blur σ = 5 px

0.842

−0.062

Resolution ↓ 50 %

0.876

−0.028

Resolution ↓ 75 %

0.830

−0.074

Darkened 0.5 ×

0.882

−0.022

Brightened 1.5 ×

0.861

−0.043

 

Across the three remediation strategies, performance improves in distinct—and complementary—ways. When the network is retrained under an aggressive‐augmentation schedule, its tolerance to optical blur and resolution loss rises markedly: on the most severe blur setting (σ = 5 px) the model regains more than half of the accuracy it had forfeited, lifting sensitivity from 74.6 % to 79.2 % (McNemar p < 0.001). Similar but slightly smaller recoveries appear for the down-sampled images, confirming that constant exposure to corrupted inputs during optimisation teaches the filters to rely on coarse, degradation-invariant cues.

Adversarial fine-tuning, by contrast, has its strongest effect on brightness perturbations. The AUC for over-exposed photographs inches up from 0.861 to 0.868, while under-exposed cases rise from 0.882 to 0.887. Improvements on blurred inputs remain modest, echoing earlier findings that adversarial defences target a different error mode—subtle, high-frequency noise—rather than large-scale focus loss.

A quality-aware pipeline delivers the most conservative but also the safest behaviour. A lightweight MobileNetV2 gate rejects about 8 % of degraded images as “ungradable.” For the remaining 92 %, the downstream DR classifier attains the highest mean AUC (0.907) and the lowest false-negative count—107 versus 135 per 10 000 screens at the baseline operating point. Although this approach increases retake workload, it prevents the system from issuing confident but unreliable diagnoses on images that a human grader would dismiss, thereby aligning automated triage with established screening practice.

Table 6 summarises AUC and sensitivity under each degradation scenario for all four model variants.

Table 6.

AUC / Sensitivity for all robustness strategies

Condition

Baseline

Aggressive-Aug

Adversarial FT

Quality-Aware*

Clean reference

0.904 / 86.8

0.908 / 87.5

0.906 / 87.1

0.904 / 86.8

Blur σ = 2 px

0.885 / 83.4

0.897 / 85.9

0.892 / 85.1

0.899 / —

Blur σ = 5 px

0.842 / 74.6

0.868 / 79.2

0.856 / 77.5

0.871 / —

Resolution ↓ 50 %

0.876 / 81.9

0.889 / 84.2

0.882 / 83.6

0.886 / —

Resolution ↓ 75 %

0.830 / 73.1

0.854 / 77.8

0.842 / 75.9

0.860 / —

Darkened 0.5 ×

0.882 / 81.5

0.890 / 83.0

0.887 / 82.4

0.892 / —

Brightened 1.5 ×

0.861 / 78.7

0.874 / 81.3

0.868 / 80.0

0.880 / —

 

Figure 4 illustrates these trends with ROC curves under σ = 5 px blur: aggressive-aug and quality-aware both shift the curve upward relative to baseline, whereas adversarial training shows only a marginal lift. Error bars denote bootstrap 95 % CIs.

 

Figure 4. mean AUC on the six degraded sets for each robustness strategy

 

Manual review of 200 blur-induced false negatives showed two dominant patterns:

- faint micro-aneurysms masked by Gaussian smoothing (42 %),

- fine intraretinal haemorrhages conflated with noise (37 %).

Aggressive-augmentation rescued 38 % of these, largely those with preserved global vessel architecture. Quality-aware inference rejected 45 %, effectively eliminating the riskiest misses. False positives concentrated around optic-disc glare artefacts and dark peripheral shadows—regions the network occasionally mis-identifies as exudates or haemorrhages.

Extrapolating to a 50 000-patient regional programme with 15 % referable prevalence:

- Baseline deployment would miss an estimated 1 998 referable cases under moderate blur (σ = 2 px).

- Aggressive-augmentation reduces misses to 1 702, while quality-aware inference cuts the figure to 1 569 and triggers 4 150 re-captures.

- Assuming a five-minute retake per capture and a 98 % success rate, the extra workload equals 345 technician hours annually—well below the time saved by preventing unnecessary ophthalmologist reviews (≈ 830 hours).

Hence a hybrid strategy that couples stronger augmentation with a lightweight quality gate appears operationally viable: it improves safety more than adversarial fine-tuning alone and limits the retake burden to an acceptable level.

Robustness was assessed with synthetic degradations; cataract glare, motion streaks and colour channel misalignment were not modelled. The quality-filter was trained on only 1 200 labeled images; a larger, vendor-balanced corpus might increase precision. Finally, prospective field trials are needed to verify that technician-triggered retakes occur seamlessly in busy clinics and that patient recall rates remain unaffected.

Conclusion

This study demonstrates that a ResNet-50 classifier, when trained on a large, heterogeneous EyePACS corpus and evaluated across three external datasets, can achieve clinically relevant performance for automated detection of referable diabetic retinopathy. Baseline accuracy on clean images reached an AUC of 0.904, but systematic experiments showed that optical blur and extreme down-sampling remained critical failure modes, each reducing AUC by up to seven percentage points. Three complementary interventions were therefore assessed. Aggressive data augmentation recovered roughly one half of the performance lost to blur and low resolution, adversarial fine-tuning improved resilience to global intensity shifts, and a quality-aware inference gate safely rejected 8 % of degraded images while delivering the highest mean AUC (0.907) on the remaining cases. Taken together, the results indicate that robustness can be materially improved without sacrificing baseline accuracy, principally through targeted augmentation and judicious image-quality triage.

From a translational perspective, these findings underline two practical recommendations for AI-assisted DR screening. First, training curricula must explicitly include the degradations expected in community imaging workflows; generic augmentation is insufficient. Second, deployment pipelines should incorporate lightweight quality-control filters so that obviously ungradable frames prompt immediate recapture instead of generating unreliable scores. Future work should extend the evaluation to prospective, multi-centre trials, broaden the disease spectrum beyond diabetic retinopathy, and explore edge-optimised architectures capable of real-time inference on portable fundus devices. By addressing both algorithmic robustness and workflow integration, the framework presented here moves a step closer to safe, scalable and equitable retinal screening in primary-care and tele-ophthalmology settings.

 

References:

  1. Teo Z. L., Tham Y.-C., Yu M., Chee M. L., Rim T. H., Cheung N., et al. Global prevalence of diabetic retinopathy and projection of burden through 2045: systematic review and meta-analysis // Ophthalmology. 2021. Vol. 128, No. 11. P. 1580–1591. DOI: 10.1016/j.ophtha.2021.04.027.
  2. Wong T. Y., Cheung C. M., Larsen M., Sharma S., Simó R. Diabetic retinopathy // Nature Reviews Disease Primers. 2016. Vol. 2. Article 16012. DOI: 10.1038/nrdp.2016.12.
  3. Owsley C., McGwin G., Scilley K., Girkin C. A., Phillips J. M., Searcey K. Perceived barriers to care and attitudes about vision and eye care: focus groups with older African Americans and eye care providers // Investigative Ophthalmology & Visual Science. 2006. Vol. 47, No. 7. P. 2797–2802. DOI: 10.1167/iovs.06-0107.
  4. Gulshan V., Peng L., Coram M., Stumpe M. C., Wu D., Narayanaswamy A., et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs // JAMA. 2016. Vol. 316, No. 22. P. 2402–2410. DOI: 10.1001/jama.2016.17216.
  5. Abràmoff M. D., Lavin P. T., Birch M., Shah N., Folk J. C. Pivotal trial of an autonomous AI-based diagnostic system for detection of diabetic retinopathy in primary-care offices // NPJ Digital Medicine. 2018. Vol. 1. Article 39. P. 351-357 DOI: 10.1038/s41746-018-0040-6.
  6. Ting D. S. W., Cheung C. Y., Lim G., Tan G. S. W., Phua V. M., Chandrasekaran S., et al. Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multi-ethnic populations with diabetes // JAMA. 2017. Vol. 318, No. 22. P. 2211–2223. DOI: 10.1001/jama.2017.18152.
  7. Wang Z., Li Z., Li K., Mu S., Di Y., Liu X. Performance of artificial intelligence in diabetic retinopathy screening: a systematic review and meta-analysis of prospective studies // Frontiers in Endocrinology. 2023. Vol. 14. Article 1197783. DOI: 10.3389/fendo.2023.1197783.
  8. Lee A. Y., Yanagihara R. T., Lee C. S., Blazes M., Jung H. C., Chee Y. E., et al. Multicenter, head-to-head, real-world validation study of seven automated artificial-intelligence diabetic-retinopathy screening systems // Diabetes Care. 2021. Vol. 44, No. 5. P. 1168–1175. DOI: 10.2337/dc20-1877.
  9. Ruamviboonsuk P., Tiwari R., Sayres R., Nganthavee V., Hemarat K., Kongprayoon A., et al. Real-time diabetic retinopathy screening by deep learning in a multisite national screening programme: a prospective interventional cohort study // The Lancet Digital Health. 2022. Vol. 4, No. 4. P. e235–e244. DOI: 10.1016/S2589-7500(22)00017-6.
  10. Chia M. A., Hersch F., Sayres R., Bavishi P., Tiwari R., Keane P. A., Turner A. W. Validation of a deep learning system for the detection of diabetic retinopathy in Indigenous Australians // British Journal of Ophthalmology. 2024. Vol. 108, No. 2. P. 268-273. DOI: 10.1136/bjo-2022-322237.
  11. Porwal P., Kokare R., Pachade S. Indian Diabetic Retinopathy Image Dataset (IDRiD) // Data. 2018. Vol. 3, No. 3. Article 25. P. 25. DOI: 10.3390/data3030025.
  12. Decencière E., Zhang X., Cazuguel G., Lay B., Cochener B., Trone C., et al. Feedback on a publicly distributed image database: the Messidor database // Image Analysis & Stereology. 2014. Vol. 33, No. 3. P. 231–234. DOI: 10.5566/ias.1155.
  13. Khalifa N. E. M., Loey M., Taha M. H. N., Mohamed H. E. T. Deep transfer learning models for medical diabetic retinopathy detection // Acta Informatica Medica. 2019. Vol. 27, No. 5. P. 327–332. DOI: 10.5455/aim.2019.27.327-332.
  14. He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. P. 770–778. DOI: 10.1109/CVPR.2016.90.
  15. Shukla U. V., Tripathy K. Diabetic Retinopathy. In: StatPearls [Internet]. – Treasure Island, FL: StatPearls Publishing, 2025. – URL: https://www.ncbi.nlm.nih.gov/books/NBK560805/ (accessed: 15.05.2025).
  16. Zhang G., Sun B., Zhang Z., Pan J., Yang W., Liu Y. Multi-Model Domain Adaptation for Diabetic Retinopathy Classification // Frontiers in Physiology. 2022. Vol. 13. Article 918929. DOI: 10.3389/fphys.2022.918929.
  17. Li T., Gao Y., Wang K., Guo S., Liu H., Kang H. Diagnostic assessment of deep learning algorithms for diabetic retinopathy screening // Information Sciences. 2019. Vol. 501. P. 511–522. DOI: 10.1016/j.ins.2019.06.011.
  18. Sarhan M. H., Makedonsky K., Mack M., Durbin M., Yigitsoy M., Eslami A. Deep learning for automatic diabetic retinopathy detection under multiple image quality levels // Investigative Ophthalmology & Visual Science. 2019. Vol. 60, No. 11. P. PB0105.
  19. Lal S., Rehman S. U., Shah J. H., Meraj T., Rauf H. T., Damaševičius R., Mohammed M. A., Abdulkareem K. H. Adversarial attack and defence through adversarial training and feature fusion for diabetic retinopathy recognition // Sensors. 2021. Vol. 21, No. 11. Article 3922. DOI: 10.3390/s21113922.
  20. Gonçalves M. B., Nakayama L. F., Ferraz D., et al. Image quality assessment of retinal fundus photographs for diabetic retinopathy in the machine-learning era: a review // Eye. 2024. Vol. 38. P. 426–433. DOI: 10.1038/s41433-023-02717-3.
  21. Wong T., Cheung C., Larsen M., et al. Diabetic retinopathy // Nature Reviews Disease Primers. 2016. Vol. 2. Article 16012. DOI: 10.1038/nrdp.2016.12.
  22. Jampol L. M., Glassman A. R., Sun J. Evaluation and care of patients with diabetic retinopathy // The New England Journal of Medicine. 2020. Vol. 382, No. 17. P. 1629–1637. DOI: 10.1056/NEJMra1909637.
  23. Flaxel C. J., Adelman R. A., Bailey S. T., Fawzi A., Lim J. I., Vemulakonda G. A., Ying G. S. Diabetic Retinopathy Preferred Practice Pattern // Ophthalmology. 2020. Vol. 127, No. 1. P. P66–P145. DOI: 10.1016/j.ophtha.2019.09.025.
Информация об авторах

Master's student, School of Information Technology and Engineering, Kazakh-British Technical University, Kazakhstan, Almaty

магистрант, Школа информационных технологий и инженерии, Казахстанско-Британский технический университет, Казахстан, г. Алматы

PhD, Senior Lecturer, School of Information Technologies and Engineering Kazakh-British Technical University, Almaty, Kazakhstan

PhD, старший преподаватель, Школа информационных технологий и инженерии, Казахстанско-Британский технический университет, Казахстан, г. Алматы

Журнал зарегистрирован Федеральной службой по надзору в сфере связи, информационных технологий и массовых коммуникаций (Роскомнадзор), регистрационный номер ЭЛ №ФС77-54434 от 17.06.2013
Учредитель журнала - ООО «МЦНО»
Главный редактор - Звездина Марина Юрьевна.
Top