RELIABILITY ASSURANCE PRACTICES FOR SERVERLESS FUNCTIONS IN FINANCIAL SYSTEMS

ОБЕСПЕЧЕНИЕ НАДЕЖНОСТИ СЕРВЕРЛЕСС-ФУНКЦИЙ В ФИНАНСОВЫХ СИСТЕМАХ

Mykhailenko Y.

28.10.2025 246

10(139)

10. Информатика, вычислительная техника и управление

Цитировать:

Mykhailenko Y. RELIABILITY ASSURANCE PRACTICES FOR SERVERLESS FUNCTIONS IN FINANCIAL SYSTEMS // Universum: технические науки : электрон. научн. журн. 2025. 10(139). URL: https://7universum.com/ru/tech/archive/item/21049 (дата обращения: 07.03.2026).

Прочитать статью:

DOI - 10.32743/UniTech.2025.139.10.21049

ABSTRACT

The paper analyzes engineering and organizational practices that ensure the reliability of serverless functions in financial systems using the Function-as-a-Service model. Its goal is to systematize solutions that help banks and insurance services comply with the Digital Operational Resilience Act while maintaining RTO, MTTR, and SLA metrics under high load. The proposed architecture integrates active-active multi-regional deployment, write-global read-local replication, multi-level cold-start mitigation, idempotent exactly-once protocols, and OpenTelemetry-based observability. The results show that automatic traffic switching reduces downtime to seconds, combining Provisioned Concurrency with adaptive warming minimizes tail latency, and a triple-consistency mechanism prevents duplicate payments without loss of throughput. The solution enables incident reports within four hours, meeting a key DORA requirement. The approach is valuable for cloud solution architects, DevSecOps engineers, and researchers focusing on the reliability and regulatory compliance of financial ICT systems.

АННОТАЦИЯ

Статья анализирует инженерные и организационные практики, обеспечивающие надёжность бессерверных функций в финансовых системах по модели Function-as-a-Service. Цель — систематизация решений, позволяющих банкам и страховым сервисам соответствовать требованиям Digital Operational Resilience Act при сохранении RTO, MTTR и SLA в установленных пределах под высокой нагрузкой. Особое внимание уделено архитектуре, объединяющей активное-активное многорегиональное развёртывание, глобальную запись с локальным чтением, многоуровневое снижение холодного старта, идемпотентные протоколы “ровно один раз” и наблюдаемость OpenTelemetry. Результаты показывают: автоматическое переключение трафика сокращает простои до секунд, сочетание Provisioned Concurrency и адаптивного прогрева снижает задержки, а тройная согласованность исключает дублирование платежей без потери производительности. Решение обеспечивает готовность отчётов об инцидентах в течение четырёх часов, что соответствует требованиям DORA, и будет полезно архитекторам облачных решений и инженерам DevSecOps.

Keywords: serverless, Function-as-a-Service, financial systems, reliability, DORA, idempotency, observability, OpenTelemetry

Ключевые слова: безсерверные технологии, Функция-как-Сервис, финансовые системы, надёжность, DORA, идемпотентность, наблюдаемость, OpenTelemetry.

Introduction. A serverless function is a short-lived stateless fragment of code that is triggered by an event and automatically scaled by a cloud provider; in financial services, it handles settlement requests, performs transaction scoring, and responds to anomalies in real time. This approach removes the burden of infrastructure administration and shortens TTM, yet shifting critical workloads to Function-as-a-Service requires viewing the compute layer as potentially ephemeral and preparing the system for instant recovery from any node loss.

The reliability of this layer is directly tied to regulation. Regulation (EU) 2022/2554 Digital Operational Resilience Act entered into force on 17 January 2025 and introduced strict standards for ICT-risk management, incident detection time, and reporting; financial organizations must demonstrate resilience of cloud and especially serverless chains, including external service providers, or face fines and operational restrictions [1]. Against this backdrop, system-level metrics-recovery time objective (RTO), service-level availability (SLA), and mean time to repair (MTTR - become not only engineering but also regulatory KPIs. This reflects how a bank’s digital channels depend on FaaS functions’ ability to handle load spikes without degradation and to register anomalies before they affect the settlement ledger.

Yet, moving to FaaS brings about dangers not found in regular small services. A look at an international money tech company doing a complete change to AWS Lambda shares problems of business control, safety, and working complexity that need careful planning of part breakdown steps and following rules at every stage [2]. Study of eleven such moves finds technical obstacles - checking event paths, joining old help systems, standard lack - along with group blocks like not enough experts and a change in growth culture [3]. Assessing and minimizing these risks in advance is a key condition for ensuring that serverless functions do not become a new single point of failure but indeed enhance the regulatory resilience of a financial system.

Materials and methods. The research drew on 23 up-to-date sources from 2020 to 2025 that cover the regulatory framework, provider engineering guidance, empirical migration studies, and cold-start metrics. The regulatory frame was defined by DORA and related clarifications from European regulators, which specify upper limits for RTO and MTTR and the procedure for incident reporting [1]. This regulatory perspective was complemented by methodological documents of major cloud vendors, chiefly AWS Prescriptive Guidance and the Region Switch mechanism description in ARC, because they formalize the technical practices required for compliance [4, 5].

The empirical part comprised two datasets. The first was a multi-stage case study of a global financial company that fully migrated to AWS Lambda; it provided detailed data on monolith-splitting phases, access management, and DevSecOps processes [2]. The second was a 2023 meta-analysis of eleven migrations that made it possible to compile a list of typical technical and organizational barriers to FaaS adoption and to validate their frequency across the industry [3]. To measure operational risks, a comparative report on cold-start latencies for AWS, Azure, and GCP containing p95 values under burst load was used, which allowed calibration of Provisioned Concurrency and adaptive warming models [8].

Results and discussion. The active-active deployment practice distributes copies of every critical function across two or more regions. It is complemented by automatic traffic switching: when one region degrades, the Region switch plan in Amazon Application Recovery Controller records the event, updates routing-control states, and automatically reroutes DNS queries, while the actual recovery time is calculated and displayed in the console so that it can be checked against the established RTO and DORA requirements [4]. It eliminates the need for human intervention at the time of an incident and helps in passing the tests of operational resilience, which are made mandatory for the financial sector. The main engineering decision that remains is the choice of consistency strategy. As noted in AWS Prescriptive Guidance, in a multi-region topology, a network partition situation will always happen, so between availability and firm consistency, architecture has to choose; with asynchronous replication there is a possibility of missing the latest transactions but synchronous writes add latency by an order of magnitude and require a quorum of at least two regions which makes correlated failures more complex to protect against [5]. The write-global read-local model is used for most financial workloads: writes go to a designated region while related reads are served locally after data asynchronously propagates.

Aurora Global Database shows the real limits of such eventual consistency. As per the official documentation, replication between clusters usually is behind by less than 100 ms, which can fulfill payment-settlement windows and keep the cross-region read latency low [6]. In scenarios where regulations mandate zero data loss, a synchronous three-region scheme with a Paxos-like quorum is employed, but it is appropriate only for a narrow set of transactions due to the increased latency.

Even a perfectly designed replication is useless if, during a failure, external services such as settlement gateways or KYC systems are unreachable. The Well-Architected FSI Lens methodology prescribes verifying in advance that all third-party dependencies can be invoked from the standby region and decoupling calls through message buffers and compensating mechanisms so that the system continues operating when a partner API is temporarily lost or at least completes operations with a controlled error [7]. This method stops cascading failures, closing the geographic-resilience loop and forming a base for other levels of reliability.

Geographic redundancy does not eliminate the fact that the financial chain will fail if it meets a cold start after failure. A comparison among serverless platforms reveals that average cold-start latency at AWS hovers at about 200 ms compared to about 500 ms for Azure and 400 ms for GCP, with load spikes pushing the tail p95 to several seconds, way above payment-system norms [8]. Battling cold initializations is therefore made part of the earlier discussed reliability plan.

Provisioned Concurrency is the warmed pool of execution environments that the cloud maintains as its first line of defense. It guarantees response in double-digit milliseconds and fits APIs for scoring and paths used for real-time settlement [9]. Banks, in turn, have implemented a minimal constant value mixed dynamically with a peak threshold ramp-up one minute before financial markets opening time; the net added cost is offset by removing duplicated container clusters, which would otherwise be maintained solely to stabilize the latency tail.

Adaptive warming gives more elasticity. A 2023 review puts such methods as preemptive resource assignment based on the inter-request interval histogram [10]. The experimental work FuncMem proved that memory prediction together with asynchronous code preloading cuts the p99 cold-start by an average of 63.48 percent, without noticeable overpayment for extra megabytes [11]. Financial teams wire these algorithms to APM-trace data to help in damping rare but painful init spikes after overnight idle periods.

Figure 1 presents an event-oriented architecture for processing data from heterogeneous sources, with the outer ring containing event sources: smart devices and consumer electronics, wearable trackers and medical sensors, connected vehicles, industrial equipment, video-surveillance systems, agricultural machinery, as well as web requests, file uploads, streaming data, and database events.

Figure 1. The architecture and workflow in serverless computing [10]

These go through API gateways to the Function Execution layer, and there the master component (event queue and dispatcher) gives work assignments to workers or invokes FaaS, making it possible to scale as well as execute code asynchronously. The results and state go to downstream components, which include BaaS, data storage, session and object stores, request response subsystem, and task scheduler. End-to-end monitoring and logging, together with load-balancing and authentication services, make the system reliable as well as manageable. The presence of both scheduled paths alongside reactive (DB/events; streaming) paths speaks for integration flexibility regarding use cases from fitness trackers up to industrial control, and precision agriculture.

Even with perfect warming, sharp request spikes can exceed the available pool. A buffering queue comes into play here: Amazon SQS, configured as a façade in front of the functions, smooths traffic, and the recent fair-queues mode protects against a “noisy neighbor,” allocating priorities among tenants and preventing avalanche-type latency growth.

The last line of defense is concurrency limiting and safe releases. The basic regional cap of 1,000 concurrent invocations can be raised. Still, financial engineers more often set a reserved quota for the most critical functions so that other services cannot evict them during bursting [12].

Figure 2 shows a reserved-concurrency policy: the vertical axis is the number of concurrent instances, the horizontal axis is time, and the outer scale is up to 1,000. Function-blue and function-orange each reserve 400, while the remaining 200 constitute unreserved pool capacity for all other functions. Wavy lines inside the colored rectangles illustrate actual load; at certain moments, the load approaches or reaches the reserve limit, marked by red segments (manifestations of throttling or reserve saturation). At moment t4, the aggregate load of all functions hits the global limit of 1,000, after which additional demand cannot be satisfied. The key intuition of the diagram is that reservation guarantees dedicated capacity for critical functions but simultaneously reduces pool capacity for others and can lead to idle reserve or throttling at peak aggregate load, so reserve sizes must be planned according to the real load profile.

Figure 2. Reserved concurrency for functions and its impact on the instance pool [12]

During rollout of new versions, alias routing with gradual weight shifting is used; if p95 metrics or initialization errors exceed the threshold, the system rolls back execution, and an extra Provisioned Concurrency buffer during the canary window prevents “spill-over” of requests to the old version [13]. Taken together, these measures turn the cold start from a random regulatory risk source into a controllable engineering parameter, ensuring seamless function operation even under stress scenarios.

Two orders of reliability are determined not by the number of standby regions but by the platform’s ability to guarantee that a financial action is applied exactly once. Calculations by the Institute of Financial Management show that duplicate payments cost companies 0.05–0.1% of total expenses. At the turnover scale of large banks, this share translates into millions of dollars [14]. Regulators, therefore, require API-level idempotency. The largest payment provider, Stripe, goes further: it issues a UUID key for every POST regardless of operation type, and the system itself deletes expired keys after 24 hours, simplifying key-store cleanup logic [15].

Practice is put down to an idempotency table with TTL: the first call writes both key and response atomically, a repeat call gets back the prior result without any meddling in the settlement circuit. The table is not allowed to become a single point of failure by sharding it by key hash prefix and replicating it in those same regions where the functions are. This, together with concurrency limits, results in guaranteed at-most-once semantics even under aggressive retry strategies and remains compatible with the "at least once" model that the provider applies by default.

Simple idempotency, however, covers only the risk of double debiting; a failure between several writes can result in partial visibility. This gap is closed by AFT - a thin layer between FaaS and storage that buffers all changes until the logical request ends and commits them atomically, providing read-atomic isolation. Experiments by the authors show that AFT overhead does not exceed single-digit milliseconds and the system consistently processes “thousands of requests per second” without consistency anomalies [16]. For monetary features, this means that, notwithstanding auto-retry, there will be no condition whereby some registers have been updated while others still retain the old state. When a transaction spreads over a chain of functions, then full two-phase commit proves deadly to latency. The recent work suggests cooperative protocols without centralized coordinators: Beldi uses in its journaling model a unified log and distributed voting by the very functions themselves. Increases baseline operations of the library on 1,000 AWS Lambda instances only two- to four-fold and sustains up to 800 requests per second at less than 200 ms median latency with exactly-once semantics out-of-the-box application of 13 functions [17]. Idempotent keys and atomic bus and cooperative commit protocols equal consistency loop; that’s what makes financial systems run on serverless platforms, able to perform operations exactly once and write them down in one single non-contradictory version of truth.

Reliable operation of financial FaaS components is impossible without end-to-end observability, because it is observability that turns discrete functions distributed across availability zones into a deterministic and controllable system. Practice shows that the industry has already taken a significant step forward: in 2025, 34% of banks and insurance companies employed ten or more observability capabilities, including tracing and log analysis, and another 26% reached full-stack level [18]. Centralizing telemetry quickly proves its economic value - 79% of organizations that collected all signals in “one pane of glass” observed direct time or budget savings through reduced MTTR and licensing costs [19]. As shown in Figure 3, overall sentiment is most positive in telecommunications and financial services. In contrast, the energy sector combines a low share of positive assessments with high uncertainty and comparatively larger negativity.

Figure 3. Sectoral Sentiment Composition [19]

OpenTelemetry becomes the technical foundation for this effect. A single protocol makes it possible to enable tracing even in short-lived Lambda handlers without additional code: wrapping the entry point in a lightweight layer is sufficient for every transaction to be enriched with context and delivered to the chosen backend almost in real time [20]. Standardized export removes vendor lock-in and provides direct control over data volume; unsurprisingly, 57% of teams that have adopted an OTEL-centric architecture have already reduced telemetry storage and processing costs [21].

For observability to truly protect business metrics, engineering SLOs must be linked to financial indicators such as MTTR and RTO. The methodology recommended in the Well-Architected Framework proposes translating stakeholder requirements into concrete availability and recovery targets, with MTTR serving as the primary indicator of a team’s incident readiness and RTO as the maximum downtime window acceptable to regulators and customers [22]. This “end-to-end contract” helps rank functions by criticality, automate rollback paths, and incorporate error budgets into the canary-release process.

High-quality telemetry also simplifies audits. In the EU, DORA requires a financial organization to send an initial notification to the regulator within four hours after classifying an incident, an interim report no later than 72 hours, and a final report within one month [23]. When traces, metrics, and logs share a unified correlation identifier, the report template is assembled automatically: the platform builds invocation timelines, ties them to payment events, and adds conclusions on SLO compliance. Thus, observability ceases to be an internal engineering tool and becomes a source of evidence-based reporting for regulators, auditors, and more importantly - the organization’s teams making decisions on further reliability improvements.

Conclusion

The study proves that the safety of the serverless compute layer in financial systems is not achieved by single steps but by a combined approach, each sealing a specific type of risk. Active-active use in more than one area with auto traffic shift lessens downtime and gives proof to DORA’s RTO and MTTR rules. Picking a write-global read-local setup keeps the copy delay inside hundreds of milliseconds while keeping the settlement window wholeness. On the other hand, live groups stay as a fallback choice just for tasks that require zero data loss.

The Provisioned Concurrency, predictive warming, and buffering queues turn what was once a hidden regulatory liability into an easily managed parameter, thereby reducing tail latencies by an order of magnitude from their current levels on the system, as well as smoothing the load profile without having to go overboard with infrastructure reservation. More concurrency booking, and assured roll-back by way of alias routing, reduces the release-time error spikes and makes the transaction-flows continuity problem much better. A triple loop guarantees logical consistency: idempotent API keys guarantee that the Idem does not rerun, the atomic AFT bus guarantees visibility is read-atomic within a request, and cooperative commit protocols, such as Beldi, are employed here to scale exactly-once semantics under parallelism with no latency penalties. This is what generates duplicate pay and partial apply risks that the industry cost studies always highlight.

OpenTelemetry-based end-to-end observability turns a disparate set of short-lived functions into a managed system in which engineering SLOs are tied directly to financial KPIs and regulatory reporting windows. It also speeds up audits by generating the timeline of the incident automatically, thereby helping fulfill DORA’s critical four-hour notification requirement. In sum, the proposed set of practices demonstrates that, with well-designed multi-regional deployment, preventive warming, idempotent protocols, and standardized telemetry, serverless functions can not only maintain but also enhance operational resilience and bolster compliance with financial-sector regulations.

References:

Central Bank of Ireland, “Digital Operational Resilience Act (DORA),” Central Bank of Ireland, Jan. 2025. https://www.centralbank.ie/regulation/digital-operational-resilience-act-dora (accessed Jul. 14, 2025).
K. K. Suram, “Serverless Infrastructure at Scale: A Comprehensive Framework for Enterprise-Wide FaaS Migration Using AWS Lambda,” IARJSET, vol. 12, no. 5, May 2025, doi: https://doi.org/10.17148/iarjset.2025.125369.
M. Hamza, M. A. Akbar, and K. Smolander, “The Journey to Serverless Migration: An Empirical Analysis of Intentions, Strategies, and Challenges,” Arxiv, Jan. 2023, doi: https://doi.org/10.48550/arxiv.2311.13249.
AWS, “Region switch in ARC,” AWS. https://docs.aws.amazon.com/r53recovery/latest/dg/region-switch.html (accessed Jul. 23, 2025).
AWS, “AWS Prescriptive Guidance,” AWS. https://docs.aws.amazon.com/pdfs/prescriptive-guidance/latest/aws-multi-region-fundamentals/aws-multi-region-fundamentals.pdf (accessed Jul. 24, 2025).
AWS, “Replication with Amazon Aurora,” AWS. https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Replication.html (accessed Jul. 25, 2025).
AWS, “FSIREL04: Does the resilience and the architecture of your workload reflect the business requirements and resilience tier?” AWS, 2025. https://docs.aws.amazon.com/wellarchitected/latest/financial-services-industry-lens/fsirel04.html (accessed Jul. 26, 2025).
Vasile Crudu, “Comparing Serverless Providers for Node.js - AWS, Azure, and GCP,” Moldstud, Mar. 11, 2025. https://moldstud.com/articles/p-comparing-serverless-providers-for-nodejs-aws-azure-and-gcp (accessed Jul. 27, 2025).
AWS, “Configuring provisioned concurrency,” AWS. https://docs.aws.amazon.com/lambda/latest/dg/provisioned-concurrency.html (accessed Jul. 28, 2025).
M. Golec, G. K. Walia, M. Kumar, F. Cuadrado, S. S. Gill, and S. Uhlig, “Cold Start Latency in Serverless Computing: A Systematic Review, Taxonomy, and Future Directions,” ACM Computing Surveys 2024, vol. 57, no. 3, pp. 1-36, doi: https://doi.org/10.1145/3700875.
M. Pandey and Y.-W. Kwon, “FuncMem: Reducing Cold Start Latency in Serverless Computing Through Memory Prediction and Adaptive Task Execution,” SAC '24: Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing, Apr. 2024, pp. 131-138, doi: https://doi.org/10.1145/3605098.3636033.
AWS, “Understanding Lambda function scaling,” AWS. https://docs.aws.amazon.com/lambda/latest/dg/lambda-concurrency.html (accessed Jul. 31, 2025).
AWS, “Implement Lambda canary deployments using a weighted alias,” AWS. https://docs.aws.amazon.com/lambda/latest/dg/configuring-alias-routing.html (accessed Aug. 02, 2025).
“Duplicate Payments in Accounts Payable: 10 Ways to Prevent Them,” BCS ProSoft. https://www.bcsprosoft.com/duplicate-payments/ (accessed Aug. 03, 2025).
Stripe, “Stripe API Reference,” Stripe. https://docs.stripe.com/api (accessed Aug. 04, 2025).
V. Sreekanti, C. Wu, S. Chhatrapati, J. E. Gonzalez, J. M. Hellerstein, and J. M. Faleiro, “A Fault-Tolerance Shim for Serverless Computing,” Arxiv, Jan. 2020, doi: https://doi.org/10.48550/arxiv.2003.06007.
H. Zhang, P. Chen, S. Angel, V. Liu, and A. Cardoza, “Fault-tolerant and transactional stateful serverless workflows,” 2020. Accessed: Aug. 06, 2025. [Online]. Available: https://www.usenix.org/system/files/osdi20-zhang_haoran.pdf
New Relic, “State of Observability for Financial Services and Insurance,” New Relic, 2025. Accessed: Aug. 07, 2025. [Online]. Available: https://newrelic.com/sites/default/files/2025-04/new-relic-state-of-observability-fsi-2025_0425.pdf
Grafana Labs, “Observability Survey,” Grafana Labs, 2024. Accessed: Aug. 08, 2025. [Online]. Available: https://grafana.com/media/observability-survey/Obs-Survey-24-final.pdf
Open Telemetry, “Serverless,” Open Telemetry. https://opentelemetry.io/docs/languages/js/serverless/ (accessed Aug. 09, 2025).
M. Shalash, “Embracing Cost-Effective Observability Through an OpenTelemetry Approach,” Apmdigest, Apr. 29, 2025. https://www.apmdigest.com/embracing-cost-effective-observability-through-opentelemetry-approach (accessed Aug. 10, 2025).
Microsoft Learn, “Recommendations for defining reliability targets,” Microsoft Learn. https://learn.microsoft.com/en-us/azure/well-architected/reliability/metrics (accessed Aug. 11, 2025).
European Banking Authority, “Joint Technical Standards on major incident reporting,” European Banking Authority. https://www.eba.europa.eu/activities/single-rulebook/regulatory-activities/operational-resilience/joint-technical-standards-major-incident-reporting (accessed Aug. 12, 2025).