Data Analyst, 01tech, Russia, Pskov
PRACTICAL FRAMEWORK FOR RELIABLE A/B EXPERIMENTS IN B2C PRODUCTS
ABSTRACT
The article addresses the problem of fragmented design of A/B experiments in B2C analytics, where statistical power parameters, test duration, and evaluation metrics are specified in an uncoordinated manner. Typical methodological limitations are examined, including the absence of prior fixation of the minimum detectable effect and the improper design of monetary metrics. It is shown that such practices reduce the reliability and reproducibility of experimental findings. The paper proposes a practical framework for the preparation and analysis of A/B experiments that links statistical power parameters with sample size and test duration calculations, and formalizes the design of monetary metrics using a ratio-based approach and linearization.
АННОТАЦИЯ
Статья посвящена проблеме фрагментарного проектирования A/B-экспериментов в B2C-аналитике, при котором параметры мощности, длительность тестирования и используемые метрики задаются несогласованно. Рассмотрены типичные ограничения методики, связанные с отсутствием предварительной фиксации минимально детектируемого эффекта и корректного выбора денежных метрик. Показано, что в связи с этим снижается надежность и воспроизводимость экспериментальных выводов. Предложен практический каркас подготовки и анализа A/B-экспериментов, который увязывает параметры статистической мощности с расчетом выборки и длительности теста, а также формализует дизайн денежных метрик с использованием ratio-подхода и линеаризации.
Keywords: A/B testing; B2C analytics; statistical power; minimum detectable effect; experimental design; monetary metrics; ratio metrics; reproducibility.
Ключевые слова: A/B-тестирование; B2C-аналитика; статистическая мощность; минимально детектируемый эффект; дизайн эксперимента; денежные метрики; ratio-метрики; воспроизводимость.
Introduction. Experimentation in digital B2C products is one of the primary decision-making instruments, whereby the release of product changes is governed by and grounded in statistically justified conclusions. In this context, the quality of experimental design directly determines the quality of the resulting decisions, as any methodological inaccuracies at the planning and execution stages lead to distorted interpretation of results, additional time costs, and reduced efficiency of product development. Long-term practical experience indicates that issues related to the reliability of A/B experiments are driven to a greater extent by the absence of a unified process for their preparation and execution than by the choice of a particular statistical test.
In practice, situations are frequently observed in which the experimental hypothesis is formulated, yet the minimum detectable effect (MDE) is not fixed prior to the experiment launch; the selected metric does not reflect the hypothesized mechanism of impact of the product change; and the sample size calculation is performed in isolation, without being converted into and a formalized stopping protocol. International studies on online experimentation rightly emphasize that the reliability of experimental conclusions is determined by the coherence and alignment of all stages of the experimental process [2].
The purpose of this article is to present a reproducible practical framework for conducting reliable A/B experiments in B2C products.
Research Methodology. The theoretical foundation of the study is based on typical errors and reliability requirements for conducting A/B experiments in B2C products, identified through theoretical analysis and systematic review of the relevant literature. In particular, one of the most common methodological issues in applied A/B experimentation practice is the initiation of tests without a predefined and expected MDE. In such cases, the objective of the experiment is effectively formulated after the results are obtained, when the presence of statistical significance is interpreted as a sufficient basis for making product decisions. At the same time, the duration of the experiment is often determined by calendar considerations rather than by calculations grounded in statistical power parameters and available traffic volume, which complicates result interpretation and reduces the reliability of the decisions made.
An additional risk factor is the practice of regularly monitoring interim experimental results and subsequently stopping the test upon the first attainment of a threshold p-value. In the absence of a predefined and valid sequential analysis protocol, such actions lead to an uncontrolled increase in the probability of false-positive conclusions [2]. Applied studies on A/B testing also emphasize the necessity of a formalized selection of statistical tests, underlying assumptions, and decision-making procedures at the experiment planning stage [1; 8].
A pronounced issue is also the conflation of underlying mechanisms within monetary indicators. In B2C products, monetary metrics typically exhibit higher variance than conversion metrics and are largely driven by rare events. For this reason, particular attention must be paid to their interpretation and design. The widely used ARPU metric, calculated as revenue per user, aggregates multiple effects simultaneously: changes in the level of user activity and changes in monetization per unit of activity. As a result, a statistically significant change in ARPU may reflect increased user activity, whereas the original hypothesis may have been related to changes in monetization per session or another elementary action.
For metrics constructed as ratios (ratio metrics), the correct formulation of the indicator and the selection of appropriate data transformations are of critical importance. Prior research notes that, without such transformations, conclusions drawn from ratio metrics may be unstable and difficult to interpret, thereby reducing the reliability of experimental results [7]. Moreover, even the use of correct statistical formulas does not guarantee reliable experimental conclusions if the calculation parameters, data processing methods, and applied transformations are not fixed in a formalized protocol. Review studies of A/B testing practice highlight methodological reproducibility and procedural transparency as primary conditions for transferring experimental approaches across teams and products [11].
In addition, studies in the fields of mobile applications and UX experimentation further note an increased risk of interpretational errors when multiple test iterations are conducted, segmentations are applied, and several metrics are evaluated simultaneously [4; 10]. In scenarios involving marketing communications and frequent product launches, where decision-making speed is a critical parameter and managerial pressure is therefore present, adherence to planning discipline and the application of unified rules for result interpretation play an equally important role in sustaining the effectiveness of the experimental process [3; 12].
Based on the identified foundations and challenges outlined above, the methodology of the present study was developed. First and foremost, it is necessary to define the parameters that must be fixed prior to the start of the experiment (see Fig. 1):
/Petrov.files/image001.png)
Figure 1. Parameters to be fixed prior to the start of the experiment, compiled by the author
At the experiment preparation stage, baseline estimates of the metrics are also fixed based on a stable pre-experimental period. For conversion metrics, the baseline is defined by the event probability p, which represents the initial conversion level. For metrics based on mean values, the user-level variance σ is estimated, which is required for correct statistical power and sample size calculations. In the case of ratio metrics, the baseline value of the ratio r₀ is specified, along with statistics of the denominator, such as the mean of the corresponding quantity, which makes it possible to properly formalize the expected variability of the metric. The time period of the experiment is also explicitly fixed.
The calculations are performed based on formulas (1–10).
For a two-sided test, formula (1) is used:
(1)
At a significance level of α = 0,05 and a target power of 0,80, the corresponding quantiles of the standard normal distribution are zα = 1,959964 and zβ = 0,841621. The square of the sum of these values equals 7,848880 [9].
Let baseline be p, absolute MDE be δ, p1 = p, p2 = p + δ, and p̄ = (p1 + p2) / 2. Then the variance multiplier is defined by formula (2):
(2)
Group size (equal groups) (3):
(3)
The power for a given n is determined via SE = sqrt(V / n), z_eff = δ / SE, and the probability of falling into the critical region |Z| > z_alpha for a random variable Z distributed according to the normal distribution N(z_eff, 1) [9].
For comparison of mean values with a given common standard deviation (4):
(4)
The conclusion is that the sample size n is proportional to σ². For monetary data with heavy tails, this implies that planning experiments without an explicit estimate of σ systematically leads to insufficient statistical power and biased expectations regarding the results [2; 5].
Let T be the total eligible daily traffic and s be the share of traffic allocated to one group. Then the experiment duration is defined as the ratio of the required sample size to the daily traffic volume assigned to a group (5):
(5)
The inverse problem, in which the duration D and the daily traffic T are given, is solved analytically for mean-based metrics using formula (6). For conversion metrics, the inverse problem is solved numerically, since the variance V depends on δ through the mean value p̄; the corresponding procedure is implemented in [9].
(6)
If the hypothesis is related to the estimation of monetization per unit of activity, a ratio metric is used (7):
(7)
where X is revenue per user over the period and Y is the number of sessions per user over the period. In practical calculations, the estimate R̂ = X̄ / Ȳ.
Linearization (delta method) is performed with respect to the baseline value r0 = X̄c / Ȳc. In this case, the user-level linearized value is defined as (8):
(8)
Then (9):
(9)
The analysis of the ratio metric reduces to testing the difference in mean values of Li, after which the effect estimate is transformed back to the ratio scale by dividing by the value Ȳc [7; 9].
In the presence of a pre-period, formula (10) is used, which corresponds to the CUPED approach and is interpreted as a linear adjustment of the target metric using a pre-experimental covariate, equivalent to subtracting the predictable component of variation. In the formulation used below, Y denotes the value of the target metric during the experiment period (for example, revenue per user), while X denotes the covariate measured prior to the intervention (for example, the value of the same metric or a proxy for user activity in the pre-period).
(10)
The covariance and variance in this expression are computed based on pre-experimental data or on the pooled user sample, provided that only pre-treatment values are used. It is critically important that, under correct randomization, the CUPED method preserves the unbiasedness of the experimental effect estimate while simultaneously reducing the standard error. As a first-order approximation, the expected relative variance reduction is determined by the correlation between Y and X and is on the order of 1 − ρ²; ρ denotes the correlation coefficient between the target metric and the covariate.
In applied B2C scenarios, a set of pre-experimental covariates X = (X1, ..., Xk) is often used in practice, including, for example, metric values in the pre-period, user activity indicators, session frequency, and similar variables. In this case, the parameter θ is estimated in a regression framework on pre-experimental data as a vector of coefficients of a linear model of the form Y ~ X, and the adjustment of the target metric is performed using the corresponding linear combination of covariates (which is equivalent to subtracting from Y the portion of variation explained by user heterogeneity and not related to the experimental treatment).
It is important to emphasize that, for the correct application of CUPED, the covariates must satisfy a number of conditions:
- Be measured strictly prior to the start of the experiment and not depend on the experimental treatment.
- Exhibit a clear and interpretable relationship with the target metric.
- Have sufficient variability and coverage across the user population.
In practice, the greatest effect is achieved when covariates of the same nature as the target metric are used. In the presence of heavy-tailed distributions of monetary metrics, the rules for outlier handling and data transformations (winsorization, trimming, log transformation) must be fixed prior to launching the experiment, since the estimation of the parameter θ and the resulting variance are sensitive to rare extreme observations.
The method increases the sensitivity of the experiment in the presence of a pronounced correlation between X and Y [5]; issues related to the selection and tuning of optimal variance reduction methods are discussed in [6]. It should be noted that the covariance and variance are computed using pre-experimental data (or the pooled user sample). Under correct randomization, CUPED preserves the unbiasedness of the effect estimate and reduces the standard error, while the expected relative variance reduction, in a first-order approximation, is determined by the strength of the correlation between X and Y (approximately 1 − ρ²).
Thus, the proposed framework is based on the open repository [9] and is presented as a set of logically interconnected modules. The power module is responsible for sample size and statistical power calculations, while the design module links sample size to experiment duration, including the solution of inverse problems. The metrics module implements approaches for working with ratio metrics and their linearization, the variance_reduction module includes variance reduction procedures (CUPED), and the simulate module is used to generate synthetic data reflecting typical B2C distributions (heavy-tailed revenue distributions, heterogeneity of user activity, and rare large purchases).
Thus, it is assumed that the use of a unified and reproducible A/B experimentation framework, in which the minimum detectable effect (MDE), power parameters, metric design, and experiment duration are specified and aligned prior to launch, makes it possible to improve the reliability of statistical conclusions and the interpretability of results in B2C products compared to fragmented experiment planning.
Results and Discussion. In all demonstrations, identical experimental parameters are used: a significance level of 0,05, a target power of 0,80, a 50/50 traffic allocation, and a daily eligible traffic of 20,000 users (interpreted as a typical order of magnitude for a mid-scale B2C product), which allows the scenarios to be compared within a common calculation framework. It is important to note that the chosen values of the significance level, target statistical power, and equal traffic allocation between the control and treatment groups are specified as the most commonly used set of assumptions in product analytics, which is necessary to ensure comparability of results across experiments and teams.
The purpose of the demonstration calculations is to compare experiment design scenarios under identical baseline conditions; for products of a different scale, all estimates are recalculated accordingly.
Thus, the first hypothesis to be highlighted is the following: under fixed traffic conditions, a decrease in the MDE leads to a nonlinear increase in sample size and experiment duration.
From the perspective of the relationship between conversion, MDE, sample size, and duration, at a baseline conversion rate of 0,05 and an MDE of 0,5 percentage points, the required sample size is 31,235 users per group, which corresponds to an experiment duration of 4 days. When the MDE is reduced to 0,2 percentage points, the sample size increases to 189,939 users per group and the duration to 19 days, as shown in Table 1:
Table 1.
Dependence of experiment duration on MDE (conversion), compiled by the author
|
MDE, p.p. |
Group size |
Duration, days |
|
0,10 |
752704 |
76 |
|
0,20 |
189939 |
19 |
|
0,30 |
85201 |
9 |
|
0,50 |
31235 |
4 |
|
1,00 |
8159 |
1 |
The result shows that, under fixed traffic conditions, the trade-off between experiment duration and the magnitude of the MDE is strictly constrained. Attempting to shorten the duration without revising the MDE leads to a reduction in statistical power and increases the risk of non-reproducible results.
The second hypothesis is as follows: for monetary metrics, an increase in variance leads to a quadratic growth in sample size (see Fig. 2).
/Petrov.files/image012.png)
Figure 2. Influence of variance on sample size (mean-based metric), compiled by the author
Thus, based on the estimates, for monetary metrics the sample size is proportional to the square of the standard deviation (σ²). In the absence of an explicit estimate of σ, experiments often turn out to be underpowered, which is especially pronounced for heavy-tailed distributions.
The third hypothesis is as follows: ratio metrics with linearization make it possible to correctly separate activity effects from monetization effects (see Table 2):
Table 2.
Comparison of ARPU and the ratio metric across different scenarios, compiled by the author
|
Scenario |
Metric |
Effect estimate |
SE |
p-value |
|
+10% sessions |
ARPU |
0,424 |
0,094 |
0,0000059 |
|
+10% sessions |
Revenue per session (linear) |
0,008 |
0,027 |
0,779 |
|
+10% revenue per session |
ARPU |
0,519 |
0,094 |
0,000000029 |
|
+10% revenue per session |
Revenue per session (linear) |
0,165 |
0,027 |
0,0000000019 |
Based on Table 2, ARPU aggregates different mechanisms and may yield a statistically significant effect that is incorrect from an interpretability standpoint. A ratio metric with linearization reflects the effect in the scale of the underlying mechanism and improves the controllability of decision-making.
The fourth hypothesis is formulated as follows: under a fixed experiment duration, there exists a lower bound on the effect that can be detected with a given statistical power (see Table 3):
Table 3.
Achievable MDE under fixed experiment duration, compiled by the author
|
Duration, days |
Group size |
MDE, p.p. |
|
7 |
70000 |
0,335 |
|
14 |
140000 |
0,236 |
|
21 |
210000 |
0,193 |
|
28 |
280000 |
0,167 |
Thus, it is evident that if the expected effect is below the achievable MDE, the experiment cannot provide a reliable conclusion without changing the experimental conditions.
Taking the obtained results into account, they confirm that the reliability of A/B experiments is determined not by individual statistical formulas, but by the coherence of the entire planning process. The proposed author’s framework for reliable experiments is based on approximations that are valid for sufficiently large sample sizes and stable baseline estimates.
For monetary metrics with heavy-tailed distributions, data transformation rules must be fixed in advance. When a valid pre-period is available, the application of CUPED is preferable, as it reduces variance without altering the interpretation of the metric.
At the same time, it is important to emphasize that the proposed framework does not replace the control of experimental validity. User interference, seasonality, novelty effects, and traffic drift must be accounted for in the experimental protocol and may impose a minimum experiment duration regardless of power calculations. When multiple metrics and segmentations are used, the primary metric and analysis rules must be defined in advance.
Conclusion. Thus, the proposed practical framework for reliable A/B experiments in B2C products makes it possible to integrate power calculation, monetary metric design, and experiment duration planning into a single reproducible structure. The presented demonstrations show that fixing the MDE prior to launch, accounting for traffic constraints, and selecting metrics appropriately are the key conditions for obtaining reliable and interpretable experimental conclusions. The open implementation of the framework lowers adoption barriers and simplifies the audit of calculations, which in turn improves the reliability of A/B experiments in B2C products.
References:
- Anikin D.A., Svishchev A.V. Method for selecting a statistical criterion for A/B testing // E-Scio. 2021. No. 11 (62). P. 298–303.
- Budylin R., Drutsa A., Katsev I., Tsoy V. Controlled experiments in online services // Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM 2018). New York: ACM, 2018. P. 468–476.
- Chumachenko A.A. Improving and speeding up A/B testing // Innovative Science. 2024. No. 9-1. P. 12–20.
- Chumachenko A.A. Optimization of user experience using iterative A/B testing // Innovations and Investments. 2024. No. 8. P. 408–413. DOI: 10.24412/2307-180X-2024-8-408-413.
- Deng A., Xu Y., Kohavi R., Walker T. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data // Proceedings of the Sixth ACM International Conference on Web Search and Data Mining (WSDM 2013). New York: ACM, 2013. P. 365–374.
- Jin Y., Ba S. Towards optimal variance reduction in online controlled experiments. arXiv preprint. 2022. URL: https://doi.org/10.48550/arXiv.2110.13406
- Khitskova Yu.V. Ratio metrics and methods for their substitution in A/B experiments // International Journal of Open Information Technologies. 2025. Vol. 13. No. 2. P. 114–120.
- Martynov I.A., Pestun U.A., Eshkina O.I. A/B testing as a tool for improving the efficiency of digital projects // Natural and Humanitarian Studies. 2024. No. 5 (55). P. 569–572.
- Niuhych. Trustworthy Experiments Core: code, notebooks, reproducible calculations and synthetic generators. URL: https://github.com/Niuhych/trustworthy-experiments-core/tree/main
- Pavlovich Yu.G., Kirinovich I.F. A/B testing as an effective method for evaluating mobile applications // Reports of the Belarusian State University of Informatics and Radioelectronics. 2021. Vol. 19. No. 1. P. 30–36.
- Quin F., Weyns D., Galster M., Costa S. et al. A/B testing: a systematic literature review. arXiv preprint. 2023. URL: https://doi.org/10.48550/arXiv.2308.04929
- Znatdinov D.I. Prospects for the development of A/B testing in brand communication management // Stolypin Bulletin. 2023. Vol. 5. No. 4. P. 2073–2082.