Product Analytics Team Lead, Simpals, Moldova, Chisinau
WORKING WITH RATIO METRICS IN A/B TESTING
ABSTRACT
This work is devoted to the analysis and application of ratio metrics in the context of A/B testing, a particularly important tool in product development. A/B testing allows you to compare two versions of a product or process, evaluating their effectiveness based on statistically significant data. The paper considers the main stages and methods of conducting A/B tests, including the formulation of hypotheses, the selection, and analysis of key performance indicators (KPIs), as well as the determination of sample size and statistical data processing. Special attention is paid to the problems of ratio metrics, which may include difficulties due to the internal dependence of data within a single user, which requires the use of specialized statistical approaches such as the bootstrap, delta method, or linearization. The results of the study emphasize the importance of the correct choice of methods of analysis and interpretation of the results to achieve objective and reasonable conclusions in A/B testing.
АННОТАЦИЯ
Работа посвящена анализу и применению метрик отношения в контексте A/B тестирования, особо важного инструмента в разработке продуктов. A/B тестирование позволяет сравнивать две версии продукта или процесса, оценивая их эффективность на основе статистически значимых данных. В работе рассматриваются основные этапы и методы проведения A/B тестов, включая формулировку гипотез, выбор и анализ ключевых показателей эффективности (KPI), а также определение размера выборки и обработку статистических данных. Особое внимание уделяется проблемам метрик отношения, которые могут включать трудности, связанные с внутренней зависимостью данных в пределах одного пользователя, что требует применения специализированных статистических подходов, таких как бутстреп, метод дельта или линеаризация. Результаты исследования подчеркивают важность правильного выбора методов анализа и интерпретации результатов для достижения объективных и обоснованных выводов в A/B тестировании.
Keywords: ratio metrics, digitalization, programming, IT, information technology, A/B testing.
Ключевые слова: метрики отношения, цифровизация, программирование, ИТ, информационные технологии, A/B тестирование.
Introduction
A/B testing, also known as split testing or variant testing, is a method for comparing two versions of a web page, user interface element, or other product to determine which one is more effective in terms of user interaction, customer engagement, or conversion [2].
Working with ratio metrics in A/B testing is a crucial aspect of analyzing and optimizing user experience and business metrics. The relevance of this topic is driven by the growing need for companies to make informed decisions based on data, which allows them to enhance user experience, increase profits, and improve competitiveness. These intra-user correlations are significant when dealing with ratio metrics, where ratio metrics involve analyzing units at a more granular level compared to the experiment units, such as click-through rate (CTR), average order value (AOV), search result page to view item (SRP to View), search result page exit rate (SRP Exit Rate), and others [1,4].
However, working with ratio metrics presents certain challenges, such as the complexity of interpreting results, the dependence of metrics on various factors. These challenges highlight the importance of a well-thought-out approach to experiment design and data analysis to obtain valid and practically meaningful conclusions.
1. Background
A/B testing is a strategic tool that evaluates two alternative versions of a product or business process through the following steps (Figure 1.):
Hypothesis Formulation. Define a hypothesis about the potential improvement of your product or process that you want to test. This can relate to anything from improving the user interface to changing your marketing strategy.
Divide the target audience into two groups randomly to ensure the reliability of the test results. This is critical to avoid sampling bias.
Key Performance Indicators (KPIs). Select the option that performs the best according to predefined key performance indicators (KPIs). KPIs may include:
- Revenue: This KPI evaluates the financial performance of the experiment. It measures the increase in revenue directly attributable to the changes tested in the experiment.
- Conversion rates: This measures the effectiveness of the experiment in converting users to a desired action, such as making a purchase or signing up for a service. Higher conversion rates indicate more successful outcomes.
Likelihood of Side Effects. Assess the likelihood of side effects. This metric measures the probability that the experiment will produce adverse effects. Minimizing side effects is crucial for maintaining user satisfaction and trust.
Minimum Detectable Effect (MDE) is the smallest significant change in a measurable parameter that can be detected during an experiment under specified conditions, such as sample size and significance level. MDE indicates the minimum magnitude of an effect that researchers can reliably detect, which helps determine the necessary sample size to achieve statistical significance and confidence in the experiment’s results. MDE is a crucial parameter for evaluating the cost and potential profitability of conducting A/B experiments. From a practical perspective for mobile app marketers, choosing an appropriate MDE value means achieving a balance between the cost of acquiring paid traffic for the experiment and attaining a meaningful return on investment (ROI) [8].
Calculate Sample Size. The calculator allows for the evaluation of various statistical schemes when planning an experiment (trial, test) where conclusions are drawn using a null hypothesis statistical test. It can be used both as a sample size calculator and as a statistical power calculator. Typically, it is necessary to determine the required sample size considering the specific power requirement. However, in cases where a predefined sample size exists, the calculator can instead compute the power for the given effect size of interest [11,13].
These metrics will help quantify the impact of each release on the organization's strategic and operational goals.
A/B testing provides a statistically sound method of assessing the impact of intended changes, allowing informed decisions to be made based on data and not just intuitive assumptions [6,9].
Figure 1. The A/B testing process
Talking about the objectives of A/B testing, the following objectives can be achieved due to the available methods for evaluating potential changes in product or business processes:
- Verification of changes: Before implementing any modifications, it is important to test them to prevent possible negative impacts on performance or metrics, such as decreased clickability of ads.
- Analyzing user behavior: Testing helps to identify which elements attract users' attention and which ones repel them. This knowledge allows for more effective and engaging designs.
- Improving user experience: A/B tests reveal the most user-friendly interface options, which help optimize navigation and interaction processes.
- Optimize your traffic budget: Experiments help to reduce the cost of customer acquisition (CPL) by selecting the most effective texts and designs for ads.
- Increase conversions: Testing points to those elements (text, images, design) that most motivate the audience to take targeted actions.
2. Methods
In A/B testing, particularly when dealing with ratio metrics, it is crucial to employ advanced statistical methods to ensure accurate and reliable results. Theoretically, the main methods include bootstrap, delta method, and linearization. Each method provides unique advantages and applications in the analysis of ratio metrics, ensuring the robustness and precision of the findings.
1. Linearization
Linearization in A/B testing is a highly computationally efficient and scalable method for transforming a ratio metric into a mean user metric. It preserves the directionality of the observed effect with the change in the target ratio metric. Moreover, the difference in linearized metrics in experiments maintains a consistent level of statistical significance with the original ratio metric and is computed using a t-test. Since linearization yields user-level signals, it opens up opportunities to apply methods that enhance the sensitivity of the ratio metric [12].
Consider a user-level metric such as Click-Through Rate (CTR):
where C is the number of clicks and u is the number of users.
To linearize it:
where K is the CTR of the control group, and Su is the number of impressions.
We obtain a new, final metric:
where U is the total number of users.
In the context of the experiment, the CTR for the control group A, where no changes were made, is denoted as K. The function L(u) determines the error compared to the control, calculated as the difference between the actual number of clicks and the expected number of clicks based on the CTR of the control group and the total number of impressions.
Changing the CTR assumes a proportional response based on user activity. In this context, it is proposed to assign different weights to each participant, ranging from equal weighting to activity-based weighting, where active users are assigned a lower weight to reduce their dominant influence on the results. This approach mitigates distortions caused by the high activity of individual users and makes the evaluation more objective.
2. Bootstrap
Bootstrap is a method of statistical analysis that involves repeatedly obtaining sample data with replacement from the original data set to estimate population parameters. This approach provides an empirical distribution of statistics, which is particularly useful in the absence of exact theoretical distributions. For ratio metrics, Bootstrap helps to estimate confidence intervals and test hypotheses.
Suppose we have a data set , where n is the sample size. We want to estimate the ratio metric R, which is a function of this data, such as the mean or median.
1. Bootstrap Sample Generation:
Create B bootstrap samples, each consisting of n elements selected with replacement from D. Let us denote these samples as .
2. Calculating Bootstrap Statistics:
For each bootstrap sample (where b = 1,2,...,B) we calculate the ratio metric of interest
3. Constructing the empirical distribution:
Given B values of , we can construct an empirical distribution of the ratio metric R.
4. Estimation of confidence intervals:
Using the empirical distribution, we can estimate confidence intervals for the ratio metric. For example, for a 95% confidence interval, we use the 2.5th and 97.5th percentiles of the distribution .
This method is robust and does not rely on assumptions about the underlying distribution of the data, making it versatile for various experimental conditions.
3. Delta Method
The delta method is a theorem used to derive the distribution of a function of asymptotically normal variables. It is commonly used to obtain standard errors and confidence intervals for functions of parameters whose estimates are asymptotically normal [5,14].
For ratio metrics, the delta method involves the following steps:
- Define the ratio metric as a function of two random variables.
- Linearly approximate this function using a Taylor series expansion.
- Use the properties of the normal distribution to estimate the asymptotic distribution of the ratio metric.
Let X and Y be two random variables representing the numerator and denominator of the ratio metric, respectively. The ratio metric R is defined as:
Using the Taylor series expansion around the mean values and :
where the partial derivatives are:
Thus, the linear approximation becomes:
The delta method allows us to analytically estimate the variance and confidence intervals of ratio metrics, making it a valuable tool in data analysis. However, it assumes that the original variables are normally distributed and requires a sufficiently large sample size for asymptotic approximations to be accurate.
In the context of the "Peeking problem," where frequent testing of experimental results can lead to erroneous conclusions, various strategies such as the use of Bayesian A/B tests, multi-armed bandits, and sequential testing are proposed to reduce the probability of false positives. These approaches provide a structured mechanism for real-time adaptation of the testing strategy, which reduces the risks associated with frequent interventions in the experiment process [10].
Bayesian A/B testing allows for the incorporation of prior information and updates the probability of success as data is collected. Multi-armed bandits allocate traffic dynamically to the best-performing variations, thereby minimizing regret and maximizing the overall reward. Sequential testing provides a framework to monitor the experiment continuously and make decisions as soon as sufficient evidence is available, reducing the total sample size required and the duration of the test.
3. Modeling for power and significance assessment
The algorithm designed to estimate power and statistical significance in the context of sample size calculation includes the following steps:
Extracting a random sample of data, including the numerator (number of clicks) and denominator (number of visits). Parameters for sample size, such as the level of first-order error, test power, and minimum detectable effect (MDE), are determined using the previously described calculator.
Splitting the sample in half, where 50% of the data act as the control group and the other 50% as the experimental group.
Calculating the p-value based on the comparison of the control and experimental groups.
Adjusting the data in the experimental group based on the true effect size: if the original data shows 3 clicks, then taking into account the true effect of 5%, this number is increased to 3.15 clicks. This process can be called synthetic addition.
Re-calculating the p-value, this time comparing the control with the modified value in the experimental group.
Repeating all the above steps (1 to 5) n times. The relative proportion of cases where the p-value at the third step is less than 0.05 reflects the empirical level of statistical significance. Similarly, the proportion of cases where the p-value at the fifth step is less than 0.05 reflects the empirical power [7].
4. Ratio-metrics
In the context of A/B testing, the key element is to define two basic concepts: unit of analysis and unit of randomization.
The unit of analysis is defined as the entity against which the outcome metric is supposed to be measured. This can be a user, session, order, button, banner, or period. For example, looking at total revenue for a certain period can lead to different conclusions depending on which base unit this value refers to if revenue is distributed by users, we get ARPU (average revenue per user), and if by orders, we get an average check.
The randomization unit is the entity that is randomly assigned to the control or experimental group in A/B tests. The most common randomization unit is users, but other units such as sessions or orders can also be used. Randomization is critical to eliminate the influence of uncontrollable variables such as the user's age, gender, geographic location, or behavioral characteristics that may skew the results of the experiment.
Randomization ensures that any observed differences between metrics in the control and experimental groups can be explained either by the impact of the implemented changes or by random variation. It also promotes the independence of observations within each group, which is a fundamental requirement for the application of statistical criteria.
When the unit of analysis is the same as the unit of randomization, it is safe to use standard statistical criteria to test hypotheses, for example, to measure user conversion or the average metric across users. Each user signal is then treated as independent, which is ideal for statistical evaluation.
However, when a metric wants to be calculated relative to another unit of analysis, the concept of ratio metrics arises. Such a metric can be described as a ratio between aggregated user data, such as the average check, which is calculated by dividing the total spend by the number of orders fulfilled by users. Ratio metrics present a challenge to the standard t-test because of their intrinsic dependence between observations within a single user.
Synthetic A/A tests can show that statistical significance for ratio metrics, unlike average user metrics, often results in a skewed p-value distribution, indicating an increase in Type I error. This emphasizes the need to choose alternative methods for assessing statistical significance for ratio metrics, such as using proxy metrics, bootstrap, or delta methods, each with its features and limitations [3].
Conclusion
Within the framework of the conducted research, it was established that the use of ratio metrics in A/B testing requires a special approach in data analysis due to their complex statistical nature. It was confirmed that standard testing methods can be ineffective due to the intrinsic dependence of the observed variables, which makes it important to choose alternative statistical methods. Applying the delta method, bootstrap, and other approaches can improve the accuracy and reliability of the results, thus determining the most effective changes in the product or process. The experimental data and methodological approaches discussed in this paper provide an important framework for planning and conducting A/B testing aimed at improving user experience and business metrics. The study results highlight the need for an integrated approach to data analysis and interpretation, which is key to achieving statistically valid conclusions and sustainable solutions in optimizing business processes.
References:
- Budylin, R., Drutsa, A., Katsev, I., Tsoy, V.: Consistent transformation of ratio metrics for efficient online controlled experiments. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining.2018. pp. 55–63. ACM.
- Claeys E. et al. Dynamic allocation optimization in a/b-tests using classification-based preprocessing //IEEE Transactions on Knowledge and Data Engineering. – 2021. – Т. 35. – №. 1. – pp. 335-349.
- Consistent conversion of ratio indicators to conduct effective online controlled experiments. [Electronic resource] Access mode https://www.researchgate.net/publication/322969314_Consistent_Transformation_of_Ratio_Metrics_for_Efficient_Online_Controlled_Experiments (accessed 8.05.2024).
- Dealing With Ratio Metrics in A/B Testing in the Presence of Intra-User Correlation and Segments. [Electronic resource] Access mode https://arxiv.org/pdf/1911.03553 (accessed 8.05.2024).
- Delta method. [Electronic resource] Access mode https://www.statlect.com/asymptotic-theory/delta-method (accessed 8.05.2024).
- Deng, A., Knoblich, U., Lu, J.: Applying the delta method in metric analytics: a practical guide with novel ideas // Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018. pp. 233–242.
- Huang X., Zhou Y., Wang X. and Wang S. A Machine Learning Approach to Optimizing A/B Testing // Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2022. pp. 4-13.
- Keyu Nie, Yinfei Kong, Ted Tao Yuan, Pauline Berry Burke Dealing with Ratio Metrics in A/B Testing at the Presence of Intra-user Correlation and Segments // Web Information Systems Engineering – WISE 2020. [Electronic resource] Access mode https://link.springer.com/chapter/10.1007/978-3-030-62008-0_39 (accessed 8.05.2024).
- Kohavi, R., Tang, D., Xu, Y.: Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing // Cambridge University Press, Cambridge. 2020. pp. 1-9.
- Mahajan P. et al. Optimizing Experimentation Time in A/B Testing: An Analysis of Two-Part Tests and Upper Bounding Techniques //2023 IEEE International Conference on Contemporary Computing and Communications (InC4). – IEEE, 2023. – Т. 1. – pp. 1-4.
- Sample Size and Power Calculation. [Electronic resource] Access mode https://www.researchgate.net/publication/319442443_Sample_Size_and_Power_Calculation (accessed 8.05.2024).
- Sekhon, J.S., Shem-Tov, Y. Inference on a new class of sample average treatment effects. 2020. pp. 1–18.
- Tabea Hoffmann, Eric-Jan Wagenmakers Bayesian Inference for the A/B Test: Example Applications with R and JASP // University of Amsterdam. 2020. pp. 1-30.
- Zhao, Z., Liu, M., Deb, A. Safely and quickly deploying new features with a staged rollout framework using sequential test and adaptive experimental design // 3rd International Conference on Computational Intelligence and Applications (ICCIA). 2018. pp. 59–70.