Understand how the CPP A/B Testing algorithm works

This article explains how the CPP A/B Testing algorithm works. The content is more technical and is intended for users who want a deeper understanding of how tests are structured and results are produced.

You don’t need this information to run tests. The platform applies these calculations automatically.

Nature of the data

In Apple Ads campaigns, the statistical details of the test perspective we developed for testing a newly created custom product pages are outlined below. In a custom product pages A/B tests, each time a user lands on the page, one of two outcomes occurs:

Either they convert (success)
or they do not convert (failure)

In this case, each user interaction can be considered a Bernoulli trial. When we repeat this trial n times — for example, 1000 users — the number of successes (conversions) follows a binomial distribution. If the sample size is sufficiently large, this distribution will approximate a normal distribution.

To make it clearer, consider the following example: think of flipping a coin twice and examining the probabilities of getting 0, 1, or 2 heads. This will create a binomial distribution. If we repeat the coin toss experiment 1000 times and graph the number of heads (successes) observed from 0 to n, it will result in a normal distribution. In other words, when we observe conversion events in a small sample, the binomial distribution is observed, but when we observe it in a large sample, a normal distribution emerges.

You can see the details below.

Example based on custom product page conversion rate results

Let’s say the true conversion rate of a custom product page variant is 5%.

Variant showed 1000 different users and assumed the process is repeated multiple times (for example, 1000 different days, with 1000 users per day).
At the end of each day, you record the conversion rate from that day’s 1000 users (for example, 4.8%, 5.3%, 4.9%, etc.).
Over time, you accumulate 1000 data points made up of these sample means.
Initially, these values may appear scattered, but according to the Central Limit Theorem (CLT), the distribution of these rates will approach a normal distribution. In other words, if you plot a histogram of these 1000 daily conversion rates, you’ll get a bell curve.
The center of this bell curve represents the true average conversion rate (5%), while the spread around it represents the random variation.

Managing random variation

One of the first things to consider in test planning is the question, "How precise do I want the results of this test to be?" The level of precision of the test is determined by the margin of error. The margin of error defines the degree to which the test results are allowed to vary from the true value.

We do not know the true conversion rate (p̂) of the users that come during the test. This rate may fluctuate slightly each day. For instance, if the original ad group’s conversion rate is on average 5%, one day the conversion rate for incoming users may be 5.7%, and the next day it may be 4.3%. This is not a systematic error but rather random variation. Even if the same ad and page are shown, user behavior can vary slightly each day.

To measure this natural variability, Standard Error (SE) is used. Standard Error measures the spread of sample means around the true conversion rate. Smaller Standard Error means a more precise estimate of the conversion rate. It indicates how much the observed conversion rate can deviate from the true rate on average. The formula is:

p is the estimated conversion rate,

n is the sample size (number of users tested).

Standard Error = √(p(1−p) / n)

The logic of this formula is as follows: The more users you have, the more these random fluctuations are balanced, and the standard error decreases. In other words, as n increases, the test becomes more stable.The smaller the standard error, the more precise the test results.

Controlling random deviation: Margin of Error (ε)

To keep this random variability within a certain limit, we define a margin of error (ε) in advance. The standard error gives us the amount of "average fluctuation," while the z value determines how many standard error widths this fluctuation will cover. For example: The z-value is selected based on the confidence level. For a 95% confidence level, the z-value is approximately 1.96. This means that the test results will be accurate within 95% confidence. In other words, z × SE represents the area covered by the specified confidence level on the normal distribution.

Formula to calculate the margin of error: ε = z x √(p(1−p) / n)

Formula to calculate the sample size for custom product page test: n = z²× p(1−p) / ε²

Then test duration calculated based on data we have below

The original ad group's conversion rate for the last 28 days (p)

The confidence level we choose (z) — for example, for a 90% confidence level, z = 1.65 (We calculate this fixed value when determining the test duration, as we want to maintain this level of confidence in the test results.)
The margin of error we want to set (ε) — for example, ±1 percentage point (0.01)

Converting to test duration

After determining the total required sample size (n), the test duration is calculated as follows:

Test Duration (days) = n / Daily average taps

For example:

p = 0.05 (i.e., 5% conversion rate)
ε = 0.01 (±1 percentage point margin of error)
z = 1.65 (90% confidence level)

When we plug these values into the formula:

n = (1.65² * (0.05 * 0.95)) / 0.01² = 1293

So, approximately 1,293 users are required per variant. If you are getting a total of 500 taps per day, this results in 250 taps per variant per day.Therefore, the test duration is calculated as 1293 / 250.

In conclusion, the sample size (n), chosen confidence level (z), original conversion rate (p), and accepted margin of error (ε) represent a statistical balance. This formula helps predict the duration of the test, how many users are needed, and how reliable the results will be in advance.