R Script to analyse your experiments — Fixed sample size and Group Sequential A/B Testing

8 min readDec 16, 2019

In 3+ years experience running experiments, I’ve noticed that being strictly rigorous when it gets to test analysis is hard… Pressure from various stakeholders or just the rush to call a test a winner and pass on to the next one can lead to risky decisions for the business.

How to mitigate this risk? How can we be as objective and rigorous as possible when analysing tests without losing pace in running our experiments?

To answer the question, I decided to put together a R script that helped in:

Creating a framework to uniform analyses across different teams
Reducing Type I/Type II errors when performing analyses (more statistical rigour to avoid increasing chances of error)
Owning the analysis process and so not relying completely on the a/b testing platform used by the company
Being quicker at stopping tests in case they are either performing better or worse than expected (group sequential testing)

At the bottom of the page you can find the R Script we currently use to analyse our experiments. Note: every site and customer base is different from each other so feel free to tailor it to your needs. Also, there might be parts that need improvements or more statistical rigour, in that case feel free to reach out!

How to use the script

The code implements the fixed sample size frequentist approach to a/b testing anaylsis. In addition to it, for those you prefer using it, I added a group sequential testing plot (particularly handy to save some time and early stop tests in case you have high traffic websites).

Choose your preferred approach based on your type of website, traffic volume, customer base…

1.Parameters Setup: the first section of the script is the one where test parameters need to be set. You will find everything from significance level and power to actual conversion rates and number of visitors per experience.

Notes:

“variations” is used to apply the Bonferroni correction for sample size calculation
By default, test type is set to one-tailed, based on http://blog.analytics-toolkit.com/2017/one-tailed-two-tailed-tests-significance-ab-testing/

After all the parameters of your test are set up, run the whole script to generate the plots.

Tip: Click on the back arrow in the plots section to get to the next plot

2. Frequentist statistics

The first plot is a summary of the test statistics calculated using the frequentist framework

statistically.significant : yes|no
p.value
delta : difference measured during the test
control and variation conversion rates
samples : samples collected during the test
samples needed : estimate of the number of samples necessary to reach statistical significance at the power set in the parameters, given the minimum detectable effect of interest

3. Distribution plot

By clicking again to the “Back” button, you will get to the next plot. This one will show you at a glance how far is the z-score from the z-critical.

The two curves are the normal distributions of conversion rate

If the red dot is within the shaded dark blue area (significance region), then the test is statistically significant.

The the shaded light blue area denotes the power (note that it includes the shaded dark blue area).

Not Statistically Significant Test (red dot to the left of the shaded area)

Statistically Significant Test (red dot to the right of the shaded area)

4. Confidence Interval

This plot simply shows what’s the confidence interval on the test statistic (difference between conversion rates) at the the confidence level chosen at the beginning (parameters setting) — e.g. if you chose alpha=0.1 the confidence level will be 90%.

5. Group Sequential Testing: shows the 3 checkpoints (how many samples per peek) and their own thresholds to reject or accept the null hypothesis. We use this to early-stop a test without losing statistical rigour.

Note: if you’re interested in knowing more about group sequential testing, check out the first two links at the bottom of the page

Checkpoints (measured in samples) and Early-Stop Thresholds

R Script:

library(mise)
library(stringr)
library(plyr)
library(dplyr)
library(pwr)
library(ggplot2)
library(scales)
library(svDialogs)
library(grid)
library(gridExtra)
library(data.table)
library(gtable)
library(bayesAB)
library(gsDesign)# Cleaning the house
mise()####### PARAMETERS SETUP ######## Baseline - used for pre-test analysis
baseline=0.56
# Defining significance level (alpha)
alpha=0.1
# Defining desidered power
power=0.8
# How many variations (not counting control)
variations=1
# Conversion Rates (crc=control & crv=variation)
crc=0.52
crv=0.527
# Visitors (nc=control & nv=variation)
nc=12500
nv=12492
# desired minimum detectable effect of interest
mde=0.03
################################beta=1-power# total samples
samples_total=nc+nv# Pooled conversion rate calculation
crpooled=((nc*crc)+(nv*crv))/(nc+nv)
# relative uplift/downlift
delta=(crv-crc)
delta_relative=(crv-crc)/crc
# based on delta, choose the one sided test to perform
if(delta > 0){
  alternative="greater"} else {alternative="less"}
# Z calculation
se=sqrt(crpooled*(1-crpooled)*(1/nc+1/nv))
zscore=abs((crv-crc)/se)
# P value (right-tailed) - 2* if wanted for two-tailed
pvalue=(1-(pnorm(abs(zscore))))#applying bonferroni correction to avoid Multiple Testing problem
alpha=alpha/variations
confidence_level=1-alpha
# Is the delta statistically significant?
if(pvalue < alpha){
  statistically_significant="yes"} else {statistically_significant="no"}#Samples needed for detecting m.d.e. at specified power and significance levelsamples_needed=power.prop.test( p1 = baseline, p2 = baseline+baseline*mde, power = power, sig.level = alpha,alternative = "one.sided" )
samples_needed_each=round(samples_needed$n,digits=0)
samples_needed=round(samples_needed$n*2,digits=0)# Confidence intervals calculation
# critical Z score calculation for confidence level
z_critical=qnorm(confidence_level)
# calculation of error
confidence_e=abs(z_critical)*se
# Confidence intervals at same confidence level used for stats calculations
confidence_intervals= delta-confidence_edelta_interval <- c(confidence_intervals[1],delta,delta)
delta_interval_plot <- c(confidence_intervals[1],delta,delta+0.01)
delta_interval_label <- percent(delta_interval,accuracy = .05)
confidence_plot <- data.frame(delta_interval)
confidence_plot$label=delta_interval_label
confidence_plot$name=c("lower","measured", "+Inf")# Statistics Summary Plot - Data frame creationstatistics_summary <- data.frame(methodology="frequentist",statistically_significant,round(pvalue,2),percent(round(delta,4),accuracy = .5),percent(crc),percent(crv),samples=nc+nv,samples_needed)# Statistics Summary Plot - Manipulation
setnames(statistics_summary, old = c('round.pvalue..2.','percent.crc.','percent.crv.','percent.round.delta..4...accuracy...0.5.','samples_needed','statistically_significant'), new = c('p.value','control.cr','variation.cr','delta','fixed.samples.needed (pre-test)','statistically.significant'), skip_absent=TRUE)
if(delta>0){setnames(statistics_summary, old = c('delta.relative'), new = c('uplift.relative'), skip_absent=TRUE)}
if(delta<0){setnames(statistics_summary, old = c('delta.relative'), new = c('downlift.relative'), skip_absent=TRUE)}
statistics_summary <- t(statistics_summary)# GROUP SEQUENTIAL TESTING
# gsDesign() is used to find boundaries and trial size required for a group sequential design.
# k=Number of analyses planned, including interim and final
# test.type=1 means one sided || type=2 two-sided symmetric || type=3 two-sided asymmetric with binding lower bound
# sfu - spending function for boundary type when (test.type=1 or 2) // choose one later
# sfl - spending function for lower bound when asymmetric or two-sided testing is performed (test.type = 3, 4, 5, or 6) // choose one later
# sfupar - specifies any parameter needed by sfu. default is -4
# n.fix - sample size for fixed design with no interim. It is used to find maximum group sequential sample size.
#The function sfPower() implements a Kim-DeMets (power) spending functiondesign = gsDesign(k=3, test.type = 3, alpha = alpha, delta=mde,
                  sfu=sfPower, sfupar=2, sfl=sfPower, sflpar=3, beta = beta)n_sequential = tail(design$n.I, 1)#checkpoints at 20% 50% 100%
checkpoints = c(ceiling(n_sequential / 5),
                ceiling(n_sequential / 2),
                ceiling(n_sequential))#checkpoints in total samples format
checkpoints = checkpoints*2# number of samples collected per variation
n_control=nc
n_variation=nv
n_min=min(nc,nv)if(samples_total>=checkpoints[1]){
  
  checkpoints_data <- data.frame(checkpoints=checkpoints, design$upper$bound,design$lower$bound )
  
  
  polygon_lower <- c(checkpoints_data$design.lower.bound,-Inf,-Inf,-Inf)
  polygon_upper <- c(checkpoints_data$design.upper.bound,Inf,Inf,Inf)
  polygon_checkpoints <- c(checkpoints_data$checkpoints,checkpoints[3],checkpoints[2],checkpoints[1])
  polygon_data <- data.frame(checkpoints=polygon_checkpoints, lower=polygon_lower, upper=polygon_upper)
  
  
  
  
  
  # Plot of Group Sequential Testing
  
  ggplot()+
    geom_polygon(data=polygon_data,mapping=aes(x=checkpoints, y=lower),fill="red",alpha=0.3)+
    geom_polygon(data=polygon_data,mapping=aes(x=checkpoints, y=upper), fill="lightgreen", alpha=0.5)+
    geom_point(checkpoints_data,mapping= aes(x=checkpoints, y=design.lower.bound),colour="black",size=4,shape=4)+
    geom_point(checkpoints_data,mapping= aes(x=checkpoints, y=design.upper.bound),colour="black",size=4,shape=4)+
    geom_line(checkpoints_data,mapping= aes(x=checkpoints, y=design.upper.bound))+
    geom_line(checkpoints_data,mapping= aes(x=checkpoints, y=design.lower.bound))+
    geom_text(checkpoints_data,mapping= aes(x=checkpoints, y=design.lower.bound, label=round(design.lower.bound,2)),nudge_y=-0.2,show.legend = FALSE) +
    geom_text(checkpoints_data,mapping= aes(x=checkpoints, y=design.upper.bound, label=round(design.upper.bound,2)),nudge_y=0.2,show.legend = FALSE) +
    labs(x= "samples", y = 'zscore' )+
    labs(title=paste("Group Sequential Testing  | ","","checkpoint1:",checkpoints[1],"   ","checkpoint2:",checkpoints[2],"   ","checkpoint3:",checkpoints[3],"   ","collected samples:",samples_total))+
    theme(plot.title = element_text(size = 11))
  
}#Plotting the overlay of probability distribution of Z assuming H0 true and the same assuming H0 false (Ha true)
normal_x_axis <- seq(-7, 7, 0.1) # use for generating the x axis of the normal distribution# Plot confidence intervals for delta
if(min(delta_interval) < 0 & max(delta_interval)>0){ggplot()+geom_rect(data=NULL,aes(xmin=0,xmax=Inf,ymin=-Inf,ymax=Inf),fill="lightgreen",alpha=0.2)+geom_rect(data=NULL,aes(xmin=min(delta_interval),xmax=0,ymin=-Inf,ymax=Inf),fill="red",alpha=0.2)+geom_line(data=confidence_plot,aes(y=c(1),x=delta_interval_plot))+geom_vline(xintercept=0, linetype="dashed", color = "black") + theme(axis.title.y = element_blank(),axis.text.y=element_blank(),axis.ticks.y = element_blank()) + labs(x = "Δ Confidence Intervals")+scale_x_continuous(labels = percent)+geom_point(data=confidence_plot,aes(y=c(1),x=delta_interval_plot),color="black",size = 2)+geom_text(data = confidence_plot, aes(y=c(1),x=delta_interval, label = label, vjust = -3,fontface="bold"))+geom_text(data = confidence_plot, aes(y=c(1),x=delta_interval_plot, label = name, vjust = 3))}
if(min(delta_interval) > 0 & max(delta_interval)>0){ggplot()+geom_rect(data=NULL,aes(xmin=min(delta_interval),xmax=Inf,ymin=-Inf,ymax=Inf),fill="lightgreen",alpha=0.2)+geom_line(data=confidence_plot,aes(y=c(1),x=delta_interval_plot)) + theme(axis.title.y = element_blank(),axis.text.y=element_blank(),axis.ticks.y = element_blank()) + labs(x = "Δ Confidence Intervals")+scale_x_continuous(labels = percent)+geom_point(data=confidence_plot,aes(y=c(1),x=delta_interval_plot),color="black",size = 2)+geom_text(data = confidence_plot, aes(y=c(1),x=delta_interval, label = label, vjust = -3,fontface="bold"))+geom_text(data = confidence_plot, aes(y=c(1),x=delta_interval_plot, label = name, vjust = 3))}# The shaded dark blue area denotes the significance region, while the the shaded blue area denotes the power (note that it includes the shaded dark blue area).
suppressWarnings(print(
if(delta > 0){normal_dist_plot <- data.frame( x = normal_x_axis, y1 = dnorm(normal_x_axis, 0, 1), y2 = dnorm(normal_x_axis, zscore, 1) )
ggplot( normal_dist_plot, aes(x = normal_x_axis) ) + geom_line( aes( y = y1, colour = 'H0 is true' ), size = 1.2 ) + geom_line( aes( y = y2, colour = 'H1 is true' ), size = 1.2 ) + geom_area( aes( y = y1, x = ifelse(x > z_critical, normal_x_axis, NA) ), fill = 'black' ) + geom_area( aes( y = y2, x = ifelse(x > z_critical, normal_x_axis, NA) ), fill = 'blue', alpha = 0.3 ) + theme( legend.title = element_blank() ) +scale_colour_manual( breaks = c("H0 is true", "H1 is true", "z score"), values = c("black", "blue","red") )+labs( x = '', y = '' )+ labs(title=paste("power=",power,"   ","z_score=",round(zscore,2),"   ","z_critical=",round(z_critical,2),"   ","significant:",statistically_significant))+theme(plot.title = element_text(size = 11))+theme(plot.title = element_text(hjust = 0.5))+ theme(plot.title = element_text(color = "black"))+geom_point(y=0,x=zscore,color="#f94f21",size = 2,shape=19,aes(fill = "z score"))
} else{
  normal_dist_plot <- data.frame( x = normal_x_axis, y1 = dnorm(normal_x_axis, 0, 1), y2 = dnorm(normal_x_axis, zscore*-1, 1) ) # first one has mean 0 because crv-crc is null under H0 true
  ggplot( normal_dist_plot, aes(x = normal_x_axis) ) + geom_line( aes( y = y1, colour = 'H0 is true' ), size = 1.2 ) + geom_line( aes( y = y2, colour = 'H1 is true' ), size = 1.2 ) + geom_area( aes( y = y1, x = ifelse(x < -z_critical, normal_x_axis, NA) ), fill = 'black' ) + geom_area( aes( y = y2, x = ifelse(x < -z_critical, normal_x_axis, NA) ), fill = 'blue', alpha = 0.3 ) + theme( legend.title = element_blank() ) +scale_colour_manual( breaks = c("H0 is true", "H1 is true"), values = c("black", "blue") )+labs( x = '', y = '' )+ labs(title=paste("power=",power,"   ","z_score=",round(-zscore,2),"   ","z_critical=",round(-z_critical,2),"   ","significant:",statistically_significant))+theme(plot.title = element_text(size = 11))+theme(plot.title = element_text(hjust = 0.5))+ theme(plot.title = element_text(color = "black"))+geom_point(y=0,x=-zscore,color="#f94f21",size = 2,shape=19,aes(fill = "z score"))
}))# Plot of statistics summary
ggplot()+ theme(panel.background = element_blank())
table_colors <- data.frame(color = c("grey90", "#b9f7fc", "#b9f7fc", "#b9f7fc",
                                     "grey90", "grey90", "grey90", "#b9f7fc","grey90","grey90","grey90", "grey90","grey90", "grey90", "grey90", "grey90"),stringsAsFactors = FALSE)
theme <- ttheme_default(core=list(bg_params = list(fill = table_colors$color, col=NA)))
grid.table(statistics_summary,theme = theme)

Sources

Efficient A/B Testing in Conversion Rate Optimization: The AGILE Statistical Method - White Paper

Free white paper on the new AGILE A/B testing statistical method for conducting A/B and MVT split tests in user…

www.analytics-toolkit.com

How to Experiment Rapidly Without Losing Rigour

As data scientists at Deliveroo we evaluate our work via robust experimentation, and we take a frequentist hypothesis…

deliveroo.engineering

Frequentist A/B test

Edit description

ethen8181.github.io

R Script to analyse your experiments — Fixed sample size and Group Sequential A/B Testing

Efficient A/B Testing in Conversion Rate Optimization: The AGILE Statistical Method - White Paper

Free white paper on the new AGILE A/B testing statistical method for conducting A/B and MVT split tests in user…

How to Experiment Rapidly Without Losing Rigour

As data scientists at Deliveroo we evaluate our work via robust experimentation, and we take a frequentist hypothesis…

Frequentist A/B test

Edit description

Handbook of Biological Statistics

When you perform a large number of statistical tests, some will have P values less than 0.05 purely by chance, even if…

Written by Andrea Corvi

Responses (1)