R Script to analyse your experiments — Fixed sample size and Group Sequential A/B Testing

Andrea Corvi
8 min readDec 16, 2019

--

In 3+ years experience running experiments, I’ve noticed that being strictly rigorous when it gets to test analysis is hard… Pressure from various stakeholders or just the rush to call a test a winner and pass on to the next one can lead to risky decisions for the business.

How to mitigate this risk? How can we be as objective and rigorous as possible when analysing tests without losing pace in running our experiments?

To answer the question, I decided to put together a R script that helped in:

  • Creating a framework to uniform analyses across different teams
  • Reducing Type I/Type II errors when performing analyses (more statistical rigour to avoid increasing chances of error)
  • Owning the analysis process and so not relying completely on the a/b testing platform used by the company
  • Being quicker at stopping tests in case they are either performing better or worse than expected (group sequential testing)

At the bottom of the page you can find the R Script we currently use to analyse our experiments. Note: every site and customer base is different from each other so feel free to tailor it to your needs. Also, there might be parts that need improvements or more statistical rigour, in that case feel free to reach out!

How to use the script

The code implements the fixed sample size frequentist approach to a/b testing anaylsis. In addition to it, for those you prefer using it, I added a group sequential testing plot (particularly handy to save some time and early stop tests in case you have high traffic websites).

Choose your preferred approach based on your type of website, traffic volume, customer base…

1.Parameters Setup: the first section of the script is the one where test parameters need to be set. You will find everything from significance level and power to actual conversion rates and number of visitors per experience.

Notes:

After all the parameters of your test are set up, run the whole script to generate the plots.

Tip: Click on the back arrow in the plots section to get to the next plot

2. Frequentist statistics

The first plot is a summary of the test statistics calculated using the frequentist framework

  • statistically.significant : yes|no
  • p.value
  • delta : difference measured during the test
  • control and variation conversion rates
  • samples : samples collected during the test
  • samples needed : estimate of the number of samples necessary to reach statistical significance at the power set in the parameters, given the minimum detectable effect of interest

3. Distribution plot

By clicking again to the “Back” button, you will get to the next plot. This one will show you at a glance how far is the z-score from the z-critical.

The two curves are the normal distributions of conversion rate

If the red dot is within the shaded dark blue area (significance region), then the test is statistically significant.

The the shaded light blue area denotes the power (note that it includes the shaded dark blue area).

Not Statistically Significant Test (red dot to the left of the shaded area)
Statistically Significant Test (red dot to the right of the shaded area)

4. Confidence Interval

This plot simply shows what’s the confidence interval on the test statistic (difference between conversion rates) at the the confidence level chosen at the beginning (parameters setting) — e.g. if you chose alpha=0.1 the confidence level will be 90%.

5. Group Sequential Testing: shows the 3 checkpoints (how many samples per peek) and their own thresholds to reject or accept the null hypothesis. We use this to early-stop a test without losing statistical rigour.

Note: if you’re interested in knowing more about group sequential testing, check out the first two links at the bottom of the page

Checkpoints (measured in samples) and Early-Stop Thresholds

R Script:

library(mise)
library(stringr)
library(plyr)
library(dplyr)
library(pwr)
library(ggplot2)
library(scales)
library(svDialogs)
library(grid)
library(gridExtra)
library(data.table)
library(gtable)
library(bayesAB)
library(gsDesign)
# Cleaning the house
mise()
####### PARAMETERS SETUP ######## Baseline - used for pre-test analysis
baseline=0.56
# Defining significance level (alpha)
alpha=0.1
# Defining desidered power
power=0.8
# How many variations (not counting control)
variations=1
# Conversion Rates (crc=control & crv=variation)
crc=0.52
crv=0.527
# Visitors (nc=control & nv=variation)
nc=12500
nv=12492
# desired minimum detectable effect of interest
mde=0.03
################################
beta=1-power# total samples
samples_total=nc+nv
# Pooled conversion rate calculation
crpooled=((nc*crc)+(nv*crv))/(nc+nv)
# relative uplift/downlift
delta=(crv-crc)
delta_relative=(crv-crc)/crc
# based on delta, choose the one sided test to perform
if(delta > 0){
alternative="greater"} else {alternative="less"}
# Z calculation
se=sqrt(crpooled*(1-crpooled)*(1/nc+1/nv))
zscore=abs((crv-crc)/se)
# P value (right-tailed) - 2* if wanted for two-tailed
pvalue=(1-(pnorm(abs(zscore))))
#applying bonferroni correction to avoid Multiple Testing problem
alpha=alpha/variations
confidence_level=1-alpha
# Is the delta statistically significant?
if(pvalue < alpha){
statistically_significant="yes"} else {statistically_significant="no"}
#Samples needed for detecting m.d.e. at specified power and significance levelsamples_needed=power.prop.test( p1 = baseline, p2 = baseline+baseline*mde, power = power, sig.level = alpha,alternative = "one.sided" )
samples_needed_each=round(samples_needed$n,digits=0)
samples_needed=round(samples_needed$n*2,digits=0)
# Confidence intervals calculation
# critical Z score calculation for confidence level
z_critical=qnorm(confidence_level)
# calculation of error
confidence_e=abs(z_critical)*se
# Confidence intervals at same confidence level used for stats calculations
confidence_intervals= delta-confidence_e
delta_interval <- c(confidence_intervals[1],delta,delta)
delta_interval_plot <- c(confidence_intervals[1],delta,delta+0.01)
delta_interval_label <- percent(delta_interval,accuracy = .05)
confidence_plot <- data.frame(delta_interval)
confidence_plot$label=delta_interval_label
confidence_plot$name=c("lower","measured", "+Inf")
# Statistics Summary Plot - Data frame creationstatistics_summary <- data.frame(methodology="frequentist",statistically_significant,round(pvalue,2),percent(round(delta,4),accuracy = .5),percent(crc),percent(crv),samples=nc+nv,samples_needed)# Statistics Summary Plot - Manipulation
setnames(statistics_summary, old = c('round.pvalue..2.','percent.crc.','percent.crv.','percent.round.delta..4...accuracy...0.5.','samples_needed','statistically_significant'), new = c('p.value','control.cr','variation.cr','delta','fixed.samples.needed (pre-test)','statistically.significant'), skip_absent=TRUE)
if(delta>0){setnames(statistics_summary, old = c('delta.relative'), new = c('uplift.relative'), skip_absent=TRUE)}
if(delta<0){setnames(statistics_summary, old = c('delta.relative'), new = c('downlift.relative'), skip_absent=TRUE)}
statistics_summary <- t(statistics_summary)
# GROUP SEQUENTIAL TESTING
# gsDesign() is used to find boundaries and trial size required for a group sequential design.
# k=Number of analyses planned, including interim and final
# test.type=1 means one sided || type=2 two-sided symmetric || type=3 two-sided asymmetric with binding lower bound
# sfu - spending function for boundary type when (test.type=1 or 2) // choose one later
# sfl - spending function for lower bound when asymmetric or two-sided testing is performed (test.type = 3, 4, 5, or 6) // choose one later
# sfupar - specifies any parameter needed by sfu. default is -4
# n.fix - sample size for fixed design with no interim. It is used to find maximum group sequential sample size.
#The function sfPower() implements a Kim-DeMets (power) spending function
design = gsDesign(k=3, test.type = 3, alpha = alpha, delta=mde,
sfu=sfPower, sfupar=2, sfl=sfPower, sflpar=3, beta = beta)
n_sequential = tail(design$n.I, 1)#checkpoints at 20% 50% 100%
checkpoints = c(ceiling(n_sequential / 5),
ceiling(n_sequential / 2),
ceiling(n_sequential))
#checkpoints in total samples format
checkpoints = checkpoints*2
# number of samples collected per variation
n_control=nc
n_variation=nv
n_min=min(nc,nv)
if(samples_total>=checkpoints[1]){

checkpoints_data <- data.frame(checkpoints=checkpoints, design$upper$bound,design$lower$bound )


polygon_lower <- c(checkpoints_data$design.lower.bound,-Inf,-Inf,-Inf)
polygon_upper <- c(checkpoints_data$design.upper.bound,Inf,Inf,Inf)
polygon_checkpoints <- c(checkpoints_data$checkpoints,checkpoints[3],checkpoints[2],checkpoints[1])
polygon_data <- data.frame(checkpoints=polygon_checkpoints, lower=polygon_lower, upper=polygon_upper)





# Plot of Group Sequential Testing

ggplot()+
geom_polygon(data=polygon_data,mapping=aes(x=checkpoints, y=lower),fill="red",alpha=0.3)+
geom_polygon(data=polygon_data,mapping=aes(x=checkpoints, y=upper), fill="lightgreen", alpha=0.5)+
geom_point(checkpoints_data,mapping= aes(x=checkpoints, y=design.lower.bound),colour="black",size=4,shape=4)+
geom_point(checkpoints_data,mapping= aes(x=checkpoints, y=design.upper.bound),colour="black",size=4,shape=4)+
geom_line(checkpoints_data,mapping= aes(x=checkpoints, y=design.upper.bound))+
geom_line(checkpoints_data,mapping= aes(x=checkpoints, y=design.lower.bound))+
geom_text(checkpoints_data,mapping= aes(x=checkpoints, y=design.lower.bound, label=round(design.lower.bound,2)),nudge_y=-0.2,show.legend = FALSE) +
geom_text(checkpoints_data,mapping= aes(x=checkpoints, y=design.upper.bound, label=round(design.upper.bound,2)),nudge_y=0.2,show.legend = FALSE) +
labs(x= "samples", y = 'zscore' )+
labs(title=paste("Group Sequential Testing | ","","checkpoint1:",checkpoints[1]," ","checkpoint2:",checkpoints[2]," ","checkpoint3:",checkpoints[3]," ","collected samples:",samples_total))+
theme(plot.title = element_text(size = 11))

}
#Plotting the overlay of probability distribution of Z assuming H0 true and the same assuming H0 false (Ha true)
normal_x_axis <- seq(-7, 7, 0.1) # use for generating the x axis of the normal distribution
# Plot confidence intervals for delta
if(min(delta_interval) < 0 & max(delta_interval)>0){ggplot()+geom_rect(data=NULL,aes(xmin=0,xmax=Inf,ymin=-Inf,ymax=Inf),fill="lightgreen",alpha=0.2)+geom_rect(data=NULL,aes(xmin=min(delta_interval),xmax=0,ymin=-Inf,ymax=Inf),fill="red",alpha=0.2)+geom_line(data=confidence_plot,aes(y=c(1),x=delta_interval_plot))+geom_vline(xintercept=0, linetype="dashed", color = "black") + theme(axis.title.y = element_blank(),axis.text.y=element_blank(),axis.ticks.y = element_blank()) + labs(x = "Δ Confidence Intervals")+scale_x_continuous(labels = percent)+geom_point(data=confidence_plot,aes(y=c(1),x=delta_interval_plot),color="black",size = 2)+geom_text(data = confidence_plot, aes(y=c(1),x=delta_interval, label = label, vjust = -3,fontface="bold"))+geom_text(data = confidence_plot, aes(y=c(1),x=delta_interval_plot, label = name, vjust = 3))}
if(min(delta_interval) > 0 & max(delta_interval)>0){ggplot()+geom_rect(data=NULL,aes(xmin=min(delta_interval),xmax=Inf,ymin=-Inf,ymax=Inf),fill="lightgreen",alpha=0.2)+geom_line(data=confidence_plot,aes(y=c(1),x=delta_interval_plot)) + theme(axis.title.y = element_blank(),axis.text.y=element_blank(),axis.ticks.y = element_blank()) + labs(x = "Δ Confidence Intervals")+scale_x_continuous(labels = percent)+geom_point(data=confidence_plot,aes(y=c(1),x=delta_interval_plot),color="black",size = 2)+geom_text(data = confidence_plot, aes(y=c(1),x=delta_interval, label = label, vjust = -3,fontface="bold"))+geom_text(data = confidence_plot, aes(y=c(1),x=delta_interval_plot, label = name, vjust = 3))}
# The shaded dark blue area denotes the significance region, while the the shaded blue area denotes the power (note that it includes the shaded dark blue area).
suppressWarnings(print(
if(delta > 0){normal_dist_plot <- data.frame( x = normal_x_axis, y1 = dnorm(normal_x_axis, 0, 1), y2 = dnorm(normal_x_axis, zscore, 1) )
ggplot( normal_dist_plot, aes(x = normal_x_axis) ) + geom_line( aes( y = y1, colour = 'H0 is true' ), size = 1.2 ) + geom_line( aes( y = y2, colour = 'H1 is true' ), size = 1.2 ) + geom_area( aes( y = y1, x = ifelse(x > z_critical, normal_x_axis, NA) ), fill = 'black' ) + geom_area( aes( y = y2, x = ifelse(x > z_critical, normal_x_axis, NA) ), fill = 'blue', alpha = 0.3 ) + theme( legend.title = element_blank() ) +scale_colour_manual( breaks = c("H0 is true", "H1 is true", "z score"), values = c("black", "blue","red") )+labs( x = '', y = '' )+ labs(title=paste("power=",power," ","z_score=",round(zscore,2)," ","z_critical=",round(z_critical,2)," ","significant:",statistically_significant))+theme(plot.title = element_text(size = 11))+theme(plot.title = element_text(hjust = 0.5))+ theme(plot.title = element_text(color = "black"))+geom_point(y=0,x=zscore,color="#f94f21",size = 2,shape=19,aes(fill = "z score"))
} else{
normal_dist_plot <- data.frame( x = normal_x_axis, y1 = dnorm(normal_x_axis, 0, 1), y2 = dnorm(normal_x_axis, zscore*-1, 1) ) # first one has mean 0 because crv-crc is null under H0 true
ggplot( normal_dist_plot, aes(x = normal_x_axis) ) + geom_line( aes( y = y1, colour = 'H0 is true' ), size = 1.2 ) + geom_line( aes( y = y2, colour = 'H1 is true' ), size = 1.2 ) + geom_area( aes( y = y1, x = ifelse(x < -z_critical, normal_x_axis, NA) ), fill = 'black' ) + geom_area( aes( y = y2, x = ifelse(x < -z_critical, normal_x_axis, NA) ), fill = 'blue', alpha = 0.3 ) + theme( legend.title = element_blank() ) +scale_colour_manual( breaks = c("H0 is true", "H1 is true"), values = c("black", "blue") )+labs( x = '', y = '' )+ labs(title=paste("power=",power," ","z_score=",round(-zscore,2)," ","z_critical=",round(-z_critical,2)," ","significant:",statistically_significant))+theme(plot.title = element_text(size = 11))+theme(plot.title = element_text(hjust = 0.5))+ theme(plot.title = element_text(color = "black"))+geom_point(y=0,x=-zscore,color="#f94f21",size = 2,shape=19,aes(fill = "z score"))
}))
# Plot of statistics summary
ggplot()+ theme(panel.background = element_blank())
table_colors <- data.frame(color = c("grey90", "#b9f7fc", "#b9f7fc", "#b9f7fc",
"grey90", "grey90", "grey90", "#b9f7fc","grey90","grey90","grey90", "grey90","grey90", "grey90", "grey90", "grey90"),stringsAsFactors = FALSE)
theme <- ttheme_default(core=list(bg_params = list(fill = table_colors$color, col=NA)))
grid.table(statistics_summary,theme = theme)

Sources

--

--

Andrea Corvi
Andrea Corvi

Written by Andrea Corvi

Head of Experimentation @ LiveScore Group. Passionate about user experience, statistics and everything CRO related

Responses (1)