R Script to analyse your experiments — Fixed sample size and Group Sequential A/B Testing
In 3+ years experience running experiments, I’ve noticed that being strictly rigorous when it gets to test analysis is hard… Pressure from various stakeholders or just the rush to call a test a winner and pass on to the next one can lead to risky decisions for the business.
How to mitigate this risk? How can we be as objective and rigorous as possible when analysing tests without losing pace in running our experiments?
To answer the question, I decided to put together a R script that helped in:
- Creating a framework to uniform analyses across different teams
- Reducing Type I/Type II errors when performing analyses (more statistical rigour to avoid increasing chances of error)
- Owning the analysis process and so not relying completely on the a/b testing platform used by the company
- Being quicker at stopping tests in case they are either performing better or worse than expected (group sequential testing)
At the bottom of the page you can find the R Script we currently use to analyse our experiments. Note: every site and customer base is different from each other so feel free to tailor it to your needs. Also, there might be parts that need improvements or more statistical rigour, in that case feel free to reach out!
How to use the script
The code implements the fixed sample size frequentist approach to a/b testing anaylsis. In addition to it, for those you prefer using it, I added a group sequential testing plot (particularly handy to save some time and early stop tests in case you have high traffic websites).
Choose your preferred approach based on your type of website, traffic volume, customer base…
1.Parameters Setup: the first section of the script is the one where test parameters need to be set. You will find everything from significance level and power to actual conversion rates and number of visitors per experience.
Notes:
- “variations” is used to apply the Bonferroni correction for sample size calculation
- By default, test type is set to one-tailed, based on http://blog.analytics-toolkit.com/2017/one-tailed-two-tailed-tests-significance-ab-testing/
After all the parameters of your test are set up, run the whole script to generate the plots.
Tip: Click on the back arrow in the plots section to get to the next plot
2. Frequentist statistics
The first plot is a summary of the test statistics calculated using the frequentist framework
- statistically.significant : yes|no
- p.value
- delta : difference measured during the test
- control and variation conversion rates
- samples : samples collected during the test
- samples needed : estimate of the number of samples necessary to reach statistical significance at the power set in the parameters, given the minimum detectable effect of interest
3. Distribution plot
By clicking again to the “Back” button, you will get to the next plot. This one will show you at a glance how far is the z-score from the z-critical.
The two curves are the normal distributions of conversion rate
If the red dot is within the shaded dark blue area (significance region), then the test is statistically significant.
The the shaded light blue area denotes the power (note that it includes the shaded dark blue area).
4. Confidence Interval
This plot simply shows what’s the confidence interval on the test statistic (difference between conversion rates) at the the confidence level chosen at the beginning (parameters setting) — e.g. if you chose alpha=0.1 the confidence level will be 90%.
5. Group Sequential Testing: shows the 3 checkpoints (how many samples per peek) and their own thresholds to reject or accept the null hypothesis. We use this to early-stop a test without losing statistical rigour.
Note: if you’re interested in knowing more about group sequential testing, check out the first two links at the bottom of the page
R Script:
library(mise)
library(stringr)
library(plyr)
library(dplyr)
library(pwr)
library(ggplot2)
library(scales)
library(svDialogs)
library(grid)
library(gridExtra)
library(data.table)
library(gtable)
library(bayesAB)
library(gsDesign)# Cleaning the house
mise()####### PARAMETERS SETUP ######## Baseline - used for pre-test analysis
baseline=0.56
# Defining significance level (alpha)
alpha=0.1
# Defining desidered power
power=0.8
# How many variations (not counting control)
variations=1
# Conversion Rates (crc=control & crv=variation)
crc=0.52
crv=0.527
# Visitors (nc=control & nv=variation)
nc=12500
nv=12492
# desired minimum detectable effect of interest
mde=0.03
################################beta=1-power# total samples
samples_total=nc+nv# Pooled conversion rate calculation
crpooled=((nc*crc)+(nv*crv))/(nc+nv)
# relative uplift/downlift
delta=(crv-crc)
delta_relative=(crv-crc)/crc
# based on delta, choose the one sided test to perform
if(delta > 0){
alternative="greater"} else {alternative="less"}
# Z calculation
se=sqrt(crpooled*(1-crpooled)*(1/nc+1/nv))
zscore=abs((crv-crc)/se)
# P value (right-tailed) - 2* if wanted for two-tailed
pvalue=(1-(pnorm(abs(zscore))))#applying bonferroni correction to avoid Multiple Testing problem
alpha=alpha/variations
confidence_level=1-alpha
# Is the delta statistically significant?
if(pvalue < alpha){
statistically_significant="yes"} else {statistically_significant="no"}#Samples needed for detecting m.d.e. at specified power and significance levelsamples_needed=power.prop.test( p1 = baseline, p2 = baseline+baseline*mde, power = power, sig.level = alpha,alternative = "one.sided" )
samples_needed_each=round(samples_needed$n,digits=0)
samples_needed=round(samples_needed$n*2,digits=0)# Confidence intervals calculation
# critical Z score calculation for confidence level
z_critical=qnorm(confidence_level)
# calculation of error
confidence_e=abs(z_critical)*se
# Confidence intervals at same confidence level used for stats calculations
confidence_intervals= delta-confidence_edelta_interval <- c(confidence_intervals[1],delta,delta)
delta_interval_plot <- c(confidence_intervals[1],delta,delta+0.01)
delta_interval_label <- percent(delta_interval,accuracy = .05)
confidence_plot <- data.frame(delta_interval)
confidence_plot$label=delta_interval_label
confidence_plot$name=c("lower","measured", "+Inf")# Statistics Summary Plot - Data frame creationstatistics_summary <- data.frame(methodology="frequentist",statistically_significant,round(pvalue,2),percent(round(delta,4),accuracy = .5),percent(crc),percent(crv),samples=nc+nv,samples_needed)# Statistics Summary Plot - Manipulation
setnames(statistics_summary, old = c('round.pvalue..2.','percent.crc.','percent.crv.','percent.round.delta..4...accuracy...0.5.','samples_needed','statistically_significant'), new = c('p.value','control.cr','variation.cr','delta','fixed.samples.needed (pre-test)','statistically.significant'), skip_absent=TRUE)
if(delta>0){setnames(statistics_summary, old = c('delta.relative'), new = c('uplift.relative'), skip_absent=TRUE)}
if(delta<0){setnames(statistics_summary, old = c('delta.relative'), new = c('downlift.relative'), skip_absent=TRUE)}
statistics_summary <- t(statistics_summary)# GROUP SEQUENTIAL TESTING
# gsDesign() is used to find boundaries and trial size required for a group sequential design.
# k=Number of analyses planned, including interim and final
# test.type=1 means one sided || type=2 two-sided symmetric || type=3 two-sided asymmetric with binding lower bound
# sfu - spending function for boundary type when (test.type=1 or 2) // choose one later
# sfl - spending function for lower bound when asymmetric or two-sided testing is performed (test.type = 3, 4, 5, or 6) // choose one later
# sfupar - specifies any parameter needed by sfu. default is -4
# n.fix - sample size for fixed design with no interim. It is used to find maximum group sequential sample size.
#The function sfPower() implements a Kim-DeMets (power) spending functiondesign = gsDesign(k=3, test.type = 3, alpha = alpha, delta=mde,
sfu=sfPower, sfupar=2, sfl=sfPower, sflpar=3, beta = beta)n_sequential = tail(design$n.I, 1)#checkpoints at 20% 50% 100%
checkpoints = c(ceiling(n_sequential / 5),
ceiling(n_sequential / 2),
ceiling(n_sequential))#checkpoints in total samples format
checkpoints = checkpoints*2# number of samples collected per variation
n_control=nc
n_variation=nv
n_min=min(nc,nv)if(samples_total>=checkpoints[1]){
checkpoints_data <- data.frame(checkpoints=checkpoints, design$upper$bound,design$lower$bound )
polygon_lower <- c(checkpoints_data$design.lower.bound,-Inf,-Inf,-Inf)
polygon_upper <- c(checkpoints_data$design.upper.bound,Inf,Inf,Inf)
polygon_checkpoints <- c(checkpoints_data$checkpoints,checkpoints[3],checkpoints[2],checkpoints[1])
polygon_data <- data.frame(checkpoints=polygon_checkpoints, lower=polygon_lower, upper=polygon_upper)
# Plot of Group Sequential Testing
ggplot()+
geom_polygon(data=polygon_data,mapping=aes(x=checkpoints, y=lower),fill="red",alpha=0.3)+
geom_polygon(data=polygon_data,mapping=aes(x=checkpoints, y=upper), fill="lightgreen", alpha=0.5)+
geom_point(checkpoints_data,mapping= aes(x=checkpoints, y=design.lower.bound),colour="black",size=4,shape=4)+
geom_point(checkpoints_data,mapping= aes(x=checkpoints, y=design.upper.bound),colour="black",size=4,shape=4)+
geom_line(checkpoints_data,mapping= aes(x=checkpoints, y=design.upper.bound))+
geom_line(checkpoints_data,mapping= aes(x=checkpoints, y=design.lower.bound))+
geom_text(checkpoints_data,mapping= aes(x=checkpoints, y=design.lower.bound, label=round(design.lower.bound,2)),nudge_y=-0.2,show.legend = FALSE) +
geom_text(checkpoints_data,mapping= aes(x=checkpoints, y=design.upper.bound, label=round(design.upper.bound,2)),nudge_y=0.2,show.legend = FALSE) +
labs(x= "samples", y = 'zscore' )+
labs(title=paste("Group Sequential Testing | ","","checkpoint1:",checkpoints[1]," ","checkpoint2:",checkpoints[2]," ","checkpoint3:",checkpoints[3]," ","collected samples:",samples_total))+
theme(plot.title = element_text(size = 11))
}#Plotting the overlay of probability distribution of Z assuming H0 true and the same assuming H0 false (Ha true)
normal_x_axis <- seq(-7, 7, 0.1) # use for generating the x axis of the normal distribution# Plot confidence intervals for delta
if(min(delta_interval) < 0 & max(delta_interval)>0){ggplot()+geom_rect(data=NULL,aes(xmin=0,xmax=Inf,ymin=-Inf,ymax=Inf),fill="lightgreen",alpha=0.2)+geom_rect(data=NULL,aes(xmin=min(delta_interval),xmax=0,ymin=-Inf,ymax=Inf),fill="red",alpha=0.2)+geom_line(data=confidence_plot,aes(y=c(1),x=delta_interval_plot))+geom_vline(xintercept=0, linetype="dashed", color = "black") + theme(axis.title.y = element_blank(),axis.text.y=element_blank(),axis.ticks.y = element_blank()) + labs(x = "Δ Confidence Intervals")+scale_x_continuous(labels = percent)+geom_point(data=confidence_plot,aes(y=c(1),x=delta_interval_plot),color="black",size = 2)+geom_text(data = confidence_plot, aes(y=c(1),x=delta_interval, label = label, vjust = -3,fontface="bold"))+geom_text(data = confidence_plot, aes(y=c(1),x=delta_interval_plot, label = name, vjust = 3))}
if(min(delta_interval) > 0 & max(delta_interval)>0){ggplot()+geom_rect(data=NULL,aes(xmin=min(delta_interval),xmax=Inf,ymin=-Inf,ymax=Inf),fill="lightgreen",alpha=0.2)+geom_line(data=confidence_plot,aes(y=c(1),x=delta_interval_plot)) + theme(axis.title.y = element_blank(),axis.text.y=element_blank(),axis.ticks.y = element_blank()) + labs(x = "Δ Confidence Intervals")+scale_x_continuous(labels = percent)+geom_point(data=confidence_plot,aes(y=c(1),x=delta_interval_plot),color="black",size = 2)+geom_text(data = confidence_plot, aes(y=c(1),x=delta_interval, label = label, vjust = -3,fontface="bold"))+geom_text(data = confidence_plot, aes(y=c(1),x=delta_interval_plot, label = name, vjust = 3))}# The shaded dark blue area denotes the significance region, while the the shaded blue area denotes the power (note that it includes the shaded dark blue area).
suppressWarnings(print(
if(delta > 0){normal_dist_plot <- data.frame( x = normal_x_axis, y1 = dnorm(normal_x_axis, 0, 1), y2 = dnorm(normal_x_axis, zscore, 1) )
ggplot( normal_dist_plot, aes(x = normal_x_axis) ) + geom_line( aes( y = y1, colour = 'H0 is true' ), size = 1.2 ) + geom_line( aes( y = y2, colour = 'H1 is true' ), size = 1.2 ) + geom_area( aes( y = y1, x = ifelse(x > z_critical, normal_x_axis, NA) ), fill = 'black' ) + geom_area( aes( y = y2, x = ifelse(x > z_critical, normal_x_axis, NA) ), fill = 'blue', alpha = 0.3 ) + theme( legend.title = element_blank() ) +scale_colour_manual( breaks = c("H0 is true", "H1 is true", "z score"), values = c("black", "blue","red") )+labs( x = '', y = '' )+ labs(title=paste("power=",power," ","z_score=",round(zscore,2)," ","z_critical=",round(z_critical,2)," ","significant:",statistically_significant))+theme(plot.title = element_text(size = 11))+theme(plot.title = element_text(hjust = 0.5))+ theme(plot.title = element_text(color = "black"))+geom_point(y=0,x=zscore,color="#f94f21",size = 2,shape=19,aes(fill = "z score"))
} else{
normal_dist_plot <- data.frame( x = normal_x_axis, y1 = dnorm(normal_x_axis, 0, 1), y2 = dnorm(normal_x_axis, zscore*-1, 1) ) # first one has mean 0 because crv-crc is null under H0 true
ggplot( normal_dist_plot, aes(x = normal_x_axis) ) + geom_line( aes( y = y1, colour = 'H0 is true' ), size = 1.2 ) + geom_line( aes( y = y2, colour = 'H1 is true' ), size = 1.2 ) + geom_area( aes( y = y1, x = ifelse(x < -z_critical, normal_x_axis, NA) ), fill = 'black' ) + geom_area( aes( y = y2, x = ifelse(x < -z_critical, normal_x_axis, NA) ), fill = 'blue', alpha = 0.3 ) + theme( legend.title = element_blank() ) +scale_colour_manual( breaks = c("H0 is true", "H1 is true"), values = c("black", "blue") )+labs( x = '', y = '' )+ labs(title=paste("power=",power," ","z_score=",round(-zscore,2)," ","z_critical=",round(-z_critical,2)," ","significant:",statistically_significant))+theme(plot.title = element_text(size = 11))+theme(plot.title = element_text(hjust = 0.5))+ theme(plot.title = element_text(color = "black"))+geom_point(y=0,x=-zscore,color="#f94f21",size = 2,shape=19,aes(fill = "z score"))
}))# Plot of statistics summary
ggplot()+ theme(panel.background = element_blank())
table_colors <- data.frame(color = c("grey90", "#b9f7fc", "#b9f7fc", "#b9f7fc",
"grey90", "grey90", "grey90", "#b9f7fc","grey90","grey90","grey90", "grey90","grey90", "grey90", "grey90", "grey90"),stringsAsFactors = FALSE)
theme <- ttheme_default(core=list(bg_params = list(fill = table_colors$color, col=NA)))
grid.table(statistics_summary,theme = theme)
Sources