A/B testing is by far the most common technique for marketing optimization. It’s easy to see why. A/B testing is a relatively straightforward approach that can be applied to anything from an email newsletter to an outbound landing page to a Facebook ad.
The idea is to isolate a single, specific element – something like a headline, a subject line, or a header image – and serve multiple variations to determine which is optimal. That way, when you send out the real email or put the landing page up your site, the majority of your audience interacts with the best possible version.
In email marketing, where you have a set audience to work with, an A/B test usually consists of breaking off a certain percentage of your audience, dividing it into even segments, and serving a different variation to each segment. This can bring up some unique challenges.
Naturally, you want to send the best version of your email to as many contacts as possible. On the other hand, if you don’t send each test segment to enough contacts, you won’t be able to count on your results being statistically significant.
What is Statistical Significance?
A statistically significant result is a change beyond what can be attributed to chance. Conversely, a result that isn’t statistically significant is one that could simply be caused by chance. Statistical significance isn’t so much a pass or fail test as it is a degree of certainty.
We all know a coin has 50% of landing on heads. We also know you could flip a coin twice and get two heads. You could keep flipping it and get three more heads, albeit this is less likely. You could then theoretically continue to flip heads indefinitely; however, the chance of this happening get smaller and smaller.
If you got 100 heads in a row, you would very likely conclude there is something wrong the coin. You’d probably be certain of this result. Likely, you would be right… but, technically, there is a 7.8886091 x 10-29 % (or roughly eight out of ten nonillion) chance that a totally fair coin would produce that result.
That’s essentially the basis of statistical significance. The stronger the result, the more certain we can be that our testing factor was the reason for the change. To declare something “statistically significant” we need to choose a degree of certainty that we’re comfortable with.
Why Does Statistical Significance Matter?
To put it simply, statistical significance lets you know if you can trust your results.
When you’re sacrificing a portion of your hard-earned email audience to testing, you want to make sure these tests generate useable results. There’s nothing worse than losing a chunk of your audience for a test that doesn’t yield any useful insight.
A/B tests literally shape your content. In a business where a couple of percentage points can be the difference between success and failure, you want to be sure these changes are taking you in the right direction.
Testing for Statistical Significance
It’s possible to calculate if a set of results are statistically significant by comparing the expected value to the observed value. This will produce a Chi-Squared value, which can then be matched to a Chi-Squared distribution table to determine if the result is statistically significant. I won’t get into too much detail in this post, but you can find more about this method here.
The Chi-Squared Test is ideal for determining the validity of a scientific study, but it has two issues when used for our purposes. First, it’s manual. Any hand (or more likely Excel) calculations are time consuming and open up room for errors. Second, it’s retroactive. This calculation is designed to calculate statistical significance after collecting results, which doesn’t help you if you send to 10% of your audience only to find that wasn’t enough to produce a statistically significant result.
Luckily, Optimizely offers this handy A/B Test Sample Size Calculator. You punch in a few factors and the tool spits out a minimum number of contacts for each test segment. We’ll talk a little bit about each input so you can get the post out of the tool.
Baseline Conversion Rate
The baseline conversion rate is the percentage of your audience that you’re expecting to convert. Use recent emails to similar audiences to help inform you about an appropriate conversion rate. When in doubt, shoot low. The larger this rate, the fewer emails you need to send per split.
Minimum Detectable Effect
This number is the minimum change in conversion rate that you can detect. For example, if you plugged 20% here and your baseline conversion was 10%, one of your variations would need at least 12% to secure a statistically significant result. The smaller the minimum detectable effect, the more emails you’ll need to send.
The statistical significance slider allows you to select a value from 80% to 99%. It essentially allows you to select your degree of certainty in the results. In general, 95% is a good threshold for most applications. If you are particularly tight on contacts, you might consider ticking it down to 90%. Likewise, if you have contacts to spare, it makes sense to dial this number up.
The A/B Test Sample Size Calculator in Practice
Let’s say you wanted to A/B test a subject line for an upcoming email campaign. You have an audience of 12,000 contacts and three variations (two tests and a control). Based on your past several broadcasts to this audience, you expect an open rate of 18%.
Moving to the A/B Test Sample Size Calculator, you plug in your expected open rate. For minimum detectable effect, you plug in 20% (which is a good standard value). That means you’ll be able to confirm a result that achieves at least a 21.6% open rate. Finally, you select 95% as a fairly standard statistical significance.
Your setup will look something like this.
As you can see, the tool recommends 1,300 contacts per segment. It’s important to note this is the minimum number of contacts delivered, not the number of contacts sent to. To find the correct number of contacts to send to, you need to factor in your anticipated deliverability. Let’s say you expect 94% deliverability. You would divide the 1,300 contacts by the 94% deliverability and end with a result of 1,383 contacts.
Each of your three segments will need to be sent to a minimum of 1,383 contacts. Together, the three segments will need to go to 4,149 contacts (35% of your total list). If you’re setting this test up in your marketing automation system, you’ll need to enter in 35% for your test group.
At this point, you might be hesitant to use more than a third of your audience for testing. Don’t worry. Now is the time for adjustment.
If the number is only slightly too high for your liking, adjust your statistical significance number. Moving the slider down to 90% can save you quite a few sends, without impacting the reliability of your test too much.
If you need a more dramatic adjustment, you have two choices. The first option is to increase the minimum detectable effect. This makes your test less precise, but an adjustment of 5-10% can dramatically drop the necessary number of sends. This is a good option if you’re confident your changes will have a significant impact. In our example, this would knock test group down to 1,660 contacts (about 14% of the list).
The alternative option is to drop one of the variations. This will maintain the integrity of the test, though it limits your testing options. Here, you’re more likely to get a useable result, but you also run the risk of throwing out the best option. In the above case, this change would reduce the test group down to 2,766 contacts (about 23% of the list).
A/B testing is a pathway to maximized marketing outcomes. However, you can only count on your tests if you know the results are meaningful. That’s where statistical significance comes in. Use the simple and streamlined A/B Test Sample Size Calculator to build this reliability into your A/B tests.
Let us know what you think:
- Do you use A/B testing to optimize your marketing efforts?
- How do you ensure your A/B tests are reliable?
- Do you know of any other resources for better A/B testing?