If you’re going to run a race, you should see it through all the way to the finish line.
Many marketers fail to apply that principle when it comes to split testing. They think their test is complete before they’ve let it run the course.
They pull out too soon. They end too early. They quit before it’s done.
What happens when tests are finished prematurely?
The numbers reported are nothing but statistical salad.
With no dressing.
Simply put: results from incomplete tests are as unreliable as next year’s weather forecast. They’re as dubious as the email you just received from a Nigerian prince.
Let your tests run to completion, and you’ll be rewarded with accurate, actionable statistics.
That’s the point I want to make, in one swell article introduction.
You can stop reading now, and keep watching cat videos. As long as you get that one message: don’t end your split tests too soon.
But to get even more value from this article, let me share some more information with you.
1. Bad News: Your gut is wrong.
Sometimes digital marketers come down with a bad case of C.B.
They run tests to validate their own preconceived notions about the way things should work. Once they get the result they’re looking for, they see no reason to continue the test.
Their bias is confirmed!
Yay. How cool is that.
That’s no way to validate a hypothesis.
Confirmation bias is a real thing, and it’s screwing you over.
Sure, there’s an argument to be made about why you should listen to your gut.
But there’s also an argument, a better one, to be made about why you should admit that your gut was wrong if the numbers say so.
The whole reason you’re split testing in the first place is to avoid gut-based and erroneous decisions. You want real data and hard evidence, not some mysterious message from the meatball sub that you had for lunch.
Numbers don’t have guts. They’re not subjective.
Numbers report cold, hard, objective reality.
They don’t even eat meatball subs.
Sometimes the reality of the data will be at odds with what your gut told you. And that’s a good thing.
But you won’t know that if you don’t let your tests run to completion. Instead, you’ll likely come down with a good, old-fashioned case of C.B. once the numbers tell you what wanted to hear.
2. How long is long enough?
Once you’re convinced you should let your tests run long enough to give you accurate stats, you’re probably wondering: how long is long enough?
That’s a killer question, and if there were an exact standard number, then I wouldn’t have had to write this article.
How long is long enough, you ask?
Answer: It depends.
Yes, that answer sucks. It also has the virtue of being right.
For starters, when you think about how long a test should run, your mind is probably thinking that the answer will be delivered in terms of time.
Yes and no.
Sure, you’ll need to make sure that your test runs an adequate number of days, but that length of time will be determined by another metric: the sample size.
Before you can determine how long you need to run your test, you’ll first need to determine how many visitors will give you the right sample size.
Otherwise, you’re likely to get statistical noise in your results.
But here we go with another question. Now you want to know how big of a sample size you need.
Again, it depends.
You’re getting tired of that answer, aren’t you?
It’s still correct.
The right sample size for your website depends on three things:
- Your existing conversion rate
- The change you’d like to see in your conversion rate
- The level of confidence you want to have that your test will be accurate.
- Let’s look at an example to help clarify things, shall we?
For example, if you have a 3 percent conversion rate now and you’re shooting for a 4.5 percent conversion rate, then you’re looking to boost your conversion rate by 50 percent.
That 50 percent number is part of what determines your sample.
Next, you need to learn about p.
What’s p? It represents the number used to calculate confidence level.
If you say: “I’m 95 percent sure that these results are accurate.” Then p is .05 (100 – .05 = .95 or 95%).
As you can see, the lower the p-value, the more confident you are about the test results.
3. Get the right sample size and time.
You might be thinking: “I’ve followed your advice and got my sample size with a high confidence level. Now all I need to do is run a test long enough to cover that sample size, amirite?”
Peep Laja of ConversionXL, who really is one of the top guys in this field, says when he first started split testing his most common mistake was ending a test too soon.
And this is important: he say he ended tests too soon even when he had a 95 percent confidence level.
At this point, you might be thinking: “Well, it’s not much of a confidence level if you can end a test at that point and still have inaccurate results.”
Correct. That’s because numbers are dumb.
I said earlier that numbers are objective and don’t lie. That’s true also.
But the statistical calculations we’ve been looking at don’t take into account variations in business cycles, days of the week, peak traffic times, seasons when conversions are more likely, etc.
Bottom line: your confidence level isn’t your bottom line.
Confidence level alone can’t validate a test’s, umm, validity.
That’s why you need a test that runs long enough to cover variations in your sales cycle.
This is a good time for me to put in a plug about always be testing.
Yes, I think you should always be testing. Yes, I think you should test longer duration rather than shorter ones. Even though relentless and constant testing is a virtue, you shouldn’t let this maxim make you rush the process.
In other words, yes, conduct split test after split test. But don’t shortchange all your efforts by pulling a test too early, or not giving yourself time to analyze test results, or rushing through the hypothesis phase. Every part of the test is important.
So, to circle back to the subject at hand, if you’re going to make a mistake, err on the side of having an unusually large sample size rather than a small one.
Keep in mind, though, if your test runs so long that it includes external forces that could affect the outcome (holidays, seasonal factors, weather, etc.), you run the risk of sample pollution and that could skew your tests as well.
Laja also offers this sage advice: your test should run the length of at least one, and preferably two, business cycles.
Or, as he puts it: “the sample would include all weekdays, weekends, various sources of traffic, your blog publishing schedules, newsletters, phases of the moon, weather and everything else that might influence the outcome.”
So, to continue with the numbers from above, if you get 3,600 visits in one business cycle (or, better yet, two business cycles), then the numbers from the tools might work just fine.
On the other hand, if you get 3,600 visits over the period of just a couple of days, then you should lengthen the time of your test to include a couple of business cycles.
You can learn from Laja’s mistake or you can repeat it. The choice is yours.
4. Make sure the significance curve flattens out.
Even after you’ve followed all the other rules, you still might have to apply one more rule before you finish your testing.
If you decide to become a gung-ho statistician and use a more sophisticated tool than the two mentioned above, you’ll likely see that your conversion rate stats are delivered with a margin of error.