How Do You Declare the Winner of Your A/B Test?
When declaring a winner it has to be statistically valid. In other words, there has to be a significant enough difference, that you really set a new course in whatever you do.
To understand the statistical significance of your A/B test you have to remember 3 specific parameters:
Sample Size
Test Size
Confidence
Make sure you’re testing something that can actually have an impact.
A smart and well-thought-out test is important, you want to learn something, even if you fail.
Below is a lightly edited transcript of Episode 31 of the Inevitable Success Podcast.
Transcript:
Damian: Today’s episode is about different ways that you can test to improve your marking program. So for example, you know, we’re big proponents of the champion/challenger methodology, basically always having an incumbent winning approach to all of your marketing that you’re constantly challenging and we always prefer to do this in a testable format. Now that said, sometimes the metrics that come back not so clear, sometimes you look at the wrong metrics. So today we want to go a little bit deeper as to how would you determine if you have a winner or not?
Stephen: Of course, and I’ve been saying this for a long time, that test results are not baseball scores. In a baseball match, with the World Series going on right now, well if you won by one run that’s fine. It’s a one-run game, maybe it was a pitcher’s duel. But in testing it’s not like that, it has to be statistically valid. In other words, there has to be a significant enough difference, that you really set a new course in whenever you do.
Damian: So the takeaway is, just because you have a test that would have won a baseball game doesn’t mean that you actually have a winning idea.
Stephen: I call it conclusive evidence that you have a winner.
Damian: So I’ve totally experienced this myself, you know, I’ve run probably hundreds of tests in order for types of medians at this point. Actually, the most common result that I have found, especially if the test is not aggressive enough, is inconclusive. It’s a very common result. You know, me personally when I’m working on optimizing things, I actually love to go after bold aggressive changes, and here’s why. When you’re testing, the fact you’re testing, you’re already managing the risk of rolling out a bad idea.
Stephen: OK. Hopefully, it doesn’t stink too much.
Damian: Well yeah but you’re going to not necessarily roll out to everybody, you can manage that too. But, you know, I love the idea of actually avoiding having inconclusive tests. I either want something to work phenomenally or prove that I should never do that again quickly. And I think when you look for things that can have big changes, the odds of learning nothing and just spinning your wheels go down from there.
Stephen: I think you’re describing is what we call the scientific approach.
Damian: Yes.
Stephen: What is the scientific approach? We all know it, we don’t practice it, but we all know it. We took some science classes in school, it’s a social science. In the beginning, there is the hypothesis – if we do this, this will happen, or if you give this drug to somebody they’ll be better, or not one to this person drop the trial, it’s the same method, right? The biggest challenge in any analytics is to come up with a hypothesis. In other words, whatever you test here, that idea should come from human beings.
Damian: Yeah, and to that end, if you design it really well, like you have a null hypothesis too, so even if you fail, you learn something as well. So you know, I think that it’s really important you know, to make sure you’re testing something that can actually have an impact.
Stephen: But having said that, when we talk about baseball for a moment, but sometimes the winning game should be like a baseball match, that if you’re testing some creative, and the difference is slight, maybe a difference in opinion, the cost of being wrong is not that great. Therefore if you had to just pick one, then fine don’t worry too much about statistical validity. Just declare a winner and go bat away. But if you’re testing some different Audience if you will, or the result of not mailing anybody at all and seeing if your mailing is doing any good, in those cases the winners should be declared carefully because you will change the way you acquire your prospect lists, the way you talk to your customers for a foreseeable time, you really want to make sure that what you learn here is something that is sustainable.
Damian: Yeah, and it should be able to be summed up in a quick conversation of what you learned, what you tested –
Stephen: When you say one thing is better than the other, it becomes quite a bit of a history-making endeavor.
Damian: Yes, just to kind of, I think to maybe, you made me think about how to clarify a little bit more – the test that I feel like we need to avoid at all costs as marketers is the imperceptible test. It’s where you if you’re using creative as an example to the same target, it’s when you show two creatives side by side and the average person actually doesn’t know what’s different between the two of them.
Stephen: Yeah the green button, red button test.
Damian: Yeah. What if it’s like, you know, blue button and slightly less blue button right? And I see these tests happen a lot.
Stephen: We have a joke here, we have a lot of developers’ versions of streams. You’ve been in both high-definition streams, the colors are not always the same.
Damian: Exactly, but here’s another example though. You know if it’s a certain slight color blue, eight percent of the population, the male population can’t even see it because they’re colorblind. Do you know? We actually had a story about this. So now they’re treated like that’s the result.
Stephen: Yeah, so there’s the so what question. In fact, let’s talk about the whole scientific approach. You set up a hypothesis, set up the test rules, execute the test, declare a winner. There’s the last step which is, so what? You always have to end every test with a so what question. So what are we going to do about it? So is this something that you’re going to do forever? Is it that significant? So yeah, I’m using the word significance again.
Damian: Yes let’s dig into that one a little bit.
Stephen: I think we should dig into what statistical significance is for the people who are not stat majors. Simply for non-stat majors, you just have to remember certain parameters that you don’t jump to conclusions too hasty. One is, what is what is a sample size? It’s an easy example, so okay they do the A/B testing, and whether the A or B, the difference is three clicks. Well, I don’t even have to test it, three clicks out of how many, about a few thousand. You know what, that’s not a difference.
Damian: You know what, there’s some math into it.
Stephen: Oh there’s some total math into this, but we’re starting out easy.
Damian: There are a couple of like good ways of thinking about this that I’ve approached over the past few years. So sample size, there are some general rules, of course, larger is typically always better. Right? And the other thing too is, if, and I’ll give you a really clear example of this, you can have a small sample size and still get the statistical significance.
Stephen: The difference is bigger.
Damian: Exactly, and that’s what people miss.
Stephen: That’s exactly what I’m talking about. So you’ve got to have all three in your mind. I’ll give you three problems. One is the general sample size. Now people get scared of the large sample for valid reasons. Let’s say you have some holdouts some mailing or emailing holdouts but you’re not going to touch them. Well if you don’t touch them they’re not going to respond. That’s the belief, right? That’s why we do these things. Well, if I have a big holdout sample, I’m going to lose my money-making opportunity. That is not a wrong way to see it but you’ve got to still test. So what is a good test size? Again the size matters. Now, I talk about it as a response size, not as a test size. Why? Because now you have to think about what is the typical difference that you’re trying to measure. Are you trying to measure the difference in 0.1%? Or plus or minus 1% is good enough for you.
Damian: So define response in this context.
Stephen: In other words in this context this – and by the way, if you are testing alternate click-through rates, they are normally in double-digit percentages, it’s easy. But in a mailing situation or like the alternate response for it, that is the number of actual conversions divided by the number of touches. That number generally is very small but that’s the ultimate number, isn’t it? Like who cares if you have all these opens if nobody bought it. Because that’s the ultimate barometer of success: is how much actual conversion did you see, and how much money did they bring in? So you even have measurements like revenue generated by a thousand touches and stuff like that. That’s why we have that ultimate merchant, because of the money talks. Now, what is the typical difference between, say you have a sale that you know you would touch and you have a mail sale here and one gets 1.2% response and the other gets like 0.18 difference – is that a real difference? You have to think about the size of the difference that you’re trying to measure, the smaller that you want to see the result, the bigger the sample size. That’s another thing.
Damian: Yep.
Stephen: There’s a third element. How confident do you want to be?
Damian: Confidence.
Stephen: Do you want to be 98% confident all the time, or 95% confidence or even 80% is good enough for you.
Damian: Let’s go a little deeper on that. What is the difference, like practically, between how long you have to wait for 95% confidence versus like 98% confidence?
Stephen: That is at a confidence level most directly related to sample size, at the time that you read. Now it’s slightly related because it could read longer, of course, you have a larger sample. What does not change is that the test you universally created, all that happened in the beginning. Just because you waited longer doesn’t mean that the test universe gets bigger. So this question should be answered before then. So you have to have some idea of the time you are probably going to measure by, you have some idea of what kind of a difference you are going to measure. So they have to know the typical response –
Damian: Right, a range of outcomes.
Stephen: Exactly you’ve got to have some idea that oh, yeah so I want to measure within just a 5% difference in open rate. That’s fine. So these things determine the size of the sample and of course the confidence level is higher figures into it.
Damian: Right. And you know what, one time I actually remember having this conversation and I said, I think I started saying that there was pushback that I either got from a client or somebody that was new here about the sample size being like a truth always, you know, more sample better. And I said, “Just think about it this way. The variance in the range of outcome has a massive impact on how many people you need.” I said, “Go through this thought experiment. Let’s say you’re AB testing two landing pages. The test goes to a fully functioning landing page that you can check out on. The control goes to a 404 dead page. You going to know very quickly you don’t need a high sample size to figure out that one is better than the other.” And that’s such a good logical test to be like, “Oh I understand the math of this.” And that’s powerful when you really understand how this stuff works because then you can start to wrap your head around what you can believe and what you don’t have to.
I mean even in medical testing, there are conditions where they’ll test that one drug is so much more powerful than the other or dangerous, that they end the test early because it’s such, you know, if people start dying then it’s very easy to tell that there’s a problem early. And that’s another thing, ending a test early when you hit a large variance in outcome.
Stephen: Tell you what it ruins baseball. Like you know what, this pitcher stinks, let’s not even continue and further agonize the team. But what you said kind of reminded me of what, a lot of marketers are too greedy about the things that they test. Please don’t do that, because I’ve seen so many tests where they’re testing everything. This source, creatives, segments, and then they go, “Well we’ll just look at all the responders and test group and divide them into all these different cells composed of like three dimensions like this segment, creative.” That’s a lot already right? But that means some cells are big enough by accident, but some cells can be so small we cannot read any result for all those dimensions. Now, when that happens I say go back to economics class again. What is the economic theory proving? We always say things like, “With all things – “
Damian: Of equal.
Stephen: Of equal what is the outcome?
Damian: I’ve got that one. You’re testing me.
Stephen: Now say it again in Latin. I’m just kidding. But the point is, if you do that, then you know what for this report I’m going to only see from a segment point of view, so which segment? Now you may have enough sample responders in it to see the result. And then you, okay so all other things being equal in terms of creative, which one? You could do it that way too.
Damian: Yeah. Tell me if this is what you, I think you might be saying something else, but you made me think of another idea. This is what happens, you just start thinking of past experience and I’m going to share it. So, I remember doing a test early in my career and I think it was for a landing page of some sort, and I remember that just so happened the randomness because random doesn’t mean even when an AB test is routing traffic, okay? And one of the test pages in hindsight had gotten so much more brand traffic than the non-brand traffic. And for anybody that knows search, brand traffic tends to convert much higher than non-brand traffic. Right? Sometimes like 10 to 1, and the slight skew in one part of the experiment, randomly through traffic randomization, when I isolated that after the fact, completely change the results. So that taught me that being able to get a fair target is really important in constructing a test.
Stephen: Oh my god yes. That’s like saying that, I even wrote an article about this, why were all these people dead wrong about predicting Trump winning the election. Do you know what it was? It was a sampling error. They under-sampled a survey of an area. You cannot predict the outcome of an election without fair representation. Think about it, if you just survey a whole bunch of city folks, guess what they are going to say? I mean there’s a regional bias in all of us right? So it was a sampling thing. Also when the sample size was so small, then you are talking about a town with really few people living in it, and what if you missed out on a major household by just randomness. The only way to fix it is well, of course, you have to have a fair randomization routine, otherwise, it’s fraudulent.
Damian: Well this is the whole thing like the randomization, this is confidence level, right? Like that the randomization the higher the sample size you have the more confident you can be right?
Stephen: That’s right, that’s a result that the way we say it is that the higher the sample size, it is more likely to resemble the real universe.
Damian: Right.
Stephen: That’s the key. It’s not about being just like the universe or that you have to call everybody and nobody’s going to do that.
Damian: Yeah. Whenever possible, and this is not possible for everybody. But like let’s say if I could do a paid search test, I would try to like I organize, I just want people to type these keywords in, you know? There’s sometimes you can do that but sometimes it makes it so small that unless you think you’re going to get a big variation in outcome, you don’t learn. But this is where really understanding the math of how all these things tie together, can help you figure out what the best thing to test is you know? If you know that you are going to have a smaller sample size and you’re not sure if it’s going to have a big range of outcomes, you may have to take a different testing approach or maybe think about how could you bubble this up into something thematically bigger to test as a bigger idea to a bigger universe because you’re spinning your wheels with inconclusive to low confidence results.
Stephen: Yeah 100 inconclusive small tests don’t mean anything.
Damian: Yeah, well it does. It means you wasted a lot of time, a lot of money.
Stephen: Somebody kept their job by doing busywork, yeah.
Damian: For some period of time.
Stephen: I’ve found that those people are really good at keeping their jobs. I’m being sarcastic. So going back to the baseball analogy let’s just end with a baseball analogy. We started with the whole baseball thing.
Damian: This is America it’s America’s favorite pastime.
Stephen: And it’s the World Series going on. Now, just like baseball analogies, which is by the way statistically significant because there are like over 160 games so that’s why there’s enough number of pitches and hits and walks that we can predict these things right? That’s why during the postseason it’s harder to predict based on just the statistics alone.
Damian: Right, if you’re trying to do the Moneyball at little league would be harder.
Stephen: Now, why are baseball coaches so good at what they do? Because they’re much smarter us? Maybe they are, but the real reason is that they’ve seen everything. When they move certain players in the field, it’s because they’ve seen it before. That means, just like these testers, if you play this game a lot, you’ll get better at it. So, we only talked about rough guidelines today. But, having a testing mindset is the hardest part. A lot of digital marketers just don’t test.
Damian: It’s very freeing I think to embrace testing.
Stephen: I think it is, you don’t want to be wrong.
Damian: You eliminate yourself from the outcome.
Stephen: It’s the math – that’s the way. So I think the hardest thing is having the scientific approach and actually race it, and you actually to commit to a test. And if you’re wrong, don’t give up, do it again. That’s baseball league, you don’t give up after one loss. Just keep at it and you will be better at it, you will think of more dimensions of a test as you do it.
Damian: I also think there’s this intuitiveness that comes from experience in designing tests. That’s a hard one to quantify, but over time you will see, you know, everybody has a supercomputer in their head which is our brain. And this is one of the things that, you know, intuition is basically, we’re actually calculating that and figuring out it’s probably on something objective.
Stephen: And the mother of a hypothesis. Think about it. Now, even with all the automated tests scheduling, one thing the machine never determines is what test.
Damian: Right.
Stephen: Sorry that’s coming from you.
Damian: Yeah exactly. So you know you’ll gradually like to figure out things that are worth testing and ways to test it where you can learn something wins or loses.
Stephen: And the next time you do it you will know the expected response rate and such things so you can design a better test. That’s how it works.
Damian: All right. This was a lot of fun.
Stephen: As always.
Damian: Yeah, it really is. This is a topic that I find incredibly stimulating and a lot of other people do, and I see it done wrong so frequently so I’m glad we spend some time on it.
Stephen: That’s why I call it don’t treat the test result as a baseball game.
Damian: Not a baseball game. All right take care.
Stephen: Thank you.
Damian: If you enjoy today’s episode we ask that you please leave a rating and write a review. Or better yet share it with another marketer. Be sure to subscribe to the podcast for new episodes. Also, check out the show description for complete show notes and links to all resources covered in today’s episode. If you’d like to speak to someone about any topics covered in today’s episode please visit BuyerGenomics.com and start a chat with the BG team today.