Causation vs Correlation
Correlation can be an amazing tool to discover causation, but sometimes it’s just too expensive or not worthwhile to even go that far. If the correlation works and you test into it, that doesn’t mean you break out an extra million bucks. You test into it and if it holds up and it’s true over time then make money with it. Don’t worry about it. Go solve another problem.
Below is a lightly edited transcript of Episode 32 of the Inevitable Success Podcast.
Damian: Google is literally saving lives. Are they? Maybe, maybe not. So, in a recent study that we had found since 2006 to 2011 the murder rate in the United States has dropped every single year a near-perfect correlation with people shifting away from Internet Explorer and Edge to Google Chrome. So is Google actually improving the safety of the Americans? Or is this correlation versus causation?
Stephen: The short answer is we don’t know. Maybe, maybe not. And if you took any economics classes in college they say, “Yeah every time there’s a war, the U.S. economy grows.” So war is good for the U.S.? Well if you just look at it from an economic stance it’s not war. Is the war the cause of all this? Maybe, but we are not here to have a philosophical discussion about causality vs. correlation. We’re here to say that marketers, especially when you’re dealing with a lot of data, we see interesting correlations all time but do we jump to conclusions or do we take a step back and say, that sounds interesting but do we act on it? I guess the long and short of it is, no just act on it if the coalition is really, really strong and if it makes sense, not all the way digging back to causality.
Damian: If we kind of go back to the Google example, I think it’s cute and it’s funny. It’s most certainly not true that it’s causing it. It’s certainly true that it is correlated though, and I think in today’s world as everything that turns into data and there are more data sets that are easy to compare to each other, you’re going to find more and more correlations. So I think the point that you’re making is that sometimes these correlations can tell you stuff that is actionable and can make you money, and sometimes you can be wrong on the causation and it can still work and I think that’s what we’re talking about.
Stephen: That’s what I’m trying to say. And also in the predictive business, we talked about predictive analytics some time ago, let’s bring back what it does and doesn’t do. Well actually they do a lot of things, but there are easier things to predict and harder things to predict. For example, predicting who’s going to do something. The who part, yeah that’s really established. Do you want to sell something? Who’s going to buy what, we know how to do that pretty, pretty well. So who means – okay who is more likely to go on a luxury cruise? Okay. With all the demographic data in past behavior can predict that. If you flip that and say that this person is coming to the store all the time, what is he going to buy next? We can do that too. How do you think that all the collaborative filtering happens on Amazon – if you buy something – oh he must be interested in that too. Well, they’re predicting what you’re going to buy the next. The second hardest thing is when. Okay, fine you’ve predicted that somebody is into luxury goods. Will, she buy some really expensive Italian handbag this Christmas? Now that’s hard because now you have some other type of empirical data to know exactly when. This is why in the marketing world what we call hotline names is so important. Or anything, like for example, I just moved by the way, and I must have left a lot of trails. In fact, it was a little spooky because I said something about moving in Facebook and before you know, it all the Street Easy ads are starting out on my wall.
Stephen: So I said, “Well this is interesting, they must be listening to everything that I say now. It’s OK because I kind of bought into it and this is what I do for a living too, so I got to say you know it’s okay.” You know? But it’s still innocuous. The point is we know how to do these things, we know how to read, so even the when part is not impossible. Yeah, this guy’s giving user data. In fact, there’s no model, there’s no predictiveness, they just responded to what I said. Now, what is the hardest thing to predict? It’s why? Why do people do things? We don’t know that.
Damian: Well actually, I wanted to see – we were talking a little bit earlier and I know you have an example from your past client experience where the correlation was very profitable.
Stephen: Oh it happens too.
Damian: Yeah? And I have an example from my past experience where the correlation was very unprofitable. So let’s, I think we can go through both. Why don’t you jump off, I think it there was the septic tank example.
Stephen: Oh septic tank yes, this happened in real life. We were helping out a luxury furniture catalog and online store. And we were building models to find out, again let’s talk about who. Who is more likely to buy furniture through a catalog.
Stephen: This is not cheap furniture by the way.
Damian: Okay so premium catalog for furniture, okay.
Stephen: And then they’re building models with all kinds of data, all kinds of behavioral data, behavioral meaning that he something similar in other places that type of thing, and the demographic helpers, also income, what’s the gender, head of household, age, all that stuff. And then all of a sudden this census-level data popped up and it was a percentage of septic tanks in a neighborhood, popped up in a model as a very strong variable. And by the way even when something is really highly correlated, we don’t use just one variable, that’s not even a model, that’s more like your gut feeling. But we don’t do that. But that popped out and we all scratched our heads. What does this mean? So again, is this causality? If you have a septic tank you do this? And then we realized that no, it’s telling us something. We’ve got to trace back, trace back to see if it makes sense.
Damian: It certainly was correlated though.
Stephen: It was strongly correlated. So we said, okay so let’s just say that what people have a septic tank? Well their house would be a bit large right to have it, and the town should be pretty far away from the city center to have a septic tank, you don’t even have a sewer system connected to the house? It is telling us something and what we said was, yeah it is a weird variable. We would have never picked it without math on our own, there’s no way. But it’s telling us something and let’s use it. So we used it, and it worked, because it was telling us all those things that I said here: certain size of a house, certain type of a household, single family unit, pretty far away from city center, certain income level, were all correlated to this particular furniture catalog. So we said, I don’t know why – again I stop asking the why part, but let’s use it and it worked. So I’d like to hear your story about when it did not work.
Damian: Sure. So one of the things about that, that when we were talking about it, it’s okay, it gave you a hint to something that you could wrap your head around well why could it – you start using a computer in your head to figure out why, why that would occur.
Stephen: Sort of human function actually.
Stephen: By the way all the machine based models, they just do it. They don’t really reason as humans do. Funny thing about it is that when you have a lot of variables, the machine will find substitutes anywhere.
Damian: Yeah, but I mean I think there are situations where that correlation could break down into unprofitability. So for example, it’s very rare but maybe there’s a growing city that still doesn’t have their sewer system yet, and you live one block away from, you know, a place that you can walk in and buy furniture, that correlation will break down from profitability because the premise, the cause was that they still had a beautiful house but they were just too far away to get in the car and go drive.
Stephen: That’s why you should never use just one variable.
Stephen: This was one of like 10-12 variables in that model. So it’s never the one thing. So that’s another thing that I want to point out is that when we say build a model, by the way, even machinists when they build a model they never use one variable. In fact, we use about 10 variables, if the one variable is really, really obscenely too strong and it takes up like 80-90 percent of predictability power, we throw that out, because if that one variable doesn’t work then you’re really screwed later. So modelers, mathematicians, they’re all about hedging bets and what is a regression model? Regression is nothing but a curve that has the least amount of error rate on the average. The curve that is the least wrong. That’s the regression curve. So yeah we don’t want to hedge all our money in one variable, we don’t have it –
Stephen: Yeah that’s a big caveat that I want people to remember.
Damian: So the story that I have was, I don’t know, maybe this could have been 5-7 years ago or whatever, but I remember I was looking at Google Analytics accounts for some e-commerce websites and I even remember like, especially earlier in my career you’d read articles that say you know, like it was hard to track things back then. So page views were like a really easy thing to track because everyone had access to it. And there was this like running theme in marketing forums and vice versa, all those places that if you could increase the number of page views in your sessions, then those were more engaged and they had higher conversion rates. And I remember digging deeper and deeper, deeper into it and I was kind of buying into it because I was looking at all these different accounts, and I saw that yeah that’s true. Like the pages that the sessions that have all these high engagements judged by that metric were extremely correlated to very, very high conversion rates. And then I looked just a little bit deeper and I realized that wow, all of these websites have multi-page checkout steps. So by definition, if you went to check out you increased your page views by 5.
Stephen: Oh right.
Damian: So if you, in hindsight like you couldn’t buy unless you had that many page views, therefore like was it really describing a good session that was engaged or were those the people that you know, you had to have that many pages to check out? And then it kind of started this whole other process where, is a landing page really great if you were, or a website really great if you have to go to so many many pages to check out? And then it was like well actually maybe the best sessions and this actually proved to be true, the best converting sessions were the ones where somebody landing on the landing page went straight to check out. There was no navigating or shopping, it was buying. And that actually is one where if you bought into, I should encourage people to keep having more page views, it was wrong. It actually hurt, it was the inverse. And I was just I guess that’s my story.
Stephen: That is a very good example. And this is why what you just did here is exactly why humans will have someplace even in the machine-driven world, is we reason. The second point is that the reason we have to dig deeper into not just pure data, but you have to even think about how the data is collected. And I have a similar example when I was at a data vendor really or a compiler, and we had no shortage of data, and we were building a model for a certain client and we found out that certain regions, by the way, when you’re in a compiler business you know that in certain states it’s hard to collect certain types of data. So when that data popped in –
Damian: What do you mean? Give me an example.
Stephen: In other words, when you compile the data, you don’t know everybody’s home value by the way. So a lot of things are outsourced and somebody actually sometimes stands in line in the local city government and finds out what all the house prices are. Well, they can troll the web, but the point is there are some variables that are collected that way. The point that I’m making is that certain variables if you know the history of it, you have to tell the difference between actual consumerist behavior or some loophole in the way we collect the data. So you’ve got to really think about not just what you see in front of you, oh yeah it looks like it’s highly correlated. And that’s what you just did, think why so many page views? Because the website is poorly designed. In my case it was more no, no, no in certain states it is hard to collect such and such data, and if that’s popping up so prominently.
So you know what, let’s look at this, compare this with a store footprint because you cannot argue that if you have a lot of store footprint you have more concentration of people in those states right? And it was almost an identical match, so that variable should be thrown out. This is why, again going back to the point number one, humans still have a place to reason and make sense of all this, but that does not mean that the analysts who do these things should have an endless pursuit of oh I want to know why. Because the why part, and this is why the why part is the last and hard to predict. Sometimes you just have to ask why. We talked about three types of data, about a few episodes ago. You have behavioral data, demographic data, and attitudinal data. Attitudinal data is scarce because you have to actually stop and ask questions in the form of primary research, or survey, or even social media listening. But it’s impossible to listen to everybody and it’s impossible to know everybody who answered it either. It is really hard to marry such data on a personal level with all the other behavioral and demographic data. In the pursuit of why that’s what you need to do. So, I’m not saying that asking why is not important, even when you see a variable you know, in a really well-built model you have to pursue to find out, okay what’s the background of all this data? Does it make any sense? Why are a septic tank and all those things showing up in my model? Yes, you have to think about it. That doesn’t mean that you have to stop and pursue the why so hard that you have to start primary research.
Stephen: Sometimes you just have to act on it.
Damian: I think the essence of what I take away from you’re saying is, one, correlation can be an amazing tool to discover causation, and two sometimes it’s just too expensive or not worthwhile to even go that far. If the correlation works and you test into it, that doesn’t mean you break out an extra million bucks. You test into it and if it holds up and it’s true over time then make money with it. Don’t worry about it. You know, go solve another problem.
Stephen: That’s right. And I’m trying to communicate the price of prediction. There are a lot of marketers ask that question first. Now the marketing is a part of the product planning stage and I have met such people in Korea actually, there is an amazing company that does all that social media listening, and they were helping companies like LG, Samsung, and all those companies and they actually figured it out by listening to the social media comments that some company made a very small washer and dryer set, thinking that yes single people might buy this thing. The assumption, great in all scientific research –
Damian: I know this story. You’ve told me this story, it’s a good one.
Stephen: Yeah. So and then they realized, wait for a second, we made this thing – they don’t buy them.
Damian: So single people didn’t buy the smaller washer and dryer.
Stephen: Because you know why? They’re too busy socializing, basically they follow tweets that they make you know? They want to have a big washer and just have one load once in a while. The lifestyle makes sense.
Damian: Basically they don’t want to do laundry all the time so they’re like I’m going to let this laundry pile up in a corner and then I’ll do it all at once.
Stephen: That’s exactly right.
Damian: And I don’t want to spend all day doing it. I want to do it one time.
Stephen: In fact, my wife who washes quite frequently, doesn’t even need, because she washes so frequently that she doesn’t even matter for her that much. The moral of the story is this, the company spent a lot of money doing this because they were actually planning a new product. You don’t want to build a wrong product to have to listen and ask, do the survey and do the panel research, you’ve got to do all these things right? But when you are in a one to one marketing mode, let’s not go crazy. Sometimes you find a good correlation, count your blessings and act on it, if it doesn’t work, go to Plan B.
Damian: I think that’s a great place to end. And you know in the meantime, if you’re going to use Internet Explorer versus something else, make sure that you do it in the winter because we also found that ice cream sales are extremely correlated with murder rates. So there are lower murder rates in the winter. So that should cancel out your risk of using Internet Explorer.
Stephen: Stay safe.
Damian: Stay safe people. Take care.
Damian: If you enjoy today’s episode we ask that you please leave a rating and write a review. Or better yet share it with another marketer. Be sure to subscribe to the podcast for new episodes. Also, check out the show description for complete show notes and links to all resources covered in today’s episode. If you’d like to speak to someone about any topics covered in today’s episode please visit BuyerGenomics.com and start a chat with the BG team today.