At Booking.com, Innovation Means Constant Failure
Harvard Business School professor Stefan Thomke discusses how past experience and intuition can be misleading when attempting to launch an innovative new product, service, business model, or process in his case “Booking.com” (co-author: Daniela Beyersdorfer) and his new book, “Experimentation Works.” Instead, Booking.com and other innovative firms embrace a culture where testing, experimentation, and even failure are at the heart of what they do.
BRIAN KENNY: The scientific method. It’s been around since the Renaissance. Historians credit Sir Francis Bacon, back in 1621, for having articulated a common platform and vocabulary that is still used today. We begin learning about it in grade school and, unless you work in a lab, you probably forgot about it a minute after high school. Here’s a refresher. The scientific method has six steps, beginning with a question, followed by research that leads to a hypothesis, that’s tested with an experiment, the results of which are analyzed, leading to a conclusion. Simple, yet this approach has led to astonishing advances in science for centuries. It’s been a bit slow to enter the world of business, where intuition, also known as going with your gut, too often drives big important decisions. In the dawn of the age of big data, the scientific method may be experiencing a renaissance of its own in the world of business. Today, we’ll hear from professor Stefan Thomke about his case entitled, “Booking.com.” I’m your host, Brian Kenny, and you’re listening to Cold Call, recorded live in Klarman Hall Studio at Harvard Business School. Stefan Thomke is an expert on managing innovation. His research focuses on the process, economics and management of business experimentation in innovation. That’s a perfect topic for today’s conversation. Stefan, thanks for joining me.
STEFAN THOMKE: Thanks for having me.
BRIAN KENNY: Let me ask you to start by setting the case for us. How does it begin, who is the protagonist, and what’s on their mind?
STEFAN THOMKE: Booking.com is a remarkable company in the travel space. Many of your listeners here have probably used Booking.com. Here are just some numbers, about 1.5 million room nights are booked per day. Somewhere between, and these are external estimates, somewhere between 400 to 500 million people visit their website every single month. The case begins as follows. The protagonist, who is a director of testing, walks into the CEO’s office, Gillian Tans, and tells her that he wants to run a radical experiment. He wants to change the Booking.com landing page. This is a landing page that has been optimized for years, through tens of thousands of experiments, and very refined. All the people who use Booking there frequently, know what it looks like, know how to navigate it, and so forth. Here, you have this director of testing walking into the office, and he wants to change it all. He wants to run this experiment to make it look like a Google search page. The question to the class is, as the CEO, and she’s listening to all this, what should she do? Should she just encourage him to go ahead? Should she get involved, and try to encourage him to modify it? Just to add to that, he also wants to run it during the Christmas season, which is a very busy travel season. He wants to also run it on a fairly significant percentage of the regular users of Booking.com, so you have to imagine being a user of Booking.com, and you’re waking up in the morning, you’re getting ready to make your hotel bookings for the Christmas season, and you’re looking at this webpage, and it looks nothing like anything that you’ve seen before. It looks more like a Google search page. They want to test what the reactions are from their customers.
BRIAN KENNY: As the saying goes, what could go wrong
STEFAN THOMKE: Yes.
BRIAN KENNY: Seems like a lot could go wrong with this. That’s a pretty big decision.
STEFAN THOMKE: A lot could go wrong. There’s a number of things that could happen. First of all, they could just look at it and leave.
BRIAN KENNY: Yeah, thinking they’re in the wrong place maybe, I don’t know.
STEFAN THOMKE: Go to Expedia, or somewhere else. Could be that they would play around with it a little bit, they get confused, and then give up, could be that they get very frustrated. A lot of things could happen. Or, they could actually like it. The problem is, in this kind of approach, the experimental approach, you just don’t know. The data says, actually, in the cases, that Booking has learned, over the years, that they’re wrong about nine out of ten times, that if you have a hypothesis, and the hypothesis seems very reasonable, and you go out and test it, and then something really surprising happens. The default is really that you’re much more likely to be wrong than to be right. He’s basically proposing this because he wants to find out what’s going to happen when they fundamentally change this page. Of course, there is also some background going on here as well, that is Google is getting into the travel business. These are potential competitive forces that are at play here. It’s more than just finding out what users would like to see, or how they react to certain pages, it’s also maybe to find out what their response to a Google-like experience would look like. Perhaps even offering different kinds of services than just the regular accommodations that they always have on their website.
BRIAN KENNY: What led you to write about Booking.com? You have a Harvard Business Review article on this topic that relates back to the case as well. Why is Booking.com of particular interest?
STEFAN THOMKE: Brian, there’s a general phenomenon that’s going on right now, especially in the digital space. As many companies learning, is as the digital economy is expanding, turns out that the way we interact with customers is fundamentally changing. There is an explosion of touch points that we have, through the various communication channels that we have with customers, and all these touch points have to be optimized. There are just a huge number of decisions to be made, and how do we make decisions involving novelty? The way we usually make decisions is we work from intuition. Something that may have worked in the past. We look around at what competitors do, and we may copy competitors, and so forth. Then we have to make all these decisions, but the data says that we’re wrong most of the time. There’s a great example from Microsoft that I’ve written about with my co-author Ronny Kohavi, from Microsoft, in this article. A few years ago, an employee at Bing has this really interesting suggestion. The AdWords that are placed here, he has this idea, “Why don’t we actually take some of the text that’s below the headlines, and just move it up to the headline?” Bing gets lots of suggestions like this. People have lots of ideas. Bing runs more than 15,000 tests a year. People looked at this, and shrugged, and didn’t really pay much attention to this. The idea was essentially lingering for more than six months, until finally, the employee took matters into his own hands, and just decided to make a few changes in the code. Took only a couple days, and launched this thing live on-
BRIAN KENNY: That’s a pretty risky move.
STEFAN THOMKE: Here’s what happens. As soon as he launches this, Microsoft gets this too-good-to-be-true alarm. They have KPIs set up, and they have various systems set up. When something looks really unusual, this alarm goes off. Usually when the alarm goes off, it’s because there’s a bug. This thing goes off, and they all scratch their heads. They check what’s going on, and they run it again. It’s a robust result, and what’s even more amazing is the change is amazingly high. That change alone resulted in more than $100 million of additional revenue in the United States alone. Ended up being, actually, the largest and most successful experiment that Bing has ever run. Kind of get a sense of what’s going on here, right?
BRIAN KENNY: I hope that guy got a promotion.
STEFAN THOMKE: I hope so too. I hope so too. We don’t have to know what works and what doesn’t work. The only way to really find out is to build these testing cultures, where you can run a lot of experiments, on a very, very large scale, and where the game is changing from not getting it perfectly right. Let’s say your hit rate is only about 10%, but if you’re running 10,000 experiments, you’re still getting 1,000 right. That then can actually have a big impact on your business. That’s kind of the backdrop for this, and of course in my own research, I heard about Booking, and that they have this really unique experimentation culture. I just heard it through my own network, and I approached them, and asked them whether they would be interested in letting me look behind the curtain.
BRIAN KENNY: I described, in the introduction, and in your background, that you look a lot at how organizations and firms manage innovation. I think we hear that term a lot, and it’s present in the business press all the time. Is this a facet of managing innovation?
STEFAN THOMKE: Absolutely, yes. It’s in fact a big part of managing innovation. The problem in innovation is that it involves novelty, and again, trying to predict novelty is very difficult, and again, we get it wrong most the time. Novelty comes in many different forms. Could be new products, new services, new business models, new processes, and all sorts of things. When you look at how decisions and these innovations are made, what happens? People go back, and they rely on their experience. Something may have worked well in the past. They may look at big data, and they look at correlations, trying to analyze the data, and they quickly find that correlations are not causality. We have a lot of examples where things correlate very highly, and there’s no causal relationship between them. Maybe the listener will enjoy some funny examples. There’s a big correlation, for example, between hand size and life expectancy, and so before you start looking at your hands and get depressed, there’s actually no causal relationship. The underlying causal variable is gender. Women tend to have smaller hands, and women tend to live longer as well. There are lots of things like that, and so yeah, big data gives us correlations, but it doesn’t give us causality. Then of course we look at context as well. We may look at what has worked in other contexts, but that can be also misleading. Some great examples, which you may be aware of. When Ron Johnson left Apple, he was clearly a retail god, having co-created the Apple store, goes to J.C. Penney, and they want him to do what he did for Apple at J.C. Penney, and things didn’t really work out that well. Then at the end, he had to go, and J.C. Penney was fighting for its survival, because they didn’t test.
BRIAN KENNY: We actually have a podcast on Cold Call about Ron Johnson and his experience, so users can look for that. Let’s talk about Booking.com a little more. What is unique about the way that they approach this experimentation process?
STEFAN THOMKE: One thing that’s interesting is to look at the history. When you go back, even from the very early days, Booking was founded, essentially, in 1996, about a year after I joined the faculty here.
BRIAN KENNY: Wow, they’ve been around for a while.
STEFAN THOMKE: They’ve been around for a while. They started very small, just in the Netherlands, in Amsterdam, but from very early on, they had this ethos of following the data, following what customers actually do. There’s an interesting example in the case. One of the initial decisions that they had to make is, where do they actually set up the first office outside of the Netherlands? They looked at the German market, which is a very big travel market of course, and of course the first logical decision that you would make is when you go to Germany and set up somewhere, you go to Berlin, or one of the other big cities. But they looked at the data, turned out that actually, the place where a lot of their Dutch travelers go is a small skiing village. By just looking at the data, they ended up opening the first office in that small skiing village, because this is where the travelers go. Testing or experimentation has always been at the heart of what they do. Over time, they built up these experimentation capabilities, and they learned very early on that trusting their intuition can be very misleading. They looked around, and there are lots of examples in the case. They looked at travel catalogs initially, and of course a lot of these travel catalogs had packages in their travel packages, with package deals, and so forth. So of course, what you do, is they would go online, and try to offer packages. Turns out that customers didn’t like it. What works perhaps in brochures, doesn’t work online. They did a lot of these kinds of things. It’s always surprising, when they actually then tested it, or when they looked at the actual behavior, turned out that a lot of the assumptions that they made about how people travel were just not true.
BRIAN KENNY: What kind of a culture do you need to support that though? I would imagine it feels a little chaotic.
STEFAN THOMKE: Yes.
BRIAN KENNY: Is it like anarchy, or is it controlled experimentation? How does this work?
STEFAN THOMKE: It’s not anarchy, but it’s democratized. Anybody can launch an experiment without permission from management, but anybody in the company can also kill an experiment that somebody else launched; they call it pushing the nuclear button. It’s an environment where we trust people to do the right thing, to test out things a lot. At the same time, it’s also an environment that has a lot of checks and balances, where other people can step in. It’s an environment where transparency is key. That is, when you’re trying to launch an experiment, you actually have to broadcast it first, inside the organization, so other people can look at your experiment, ask questions, criticize it. I think that’s necessary, because they don’t have a committee, or something like that, that overlooks every single experiment. If you want to operate at the scale that they’re operating at, you need to create an organization that has these kinds of checks and balances, and sale of governance. That’s hard. It took them a long time to build that kind of environment.
BRIAN KENNY: They are using the scientific method, the one that I described in the beginning?
STEFAN THOMKE: Yes, turbo-charged. What they’re running is about 1,000. A little more than 1,000 concurrent, so at any point in time, which results in quadrillions of variations of their Booking.com landing page. If you and me were sitting here, and we both go into their landing page, and wanted to book a trip, we’d very likely end up on different versions, because there’s just so many versions out there. There is not one version of a Booking.com landing page.
BRIAN KENNY: That’s amazing. When I think about all the times, my team oversees the webpage here at Harvard Business School, and when I think about all the time that we put into creating that one experience, I can’t imagine what that must be like. It must feel precarious.
STEFAN THOMKE: They’re not alone, by the way. If you go to Netflix, you go to Google, you go to Amazon, go to Microsoft, all the things that we use every day, they’re all doing this. Customers are often not aware of that, but you are being experimented on every single day, multiple times. It’s just their way of finding out what works and what doesn’t work, which I think at the end, allows them to make better decisions.
BRIAN KENNY: Is the fact that they’re a digital company, does that make it more feasible for them to do this? Could this approach be applied in a brick and mortar institution, just like on their web presence?
STEFAN THOMKE: Yes. That’s the short answer. Of course, in a digital environment with a lot of traffic, it’s a lot easier, because there are certain statistical principles that you have to follow, and when you get a lot more traffic, you can just test a lot more things. It also allows you to test smaller things, small changes, because if you get enough sample size, we call it the power of the experiment. It’s a technical term. But you can actually do it in smaller settings as well. I wrote another article with my co-author Jim Manzi, and his company has done this in the retail space, where you would have a company like Kohl’s, has retail outlets, and they want to run experiments in their stores. Now of course a Kohl’s cannot run a sample size of a million. They test it on a million stores, they just don’t have the stores, but you can do it with small sample sizes, and they just use different statistical techniques that allows them to run experiments in small sample environments. Turns out, actually, that the techniques that you use in small sample environments are much more sophisticated than the statistical methods that you use in large sample environments, because things get a little complicated. But yes, so it’s not just about companies with digital routes, it’s also about companies that don’t have digital routes that are going digital. It’s also about companies that are running experiments in brick and mortar environments, and turns out that actually a lot of retailers now run these kinds of experiments, and again, you’re probably not aware of it when you go shopping. They may test different product lines, they may try to remodel parts of a store, they do all these kinds of things, and they have controls where they run it with control stores, to compare behavior in stores, and all these kinds of things that are going on all the time. The beauty of it, again, it’s that you see the scientific method at work here, in a very large-scale way. But of course, the companies that I’m talking about, the Googles, the Booking.coms. LinkedIn, by the way, as well, Facebook, and all these companies just do it at massive scale.
BRIAN KENNY: What are some of the pitfalls of doing this? There’s got to be a downside. How do you manage the risk, I guess?
STEFAN THOMKE: It’s scary to run these things.
BRIAN KENNY: I would think so, yeah.
STEFAN THOMKE: It is scary, but perhaps if you think about what the opportunity cost is of not doing it, because eventually, you’re going to have to go live. If you go live with something that’s not tested, and then again, J.C. Penney, good example, you can get into even more trouble. The reality is you can launch some of these things, and if you have enough safeguards in place, if something really tanks very quickly, you can turn it off. There is a risk there, and of course the more radical the changes that you make, the more radical the experiments, the bigger the risk. That’s kind of one of the trade-offs, and that brings us back to the case dilemma. Here’s a director comes in, and wants to run this radical experiment. Booking.com is really more in the mode of running more incremental small experiments, that are less risky, and he wants to run a big one, a very risky one. Again, the CEO, what should she do? Should she get involved, should she encourage him maybe to move it to a different time of the year, maybe run it on a smaller scale? Should he modify the test, maybe make it a little less ambitious, or should she just leave him alone? They have this culture of testing, they have the methodology in place, and should she just trust the organization for these things to work out on their own?
BRIAN KENNY: What’s the culture there like? Let’s say you let one of these experiments fly, and it does fail miserably. There must be a high tolerance for failure?
STEFAN THOMKE: Failure is the status quo. Again, when 9 out of 10 things fail, you’re much more likely to run into a failure than not, so what you have to do is you have to create, basically, an environment that discusses the failures, failures are shared. I always say that not winning is not losing, it’s not the same thing. Not winning means yeah, things fail, but there’s a learning objective here. You can learn from these failures, you share the learning across, and often these failures on input to the next hypothesis, because it’s an iterative process, and you try again, modify the hypothesis, try, and then sometimes these failures lead to some really important insights, which then lead to great experiments, which then lead to big performance improvements. Traditionally, when we think about innovation, managers always assume that small changes, small innovations, small changes lead to small performance changes. Big changes, breakthrough innovations, lead to big performance changes. This is a different game here, because, as I’ve told you about the Microsoft example early on, small changes can have huge performance implications, and revenue. The game is a little bit different. The game here is really what I call high velocity incrementalism. What you do, is you can make a lot of small changes, but you have to do them very fast, and you have to do them at very large scale. Actually the cumulative impact of all this, can then lead to big performance changes. That’s the game in these spaces, and then once in a while, of course you do big changes, but most of the changes would be considered to be more incremental.
BRIAN KENNY: What does success look like in a situation like that? Do they look back over the course of the past year and say, “We attempted this many experiments, and our hit rate was low, but we’re happy that we attempted this many,” or do they just look at the successes?
STEFAN THOMKE: No, they have KPIs. All these companies have sometimes very sophisticated KPIs. Booking has several KPIs, but of course, the biggest KPI that they pay attention to is conversion. The tricky thing with these KPIs is to find a KPI that’s meaningful, because sometimes short-term KPIs don’t really meet long-term objectives, so the best KPIs you can find are the ones that correlate very well with long-term objectives. In some ways, Booking really has to solve three big problems. The first problem is they have to get traffic. When you go to Google or Bing, and you type in a search, and you’re looking for a hotel, they want to make sure that you end up on their site. That problem can be solved with money. Just pay Google or Bing a lot of money, and they’ll send you to their site, and in fact, when you look at their income statement, you’ll find that’s a huge expense for them. Then the second problem is, once you get to their site, they need to convert you, because if you go to their site, and then leave their site, and don’t book, that’s a lost opportunity. They have to convert you, so they have to give you the best possible experience. The way to do that is to figure out what works, and what doesn’t work, and that’s where all the testing comes in. They run a massive number of tests, and they try lots of different things to make sure that you convert. Then the third problem is, of course, after you’ve booked it, you’re actually going to travel. They need to make sure that you also have a great travel experience. Let’s say you go to a hotel, and the hotel is overbooked, and something doesn’t work out, that you can call their customer support, and so they have a lot of people in customer support that make sure that things run without friction, and that you’ll get a room, either at that hotel or some other hotel right away, and that the whole experience is kind of amazing. Then you come back, and they don’t have to pay money to Google to get you to come back, to get their traffic. That’s what it’s all about, and I think the middle one, the conversion is critical, because that’s where the proverbial rubber hits the road.
BRIAN KENNY: I was thinking as I read the case, and I referenced big data in the introduction, and you’ve mentioned it a couple of times here. This felt a little bit to me like the antithesis of big data. This is like lots of shots of small data that are leading you down a path to make decisions incrementally.
STEFAN THOMKE: I think the two are complementary. What big data is really great for is to find interesting relationships, correlations between variables. The problem of course in innovation, is that in innovation, you usually don’t have a lot of data, because if there was a lot of data, it wouldn’t be very innovative. It means someone has already done it. The way we usually use big data in this approach, is big data becomes very useful in the hypothesis generation. You can look for interesting patterns, and those patterns then lead to interesting hypotheses that are testable, that are measurable. Then the experimentation comes in, and the experimentation actually tests for cause and effect, and then tells you whether it’s real or not. Big data and experimentation go together, and I think companies that just do big data, and they don’t do the experimentation, are missing out in a big way, and I think that they will soon realize that both are needed to be successful. A hypothesis can come from many different places. Big data is one source, but qualitative research is another source. Booking actually does quite a bit of research, they do focus groups, they do all the things that other companies do as well. They look for interesting patterns in their service centers, customer complaints, and all those sorts of things, but the way they use that is very different. They funnel that into a hypothesis pipeline, and then that pipeline is then used for the testing, because you have to feed this huge experimentation machine if you’re running so many experiments. They didn’t tell me how many experiments they run per year but my guess is somewhere between 20,000 to 30,000 or something like that. You can back it out by just a few calculations, so it’s a huge number of tests. Those hypotheses have to come from somewhere, and in this case, they come from all kinds of sources.
BRIAN KENNY: You’ve discussed this case with students?
STEFAN THOMKE: Oh yes, I’ve taught it a number of times now, in Executive Education, and it is really interesting to see their reaction.
BRIAN KENNY: I don’t want you to give away any deep insights, but I’m just curious, is there something that surprised you in their reaction?
STEFAN THOMKE: First of all, I think in the discussions, they do get the scientific method right away, and so that’s not really up for debate. I think what really impresses them is to look at the organization itself, and the scale at which they do this. People come from companies that have done some of this, but they just haven’t done it at that scale. Then to build a whole organization process, and a culture around that, that kind of blows them away. I think for many, it’s also a wake-up call. That is, in the digital era, this is where we’re all heading, and this is what gives these companies, that I’ve named, a competitive advantage, because by the time some companies figure out what to do, they’ve already tested a hundred things. If you’re that slow, you’re going to be behind. I remember in one session, I had a few competitors of Booking, actually, in the class.
BRIAN KENNY: Oh, wow.
STEFAN THOMKE: I’m not going to name them, but at the end of the class, I turned to them and I said, “Just tell me. How do you compete against a model like this?” They threw up their arms and said, “We don’t know,” because they said, “By the time we go through committees, and we agree on what to do, Booking has already tested so many things, so they’re already way ahead of us. Even with all the committees, and all of the thinking that we do, we’re still going to be wrong most of the time.”
BRIAN KENNY: Yeah, that’s something. It’s a great case. Thanks so much for joining me.
STEFAN THOMKE: Thank you. Thank you for having me, Brian.
BRIAN KENNY: If you enjoy Cold Call, you should check out our other podcasts from Harvard Business School, including After Hours, Sky Deck, and Managing the Future of Work. Find them on Apple Podcasts, or wherever you listen. Thanks again for joining us. I’m your host, Brian Kenny, and you’ve been listening to Cold Call, an official podcast of Harvard Business School, brought to you by the HBR Presents Network.
Note for our listeners: You can preorder Stefan Thomke’s new book, “Experimentation Works,” at Amazon, B &N, and IndieBound.