Being Data Driven by DJ Patil on Building Data Science Teams.

Everyone wants to build a data-driven organization. It’s a popular phrase and there are plenty of books, journals, and technical blogs on the topic. But what does it really mean to be “data driven”?

My definition is: A data-driven organization acquires, processes, and leverages data in a timely fashion to create efficiencies, iterate on and develop new products, and navigate the competitive landscape.

There are many ways to assess whether an organization is data driven. Some like to talk about how much data they generate. Others like to talk about the sophistication of data they use, or the process of internalizing data. I prefer to start by highlighting organizations that use data effectively.

Ecommerce companies have a long history of using data to benefit their organizations. Any good salesman instinctively knows how to suggest further purchases to a customer. With “People who viewed this item also viewed …,” Amazon moved this technique online. This simple implementation of collaborative filtering is one of their most used features; it is a powerful mechanism for serendipity outside of traditional search. This feature has become so popular that there are now variants such as “People who viewed this item bought … .” If a customer isn’t quite satisfied with the product he’s looking at, suggest something similar that might be more to his taste. The value to a master retailer is obvious: close the deal if at all possible, and instead of a single purchase, get customers to make two or more purchases by suggesting things they’re likely to want. Amazon revolutionized electronic commerce by bringing these techniques online.

Data products are at the heart of social networks. After all, what is a social network if not a huge dataset of users with connections to each other, forming a graph? Perhaps the most important product for a social network is something to help users connect with others. Any new user needs to find friends, acquaintances, or contacts. It’s not a good user experience to force users to search for their friends, which is often a surprisingly difficult task. At LinkedIn, we invented People You May Know (PYMK) to solve this problem. It’s easy for software to predict that if James knows Mary, and Mary knows John Smith, then James may know John Smith. (Well, conceptually easy. Finding connections in graphs gets tough quickly as the endpoints get farther apart. But solving that problem is what data scientists are for.) But imagine searching for John Smith by name on a network with hundreds of millions of users!

Although PYMK was novel at the time, it has become a critical part of every social network’s offering. Facebook not only supports its own version of PYMK, they monitor the time it takes for users to acquire friends. Using sophisticated tracking and analysis technologies, they have identified the time and number of connections it takes to get a user to long-term engagement. If you connect with a few friends, or add friends slowly, you won’t stick around for long. By studying the activity levels that lead to commitment, they have designed the site to decrease the time it takes for new users to connect with the critical number of friends.

Netflix does something similar in their online movie business. When you sign up, they strongly encourage you to add to the queue of movies you intend to watch. Their data team has discovered that once you add more than a certain number of movies, the probability you will be a long-term customer is significantly higher. With this data, Netflix can construct, test, and monitor product flows to maximize the number of new users who exceed the magic number and become long-term customers. They’ve built a highly optimized registration/trial service that leverages this information to engage the user quickly and efficiently.

Netflix, LinkedIn, and Facebook aren’t alone in using customer data to encourage long-term engagement — Zynga isn’t just about games. Zynga constantly monitors who their users are and what they are doing, generating an incredible amount of data in the process. By analysing how people interact with a game over time, they have identified tipping points that lead to a successful game. They know how the probability that users will become long-term changes based on the number of interactions they have with others, the number of buildings they build in the first n days, the number of mobsters they kill in the first m hours, etc. They have figured out the keys to the engagement challenge and have built their product to encourage users to reach those goals. Through continued testing and monitoring, they refined their understanding of these key metrics.

Google and Amazon pioneered the use of A/B testing to optimize the layout of a web page. For much of the web’s history, web designers worked by intuition and instinct. There’s nothing wrong with that, but if you make a change to a page, you owe it to yourself to ensure that the change is effective. Do you sell more product? How long does it take for users to find the result they’re looking for? How many users give up and go to another site? These questions can only be answered by experimenting, collecting the data, and doing the analysis, all of which are second nature to a data-driven company.

Yahoo has made many important contributions to data science. After observing Google’s use of MapReduce to analyse huge datasets, they realized that they needed similar tools for their own business. The result was Hadoop, now one of the most important tools in any data scientist’s repertoire. Hadoop has since been commercialized by Cloudera, Hortonworks (a Yahoo spin-off), MapR, and several other companies. Yahoo didn’t stop with Hadoop; they have observed the importance of streaming data, an application that Hadoop doesn’t handle well, and are working on an open source tool called S4 (still in the early stages) to handle streams effectively.

Payment services, such as PayPal, Visa, American Express, and Square, live and die by their abilities to stay one step ahead of the bad guys. To do so, they use sophisticated fraud detection systems to look for abnormal patterns in incoming data. These systems must be able to react in milliseconds, and their models need to be updated in real time as additional data becomes available. It amounts to looking for a needle in a haystack while the workers keep piling on more hay. We’ll go into more details about fraud and security later in this article.

Google and other search engines constantly monitor search relevance metrics to identify areas where people are trying to game the system or where tuning is required to provide a better user experience. The challenge of moving and processing data on Google’s scale is immense, perhaps larger than any other company today. To support this challenge, they have had to invent novel technical solutions that range from hardware (e.g., custom computers) to software (e.g., MapReduce) to algorithms (PageRank), much of which has now percolated into open source software projects.

I’ve found that the strongest data-driven organizations all live by the motto “if you can’t measure it, you can’t fix it” (a motto I learned from one of the best operations people I’ve worked with). This mindset gives you a fantastic ability to deliver value to your company by:

  • Instrumenting and collecting as much data as you can. Whether you’re doing business intelligence or building products, if you don’t collect the data, you can’t use it.
  • Measuring in a proactive and timely way. Are your products, and strategies succeeding? If you don’t measure the results, how do you know?
  • Getting many people to look at data. Any problems that may be present will become obvious more quickly — “with enough eyes all bugs are shallow.”
  • Fostering increased curiosity about why the data has changed or is not changing. In a data-driven organization, everyone is thinking about the data.

It’s easy to pretend that you’re data driven. But if you get into the mindset to collect and measure everything you can, and think about what the data you’ve collected means, you’ll be ahead of most of the organizations that claim to be data driven. And while I have a lot to say about professional data scientists later in this post, keep in mind that data isn’t just for the professionals: Everyone should be looking at the data.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.