The “big data” buzzword is taking over in industry and government, and nearly every boardroom is asking their executive teams: “What’s your big data strategy?”
But frankly, I don’t care how big your data is. All that matters is what you do with it. Let’s make this clear: “big data” is just a description of an engineering problem. When I started in analytics 20 years ago, my clients in banking and insurance had far too many customers to be able to store and analyse even their customer lists on a single PC – so back then we used clusters of large computers for this simple purpose. But today, the laptop I’m writing on right now has hundreds of times more power than those systems, and can complete sophisticated analysis on what just a few years ago would have been considered massive data sets. So why are we defining this “big data” concept based on constantly changing engineering constraints? Why not instead use a concept like “leveraging data”, and start with the question “what are the most important strategic issues in my organization, and can I use a data-driven approach to do a better job of them?”
My company, Kaggle, has run hundreds of data analysis competitions involving over a hundred thousand data scientists tackling some of the world’s toughest problems, from organizations like Merck (automated drug discovery), GE (predicting flight arrival times), Ford (using car sensors to detect drowsy drivers), NASA (mapping the dark matter in the universe) and Allstate (predicting the cost of insurance claims). In every case, all previous scientific and industry benchmarks have been broken, by taking advantage of the world’s top minds competing against each other. In most cases, the top competitors choose to do the analysis on their laptop computers, using free, open source software.
Why are the world’s top data scientists using free software on their laptops to tackle the world’s toughest problems? It’s because “the right amount of data” always trumps “big data”. Although large organizations today have around a petabyte of stored data on average, the best analytic models are those that are built in a highly iterative, creative way, using just the subset of data necessary to get the job done. The world’s best data scientists nearly always use sampling methods to create a small analytical dataset of the right size.
So how big is “the right size”? It turns out that the answer is simple: it’s whatever will fit in RAM (random access memory) on your laptop. That’s nearly certain to be plenty big enough to be able to extract all the rich insights in your data, but small enough that you don’t have to deal with the huge overhead and complexity of streaming algorithms and cluster or map/reduce computing.
Through analysing the results of our competitions and the research coming out of the machine learning community, we’ve discovered that there are four specific areas where you really do need “big data” – in fact, you need as much data as you can get. These areas are:
- understanding written and spoken human language
- understanding the contents of images
- analysing videos
- solving problems using data relating genomes
The good news is that even these problems are starting to become more accessible – for instance Google recently released word2vec, a system where it has used sophisticated machine learning to analyse the meaning of over 1 million concepts expressed in 100 billion words of newspaper articles, and turned each concept into a sequence of 1,000 numbers. These numbers encode each concept in a way that can be directly used in “small data” algorithms, meaning now that even the meaning of natural language documents can be analysed on laptop computers.
So does this mean you shouldn’t be investing in data? No. In fact, the power of data is still greatly under-appreciated by most organizations. When you look at the success of companies that are truly data-driven – like Google, that uses machine learning algorithms to rank search results, select advertisements, translate documents and even analyse the effectiveness of interview questions – it’s very clear that there is no industry that will not be disrupted by the new breed of data-driven companies.
What it does mean is that you should stop waiting to leverage your data and start asking what you could do with the data you already have in the format it’s already in.
Author: Jeremy Howard is the President of Kaggle, a World Economic Forum Technology Pioneer company. He will take part in the session Forum Debate: Big Data or Big Hype? at the World Economic Forum’s Annual Meeting of New Champions 2013 in Dalian.
Image: Employees are seen at their workstations in India REUTERS/Vivek Prakash.