You’ve heard the term before. Maybe from me. Big Data. It’s a catchphrase of our time. But have you ever asked what it means? Google’s search engine defines it as a noun referring to “extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions.” And Google should know, right? It’s their day job.
But I had another definition proposed to me at a workshop on the topic last week. Roger Downing of the Hartree Centre in Warrington, part of the Science and Technology Facilities Council, described big data as datasets that were “uncomfortably large to deal with on a single machine”. That’s one of the reasons why the Hartree Centre exists and why I and a group of other PhD students were being treated to a workshop on big data there – they have plenty of machines to deal with the datasets comfortably. But over the course of the week, I began to wonder whether big data was about not just the size of the datasets, but also the data analysis decisions that may be uncomfortable for individual humans to deal with.
Certainly the volume of data and the speed with which it’s generated is staggering for humans or machines. Even though it has to be translated at some point into a plethora of ones and zeros, the datasets themselves are made up of numbers, measurements, text, images, audio and visual recordings, shape files and mixed formats collected and stored in a variety of computer programming languages. The datasets come from sources around the world and are produced by scientists, machines, transactions, interactions and ordinary people. Therefore, it is no surprise that some of the data is meticulous, some is missing and some is mendacious.
And all of it only has value if it can be analysed in such a way that can help people in society make better decisions more efficiently and achieve their goals, whether they be health and well-being or the bottom line. So if the analysis is uncomfortable for a single machine, then big data analytics requires tools that enable ‘cluster computing’ with processing in parallel and allowances for ‘fault-tolerance’ or duplication of original and subsequent datasets so that information is not corrupted or lost during processing. The performance of such tools are designed and judged for speed, efficiency, ease of use, compatibility and unity, i.e. the more different data types the tool can handle, programming languages it can interact with, and variety of output it can produce within a unified framework, the better.
Of course tools must be used by well-trained data scientists, because the analysis of data and its value depends upon asking the right questions. Those right questions are most likely to be asked if data scientists not only have statistical and computer science skills, but also expertise in their area of study and a combination of creativity and curiosity that seeks new paths for research. Which again, is why we were there, as it is felt in some circles that it may be easier to offer training in statistics and computer programming to those working and researching within specialist areas than to train statisticians and computer scientists in all the disciplines they may encounter in their work with big data. Furthermore, patterns and predictions coming out of big data analysis are not helpful if the data has not been cleaned first and checked for its accuracy, consistency and completeness, a much easier task with specialist knowledge at your disposal. Machines cannot learn if they are not trained on structured and then validated data. And people cannot trust the output without control over the input and an understanding of how data was transformed into information.
And so there is the issue of comfort again. The technology now exists to economically store big datasets and try to merge them even if there is no certainty that added value will result. Machines analyse big data and offer potential audiences instead of actual ones, probabilities and levels of confidence instead of facts. Machine learning and cognitive computing utilise big data to create machine assistants, enhancing and accelerating human expertise, rather than machine workers, undertaking mundane tasks for humans. Thus we enter a brave new world. But I still can’t say I’m entirely comfortable.