Big Data Busting

You’ve heard the term before. Maybe from me. Big Data. It’s a catchphrase of our time. But have you ever asked what it means? Google’s search engine defines it as a noun referring to “extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions.” And Google should know, right? It’s their day job.

But I had another definition proposed to me at a workshop on the topic last week. Roger Downing of the Hartree Centre in Warrington, part of the Science and Technology Facilities Council, described big data as datasets that were “uncomfortably large to deal with on a single machine”. That’s one of the reasons why the Hartree Centre exists and why I and a group of other PhD students were being treated to a workshop on big data there – they have plenty of machines to deal with the datasets comfortably. But over the course of the week, I began to wonder whether big data was about not just the size of the datasets, but also the data analysis decisions that may be uncomfortable for individual humans to deal with.

Certainly the volume of data and the speed with which it’s generated is staggering for humans or machines. Even though it has to be translated at some point into a plethora of ones and zeros, the datasets themselves are made up of numbers, measurements, text, images, audio and visual recordings, shape files and mixed formats collected and stored in a variety of computer programming languages. The datasets come from sources around the world and are produced by scientists, machines, transactions, interactions and ordinary people. Therefore, it is no surprise that some of the data is meticulous, some is missing and some is mendacious.

And all of it only has value if it can be analysed in such a way that can help people in society make better decisions more efficiently and achieve their goals, whether they be health and well-being or the bottom line.  So if the analysis is uncomfortable for a single machine, then big data analytics requires tools that enable ‘cluster computing’ with processing in parallel and allowances for ‘fault-tolerance’ or duplication of original and subsequent datasets so that information is not corrupted or lost during processing. The performance of such tools are designed and judged for speed, efficiency, ease of use, compatibility and unity, i.e. the more different data types the tool can handle, programming languages it can interact with, and variety of output it can produce within a unified framework, the better.

Of course tools must be used by well-trained data scientists, because the analysis of data and its value depends upon asking the right questions. Those right questions are most likely to be asked if data scientists not only have statistical and computer science skills, but also expertise in their area of study and a combination of creativity and curiosity that seeks new paths for research. Which again, is why we were there, as it is felt in some circles that it may be easier to offer training in statistics and computer programming to those working and researching within specialist areas than to train statisticians and computer scientists in all the disciplines they may encounter in their work with big data. Furthermore, patterns and predictions coming out of big data analysis are not helpful if the data has not been cleaned first and checked for its accuracy, consistency and completeness, a much easier task with specialist knowledge at your disposal. Machines cannot learn if they are not trained on structured and then validated data. And people cannot trust the output without control over the input and an understanding of how data was transformed into information.

And so there is the issue of comfort again. The technology now exists to economically store big datasets and try to merge them even if there is no certainty that added value will result. Machines analyse big data and offer potential audiences instead of actual ones, probabilities and levels of confidence instead of facts. Machine learning and cognitive computing utilise big data to create machine assistants, enhancing and accelerating human expertise, rather than machine workers, undertaking mundane tasks for humans. Thus we enter a brave new world. But I still can’t say I’m entirely comfortable.

Data x3

Data, Data, Data. Does it have the same cachet as Location, Location, Location? Big data. Open data. Standardised data. Personal data. If it doesn’t yet, it soon will.

I attended the Transport Practitioners’ Meeting 2016 last week and the programme was full of presentations and workshops available to any delegate with an interest in data, including me. With multiple, parallel sessions, I could have filled my personal programme twice over.

Transport planning has always been rich in the production and use of data. The difference now is that data is producing itself, the ability for the transport sector to mine data collected for other purposes is growing, and the datasets themselves are multiplying. Transport planners are challenged to keep up, and to keep to their professional aims of using the data for the good of society.

The scale of this challenge is recognised by Research Councils and is probably why I won a studentship to undertake a PhD project that must use big data to assess environmental risk and resilience. Thus my particular interest in finding all the inspiration I could at the conference.

Talk after talk, including my own presentation on bike share, mentioned the trends in data that will guide transport planning delivery in the future, but more specific sources of data were also discussed.

Some were not so much new as newly accessible. In the UK, every vehicle must be registered to an owner and after 3 years must pass an annual service, called an MOT. A group of academics has been analysing this data for the government in part to determine what benefits its use might bring. Our workshop discussion at the conference on this agreed the possibilities were extensive.

Crowd-sourced data, on the other hand, could be called new; collected on social media platforms or by apps like Waze. Local people using local transport networks share views on the quality of operation, report potholes, raise issues, and follow operators’ social media accounts to get their personalised transport news. This data is the technological successor to anecdote; still qualitatively rich, but now quantitatively significant. It helps operators and highways authorities respond to customers more quickly. Can it also help transport professionals plan strategically for the future?

Another new source of data is records of ‘mobile phone events’ – data collected by mobile phone network operators that can be used to determine movement, speed, duration of stay, etc. There are still substantial flaws in translating this data for transport purposes, particularly the significant under-counting of short trips and the extent of verification required. However, accuracy will increase in time, and apps that are designed to track travel such as Strava and Moves can already be analysed with much greater confidence.

Even more reliable are the records now produced automatically by ticketing systems on public transport, sensors in roads and traffic signals, cameras, lasers, GPS trackers and more. Transport is not only at the forefront of machine learning, but the ‘Internet of Things’ is becoming embedded in its infrastructure. Will such data eventually replace traditional traffic counts and surveys, informing reliable models, accurate forecasts and appropriate interventions?

It is certainly possible that we will be able to plan for populations with population-size data sources on a longitudinal spectrum, rather than using sample surveys of a few hundred people or snapshots of a short period of ‘neutral’ time.

However…

Despite attempts to stop it (note impossibility of ignoring Brexit in any field; its shadow hung over the conference proceedings), globalisation is here to stay and data operates in an international ecosystem. Thus, it cannot be used to its full potential without international regulations on sharing and privacy and standards on format and availability.

Transport planners also need the passion and the skills to make data work for us. Substantial analysis of new datasets is required to identify utility and possibility, requiring not only statistical and modelling training, but also instruction in analytical methods. People with such skills are in limited supply, as is the time and money for both training and analysis of new datasets.

Therefore, perhaps the most important lesson is that sharing best practice and successful projects that employ data at conferences like TPM2016 is more important than ever.