These days, data generated by online social networks is being used for a wide range of purposes. However, it may well be that this ‘social data’ is not as reliable as many researchers would like to think. An article published recently in Science journal points up some of the issues around social media data collection and analysis.

Social Networks: Reliable or Unreliable Source of Research Data?

What if the huge volumes of Big Data coming from Facebook, Twitter and the other social platforms were seriously flawed? An article published by Derek Ruths, Assistant Professor at McGill University’s School of Computer Science (Montreal) and Jürgen Pfeffer of Carnegie Mellon’s Institute for Software Research (Pittsburgh) comes to this conclusion. For some years now data drawn from social networks has been used by both academic sociologists and brand marketers to help them understand human behaviour. Many projects and proposals have been based on information sourced in this way: Kristina Lerman, Associate Research Professor at the University of Southern California’s Department of Computer Science, suggested that Facebook should be used to establish networks of friends as a way of stopping the spread of the Ebola virus. Meanwhile in Boston, researchers have been scrutinising opinions expressed online. However, other computer scientists – Derek Ruths and Jürgen Pfeffer among them – point out that this type of data is very difficult to use. A recent article authored by these two experts flags up some of the problems of using data culled from social media.

Networks tend to skew results

One major problem stems from the dubious reliability of information engendered by the networks themselves. Ruths and Pfeffer stress that members of a network are far from being truly representative of the general population. A single extreme example serves to illustrate the risks of skewed data. In 2012, according to the startup Buzz Referral, 80% of Pinterest users were women. If each of the social networks attracts its own special aficionados, any survey sample may lack balance. Secondly, Ruths and Pfeffer argue that the design of each platform strongly influences how people’s views will be interpreted.  For example, the famous Facebook ‘Like’ button purports to indicate satisfaction or affinity levels, but the absence of a ‘Dislike’ option makes it much more difficult to gauge true feelings. This means that scientists should use this sort of data with great caution, they stress. Thirdly, even when data is freely accessible it has already been filtered by those managing the network without this necessarily being apparent. So much for the nature of social networks, but in fact there are other more insidious forces at work which may entirely vitiate ‘social’ data.

Watch out for bots and spammers!

The huge number of bots and spammers, which masquerade as normal users on social media, are often mistakenly incorporated into measurements and predictions of human behaviour, thus constituting sources of serious errors. Lastly, Ruths and Pfeffer reckon that researchers often report results for groups of easy-to-classify users, topics, and events, making the new ‘social’ analysis methods seem more accurate than they really are. For instance, while studies focusing on politically active users on Twitter claim that political affiliation can be predicted from tweets with 90% accuracy, when all Twitter users are included this success rate falls to 65%. So at the end of the day it is the researchers’ methods that should be called into question. “The common thread in all these issues is the need for researchers to be more acutely aware of what they’re actually analysing when working with social media data,” concludes Derek Ruths.

By Guillaume Scifo