The Big Data Paradox and the Limits of Modern Polling

The Big Data Paradox and the Limits of Modern Polling

A look back at the Masterclass with Michael Bailey
  • Masterclass with Michael Bailey © Students from the Digital streamMasterclass with Michael Bailey © Students from the Digital stream

On Wednesday, November 5, 2022, the Sciences Po School of Public Affairs had the honor of welcoming Dr. Michael Bailey, Colonel William J. Walsh Professor at the McCourt School of Public Policy of Georgetown University.  

Organized by the students of the Digital, New Technologies and Public Policy stream, the masterclass delved into the field of Bailey’s upcoming book on public opinion polling. Specifically, it brought our attention to what the literature refers to as the Big Data Paradox. As Dr. Bailey explained, the counterintuitive claim is that surveys can become less reliable as their sample size increases. More specifically, when the sample respondents is non-random, it is possible for surveys to become less reliable as their sample size grows. How can this be? Reality, as always, is a little bit more complex. And Dr. Bailey brought us all on a journey which contextualized the argument and explored its implications. 

Let us start from an example from the recent history: the run up to the 2016 U.S. presidential election. You might remember feeling surprised when Trump was elected, or even not believing in what you were experiencing watching the results come in, as the polls conducted at the time predicted Hillary Clinton would succeed Barack Obama in the White House. Still, as we all easily recall, Donald Trump was the one to occupy it for the next four years. The same type of surprise almost happened in 2020, although many might not have paid attention due to the outcome of the election. Polls had Biden up by 8-15 percentage points in many crucial states. An easy victory, right? Well, the election was a bit closer than predicted, with Biden coming up on top by less than 1% in many of those same states. 

How can this be? How can pollsters mess it up so bad, and twice in a row? 

It turns out that polls have always been unreliable, especially those employing large samples. Between 1924 and 1932, Literary Digest had conducted polls, sampling millions of people, which quite accurately predicted the outcome of presidential races. However, the fairy tale ended in 1936. The magazine predicted, through the same methods of previous years, that Republican Alf Landon would have come out on top – and do so by a large margin. The exact opposite happened. Roosevelt swept all but two states, winning the White House with 61% of the vote – a monumental miscalculation. 

Polling is a statistical method to gather data about public opinion. As such, its outcomes highly depend on the way the sample is constructed. A perfectly random sample, for instance, enables us to project certain results on to the overall population – the same can be said of carefully tailored samples which represent the population’s percentages. But what happens when the sample, supposedly random, is not representative of the population? Well, in that case, our predictions would most likely turn out to be mistaken. 

That is precisely what happened in the examples provided. The population sample was, in fact, biased and provided a distorted image of what the actual outcome of the election would have been. This is mainly due to two factors: unrepresentative quotas and non-response bias. As Dr. Bailey explained, while the former might be addressed by weighting the different responses according to demographics, the second is much trickier (and affects the extent to which weights are effective). The problem, which yields the paradox, results from the different attitudes people have towards surveys. Some like to respond and some do not – and will not answer. 

Although in some cases the willingness to respond to a survey might be ignored, in other cases, it is highly correlated to what the person would answer and distorts the survey’s results.

We call the former ignorable non-response and the latter non-ignorable non-response.

This problem is worsened by the extremely low response rates recorded by all kinds of surveys in recent years. The New York Times, for instance, rarely gets more than 5% of the people they reached out to – and this figure is at 1% more often than not. As it might be clear, in the case of a non-ignorable non-response, this creates problems in terms of reliability of the results. In other words, the randomness put into the selection of people we reach out to is rendered much less useful by the combination of low response rates and non-ignorable non-response. 

Because it is expensive to keep trying to get responses from random samples, many famous polling firms have given up on random sampling and instead use non-probability samples based on people who clicked on web ads. These samples can yield huge numbers of responses, but if the response is non-ignorable (meaning, for example, that supporters of Trump are less likely to opt in even when considering demographics), then the results may be systematically incorrect. What should we do then? 

The solution might be much simpler than we expect. More attention has to be put in deciding who is surveyed and making sure that the unresponsive portion of the population is somewhat represented. This method, called random outreach, relies on the fact that smaller populations are impacted less by non-ignorable non-response than larger ones. This can be seen in the example that Dr. Bailey brought to start us all off: the prediction of vaccination rates. One Facebook poll had more than 250,000 people voluntarily filling in the survey on their platform, while another, a smaller one, received only 1,000 answers. Which one was more accurate? Well, the second one, thanks to their random outreach, as they were able to better mitigate the effects of non-response bias. 

Polling is a constantly developing field of research, but we can see how one thing does not seem to change. Even in the era of big data, it continues to be the case that the method of gathering the data is more important than the size of the data. 

Article written by Giovanni Maggi

Video © Thomas Arrivé/Sciences Po

learn more

Tags :
Back to top