Home>Is Big Data Better Data? A Conference by Dr. Michael Bailey on Modern Polling

14.10.2022

Is Big Data Better Data? A Conference by Dr. Michael Bailey on Modern Polling

On Wednesday, 5 October, the students of Sciences Po’s School of Public Affairs, more precisely from the Digital, New Technology and Public policy stream, were invited to a Masterclass by Dr. Michael Bailey, moderated by Dominique Cardon, Director of the médialab. The Stanford graduate and Professor at Georgetown University is working on the future of modern polling, and will soon be releasing a book on this important topic. He was thrilled to have the opportunity to meet students that might in their future career be relying on big data and the new sophisticated methods of polling to predict election results and make decisions about public policy. His views are not shared by all his fellow academic specialists but he dared to express them: “more data might not always be better”.

Quantity is not always better than quality

Dr. Michael Bailey began his lecture by a significant example. On 4 polls realised to estimate how many American citizens were getting vaccinated against Covid, the closest to “reality” (meaning the results of the center for disease) was the one with the less data: only a thousand people against more than 250.000 people for the Facebook weekly survey. Another example could be the many polls from the 2020 US presidential election: all the results overstated the lead of Joe Biden over Donald Trump, when they were actually quite close. The guest speaker added, “We should be nervous when we look at polls” and “It doesn’t mean they’re always wrong, but when they’re wrong, they’re really wrong”.

The professor went on with a history of polling methods to predict the presidential elections, which all showed errors at some point in time: the mass polling by the Literary Digest, the quota sampling by Time Magazine, the phone interviews and then the success of internet polling to balance out the rising non response rate of other methods. The current most popular methods are either probabilistic polling – with a weight adjust for non response patterns – or internet panels – with a use of quota and weights.

One of the discoveries made by Dr. Michael Bailey, that is not popular in his field of research is that “sample and population size matter”. The widespread notion that any size of sampling can give correct results for any size of population is not always correct. Dr. Bailey took the time to test and prove his intuition. Data quality is more important than data quantity, and the population size does impact on data quality, “Big data quality problem is a big problem”. He gave the example of a polling of 20 people that would participate in a huge city, the people who would be willing to reply could be the odd ones, and not very representative, but a polling of 20 people in a 1.000 people village, even if they are still odd people interested in participating, would be more representative of their population.

Don’t ignore the non-ignorable non-response

One of the main focus of the guest speaker address was the importance of the non-response factor. The fact that the non-response rate has become a huge phenomenon can be, or not, a major issue. There are two types of non-response: the ignorable one and the non-ignorable one. The major danger is when “the decision to respond is related to the content of the response”. When it isn’t, the solution can be “dealt with by weighing” some of the data. The graph representing the part of ignorable non-response and the results of a poll is presented by Dr. Michael Bailey in the shape of a “flat fish” and does not impact the global result.

The difficulty is when the non-response is non-ignorable, when the response that would have been given by the people would have made a huge difference on the result of the poll. It is the kind of non-response that explains errors such as the polls prior to the US presidential election of 2016 and 2020. The corresponding graph presented by Dr. Michael Bailey is then a “tilted fish”, with the tail of the fish that changes drastically the overall dynamic of the results.

Dr. Bailey revealed that a way to manage this issue is to “look harder for non-ignorable non-response”, by doing a randomised outreach effort to ask the questions again, differently, in a more engaging way, to see if the results of the poll are then modified or not. The metaphor he chose was a professor asking his students if they want to ask questions. On the first day the professor says that he can take only two questions. The students who raise their hands are the most motivated to express their opinions. The next day, the professor says he can take questions from everyone and that he has plenty of time: seven students ask questions. If the questions from day 1 are about the same kind as the ones from day 2, then the non-response factor of day 1 was ignorable, if they are different, then it was non-ignorable and it was very important to hear those questions.

The American guest speaker concluded that “even in our big data era, the method of gathering is more important than the size of the data”. He invited the students of Sciences Po’s School of Public Affairs to care about this exciting topic and the important challenges it raises, for sensitive contexts such as public election or public policy making.

MORE INFORMATION: