The Business of Hillary vs. Trump
Data aggregation and cleaning – what we can learn from the US presidential elections.
Every four years, the US media is blessed with a money-making bonanza. During the 2016 presidential debates, CNN was able to charge 40 times its standard rate for a 30 second TV commercial.1 The elections present an easy value creation and capture opportunity. Value creation is simplified because the issues are defined even before the race, the protagonists are predetermined, the journalists know exactly where to look for stories, and in return the audience expects. To enhance their value capture, the news agencies in return need to provide a reliable information flow, but most importantly, a credible prediction of the outcome at the end. This credibility determines the long-term ability of the news industry to capture additional value during the election season. However, for the second time in a few months, the news agencies got the prediction wrong. First, it was the Brexit, and then the US presidential elections. While the verdict on why the predictions turned out to be wrong is still not out, some of the issues with data aggregation and faulty assumptions are (now) glaringly obvious.’
FiveThirtyEight.com provides an overview of how election outcome prediction models work. At the most basic level, various survey and non-survey data are fed into supervised learning models, which are trained on historical data, and the signal from various models are aggregated based on predetermined criterion. As we near the actual elections, the weights on signals from regression models are reduced, and the weights on sample polling results are increased. The Atlantic and the PewResearch Center discuss some of the issues with the way data was aggregated for regression analysis:
Non-response bias: When sampling the population through polls, certain demographics such as low-income households or the rural population are difficult to reach. Data cleaning and regression models attempt to correct for this problem based on historical behavior of these demographic groups. But in the case of the 2016 elections, the underrepresented groups deviated overwhelmingly from the historical trend.
The “Shy Trumper” hypothesis: Because of the media frenzy and general societal reaction, openly claiming to be a Trump supporter was not a popular move. While still not proven, some claim that many Trump supporters may have represented themselves as Hillary supporters or “undecided” in the polls. The demographic models that classify the undecided lot into the different camps would have then also been skewed towards Hillary Clinton.
The Likely voter models: One could argue that this is an unnecessary adjustment to the dataset. But currently, regression on demographic characteristics are used to adjust poll results for whether a poll participant is likely to participate in the actual elections. It is sort of an indirect way of further adjusting the weights on signals from different regression models. So if certain races or income classes show up in higher numbers than expected, or many traditional Midwesterners choose to abstain from voting, then the predictions would turn out to be inaccurate.
So why in this age of machine learning, and especially the ability to easily pursue sentiment analysis, did the news channels not do anything to boost their predictive power? More importantly, why did they not review their methods after the Brexit outcomes? Some of this has to do with competition and incentives. Although in the long-run a news agency’s revenue potential is tied to credibility, the short-term focus on competing for revenue share on “speed-to-market” does not leave any time for a detailed examination of modeling assumptions and outputs. Hopefully, the two recent episodes of missed forecasts will serve as a wake-up call for the media channels using predictive analytics.
Student comments on The Business of Hillary vs. Trump
Great topic to choose for this assignment–it’s illustrative of the practical challenges and limitations of data science. Some of the errors you mentioned are systematic errors; for instance, “Shy Trumpers” could cause all polls–not just random ones–to have errors that are biased in the same direction (towards Hillary). One way to handle this is to build a model that accounts for these biases. FiveThirtyEight had one of the few models that did this (they assumed polling errors were correlated) and indeed gave Trump a higher chance of winning than others. Even with this assumption, however, they still got the final prediction wrong. Perhaps, as you suggest with “sentiment analysis”, forecasters need to look to new data sources to eliminate this potential bias.
Very timely discussion. On the topic of ‘likely voter’ models, I wonder if the political polling in the US took into account how young people did not show up to vote or start caring about Brexit until it was too late. On this note, I wonder how people in the 18 to MBA + 5 year range voted on both issues. I also wonder how results correlated according to household internet access and main news source.
538 has an interesting presidential approval meter on the main page, which is aggregated and weighted across multiple polling sources according to 538’s judgment. It is an interesting way to see instant increase or decrease of approval numbers, post an important presidential decision, action or communication.
Regarding Brexit, many outlets said only 36% of young people came out to vote. This information was based on data compiled from the previous general election, which looked at the proportion within each generation who said they always vote.
An LSE poll showed youth turnout was around 64% of registered voters. However the over-65s voted in exceptionally high numbers: 90% turnout!
Not sure if you realized that a student in section 1 wrote a post on how the 538 failed the prediction. Curious to see your response on that post. On top of the non-response bias, I will actually question the raw polling data. People who they poll are not necessarily those actually voted.
Great post, Kunal! There are so many reasons that the media and the pollsters got it wrong. It seems that the initial data, the input, was wrong and it created false output. The forecasting models were based on digital data analysis, aggregating data from the Internet, from online social media networks. It turned out that the sources did not depict the whole picture and the initial data was not representative. It is possible to find voters who use Internet very passively or not use it at all, especially among socially disadvantaged groups. The ‘Shy Trumper’ hypothesis might have played a role, too.
There were also arguments that journalists lost contact with ‘ordinary’ people and became members of ‘coastal elites’ seeing the world outside of New York, Washington or Boston, only through the Acela train windows. That way journalists found themselves in a perfect filter bubble – but it is a different story.
To me the most important symbol of how wrong we got it is the New York Times’ graph of ‘Chance of Winning Presidency’. The lines for both candidates had never crossed before but they crossed for the first time on the election night: