The big data revolution can offer us analytical tools that previous researchers only dreamed of. But data alone is not enough. As Carmen Reinhart and Kenneth Rogoff, and now Thomas Piketty, demonstrate, transparency of method is just as critical.
In “A Critique of Reinhart and Rogoff”, the paper that first questioned the conclusion of the pair’s 2010 work, “Growth in a Time of Debt”, authors Thomas Herndon, Michael Ash and Robert Pollin claimed to have discovered that “coding errors, selective exclusion of available data and unconventional weighting of summary statistics lead to serious errors that inaccurately represent the relationship between public debt and GDP growth”.
One of the clearest problems identified were mistakes made when data was input into spreadsheets. These errors meant that five countries were excluded from the analysis and, along with other coding errors, caused the authors to underestimate average growth rates for heavily indebted countries and overestimate growth for the least indebted.
However, coding errors account for only a small percentage of the overall result (and were accepted by Reinhart and Rogoff as mistakes). Instead, the disagreement between the conclusions of the two papers stems largely from correcting for Reinhart and Rogoff’s “selective exclusion of available data, and unconventional weighting of summary statistics”. That is, the major question posed by Herndon, Ash and Pollin was over the methodology of the original paper, not just the data.
This brings us to the current controversy surrounding French economist Thomas Piketty and his new book, Capital in the Twenty-First Century. After digging into the data that Piketty has laudably placed online, the Financial Times discovered what it claimed were “unexplained data entries and errors in the figures underlying some of the book’s key charts”. These were sufficiently serious in the eyes of Chris Giles, the FT’s economics editor, as to undermine the central claim of the book that the portion of wealth owned by the richest in society has been inexorably growing.
In a similar fashion to the Reinhart/Rogoff incident, problems were discovered with the data that had been inputted by Piketty and his researchers. Yet also akin to the previous incident, the transcription errors alone do not account for the bulk of the alleged flaws discovered by the FT. Instead, these are a consequence of methodological disagreements relating to “unexplained alterations of the original source data” and inconsistent weightings.
Unsurprisingly, just as in the previous case, the author and his supporters vehemently deny the accusations levelled at the book. Firstly, Piketty argues that the adjustments to the data were necessary in order to maintain consistency over time and across countries. And, secondly (unlike in the Reinhart/Rogoff example) the alleged flaws discovered in the analysis fail to disprove the overall trend of rising wealth inequality.
Whatever your view on the veracity of these claims and counter-claims, what both cases show is that failing to make clear the assumptions and adjustments that underpin your methodology leaves your results open to this type of criticism. Even if there is a compelling rationale for having made adjustments to the underlying data, the absence of an explanation can easily lead people to assume nefarious motives where none may exist.
Understanding how the data has been selected and collated is critical for those wishing to replicate the results – a practice that not only helps to clean up any mistakes but can also make the final conclusions more robust. It also lets readers know why certain decisions have been taken, so that they can gain a deeper understanding of the argument.
To that extent, Piketty, Reinhart and Rogoff are at fault for assuming that the route they took in doing their research mattered less than the results. This fault, however, is not theirs alone. If we are to embrace the potential of big data, we must move beyond simply opening up our spreadsheets to scrutiny. The lesson here is clear: it is only when we reveal the choices we make in selecting data that transparency can become accountability.
Author: Tomas Hirst works for the digital team of the World Economic Forum.
Image: An illustration picture shows a projection of binary code around the shadow of a man holding a laptop computer in an office in Warsaw June 24, 2013. REUTERS/Kacper Pempel