An unprecedented number of individuals and organizations are finding ways to explore, interpret and use Open Data. Public agencies are hosting Open Data events such as meetups, hackathons and data dives. The potential of these initiatives is great, including support for economic development (McKinsey, 2013), anti-corruption (European Public Sector Information Platform, 2014) and accountability (Open Government Partnership, 2012). But is Open Data’s full potential being realized?
A news item from Computer Weekly casts doubt. A recent report notes that, in the United Kingdom, poor data quality is hindering the government’s Open Data program. The report goes on to explain that – in an effort to make the public sector more transparent and accountable – UK public bodies have been publishing spending records every month since November 2010. The authors of the report, who conducted an analysis of 50 spending-related data releases by the Cabinet Office since May 2010, found that that the data was of such poor quality that using it would require advanced computer skills.
Far from being a one-off problem, research suggests that this issue is ubiquitous and endemic. Some estimates indicate that as much as 80 percent of the time and cost of an analytics project is attributable to the need to clean up “dirty data” (Dasu and Johnson, 2003).
In addition to data quality issues, data provenance can be difficult to determine. Knowing where data originates and by what means it has been disclosed is key to being able to trust data. If end users do not trust data, they are unlikely to believe they can rely upon the information for accountability purposes. Establishing data provenance does not “spring full blown from the head of Zeus.” It entails a good deal of effort undertaking such activities as enriching data with metadata – data about data – such as the date of creation, the creator of the data, who has had access to the data over time and ensuring that both data and metadata remain unalterable.
Similarly, if people think that data could be tampered with, they are unlikely to place trust in it; full comprehension of data relies on the ability to trace its origins. Without knowledge of data provenance, it can be difficult to interpret the meaning of terms, acronyms and measures that data creators may have taken for granted, but are much more difficult to decipher over time.
A final issue is a lack of consideration for the need of data preservation and stewardship. This can also prevent full realization of the benefits of Open Data, as evidenced by comments from a participant at one of the World Bank’s own data dives: “The big problem…is data curation…[organizations] think the structure of data is what gives it value. But it’s actually…what I call data curation….there is no archive…” (Sonuparlak, 2013).
Poor quality data, lack of information about data provenance and data stewardship issues present common barriers to implementation of Open Data initiatives. If this is an issue in the UK and other developed countries, how much more of an issue is it likely to be for the World Bank’s (often-developing) client countries?
Even though much public sector-related Open Data originates from official government records, in many countries even basic records management controls are missing, particularly in the digital environment. Without these controls, records are likely to be incomplete, difficult to locate and challenging to authenticate; where they exist, they can be easily manipulated, deleted, fragmented or lost. Poorly kept records, resulting in inaccurate or incomplete data, can lead to:
- Misunderstanding and misuse of information;
- Cover-up of fraud;
- Skewed findings and statistics; and
- Misguided policy recommendations and misplaced funding.
All of these problems lead to missed opportunities in maximizing the full value of Open Data.
So, what’s the solution? Fixing poor quality data after it is created or when it comes time to use it is costly and, in the case of historical data, often not even possible due to technological changes or lack of documentation. There is evidence to suggest that designing controls that will produce good-quality data in the first place is a far better strategy; that is, it is better to arrive at good Open Data by design.
The World Bank’s Open Data Readiness Assessment Tool can help client countries identify whether they are ready for Open Data along several dimensions, but two in particular:
- Institutional structures, responsibilities and skills within government
- Data within government
Institutional structures, responsibilities and skills – or what may be described as Information Governance – are important because “Open Data requires agencies to manage their data assets with a transparent, organized process for data gathering, security, quality control and release. To effectively carry out these responsibilities, agencies need to have (or develop) clear business processes for data management as well as staff with adequate ICT skills and technical understanding of data (e.g., formats, metadata, APIs, databases)” (ODRA, 2014).
Data within government is similarly critical because it builds “on established digital data sources and information management procedures within government where they already exist. . . good existing information management practices within government can make it much easier to find data and associate metadata and documentation, identify business ownership, assess what needs to be done to release it as Open Data and put processes in place that make the release of data a sustainable, business-as-usual, downstream process as part of day-to-day information management” (ODRA, 2014).
Designing the institutional structures, responsibilities and skills within government to produce good data means designation of one entity with sufficient political weight to coordinate Open Data matters across government. It also requires ensuring that Open Data policies are implemented, and that there is one agency or department responsible for information management – regardless of the form of the information (i.e. paper or digital). In regard to data within government, good design includes a comprehensive inventory of data holdings; coherent information management policies and standards, consistently enforced across government; and a process for digitization of paper records with infrastructure and processes in place for sustaining long-term digital record repositories. International standards point the way to other good practices.
While all of these design features will help produce good Open Data, there is still a lack of knowledge about the antecedents and contributing factors of good Open Data, as well as challenges on how best to implement initiatives in the context of the World Bank’s client countries. The conditions that produce good data require more in-depth study as a foundation for a design framework that can, for example, be built into an expanded Open Data Readiness Assessment Tool. Only with this understanding will it be possible to achieve good Open Data by design, and then truly realize the benefits of Open Data for development.
Published in collaboration with The World Bank Blog.
Author: Dr. Victoria Lemieux is a Senior Public Sector Specialist (Information Management) and an Associate Professor of Archival Studies at the University of British Columbia (on leave). Oleg works for the World Bank’s Transport & ICT Global Practice, where he coordinates the Open Data Program. Roger Burks is an Online Communications Officer for the World Bank.
Image: A robotic tape library used for mass storage of digital data is pictured at the Konrad-Zuse Centre for applied mathematics and computer science (ZIB), in Berlin August 13, 2013. REUTERS/Thomas Peter.