AI Needs Better Data, Not Just More Data – Center for Data Innovation

AI has a data quality problem. In a survey of 179 data scientists, over half identified addressing issues related to data quality as the biggest bottleneck in successful AI projects. Big data is so often improperly formatted, lacking metadata, or “dirty,” meaning incomplete, incorrect, or inconsistent, that data scientists typically spend 80 percent of their time on cleaning and preparing data to make it usable, leaving them with just 20 percent of their time to focus on actually using data for analysis. This means organizations developing and using AI must devote huge amounts of resources to ensuring they have sufficient amounts of high-quality data so that their AI tools are not useless. As policymakers pursue national strategies to increase their competitiveness in AI, they should recognize that any country that wants to lead in AI must also lead in data quality.

Collecting and storing data may be getting cheaper, but creating high-quality data can be costly—potentially prohibitively so to small organizations or teams of researchers, forcing them to make do with bad data and thus unreliable or inaccurate AI tools, or preventing them from using AI entirely. The private sector will of course invest in data quality, but policymakers should view increasing the amount of high-quality data as a valuable opportunity to accelerate AI development and adoption, as well as reduce the potential economic and social harms of AI built with bad data. There are three avenues for policymakers to increase the amount of high-quality data available for AI: require the government to provide high-quality data; promote the voluntary provision of high-quality data from the private and non-profit sectors; and accelerate efforts to digitize all sectors of the economy to support more comprehensive data collection.

In recent years, policymakers have emphasized the importance of making data available for AI. While quantity is important, as developing AI systems can require vast quantities of data, and open government data can be a valuable platform for innovation, the federal government’s data often suffers from data quality problems, such as a lack of standard identifiers and inconsistent definitions, that make analysis difficult. Policymakers should both invest in efforts to improve the government’s existing data, as well as direct government agencies to develop shared pools of high quality, application-specific training and validation data in key areas of public interest, such as agriculture, education, health care, public safety and law enforcement, and transportation. For example, the U.S. National Institute of Standards and Technology should work with law enforcement agencies, civil society, and other stakeholders to develop shared, representative datasets of faces that can serve as an unbiased resource for organizations developing facial recognition technology. There is already precedent for government stepping in to provide high-quality data where it is sorely needed. For example, the U.S. Department of Transportation began work on a publicly accessible national address database in 2015 after recognizing that several government agencies, as well as large sectors of the economy, collect and rely on address data, but lack a single, comprehensive source for this information, resulting in duplicitous collection and fragmented datasets.

Second, as government data is only a fraction of the data that could be useful for AI development, policymakers should also promote the private and non-profit sectors providing voluntary access to high-quality data. In many cases where high-quality data exists, it is dramatically underutilized. For example, in the healthcare sector, government agencies, universities, and pharmaceutical companies may all have their own rich datasets that could generate substantial benefits for AI if widely shared, but these stakeholders lack the mechanisms to do so while ensuring that this proprietary and sensitive data is protected. Policymakers in the United Kingdom have recognized this as a key barrier to AI development and they are attempting to overcome it by developing a model for data trusts, defined as “not a legal entity or institution, but rather a set of relationships underpinned by a repeatable framework, compliant with parties’ obligations to share data in a fair, safe, and equitable way.” Without a coordinating body like a government agency specifically devoted to developing and supporting these models, it is unlikely that organizations will develop them on their own. Policymakers should experiment with data trusts and other models to make existing high-quality datasets, including those developed and maintained by government agencies, a more widely available resource for AI .

Third, since datasets are most useful when they are representative and complete, policymakers should accelerate digitization efforts to enable more comprehensive data collection. Many sectors lag in digitization, and organizations in these sectors are limited in their ability to use AI as a consequence. For example, over half of all U.S. electricity customers do not yet have smart meters monitoring electricity usage, limiting the opportunities to leverage AI to better manage energy use. In addition, there are no comprehensive smart city initiatives in the United States—even leading cities have only a handful of efforts to deploy sensor networks and digitize municipal operations, forcing cities to use AI tools with limited utility when only some data exists, or preventing them from using them entirely where no data exists. And despite the promise of smart manufacturing, adoption of digital manufacturing technology is sluggish, limiting the ability of manufacturers to leverage AI to improve operations. Policymakers should direct federal agencies, such as the Department of Housing and Urban Development, the Department of Health and Human Services, the Department of Transportation, and the Federal Energy Regulatory Commission to identify and implement policies that can accelerate digital transformation in relevant sectors.Fortunately, some policymakers have recognized the importance of providing high-quality data for AI development. President Trump’s recently announced American AI Initiative promises to “enhance access to high-quality and fully traceable federal data… to increase the value of such resources for AI R&D,” and directs agencies to identify and address data quality limitations. However more tangible and comprehensive action is needed. Policymakers should allocate funding for agencies to systematically improve the quality of the data they make publicly available, develop new high-quality data resources, promote the broader circulation of high-quality data that could serve as an invaluable resource to all organizations developing AI, and pursue a fully digitized economy. Fortunately, the recently passed OPEN Government Data Act directs federal agencies to appoint chief data officers, who can oversee these efforts and identify additional ways to increase the availability of high-quality data in government. And chief data officers should not only focus on improving data quality throughout government, but also developing strategies to address the data needs, particularly as they relate to AI, of universities, nonprofits, and businesses working to address issues related to their agency’s mission.

Image: Pxhere.