Does history rhyme? Supercomputing, AI, and the US government’s support for a research data infrastructure

George Strawn

Professor Emeritus and former Department Chair of Computer Science and Director of the Computation Center at Iowa State University

Board Director Emeritus for the National Academies’ Board on Research Data and Information


The author Mark Twain supposedly said that “history does not repeat itself but it rhymes.” And with respect to support for AI research, a number of recent actions by the US government appear to rhyme with similar actions it took in the 1980s, when it recommended (and ultimately implemented) significant support for supercomputing-based research.

In 1982, an influential committee[1] submitted a report (called the Lax Report, named after its chair, Peter D. Lax, a renowned mathematician) recommending that the US government provide supercomputing access to university-based computational scientists. The National Science Foundation (NSF) was given the responsibility of creating supercomputing centers for that purpose.

As an ancillary consideration, the Lax Report also recommended creating a computer network connecting 100 research universities to the new supercomputing centers. This recommendation gave the NSF permission to create the National Science Foundation Network (NSFNET), which within a decade evolved into today’s Internet.

The echo of the Lax Report and its aftermath can be heard today with the release of another influential document, the National AI Research Resource Report (or NAIRR Report). The NAIRR Task Force was cochaired by leaders of the Office of Science and Technology Policy (OSTP) and the NSF, with the final report signed by their directors.

On its heels, a bipartisan group of congressmen proposed legislation (cleverly named the CREATE AI Act) to codify the report’s recommendations into law. And the White House issued an Executive Order directing NSF to begin a pilot to implement the NAIRR Report’s recommendations while legislation is being considered. (NSF has officially announced the NAIRR pilot and has already made several data awards consistent with the report’s recommendations.)

NAIRR Report

In January 2023, the NAIRR Task Force released its report on the need to increase investment in AI research[2]. The Task Force consisted of 12 experts from academia, government, and the private sector. The Task Force conducted its work before generative AI such as ChatGPT had become a public spectacle, so its recommendations did not highlight that newest AI success.

But the Report did note that both dimensions of machine learning (analytic and generative) require huge amounts of data to train their algorithms. Thus, access to data is a core requirement for AI research. And moreover, public access to AI data could increase trust in AI-generated results.

The Report calls for the creation of an Operating Entity, which should be a distinct non-governmental organization, governed by a charter and associated policies, with an executive team managing day-to-day operations. Among other things, the Operating Entity is charged with supporting data as follows:

Data and Datasets

The Operating Entity should provide a search and discovery service with metadata about the usage of all datasets. Such a service should be consistent with Section 202(c) of the Evidence Act. It should be designed to dovetail with the capabilities anticipated through development of a Federal data catalog, but extend beyond Federal data to include research, administrative, and other data produced by non-federal entities.

The Operating Entity should support data resource providers by either funding the creation of or providing continuing support to existing Al data repositories. In coordination with the Technology Advisory Board, the Operating Entity should publish interoperability guidelines for such data repositories, and encourage data repositories to compete to become NAIRR data resource providers. These guidelines should be informed by the Desired Characteristics of Data Repositories for Federally Funded Research developed by the National Science and Technology Council's Subcommittee on Open Science. Having such repositories and datasets visible, searchable, and discoverable inside the NAIRR, as well as implementing mechanisms to track dataset use, are important to the success of the NAIRR.

In order to achieve its vision and goals, the Task Force estimated the budget for the NAIRR to be $2.6 billion USD over an initial six-year period.

These requirements are closely related to those of other Open Science Data Commons such as the European Open Science Cloud[3].

CREATE AI Act

Although the US Congress is highly divided, a bipartisan group of lawmakers has proposed legislation to implement the tenets of the NAIRR Report. (By way of disclosure, Frontiers is on record as supporting the aims of this legislation.)

As a press release[4] states:

The CREATE AI Act establishes the NAIRR, which has four primary goals:

1. Spur innovation and advance the development of safe, reliable, and trustworthy AI research and development.

2. Improve access to AI resources for researchers and students, including groups typically underrepresented in STEM.

3. Improve capacity for AI research in the United States.

4. Support the testing, benchmarking, and evaluation of AI systems developed and deployed in the United States.

The NAIRR will offer the following to researchers, educators, and students at higher education institutions, non-profits, and federally funded agencies:

1. Computational resources, including an open-source software environment and a programming interface providing structured access to AI models.

2. Data, including curated datasets of user interest and an AI data commons.

3. Educational tools and services, including educational materials, technical training, and user support.

4. AI testbeds, including a catalog of open AI testbeds and a collaborative project with the National Institute of Standards and Technology.

While the CREATE AI Act authorizes the NAIRR, it does not appropriate funding for it. Parts of the NAIRR could be funded through existing US governmental programs, or lawmakers could fund the NAIRR as part of the annual appropriations process.

Executive Order

There are signs that passage of the CREATE AI Act is anticipated by the Executive Branch of the government. In particular, the White House has published an Executive Order[5], which among other things “directs the NSF to pilot the NAIRR to explore the infrastructure, governance mechanisms, and user interfaces needed to make distributed computational, data, model, and training resources available to the research community in support of AI-related research and development.”

And indeed, the NSF has officially launched their NAIRR pilot and has made several awards that relate directly to the data dimensions of the NAIRR.

Relevant NSF awards and launch of NAIRR pilot

Two open data awards have been made by NSF consistent with the specifications of the Executive Order: The National Data Platform (NDP) award, begun on September 1, 2023 for $6

NSF officially launched the NAIRR Pilot on January 24, 2024. It is the first step toward realizing the full NAIRR vision of a shared research infrastructure that will strengthen and democratize access to critical resources necessary to power responsible AI discovery and innovation. NSF leads the NAIRR Pilot in collaboration with 10 other federal agencies and 25 private sector, non-profit, and philanthropy partners who have contributed an array of resources such as computing capabilities, datasets, pre-trained models, analytic platforms, and learning opportunities[6].

Does History Rhyme?

The NAIRR report “rhymes” with the Lax Report in that an ancillary consideration—the creation of a research data infrastructure to support AI research—might in the long run have a broader impact than just AI research.

A successful research data infrastructure to support AI could lead to a broader framework to support all research; this in turn could lead to a general data infrastructure that enables both wider access to research assets and greater public wellbeing.

Such an achievement would parallel what happened as the NSFNET grew from a 100-node network supporting supercomputer access to a computer network for US higher education, and ultimately to today’s global Internet.

How might that look for AI research? Take, for example, the current state of most science data: it is difficult to understand by anyone hoping to reuse it. By some estimates 80% of the time invested in a data project is spent “munging” (ie, preparing) the data procured, leaving far less time to analyze it[7].

Standards such as the FAIR principles[8] exist to alleviate this burden, but compared to the age, size, and complexity of available datasets, FAIR implementation has barely begun.

If and when a unified data infrastructure emerges, it will do for data what the Internet did for networking, first for science then for society. (Disorganized data is a human problem, not exclusively a scientific one.)

Even a partial solution for data cleanup – say, cutting by half the time required to prep data for analysis – would be significant. Eliminating that time entirely would be revolutionary.

***

George Strawn is Professor Emeritus and former Department Chair of Computer Science and Director of the Computation Center at Iowa State University. He spent a number of years in various positions at NSF, that last of which was on detail to OSTP as co-chair of the Networking and IT Research and Development (NITRD) subcommittee of the National Science and Technology Council, and also as director of the National Coordination Office, which staffs the NITRD subcommittee. After retiring from NSF he became board director for the National Academies' Board on Research Data and Information, where he is currently board director emeritus.


[1] https://www.nsf.gov/news/special_reports/nsf-net/images/lax_report_1982.pdf

[2] https://www.ai.gov/wp-content/uploads/2023/01/NAIRR-TF-Final-Report-2023.pdf

[3] https://digital-strategy.ec.europa.eu/en/policies/open-science-cloud

[4] https://eshoo.house.gov/media/press-releases/ai-caucus-leaders-introduce-bipartisan-bill-expand-access-ai-research

[5] https://ssti.org/blog/biden-administration-releases-executive-order-regarding-future-ai-us-including-specific

[6] https://new.nsf.gov/news/democratizing-future-ai-rd-nsf-launch-national-ai?auHash=IjZsFGqQHd0-1VVQYDhfQTxqCA3sgu6IpfhlK01W6Xc

[7] https://www.tandfonline.com/doi/full/10.1080/10618600.2017.1384734

[8] https://www.go-fair.org/fair-principles/

Previous
Previous

Mitigating the global water crisis: digital twin Earths offer a promising solution

Next
Next

Are we entering a Data Winter? On the urgent need to preserve data access for the public interest