FAIR use in Artificial Intelligence? Access to data for the benefit of all

Luc Soete

Member of ESIR

Dean of the Brussels School of Governance

Vice-Chairman of the supervisory board of the Technical University Delft (TUD)


When applying Artificial Intelligence (AI) we are faced with the central challenge, particularly in the personal sphere, of access to the data used to train algorithms. In some notable cases – such as when Dutch tax authorities used an early form of AI to monitor childcare benefits[1] to citizens entitled to such payments – the application of poorly verified data to train an algorithm has caused major political scandal. More broadly, there is evidence of growing public distrust of public authorities’ use of AI.[2]

At the same time, large, privately-owned technology companies – notably Google, Meta, Microsoft, and Amazon – have, with the accumulation of massive amounts of data on their users and clients, driven AI research in a dramatic way with new tools that are being developed and commercialized quickly. Paradoxically, rather than being confronted by citizens’ distrust, the AI business sector itself – and the privately-funded AI research community – has expressed concerns about an “out-of-control” rate of progress.

In parallel, the call for greater regulation is made harder to realize by the speed of progress in the commercial development of AI platforms and tools. “Regulatory sandboxes” – to test innovative products or services that challenge existing legal frameworks and allow participating companies to obtain a waiver from specific legal provisions or compliance processes to innovate (European Parliament[3], OECD, 2023[4]) – are beginning to resemble “quicksand boxes”.

It is no surprise that AI regulation is a continuous subject of debate, review and tentative development. The extent to which the call for regulation is part of incumbents’ own strategies to increase the value of their data stock, or indeed to prevent newcomers from entering the markets, is very much open to debate (Burgelman, 2023).[5]

However, from the perspective of progress in research, science, and technology – my focus here – it could be argued that rather than regulatory “quicksand boxes” for AI, public authorities and regulators should open the “black box” of data used to train AI and support algorithms. They should see to it that this data used is made public – or at least transparent – so that the potential for its much broader application with further improvements can be unleashed, based on trial and error.

As Barend Mons (2023)[6] has usefully highlighted, “good practices in data stewardship and analysis could be more effective than regulation.” His plea is to make data inputs “Fully AI Ready” in any further AI research (an alternative wordplay on the original FAIR acronym of Findable, Accessible, Interoperable and Reusable, for machines as well as people[7]).

From a societal perspective, it is clear that the handling of huge data sets by a very few big tech companies brings an oligopoly to the market for new AI applications and arguably slows innovation. This is particularly pertinent in Europe where, as Michael Spence (2023)[8] has argued, productivity gains in many public sectors have been slowing over the past decade. An ageing population and possible labor shortages make urgent the need to more fully exploit AI opportunities, not least if we are to keep public services open and accessible across health, education, public administration, human protection, and policing.

In that context, Henk van Tuinen, the former director of the Dutch Central Bureau of Statistics,[9] has put forward an interesting proposal. He has argued the case to see the national statistical offices of European Member States – or alternatively Eurostat at the EU level – take charge of the collection and preservation of all FAIR “observational data”.  

It is already the case that in most countries there is statistical legislation compelling businesses to provide information on their production, employment and revenues to their national statistical agencies. The legal obligation to deliver such data was introduced in the first half of the 20th century with the public aim of obtaining accurate statistics about the economic progress of a country, its infrastructure, transformation, competitiveness, and so forth. These statistical agencies are able to analyze collected data, but they cannot publish it in such a way that reveals information about individual businesses or other entities; nor can they provide any identifiable information to government or the judiciary.  

These national agencies have in the data all the identifying details at their disposal for statistical analysis, allowing researchers in turn, and within limits, to draw on the data for their own analyses. In many areas, successfully assessing and evaluating particular public policies has depended heavily on this micro-econometric statistical analysis. In short, based on the law as it stands, companies could be compelled to provide national statistical offices the data underlying their AI algorithms on the basis it meets FAIR principles.

 A large tech company might well lobby against such a move. But in so far as the supplied data would not be made public, that the obligation would be faced equally by its competitors, and that the supply of data would not be administratively costly (in contrast to traditional statistical indicators), it is conceivable at least to reach a consensus.  

For statistical offices, this could be a logical step and in line with their own historical tradition of not revealing any individual company data. As van Tuinen points out, “Statistics are about aggregates and relationships, not individual entities. By applying statistical legislation, the data for big tech could become available for statistical purposes and scientific research by all bona fide scientists and research institutions, for the benefit of everyone in society.”  

In this scenario, statistical agencies could find a new role in Europe’s emerging digital society. The experience that many of them have gained in the analysis of very large data files, including big data, (look for example at the Dutch Central Bureau of Statistics), could become directly useful to AI research in Europe. Intensive collaboration between European statistical agencies and external research institutions in AI could drive extra public investment in the independence and professionalism of national statistical offices, including Eurostat. And in turn, strengthening the independence and professionalism of these agencies would offer a bulwark against the misinformation, polarization and sharpening of international tensions that is eroding public trust in science.


[1] Various systematic problems with data collection, verification and algorithmic processing have contributed to the scandal: for an overview see the KPMG report (July 2020): https://open.overheid.nl/documenten/ronl-42970d17-d8b9-41d6-aa8f-d0cc52ab97c8/pdf

[2] For Nitesh Bharosa: “The main difference between bad and ugly GovTech is intentionality. By design, bad GovTech has immoral intentions that attack human rights. With Ugly GovTech, there were good intentions in the beginning, but the solutions are poorly designed.” https://ibestuur.nl/artikel/intreerede-nitesh-bharosa-the-good-the-bad-the-ugly-in-govtech/  

[3] European Parliament (2022), Artificial intelligence act and regulatory sandboxes, https://www.europarl.europa.eu/RegData/etudes/BRIE/2022/733544/EPRS_BRI(2022)733544_EN.pdf

[4] OECD (2023), Regulatory Sandboxes in Artificial intelligence, July, https://www.oecd.org/publications/regulatory-sandboxes-in-artificial-intelligence-8f80a0e6-en.htm

[5] Burgelman, Jean-Claude (2023), Getting a grip on data and Artificial Intelligence, Frontiers Policy Lab, May 8th, 2023, https://policylabs.frontiersin.org/content/getting-a-grip-on-data-and-artificial-intelligence

[6] Mons, Barend (2023), Will the generative ‘AI’ hype need top-down regulation to implode?, Frontiers Policy Lab, https://policylabs.frontiersin.org/content/commentary-does-the-hype-of-generative-ai-need-top-down-regulation-or-will-it-implode   

[7] https://www.nature.com/articles/sdata201618

[8] Spence, Michael (2023) “AI and the productivity imperative”, Project Syndicate, August 9th, 2023 https://www.project-syndicate.org/commentary/generative-ai-large-language-models-boost-productivity-growth-by-michael-spence-2023-08

[9] Van Tuinen, Henk (2023), “Ontsluit data van big tech via statistische wetgeving”, Economisch Statistische Berichten, August 2023.


Previous
Previous

Which are the next walls to fall in science and society? Key takeaways from the Falling Walls Science Summit 2023

Next
Next

The fossil fuel policy gap