Are we entering a Data Winter? On the urgent need to preserve data access for the public interest  

Stefaan G. Verhulst 

Co-Founder and Chief Research and Development Officer of the Governance Laboratory at New York University (NYU)

Co-Founder and Principal Scientific Advisor of The Data Tank

Research Professor, NYU Center for Urban Science + Progress


Introduction  

In an era where data drives decision-making, the accessibility of data for public interest purposes has never been more crucial. Whether shaping public policy, responding to disasters, or empowering research, data plays a pivotal role in our understanding of complex social, environmental, and economic issues. In 2015, I introduced the concept of Data Collaboratives[1] to advance new and innovative partnerships between the public and private sectors that could make data more accessible for public interest purposes. More recently, I have been advocating for a reimagined approach to data stewardship[2] to make data collaboration more systematic, agile, sustainable, and responsible. 

Despite many advances, the project of opening access to data is proving increasingly challenging.  Indeed, unless we step up our efforts in 2024, we may be entering a prolonged data winter--analogous to previous Artificial Intelligence winters[3], marked by reduced funding and interest in AI research, in which data assets that could be leveraged for the common good are instead frozen and immobilized. This blog examines some developments driving this potential data winter,  and raises a number of concerns about the current state of data accessibility and its implications for the public interest. We conclude by calling for a new Decade of Data[4]—one marked by a reinvigorated commitment to open data and data reuse for the public interest.   

  

A Digital Dark Age? Decreased Access to Social Media and Climate Data   

Among the most troubling trends in the digital landscape is a sharp decrease in access to datasets that were previously available for researchers and others working in the public interest. A good example can be found in the limits placed on social media sites like Facebook and X (formerly Twitter). Not long ago, such sites were rich, real-time libraries of public sentiment and behavior. Researchers relied on the resulting data to gain a deeper understanding[5] of diverse phenomena, from political crises to epidemic patterns and natural disasters. However, the landscape in 2024 starkly contrasts with this once-open data environment.   

This trend is highlighted in a recent article by Gina Neff in Wired[6], which  sheds light on restrictions on social media platform data for computational social science[7]. Neff describes how social media, once a goldmine for researchers seeking insights into societal trends, political movements, and human behavior, is now entering a phase of restricted access, ushering in what she describes as a “grim digital dark age.” One example can be found in the decision[8] by Elon Musk to end free access to Twitter's API, which dealt a significant blow to the research community. This move stands as a stark reminder that the modern Internet is increasingly controlled by a small number of gatekeepers who dictate data access.  

Even when researchers are granted access, the use of policies such as Meta’s “independence by permission” reveal a troubling dynamic: control over what types of questions can be asked (and who can ask them) continues to rest with the companies. This control significantly limits the scope and independence of research. All told, despite efforts[9] by organizations such as the European Digital Media Observatory to create an independent intermediary body to support research on digital platforms, the notion of closer cooperation with platform companies to harness data for public good seems increasingly illusory.   

A similar trend is evident in the realm of climate data. Once largely the domain of public research institutions and widely accessible, climate data is increasingly being treated as a lucrative asset by the private sector. Another recent article[10], this one by Justin S. Mankin, discusses this trend, showing how climate data has gradually shifted from a public good to a commodity governed by market forces. The privatization of climate data—evidenced, for instance, by the existence of a vibrant market for climate data and risk models—raises important questions about  social equity and justice. Today, there exists a growing divide between those who can afford such data and insights and those who cannot. This divide could lead to scenarios where the wealthy are better equipped to adapt to climate risks, while the less affluent are left vulnerable. Such a scenario perpetuates existing inequalities and, more broadly, undermines the very essence of using data as a tool for broader societal benefit.  

  

Generative AI-nxiety and Legislative Inactivity  

The trend against data openness has been exacerbated by two more general phenomena in the data sphere: the rapid ascent of generative AI, and a general stalling on the policy and regulatory front.  

Generative AI represents a double-edged sword in the context of data accessibility. In theory, it offers immense potential for innovation and democratizing access to data and knowledge. In fact, the fear of misuse and the lack of clear regulatory frameworks around how data–and what data–is used in AI models are contributing to a wave of apprehension (sometimes termed "Generative AI-nxiety"[11]) and a general tightening of restrictions. This apprehension has significant implications for data sharing, particularly in the context of public interest reuse and the development of foundational AI models. As illustrated by the recent New York Times case[12]against OpenAI and Microsoft, concerns over the unauthorized re-use of sensitive or proprietary information may be valid.   

The trouble is that “AI-nxiety” now is equally being applied to legitimate and non-discriminatory reuses of data, too, and as a result is starting to stunt data accessibility for public interest purposes. Overall, the apprehension that data could be used to train AI models, which might then generate outputs with unforeseen consequences due to decisions informed by the outputs, is leading to a more guarded approach to data sharing. This concern is directly related[13] to the quality and features of data made accessible for training. Poor data quality can lead to poor decision-making by AI systems. And if an AI system is trained on biased data, it can perpetuate or even amplify these biases, leading to unfair or discriminatory outcomes. This trend poses a significant challenge for researchers and organizations that rely on open data to address societal issues, develop public policies, and advance scientific understanding.  

The difficulties of balancing legitimate and non-legitimate uses of data could, in theory, be managed with a responsible data framework (applicable to AI models, and beyond). Yet the advent of a potential data winter is also marked—and being hastened—by a worrying stagnation in the development of open data policies and regulations. Despite growing recognition regarding the importance of data for public interest purposes, there has been a noticeable lack of recent progress in the development and implementation of open data policies. For instance, our repository on open data policies and regulations[14] has not witnessed any meaningful advances in the past year. Likewise, the European Data Act, which seemed to hold considerable promise for opening up business data, serves as a case study of recent  legislative shortcomings.  One of the main issues has been the lack of a clear implementation pathway, leaving a gap between legislative intent and practical application. This gap not only hinders the accessibility of business data, but also reflects a broader trend of ineffective policy-making in the data domain.  

  

Conclusion: From a Data Winter to a Decade of Data  

As we navigate through the complexities of data accessibility in today's world, it is evident that the battle to keep data open and accessible for public interest purposes is facing significant challenges. Despite these hurdles, the need for accessible data in the public interest has never been more critical. Data drives our understanding of complex global issues, informs policy decisions, and fuels scientific advancements. The ongoing challenges underscore the need for a balanced approach that safeguards community and proprietary interests without stifling the flow of data that can benefit society at large.  

Looking ahead, it is imperative that stakeholders from various sectors—governments, private organizations, academia, and civil society—collaborate to focus on halting this trend of enclosure and hoarding by calling for a Decade of Data[15] so as to forge pathways to ensure data remains a tool for public good. This requires not only effective and pragmatic legislation but also a cultural shift in how we perceive and value data, and investment in both human infrastructure, such as data stewards. The goal should be to create an ecosystem where data is not just a commodity to be traded but a resource to empower communities and science and foster a more informed, equitable world. And all this should go hand-in-hand with and can be facilitated through increased digital self-determination[16].    

At the global level, we stand at a pivotal juncture, with a unique opportunity to redefine the trajectory of data cooperation—not in the distant future but in the coming months. This change hinges on the decisions to be made during the Global Digital Compact[17], where world leaders will determine the scope and nature of their collaboration on digital matters. The current landscape is marked by a trend towards isolation, with “small gardens” surrounded by “ever higher walls.”   

The Global Digital Compact presents an opportunity for world leaders to make a concerted effort towards enhancing digital cooperation, aiming to lower these barriers. It's crucial to recognize data as a fundamental cornerstone of the AI era, not merely a byproduct. Such recognition underscores the need for a balanced approach that fosters open data exchange while ensuring robust privacy and security measures. By doing so, the Compact has the potential to lay the groundwork for a more interconnected and responsible digital future, where data collaboration and innovation go hand in hand with ethical considerations and global cooperation. It’s not too late to move from a limiting, chilling data winter to an enabling, socially beneficial data decade. 


[1] https://sverhulst.medium.com/data-collaboratives-exchanging-data-to-improve-people-s-lives-d0fcfc1bdd9a

[2] https://medium.com/data-stewards-network/data-stewardship-re-imagined-capacities-and-competencies-d37a0ebaf0ee

[3] https://en.wikipedia.org/wiki/AI_winter

[4] https://unu.edu/sites/default/files/2023-10/call%20for%20international%20decade%20data%20.pdf

[5] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3141457

[6] https://www.wired.com/story/the-new-digital-dark-age/

[7] https://link.springer.com/book/10.1007/978-3-031-16624-2

[8] https://techcrunch.com/2023/02/01/twitter-to-end-free-access-to-its-api/#:~:text=Apps-,Twitter%20to%20end%20free%20access%20to%20its,Elon%20Musk's%20latest%20monetization%20push&text=Twitter%20will%20discontinue%20offering%20free,avenues%20to%20monetize%20the%20platform.

[9] https://edmo.eu/2023/05/15/launch-of-the-edmo-working-group-for-the-creation-of-an-independent-intermediary-body-to-support-research-on-digital-platforms/

[10] https://www.nytimes.com/2024/01/20/opinion/climate-risk-disasters-data.html

[11] https://hbr.org/2023/08/generative-ai-nxiety

[12] https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html

[13] https://www.lorentzcenter.nl/the-road-to-fair-and-equitable-science.html

[14] https://repository.opendatapolicylab.org/

[15] https://unu.edu/publication/unlocking-potential-call-international-decade-data

[16] https://idsd.network/

[17] https://www.un.org/techenvoy/global-digital-compact

Previous
Previous

Does history rhyme? Supercomputing, AI, and the US government’s support for a research data infrastructure

Next
Next

Rethinking science in the 21st Century: Universities need to be meadows