01 Feb 2025 3 min read

Our Data is Disappearing

Page not found!

I didn't intend to make this newsletter into a political diatribe.

Sure, politics is part of all our lives – or it should be if we're paying attention, but I want this newsletter to be about topics and issues affecting the data sciences milieu. That is going to overlap on occasion with the world of politics, but I didn't expect it to be quite so rapid and all-consuming with the change in administrations less than two weeks ago.

Seriously, it hasn't yet been two weeks and already, I'm exhausted.

Our data is disappearing.

Data sets, many housed on Federal government servers are actively being removed. Some examples from the past few days.

The National Science Foundation (NSF) maintains a store of data sets that is required by the OPEN Government Data Act of 2018 – remind me who was president in 2018. At this writing, there are 306,388 datasets in the catalog. Last week, there were over 330,000 records. More than 8% of the data sets collected and archived by the Federal government appears to just be gone.

The Census is more than just the decennial household survey we respond to every ten years. It encompasses data across more than 130 surveys and programs which are being collected continuously (e.g., the American Community Survey). There are more than 2.5 million data sets and tables available at census.gov that are used by researchers, myself included, every day. As of yesterday evening, the entire census.gov landing page was down. The data portal was still available, but it's near impossible to know with what changes. The landing page has been restored as of this morning, which is great news, but again, who knows with what changes or for how long.

Nevertheless, it is possible to spot some specific changes that have occurred. Searching for example for "Sexual Orientation & Gender Identity" (SOGI data) returns a bunch of links, but clicking on any of them returns a "page not found" error. Likewise when searching for any topic related to "LGBTQ." A December 2022 report titled "LGBT Adults Report Anxiety, Depression at All Ages" is now missing.

Public health data has been "disappeared" from the Centers for Disease Control and Prevention's (CDC) website. The data portal at the CDC was temporarily removed last night, similar to the landing page at census.gov. Replacing the CDC portal was a landing page, stating "the website will resume operations once in compliance with [Trump's] Executive Orders." The portal is back this morning, but it is not normal for these sites to be down or missing data.

There are conspicuous changes that have occurred at the CDC. The Morbidity and Mortality Weekly Report (MMWR), which has been issued every week since 1952 has not been issued since January 16. A search this morning for the term "transgender" on the CDC's data portal returned 0 results.

These are just a few of the data changes that are most visible to the public.

Maintaining and protecting data is critical to the proper functioning of society. If we lose access to data, we don't just lose historical information. We lose institutional memory. We also lose the ability to react to current and future threats. The COVID-19 pandemic of 2020-22 has a lot to teach us about future pandemics such as the burgeoning Avian Flu (H595). If we lose data about the successes and failures to deal with past health crises, it is going to affect how we deal (or fail to deal) with future ones.

Fortunately, there are groups and institutions seeking to preserve some data access to the public.

The Wayback Machine allows people to view what a website looked like at some point in the past. By comparing what a website looked like last month to what it looks like today, we can identify what is missing. That takes a lot of effort though, and the Wayback Machine doesn't archive the underlying datasets, just what a website once looked like.

The End of Term Web Archive (EOT) is a collaborative group that is banding together to urge people to identify current URLs that are at risk of being "disappeared" so that they can be archived.

A group of students and researchers at the Harvard School of Public Health is working to archive public data sets using a project called the Harvard Dataverse.

I'm glad that people are being proactive in trying to identify and restore many of these missing data repositories. I hope that we can impress on the government though how important it is that the government itself maintain these data sources so that they remain free and open to the public (as required by law), and untainted by politics.

Like many things about this new administration, my hopes are not high.