Census Bureau's use of 'synthetic data' worries researchers
Orlando, Fla. — First came the “noise” — small errors the U.S. Census Bureau decided to introduce into the 2020 census data to protect participants' privacy. Now the bureau is looking into “synthetic data,” manipulating the numbers widely used for economic and demographic research, to obscure the identities of people who provided information.
The moves have some researchers up in arms, worried that the statistical agency could sacrifice accuracy in its zeal to protect privacy.
Census Bureau statisticians disclosed at a virtual conference last week that over the next three years they will work toward developing a method to create “synthetic data" for files on individuals and homes that already are devoid of personalized information. These files, known as American Community Survey microdata, are used by researchers to create customized tables tailored to their research.
Census Bureau statisticians said more privacy protections are needed as technological innovations magnify the threat of people being identified through their survey answers, which are confidential. Computing power is now so vast that it can easily crunch third-party data sets that combine personal information from credit rating and social media companies, purchasing records, voting patterns and public documents, among other things.
“It’s a balancing act. The law requires us to do competing things. We need to release statistics on the nation to allow people to make useful decisions. But we also have to protect the privacy of our respondents,” said Rolando Rodriguez, a Census Bureau statistician, at the conference.
But critics say the proposal, coupled with an ongoing effort to add small inaccuracies to the 2020 census data in order to protect participants' privacy, undermines the Census Bureau's credibility as the go-to provider of precise data about the U.S. population.
University of Minnesota demographer Steven Ruggles said bluntly that synthetic data “will not be suitable for research."
“The Census Bureau is inventing imaginary threats to confidentiality to sharply reduce public access to data," Ruggles said. “I do not think this will stand, because society needs information to function."
The microdata are gathered every year from the American Community Survey with a sample size of 3.5 million households, extrapolated across populations of all sizes, from the entire nation down to neighborhoods. This provides a wide range of estimates on the nation’s demographic makeup and housing characteristics. The microdata are used in the drafting of around 12,000 research papers a year, Ruggles said.
The synthetic data are created by taking variables in the microdata to build models recreating the interrelationships of the variables and then constructing a simulated population based on the models. Scholars would conduct their research using the simulated population — or the synthetic data — and then submit it, if they want, to the Census Bureau for double checking against the real data to make sure their analyses are correct.
Ruggles said new discoveries in data will be missed since the models only capture what is already known.
Another problem is that synthetic data can amplify an outlier, such as in a health study where one person engages in risky behavior multiple times but others don't, and it makes it seem like the risky behavior is more widespread than it actually is, said David Swanson, a professor emeritus of sociology at the University of California Riverside.
There are benefits, though, such as the ability to get details about people at really small geographic levels such as neighborhood blocks, said Cornell University economist Lars Vilhuber, who has done research on the method. The synthetic data makes that possible because it protects privacy, he said,
“You can actually get far more detail into the data than with traditional methods," Vilhuber said.
The Census Bureau said in a statement on Thursday that it hasn't made any final decisions on the use of synthetic data in the American Community Survey and that it welcomed feedback from researchers.
The Census Bureau has taken other recent steps to protect individuals’ privacy, which has gotten harder in the face of a proliferation of outside data sources. This year, the bureau proposed using housing units instead of people when defining an urban area. And it has drawn fierce criticism for using a statistical technique known as “differential privacy” in 2020 census data that will be used for drawing congressional and legislative districts.
Differential privacy adds mathematical “noise,” or intentional errors, to the data to obscure any given individual’s identity while still providing statistically valid information. It has been challenged in court by the state of Alabama which says its use will result in inaccurate data.
“The Census Bureau is saying this is in the tradition of what they have always done” in protecting privacy, said historian Margo Anderson, a professor at the University of Wisconsin-Milwaukee. “There’s an increasingly substantial organization of critics saying this is completely different. They say, ‘You have never made the data intentionally inaccurate.'”
The Census Bureau first floated the idea of using synthetic data three years ago, but concerns over that and differential policy got shoved aside after the Trump administration failed unsuccessfully to add a citizenship question to the 2020 census questionnaire and the pandemic challenged the nation's head count last year, Anderson said.
For Swanson, the Census Bureau's efforts at privacy reminds him of the quote that reporter Peter Arnett attributed to an unnamed U.S. military official during the Vietnam War: ″We had to destroy the town in order to save it."
“I feel they literally would destroy the census data to save it from an uncertain threat,” Swanson said. “If they destroy the data, they are going to destroy the bureau.”