Evaluating the statistical realism of LLM-generated social science data - Institute for Social and Economic Research (ISER)

Publication type

Journal Article

Authors

Publication date

May 12, 2026

Summary:

Large language models (LLMs) enable the generation of data that could potentially be analyzed for social research. While the need for assessing the validity of such AI-generated data is widely recognized, we do not yet have a coherent framework for assessment. This work introduces SSDataBench, a systematic benchmark of population-level statistical realism in LLM-generated social science data. By evaluating 15 LLMs in generating data across five types of statistical patterns against real data from four longitudinal and three cross-sectional social science surveys, we quantify how well LLM-generated data reproduce key statistical moments in actual surveys and reveal representational limitations in current LLMs under sparse conditioning settings. We further provide preliminary evidence that domain-specific training can significantly improve population-level statistical realism.

Published in

Proceedings of the National Academy of Sciences of the United States of America

Volume

Volume: 123

DOI

https://doi.org/10.1073/pnas.2538145123

ISSN

00278424

Subjects

Notes

Open Access

This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).

Code: The codes used in all experiments are publicly available at https://github.com/lszshu/SSDataBench. The raw datasets include NLSY (47), CFPS (51), Add Health (52), Understanding Society (53), U.S. Census (54), CPS-ASEC (55), and GSS (56). Processed datasets that permit public release and all data processing scripts are available on GitHub (59).

#589078

News

Working papers

Publications search

Podcasts

Projects

Events

Survey methodology

Themes