Publication type
Journal Article
Authors
Publication date
May 12, 2026
Summary:
Large language models (LLMs) enable the generation of data that could potentially be analyzed for social research. While the need for assessing the validity of such AI-generated data is widely recognized, we do not yet have a coherent framework for assessment. This work introduces SSDataBench, a systematic benchmark of population-level statistical realism in LLM-generated social science data. By evaluating 15 LLMs in generating data across five types of statistical patterns against real data from four longitudinal and three cross-sectional social science surveys, we quantify how well LLM-generated data reproduce key statistical moments in actual surveys and reveal representational limitations in current LLMs under sparse conditioning settings. We further provide preliminary evidence that domain-specific training can significantly improve population-level statistical realism.
Published in
Proceedings of the National Academy of Sciences of the United States of America
Volume
Volume: 123
DOI
https://doi.org/10.1073/pnas.2538145123
ISSN
00278424
Subjects
Notes
Copyright © 2026 the Author(s). Published by PNAS.
Open Access
This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).
Code: The codes used in all experiments are publicly available at https://github.com/lszshu/SSDataBench. The raw datasets include NLSY (47), CFPS (51), Add Health (52), Understanding Society (53), U.S. Census (54), CPS-ASEC (55), and GSS (56). Processed datasets that permit public release and all data processing scripts are available on GitHub (59).
#589078