Evaluating the statistical realism of LLM-generated social science data

Publication type

Journal Article

Authors

Publication date

May 12, 2026

Summary:

Large language models (LLMs) enable the generation of data that could potentially be analyzed for social research. While the need for assessing the validity of such AI-generated data is widely recognized, we do not yet have a coherent framework for assessment. This work introduces SSDataBench, a systematic benchmark of population-level statistical realism in LLM-generated social science data. By evaluating 15 LLMs in generating data across five types of statistical patterns against real data from four longitudinal and three cross-sectional social science surveys, we quantify how well LLM-generated data reproduce key statistical moments in actual surveys and reveal representational limitations in current LLMs under sparse conditioning settings. We further provide preliminary evidence that domain-specific training can significantly improve population-level statistical realism.

Published in

Proceedings of the National Academy of Sciences of the United States of America

Volume

Volume: 123

DOI

https://doi.org/10.1073/pnas.2538145123

ISSN

00278424

Subjects

Notes

Copyright © 2026 the Author(s). Published by PNAS.

Open Access

This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).

Code: The codes used in all experiments are publicly available at https://github.com/lszshu/SSDataBench. The raw datasets include NLSY (47), CFPS (51), Add Health (52), Understanding Society (53), U.S. Census (54), CPS-ASEC (55), and GSS (56). Processed datasets that permit public release and all data processing scripts are available on GitHub (59).

#589078

News

Latest findings, new research

Publications search

Search all research by subject and author

Podcasts

Researchers discuss their findings and what they mean for society

Projects

Background and context, methods and data, aims and outputs

Events

Conferences, seminars and workshops

Survey methodology

Specialist research, practice and study

Themes

Key research themes and areas of interest