Synthetic Data

<aside>

WORKING DRAFT FOR PUBLIC FEEDBACK For more context on this draft, please see here. Please submit feedback here.

</aside>

Previous: Synthetic Participants & Facilitators (AI Agents)

Table of Contents

Next: Automated Synthesis

Overview

Synthetic data refers to artificially generated information that mimics the behavior and statistical properties of real-world data through algorithms, generative models, or simulations. It can be in the form of analyzing existing datasets to impute missing data (e.g., generating data points for groups of people who are underrepresented in the original sample) or through the use of synthetic participants which are programmed to mimic survey responses through specific user personas. The usage of synthetic data has been extended across several domains (e.g., health, privacy, read teaming) and formats such as vision, audio, and text, including to create datasets with adversarial examples to detect and understand vulnerabilities in models, to cover sensitive topics, and improve model’s ability to effectively handle real-world inputs instead of working with real users or experts to collect data from real-world circumstances.

One common method for creating synthetic data involves using data seeds, which are initial examples or pieces of data used to guide the generation of additional synthetic data. Generating synthetic data is typically a recursive process, where data seeds are used to prompt language models to generate new synthetic data. Then, the synthetic instructions are given to more advanced language models to create responses that are used to train the target language model. Ultimately, the purpose of data seeds is to provide a starting point from which algorithms can generate new instances while maintaining specific characteristics or patterns.

Strengths and Useful Applications

Overcome limitations of real-world data
Customizable
Address some concerns around data consent & privacy
Address some concerns around data enrichment labor conditions
Semi-automated approaches

Potential Risks of Use

Generating synthetic datasets requires good “data seeds”
Lack of documentation / “lineage” and “circumvented consent”
Difficult to establish “ground truth”
Real cost of generating synthetic data
Unclear regulation