SDV

Synthetic data generation for tabular data

3,042

366

3,042

167

View on GitHub

Top Related Projects

ydata-synthetic

1,557

Synthetic data generators for tabular and time-series data

AIF360

2,627

A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.

manifold

1,666

A model-agnostic visual debugging tool for machine learning

Quick Overview

The SDV (Synthetic Data Vault) is a Python library that generates synthetic data that mimics the statistical properties of real-world datasets. It is designed to help data scientists and researchers create realistic, privacy-preserving datasets for use in machine learning and data analysis tasks.

Pros

Privacy-Preserving: The SDV generates synthetic data that preserves the statistical properties of the original dataset while removing any personally identifiable information, making it a useful tool for working with sensitive data.
Flexible and Extensible: The SDV supports a wide range of data types, including tabular, relational, and time-series data, and can be extended to handle custom data formats.
Scalable: The SDV can handle large datasets and can generate synthetic data at scale, making it suitable for a variety of use cases.
Open-Source: The SDV is an open-source project, which means that it is freely available and can be customized and extended by the community.

Cons

Complexity: The SDV can be complex to set up and configure, especially for users who are not familiar with machine learning and data generation techniques.
Performance: Generating large synthetic datasets can be computationally intensive, and the performance of the SDV may be a concern for some users.
Accuracy: While the SDV aims to generate realistic synthetic data, the accuracy of the generated data may not be perfect, and users should carefully evaluate the quality of the synthetic data before using it in their applications.
Limited Documentation: The SDV has a relatively small community and may not have as much documentation and support as some other data generation tools.

Code Examples

Here are a few examples of how to use the SDV library:

Generating Synthetic Tabular Data:

from sdv.tabular import GaussianCopula
from sdv.demo import load_demo_data

# Load a demo dataset
data = load_demo_data('student')

# Fit a GaussianCopula model to the data
model = GaussianCopula()
model.fit(data)

# Generate synthetic data
synthetic_data = model.sample(num_rows=1000)

Generating Synthetic Relational Data:

from sdv.relational import HMA
from sdv.demo import load_demo_data

# Load a demo relational dataset
data = load_demo_data('student_placements')

# Fit an HMA model to the data
model = HMA()
model.fit(data)

# Generate synthetic data
synthetic_data = model.sample(num_rows=1000)

Evaluating the Quality of Synthetic Data:

from sdv.evaluation import evaluate
from sdv.demo import load_demo_data

# Load a demo dataset
data = load_demo_data('student')

# Fit a GaussianCopula model to the data
model = GaussianCopula()
model.fit(data)

# Generate synthetic data
synthetic_data = model.sample(num_rows=1000)

# Evaluate the quality of the synthetic data
metrics = evaluate(data, synthetic_data)
print(metrics)

Getting Started

To get started with the SDV, you can follow these steps:

Install the SDV library using pip:

pip install sdv

Import the necessary modules and load a demo dataset:

from sdv.tabular import GaussianCopula
from sdv.demo import load_demo_data

data = load_demo_data('student')

Fit a GaussianCopula model to the data and generate synthetic data:

model = GaussianCopula()
model.fit(data)
synthetic_data = model.sample(num_rows=1000)

Evaluate the quality of the synthetic data:

from sdv.evaluation import evaluate

metrics = evaluate(data, synthetic_data)
print(metrics)

Explore the SDV documentation and try out different models and data types to suit your needs.

Competitor Comparisons

ydata-synthetic

1,557

Synthetic data generators for tabular and time-series data

Pros of ydata-synthetic

More focused on privacy-preserving synthetic data generation
Offers specialized algorithms for time series data
Provides a user-friendly interface for data scientists and analysts

Cons of ydata-synthetic

Less comprehensive documentation compared to SDV
Smaller community and fewer contributors
More limited in terms of supported data types and structures

Code Comparison

SDV example:

from sdv import Tabular

model = Tabular('my_table')
model.fit(real_data)
synthetic_data = model.sample(num_rows=1000)

ydata-synthetic example:

from ydata_synthetic.synthesizers import RegularSynthesizer

synthesizer = RegularSynthesizer(modelname='CTGAN', epochs=300)
synthesizer.fit(data)
synthetic_data = synthesizer.sample(1000)

Both libraries offer similar high-level APIs for generating synthetic data, but SDV provides a more abstracted interface, while ydata-synthetic requires specifying the model type explicitly. SDV's approach may be more user-friendly for beginners, while ydata-synthetic offers more control over the underlying algorithms.

AIF360

2,627

A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.

Pros of AIF360

Focuses specifically on AI fairness and bias mitigation
Provides a comprehensive set of fairness metrics and algorithms
Offers educational resources and tutorials on AI ethics

Cons of AIF360

More specialized and narrower in scope than SDV
May have a steeper learning curve for users new to fairness concepts
Less frequent updates and smaller community compared to SDV

Code Comparison

AIF360:

from aif360.datasets import BinaryLabelDataset
from aif360.metrics import BinaryLabelDatasetMetric

dataset = BinaryLabelDataset(...)
metric = BinaryLabelDatasetMetric(dataset, unprivileged_groups, privileged_groups)

SDV:

from sdv import Metadata, SDV

metadata = Metadata(...)
sdv = SDV()
synthetic_data = sdv.fit_sample(metadata, data)

The code snippets demonstrate the different focus areas of the two libraries. AIF360 is centered around fairness metrics and bias detection, while SDV is geared towards synthetic data generation and handling metadata. AIF360 requires more specific inputs related to privileged and unprivileged groups, reflecting its emphasis on fairness analysis. SDV's API is more straightforward for general data synthesis tasks.

manifold

1,666

A model-agnostic visual debugging tool for machine learning

Pros of Manifold

Focuses on visual debugging and model performance analysis
Provides interactive visualizations for machine learning workflows
Supports multiple ML frameworks (TensorFlow, PyTorch, etc.)

Cons of Manifold

More specialized tool for ML model analysis, less versatile than SDV
Steeper learning curve for non-ML practitioners
Less active development and community support compared to SDV

Code Comparison

SDV example:

from sdv import Metadata, SDV

metadata = Metadata('my_dataset.csv')
sdv = SDV()
sdv.fit(metadata)
synthetic_data = sdv.sample()

Manifold example:

from manifold import Manifold

manifold = Manifold()
manifold.add_data(features, predictions, labels)
manifold.run()

Key Differences

SDV is a comprehensive synthetic data generation tool, while Manifold focuses on ML model analysis and debugging. SDV offers broader data synthesis capabilities across various domains, whereas Manifold excels in providing visual insights into model performance. SDV is more accessible for general data tasks, while Manifold caters to ML practitioners seeking in-depth model analysis.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

This repository is part of The Synthetic Data Vault Project, a project from DataCebo.

Overview

The Synthetic Data Vault (SDV) is a Python library designed to be your one-stop shop for creating tabular synthetic data. The SDV uses a variety of machine learning algorithms to learn patterns from your real data and emulate them in synthetic data.

Features

:brain: Create synthetic data using machine learning. The SDV offers multiple models, ranging from classical statistical methods (GaussianCopula) to deep learning methods (CTGAN). Generate data for single tables, multiple connected tables or sequential tables.

:bar_chart: Evaluate and visualize data. Compare the synthetic data to the real data against a variety of measures. Diagnose problems and generate a quality report to get more insights.

:arrows_counterclockwise: Preprocess, anonymize and define constraints. Control data processing to improve the quality of synthetic data, choose from different types of anonymization and define business rules in the form of logical constraints.

Important Links
Tutorials	Get some hands-on experience with the SDV. Launch the tutorial notebooks and run the code yourself.
:book: Docs	Learn how to use the SDV library with user guides and API references.
:orange_book: Blog	Get more insights about using the SDV, deploying models and our synthetic data community.
Community	Join our Slack workspace for announcements and discussions.
:computer: Website	Check out the SDV website for more information about the project.

Install

The SDV is publicly available under the Business Source License. Install SDV using pip or conda. We recommend using a virtual environment to avoid conflicts with other software on your device.

pip install sdv

conda install -c pytorch -c conda-forge sdv

Getting Started

Load a demo dataset to get started. This dataset is a single table describing guests staying at a fictional hotel.

from sdv.datasets.demo import download_demo

real_data, metadata = download_demo(
    modality='single_table',
    dataset_name='fake_hotel_guests')

Single Table Metadata Example

The demo also includes metadata, a description of the dataset, including the data types in each column and the primary key (guest_email).

Synthesizing Data

Next, we can create an SDV synthesizer, an object that you can use to create synthetic data. It learns patterns from the real data and replicates them to generate synthetic data. Let's use the GaussianCopulaSynthesizer.

from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data=real_data)

And now the synthesizer is ready to create synthetic data!

synthetic_data = synthesizer.sample(num_rows=500)

The synthetic data will have the following properties:

Sensitive columns are fully anonymized. The email, billing address and credit card number columns contain new data so you don't expose the real values.
Other columns follow statistical patterns. For example, the proportion of room types, the distribution of check in dates and the correlations between room rate and room type are preserved.
Keys and other relationships are intact. The primary key (guest email) is unique for each row. If you have multiple tables, the connection between a primary and foreign keys makes sense.

Evaluating Synthetic Data

The SDV library allows you to evaluate the synthetic data by comparing it to the real data. Get started by generating a quality report.

from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    real_data,
    synthetic_data,
    metadata)

Generating report ...

(1/2) Evaluating Column Shapes: |ââââââââââââââââ| 9/9 [00:00<00:00, 1133.09it/s]|
Column Shapes Score: 89.11%

(2/2) Evaluating Column Pair Trends: |ââââââââââââââââââââââââââââââââââââââââââ| 36/36 [00:00<00:00, 502.88it/s]|
Column Pair Trends Score: 88.3%

Overall Score (Average): 88.7%

This object computes an overall quality score on a scale of 0 to 100% (100 being the best) as well as detailed breakdowns. For more insights, you can also visualize the synthetic vs. real data.

from sdv.evaluation.single_table import get_column_plot

fig = get_column_plot(
    real_data=real_data,
    synthetic_data=synthetic_data,
    column_name='amenities_fee',
    metadata=metadata
)
    
fig.show()

Real vs. Synthetic Data

What's Next?

Using the SDV library, you can synthesize single table, multi table and sequential data. You can also customize the full synthetic data workflow, including preprocessing, anonymization and adding constraints.

To learn more, visit the SDV Demo page.

Credits

Thank you to our team of contributors who have built and maintained the SDV ecosystem over the years!

View Contributors

Citation

If you use SDV for your research, please cite the following paper:

Neha Patki, Roy Wedge, Kalyan Veeramachaneni. The Synthetic Data Vault. IEEE DSAA 2016.

@inproceedings{
    SDV,
    title={The Synthetic data vault},
    author={Patki, Neha and Wedge, Roy and Veeramachaneni, Kalyan},
    booktitle={IEEE International Conference on Data Science and Advanced Analytics (DSAA)},
    year={2016},
    pages={399-410},
    doi={10.1109/DSAA.2016.49},
    month={Oct}
}

The Synthetic Data Vault Project was first created at MIT's Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:

ð Data discovery & transformation. Reverse the transforms to reproduce realistic data.
ð§ Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular, multi table and time series data.
ð Measuring quality and privacy of synthetic data, and comparing different synthetic data generation models.

Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot