neosync
Open source data anonymization and synthetic data orchestration for developers. Create high fidelity synthetic data and sync it across your environments.
Top Related Projects
The Metadata Platform for your Data Stack
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Quick Overview
Neosync is an open-source data synchronization tool designed to streamline the process of creating and managing synthetic data for development and testing environments. It aims to provide a secure and efficient way to generate realistic test data while maintaining data privacy and compliance.
Pros
- Enhances data privacy and security by generating synthetic data instead of using production data
- Supports multiple data sources and destinations, including popular databases and file formats
- Offers a user-friendly web interface for easy configuration and management
- Provides customizable data generation rules and transformers
Cons
- Still in early development stages, which may lead to potential instability or missing features
- Limited documentation and community support compared to more established tools
- May require additional setup and configuration for complex data scenarios
- Performance and scalability for large datasets are yet to be thoroughly tested
Getting Started
To get started with Neosync, follow these steps:
-
Clone the repository:
git clone https://github.com/nucleuscloud/neosync.git cd neosync
-
Install dependencies:
make install
-
Set up the development environment:
make dev
-
Access the web interface at
http://localhost:3000
and follow the on-screen instructions to configure your data synchronization jobs.
For more detailed instructions and configuration options, refer to the project's documentation in the repository.
Competitor Comparisons
The Metadata Platform for your Data Stack
Pros of DataHub
- More comprehensive data catalog and metadata management system
- Larger community and ecosystem with extensive documentation
- Supports a wider range of data sources and integrations
Cons of DataHub
- More complex setup and configuration process
- Heavier resource requirements for deployment and operation
- Steeper learning curve for new users
Code Comparison
DataHub (Python):
from datahub.ingestion.run.pipeline import Pipeline
pipeline = Pipeline.create({
"source": {"type": "mysql", "config": {...}},
"sink": {"type": "datahub-rest", "config": {...}}
})
pipeline.run()
Neosync (Go):
client := neosync.NewClient(neosync.ClientConfig{...})
job := &neosync.Job{
Source: neosync.Source{Type: "mysql", Config: {...}},
Destination: neosync.Destination{Type: "postgres", Config: {...}}
}
err := client.RunJob(job)
DataHub offers a more extensive data catalog solution with broader integration capabilities, while Neosync focuses on data synchronization and transformation. DataHub's larger community and ecosystem provide more resources and support, but it comes with increased complexity and resource requirements. Neosync, being more specialized, may offer a simpler setup for specific data synchronization tasks.
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Pros of Airbyte
- Extensive library of pre-built connectors for various data sources and destinations
- Strong community support and regular updates
- Comprehensive documentation and user guides
Cons of Airbyte
- More complex setup and configuration process
- Higher resource requirements for deployment and operation
- Steeper learning curve for new users
Code Comparison
Neosync (Go):
func (j *Job) Run(ctx context.Context) error {
// Job execution logic
return nil
}
Airbyte (Java):
public class Job implements Runnable {
@Override
public void run() {
// Job execution logic
}
}
Both projects use different programming languages, with Neosync primarily using Go and Airbyte using Java. The code snippets show a basic job execution structure, but Airbyte's implementation is typically more complex due to its broader scope and features.
Neosync focuses on data synchronization and anonymization, while Airbyte is a more comprehensive data integration platform. Airbyte offers a wider range of connectors and data transformation capabilities, but this comes at the cost of increased complexity and resource requirements. Neosync, being more specialized, may offer a simpler solution for specific use cases related to data synchronization and privacy.
Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.
Pros of Meltano
- More mature project with a larger community and ecosystem
- Supports a wider range of data sources and destinations
- Offers a comprehensive ELT (Extract, Load, Transform) workflow
Cons of Meltano
- Steeper learning curve due to its extensive features
- Requires more setup and configuration for simple tasks
- Heavier resource usage for small-scale projects
Code Comparison
Meltano configuration (meltano.yml):
plugins:
extractors:
- name: tap-github
variant: meltanolabs
loaders:
- name: target-postgres
variant: transferwise
Neosync configuration (neosync.yaml):
sources:
- name: github
type: github
destinations:
- name: postgres
type: postgres
Summary
Meltano is a more comprehensive data integration tool with a broader range of features and integrations, making it suitable for complex data pipelines and large-scale projects. However, this comes at the cost of increased complexity and resource requirements.
Neosync, on the other hand, appears to be a simpler, more focused tool that may be easier to set up and use for smaller projects or specific use cases. It likely has a smaller learning curve but may lack some of the advanced features and integrations offered by Meltano.
The choice between the two would depend on the specific requirements of the project, the scale of data operations, and the level of complexity needed in the data integration process.
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
Pros of Prefect
- More mature and widely adopted project with a larger community
- Extensive documentation and tutorials available
- Supports a wider range of data workflows and integrations
Cons of Prefect
- Steeper learning curve for beginners
- Can be overkill for simpler data synchronization tasks
- Requires more setup and configuration
Code Comparison
Neosync (JavaScript):
const job = new Job({
name: 'sync-data',
source: 'postgres',
destination: 'mysql',
transformations: [maskPII, anonymizeNames]
});
Prefect (Python):
@task
def extract():
# Extract data from source
@task
def transform(data):
# Apply transformations
@task
def load(data):
# Load data to destination
@flow
def etl_flow():
data = extract()
transformed = transform(data)
load(transformed)
Neosync focuses on straightforward data synchronization with built-in anonymization features, while Prefect offers a more comprehensive workflow orchestration platform. Neosync may be easier to set up for simple sync tasks, but Prefect provides greater flexibility and scalability for complex data pipelines.
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Pros of Airflow
- Mature and widely adopted workflow orchestration platform with a large community
- Extensive library of pre-built operators and integrations
- Flexible scheduling and dependency management capabilities
Cons of Airflow
- Steeper learning curve and more complex setup compared to Neosync
- Heavier resource requirements, especially for large-scale deployments
- Less focused on data synchronization and generation tasks
Code Comparison
Airflow DAG definition:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
dag = DAG('example_dag', start_date=datetime(2023, 1, 1))
task = PythonOperator(
task_id='example_task',
python_callable=my_function,
dag=dag
)
Neosync job configuration:
version: 1
jobs:
- name: example-job
source:
postgres:
connection: source_db
destination:
postgres:
connection: destination_db
mappings:
- source_schema: public
source_table: users
destination_schema: public
destination_table: users
While Airflow provides a more general-purpose workflow orchestration platform, Neosync focuses specifically on data synchronization and generation tasks. Airflow offers greater flexibility and extensibility, but Neosync provides a more streamlined experience for its targeted use case.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Open Source Data Anonymization and Synthetic Data Orchestration
Introduction
Neosync is an open-source, developer-first way to anonymize PII, generate synthetic data and sync environments for better testing, debugging and developer experience.
Companies use Neosync to:
- Safely test code against production data - Anonymize sensitive production data in order to safely use it locally for a better testing and developer experience
- Easily reproduce production bugs locally - Anonymize and subset production data to get a safe, representative data set that you can use to locally reproduce production bugs quickly and efficiently
- High quality data for lower-level environments - Catch bugs before they hit production when you hydrate your staging and QA environments with production-like data
- Solve GDPR, DPDP, FERPA, HIPAA and more - Use anonymized and synthetic data to reduce your compliance scope and easily comply with laws like HIPAA, GDPR, and DPDP
- Seed development databases - Easily seed development databases with synthetic data for unit testing, demos and more
Features
- Generate synthetic data based on your schema
- Anonymize existing production-data for a better developer experience
- Subset your production database for local and CI testing using any SQL query
- Complete async pipeline that automatically handles job retries, failures and playback using an event-sourcing model
- Referential integrity for your data automatically
- Declarative, GitOps based configs as a step in your CI pipeline to hydrate your CI DB
- Pre-built data transformers for all major data types
- Custom data transformers using javascript or LLMs
- Pre-built integrations with Postgres, Mysql, S3
Getting started
Neosync is a fully dockerized setup which makes it easy to get up and running.
A compose.yml file at the root contains production image refs that allow you to get up and running with just a few commands without having to build anything on your system.
Neosync uses the newer docker compose
command, so be sure to have that installed on your machine.
To start Neosync, clone the repo into a local directory, be sure to have docker installed and running, and then run:
make compose/up
To stop, run:
make compose/down
Neosync will now be available on http://localhost:3000.
The production compose pre-seeds with connections and jobs to get you started! Simply run the generate and sync job to watch Neosync in action!
Kubernetes, Auth Mode and more
For more in-depth details on environment variables, Kubernetes deployments, and a production-ready guide, check out the Deploy Neosync section of our Docs.
Resources
Some resources to help you along the way:
- Docs for comprehensive documentation and guides
- Discord for discussion with the community and Neosync team
- X for the latest updates
Contributing
We love contributions big and small. Here are just a few ways that you can contribute to Neosync.
- Join our Discord channel and ask us any questions there
- Open a PR (see our instructions on developing with Neosync locally)
- Submit a feature request or bug report
Licensing
We strongly believe in free and open source software and make this repo is available under the MIT expat license.
Top Related Projects
The Metadata Platform for your Data Stack
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot