neosync

Open Source Data Security Platform for Developers to Monitor and Detect PII, Anonymize Production Data and Sync it across environments.

3,852

154

3,852

View on GitHub

Top Related Projects

datahub

10,550

The Metadata Platform for your Data and AI Stack

airbyte

17,998

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

meltano

2,042

Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

prefect

19,133

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.

airflow

39,846

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Quick Overview

Neosync is an open-source data synchronization tool designed to streamline the process of creating and managing synthetic data for development and testing environments. It aims to provide a secure and efficient way to generate realistic test data while maintaining data privacy and compliance.

Pros

Enhances data privacy and security by generating synthetic data instead of using production data
Supports multiple data sources and destinations, including popular databases and file formats
Offers a user-friendly web interface for easy configuration and management
Provides customizable data generation rules and transformers

Cons

Still in early development stages, which may lead to potential instability or missing features
Limited documentation and community support compared to more established tools
May require additional setup and configuration for complex data scenarios
Performance and scalability for large datasets are yet to be thoroughly tested

Getting Started

To get started with Neosync, follow these steps:

Clone the repository:

git clone https://github.com/nucleuscloud/neosync.git
cd neosync

Install dependencies:
```
make install
```
Set up the development environment:
```
make dev
```
Access the web interface at http://localhost:3000 and follow the on-screen instructions to configure your data synchronization jobs.

For more detailed instructions and configuration options, refer to the project's documentation in the repository.

Competitor Comparisons

datahub

10,550

The Metadata Platform for your Data and AI Stack

Pros of DataHub

More comprehensive data catalog and metadata management system
Larger community and ecosystem with extensive documentation
Supports a wider range of data sources and integrations

Cons of DataHub

More complex setup and configuration process
Heavier resource requirements for deployment and operation
Steeper learning curve for new users

Code Comparison

DataHub (Python):

from datahub.ingestion.run.pipeline import Pipeline

pipeline = Pipeline.create({
    "source": {"type": "mysql", "config": {...}},
    "sink": {"type": "datahub-rest", "config": {...}}
})
pipeline.run()

Neosync (Go):

client := neosync.NewClient(neosync.ClientConfig{...})
job := &neosync.Job{
    Source: neosync.Source{Type: "mysql", Config: {...}},
    Destination: neosync.Destination{Type: "postgres", Config: {...}}
}
err := client.RunJob(job)

DataHub offers a more extensive data catalog solution with broader integration capabilities, while Neosync focuses on data synchronization and transformation. DataHub's larger community and ecosystem provide more resources and support, but it comes with increased complexity and resource requirements. Neosync, being more specialized, may offer a simpler setup for specific data synchronization tasks.

airbyte

17,998

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

Pros of Airbyte

Extensive library of pre-built connectors for various data sources and destinations
Strong community support and regular updates
Comprehensive documentation and user guides

Cons of Airbyte

More complex setup and configuration process
Higher resource requirements for deployment and operation
Steeper learning curve for new users

Code Comparison

Neosync (Go):

func (j *Job) Run(ctx context.Context) error {
    // Job execution logic
    return nil
}

Airbyte (Java):

public class Job implements Runnable {
    @Override
    public void run() {
        // Job execution logic
    }
}

Both projects use different programming languages, with Neosync primarily using Go and Airbyte using Java. The code snippets show a basic job execution structure, but Airbyte's implementation is typically more complex due to its broader scope and features.

Neosync focuses on data synchronization and anonymization, while Airbyte is a more comprehensive data integration platform. Airbyte offers a wider range of connectors and data transformation capabilities, but this comes at the cost of increased complexity and resource requirements. Neosync, being more specialized, may offer a simpler solution for specific use cases related to data synchronization and privacy.

meltano

2,042

Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

Pros of Meltano

More mature project with a larger community and ecosystem
Supports a wider range of data sources and destinations
Offers a comprehensive ELT (Extract, Load, Transform) workflow

Cons of Meltano

Steeper learning curve due to its extensive features
Requires more setup and configuration for simple tasks
Heavier resource usage for small-scale projects

Code Comparison

Meltano configuration (meltano.yml):

plugins:
  extractors:
    - name: tap-github
      variant: meltanolabs
  loaders:
    - name: target-postgres
      variant: transferwise

Neosync configuration (neosync.yaml):

sources:
  - name: github
    type: github
destinations:
  - name: postgres
    type: postgres

Summary

Meltano is a more comprehensive data integration tool with a broader range of features and integrations, making it suitable for complex data pipelines and large-scale projects. However, this comes at the cost of increased complexity and resource requirements.

Neosync, on the other hand, appears to be a simpler, more focused tool that may be easier to set up and use for smaller projects or specific use cases. It likely has a smaller learning curve but may lack some of the advanced features and integrations offered by Meltano.

The choice between the two would depend on the specific requirements of the project, the scale of data operations, and the level of complexity needed in the data integration process.

prefect

19,133

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.

Pros of Prefect

More mature and widely adopted project with a larger community
Extensive documentation and tutorials available
Supports a wider range of data workflows and integrations

Cons of Prefect

Steeper learning curve for beginners
Can be overkill for simpler data synchronization tasks
Requires more setup and configuration

Code Comparison

Neosync (JavaScript):

const job = new Job({
  name: 'sync-data',
  source: 'postgres',
  destination: 'mysql',
  transformations: [maskPII, anonymizeNames]
});

Prefect (Python):

@task
def extract():
    # Extract data from source
    
@task
def transform(data):
    # Apply transformations
    
@task
def load(data):
    # Load data to destination

@flow
def etl_flow():
    data = extract()
    transformed = transform(data)
    load(transformed)

Neosync focuses on straightforward data synchronization with built-in anonymization features, while Prefect offers a more comprehensive workflow orchestration platform. Neosync may be easier to set up for simple sync tasks, but Prefect provides greater flexibility and scalability for complex data pipelines.

airflow

39,846

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Pros of Airflow

Mature and widely adopted workflow orchestration platform with a large community
Extensive library of pre-built operators and integrations
Flexible scheduling and dependency management capabilities

Cons of Airflow

Steeper learning curve and more complex setup compared to Neosync
Heavier resource requirements, especially for large-scale deployments
Less focused on data synchronization and generation tasks

Code Comparison

Airflow DAG definition:

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

dag = DAG('example_dag', start_date=datetime(2023, 1, 1))

task = PythonOperator(
    task_id='example_task',
    python_callable=my_function,
    dag=dag
)

Neosync job configuration:

version: 1
jobs:
  - name: example-job
    source:
      postgres:
        connection: source_db
    destination:
      postgres:
        connection: destination_db
    mappings:
      - source_schema: public
        source_table: users
        destination_schema: public
        destination_table: users

While Airflow provides a more general-purpose workflow orchestration platform, Neosync focuses specifically on data synchronization and generation tasks. Airflow offers greater flexibility and extensibility, but Neosync provides a more streamlined experience for its targeted use case.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Open Source Data Anonymization and Synthetic Data Orchestration

Introduction

Neosync is an open-source, developer-first way to anonymize PII, generate synthetic data and sync environments for better testing, debugging and developer experience.

Companies use Neosync to:

Safely test code against production data - Anonymize sensitive production data in order to safely use it locally for a better testing and developer experience
Easily reproduce production bugs locally - Anonymize and subset production data to get a safe, representative data set that you can use to locally reproduce production bugs quickly and efficiently
High quality data for lower-level environments - Catch bugs before they hit production when you hydrate your staging and QA environments with production-like data
Solve GDPR, DPDP, FERPA, HIPAA and more - Use anonymized and synthetic data to reduce your compliance scope and easily comply with laws like HIPAA, GDPR, and DPDP
Seed development databases - Easily seed development databases with synthetic data for unit testing, demos and more

Features

Generate synthetic data based on your schema
Anonymize existing production-data for a better developer experience
Subset your production database for local and CI testing using any SQL query
Complete async pipeline that automatically handles job retries, failures and playback using an event-sourcing model
Referential integrity for your data automatically
Declarative, GitOps based configs as a step in your CI pipeline to hydrate your CI DB
Pre-built data transformers for all major data types
Custom data transformers using javascript or LLMs
Pre-built integrations with Postgres, Mysql, S3

Getting started

Neosync is a fully dockerized setup which makes it easy to get up and running.

A compose.yml file at the root contains production image refs that allow you to get up and running with just a few commands without having to build anything on your system.

Neosync uses the newer docker compose command, so be sure to have that installed on your machine.

To start Neosync, clone the repo into a local directory, be sure to have docker installed and running, and then run:

make compose/up

To stop, run:

make compose/down

Neosync will now be available on http://localhost:3000.

The production compose pre-seeds with connections and jobs to get you started! Simply run the generate and sync job to watch Neosync in action!

Kubernetes, Auth Mode and more

For more in-depth details on environment variables, Kubernetes deployments, and a production-ready guide, check out the Deploy Neosync section of our Docs.

Resources

Some resources to help you along the way:

Docs for comprehensive documentation and guides
Discord for discussion with the community and Neosync team
X for the latest updates

Contributing

We love contributions big and small. Here are just a few ways that you can contribute to Neosync.

Join our Discord channel and ask us any questions there
Open a PR (see our instructions on developing with Neosync locally)
Submit a feature request or bug report

Licensing

We strongly believe in free and open source software and make this repo is available under the MIT expat license.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot