data-forge-ts
The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
Top Related Projects
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
Parallel computing with task scheduling
Modin: Scale your Pandas workflows by changing a single line of code
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
Dataframes powered by a multithreaded, vectorized query engine, written in Rust
Quick Overview
Data-Forge is a powerful data transformation and analysis toolkit for JavaScript and TypeScript. It provides a fluent API for working with datasets, offering functionality similar to pandas in Python or dplyr in R. Data-Forge is designed to handle both tabular and time series data efficiently.
Pros
- Comprehensive API for data manipulation and analysis
- Supports both JavaScript and TypeScript
- Extensible through plugins
- Well-documented with extensive examples
Cons
- Learning curve for users new to data manipulation libraries
- Performance may be slower compared to native JavaScript operations for large datasets
- Limited built-in visualization capabilities
- Smaller community compared to more established data libraries
Code Examples
Loading and filtering data:
import { readFile } from 'data-forge-fs';
const df = await readFile('data.csv')
.parseCSV()
.where(row => row.age > 30)
.select(row => ({
name: row.name,
age: row.age
}));
console.log(df.head(5).toArray());
Performing calculations on columns:
import { DataFrame } from 'data-forge';
const df = new DataFrame({
columns: {
A: [1, 2, 3, 4, 5],
B: [10, 20, 30, 40, 50]
}
});
const result = df
.generateSeries({
C: row => row.A * 2,
D: row => row.B / 10
})
.toArray();
console.log(result);
Grouping and aggregating data:
import { readFile } from 'data-forge-fs';
const df = await readFile('sales.csv')
.parseCSV()
.groupBy(row => row.category)
.select(group => ({
category: group.first().category,
totalSales: group.deflate(row => row.sales).sum(),
averagePrice: group.deflate(row => row.price).average()
}))
.toArray();
console.log(df);
Getting Started
To get started with Data-Forge, follow these steps:
-
Install Data-Forge:
npm install data-forge
-
Import and use Data-Forge in your TypeScript or JavaScript project:
import { DataFrame } from 'data-forge'; const df = new DataFrame({ columns: { Name: ['John', 'Jane', 'Bob'], Age: [25, 30, 35] } }); console.log(df.toString());
-
Explore the documentation and examples on the official Data-Forge website to learn more about its capabilities and advanced features.
Competitor Comparisons
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Pros of pandas
- Extensive functionality and mature ecosystem
- Highly optimized for performance with C extensions
- Large community and extensive documentation
Cons of pandas
- Steeper learning curve for beginners
- Memory-intensive for large datasets
- Python-specific, not easily portable to other languages
Code Comparison
pandas:
import pandas as pd
df = pd.read_csv('data.csv')
filtered = df[df['column'] > 5]
result = filtered.groupby('category').mean()
Data-Forge:
import { readFile } from 'data-forge-fs';
const df = await readFile('data.csv').parseCSV();
const filtered = df.where(row => row.column > 5);
const result = filtered.groupBy(row => row.category).select(group => group.mean());
Both libraries offer similar functionality for data manipulation, but pandas has a more concise syntax due to its specialized data structures. Data-Forge follows a more functional programming approach and is designed for TypeScript, making it more suitable for JavaScript/TypeScript developers working with data in web applications or Node.js environments.
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
Pros of Arrow
- Highly performant and memory-efficient columnar memory format
- Supports multiple programming languages and platforms
- Extensive ecosystem and community support
Cons of Arrow
- Steeper learning curve due to its complexity
- May be overkill for smaller-scale data processing tasks
- Requires more setup and configuration
Code Comparison
Arrow (C++):
#include <arrow/api.h>
std::shared_ptr<arrow::Table> table;
arrow::MemoryPool* pool = arrow::default_memory_pool();
arrow::Int64Builder builder(pool);
builder.AppendValues({1, 2, 3, 4, 5});
Data-Forge-TS (TypeScript):
import { DataFrame } from 'data-forge';
const df = new DataFrame({
columnNames: ["Value"],
rows: [[1], [2], [3], [4], [5]]
});
Key Differences
- Arrow focuses on efficient data representation and interoperability
- Data-Forge-TS is more oriented towards data manipulation and analysis in TypeScript
- Arrow has a broader scope and supports multiple languages, while Data-Forge-TS is TypeScript-specific
- Arrow is better suited for large-scale data processing, while Data-Forge-TS is more appropriate for smaller datasets and simpler operations
Parallel computing with task scheduling
Pros of Dask
- Designed for large-scale parallel computing and distributed processing
- Integrates well with the Python scientific ecosystem (NumPy, Pandas)
- Supports complex workflows and task scheduling
Cons of Dask
- Steeper learning curve, especially for distributed computing concepts
- More complex setup and configuration for distributed environments
- Potentially overkill for smaller datasets or simpler data processing tasks
Code Comparison
Data-Forge-TS:
import { DataFrame } from 'data-forge';
const df = new DataFrame([
{ A: 1, B: 10 },
{ A: 2, B: 20 },
]);
const result = df.select(row => row.A * 2);
Dask:
import dask.dataframe as dd
df = dd.from_pandas(pd.DataFrame({
'A': [1, 2],
'B': [10, 20]
}), npartitions=2)
result = df['A'] * 2
Key Differences
- Data-Forge-TS is TypeScript-based, while Dask is Python-based
- Dask focuses on distributed computing and big data, while Data-Forge-TS is more suited for in-memory data processing
- Dask has a wider range of data structures and integrations with scientific libraries
Modin: Scale your Pandas workflows by changing a single line of code
Pros of Modin
- Designed for large-scale data processing with distributed computing capabilities
- Seamless integration with existing pandas code, requiring minimal changes
- Significantly faster performance for large datasets compared to pandas
Cons of Modin
- Limited functionality compared to pandas, not all operations are supported
- Requires additional setup and dependencies for distributed computing
- May have overhead for small datasets, potentially slower than pandas
Code Comparison
Modin:
import modin.pandas as pd
df = pd.read_csv("large_dataset.csv")
result = df.groupby("category").mean()
Data-Forge-TS:
import { readFile } from "data-forge-fs";
const df = await readFile("large_dataset.csv").parseCSV();
const result = df.groupBy(row => row.category).select(group => group.mean());
Key Differences
- Modin focuses on scaling pandas operations for big data, while Data-Forge-TS is a TypeScript data manipulation library
- Modin aims for pandas compatibility, whereas Data-Forge-TS has its own API design
- Modin is better suited for large-scale data processing, while Data-Forge-TS is more appropriate for smaller datasets and TypeScript projects
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
Pros of Vaex
- Designed for handling large datasets (up to 1 billion rows) efficiently
- Supports out-of-core computing, allowing processing of data larger than RAM
- Offers advanced visualization capabilities for big data
Cons of Vaex
- Primarily focused on tabular data, less versatile for other data structures
- Steeper learning curve due to its specialized nature
- Less integration with TypeScript ecosystem
Code Comparison
Data-Forge-TS:
import { DataFrame } from 'data-forge';
const df = new DataFrame({
columnNames: ["A", "B", "C"],
rows: [[1, 2, 3], [4, 5, 6]]
});
Vaex:
import vaex
df = vaex.from_arrays(
A=[1, 4], B=[2, 5], C=[3, 6]
)
Summary
Vaex excels in handling extremely large datasets and provides powerful visualization tools, making it ideal for big data analysis. However, it may be overkill for smaller projects and has a steeper learning curve. Data-Forge-TS, on the other hand, offers a more general-purpose data manipulation library with better TypeScript integration, but may not be as efficient for very large datasets. The choice between the two depends on the specific requirements of your project, particularly in terms of data size and processing needs.
Dataframes powered by a multithreaded, vectorized query engine, written in Rust
Pros of Polars
- Written in Rust, offering high performance and memory efficiency
- Supports both eager and lazy execution modes
- Provides a wide range of data manipulation and analysis functions
Cons of Polars
- Steeper learning curve due to its Rust foundations
- Less integrated with the TypeScript/JavaScript ecosystem
- May require additional setup for use in web-based environments
Code Comparison
Data-Forge-TS:
import { DataFrame } from 'data-forge';
const df = new DataFrame([
{ A: 1, B: 'x' },
{ A: 2, B: 'y' },
]);
const filtered = df.where(row => row.A > 1);
Polars:
import polars as pl
df = pl.DataFrame({'A': [1, 2], 'B': ['x', 'y']})
filtered = df.filter(pl.col('A') > 1)
Summary
Polars is a high-performance data manipulation library written in Rust, offering excellent speed and efficiency. It provides a rich set of features for data analysis and supports both eager and lazy execution. However, it may have a steeper learning curve and less seamless integration with TypeScript/JavaScript projects compared to Data-Forge-TS.
Data-Forge-TS, on the other hand, is specifically designed for TypeScript and JavaScript environments, making it more accessible for web developers. While it may not match Polars in raw performance, it offers a familiar API and easier integration with existing TypeScript/JavaScript codebases.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Data-Forge
The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
Implemented in TypeScript.
Used in JavaScript ES5+ or TypeScript.
To learn more about Data-Forge visit the home page.
Read about Data-Forge for data science in the book JavaScript for Data Science.
Love this? Please star this repo and click here to support my work
Please note that this TypeScript repository replaces the previous JavaScript version of Data-Forge.
BREAKING CHANGES
As of v1.6.9 the dependencies Sugar, Lodash and Moment have been factored out (or replaced with smaller dependencies). This more than halves the bundle size. Hopefully this won't cause any problems - but please log an issue if something changes that you weren't expecting.
As of v1.3.0 file system support has been removed from the Data-Forge core API. This is after repeated issues from users trying to get Data-Forge working in the browser, especially under AngularJS 6.
Functions for reading and writing files have been moved to the separate code library Data-Forge FS.
If you are using the file read and write functions prior to 1.3.0 then your code will no longer work when you upgrade to 1.3.0. The fix is simple though, where usually you would just require in Data-Forge as follows:
const dataForge = require('data-forge');
Now you must also require in the new library as well:
const dataForge = require('data-forge');
require('data-forge-fs');
Data-Forge FS augments Data-Forge core so that you can use the readFile/writeFile functions as in previous versions and as is shown in this readme and the guide.
If you still have problems with AngularJS 6 please see this workaround: https://github.com/data-forge/data-forge-ts/issues/3#issuecomment-438580174
Install
To install for Node.js and the browser:
npm install --save data-forge
If working in Node.js and you want the functions to read and write data files:
npm install --save data-forge-fs
Quick start
Data-Forge can load CSV, JSON or arbitrary data sets.
Parse the data, filter it, transform it, aggregate it, sort it and much more.
Use the data however you want or export it to CSV or JSON.
Here's an example:
const dataForge = require('data-forge');
require('data-forge-fs'); // For readFile/writeFile.
dataForge.readFileSync('./input-data-file.csv') // Read CSV file (or JSON!)
.parseCSV()
.parseDates(["Column B"]) // Parse date columns.
.parseInts(["Column B", "Column C"]) // Parse integer columns.
.parseFloats(["Column D", "Column E"]) // Parse float columns.
.dropSeries(["Column F"]) // Drop certain columns.
.where(row => predicate(row)) // Filter rows.
.select(row => transform(row)) // Transform the data.
.asCSV()
.writeFileSync("./output-data-file.csv"); // Write to output CSV file (or JSON!)
From the browser
Data-Forge has been tested with Browserify and Webpack. Please see links to examples below.
If you aren't using Browserify or Webpack, the npm package includes a pre-packed browser distribution that you can install and included in your HTML as follows:
<script language="javascript" type="text/javascript" src="node_modules/data-forge/dist/web/index.js"></script>
This gives you the data-forge package mounted under the global variable dataForge
.
Please remember that you can't use data-forge-fs or the file system functions in the browser.
Features
- Import and export CSV and JSON data and text files (when using Data-Forge FS).
- Or work with arbitrary JavaScript data.
- Many options for working with your data:
- Filtering
- Transformation
- Extracting subsets
- Grouping, aggregation and summarization
- Sorting
- And much more
- Great for slicing and dicing tabular data:
- Add, remove, transform and generate named columns (series) of data.
- Great for working with time series data.
- Your data is indexed so you have the ability to merge and aggregate.
- Your data is immutable! Transformations and modifications produce a new dataset.
- Build data pipeline that are evaluated lazily.
- Inspired by Pandas and LINQ, so it might feel familiar!
Contributions
Want a bug fixed or maybe to improve performance?
Don't see your favourite feature?
Need to add your favourite Pandas or LINQ feature?
Please contribute and help improve this library for everyone!
Fork it, make a change, submit a pull request. Want to chat? See my contact details at the end or reach out on Gitter.
Platforms
- Node.js (npm install --save data-forge data-forge-fs) (see example here)
- Browser
- Via bower (bower install --save data-forge) (see example here)
- Via Browserify (see example here)
- Via Webpack (see example here)
Documentation
Resources
Contact
Please reach and tell me what you are doing with Data-Forge or how you'd like to see it improved.
- Twitter: @codecapers
- Email: ashley@codecapers.com.au
- Linkedin: www.linkedin.com/in/ashleydavis75
- Web: www.codecapers.com.au
Support the developer
Top Related Projects
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
Parallel computing with task scheduling
Modin: Scale your Pandas workflows by changing a single line of code
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
Dataframes powered by a multithreaded, vectorized query engine, written in Rust
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot