corpora

A collection of small corpuses of interesting data for the creation of bots and similar stuff.

5,011

1,303

5,011

View on GitHub View on NPM

Top Related Projects

dataset

4,801

Easy-to-use data handling for SQL data stores with support for implicit table creation, bulk loading, and transactions.

Quick Overview

The dariusk/corpora repository is a collection of static JSON datasets, covering a wide range of topics from animals and foods to mythology and technology. It serves as a valuable resource for developers, researchers, and artists who need diverse, structured data for various projects, such as generating random content, populating databases, or creating data visualizations.

Pros

Extensive variety of datasets covering numerous categories
Well-structured JSON format for easy integration into projects
Community-driven, allowing for contributions and updates
Free and open-source, accessible to everyone

Cons

Some datasets may be incomplete or require updates
Limited to static data, not real-time or frequently updated information
Potential inconsistencies in data formatting across different datasets
May require additional processing or filtering for specific use cases

Getting Started

To use the corpora datasets in your project:

Clone the repository:

git clone https://github.com/dariusk/corpora.git

Navigate to the desired dataset in the data directory.
Copy the JSON file or use it directly in your project.
Parse the JSON data using your preferred programming language or tool.

Example usage in Python:

import json

with open('path/to/corpora/data/animals/dogs.json') as f:
    dog_data = json.load(f)

print(dog_data['dogs'])

This will load and print the list of dog breeds from the dataset.

Competitor Comparisons

dataset

4,801

Easy-to-use data handling for SQL data stores with support for implicit table creation, bulk loading, and transactions.

Pros of dataset

More focused on data manipulation and database operations
Provides a Python library for working with structured data
Offers features like data import/export and SQL query support

Cons of dataset

Less diverse in terms of pre-built datasets
Requires more programming knowledge to utilize effectively
May have a steeper learning curve for non-technical users

Code comparison

dataset:

import dataset

db = dataset.connect('sqlite:///mydatabase.db')
table = db['users']
table.insert(dict(name='John Doe', age=37))

corpora:

const corpora = require('corpora-api');

const animals = corpora.getFile('animals', 'common');
console.log(animals.animals);

Summary

dataset is a Python library for working with structured data, offering database operations and data manipulation features. It's more suitable for developers and data analysts who need to perform complex data operations.

corpora is a collection of small datasets and lists, primarily used for quick access to various categories of information. It's more accessible for general users and creative projects that require diverse, pre-compiled data.

The choice between the two depends on the specific needs of the project and the user's technical expertise.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Corpora

This project is a collection of static corpora (plural of "corpus") that are potentially useful in the creation of weird internet stuff. I've found that, as a creator, sometimes I am making something that needs access to a lot of adjectives, but not necessarily every adjective in the English language. So for the last year I've been copy/pasting an adjs.json file from project to project. This is kind of awful, so I'm hoping that this project will at least help me keep everything in one place.

I would like this to help with rapid prototyping of projects. For example: you might use nouns.json to start with, just to see if an idea you had was any good. Once you've built the project quickly around the nouns collection, you can then rip it out and replace it with a more complex or exhaustive data source.

I'm also hoping that this can be used as a teaching tool: maybe someone has three hours to teach how to make Twitter bots. That doesn't give the student much time to find/scrape/clean/parse interesting data. My hope is that students can be pointed to this project and they can pick and choose different interesting data sources to meld together for the creation of prototypes.

License

Since Corpora is more data than code, I have chosen to CC0 license this (rather than MIT license or similar).

To the extent possible under law, Darius Kazemi has waived all copyright and related or neighboring rights to Corpora. This work is published from: United States.

What is Corpora NOT?

This project is not meant to replace exhaustive APIs -- if you want nouns, and you want every noun in the English language, replete with metadata, consider Wordnik. If you want the title of every Wikipedia article, use the MediaWiki API.

What is Corpora?

Corpora is a repository of JSON files, meant to be language-neutral. If you want to create an NPM repo or whatever based on this, be my guest, but this repository will remain a collection of data files that can be interpreted by any language that can parse JSON.
Corpora is a collection of small files. It is not meant to be an exhaustive source of anything: a list of resources should contain somewhere in the vicinity of 1000 items.
- For example, Corpora will not contain any complete "dictionary" style files. Instead we host a sampling of 1000 common nouns, adjectives, and verbs.
- Some lists are small enough by nature that we may contain a complete list of things in their category. For example, a list of heavily populated U.S. cities may only have 75 cities and be considered complete.

List of Corpora-related tools

corpora-project, a Node.js NPM package for accessing corpora data offline.
pycorpora, a simple Python interface for corpora
corpora-api, a Node.js server that offers up the corpora as a JSON API (now live at https://corpora-api.glitch.me)

I have some data, how do I submit?

We accept pull requests to this repository. Some guidelines:

BY SUBMITTING DATA AS A PULL REQUEST, YOU AGREE TO OUR APPLYING A CC0 FREE CULTURE LICENSE TO THE DATA, MEANING THAT ANYONE CAN USE THE DATA FOR ANY REASON WITHOUT ATTRIBUTION IN PERPETUITY.
Please submit all data as JSON format in a file with a .json extension, and please JSONLint your files before submitting -- also, thanks to Matt Rothenberg we have Travis-CI testing, which will jsonlint your pull request automatically. If you see a test failure notification in your PR after you submit, there's a problem with your JSON!
Keep individual files to about 1000 "things" maximum. Fewer than 1000 is fine, too.
If you'd like attribution, I'm happy to include your name in this Readme file. Just remember that nobody who uses this data is obligated to include attribution in their own projects.

Contributors

By Darius Kazemi and Many Wonderful Contributors.

Top Related Projects

dataset

4,801

Easy-to-use data handling for SQL data stores with support for implicit table creation, bulk loading, and transactions.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Menu

corpora

Top Related Projects

dataset

Quick Overview

Pros

Cons

Getting Started

Competitor Comparisons

dataset

Pros of dataset

Cons of dataset

Code comparison

Summary

Convert designs to code with AI

README

Corpora

License

What is Corpora NOT?

What is Corpora?

List of Corpora-related tools

I have some data, how do I submit?

Contributors

Top Related Projects

dataset

Convert designs to code with AI

NPM DownloadsLast 30 Days