data

Data and code behind the articles and graphics at FiveThirtyEight

17,045

11,115

17,045

View on GitHub

Top Related Projects

covid-19-data

5,666

Data on COVID-19 (coronavirus) cases, deaths, hospitalizations, tests • All countries • Updated daily by Our World in Data

everything

1,322

An index of all our open-source data, analysis, libraries, tools, and guides.

COVID-19

29,073

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE

Quick Overview

The fivethirtyeight/data repository is a collection of data sets and code used in articles published by FiveThirtyEight, a data journalism website. It provides raw data, data dictionaries, and some analysis scripts for various topics covered in their articles, making it a valuable resource for data enthusiasts, researchers, and journalists.

Pros

Diverse range of topics covered, including politics, sports, economics, and social issues
Well-documented data sets with accompanying README files and data dictionaries
Regular updates with new data sets as new articles are published
Open-source and freely available for public use and analysis

Cons

Some data sets may be incomplete or require additional context from the associated articles
Not all data sets are consistently formatted or follow the same structure
Limited analysis scripts provided; users often need to perform their own analysis
Some older data sets may not be actively maintained or updated

Getting Started

To use the data sets from this repository:

Clone the repository:

git clone https://github.com/fivethirtyeight/data.git

Navigate to the desired data set folder:
```
cd data/dataset-name
```
Read the README.md file for information about the data set and its structure.
Use your preferred data analysis tools (e.g., Python, R, Excel) to load and analyze the data.

Note: This repository does not contain a code library, so there are no specific code examples or installation instructions. Users are expected to work with the raw data files directly using their preferred tools and methods.

Competitor Comparisons

covid-19-data

5,666

Data on COVID-19 (coronavirus) cases, deaths, hospitalizations, tests • All countries • Updated daily by Our World in Data

Pros of covid-19-data

More focused and specialized dataset, specifically for COVID-19 data
More frequently updated, providing near real-time information
Includes a wider range of global data sources and countries

Cons of covid-19-data

Limited to a single topic, whereas data covers various subjects
May require more domain-specific knowledge to interpret and use effectively
Potentially more complex data structure due to the nature of COVID-19 reporting

Code comparison

covid-19-data:

import pandas as pd

# Load COVID-19 data
df = pd.read_csv('https://covid.ourworldindata.org/data/owid-covid-data.csv')
df['date'] = pd.to_datetime(df['date'])

data:

import pandas as pd

# Load FiveThirtyEight data
df = pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/poll-quiz-guns/guns-polls.csv')
df['date'] = pd.to_datetime(df['date'])

Both repositories provide CSV files that can be easily loaded using pandas. The main difference lies in the specific datasets and their structures. covid-19-data focuses on COVID-19 statistics, while data covers a variety of topics, requiring users to navigate different datasets based on their needs.

everything

1,322

An index of all our open-source data, analysis, libraries, tools, and guides.

Pros of everything

Broader scope, covering various topics beyond just data
More frequent updates and active community engagement
Includes code, analysis, and methodologies alongside datasets

Cons of everything

Less structured organization compared to data
May contain more opinion-based or editorial content
Potentially overwhelming for users seeking specific datasets

Code comparison

data:

import pandas as pd

df = pd.read_csv('fivethirtyeight_dataset.csv')
df.head()

everything:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('buzzfeed_dataset.csv')
df.plot(x='date', y='value')
plt.show()

Key differences

data focuses primarily on datasets and statistical analysis
everything includes a wider range of content, including investigative reporting
data provides more consistent formatting and documentation for datasets
everything offers more diverse types of information and resources
data is more suitable for academic or research purposes
everything caters to a broader audience, including journalists and general public

Both repositories serve as valuable resources for data-driven journalism and analysis, with each having its own strengths and target audience. The choice between them depends on the specific needs and preferences of the user.

COVID-19

29,073

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE

Pros of COVID-19

Focused specifically on COVID-19 data, providing comprehensive and detailed information
Frequently updated, often on a daily basis, ensuring up-to-date statistics
Global coverage with data from countries and regions worldwide

Cons of COVID-19

Limited to COVID-19 data only, lacking diversity in topics
Raw data format may require more processing for analysis
Potential inconsistencies in reporting methods across different regions

Code Comparison

COVID-19 data format (CSV):

Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered
New York,US,40.7128,-74.0060,2021-03-01,1000000,50000,900000

FiveThirtyEight data format (CSV):

date,state,positive,negative,pending,hospitalized,death
2021-03-01,New York,1000000,5000000,1000,10000,50000

The COVID-19 repository focuses on global data with geographical coordinates, while FiveThirtyEight includes more detailed categories for US states. FiveThirtyEight's data is more diverse, covering various topics beyond COVID-19, making it suitable for a wider range of analyses. However, COVID-19 provides more frequent updates and global coverage specific to the pandemic.

data

4,676

An index of all open-source data

Pros of GoogleTrends/data

More focused on search trends and user interest data
Regularly updated with current trending topics
Provides data in multiple formats (CSV, JSON)

Cons of GoogleTrends/data

Limited scope compared to the diverse datasets in fivethirtyeight/data
Less comprehensive documentation and context for datasets
Fewer historical datasets available

Code comparison

GoogleTrends/data:

import pandas as pd

df = pd.read_csv('multiTimeline.csv', skiprows=1)
df['date'] = pd.to_datetime(df['date'])
print(df.head())

fivethirtyeight/data:

import pandas as pd

df = pd.read_csv('nba-elo/nbaallelo.csv')
df['date'] = pd.to_datetime(df['date'])
print(df.head())

Both repositories use similar data loading techniques, but the specific datasets and their structures differ. GoogleTrends/data focuses on search trend data, while fivethirtyeight/data covers a wider range of topics and may require more complex data manipulation depending on the dataset.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

GitHub repo size

See the index for a list of the data and code we've published and their accompanying stories.

As of June 13, 2023, sports predictions and forecasts are no longer being updated.

Unless otherwise noted, our data sets are available under the Creative Commons Attribution 4.0 International License, and the code is available under the MIT License. If you find this information useful, please let us know.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot