datasharing

The Leek group guide to data sharing

6,649

243,459

6,649

911

View on GitHub

Top Related Projects

dplyr

4,938

dplyr: A grammar of data manipulation

ggplot2

6,794

An implementation of the Grammar of Graphics in R

pandas

46,172

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

numpy

30,015

The fundamental package for scientific computing with Python.

Quick Overview

The jtleek/datasharing repository is a guide for sharing data in scientific publications. It provides a comprehensive overview of best practices for preparing and sharing data alongside research findings. The repository aims to improve reproducibility and transparency in scientific research.

Pros

Offers clear, step-by-step instructions for data sharing
Covers a wide range of data-related topics, from raw data to metadata
Promotes better research practices and reproducibility
Useful for both novice and experienced researchers

Cons

Primarily focused on tabular data, may not cover all data types
Could benefit from more examples or case studies
Lacks specific guidance for different scientific disciplines
May require regular updates to keep pace with evolving data sharing standards

Note: As this is not a code library, the code examples and getting started instructions sections have been omitted.

Competitor Comparisons

r4ds

4,896

R for data science: a book

Pros of r4ds

Comprehensive coverage of data science in R
Regularly updated with new content and examples
Extensive community support and contributions

Cons of r4ds

More complex and time-consuming to work through
May be overwhelming for complete beginners
Focuses primarily on tidyverse ecosystem

Code comparison

r4ds:

library(tidyverse)
ggplot(mpg, aes(displ, hwy, colour = class)) +
  geom_point()

datasharing:

data <- read.csv("mydata.csv")
summary(data)

Key differences

r4ds is a comprehensive book on R for data science
datasharing focuses on best practices for sharing data
r4ds has more active development and contributions
datasharing is more concise and easier to digest quickly

Similarities

Both aim to improve data science practices
Both are open-source and freely available on GitHub
Both have educational value for data scientists

Use cases

r4ds:

Learning R programming for data science
Developing advanced data analysis skills
Exploring the tidyverse ecosystem

datasharing:

Understanding best practices for data sharing
Quick reference for data organization
Improving reproducibility in research

dplyr

4,938

dplyr: A grammar of data manipulation

Pros of dplyr

Comprehensive data manipulation toolkit with a wide range of functions
Efficient performance for large datasets due to C++ backend
Consistent syntax and pipe operator integration for readable code

Cons of dplyr

Steeper learning curve for beginners compared to basic R functions
Potential compatibility issues with base R or other packages
Requires additional package installation and loading

Code Comparison

dplyr:

library(dplyr)
data %>%
  filter(year > 2000) %>%
  group_by(category) %>%
  summarize(mean_value = mean(value))

datasharing:

# No specific code examples provided in the repository
# The project focuses on guidelines for data sharing

Additional Notes

datasharing is a guide for sharing data in scientific publications, while dplyr is a data manipulation package for R. They serve different purposes, making a direct comparison challenging. datasharing provides best practices for researchers, while dplyr offers tools for data analysis and transformation.

ggplot2

6,794

An implementation of the Grammar of Graphics in R

Pros of ggplot2

Comprehensive data visualization library with extensive functionality
Well-documented with a large community and ecosystem of extensions
Follows a consistent grammar of graphics for intuitive plot creation

Cons of ggplot2

Steeper learning curve for beginners compared to simpler plotting methods
Can be slower for large datasets or complex visualizations
Requires more code for basic plots compared to base R graphics

Code Comparison

ggplot2:

library(ggplot2)
ggplot(mtcars, aes(x = mpg, y = wt)) +
  geom_point() +
  labs(title = "MPG vs Weight", x = "Miles per Gallon", y = "Weight")

datasharing:

## Data Dictionary

Here's a sample data dictionary:

1. ID: Unique identifier for each participant
2. Age: Participant's age in years
3. Gender: Participant's gender (Male/Female)

Summary

ggplot2 is a powerful data visualization library, while datasharing is a guide for sharing data in scientific publications. ggplot2 offers extensive plotting capabilities but may be more complex for beginners. datasharing focuses on best practices for organizing and documenting data, which is crucial for reproducibility in research. While they serve different purposes, both contribute significantly to the data science ecosystem.

pandas

46,172

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Pros of pandas

Extensive data manipulation and analysis library with powerful features
Large, active community providing support and contributions
Comprehensive documentation and extensive examples

Cons of pandas

Steeper learning curve for beginners
Larger codebase and more complex installation process
Higher resource requirements for large datasets

Code comparison

datasharing:

# Guide to sharing data

1. Start with raw data
2. Provide a tidy data set
3. Include a code book

pandas:

import pandas as pd

# Read CSV file
df = pd.read_csv('data.csv')

# Perform data manipulation
df_cleaned = df.dropna().groupby('category').mean()

# Export to Excel
df_cleaned.to_excel('output.xlsx')

The datasharing repository focuses on guidelines for sharing data effectively, while pandas provides a robust toolkit for data manipulation and analysis in Python. datasharing is more accessible for beginners and emphasizes best practices, whereas pandas offers powerful functionality but requires more technical expertise to utilize fully.

numpy

30,015

The fundamental package for scientific computing with Python.

Pros of NumPy

Extensive library for numerical computing in Python
Large, active community with frequent updates and improvements
Comprehensive documentation and widespread adoption in scientific computing

Cons of NumPy

More complex and larger codebase
Steeper learning curve for beginners
Requires installation and management of dependencies

Code Comparison

NumPy example:

import numpy as np

arr = np.array([1, 2, 3, 4, 5])
mean = np.mean(arr)

datasharing example:

# Data sharing

* Raw data
* Tidy data
* Code book
* Instruction list

Summary

NumPy is a powerful numerical computing library for Python, while datasharing is a guide for sharing data in scientific publications. NumPy offers extensive functionality but requires more setup and learning. datasharing is simpler and focuses on best practices for data sharing in research.

NumPy is ideal for complex numerical computations and data analysis, whereas datasharing is better suited for researchers looking to improve their data sharing practices. The choice between them depends on the specific needs of the project or research.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

How to share data with a statistician

This is a guide for anyone who needs to share data with a statistician or data scientist. The target audiences I have in mind are:

Collaborators who need statisticians or data scientists to analyze data for them
Students or postdocs in various disciplines looking for consulting advice
Junior statistics students whose job it is to collate/clean/wrangle data sets

The goals of this guide are to provide some instruction on the best way to share data to avoid the most common pitfalls and sources of delay in the transition from data collection to data analysis. The Leek group works with a large number of collaborators and the number one source of variation in the speed to results is the status of the data when they arrive at the Leek group. Based on my conversations with other statisticians this is true nearly universally.

My strong feeling is that statisticians should be able to handle the data in whatever state they arrive. It is important to see the raw data, understand the steps in the processing pipeline, and be able to incorporate hidden sources of variability in one's data analysis. On the other hand, for many data types, the processing steps are well documented and standardized. So the work of converting the data from raw form to directly analyzable form can be performed before calling on a statistician. This can dramatically speed the turnaround time, since the statistician doesn't have to work through all the pre-processing steps first.

What you should deliver to the statistician

To facilitate the most efficient and timely analysis this is the information you should pass to a statistician:

The raw data.
A tidy data set
A code book describing each variable and its values in the tidy data set.
An explicit and exact recipe you used to go from 1 -> 2,3

Let's look at each part of the data package you will transfer.

The raw data

It is critical that you include the rawest form of the data that you have access to. This ensures that data provenance can be maintained throughout the workflow. Here are some examples of the raw form of data:

The strange binary file your measurement machine spits out
The unformatted Excel file with 10 worksheets the company you contracted with sent you
The complicated JSON data you got from scraping the Twitter API
The hand-entered numbers you collected looking through a microscope

You know the raw data are in the right format if you:

Ran no software on the data
Did not modify any of the data values
You did not remove any data from the data set
You did not summarize the data in any way

If you made any modifications of the raw data it is not the raw form of the data. Reporting modified data as raw data is a very common way to slow down the analysis process, since the analyst will often have to do a forensic study of your data to figure out why the raw data looks weird. (Also imagine what would happen if new data arrived?)

The tidy data set

The general principles of tidy data are laid out by Hadley Wickham in this paper and this video. While both the paper and the video describe tidy data using R, the principles are more generally applicable:

Each variable you measure should be in one column
Each different observation of that variable should be in a different row
There should be one table for each "kind" of variable
If you have multiple tables, they should include a column in the table that allows them to be joined or merged

While these are the hard and fast rules, there are a number of other things that will make your data set much easier to handle. First is to include a row at the top of each data table/spreadsheet that contains full row names. So if you measured age at diagnosis for patients, you would head that column with the name AgeAtDiagnosis instead of something like ADx or another abbreviation that may be hard for another person to understand.

Here is an example of how this would work from genomics. Suppose that for 20 people you have collected gene expression measurements with RNA-sequencing. You have also collected demographic and clinical information about the patients including their age, treatment, and diagnosis. You would have one table/spreadsheet that contains the clinical/demographic information. It would have four columns (patient id, age, treatment, diagnosis) and 21 rows (a row with variable names, then one row for every patient). You would also have one spreadsheet for the summarized genomic data. Usually this type of data is summarized at the level of the number of counts per exon. Suppose you have 100,000 exons, then you would have a table/spreadsheet that had 21 rows (a row for gene names, and one row for each patient) and 100,001 columns (one row for patient ids and one row for each data type).

If you are sharing your data with the collaborator in Excel, the tidy data should be in one Excel file per table. They should not have multiple worksheets, no macros should be applied to the data, and no columns/cells should be highlighted. Alternatively share the data in a CSV or TAB-delimited text file. (Beware however that reading CSV files into Excel can sometimes lead to non-reproducible handling of date and time variables.)

The code book

For almost any data set, the measurements you calculate will need to be described in more detail than you can or should sneak into the spreadsheet. The code book contains this information. At minimum it should contain:

Information about the variables (including units!) in the data set not contained in the tidy data
Information about the summary choices you made
Information about the experimental study design you used

In our genomics example, the analyst would want to know what the unit of measurement for each clinical/demographic variable is (age in years, treatment by name/dose, level of diagnosis and how heterogeneous). They would also want to know how you picked the exons you used for summarizing the genomic data (UCSC/Ensembl, etc.). They would also want to know any other information about how you did the data collection/study design. For example, are these the first 20 patients that walked into the clinic? Are they 20 highly selected patients by some characteristic like age? Are they randomized to treatments?

A common format for this document is a Word file. There should be a section called "Study design" that has a thorough description of how you collected the data. There is a section called "Code book" that describes each variable and its units.

How to code variables

When you put variables into a spreadsheet there are several main categories you will run into depending on their data type:

Continuous
Ordinal
Categorical
Missing
Censored

Continuous variables are anything measured on a quantitative scale that could be any fractional number. An example would be something like weight measured in kg. Ordinal data are data that have a fixed, small (< 100) number of levels but are ordered. This could be for example survey responses where the choices are: poor, fair, good. Categorical data are data where there are multiple categories, but they aren't ordered. One example would be sex: male or female. This coding is attractive because it is self-documenting. Missing data are data that are unobserved and you don't know the mechanism. You should code missing values as NA. Censored data are data where you know the missingness mechanism on some level. Common examples are a measurement being below a detection limit or a patient being lost to follow-up. They should also be coded as NA when you don't have the data. But you should also add a new column to your tidy data called, "VariableNameCensored" which should have values of TRUE if censored and FALSE if not. In the code book you should explain why those values are missing. It is absolutely critical to report to the analyst if there is a reason you know about that some of the data are missing. You should also not impute/make up/ throw away missing observations.

In general, try to avoid coding categorical or ordinal variables as numbers. When you enter the value for sex in the tidy data, it should be "male" or "female". The ordinal values in the data set should be "poor", "fair", and "good" not 1, 2 ,3. This will avoid potential mixups about which direction effects go and will help identify coding errors.

Always encode every piece of information about your observations using text. For example, if you are storing data in Excel and use a form of colored text or cell background formatting to indicate information about an observation ("red variable entries were observed in experiment 1.") then this information will not be exported (and will be lost!) when the data is exported as raw text. Every piece of data should be encoded as actual text that can be exported.

The instruction list/script

You may have heard this before, but reproducibility is a big deal in computational science. That means, when you submit your paper, the reviewers and the rest of the world should be able to exactly replicate the analyses from raw data all the way to final results. If you are trying to be efficient, you will likely perform some summarization/data analysis steps before the data can be considered tidy.

The ideal thing for you to do when performing summarization is to create a computer script (in R, Python, or something else) that takes the raw data as input and produces the tidy data you are sharing as output. You can try running your script a couple of times and see if the code produces the same output.

In many cases, the person who collected the data has incentive to make it tidy for a statistician to speed the process of collaboration. They may not know how to code in a scripting language. In that case, what you should provide the statistician is something called pseudocode. It should look something like:

Step 1 - take the raw file, run version 3.1.2 of summarize software with parameters a=1, b=2, c=3
Step 2 - run the software separately for each sample
Step 3 - take column three of outputfile.out for each sample and that is the corresponding row in the output data set

You should also include information about which system (Mac/Windows/Linux) you used the software on and whether you tried it more than once to confirm it gave the same results. Ideally, you will run this by a fellow student/labmate to confirm that they can obtain the same output file you did.

What you should expect from the analyst

When you turn over a properly tidied data set it dramatically decreases the workload on the statistician. So hopefully they will get back to you much sooner. But most careful statisticians will check your recipe, ask questions about steps you performed, and try to confirm that they can obtain the same tidy data that you did with, at minimum, spot checks.

You should then expect from the statistician:

An analysis script that performs each of the analyses (not just instructions)
The exact computer code they used to run the analysis
All output files/figures they generated.

This is the information you will use in the supplement to establish reproducibility and precision of your results. Each of the steps in the analysis should be clearly explained and you should ask questions when you don't understand what the analyst did. It is the responsibility of both the statistician and the scientist to understand the statistical analysis. You may not be able to perform the exact analyses without the statistician's code, but you should be able to explain why the statistician performed each step to a labmate/your principal investigator.

Contributors

Jeff Leek - Wrote the initial version.
L. Collado-Torres - Fixed typos, added links.
Nick Reich - Added tips on storing data as text.
Nick Horton - Minor wording suggestions.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot