PythonDataScienceHandbook

Python Data Science Handbook: full text in Jupyter Notebooks

44,377

18,283

44,377

217

View on GitHub

Top Related Projects

introduction_to_ml_with_python

7,689

Notebooks and code for the book "Introduction to Machine Learning with Python"

pydata-book

23,068

Materials and IPython notebooks for "Python for Data Analysis" by Wes McKinney, published by O'Reilly Media

handson-ml

25,359

⛔️ DEPRECATED – See https://github.com/ageron/handson-ml3 instead.

data-science-ipython-notebooks

28,305

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

27,337

aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)

jupyter

15,104

Jupyter metapackage for installation, docs and chat

Quick Overview

The PythonDataScienceHandbook repository contains the full text of Jake VanderPlas's "Python Data Science Handbook," along with all the Jupyter Notebooks used to create the book. It covers essential tools for data science in Python, including IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and related packages.

Pros

Comprehensive coverage of core data science tools in Python
Includes practical examples and explanations for each topic
Free and open-source, making it accessible to everyone
Jupyter Notebooks allow for interactive learning and experimentation

Cons

May not cover the most recent updates to libraries since its publication
Focuses primarily on foundational tools, potentially lacking coverage of more advanced or specialized topics
Some examples may become outdated as Python and its libraries evolve
Large repository size due to inclusion of all book content and notebooks

Code Examples

Using NumPy for array operations:

import numpy as np

# Create a 2D array
arr = np.array([[1, 2, 3], [4, 5, 6]])

# Perform element-wise operations
result = arr * 2 + 1
print(result)

Data manipulation with Pandas:

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Filter and sort the DataFrame
filtered_df = df[df['A'] > 1].sort_values('B', ascending=False)
print(filtered_df)

Plotting with Matplotlib:

import matplotlib.pyplot as plt
import numpy as np

# Generate data
x = np.linspace(0, 10, 100)
y = np.sin(x)

# Create a line plot
plt.plot(x, y)
plt.title('Sine Wave')
plt.xlabel('x')
plt.ylabel('sin(x)')
plt.show()

Getting Started

To get started with the Python Data Science Handbook:

Clone the repository:

git clone https://github.com/jakevdp/PythonDataScienceHandbook.git

Install required dependencies:

pip install jupyter numpy pandas matplotlib scikit-learn

Navigate to the repository directory and start Jupyter Notebook:
```
cd PythonDataScienceHandbook
jupyter notebook
```
Open and run the notebooks in the notebooks directory to explore the content interactively.

Competitor Comparisons

introduction_to_ml_with_python

7,689

Notebooks and code for the book "Introduction to Machine Learning with Python"

Pros of Introduction to ML with Python

More focused on machine learning concepts and algorithms
Includes practical examples and case studies for real-world applications
Provides in-depth explanations of model evaluation and parameter tuning

Cons of Introduction to ML with Python

Less comprehensive coverage of data manipulation and visualization
May be more challenging for beginners without prior Python experience
Doesn't cover as wide a range of data science topics as PythonDataScienceHandbook

Code Comparison

Introduction to ML with Python:

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

PythonDataScienceHandbook:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('data.csv')
plt.scatter(df['x'], df['y'])
plt.xlabel('X-axis')
plt.ylabel('Y-axis')

The code snippets demonstrate the different focus areas of each repository. Introduction to ML with Python emphasizes machine learning algorithms and model training, while PythonDataScienceHandbook covers data manipulation and visualization techniques.

pydata-book

23,068

Materials and IPython notebooks for "Python for Data Analysis" by Wes McKinney, published by O'Reilly Media

Pros of pydata-book

Covers a wider range of topics, including data manipulation, statistical analysis, and machine learning
Provides more in-depth explanations of statistical concepts and their applications
Includes more real-world datasets and examples for practical learning

Cons of pydata-book

Less focus on visualization techniques compared to PythonDataScienceHandbook
Code examples may be slightly outdated due to the book's earlier publication date
Fewer interactive elements and Jupyter notebook integrations

Code Comparison

PythonDataScienceHandbook:

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x))
plt.show()

pydata-book:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': np.random.randn(100),
                   'B': np.random.randn(100)})
print(df.describe())

The PythonDataScienceHandbook example focuses on visualization, while the pydata-book example demonstrates data manipulation and basic statistical analysis using pandas.

handson-ml

25,359

⛔️ DEPRECATED – See https://github.com/ageron/handson-ml3 instead.

Pros of handson-ml

More focused on machine learning and deep learning techniques
Includes practical exercises and hands-on projects
Regularly updated with newer ML concepts and frameworks

Cons of handson-ml

Less coverage of general data science topics and exploratory data analysis
May be more challenging for absolute beginners in Python and data science

Code Comparison

PythonDataScienceHandbook:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('data.csv')
plt.scatter(data['x'], data['y'])

handson-ml:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)

The PythonDataScienceHandbook focuses more on data manipulation and visualization, while handson-ml emphasizes machine learning model implementation and evaluation. Both repositories offer valuable resources for data scientists and machine learning practitioners, with PythonDataScienceHandbook providing a broader overview of data science concepts and handson-ml diving deeper into machine learning techniques and applications.

data-science-ipython-notebooks

28,305

Pros of data-science-ipython-notebooks

Covers a wider range of topics, including big data tools like Spark and Hadoop
Includes more practical examples and real-world applications
Offers a variety of mini-projects and exercises for hands-on learning

Cons of data-science-ipython-notebooks

Less structured and cohesive compared to PythonDataScienceHandbook
May be overwhelming for beginners due to the breadth of topics covered
Some notebooks may be outdated or use older versions of libraries

Code Comparison

PythonDataScienceHandbook:

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x))
plt.show()

data-science-ipython-notebooks:

from pyspark import SparkContext
sc = SparkContext("local", "Word Count")
text_file = sc.textFile("path/to/file.txt")
word_counts = text_file.flatMap(lambda line: line.split()) \
                       .map(lambda word: (word, 1)) \
                       .reduceByKey(lambda a, b: a + b)

The code examples highlight the difference in focus between the two repositories. PythonDataScienceHandbook emphasizes core data science libraries like NumPy and Matplotlib, while data-science-ipython-notebooks includes examples using more specialized tools like Apache Spark for big data processing.

Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

27,337

aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)

Pros of Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

Focuses specifically on Bayesian methods and probabilistic programming
Provides hands-on examples using PyMC3 and TensorFlow Probability
Offers a more in-depth exploration of statistical concepts

Cons of Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

Narrower scope compared to the broader data science coverage in PythonDataScienceHandbook
May be more challenging for beginners due to its specialized focus

Code Comparison

PythonDataScienceHandbook:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('data.csv')
plt.scatter(data['x'], data['y'])
plt.show()

Probabilistic-Programming-and-Bayesian-Methods-for-Hackers:

import pymc3 as pm
import numpy as np

with pm.Model() as model:
    mu = pm.Normal('mu', mu=0, sd=1)
    obs = pm.Normal('obs', mu=mu, sd=1, observed=data)

The code snippets highlight the difference in focus between the two repositories. PythonDataScienceHandbook covers general data manipulation and visualization, while Probabilistic-Programming-and-Bayesian-Methods-for-Hackers emphasizes probabilistic modeling using specialized libraries like PyMC3.

jupyter

15,104

Jupyter metapackage for installation, docs and chat

Pros of Jupyter

Broader scope, covering the entire Jupyter ecosystem
More active development and community engagement
Provides a comprehensive platform for interactive computing

Cons of Jupyter

Less focused on data science-specific content
Steeper learning curve for beginners
More complex project structure

Code Comparison

PythonDataScienceHandbook:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('data.csv')
plt.scatter(data['x'], data['y'])

Jupyter:

from notebook.notebookapp import NotebookApp
import sys

if __name__ == '__main__':
    sys.exit(NotebookApp.launch_instance())

The PythonDataScienceHandbook example demonstrates typical data science workflows, while the Jupyter code snippet shows how to launch a Jupyter Notebook application programmatically.

PythonDataScienceHandbook focuses on practical data science examples and tutorials, making it ideal for learning data science concepts. Jupyter, on the other hand, provides the underlying infrastructure for interactive computing, catering to a wider range of use cases beyond just data science.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Python Data Science Handbook

This repository contains the entire Python Data Science Handbook, in the form of (free!) Jupyter notebooks.

cover image

How to Use this Book

Read the book in its entirety online at https://jakevdp.github.io/PythonDataScienceHandbook/
Run the code using the Jupyter notebooks available in this repository's notebooks directory.
Launch executable versions of these notebooks using Google Colab:
Launch a live notebook server with these notebooks using binder:
Buy the printed book through O'Reilly Media

About

The book was written and tested with Python 3.5, though other Python versions (including Python 2.7) should work in nearly all cases.

The book introduces the core libraries essential for working with data in Python: particularly IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and related packages. Familiarity with Python as a language is assumed; if you need a quick introduction to the language itself, see the free companion project, A Whirlwind Tour of Python: it's a fast-paced introduction to the Python language aimed at researchers and scientists.

See Index.ipynb for an index of the notebooks available to accompany the text.

Software

The code in the book was tested with Python 3.5, though most (but not all) will also work correctly with Python 2.7 and other older Python versions.

The packages I used to run the code in the book are listed in requirements.txt (Note that some of these exact version numbers may not be available on your platform: you may have to tweak them for your own use). To install the requirements using conda, run the following at the command-line:

$ conda install --file requirements.txt

To create a stand-alone environment named PDSH with Python 3.5 and all the required package versions, run the following:

$ conda create -n PDSH python=3.5 --file requirements.txt

You can read more about using conda environments in the Managing Environments section of the conda documentation.

License

Code

The code in this repository, including all code samples in the notebooks listed above, is released under the MIT license. Read more at the Open Source Initiative.

Text

The text content of the book is released under the CC-BY-NC-ND license. Read more at Creative Commons.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot