PythonDataScienceHandbook
Python Data Science Handbook: full text in Jupyter Notebooks
Top Related Projects
Notebooks and code for the book "Introduction to Machine Learning with Python"
Materials and IPython notebooks for "Python for Data Analysis" by Wes McKinney, published by O'Reilly Media
⛔️ DEPRECATED – See https://github.com/ageron/handson-ml3 instead.
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)
Jupyter metapackage for installation, docs and chat
Quick Overview
The PythonDataScienceHandbook repository contains the full text of Jake VanderPlas's "Python Data Science Handbook," along with all the Jupyter Notebooks used to create the book. It covers essential tools for data science in Python, including IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and related packages.
Pros
- Comprehensive coverage of core data science tools in Python
- Includes practical examples and explanations for each topic
- Free and open-source, making it accessible to everyone
- Jupyter Notebooks allow for interactive learning and experimentation
Cons
- May not cover the most recent updates to libraries since its publication
- Focuses primarily on foundational tools, potentially lacking coverage of more advanced or specialized topics
- Some examples may become outdated as Python and its libraries evolve
- Large repository size due to inclusion of all book content and notebooks
Code Examples
- Using NumPy for array operations:
import numpy as np
# Create a 2D array
arr = np.array([[1, 2, 3], [4, 5, 6]])
# Perform element-wise operations
result = arr * 2 + 1
print(result)
- Data manipulation with Pandas:
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# Filter and sort the DataFrame
filtered_df = df[df['A'] > 1].sort_values('B', ascending=False)
print(filtered_df)
- Plotting with Matplotlib:
import matplotlib.pyplot as plt
import numpy as np
# Generate data
x = np.linspace(0, 10, 100)
y = np.sin(x)
# Create a line plot
plt.plot(x, y)
plt.title('Sine Wave')
plt.xlabel('x')
plt.ylabel('sin(x)')
plt.show()
Getting Started
To get started with the Python Data Science Handbook:
-
Clone the repository:
git clone https://github.com/jakevdp/PythonDataScienceHandbook.git
-
Install required dependencies:
pip install jupyter numpy pandas matplotlib scikit-learn
-
Navigate to the repository directory and start Jupyter Notebook:
cd PythonDataScienceHandbook jupyter notebook
-
Open and run the notebooks in the
notebooks
directory to explore the content interactively.
Competitor Comparisons
Notebooks and code for the book "Introduction to Machine Learning with Python"
Pros of Introduction to ML with Python
- More focused on machine learning concepts and algorithms
- Includes practical examples and case studies for real-world applications
- Provides in-depth explanations of model evaluation and parameter tuning
Cons of Introduction to ML with Python
- Less comprehensive coverage of data manipulation and visualization
- May be more challenging for beginners without prior Python experience
- Doesn't cover as wide a range of data science topics as PythonDataScienceHandbook
Code Comparison
Introduction to ML with Python:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
PythonDataScienceHandbook:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
plt.scatter(df['x'], df['y'])
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
The code snippets demonstrate the different focus areas of each repository. Introduction to ML with Python emphasizes machine learning algorithms and model training, while PythonDataScienceHandbook covers data manipulation and visualization techniques.
Materials and IPython notebooks for "Python for Data Analysis" by Wes McKinney, published by O'Reilly Media
Pros of pydata-book
- Covers a wider range of topics, including data manipulation, statistical analysis, and machine learning
- Provides more in-depth explanations of statistical concepts and their applications
- Includes more real-world datasets and examples for practical learning
Cons of pydata-book
- Less focus on visualization techniques compared to PythonDataScienceHandbook
- Code examples may be slightly outdated due to the book's earlier publication date
- Fewer interactive elements and Jupyter notebook integrations
Code Comparison
PythonDataScienceHandbook:
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x))
plt.show()
pydata-book:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': np.random.randn(100),
'B': np.random.randn(100)})
print(df.describe())
The PythonDataScienceHandbook example focuses on visualization, while the pydata-book example demonstrates data manipulation and basic statistical analysis using pandas.
⛔️ DEPRECATED – See https://github.com/ageron/handson-ml3 instead.
Pros of handson-ml
- More focused on machine learning and deep learning techniques
- Includes practical exercises and hands-on projects
- Regularly updated with newer ML concepts and frameworks
Cons of handson-ml
- Less coverage of general data science topics and exploratory data analysis
- May be more challenging for absolute beginners in Python and data science
Code Comparison
PythonDataScienceHandbook:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('data.csv')
plt.scatter(data['x'], data['y'])
handson-ml:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
The PythonDataScienceHandbook focuses more on data manipulation and visualization, while handson-ml emphasizes machine learning model implementation and evaluation. Both repositories offer valuable resources for data scientists and machine learning practitioners, with PythonDataScienceHandbook providing a broader overview of data science concepts and handson-ml diving deeper into machine learning techniques and applications.
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Pros of data-science-ipython-notebooks
- Covers a wider range of topics, including big data tools like Spark and Hadoop
- Includes more practical examples and real-world applications
- Offers a variety of mini-projects and exercises for hands-on learning
Cons of data-science-ipython-notebooks
- Less structured and cohesive compared to PythonDataScienceHandbook
- May be overwhelming for beginners due to the breadth of topics covered
- Some notebooks may be outdated or use older versions of libraries
Code Comparison
PythonDataScienceHandbook:
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x))
plt.show()
data-science-ipython-notebooks:
from pyspark import SparkContext
sc = SparkContext("local", "Word Count")
text_file = sc.textFile("path/to/file.txt")
word_counts = text_file.flatMap(lambda line: line.split()) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
The code examples highlight the difference in focus between the two repositories. PythonDataScienceHandbook emphasizes core data science libraries like NumPy and Matplotlib, while data-science-ipython-notebooks includes examples using more specialized tools like Apache Spark for big data processing.
aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)
Pros of Probabilistic-Programming-and-Bayesian-Methods-for-Hackers
- Focuses specifically on Bayesian methods and probabilistic programming
- Provides hands-on examples using PyMC3 and TensorFlow Probability
- Offers a more in-depth exploration of statistical concepts
Cons of Probabilistic-Programming-and-Bayesian-Methods-for-Hackers
- Narrower scope compared to the broader data science coverage in PythonDataScienceHandbook
- May be more challenging for beginners due to its specialized focus
Code Comparison
PythonDataScienceHandbook:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('data.csv')
plt.scatter(data['x'], data['y'])
plt.show()
Probabilistic-Programming-and-Bayesian-Methods-for-Hackers:
import pymc3 as pm
import numpy as np
with pm.Model() as model:
mu = pm.Normal('mu', mu=0, sd=1)
obs = pm.Normal('obs', mu=mu, sd=1, observed=data)
The code snippets highlight the difference in focus between the two repositories. PythonDataScienceHandbook covers general data manipulation and visualization, while Probabilistic-Programming-and-Bayesian-Methods-for-Hackers emphasizes probabilistic modeling using specialized libraries like PyMC3.
Jupyter metapackage for installation, docs and chat
Pros of Jupyter
- Broader scope, covering the entire Jupyter ecosystem
- More active development and community engagement
- Provides a comprehensive platform for interactive computing
Cons of Jupyter
- Less focused on data science-specific content
- Steeper learning curve for beginners
- More complex project structure
Code Comparison
PythonDataScienceHandbook:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('data.csv')
plt.scatter(data['x'], data['y'])
Jupyter:
from notebook.notebookapp import NotebookApp
import sys
if __name__ == '__main__':
sys.exit(NotebookApp.launch_instance())
The PythonDataScienceHandbook example demonstrates typical data science workflows, while the Jupyter code snippet shows how to launch a Jupyter Notebook application programmatically.
PythonDataScienceHandbook focuses on practical data science examples and tutorials, making it ideal for learning data science concepts. Jupyter, on the other hand, provides the underlying infrastructure for interactive computing, catering to a wider range of use cases beyond just data science.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Python Data Science Handbook
This repository contains the entire Python Data Science Handbook, in the form of (free!) Jupyter notebooks.
How to Use this Book
-
Read the book in its entirety online at https://jakevdp.github.io/PythonDataScienceHandbook/
-
Run the code using the Jupyter notebooks available in this repository's notebooks directory.
-
Launch executable versions of these notebooks using Google Colab:
-
Launch a live notebook server with these notebooks using binder:
-
Buy the printed book through O'Reilly Media
About
The book was written and tested with Python 3.5, though other Python versions (including Python 2.7) should work in nearly all cases.
The book introduces the core libraries essential for working with data in Python: particularly IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and related packages. Familiarity with Python as a language is assumed; if you need a quick introduction to the language itself, see the free companion project, A Whirlwind Tour of Python: it's a fast-paced introduction to the Python language aimed at researchers and scientists.
See Index.ipynb for an index of the notebooks available to accompany the text.
Software
The code in the book was tested with Python 3.5, though most (but not all) will also work correctly with Python 2.7 and other older Python versions.
The packages I used to run the code in the book are listed in requirements.txt (Note that some of these exact version numbers may not be available on your platform: you may have to tweak them for your own use). To install the requirements using conda, run the following at the command-line:
$ conda install --file requirements.txt
To create a stand-alone environment named PDSH
with Python 3.5 and all the required package versions, run the following:
$ conda create -n PDSH python=3.5 --file requirements.txt
You can read more about using conda environments in the Managing Environments section of the conda documentation.
License
Code
The code in this repository, including all code samples in the notebooks listed above, is released under the MIT license. Read more at the Open Source Initiative.
Text
The text content of the book is released under the CC-BY-NC-ND license. Read more at Creative Commons.
Top Related Projects
Notebooks and code for the book "Introduction to Machine Learning with Python"
Materials and IPython notebooks for "Python for Data Analysis" by Wes McKinney, published by O'Reilly Media
⛔️ DEPRECATED – See https://github.com/ageron/handson-ml3 instead.
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)
Jupyter metapackage for installation, docs and chat
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot