Convert Figma logo to code with AI

jakevdp logoPythonDataScienceHandbook

Python Data Science Handbook: full text in Jupyter Notebooks

42,701
17,819
42,701
209

Top Related Projects

Notebooks and code for the book "Introduction to Machine Learning with Python"

Materials and IPython notebooks for "Python for Data Analysis" by Wes McKinney, published by O'Reilly Media

⛔️ DEPRECATED – See https://github.com/ageron/handson-ml3 instead.

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)

14,856

Jupyter metapackage for installation, docs and chat

Quick Overview

The PythonDataScienceHandbook repository contains the full text of Jake VanderPlas's "Python Data Science Handbook," along with all the Jupyter Notebooks used to create the book. It covers essential tools for data science in Python, including IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and related packages.

Pros

  • Comprehensive coverage of core data science tools in Python
  • Includes practical examples and explanations for each topic
  • Free and open-source, making it accessible to everyone
  • Jupyter Notebooks allow for interactive learning and experimentation

Cons

  • May not cover the most recent updates to libraries since its publication
  • Focuses primarily on foundational tools, potentially lacking coverage of more advanced or specialized topics
  • Some examples may become outdated as Python and its libraries evolve
  • Large repository size due to inclusion of all book content and notebooks

Code Examples

  1. Using NumPy for array operations:
import numpy as np

# Create a 2D array
arr = np.array([[1, 2, 3], [4, 5, 6]])

# Perform element-wise operations
result = arr * 2 + 1
print(result)
  1. Data manipulation with Pandas:
import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Filter and sort the DataFrame
filtered_df = df[df['A'] > 1].sort_values('B', ascending=False)
print(filtered_df)
  1. Plotting with Matplotlib:
import matplotlib.pyplot as plt
import numpy as np

# Generate data
x = np.linspace(0, 10, 100)
y = np.sin(x)

# Create a line plot
plt.plot(x, y)
plt.title('Sine Wave')
plt.xlabel('x')
plt.ylabel('sin(x)')
plt.show()

Getting Started

To get started with the Python Data Science Handbook:

  1. Clone the repository:

    git clone https://github.com/jakevdp/PythonDataScienceHandbook.git
    
  2. Install required dependencies:

    pip install jupyter numpy pandas matplotlib scikit-learn
    
  3. Navigate to the repository directory and start Jupyter Notebook:

    cd PythonDataScienceHandbook
    jupyter notebook
    
  4. Open and run the notebooks in the notebooks directory to explore the content interactively.

Competitor Comparisons

Notebooks and code for the book "Introduction to Machine Learning with Python"

Pros of Introduction to ML with Python

  • More focused on machine learning concepts and algorithms
  • Includes practical examples and case studies for real-world applications
  • Provides in-depth explanations of model evaluation and parameter tuning

Cons of Introduction to ML with Python

  • Less comprehensive coverage of data manipulation and visualization
  • May be more challenging for beginners without prior Python experience
  • Doesn't cover as wide a range of data science topics as PythonDataScienceHandbook

Code Comparison

Introduction to ML with Python:

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

PythonDataScienceHandbook:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('data.csv')
plt.scatter(df['x'], df['y'])
plt.xlabel('X-axis')
plt.ylabel('Y-axis')

The code snippets demonstrate the different focus areas of each repository. Introduction to ML with Python emphasizes machine learning algorithms and model training, while PythonDataScienceHandbook covers data manipulation and visualization techniques.

Materials and IPython notebooks for "Python for Data Analysis" by Wes McKinney, published by O'Reilly Media

Pros of pydata-book

  • Covers a wider range of topics, including data manipulation, statistical analysis, and machine learning
  • Provides more in-depth explanations of statistical concepts and their applications
  • Includes more real-world datasets and examples for practical learning

Cons of pydata-book

  • Less focus on visualization techniques compared to PythonDataScienceHandbook
  • Code examples may be slightly outdated due to the book's earlier publication date
  • Fewer interactive elements and Jupyter notebook integrations

Code Comparison

PythonDataScienceHandbook:

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x))
plt.show()

pydata-book:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': np.random.randn(100),
                   'B': np.random.randn(100)})
print(df.describe())

The PythonDataScienceHandbook example focuses on visualization, while the pydata-book example demonstrates data manipulation and basic statistical analysis using pandas.

⛔️ DEPRECATED – See https://github.com/ageron/handson-ml3 instead.

Pros of handson-ml

  • More focused on machine learning and deep learning techniques
  • Includes practical exercises and hands-on projects
  • Regularly updated with newer ML concepts and frameworks

Cons of handson-ml

  • Less coverage of general data science topics and exploratory data analysis
  • May be more challenging for absolute beginners in Python and data science

Code Comparison

PythonDataScienceHandbook:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('data.csv')
plt.scatter(data['x'], data['y'])

handson-ml:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)

The PythonDataScienceHandbook focuses more on data manipulation and visualization, while handson-ml emphasizes machine learning model implementation and evaluation. Both repositories offer valuable resources for data scientists and machine learning practitioners, with PythonDataScienceHandbook providing a broader overview of data science concepts and handson-ml diving deeper into machine learning techniques and applications.

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

Pros of data-science-ipython-notebooks

  • Covers a wider range of topics, including big data tools like Spark and Hadoop
  • Includes more practical examples and real-world applications
  • Offers a variety of mini-projects and exercises for hands-on learning

Cons of data-science-ipython-notebooks

  • Less structured and cohesive compared to PythonDataScienceHandbook
  • May be overwhelming for beginners due to the breadth of topics covered
  • Some notebooks may be outdated or use older versions of libraries

Code Comparison

PythonDataScienceHandbook:

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x))
plt.show()

data-science-ipython-notebooks:

from pyspark import SparkContext
sc = SparkContext("local", "Word Count")
text_file = sc.textFile("path/to/file.txt")
word_counts = text_file.flatMap(lambda line: line.split()) \
                       .map(lambda word: (word, 1)) \
                       .reduceByKey(lambda a, b: a + b)

The code examples highlight the difference in focus between the two repositories. PythonDataScienceHandbook emphasizes core data science libraries like NumPy and Matplotlib, while data-science-ipython-notebooks includes examples using more specialized tools like Apache Spark for big data processing.

aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)

Pros of Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

  • Focuses specifically on Bayesian methods and probabilistic programming
  • Provides hands-on examples using PyMC3 and TensorFlow Probability
  • Offers a more in-depth exploration of statistical concepts

Cons of Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

  • Narrower scope compared to the broader data science coverage in PythonDataScienceHandbook
  • May be more challenging for beginners due to its specialized focus

Code Comparison

PythonDataScienceHandbook:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('data.csv')
plt.scatter(data['x'], data['y'])
plt.show()

Probabilistic-Programming-and-Bayesian-Methods-for-Hackers:

import pymc3 as pm
import numpy as np

with pm.Model() as model:
    mu = pm.Normal('mu', mu=0, sd=1)
    obs = pm.Normal('obs', mu=mu, sd=1, observed=data)

The code snippets highlight the difference in focus between the two repositories. PythonDataScienceHandbook covers general data manipulation and visualization, while Probabilistic-Programming-and-Bayesian-Methods-for-Hackers emphasizes probabilistic modeling using specialized libraries like PyMC3.

14,856

Jupyter metapackage for installation, docs and chat

Pros of Jupyter

  • Broader scope, covering the entire Jupyter ecosystem
  • More active development and community engagement
  • Provides a comprehensive platform for interactive computing

Cons of Jupyter

  • Less focused on data science-specific content
  • Steeper learning curve for beginners
  • More complex project structure

Code Comparison

PythonDataScienceHandbook:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('data.csv')
plt.scatter(data['x'], data['y'])

Jupyter:

from notebook.notebookapp import NotebookApp
import sys

if __name__ == '__main__':
    sys.exit(NotebookApp.launch_instance())

The PythonDataScienceHandbook example demonstrates typical data science workflows, while the Jupyter code snippet shows how to launch a Jupyter Notebook application programmatically.

PythonDataScienceHandbook focuses on practical data science examples and tutorials, making it ideal for learning data science concepts. Jupyter, on the other hand, provides the underlying infrastructure for interactive computing, catering to a wider range of use cases beyond just data science.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Python Data Science Handbook

Binder Colab

This repository contains the entire Python Data Science Handbook, in the form of (free!) Jupyter notebooks.

cover image

How to Use this Book

About

The book was written and tested with Python 3.5, though other Python versions (including Python 2.7) should work in nearly all cases.

The book introduces the core libraries essential for working with data in Python: particularly IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and related packages. Familiarity with Python as a language is assumed; if you need a quick introduction to the language itself, see the free companion project, A Whirlwind Tour of Python: it's a fast-paced introduction to the Python language aimed at researchers and scientists.

See Index.ipynb for an index of the notebooks available to accompany the text.

Software

The code in the book was tested with Python 3.5, though most (but not all) will also work correctly with Python 2.7 and other older Python versions.

The packages I used to run the code in the book are listed in requirements.txt (Note that some of these exact version numbers may not be available on your platform: you may have to tweak them for your own use). To install the requirements using conda, run the following at the command-line:

$ conda install --file requirements.txt

To create a stand-alone environment named PDSH with Python 3.5 and all the required package versions, run the following:

$ conda create -n PDSH python=3.5 --file requirements.txt

You can read more about using conda environments in the Managing Environments section of the conda documentation.

License

Code

The code in this repository, including all code samples in the notebooks listed above, is released under the MIT license. Read more at the Open Source Initiative.

Text

The text content of the book is released under the CC-BY-NC-ND license. Read more at Creative Commons.