Convert Figma logo to code with AI

statsmodels logostatsmodels

Statsmodels: statistical modeling and econometrics in Python

10,062
2,878
10,062
2,832

Top Related Projects

12,986

SciPy library main repository

scikit-learn: machine learning in Python

43,532

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

matplotlib: plotting with Python

27,792

The fundamental package for scientific computing with Python.

18,363

Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.

Quick Overview

Statsmodels is a Python library for statistical modeling and econometrics. It provides a comprehensive set of tools for statistical inference, hypothesis testing, and data exploration. The library is designed to complement other scientific Python libraries like NumPy, SciPy, and Pandas.

Pros

  • Extensive collection of statistical models and methods
  • Well-documented with detailed API references and examples
  • Integrates seamlessly with other scientific Python libraries
  • Actively maintained and regularly updated

Cons

  • Steeper learning curve compared to some other statistical libraries
  • Can be slower for large datasets compared to specialized libraries
  • Some advanced features may require additional dependencies
  • Documentation can be technical and challenging for beginners

Code Examples

  1. Linear Regression
import statsmodels.api as sm

X = [[1, 1], [1, 2], [1, 3], [1, 4], [1, 5]]
y = [2, 4, 5, 4, 5]

model = sm.OLS(y, X).fit()
print(model.summary())

This example demonstrates how to perform a simple linear regression using statsmodels.

  1. Time Series Analysis
import pandas as pd
import statsmodels.api as sm

# Assuming 'data' is a pandas DataFrame with a DatetimeIndex
model = sm.tsa.ARIMA(data['value'], order=(1, 1, 1))
results = model.fit()
print(results.summary())

This code snippet shows how to fit an ARIMA model to a time series dataset.

  1. Hypothesis Testing
from statsmodels.stats.proportion import proportions_ztest

count = [5, 12]
nobs = [83, 99]

stat, pvalue = proportions_ztest(count, nobs)
print(f'Z-statistic: {stat:.4f}')
print(f'P-value: {pvalue:.4f}')

This example demonstrates how to perform a two-sample test of proportions using statsmodels.

Getting Started

To get started with statsmodels, follow these steps:

  1. Install statsmodels using pip:

    pip install statsmodels
    
  2. Import the library in your Python script:

    import statsmodels.api as sm
    
  3. Load your data and create a model:

    import numpy as np
    X = np.random.rand(100, 3)
    y = np.random.rand(100)
    model = sm.OLS(y, X).fit()
    
  4. Analyze the results:

    print(model.summary())
    

This quick start guide will help you set up statsmodels and run a basic linear regression model.

Competitor Comparisons

12,986

SciPy library main repository

Pros of SciPy

  • Broader scope, covering a wide range of scientific computing tasks
  • More mature and established project with a larger community
  • Faster performance for many numerical operations

Cons of SciPy

  • Less specialized for statistical analysis compared to StatsModels
  • May require additional libraries for advanced statistical modeling
  • Documentation can be more technical and less accessible for beginners

Code Comparison

StatsModels:

import statsmodels.api as sm
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
print(model.summary())

SciPy:

from scipy import stats
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
print(f"Slope: {slope}, Intercept: {intercept}, R-value: {r_value}")

StatsModels provides a more comprehensive statistical output, while SciPy offers a simpler interface for basic linear regression. StatsModels is better suited for in-depth statistical analysis, whereas SciPy is more versatile for general scientific computing tasks.

scikit-learn: machine learning in Python

Pros of scikit-learn

  • Broader range of machine learning algorithms and tools
  • More user-friendly API with consistent interfaces
  • Extensive documentation and community support

Cons of scikit-learn

  • Less focus on statistical inference and hypothesis testing
  • Fewer specialized econometric models

Code Comparison

scikit-learn:

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
predictions = model.predict(X_test)

statsmodels:

import statsmodels.api as sm
model = sm.OLS(y, X)
results = model.fit()
predictions = results.predict(X_test)

Summary

scikit-learn is a versatile machine learning library with a wide range of algorithms and a user-friendly API. It's excellent for predictive modeling and general-purpose machine learning tasks. statsmodels, on the other hand, specializes in statistical modeling and econometrics, offering more detailed statistical output and hypothesis testing capabilities. While scikit-learn is more popular for general machine learning, statsmodels is often preferred for in-depth statistical analysis and econometric modeling.

43,532

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Pros of pandas

  • More versatile for data manipulation and preprocessing
  • Faster performance for large datasets
  • Better integration with other data science libraries

Cons of pandas

  • Less specialized for statistical modeling
  • Steeper learning curve for advanced statistical analyses
  • Limited built-in statistical tests and models

Code Comparison

pandas:

import pandas as pd

df = pd.read_csv('data.csv')
grouped = df.groupby('category').mean()
result = grouped['value'].sort_values(ascending=False)

statsmodels:

import statsmodels.api as sm

model = sm.OLS(y, X).fit()
predictions = model.predict(X_new)
summary = model.summary()

pandas excels in data manipulation and preprocessing, offering powerful tools for handling various data formats and structures. It provides faster performance for large datasets and integrates well with other data science libraries. However, pandas has a steeper learning curve for advanced statistical analyses and offers limited built-in statistical tests and models compared to statsmodels.

statsmodels, on the other hand, specializes in statistical modeling and econometrics. It provides a wide range of statistical tests, estimators, and tools for in-depth statistical analysis. While it may not be as versatile for general data manipulation, statsmodels offers more comprehensive statistical capabilities and is better suited for advanced statistical modeling tasks.

matplotlib: plotting with Python

Pros of matplotlib

  • More comprehensive and versatile plotting library
  • Extensive documentation and large community support
  • Highly customizable with fine-grained control over plot elements

Cons of matplotlib

  • Steeper learning curve for complex visualizations
  • Can be verbose for simple plots
  • Less focus on statistical analysis and modeling

Code Comparison

matplotlib:

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x))
plt.show()

statsmodels:

import statsmodels.api as sm
import numpy as np

x = np.linspace(0, 10, 100)
y = np.sin(x) + np.random.normal(0, 0.1, 100)
model = sm.OLS(y, sm.add_constant(x)).fit()
print(model.summary())

Summary

matplotlib excels in creating a wide range of visualizations with high customizability, while statsmodels focuses on statistical analysis and modeling. matplotlib is better suited for complex plotting needs, whereas statsmodels is more appropriate for statistical inference and econometric analysis. The choice between the two depends on the specific requirements of your data analysis project.

27,792

The fundamental package for scientific computing with Python.

Pros of NumPy

  • Faster performance for numerical operations
  • More comprehensive array manipulation capabilities
  • Broader adoption and ecosystem integration

Cons of NumPy

  • Limited statistical modeling functionality
  • Less focus on econometrics and time series analysis
  • Steeper learning curve for complex statistical operations

Code Comparison

NumPy:

import numpy as np

# Create an array and perform basic operations
arr = np.array([1, 2, 3, 4, 5])
mean = np.mean(arr)
std = np.std(arr)

Statsmodels:

import statsmodels.api as sm

# Perform linear regression
X = sm.add_constant([1, 2, 3, 4, 5])
y = [2, 4, 5, 4, 5]
model = sm.OLS(y, X).fit()

NumPy excels in array operations and numerical computations, making it ideal for general-purpose scientific computing. It provides a solid foundation for many data science libraries.

Statsmodels, on the other hand, specializes in statistical modeling and econometrics. It offers a wide range of statistical tests, estimators, and tools for analyzing datasets, particularly in economics and social sciences.

While NumPy is more widely used and versatile, Statsmodels provides more specialized statistical functionality out of the box, making it valuable for specific analytical tasks.

18,363

Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.

Pros of Prophet

  • User-friendly interface for time series forecasting
  • Handles missing data and outliers automatically
  • Built-in seasonality detection and holiday effects

Cons of Prophet

  • Less flexible for custom model specifications
  • Limited to additive models, may not suit all use cases
  • Fewer statistical tests and diagnostics available

Code Comparison

Prophet:

from fbprophet import Prophet
model = Prophet()
model.fit(df)
future = model.make_future_dataframe(periods=365)
forecast = model.predict(future)

Statsmodels:

import statsmodels.api as sm
model = sm.tsa.statespace.SARIMAX(df['y'], order=(1,1,1), seasonal_order=(1,1,1,12))
results = model.fit()
forecast = results.forecast(steps=365)

Prophet offers a simpler API for quick forecasting, while Statsmodels provides more control over model specification. Prophet is designed for business forecasting with built-in seasonality handling, whereas Statsmodels offers a broader range of statistical models and tests. Prophet is more accessible for non-experts, but Statsmodels provides greater flexibility for advanced users and complex time series analysis.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

.. image:: docs/source/images/statsmodels-logo-v2-horizontal.svg :alt: Statsmodels logo

|PyPI Version| |Conda Version| |License| |Azure CI Build Status| |Codecov Coverage| |Coveralls Coverage| |PyPI downloads| |Conda downloads|

About statsmodels

statsmodels is a Python package that provides a complement to scipy for statistical computations including descriptive statistics and estimation and inference for statistical models.

Documentation

The documentation for the latest release is at

https://www.statsmodels.org/stable/

The documentation for the development version is at

https://www.statsmodels.org/dev/

Recent improvements are highlighted in the release notes

https://www.statsmodels.org/stable/release/

Backups of documentation are available at https://statsmodels.github.io/stable/ and https://statsmodels.github.io/dev/.

Main Features

  • Linear regression models:

    • Ordinary least squares
    • Generalized least squares
    • Weighted least squares
    • Least squares with autoregressive errors
    • Quantile regression
    • Recursive least squares
  • Mixed Linear Model with mixed effects and variance components

  • GLM: Generalized linear models with support for all of the one-parameter exponential family distributions

  • Bayesian Mixed GLM for Binomial and Poisson

  • GEE: Generalized Estimating Equations for one-way clustered or longitudinal data

  • Discrete models:

    • Logit and Probit
    • Multinomial logit (MNLogit)
    • Poisson and Generalized Poisson regression
    • Negative Binomial regression
    • Zero-Inflated Count models
  • RLM: Robust linear models with support for several M-estimators.

  • Time Series Analysis: models for time series analysis

    • Complete StateSpace modeling framework

      • Seasonal ARIMA and ARIMAX models
      • VARMA and VARMAX models
      • Dynamic Factor models
      • Unobserved Component models
    • Markov switching models (MSAR), also known as Hidden Markov Models (HMM)

    • Univariate time series analysis: AR, ARIMA

    • Vector autoregressive models, VAR and structural VAR

    • Vector error correction model, VECM

    • exponential smoothing, Holt-Winters

    • Hypothesis tests for time series: unit root, cointegration and others

    • Descriptive statistics and process models for time series analysis

  • Survival analysis:

    • Proportional hazards regression (Cox models)
    • Survivor function estimation (Kaplan-Meier)
    • Cumulative incidence function estimation
  • Multivariate:

    • Principal Component Analysis with missing data
    • Factor Analysis with rotation
    • MANOVA
    • Canonical Correlation
  • Nonparametric statistics: Univariate and multivariate kernel density estimators

  • Datasets: Datasets used for examples and in testing

  • Statistics: a wide range of statistical tests

    • diagnostics and specification tests
    • goodness-of-fit and normality tests
    • functions for multiple testing
    • various additional statistical tests
  • Imputation with MICE, regression on order statistic and Gaussian imputation

  • Mediation analysis

  • Graphics includes plot functions for visual analysis of data and model results

  • I/O

    • Tools for reading Stata .dta files, but pandas has a more recent version
    • Table output to ascii, latex, and html
  • Miscellaneous models

  • Sandbox: statsmodels contains a sandbox folder with code in various stages of development and testing which is not considered "production ready". This covers among others

    • Generalized method of moments (GMM) estimators
    • Kernel regression
    • Various extensions to scipy.stats.distributions
    • Panel data models
    • Information theoretic measures

How to get it

The main branch on GitHub is the most up to date code

https://www.github.com/statsmodels/statsmodels

Source download of release tags are available on GitHub

https://github.com/statsmodels/statsmodels/tags

Binaries and source distributions are available from PyPi

https://pypi.org/project/statsmodels/

Binaries can be installed in Anaconda

conda install statsmodels

Getting the latest code

Installing the most recent nightly wheel

The most recent nightly wheel can be installed using pip.

.. code:: bash

   python -m pip install -i https://pypi.anaconda.org/scientific-python-nightly-wheels/simple statsmodels --upgrade --use-deprecated=legacy-resolver

Installing from sources
~~~~~~~~~~~~~~~~~~~~~~~

See INSTALL.txt for requirements or see the documentation

https://statsmodels.github.io/dev/install.html

Contributing
============
Contributions in any form are welcome, including:

* Documentation improvements
* Additional tests
* New features to existing models
* New models

https://www.statsmodels.org/stable/dev/test_notes

for instructions on installing statsmodels in *editable* mode.

License
=======

Modified BSD (3-clause)

Discussion and Development
==========================

Discussions take place on the mailing list

https://groups.google.com/group/pystatsmodels

and in the issue tracker. We are very interested in feedback
about usability and suggestions for improvements.

Bug Reports
===========

Bug reports can be submitted to the issue tracker at

https://github.com/statsmodels/statsmodels/issues

.. |Azure CI Build Status| image:: https://dev.azure.com/statsmodels/statsmodels-testing/_apis/build/status/statsmodels.statsmodels?branchName=main
   :target: https://dev.azure.com/statsmodels/statsmodels-testing/_build/latest?definitionId=1&branchName=main
.. |Codecov Coverage| image:: https://codecov.io/gh/statsmodels/statsmodels/branch/main/graph/badge.svg
   :target: https://codecov.io/gh/statsmodels/statsmodels
.. |Coveralls Coverage| image:: https://coveralls.io/repos/github/statsmodels/statsmodels/badge.svg?branch=main
   :target: https://coveralls.io/github/statsmodels/statsmodels?branch=main
.. |PyPI downloads| image:: https://img.shields.io/pypi/dm/statsmodels?label=PyPI%20Downloads
   :alt: PyPI - Downloads
   :target: https://pypi.org/project/statsmodels/
.. |Conda downloads| image:: https://img.shields.io/conda/dn/conda-forge/statsmodels.svg?label=Conda%20downloads
   :target: https://anaconda.org/conda-forge/statsmodels/
.. |PyPI Version| image:: https://img.shields.io/pypi/v/statsmodels.svg
   :target: https://pypi.org/project/statsmodels/
.. |Conda Version| image:: https://anaconda.org/conda-forge/statsmodels/badges/version.svg
   :target: https://anaconda.org/conda-forge/statsmodels/
.. |License| image:: https://img.shields.io/pypi/l/statsmodels.svg
   :target: https://github.com/statsmodels/statsmodels/blob/main/LICENSE.txt