Top Related Projects
scikit-learn: machine learning in Python
Python Library for learning (Structure and Parameter), inference (Probabilistic and Causal), and simulations in Bayesian Networks.
Bayesian Modeling and Probabilistic Programming in Python
Hidden Markov Models in Python, with scikit-learn like API
Statsmodels: statistical modeling and econometrics in Python
Probabilistic reasoning and statistical analysis in TensorFlow
Quick Overview
Pomegranate is a Python library for probabilistic modeling, providing implementations of various probabilistic models such as Bayesian networks, hidden Markov models, and mixture models. It offers a user-friendly API for building, training, and using these models, with a focus on performance through the use of Cython for computationally intensive operations.
Pros
- Fast performance due to Cython implementation
- Comprehensive set of probabilistic models and algorithms
- User-friendly API with scikit-learn-like interface
- Supports both supervised and unsupervised learning
Cons
- Limited documentation and examples for advanced use cases
- Smaller community compared to more mainstream machine learning libraries
- May require compilation on some systems, which can be challenging for beginners
- Some models may have limited customization options
Code Examples
- Creating and training a Gaussian Mixture Model:
from pomegranate import GaussianMixture
# Create a Gaussian Mixture Model with 3 components
model = GaussianMixture(n_components=3)
# Fit the model to your data
model.fit(X)
# Predict the cluster for new data
predictions = model.predict(X_new)
- Building a Bayesian Network:
from pomegranate import BayesianNetwork, DiscreteDistribution
# Define the network structure
network = BayesianNetwork([
('A', 'C'),
('B', 'C'),
('C', 'D')
])
# Add probability distributions to nodes
network.add_node(DiscreteDistribution({'True': 0.7, 'False': 0.3}), name='A')
network.add_node(DiscreteDistribution({'True': 0.6, 'False': 0.4}), name='B')
network.add_conditional_probability_table(['A', 'B', 'C'], [...])
network.add_conditional_probability_table(['C', 'D'], [...])
# Bake the network to finalize its structure
network.bake()
- Using a Hidden Markov Model:
from pomegranate import HiddenMarkovModel, NormalDistribution
# Create a Hidden Markov Model
model = HiddenMarkovModel()
# Add states with emissions
s1 = model.add_state(NormalDistribution(0, 1))
s2 = model.add_state(NormalDistribution(5, 1))
# Add transitions
model.add_transition(model.start, s1, 0.9)
model.add_transition(model.start, s2, 0.1)
model.add_transition(s1, s1, 0.8)
model.add_transition(s1, s2, 0.2)
model.add_transition(s2, s2, 0.9)
model.add_transition(s2, model.end, 0.1)
# Bake the model
model.bake()
Getting Started
To get started with Pomegranate, first install it using pip:
pip install pomegranate
Then, import the desired models and start using them in your Python code:
from pomegranate import BayesianNetwork, GaussianMixture, HiddenMarkovModel
# Use the models as shown in the code examples above
For more detailed information and advanced usage, refer to the official documentation and examples in the GitHub repository.
Competitor Comparisons
scikit-learn: machine learning in Python
Pros of scikit-learn
- Extensive collection of machine learning algorithms and tools
- Well-established, widely used, and extensively documented
- Large community support and frequent updates
Cons of scikit-learn
- Limited support for probabilistic models and Bayesian inference
- Less flexibility for custom probability distributions
- Steeper learning curve for beginners
Code Comparison
scikit-learn:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
pomegranate:
from pomegranate import NaiveBayes, NormalDistribution
model = NaiveBayes([NormalDistribution for _ in range(n_features)])
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Summary
scikit-learn is a comprehensive machine learning library with a wide range of algorithms and tools. It's well-established and has extensive documentation and community support. However, it may have limitations in probabilistic modeling and custom distributions compared to pomegranate. pomegranate offers more flexibility in these areas but has a smaller user base and fewer general-purpose machine learning algorithms. The choice between the two depends on the specific requirements of your project, especially regarding probabilistic modeling needs.
Python Library for learning (Structure and Parameter), inference (Probabilistic and Causal), and simulations in Bayesian Networks.
Pros of pgmpy
- More comprehensive library for probabilistic graphical models, including Bayesian networks, Markov networks, and more
- Extensive documentation and tutorials, making it easier for beginners to get started
- Active community and regular updates
Cons of pgmpy
- Generally slower performance compared to pomegranate, especially for large datasets
- Less focus on machine learning integration and high-performance computing
Code Comparison
pgmpy example:
from pgmpy.models import BayesianNetwork
from pgmpy.factors.discrete import TabularCPD
model = BayesianNetwork([('A', 'B'), ('B', 'C')])
cpd_a = TabularCPD('A', 2, [[0.6], [0.4]])
cpd_b = TabularCPD('B', 2, [[0.7, 0.3], [0.3, 0.7]], evidence=['A'], evidence_card=[2])
model.add_cpds(cpd_a, cpd_b)
pomegranate example:
from pomegranate import *
model = BayesianNetwork()
A = DiscreteDistribution({'0': 0.6, '1': 0.4})
B = ConditionalProbabilityTable([['0', '0', 0.7], ['0', '1', 0.3], ['1', '0', 0.3], ['1', '1', 0.7]], [A])
model.add_states(A, B)
model.add_edge(A, B)
model.bake()
Both libraries offer similar functionality for creating Bayesian networks, but pgmpy's syntax is more verbose and explicit, while pomegranate's is more concise and object-oriented.
Bayesian Modeling and Probabilistic Programming in Python
Pros of PyMC
- More comprehensive probabilistic programming framework
- Larger community and more extensive documentation
- Better integration with other scientific Python libraries
Cons of PyMC
- Steeper learning curve for beginners
- Can be slower for certain types of models compared to Pomegranate
Code Comparison
PyMC example:
import pymc as pm
with pm.Model() as model:
mu = pm.Normal('mu', mu=0, sigma=1)
obs = pm.Normal('obs', mu=mu, sigma=1, observed=data)
trace = pm.sample(1000)
Pomegranate example:
from pomegranate import NormalDistribution, GeneralMixtureModel
model = GeneralMixtureModel([NormalDistribution(0, 1), NormalDistribution(5, 1)])
model.fit(data)
PyMC offers a more flexible and expressive syntax for defining complex probabilistic models, while Pomegranate provides a simpler API for specific types of models like mixture models. PyMC is better suited for advanced Bayesian inference tasks, whereas Pomegranate excels in certain machine learning applications and discrete state space models.
Hidden Markov Models in Python, with scikit-learn like API
Pros of hmmlearn
- Focused specifically on Hidden Markov Models (HMMs)
- Lightweight and easy to install
- Follows scikit-learn API conventions
Cons of hmmlearn
- Limited to HMMs only, less versatile than Pomegranate
- Smaller community and less frequent updates
- Fewer advanced features and model types
Code Comparison
hmmlearn:
from hmmlearn import hmm
model = hmm.GaussianHMM(n_components=3, covariance_type="full")
model.fit(X)
hidden_states = model.predict(X)
Pomegranate:
from pomegranate import HiddenMarkovModel, NormalDistribution
model = HiddenMarkovModel.from_samples(NormalDistribution, n_components=3, X=X)
hidden_states = model.predict(X)
Summary
hmmlearn is a specialized library for Hidden Markov Models that follows scikit-learn conventions, making it easy to use for those familiar with the ecosystem. It's lightweight and focused, but limited in scope compared to Pomegranate.
Pomegranate offers a broader range of probabilistic models and more advanced features, making it more versatile for complex tasks. However, this comes at the cost of a slightly steeper learning curve and potentially more complex installation process.
The choice between the two depends on the specific requirements of your project and whether you need the additional capabilities offered by Pomegranate or prefer the simplicity and scikit-learn compatibility of hmmlearn.
Statsmodels: statistical modeling and econometrics in Python
Pros of statsmodels
- Comprehensive statistical modeling library with a wide range of econometric tools
- Well-established and widely used in academic and professional settings
- Extensive documentation and community support
Cons of statsmodels
- Steeper learning curve due to its extensive functionality
- Can be slower for certain operations compared to more specialized libraries
Code Comparison
statsmodels:
import statsmodels.api as sm
X = sm.add_constant(X)
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
pomegranate:
from pomegranate import GeneralMixtureModel, NormalDistribution
model = GeneralMixtureModel([NormalDistribution(1, 1), NormalDistribution(5, 2)])
model.fit(X)
print(model.predict(X))
Key Differences
- statsmodels focuses on traditional statistical modeling and econometrics
- pomegranate specializes in probabilistic modeling and machine learning algorithms
- statsmodels offers a broader range of statistical tools, while pomegranate excels in specific areas like mixture models and hidden Markov models
Use Cases
- Choose statsmodels for comprehensive statistical analysis and econometric modeling
- Opt for pomegranate when working with probabilistic models, especially in machine learning contexts
Probabilistic reasoning and statistical analysis in TensorFlow
Pros of TensorFlow Probability
- Extensive integration with TensorFlow ecosystem
- Supports distributed and GPU-accelerated computations
- Wider range of probabilistic models and inference algorithms
Cons of TensorFlow Probability
- Steeper learning curve, especially for those new to TensorFlow
- Can be more complex to set up and use for simpler probabilistic tasks
- Larger overhead for small-scale projects
Code Comparison
Pomegranate:
from pomegranate import *
model = HiddenMarkovModel.from_samples(NormalDistribution, n_components=5, X=data)
model.fit(data)
TensorFlow Probability:
import tensorflow_probability as tfp
tfd = tfp.distributions
model = tfd.HiddenMarkovModel(
initial_distribution=tfd.Categorical(probs=[0.2] * 5),
transition_distribution=tfd.Categorical(probs=[[0.2] * 5] * 5),
observation_distribution=tfd.Normal(loc=0., scale=1.),
num_steps=100)
Both libraries offer probabilistic modeling capabilities, but TensorFlow Probability provides more advanced features and integration with the TensorFlow ecosystem at the cost of increased complexity. Pomegranate offers a simpler API for basic probabilistic modeling tasks, making it more accessible for beginners or smaller projects.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Note IMPORTANT: pomegranate v1.0.0 is a ground-up rewrite of pomegranate using PyTorch as the computational backend instead of Cython. Although the same functionality is supported, the API is significantly different. Please see the tutorials and examples folders for help rewriting your code.
ReadTheDocs | Tutorials | Examples
pomegranate is a library for probabilistic modeling defined by its modular implementation and treatment of all models as the probability distributions they are. The modular implementation allows one to easily drop normal distributions into a mixture model to create a Gaussian mixture model just as easily as dropping a gamma and a Poisson distribution into a mixture model to create a heterogeneous mixture. But that's not all! Because each model is treated as a probability distribution, Bayesian networks can be dropped into a mixture just as easily as a normal distribution, and hidden Markov models can be dropped into Bayes classifiers to make a classifier over sequences. Together, these two design choices enable a flexibility not seen in any other probabilistic modeling package.
Recently, pomegranate (v1.0.0) was rewritten from the ground up using PyTorch to replace the outdated Cython backend. This rewrite gave me an opportunity to fix many bad design choices that I made as a bb software engineer. Unfortunately, many of these changes are not backwards compatible and will disrupt workflows. On the flip side, these changes have significantly sped up most methods, improved and simplified the code, fixed many issues raised by the community over the years, and made it significantly easier to contribute. I've written more below, but you're likely here now because your code is broken and this is the tl;dr.
Special shout-out to NumFOCUS for supporting this work with a special development grant.
Installation
pip install pomegranate
If you need the last Cython release before the rewrite, use pip install pomegranate==0.14.8
. You may need to manually install a version of Cython before v3.
Why a Rewrite?
This rewrite was motivated by four main reasons:
- Speed: Native PyTorch is usually significantly faster than the hand-tuned Cython code that I wrote.
- Features: PyTorch has many features, such as serialization, mixed precision, and GPU support, that can now be directly used in pomegranate without additional work on my end.
- Community Contribution: A challenge that many people faced when using pomegranate was that they could not modify or extend it because they did not know Cython. Even if they did know Cython, coding in it is a pain that I felt each time I tried adding a new feature or fixing a bug or releasing a new version. Using PyTorch as the backend significantly reduces the amount of effort needed to add in new features.
- Interoperability: Libraries like PyTorch offer an invaluable opportunity to not just utilize their computational backends but to better integrate into existing resources and communities. This rewrite will make it easier for people to integrate probabilistic models with neural networks as losses, constraints, and structural regularizations, as well as with other projects built on PyTorch.
High-level Changes
- General
- The entire codebase has been rewritten in PyTorch and all models are instances of
torch.nn.Module
- This codebase is checked by a comprehensive suite of >800 unit tests calling assert statements several thousand times, much more than previous versions.
- Installation issues are now likely to come from PyTorch for which there are countless resources to help out.
- Features
- All models now have GPU support
- All models now have support for half/mixed precision
- Serialization is now handled by PyTorch, yielding more compact and efficient I/O
- Missing values are now supported through
torch.masked.MaskedTensor
objects - Prior probabilities can now be passed to all relevant models and methods and enable more comprehensive/flexible semi-supervised learning than before
- Models
- All distributions are now multivariate by default and treat each feature independently (except Normal)
- "Distribution" has been removed from names so that, for example,
NormalDistribution
is nowNormal
FactorGraph
is now supported as first-class citizens, with all the prediction and training methods- Hidden Markov models have been split into
DenseHMM
andSparseHMM
models which differ in how the transition matrix is encoded, withDenseHMM
objects being significantly faster on truly dense graphs
- Differences
NaiveBayes
has been permanently removed as it is redundant withBayesClassifier
MarkovNetwork
has not yet been implemented- Constraint graphs and constrained structure learning for Bayesian networks has not yet been implemented
- Silent states for hidden Markov models have not yet been implemented
- Viterbi for hidden Markov models has not yet been implemented
Speed
Most models and methods in pomegranate v1.0.0 are faster than their counterparts in earlier versions. This generally scales by complexity, where one sees only small speedups for simple distributions on small data sets but much larger speedups for more complex models on big data sets, e.g. hidden Markov model training or Bayesian network inference. The notable exception for now is that Bayesian network structure learning, other than Chow-Liu tree building, is still incomplete and not much faster. In the examples below, torchegranate
refers to the temporarily repository used to develop pomegranate v1.0.0 and pomegranate
refers to pomegranate v0.14.8.
K-Means
Who knows what's happening here? Wild.
Hidden Markov Models
Dense transition matrix (CPU)
Sparse transition matrix (CPU)
Training a 125 node model with a dense transition matrix
Bayesian Networks
Features
Note Please see the tutorials folder for code examples.
Switching from a Cython backend to a PyTorch backend has enabled or expanded a large number of features. Because the rewrite is a thin wrapper over PyTorch, as new features get released for PyTorch they can be applied to pomegranate models without the need for a new release from me.
GPU Support
All distributions and methods in pomegranate now have GPU support. Because each distribution is a torch.nn.Module
object, the use is identical to other code written in PyTorch. This means that both the model and the data have to be moved to the GPU by the user. For instance:
>>> X = torch.exp(torch.randn(50, 4))
# Will execute on the CPU
>>> d = Exponential().fit(X)
>>> d.scales
Parameter containing:
tensor([1.8627, 1.3132, 1.7187, 1.4957])
# Will execute on a GPU
>>> d = Exponential().cuda().fit(X.cuda())
>>> d.scales
Parameter containing:
tensor([1.8627, 1.3132, 1.7187, 1.4957], device='cuda:0')
Likewise, all models are distributions, and so can be used on the GPU similarly. When a model is moved to the GPU, all of the models associated with it (e.g. distributions) are also moved to the GPU.
>>> X = torch.exp(torch.randn(50, 4)).cuda()
>>> model = GeneralMixtureModel([Exponential(), Exponential()]).cuda()
>>> model.fit(X)
[1] Improvement: 1.26068115234375, Time: 0.001134s
[2] Improvement: 0.168121337890625, Time: 0.001097s
[3] Improvement: 0.037841796875, Time: 0.001095s
>>> model.distributions[0].scales
Parameter containing:
>>> model.distributions[1].scales
tensor([0.9141, 1.0835, 2.7503, 2.2475], device='cuda:0')
Parameter containing:
tensor([1.9902, 2.3871, 0.8984, 1.2215], device='cuda:0')
Mixed Precision
pomegranate models can, in theory, operate in the same mixed or low-precision regimes as other PyTorch modules. However, because pomegranate uses more complex operations than most neural networks, this sometimes does not work or help in practice because these operations have not been optimized or implemented in the low-precision regime. So, hopefully this feature will become more useful over time.
>>> X = torch.randn(100, 4)
>>> d = Normal(covariance_type='diag')
>>>
>>> with torch.autocast('cuda', dtype=torch.bfloat16):
>>> d.fit(X)
Serialization
pomegranate distributions are all instances of torch.nn.Module
and so serialization is the same as any other PyTorch model.
Saving:
>>> X = torch.exp(torch.randn(50, 4)).cuda()
>>> model = GeneralMixtureModel([Exponential(), Exponential()], verbose=True)
>>> model.cuda()
>>> model.fit(X)
>>> torch.save(model, "test.torch")
Loading:
>>> model = torch.load("test.torch")
torch.compile
Note
torch.compile
is under active development by the PyTorch team and may rapidly improve. For now, you may need to pass incheck_data=False
when initializing models to avoid one compatibility issue.
In PyTorch v2.0.0, torch.compile
was introduced as a flexible wrapper around tools that would fuse operations together, use CUDA graphs, and generally try to remove I/O bottlenecks in GPU execution. Because these bottlenecks can be extremely significant in the small-to-medium sized data settings many pomegranate users are faced with, torch.compile
seems like it will be extremely valuable. Rather than targeting entire models, which mostly just compiles the forward
method, you should compile individual methods from your objects.
# Create your object as normal
>>> mu = torch.exp(torch.randn(100))
>>> d = Exponential(mu).cuda()
# Create some data
>>> X = torch.exp(torch.randn(1000, 100))
>>> d.log_probability(X)
# Compile the `log_probability` method!
>>> d.log_probability = torch.compile(d.log_probability, mode='reduce-overhead', fullgraph=True)
>>> d.log_probability(X)
Unfortunately, I have had difficulty getting torch.compile
to work when methods are called in a nested manner, e.g., when compiling the predict
method for a mixture model which, inside it, calls the log_probability
method of each distribution. I have tried to organize the code in a manner that avoids some of these errors, but because the error messages right now are opaque I have had some difficulty.
Missing Values
pomegranate supports handling data with missing values through torch.masked.MaskedTensor
objects. Simply, one needs to just put a mask over the values that are missing.
>>> X = <your tensor with NaN for the missing values>
>>> mask = ~torch.isnan(X)
>>> X_masked = torch.masked.MaskedTensor(X, mask=mask)
>>> d = Normal(covariance_type='diag').fit(X_masked)
>>> d.means
Parameter containing:
tensor([0.2271, 0.0290, 0.0763, 0.0135])
All algorithms currently treat missingness as something to ignore. As an example, when calculating the mean of a column with missing values, the mean will simply be the average value of the present values. Missing values are not imputed because improper imputation can bias your data, produce unlikely estimates which distort distributions, and also shrink the variance.
Because not all operations are yet available for MaskedTensors, the following distributions are not yet supported for missing values: Bernoulli, categorical, normal with full covariance, uniform
Prior Probabilities and Semi-supervised Learning
A new feature in pomegranate v1.0.0 is being able to pass in prior probabilities for each observation for mixture models, Bayes classifiers, and hidden Markov models. These are the prior probability that an observation belongs to a component of the model before evaluating the likelihood and should range between 0 and 1. When these values include a 1.0 for an observation, it is treated as a label, because the likelihood no longer matters in terms of assigning that observation to a state. Hence, one can use these prior probabilities to do labeled training when each observation has a 1.0 for some state, semi-supervised learning when a subset of observations (including when sequences are only partially labeled for hidden Markov models), or more sophisticated forms of weighting when the values are between 0 and 1.
Top Related Projects
scikit-learn: machine learning in Python
Python Library for learning (Structure and Parameter), inference (Probabilistic and Causal), and simulations in Bayesian Networks.
Bayesian Modeling and Probabilistic Programming in Python
Hidden Markov Models in Python, with scikit-learn like API
Statsmodels: statistical modeling and econometrics in Python
Probabilistic reasoning and statistical analysis in TensorFlow
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot