evidently
Evidently is an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.
Top Related Projects
Responsible AI Toolbox is a suite of tools providing model and data exploration and assessment user interfaces and libraries that enable a better understanding of AI systems. These interfaces and libraries empower developers and stakeholders of AI systems to develop and monitor AI more responsibly, and take better data-driven actions.
Open source platform for the machine learning lifecycle
Algorithms for outlier, adversarial and drift detection
A game theoretic approach to explain the output of any machine learning model.
Fit interpretable models. Explain blackbox machine learning.
Interpretability and explainability of data and machine learning models
Quick Overview
Evidently AI is an open-source library that provides a set of tools for monitoring and evaluating machine learning models in production. It helps data scientists and machine learning engineers to automatically generate reports and dashboards to monitor model performance, data drift, and other key metrics.
Pros
- Comprehensive Monitoring: Evidently AI offers a wide range of monitoring capabilities, including model performance, data drift, and data quality checks.
- Automated Reporting: The library can automatically generate detailed reports and dashboards, saving time and effort for data teams.
- Customizable Checks: Users can define custom checks and metrics to suit their specific needs.
- Easy Integration: Evidently AI can be easily integrated into existing machine learning workflows and pipelines.
Cons
- Limited Deployment Options: The library currently only supports deployment as a Python package, which may not be suitable for all use cases.
- Steep Learning Curve: Evidently AI has a relatively complex API and may require some time to get familiar with.
- Dependency on Other Libraries: The library relies on several other Python packages, which can increase the complexity of the setup process.
- Limited Community Support: Compared to some other popular machine learning libraries, Evidently AI has a smaller community and may have fewer resources available.
Code Examples
Here are a few examples of how to use Evidently AI:
- Generating a Model Performance Report:
from evidently.report import Report
from evidently.metrics import PerformanceMetrics
report = Report(metrics=[PerformanceMetrics()])
report.run(reference_data=X_ref, production_data=X_prod, target_ref=y_ref, target_prod=y_prod)
report.save("performance_report.html")
This code generates a comprehensive report on the performance of a machine learning model, including metrics like accuracy, precision, recall, and F1-score.
- Detecting Data Drift:
from evidently.report import Report
from evidently.metrics import DataDriftMetrics
report = Report(metrics=[DataDriftMetrics()])
report.run(reference_data=X_ref, production_data=X_prod)
report.save("data_drift_report.html")
This code generates a report on the data drift between the reference and production datasets, which can help identify potential issues with model performance.
- Customizing Checks:
from evidently.report import Report
from evidently.metrics import PerformanceMetrics, DataDriftMetrics
from evidently.checks import Check, CheckConfig
custom_check = Check(
name="Custom Check",
type="value_drift",
config=CheckConfig(
metric_name="mean",
max_drift=0.1
)
)
report = Report(metrics=[PerformanceMetrics(), DataDriftMetrics()], checks=[custom_check])
report.run(reference_data=X_ref, production_data=X_prod, target_ref=y_ref, target_prod=y_prod)
report.save("custom_report.html")
This code demonstrates how to define a custom check to monitor a specific metric, in this case, the mean value drift between the reference and production datasets.
Getting Started
To get started with Evidently AI, follow these steps:
- Install the library using pip:
pip install evidently
- Import the necessary modules and define your reference and production datasets:
from evidently.report import Report
from evidently.metrics import PerformanceMetrics, DataDriftMetrics
X_ref, y_ref = get_reference_data()
X_prod, y_prod = get_production_data()
- Generate a report and save it to an HTML file:
report = Report(metrics=[PerformanceMetrics(), DataDriftMetrics()])
report.run(reference_data=X_ref, production_data=X_prod, target_ref=y_ref, target_prod=y_prod)
report.save("report.html")
- Open the generated HTML report in a web browser to view the results.
That's it! You can now start using Evidently AI to monitor and evaluate your machine learning models
Competitor Comparisons
Responsible AI Toolbox is a suite of tools providing model and data exploration and assessment user interfaces and libraries that enable a better understanding of AI systems. These interfaces and libraries empower developers and stakeholders of AI systems to develop and monitor AI more responsibly, and take better data-driven actions.
Pros of Responsible AI Toolbox
- Comprehensive suite of tools for responsible AI, including interpretability, fairness, and error analysis
- Integrates well with Azure Machine Learning and other Microsoft services
- Offers both GUI and programmatic interfaces for accessibility
Cons of Responsible AI Toolbox
- Primarily focused on tabular data, with limited support for other data types
- Steeper learning curve due to its extensive feature set
- Requires more setup and configuration compared to Evidently
Code Comparison
Responsible AI Toolbox:
from raiwidgets import ResponsibleAIDashboard
ResponsibleAIDashboard(model, dataset, true_y, pred_y,
categorical_features=['category'],
task_type='classification')
Evidently:
from evidently import ColumnMapping
from evidently.report import ClassificationPerformanceReport
report = ClassificationPerformanceReport(column_mapping=ColumnMapping(
target='target', prediction='prediction', numerical_features=['feature1']))
report.run(reference_data=ref_data, current_data=current_data)
Both tools offer powerful capabilities for analyzing and improving AI models, but they cater to different use cases. Responsible AI Toolbox provides a more comprehensive suite of tools with tighter integration into the Microsoft ecosystem, while Evidently offers a more lightweight and flexible approach to model monitoring and analysis.
Open source platform for the machine learning lifecycle
Pros of MLflow
- Comprehensive end-to-end ML lifecycle management, including experiment tracking, model packaging, and deployment
- Integrates well with various ML frameworks and tools, offering a unified platform for diverse ML workflows
- Provides a user-friendly UI for experiment tracking and model comparison
Cons of MLflow
- Steeper learning curve due to its broader scope and feature set
- May be overkill for smaller projects or teams focused primarily on model monitoring
Code Comparison
MLflow:
import mlflow
mlflow.start_run()
mlflow.log_param("param1", value1)
mlflow.log_metric("metric1", value2)
mlflow.end_run()
Evidently:
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=ref_data, current_data=curr_data, column_mapping=column_mapping)
MLflow offers a more general-purpose logging API for tracking experiments, while Evidently focuses specifically on generating data quality and model performance reports. MLflow is better suited for managing the entire ML lifecycle, whereas Evidently excels in detailed model monitoring and data drift detection.
Algorithms for outlier, adversarial and drift detection
Pros of Alibi Detect
- Focuses on drift detection and outlier detection in machine learning models
- Provides advanced algorithms for detecting concept drift and data drift
- Supports both batch and online drift detection scenarios
Cons of Alibi Detect
- Steeper learning curve due to more complex algorithms and concepts
- Less emphasis on data quality and model performance monitoring
- Requires more setup and configuration for basic use cases
Code Comparison
Alibi Detect:
from alibi_detect.cd import TabularDrift
cd = TabularDrift(X_ref, p_val=.05, categories_per_feature=categories_per_feature)
preds = cd.predict(X)
Evidently:
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=X_ref, current_data=X, column_mapping=column_mapping)
Both libraries offer drift detection capabilities, but Alibi Detect provides more advanced algorithms and configurations, while Evidently focuses on simplicity and ease of use for basic monitoring tasks.
A game theoretic approach to explain the output of any machine learning model.
Pros of SHAP
- Focuses on model interpretability with advanced techniques like Shapley values
- Provides detailed feature importance and impact analysis
- Supports a wide range of machine learning models and frameworks
Cons of SHAP
- Primarily centered on model explanation, not comprehensive ML monitoring
- Can be computationally intensive for large datasets or complex models
- Steeper learning curve for users new to Shapley values and model interpretation
Code Comparison
SHAP example:
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X)
Evidently example:
from evidently.dashboard import Dashboard
from evidently.tabs import DataDriftTab, CatTargetDriftTab
dashboard = Dashboard(tabs=[DataDriftTab(), CatTargetDriftTab()])
dashboard.calculate(reference_data, production_data, column_mapping=column_mapping)
dashboard.save("drift_report.html")
SHAP excels in detailed model interpretation, while Evidently offers a broader suite of ML monitoring tools, including data drift detection and model performance tracking. SHAP is ideal for in-depth feature analysis, whereas Evidently provides a more comprehensive approach to ML model monitoring and reporting.
Fit interpretable models. Explain blackbox machine learning.
Pros of Interpret
- Broader scope of interpretability techniques, including global and local explanations
- Supports a wider range of machine learning models, including tree-based models and neural networks
- Offers interactive visualizations for exploring model behavior
Cons of Interpret
- Steeper learning curve due to its more comprehensive feature set
- Less focused on data drift and model monitoring compared to Evidently
- May require more computational resources for complex models and large datasets
Code Comparison
Interpret:
from interpret import show
from interpret.glassbox import ExplainableBoostingClassifier
ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, y_train)
ebm_global = ebm.explain_global()
show(ebm_global)
Evidently:
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference, current_data=current, column_mapping=column_mapping)
report.save_html("data_drift_report.html")
Both libraries offer valuable tools for model interpretation and monitoring, with Interpret providing a more comprehensive set of interpretability techniques, while Evidently focuses on data drift and model performance monitoring.
Interpretability and explainability of data and machine learning models
Pros of AIX360
- Comprehensive suite of explainability algorithms for various AI models
- Supports both local and global explanations for model interpretability
- Includes educational resources and tutorials for understanding AI explainability
Cons of AIX360
- Steeper learning curve due to its broader scope and complexity
- Less focused on data drift and model performance monitoring
- Requires more setup and configuration for specific use cases
Code Comparison
AIX360:
from aix360.algorithms.protodash import ProtodashExplainer
explainer = ProtodashExplainer()
explanation = explainer.explain(X_train, X_test, k=5)
Evidently:
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=ref_data, current_data=cur_data, column_mapping=column_mapping)
AIX360 focuses on generating explanations for AI models, while Evidently specializes in data and model monitoring, including drift detection. AIX360 offers a wider range of explainability techniques, but Evidently provides more straightforward tools for ongoing model performance evaluation and data quality checks.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Evidently
An open-source framework to evaluate, test and monitor ML and LLM-powered systems.
Documentation | Discord Community | Blog | Twitter | Evidently Cloud
:new: New release
Evidently 0.4.25. LLM evaluation -> Tutorial
:bar_chart: What is Evidently?
Evidently is an open-source Python library for ML and LLM evaluation and observability. It helps evaluate, test, and monitor AI-powered systems and data pipelines from experimentation to production.Â
- ð¡ Works with tabular, text data, and embeddings.
- ⨠Supports predictive and generative systems, from classification to RAG.
- ð 100+ built-in metrics from data drift detection to LLM judges.
- ð ï¸ Python interface for custom metrics and tests.Â
- ð¦ Both offline evals and live monitoring.
- ð» Open architecture: easily export data and integrate with existing tools.Â
Evidently is very modular. You can start with one-off evaluations using Reports
or Test Suites
in Python or get a real-time monitoring Dashboard
service.
1. Reports
Reports compute various data, ML and LLM quality metrics. You can start with Presets or customize.
- Out-of-the-box interactive visuals.
- Best for exploratory analysis and debugging.
- Get results in Python, export as JSON, Python dictionary, HTML, DataFrame, or view in monitoring UI.
Reports |
---|
2. Test Suites
Test Suites check for defined conditions on metric values and return a pass or fail result.
- Best for regression testing, CI/CD checks, or data validation pipelines.
- Zero setup option: auto-generate test conditions from the reference dataset.
- Simple syntax to set custom test conditions as
gt
(greater than),lt
(less than), etc. - Get results in Python, export as JSON, Python dictionary, HTML, DataFrame, or view in monitoring UI.
Test Suite |
---|
3. Monitoring Dashboard
Monitoring UI service helps visualize metrics and test results over time.
You can choose:
- Self-host the open-source version. Live demo.
- Sign up for Evidently Cloud (Recommended).
Evidently Cloud offers a generous free tier and extra features like user management, alerting, and no-code evals.
Dashboard |
---|
:woman_technologist: Install Evidently
Evidently is available as a PyPI package. To install it using pip package manager, run:
pip install evidently
To install Evidently using conda installer, run:
conda install -c conda-forge evidently
:arrow_forward: Getting started
Option 1: Test Suites
This is a simple Hello World. Check the Tutorials for more: Tabular data or LLM evaluation.
Import the Test Suite, evaluation Preset and toy tabular dataset.
import pandas as pd
from sklearn import datasets
from evidently.test_suite import TestSuite
from evidently.test_preset import DataStabilityTestPreset
iris_data = datasets.load_iris(as_frame=True)
iris_frame = iris_data.frame
Split the DataFrame
into reference and current. Run the Data Stability Test Suite that will automatically generate checks on column value ranges, missing values, etc. from the reference. Get the output in Jupyter notebook:
data_stability= TestSuite(tests=[
DataStabilityTestPreset(),
])
data_stability.run(current_data=iris_frame.iloc[:60], reference_data=iris_frame.iloc[60:], column_mapping=None)
data_stability
You can also save an HTML file. You'll need to open it from the destination folder.
data_stability.save_html("file.html")
To get the output as JSON:
data_stability.json()
You can choose other Presets, individual Tests and set conditions.
Option 2: Reports
Import the Report, evaluation Preset and toy tabular dataset.
import pandas as pd
from sklearn import datasets
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
iris_data = datasets.load_iris(as_frame=True)
iris_frame = iris_data.frame
Run the Data Drift Report that will compare column distributions between current
and reference
:
data_drift_report = Report(metrics=[
DataDriftPreset(),
])
data_drift_report.run(current_data=iris_frame.iloc[:60], reference_data=iris_frame.iloc[60:], column_mapping=None)
data_drift_report
Save the report as HTML. You'll later need to open it from the destination folder.
data_drift_report.save_html("file.html")
To get the output as JSON:
data_drift_report.json()
You can choose other Presets and individual Metrics, including LLM evaluations for text data.
Option 3: ML monitoring dashboard
This launches a demo project in the Evidently UI. Check tutorials for Self-hosting or Evidently Cloud.
Recommended step: create a virtual environment and activate it.
pip install virtualenv
virtualenv venv
source venv/bin/activate
After installing Evidently (pip install evidently
), run the Evidently UI with the demo projects:
evidently ui --demo-projects all
Access Evidently UI service in your browser. Go to the localhost:8000.
ð¦ What can you evaluate?
Evidently has 100+ built-in evals. You can also add custom ones. Each metric has an optional visualization: you can use it in Reports
, Test Suites
, or plot on a Dashboard
.
Here are examples of things you can check:
ð¡ Text descriptors | ð LLM outputs |
Length, sentiment, toxicity, language, special symbols, regular expression matches, etc. | Semantic similarity, retrieval relevance, summarization quality, etc. with model- and LLM-based evals. |
ð¢ Data quality | ð Data distribution drift |
Missing values, duplicates, min-max ranges, new categorical values, correlations, etc. | 20+ statistical tests and distance metrics to compare shifts in data distribution. |
ð¯ Classification | ð Regression |
Accuracy, precision, recall, ROC AUC, confusion matrix, bias, etc. | MAE, ME, RMSE, error distribution, error normality, error bias, etc. |
ð Ranking (inc. RAG) | ð Recommendations |
NDCG, MAP, MRR, Hit Rate, etc. | Serendipity, novelty, diversity, popularity bias, etc. |
:computer: Contributions
We welcome contributions! Read the Guide to learn more.
:books: Documentation
For more information, refer to a complete Documentation. You can start with the tutorials:
- Get Started with Tabular and ML Evaluation
- Get Started with LLM Evaluation
- Self-hosting ML monitoring Dashboard
- Cloud ML monitoring Dashboard
See more examples in the Docs.
How-to guides
Explore the How-to guides to understand specific features in Evidently.
:white_check_mark: Discord Community
If you want to chat and connect, join our Discord community!
Top Related Projects
Responsible AI Toolbox is a suite of tools providing model and data exploration and assessment user interfaces and libraries that enable a better understanding of AI systems. These interfaces and libraries empower developers and stakeholders of AI systems to develop and monitor AI more responsibly, and take better data-driven actions.
Open source platform for the machine learning lifecycle
Algorithms for outlier, adversarial and drift detection
A game theoretic approach to explain the output of any machine learning model.
Fit interpretable models. Explain blackbox machine learning.
Interpretability and explainability of data and machine learning models
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot