evidently
Evidently is an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.
Top Related Projects
Responsible AI Toolbox is a suite of tools providing model and data exploration and assessment user interfaces and libraries that enable a better understanding of AI systems. These interfaces and libraries empower developers and stakeholders of AI systems to develop and monitor AI more responsibly, and take better data-driven actions.
Open source platform for the machine learning lifecycle
Algorithms for outlier, adversarial and drift detection
A game theoretic approach to explain the output of any machine learning model.
Fit interpretable models. Explain blackbox machine learning.
Interpretability and explainability of data and machine learning models
Quick Overview
Evidently AI is an open-source library that provides a set of tools for monitoring and evaluating machine learning models in production. It helps data scientists and machine learning engineers to automatically generate reports and dashboards to monitor model performance, data drift, and other key metrics.
Pros
- Comprehensive Monitoring: Evidently AI offers a wide range of monitoring capabilities, including model performance, data drift, and data quality checks.
- Automated Reporting: The library can automatically generate detailed reports and dashboards, saving time and effort for data teams.
- Customizable Checks: Users can define custom checks and metrics to suit their specific needs.
- Easy Integration: Evidently AI can be easily integrated into existing machine learning workflows and pipelines.
Cons
- Limited Deployment Options: The library currently only supports deployment as a Python package, which may not be suitable for all use cases.
- Steep Learning Curve: Evidently AI has a relatively complex API and may require some time to get familiar with.
- Dependency on Other Libraries: The library relies on several other Python packages, which can increase the complexity of the setup process.
- Limited Community Support: Compared to some other popular machine learning libraries, Evidently AI has a smaller community and may have fewer resources available.
Code Examples
Here are a few examples of how to use Evidently AI:
- Generating a Model Performance Report:
from evidently.report import Report
from evidently.metrics import PerformanceMetrics
report = Report(metrics=[PerformanceMetrics()])
report.run(reference_data=X_ref, production_data=X_prod, target_ref=y_ref, target_prod=y_prod)
report.save("performance_report.html")
This code generates a comprehensive report on the performance of a machine learning model, including metrics like accuracy, precision, recall, and F1-score.
- Detecting Data Drift:
from evidently.report import Report
from evidently.metrics import DataDriftMetrics
report = Report(metrics=[DataDriftMetrics()])
report.run(reference_data=X_ref, production_data=X_prod)
report.save("data_drift_report.html")
This code generates a report on the data drift between the reference and production datasets, which can help identify potential issues with model performance.
- Customizing Checks:
from evidently.report import Report
from evidently.metrics import PerformanceMetrics, DataDriftMetrics
from evidently.checks import Check, CheckConfig
custom_check = Check(
name="Custom Check",
type="value_drift",
config=CheckConfig(
metric_name="mean",
max_drift=0.1
)
)
report = Report(metrics=[PerformanceMetrics(), DataDriftMetrics()], checks=[custom_check])
report.run(reference_data=X_ref, production_data=X_prod, target_ref=y_ref, target_prod=y_prod)
report.save("custom_report.html")
This code demonstrates how to define a custom check to monitor a specific metric, in this case, the mean value drift between the reference and production datasets.
Getting Started
To get started with Evidently AI, follow these steps:
- Install the library using pip:
pip install evidently
- Import the necessary modules and define your reference and production datasets:
from evidently.report import Report
from evidently.metrics import PerformanceMetrics, DataDriftMetrics
X_ref, y_ref = get_reference_data()
X_prod, y_prod = get_production_data()
- Generate a report and save it to an HTML file:
report = Report(metrics=[PerformanceMetrics(), DataDriftMetrics()])
report.run(reference_data=X_ref, production_data=X_prod, target_ref=y_ref, target_prod=y_prod)
report.save("report.html")
- Open the generated HTML report in a web browser to view the results.
That's it! You can now start using Evidently AI to monitor and evaluate your machine learning models
Competitor Comparisons
Responsible AI Toolbox is a suite of tools providing model and data exploration and assessment user interfaces and libraries that enable a better understanding of AI systems. These interfaces and libraries empower developers and stakeholders of AI systems to develop and monitor AI more responsibly, and take better data-driven actions.
Pros of Responsible AI Toolbox
- Comprehensive suite of tools for responsible AI, including interpretability, fairness, and error analysis
- Integrates well with Azure Machine Learning and other Microsoft services
- Offers both GUI and programmatic interfaces for accessibility
Cons of Responsible AI Toolbox
- Primarily focused on tabular data, with limited support for other data types
- Steeper learning curve due to its extensive feature set
- Requires more setup and configuration compared to Evidently
Code Comparison
Responsible AI Toolbox:
from raiwidgets import ResponsibleAIDashboard
ResponsibleAIDashboard(model, dataset, true_y, pred_y,
categorical_features=['category'],
task_type='classification')
Evidently:
from evidently import ColumnMapping
from evidently.report import ClassificationPerformanceReport
report = ClassificationPerformanceReport(column_mapping=ColumnMapping(
target='target', prediction='prediction', numerical_features=['feature1']))
report.run(reference_data=ref_data, current_data=current_data)
Both tools offer powerful capabilities for analyzing and improving AI models, but they cater to different use cases. Responsible AI Toolbox provides a more comprehensive suite of tools with tighter integration into the Microsoft ecosystem, while Evidently offers a more lightweight and flexible approach to model monitoring and analysis.
Open source platform for the machine learning lifecycle
Pros of MLflow
- Comprehensive end-to-end ML lifecycle management, including experiment tracking, model packaging, and deployment
- Integrates well with various ML frameworks and tools, offering a unified platform for diverse ML workflows
- Provides a user-friendly UI for experiment tracking and model comparison
Cons of MLflow
- Steeper learning curve due to its broader scope and feature set
- May be overkill for smaller projects or teams focused primarily on model monitoring
Code Comparison
MLflow:
import mlflow
mlflow.start_run()
mlflow.log_param("param1", value1)
mlflow.log_metric("metric1", value2)
mlflow.end_run()
Evidently:
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=ref_data, current_data=curr_data, column_mapping=column_mapping)
MLflow offers a more general-purpose logging API for tracking experiments, while Evidently focuses specifically on generating data quality and model performance reports. MLflow is better suited for managing the entire ML lifecycle, whereas Evidently excels in detailed model monitoring and data drift detection.
Algorithms for outlier, adversarial and drift detection
Pros of Alibi Detect
- Focuses on drift detection and outlier detection in machine learning models
- Provides advanced algorithms for detecting concept drift and data drift
- Supports both batch and online drift detection scenarios
Cons of Alibi Detect
- Steeper learning curve due to more complex algorithms and concepts
- Less emphasis on data quality and model performance monitoring
- Requires more setup and configuration for basic use cases
Code Comparison
Alibi Detect:
from alibi_detect.cd import TabularDrift
cd = TabularDrift(X_ref, p_val=.05, categories_per_feature=categories_per_feature)
preds = cd.predict(X)
Evidently:
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=X_ref, current_data=X, column_mapping=column_mapping)
Both libraries offer drift detection capabilities, but Alibi Detect provides more advanced algorithms and configurations, while Evidently focuses on simplicity and ease of use for basic monitoring tasks.
A game theoretic approach to explain the output of any machine learning model.
Pros of SHAP
- Focuses on model interpretability with advanced techniques like Shapley values
- Provides detailed feature importance and impact analysis
- Supports a wide range of machine learning models and frameworks
Cons of SHAP
- Primarily centered on model explanation, not comprehensive ML monitoring
- Can be computationally intensive for large datasets or complex models
- Steeper learning curve for users new to Shapley values and model interpretation
Code Comparison
SHAP example:
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X)
Evidently example:
from evidently.dashboard import Dashboard
from evidently.tabs import DataDriftTab, CatTargetDriftTab
dashboard = Dashboard(tabs=[DataDriftTab(), CatTargetDriftTab()])
dashboard.calculate(reference_data, production_data, column_mapping=column_mapping)
dashboard.save("drift_report.html")
SHAP excels in detailed model interpretation, while Evidently offers a broader suite of ML monitoring tools, including data drift detection and model performance tracking. SHAP is ideal for in-depth feature analysis, whereas Evidently provides a more comprehensive approach to ML model monitoring and reporting.
Fit interpretable models. Explain blackbox machine learning.
Pros of Interpret
- Broader scope of interpretability techniques, including global and local explanations
- Supports a wider range of machine learning models, including tree-based models and neural networks
- Offers interactive visualizations for exploring model behavior
Cons of Interpret
- Steeper learning curve due to its more comprehensive feature set
- Less focused on data drift and model monitoring compared to Evidently
- May require more computational resources for complex models and large datasets
Code Comparison
Interpret:
from interpret import show
from interpret.glassbox import ExplainableBoostingClassifier
ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, y_train)
ebm_global = ebm.explain_global()
show(ebm_global)
Evidently:
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference, current_data=current, column_mapping=column_mapping)
report.save_html("data_drift_report.html")
Both libraries offer valuable tools for model interpretation and monitoring, with Interpret providing a more comprehensive set of interpretability techniques, while Evidently focuses on data drift and model performance monitoring.
Interpretability and explainability of data and machine learning models
Pros of AIX360
- Comprehensive suite of explainability algorithms for various AI models
- Supports both local and global explanations for model interpretability
- Includes educational resources and tutorials for understanding AI explainability
Cons of AIX360
- Steeper learning curve due to its broader scope and complexity
- Less focused on data drift and model performance monitoring
- Requires more setup and configuration for specific use cases
Code Comparison
AIX360:
from aix360.algorithms.protodash import ProtodashExplainer
explainer = ProtodashExplainer()
explanation = explainer.explain(X_train, X_test, k=5)
Evidently:
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=ref_data, current_data=cur_data, column_mapping=column_mapping)
AIX360 focuses on generating explanations for AI models, while Evidently specializes in data and model monitoring, including drift detection. AIX360 offers a wider range of explainability techniques, but Evidently provides more straightforward tools for ongoing model performance evaluation and data quality checks.
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Evidently
An open-source framework to evaluate, test and monitor ML and LLM-powered systems.
Documentation | Discord Community | Blog | Twitter | Evidently Cloud
:bar_chart: What is Evidently?
Evidently is an open-source Python library to evaluate, test, and monitor ML and LLM systemsâfrom experiments to production.
- ð¡ Works with tabular and text data.
- ⨠Supports evals for predictive and generative tasks, from classification to RAG.
- ð 100+ built-in metrics from data drift detection to LLM judges.
- ð ï¸ Python interface for custom metrics.
- ð¦ Both offline evals and live monitoring.
- ð» Open architecture: easily export data and integrate with existing tools.
Evidently is very modular. You can start with one-off evaluations or host a full monitoring service.
1. Reports and Test Suites
Reports compute and summarize various data, ML and LLM quality evals.
- Start with Presets and built-in metrics or customize.
- Best for experiments, exploratory analysis and debugging.
- View interactive Reports in Python or export as JSON, Python dictionary, HTML, or view in monitoring UI.
Turn any Report into a Test Suite by adding pass/fail conditions.
- Best for regression testing, CI/CD checks, or data validation.
- Zero setup option: auto-generate test conditions from the reference dataset.
- Simple syntax to set test conditions as
gt
(greater than),lt
(less than), etc.
Reports |
---|
![]() |
2. Monitoring Dashboard
Monitoring UI service helps visualize metrics and test results over time.
You can choose:
- Self-host the open-source version. Live demo.
- Sign up for Evidently Cloud (Recommended).
Evidently Cloud offers a generous free tier and extra features like dataset and user management, alerting, and no-code evals. Compare OSS vs Cloud.
Dashboard |
---|
![]() |
:woman_technologist: Install Evidently
To install from PyPI:
pip install evidently
To install Evidently using conda installer, run:
conda install -c conda-forge evidently
:arrow_forward: Getting started
Reports
LLM evals
This is a simple Hello World. Check the Tutorials for more: LLM evaluation.
Import the necessary components:
import pandas as pd
from evidently import Report
from evidently import Dataset, DataDefinition
from evidently.descriptors import Sentiment, TextLength, Contains
from evidently.presets import TextEvals
Create a toy dataset with questions and answers.
eval_df = pd.DataFrame([
["What is the capital of Japan?", "The capital of Japan is Tokyo."],
["Who painted the Mona Lisa?", "Leonardo da Vinci."],
["Can you write an essay?", "I'm sorry, but I can't assist with homework."]],
columns=["question", "answer"])
Create an Evidently Dataset object and add descriptors
: row-level evaluators. We'll check for sentiment of each response, its length and whether it contains words indicative of denial.
eval_dataset = Dataset.from_pandas(pd.DataFrame(eval_df),
data_definition=DataDefinition(),
descriptors=[
Sentiment("answer", alias="Sentiment"),
TextLength("answer", alias="Length"),
Contains("answer", items=['sorry', 'apologize'], mode="any", alias="Denials")
])
You can view the dataframe with added scores:
eval_dataset.as_dataframe()
To get a summary Report to see the distribution of scores:
report = Report([
TextEvals()
])
my_eval = report.run(eval_dataset)
my_eval
# my_eval.json()
# my_eval.dict()
You can also choose other evaluators, including LLM-as-a-judge and configure pass/fail conditions.
Data and ML evals
This is a simple Hello World. Check the Tutorials for more: Tabular data.
Import the Report, evaluation Preset and toy tabular dataset.
import pandas as pd
from sklearn import datasets
from evidently import Report
from evidently.presets import DataDriftPreset
iris_data = datasets.load_iris(as_frame=True)
iris_frame = iris_data.frame
Run the Data Drift evaluation preset that will test for shift in column distributions. Take the first 60 rows of the dataframe as "current" data and the following as reference. Get the output in Jupyter notebook:
report = Report([
DataDriftPreset(method="psi")
],
include_tests="True")
my_eval = report.run(iris_frame.iloc[:60], iris_frame.iloc[60:])
my_eval
You can also save an HTML file. You'll need to open it from the destination folder.
my_eval.save_html("file.html")
To get the output as JSON or Python dictionary:
my_eval.json()
# my_eval.dict()
You can choose other Presets, create Reports from indiviudal Metrics and configure pass/fail conditions.
Monitoring dashboard
This launches a demo project in the locally hosted Evidently UI. Sign up for Evidently Cloud to instantly get a managed version with additional features.
Recommended step: create a virtual environment and activate it.
pip install virtualenv
virtualenv venv
source venv/bin/activate
After installing Evidently (pip install evidently
), run the Evidently UI with the demo projects:
evidently ui --demo-projects all
Visit localhost:8000 to access the UI.
ð¦ What can you evaluate?
Evidently has 100+ built-in evals. You can also add custom ones.
Here are examples of things you can check:
ð¡ Text descriptors | ð LLM outputs |
Length, sentiment, toxicity, language, special symbols, regular expression matches, etc. | Semantic similarity, retrieval relevance, summarization quality, etc. with model- and LLM-based evals. |
ð¢ Data quality | ð Data distribution drift |
Missing values, duplicates, min-max ranges, new categorical values, correlations, etc. | 20+ statistical tests and distance metrics to compare shifts in data distribution. |
ð¯ Classification | ð Regression |
Accuracy, precision, recall, ROC AUC, confusion matrix, bias, etc. | MAE, ME, RMSE, error distribution, error normality, error bias, etc. |
ð Ranking (inc. RAG) | ð Recommendations |
NDCG, MAP, MRR, Hit Rate, etc. | Serendipity, novelty, diversity, popularity bias, etc. |
:computer: Contributions
We welcome contributions! Read the Guide to learn more.
:books: Documentation
For more examples, refer to a complete Documentation.
:white_check_mark: Discord Community
If you want to chat and connect, join our Discord community!
Top Related Projects
Responsible AI Toolbox is a suite of tools providing model and data exploration and assessment user interfaces and libraries that enable a better understanding of AI systems. These interfaces and libraries empower developers and stakeholders of AI systems to develop and monitor AI more responsibly, and take better data-driven actions.
Open source platform for the machine learning lifecycle
Algorithms for outlier, adversarial and drift detection
A game theoretic approach to explain the output of any machine learning model.
Fit interpretable models. Explain blackbox machine learning.
Interpretability and explainability of data and machine learning models
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot