evidently

Evidently is an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.

6,482

708

6,482

221

View on GitHub

Top Related Projects

responsible-ai-toolbox

1,593

Responsible AI Toolbox is a suite of tools providing model and data exploration and assessment user interfaces and libraries that enable a better understanding of AI systems. These interfaces and libraries empower developers and stakeholders of AI systems to develop and monitor AI more responsibly, and take better data-driven actions.

mlflow

21,434

The open source developer platform to build AI/LLM applications and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.

alibi-detect

2,414

Algorithms for outlier, adversarial and drift detection

shap

24,183

A game theoretic approach to explain the output of any machine learning model.

interpret

6,630

Fit interpretable models. Explain blackbox machine learning.

AIX360

1,714

Interpretability and explainability of data and machine learning models

Quick Overview

Evidently AI is an open-source library that provides a set of tools for monitoring and evaluating machine learning models in production. It helps data scientists and machine learning engineers to automatically generate reports and dashboards to monitor model performance, data drift, and other key metrics.

Pros

Comprehensive Monitoring: Evidently AI offers a wide range of monitoring capabilities, including model performance, data drift, and data quality checks.
Automated Reporting: The library can automatically generate detailed reports and dashboards, saving time and effort for data teams.
Customizable Checks: Users can define custom checks and metrics to suit their specific needs.
Easy Integration: Evidently AI can be easily integrated into existing machine learning workflows and pipelines.

Cons

Limited Deployment Options: The library currently only supports deployment as a Python package, which may not be suitable for all use cases.
Steep Learning Curve: Evidently AI has a relatively complex API and may require some time to get familiar with.
Dependency on Other Libraries: The library relies on several other Python packages, which can increase the complexity of the setup process.
Limited Community Support: Compared to some other popular machine learning libraries, Evidently AI has a smaller community and may have fewer resources available.

Code Examples

Here are a few examples of how to use Evidently AI:

Generating a Model Performance Report:

from evidently.report import Report
from evidently.metrics import PerformanceMetrics

report = Report(metrics=[PerformanceMetrics()])
report.run(reference_data=X_ref, production_data=X_prod, target_ref=y_ref, target_prod=y_prod)
report.save("performance_report.html")

This code generates a comprehensive report on the performance of a machine learning model, including metrics like accuracy, precision, recall, and F1-score.

Detecting Data Drift:

from evidently.report import Report
from evidently.metrics import DataDriftMetrics

report = Report(metrics=[DataDriftMetrics()])
report.run(reference_data=X_ref, production_data=X_prod)
report.save("data_drift_report.html")

This code generates a report on the data drift between the reference and production datasets, which can help identify potential issues with model performance.

Customizing Checks:

from evidently.report import Report
from evidently.metrics import PerformanceMetrics, DataDriftMetrics
from evidently.checks import Check, CheckConfig

custom_check = Check(
    name="Custom Check",
    type="value_drift",
    config=CheckConfig(
        metric_name="mean",
        max_drift=0.1
    )
)

report = Report(metrics=[PerformanceMetrics(), DataDriftMetrics()], checks=[custom_check])
report.run(reference_data=X_ref, production_data=X_prod, target_ref=y_ref, target_prod=y_prod)
report.save("custom_report.html")

This code demonstrates how to define a custom check to monitor a specific metric, in this case, the mean value drift between the reference and production datasets.

Getting Started

To get started with Evidently AI, follow these steps:

Install the library using pip:

pip install evidently

Import the necessary modules and define your reference and production datasets:

from evidently.report import Report
from evidently.metrics import PerformanceMetrics, DataDriftMetrics

X_ref, y_ref = get_reference_data()
X_prod, y_prod = get_production_data()

Generate a report and save it to an HTML file:

report = Report(metrics=[PerformanceMetrics(), DataDriftMetrics()])
report.run(reference_data=X_ref, production_data=X_prod, target_ref=y_ref, target_prod=y_prod)
report.save("report.html")

Open the generated HTML report in a web browser to view the results.

That's it! You can now start using Evidently AI to monitor and evaluate your machine learning models

Competitor Comparisons

responsible-ai-toolbox

1,593

Pros of Responsible AI Toolbox

Comprehensive suite of tools for responsible AI, including interpretability, fairness, and error analysis
Integrates well with Azure Machine Learning and other Microsoft services
Offers both GUI and programmatic interfaces for accessibility

Cons of Responsible AI Toolbox

Primarily focused on tabular data, with limited support for other data types
Steeper learning curve due to its extensive feature set
Requires more setup and configuration compared to Evidently

Code Comparison

Responsible AI Toolbox:

from raiwidgets import ResponsibleAIDashboard
ResponsibleAIDashboard(model, dataset, true_y, pred_y, 
                       categorical_features=['category'], 
                       task_type='classification')

Evidently:

from evidently import ColumnMapping
from evidently.report import ClassificationPerformanceReport
report = ClassificationPerformanceReport(column_mapping=ColumnMapping(
    target='target', prediction='prediction', numerical_features=['feature1']))
report.run(reference_data=ref_data, current_data=current_data)

Both tools offer powerful capabilities for analyzing and improving AI models, but they cater to different use cases. Responsible AI Toolbox provides a more comprehensive suite of tools with tighter integration into the Microsoft ecosystem, while Evidently offers a more lightweight and flexible approach to model monitoring and analysis.

mlflow

21,434

Pros of MLflow

Comprehensive end-to-end ML lifecycle management, including experiment tracking, model packaging, and deployment
Integrates well with various ML frameworks and tools, offering a unified platform for diverse ML workflows
Provides a user-friendly UI for experiment tracking and model comparison

Cons of MLflow

Steeper learning curve due to its broader scope and feature set
May be overkill for smaller projects or teams focused primarily on model monitoring

Code Comparison

MLflow:

import mlflow

mlflow.start_run()
mlflow.log_param("param1", value1)
mlflow.log_metric("metric1", value2)
mlflow.end_run()

Evidently:

from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=ref_data, current_data=curr_data, column_mapping=column_mapping)

MLflow offers a more general-purpose logging API for tracking experiments, while Evidently focuses specifically on generating data quality and model performance reports. MLflow is better suited for managing the entire ML lifecycle, whereas Evidently excels in detailed model monitoring and data drift detection.

alibi-detect

2,414

Algorithms for outlier, adversarial and drift detection

Pros of Alibi Detect

Focuses on drift detection and outlier detection in machine learning models
Provides advanced algorithms for detecting concept drift and data drift
Supports both batch and online drift detection scenarios

Cons of Alibi Detect

Steeper learning curve due to more complex algorithms and concepts
Less emphasis on data quality and model performance monitoring
Requires more setup and configuration for basic use cases

Code Comparison

Alibi Detect:

from alibi_detect.cd import TabularDrift

cd = TabularDrift(X_ref, p_val=.05, categories_per_feature=categories_per_feature)
preds = cd.predict(X)

Evidently:

from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=X_ref, current_data=X, column_mapping=column_mapping)

Both libraries offer drift detection capabilities, but Alibi Detect provides more advanced algorithms and configurations, while Evidently focuses on simplicity and ease of use for basic monitoring tasks.

shap

24,183

A game theoretic approach to explain the output of any machine learning model.

Pros of SHAP

Focuses on model interpretability with advanced techniques like Shapley values
Provides detailed feature importance and impact analysis
Supports a wide range of machine learning models and frameworks

Cons of SHAP

Primarily centered on model explanation, not comprehensive ML monitoring
Can be computationally intensive for large datasets or complex models
Steeper learning curve for users new to Shapley values and model interpretation

Code Comparison

SHAP example:

import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X)

Evidently example:

from evidently.dashboard import Dashboard
from evidently.tabs import DataDriftTab, CatTargetDriftTab
dashboard = Dashboard(tabs=[DataDriftTab(), CatTargetDriftTab()])
dashboard.calculate(reference_data, production_data, column_mapping=column_mapping)
dashboard.save("drift_report.html")

SHAP excels in detailed model interpretation, while Evidently offers a broader suite of ML monitoring tools, including data drift detection and model performance tracking. SHAP is ideal for in-depth feature analysis, whereas Evidently provides a more comprehensive approach to ML model monitoring and reporting.

interpret

6,630

Fit interpretable models. Explain blackbox machine learning.

Pros of Interpret

Broader scope of interpretability techniques, including global and local explanations
Supports a wider range of machine learning models, including tree-based models and neural networks
Offers interactive visualizations for exploring model behavior

Cons of Interpret

Steeper learning curve due to its more comprehensive feature set
Less focused on data drift and model monitoring compared to Evidently
May require more computational resources for complex models and large datasets

Code Comparison

Interpret:

from interpret import show
from interpret.glassbox import ExplainableBoostingClassifier

ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, y_train)
ebm_global = ebm.explain_global()
show(ebm_global)

Evidently:

from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference, current_data=current, column_mapping=column_mapping)
report.save_html("data_drift_report.html")

Both libraries offer valuable tools for model interpretation and monitoring, with Interpret providing a more comprehensive set of interpretability techniques, while Evidently focuses on data drift and model performance monitoring.

AIX360

1,714

Interpretability and explainability of data and machine learning models

Pros of AIX360

Comprehensive suite of explainability algorithms for various AI models
Supports both local and global explanations for model interpretability
Includes educational resources and tutorials for understanding AI explainability

Cons of AIX360

Steeper learning curve due to its broader scope and complexity
Less focused on data drift and model performance monitoring
Requires more setup and configuration for specific use cases

Code Comparison

AIX360:

from aix360.algorithms.protodash import ProtodashExplainer

explainer = ProtodashExplainer()
explanation = explainer.explain(X_train, X_test, k=5)

Evidently:

from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=ref_data, current_data=cur_data, column_mapping=column_mapping)

AIX360 focuses on generating explanations for AI models, while Evidently specializes in data and model monitoring, including drift detection. AIX360 offers a wider range of explainability techniques, but Evidently provides more straightforward tools for ongoing model performance evaluation and data quality checks.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Evidently

An open-source framework to evaluate, test and monitor ML and LLM-powered systems.

Evidently

Documentation | Discord Community | Blog | Twitter | Evidently Cloud

:bar_chart: What is Evidently?

Evidently is an open-source Python library to evaluate, test, and monitor ML and LLM systemsâfrom experiments to production.

ð¡ Works with tabular and text data.
â¨ Supports evals for predictive and generative tasks, from classification to RAG.
ð 100+ built-in metrics from data drift detection to LLM judges.
ð ï¸ Python interface for custom metrics.
ð¦ Both offline evals and live monitoring.
ð» Open architecture: easily export data and integrate with existing tools.

Evidently is very modular. You can start with one-off evaluations or host a full monitoring service.

1. Reports and Test Suites

Reports compute and summarize various data, ML and LLM quality evals.

Start with Presets and built-in metrics or customize.
Best for experiments, exploratory analysis and debugging.
View interactive Reports in Python or export as JSON, Python dictionary, HTML, or view in monitoring UI.

Turn any Report into a Test Suite by adding pass/fail conditions.

Best for regression testing, CI/CD checks, or data validation.
Zero setup option: auto-generate test conditions from the reference dataset.
Simple syntax to set test conditions as gt (greater than), lt (less than), etc.

Reports

2. Monitoring Dashboard

Monitoring UI service helps visualize metrics and test results over time.

You can choose:

Self-host the open-source version. Live demo.
Sign up for Evidently Cloud (Recommended).

Evidently Cloud offers a generous free tier and extra features like dataset and user management, alerting, and no-code evals. Compare OSS vs Cloud.

Dashboard

:woman_technologist: Install Evidently

To install from PyPI:

pip install evidently

To install Evidently using conda installer, run:

conda install -c conda-forge evidently

:arrow_forward: Getting started

Reports

LLM evals

This is a simple Hello World. Check the Tutorials for more: LLM evaluation.

Import the necessary components:

import pandas as pd
from evidently import Report
from evidently import Dataset, DataDefinition
from evidently.descriptors import Sentiment, TextLength, Contains
from evidently.presets import TextEvals

Create a toy dataset with questions and answers.

eval_df = pd.DataFrame([
    ["What is the capital of Japan?", "The capital of Japan is Tokyo."],
    ["Who painted the Mona Lisa?", "Leonardo da Vinci."],
    ["Can you write an essay?", "I'm sorry, but I can't assist with homework."]],
                       columns=["question", "answer"])

Create an Evidently Dataset object and add descriptors: row-level evaluators. We'll check for sentiment of each response, its length and whether it contains words indicative of denial.

eval_dataset = Dataset.from_pandas(pd.DataFrame(eval_df),
data_definition=DataDefinition(),
descriptors=[
    Sentiment("answer", alias="Sentiment"),
    TextLength("answer", alias="Length"),
    Contains("answer", items=['sorry', 'apologize'], mode="any", alias="Denials")
])

You can view the dataframe with added scores:

eval_dataset.as_dataframe()

To get a summary Report to see the distribution of scores:

report = Report([
    TextEvals()
])

my_eval = report.run(eval_dataset)
my_eval
# my_eval.json()
# my_eval.dict()

You can also choose other evaluators, including LLM-as-a-judge and configure pass/fail conditions.

Data and ML evals

This is a simple Hello World. Check the Tutorials for more: Tabular data.

Import the Report, evaluation Preset and toy tabular dataset.

import pandas as pd
from sklearn import datasets

from evidently import Report
from evidently.presets import DataDriftPreset

iris_data = datasets.load_iris(as_frame=True)
iris_frame = iris_data.frame

Run the Data Drift evaluation preset that will test for shift in column distributions. Take the first 60 rows of the dataframe as "current" data and the following as reference. Get the output in Jupyter notebook:

report = Report([
    DataDriftPreset(method="psi")
],
include_tests="True")
my_eval = report.run(iris_frame.iloc[:60], iris_frame.iloc[60:])
my_eval

You can also save an HTML file. You'll need to open it from the destination folder.

my_eval.save_html("file.html")

To get the output as JSON or Python dictionary:

my_eval.json()
# my_eval.dict()

You can choose other Presets, create Reports from indiviudal Metrics and configure pass/fail conditions.

Monitoring dashboard

This launches a demo project in the locally hosted Evidently UI. Sign up for Evidently Cloud to instantly get a managed version with additional features.

if you have uv you can run Evidently UI with a single command.

uv run --with evidently evidently ui --demo-projects all

If you haven't, create a virtual environment using the standard approach.

pip install virtualenv
virtualenv venv
source venv/bin/activate

After installing Evidently (pip install evidently), run the Evidently UI with the demo projects:

evidently ui --demo-projects all

Visit localhost:8000 to access the UI.

ð¦ What can you evaluate?

Evidently has 100+ built-in evals. You can also add custom ones.

Here are examples of things you can check:


ð¡ Text descriptors	ð LLM outputs
Length, sentiment, toxicity, language, special symbols, regular expression matches, etc.	Semantic similarity, retrieval relevance, summarization quality, etc. with model- and LLM-based evals.
ð¢ Data quality	ð Data distribution drift
Missing values, duplicates, min-max ranges, new categorical values, correlations, etc.	20+ statistical tests and distance metrics to compare shifts in data distribution.
ð¯ Classification	ð Regression
Accuracy, precision, recall, ROC AUC, confusion matrix, bias, etc.	MAE, ME, RMSE, error distribution, error normality, error bias, etc.
ð Ranking (inc. RAG)	ð Recommendations
NDCG, MAP, MRR, Hit Rate, etc.	Serendipity, novelty, diversity, popularity bias, etc.

:computer: Contributions

We welcome contributions! Read the Guide to learn more.

:books: Documentation

For more examples, refer to a complete Documentation.

:white_check_mark: Discord Community

If you want to chat and connect, join our Discord community!

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of Responsible AI Toolbox

Cons of Responsible AI Toolbox

Code Comparison

Pros of MLflow

Cons of MLflow

Code Comparison

Pros of Alibi Detect

Cons of Alibi Detect

Code Comparison

Pros of SHAP

Cons of SHAP

Code Comparison

Pros of Interpret

Cons of Interpret

Code Comparison

Pros of AIX360

Cons of AIX360

Code Comparison

Convert designs to code with AI

README

Evidently

:bar_chart: What is Evidently?

1. Reports and Test Suites

2. Monitoring Dashboard

:woman_technologist: Install Evidently

:arrow_forward: Getting started

Reports

LLM evals

Data and ML evals

Monitoring dashboard

ð¦ What can you evaluate?

:computer: Contributions

:books: Documentation

:white_check_mark: Discord Community

Top Related Projects

Convert designs to code with AI

ð¦ What can you evaluate?