evaluate

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

2,198

273

2,198

236

View on GitHub

Top Related Projects

evals

16,478

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Quick Overview

Huggingface/evaluate is an open-source library for evaluating machine learning models, particularly in natural language processing tasks. It provides a collection of evaluation metrics and datasets, making it easier for researchers and practitioners to assess and compare model performance across various tasks and benchmarks.

Pros

Extensive collection of evaluation metrics and datasets
Easy integration with Hugging Face's ecosystem (transformers, datasets)
Supports both local and distributed evaluation
Customizable and extensible for adding new metrics or datasets

Cons

Learning curve for users unfamiliar with Hugging Face's ecosystem
Some metrics may have limited documentation or examples
Occasional inconsistencies in metric implementations across different tasks
Dependency on other Hugging Face libraries may be a drawback for some users

Code Examples

Loading and using a metric:

from evaluate import load

metric = load("accuracy")
predictions = [0, 1, 1, 0]
references = [0, 1, 0, 1]
results = metric.compute(predictions=predictions, references=references)
print(results)

Creating a custom metric:

from evaluate import EvaluationModule

def custom_accuracy(predictions, references):
    return {"accuracy": sum(p == r for p, r in zip(predictions, references)) / len(predictions)}

custom_metric = EvaluationModule(
    name="custom_accuracy",
    compute_fn=custom_accuracy,
    inputs_description="Predictions and references as lists of integers."
)

Using a metric with Hugging Face's datasets:

from datasets import load_dataset
from evaluate import load

dataset = load_dataset("glue", "mrpc", split="validation")
metric = load("glue", "mrpc")

for batch in dataset:
    predictions = model(batch["sentence1"], batch["sentence2"])
    metric.add_batch(predictions=predictions, references=batch["label"])

final_score = metric.compute()

Getting Started

To get started with huggingface/evaluate, follow these steps:

Install the library:
```
pip install evaluate
```

Import and use a metric:

from evaluate import load

# Load a metric
metric = load("accuracy")

# Compute the metric
results = metric.compute(predictions=[0, 1, 1, 0], references=[0, 1, 0, 1])
print(results)

Explore available metrics and datasets in the Hugging Face Hub.

Competitor Comparisons

evals

16,478

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Pros of evals

Focused on evaluating language models and AI systems
Includes a variety of pre-built evaluation tasks and metrics
Designed to work seamlessly with OpenAI's models and APIs

Cons of evals

Limited to evaluating AI models, not general-purpose metrics
Less extensive community contributions compared to evaluate
Primarily tailored for OpenAI's ecosystem

Code comparison

evaluate:

from evaluate import load
metric = load("accuracy")
results = metric.compute(predictions=preds, references=refs)

evals:

from evals import completion_fn, registry
eval = registry.get_eval("test-eval")
eval.run(completion_fn=completion_fn)

Key differences

evaluate is more versatile, supporting various ML tasks and metrics
evals is specifically designed for evaluating language models and AI systems
evaluate has a larger community and more diverse set of metrics
evals integrates more closely with OpenAI's ecosystem
evaluate is more suitable for general ML projects, while evals is tailored for AI model evaluation

Both repositories serve important roles in the ML/AI ecosystem, with evaluate offering broader applicability and evals providing specialized tools for AI system evaluation.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Tip: For more recent evaluation approaches, for example for evaluating LLMs, we recommend our newer and more actively maintained library LightEval.

ð¤ Evaluate is a library that makes evaluating and comparing models and reporting their performance easier and more standardized.

It currently contains:

implementations of dozens of popular metrics: the existing metrics cover a variety of tasks spanning from NLP to Computer Vision, and include dataset-specific metrics for datasets. With a simple command like accuracy = load("accuracy"), get any of these metrics ready to use for evaluating a ML model in any framework (Numpy/Pandas/PyTorch/TensorFlow/JAX).
comparisons and measurements: comparisons are used to measure the difference between models and measurements are tools to evaluate datasets.
an easy way of adding new evaluation modules to the ð¤ Hub: you can create new evaluation modules and push them to a dedicated Space in the ð¤ Hub with evaluate-cli create [metric name], which allows you to see easily compare different metrics and their outputs for the same sets of references and predictions.

ð Documentation

ð Find a metric, comparison, measurement on the Hub

ð Add a new evaluation module

ð¤ Evaluate also has lots of useful features like:

Type checking: the input types are checked to make sure that you are using the right input formats for each metric
Metric cards: each metrics comes with a card that describes the values, limitations and their ranges, as well as providing examples of their usage and usefulness.
Community metrics: Metrics live on the Hugging Face Hub and you can easily add your own metrics for your project or to collaborate with others.

Installation

With pip

ð¤ Evaluate can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)

pip install evaluate

Usage

ð¤ Evaluate's main methods are:

evaluate.list_evaluation_modules() to list the available metrics, comparisons and measurements
evaluate.load(module_name, **kwargs) to instantiate an evaluation module
results = module.compute(*kwargs) to compute the result of an evaluation module

Adding a new evaluation module

First install the necessary dependencies to create a new metric with the following command:

pip install evaluate[template]

Then you can get started with the following command which will create a new folder for your metric and display the necessary steps:

evaluate-cli create "Awesome Metric"

See this step-by-step guide in the documentation for detailed instructions.

Credits

Thanks to @marella for letting us use the evaluate namespace on PyPi previously used by his library.

Top Related Projects

evals

16,478

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot