evaluate
🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
Top Related Projects
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Quick Overview
Huggingface/evaluate is an open-source library for evaluating machine learning models, particularly in natural language processing tasks. It provides a collection of evaluation metrics and datasets, making it easier for researchers and practitioners to assess and compare model performance across various tasks and benchmarks.
Pros
- Extensive collection of evaluation metrics and datasets
- Easy integration with Hugging Face's ecosystem (transformers, datasets)
- Supports both local and distributed evaluation
- Customizable and extensible for adding new metrics or datasets
Cons
- Learning curve for users unfamiliar with Hugging Face's ecosystem
- Some metrics may have limited documentation or examples
- Occasional inconsistencies in metric implementations across different tasks
- Dependency on other Hugging Face libraries may be a drawback for some users
Code Examples
Loading and using a metric:
from evaluate import load
metric = load("accuracy")
predictions = [0, 1, 1, 0]
references = [0, 1, 0, 1]
results = metric.compute(predictions=predictions, references=references)
print(results)
Creating a custom metric:
from evaluate import EvaluationModule
def custom_accuracy(predictions, references):
return {"accuracy": sum(p == r for p, r in zip(predictions, references)) / len(predictions)}
custom_metric = EvaluationModule(
name="custom_accuracy",
compute_fn=custom_accuracy,
inputs_description="Predictions and references as lists of integers."
)
Using a metric with Hugging Face's datasets:
from datasets import load_dataset
from evaluate import load
dataset = load_dataset("glue", "mrpc", split="validation")
metric = load("glue", "mrpc")
for batch in dataset:
predictions = model(batch["sentence1"], batch["sentence2"])
metric.add_batch(predictions=predictions, references=batch["label"])
final_score = metric.compute()
Getting Started
To get started with huggingface/evaluate, follow these steps:
-
Install the library:
pip install evaluate
-
Import and use a metric:
from evaluate import load # Load a metric metric = load("accuracy") # Compute the metric results = metric.compute(predictions=[0, 1, 1, 0], references=[0, 1, 0, 1]) print(results)
-
Explore available metrics and datasets in the Hugging Face Hub.
Competitor Comparisons
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Pros of evals
- Focused on evaluating language models and AI systems
- Includes a variety of pre-built evaluation tasks and metrics
- Designed to work seamlessly with OpenAI's models and APIs
Cons of evals
- Limited to evaluating AI models, not general-purpose metrics
- Less extensive community contributions compared to evaluate
- Primarily tailored for OpenAI's ecosystem
Code comparison
evaluate:
from evaluate import load
metric = load("accuracy")
results = metric.compute(predictions=preds, references=refs)
evals:
from evals import completion_fn, registry
eval = registry.get_eval("test-eval")
eval.run(completion_fn=completion_fn)
Key differences
- evaluate is more versatile, supporting various ML tasks and metrics
- evals is specifically designed for evaluating language models and AI systems
- evaluate has a larger community and more diverse set of metrics
- evals integrates more closely with OpenAI's ecosystem
- evaluate is more suitable for general ML projects, while evals is tailored for AI model evaluation
Both repositories serve important roles in the ML/AI ecosystem, with evaluate offering broader applicability and evals providing specialized tools for AI system evaluation.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Tip: For more recent evaluation approaches, for example for evaluating LLMs, we recommend our newer and more actively maintained library LightEval.
ð¤ Evaluate is a library that makes evaluating and comparing models and reporting their performance easier and more standardized.
It currently contains:
- implementations of dozens of popular metrics: the existing metrics cover a variety of tasks spanning from NLP to Computer Vision, and include dataset-specific metrics for datasets. With a simple command like
accuracy = load("accuracy")
, get any of these metrics ready to use for evaluating a ML model in any framework (Numpy/Pandas/PyTorch/TensorFlow/JAX). - comparisons and measurements: comparisons are used to measure the difference between models and measurements are tools to evaluate datasets.
- an easy way of adding new evaluation modules to the ð¤ Hub: you can create new evaluation modules and push them to a dedicated Space in the ð¤ Hub with
evaluate-cli create [metric name]
, which allows you to see easily compare different metrics and their outputs for the same sets of references and predictions.
ð Find a metric, comparison, measurement on the Hub
ð Add a new evaluation module
ð¤ Evaluate also has lots of useful features like:
- Type checking: the input types are checked to make sure that you are using the right input formats for each metric
- Metric cards: each metrics comes with a card that describes the values, limitations and their ranges, as well as providing examples of their usage and usefulness.
- Community metrics: Metrics live on the Hugging Face Hub and you can easily add your own metrics for your project or to collaborate with others.
Installation
With pip
ð¤ Evaluate can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)
pip install evaluate
Usage
ð¤ Evaluate's main methods are:
evaluate.list_evaluation_modules()
to list the available metrics, comparisons and measurementsevaluate.load(module_name, **kwargs)
to instantiate an evaluation moduleresults = module.compute(*kwargs)
to compute the result of an evaluation module
Adding a new evaluation module
First install the necessary dependencies to create a new metric with the following command:
pip install evaluate[template]
Then you can get started with the following command which will create a new folder for your metric and display the necessary steps:
evaluate-cli create "Awesome Metric"
See this step-by-step guide in the documentation for detailed instructions.
Credits
Thanks to @marella for letting us use the evaluate
namespace on PyPi previously used by his library.
Top Related Projects
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot