Convert Figma logo to code with AI

huggingface logoevaluate

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

1,997
258
1,997
212

Top Related Projects

15,195

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Quick Overview

Huggingface/evaluate is an open-source library for evaluating machine learning models, particularly in natural language processing tasks. It provides a collection of evaluation metrics and datasets, making it easier for researchers and practitioners to assess and compare model performance across various tasks and benchmarks.

Pros

  • Extensive collection of evaluation metrics and datasets
  • Easy integration with Hugging Face's ecosystem (transformers, datasets)
  • Supports both local and distributed evaluation
  • Customizable and extensible for adding new metrics or datasets

Cons

  • Learning curve for users unfamiliar with Hugging Face's ecosystem
  • Some metrics may have limited documentation or examples
  • Occasional inconsistencies in metric implementations across different tasks
  • Dependency on other Hugging Face libraries may be a drawback for some users

Code Examples

Loading and using a metric:

from evaluate import load

metric = load("accuracy")
predictions = [0, 1, 1, 0]
references = [0, 1, 0, 1]
results = metric.compute(predictions=predictions, references=references)
print(results)

Creating a custom metric:

from evaluate import EvaluationModule

def custom_accuracy(predictions, references):
    return {"accuracy": sum(p == r for p, r in zip(predictions, references)) / len(predictions)}

custom_metric = EvaluationModule(
    name="custom_accuracy",
    compute_fn=custom_accuracy,
    inputs_description="Predictions and references as lists of integers."
)

Using a metric with Hugging Face's datasets:

from datasets import load_dataset
from evaluate import load

dataset = load_dataset("glue", "mrpc", split="validation")
metric = load("glue", "mrpc")

for batch in dataset:
    predictions = model(batch["sentence1"], batch["sentence2"])
    metric.add_batch(predictions=predictions, references=batch["label"])

final_score = metric.compute()

Getting Started

To get started with huggingface/evaluate, follow these steps:

  1. Install the library:

    pip install evaluate
    
  2. Import and use a metric:

    from evaluate import load
    
    # Load a metric
    metric = load("accuracy")
    
    # Compute the metric
    results = metric.compute(predictions=[0, 1, 1, 0], references=[0, 1, 0, 1])
    print(results)
    
  3. Explore available metrics and datasets in the Hugging Face Hub.

Competitor Comparisons

15,195

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Pros of evals

  • Focused on evaluating language models and AI systems
  • Includes a variety of pre-built evaluation tasks and metrics
  • Designed to work seamlessly with OpenAI's models and APIs

Cons of evals

  • Limited to evaluating AI models, not general-purpose metrics
  • Less extensive community contributions compared to evaluate
  • Primarily tailored for OpenAI's ecosystem

Code comparison

evaluate:

from evaluate import load
metric = load("accuracy")
results = metric.compute(predictions=preds, references=refs)

evals:

from evals import completion_fn, registry
eval = registry.get_eval("test-eval")
eval.run(completion_fn=completion_fn)

Key differences

  • evaluate is more versatile, supporting various ML tasks and metrics
  • evals is specifically designed for evaluating language models and AI systems
  • evaluate has a larger community and more diverse set of metrics
  • evals integrates more closely with OpenAI's ecosystem
  • evaluate is more suitable for general ML projects, while evals is tailored for AI model evaluation

Both repositories serve important roles in the ML/AI ecosystem, with evaluate offering broader applicability and evals providing specialized tools for AI system evaluation.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README



Build GitHub Documentation GitHub release Contributor Covenant

Tip: For more recent evaluation approaches, for example for evaluating LLMs, we recommend our newer and more actively maintained library LightEval.

🤗 Evaluate is a library that makes evaluating and comparing models and reporting their performance easier and more standardized.

It currently contains:

  • implementations of dozens of popular metrics: the existing metrics cover a variety of tasks spanning from NLP to Computer Vision, and include dataset-specific metrics for datasets. With a simple command like accuracy = load("accuracy"), get any of these metrics ready to use for evaluating a ML model in any framework (Numpy/Pandas/PyTorch/TensorFlow/JAX).
  • comparisons and measurements: comparisons are used to measure the difference between models and measurements are tools to evaluate datasets.
  • an easy way of adding new evaluation modules to the 🤗 Hub: you can create new evaluation modules and push them to a dedicated Space in the 🤗 Hub with evaluate-cli create [metric name], which allows you to see easily compare different metrics and their outputs for the same sets of references and predictions.

🎓 Documentation

🔎 Find a metric, comparison, measurement on the Hub

🌟 Add a new evaluation module

🤗 Evaluate also has lots of useful features like:

  • Type checking: the input types are checked to make sure that you are using the right input formats for each metric
  • Metric cards: each metrics comes with a card that describes the values, limitations and their ranges, as well as providing examples of their usage and usefulness.
  • Community metrics: Metrics live on the Hugging Face Hub and you can easily add your own metrics for your project or to collaborate with others.

Installation

With pip

🤗 Evaluate can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)

pip install evaluate

Usage

🤗 Evaluate's main methods are:

  • evaluate.list_evaluation_modules() to list the available metrics, comparisons and measurements
  • evaluate.load(module_name, **kwargs) to instantiate an evaluation module
  • results = module.compute(*kwargs) to compute the result of an evaluation module

Adding a new evaluation module

First install the necessary dependencies to create a new metric with the following command:

pip install evaluate[template]

Then you can get started with the following command which will create a new folder for your metric and display the necessary steps:

evaluate-cli create "Awesome Metric"

See this step-by-step guide in the documentation for detailed instructions.

Credits

Thanks to @marella for letting us use the evaluate namespace on PyPi previously used by his library.