test

Measuring Massive Multitask Language Understanding | ICLR 2021

1,439

104

1,439

View on GitHub

Top Related Projects

lm-evaluation-harness

9,415

A framework for few-shot evaluation of language models.

evals

16,478

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

evaluate

2,282

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

BIG-bench

3,076

Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models

Quick Overview

The hendrycks/test repository is a comprehensive benchmark for measuring large language model (LLM) capabilities across a wide range of subjects and difficulty levels. It contains over 57 tasks spanning STEM fields, humanities, social sciences, and more, designed to evaluate the knowledge and reasoning abilities of AI models.

Pros

Extensive coverage of diverse subjects and difficulty levels
Provides a standardized benchmark for comparing LLM performance
Regularly updated with new tasks and improvements
Open-source and freely available for researchers and developers

Cons

May not fully capture real-world application scenarios
Potential for bias in question selection and difficulty assessment
Requires significant computational resources to evaluate large models
Limited to multiple-choice format, which may not assess all aspects of language understanding

Getting Started

To use the hendrycks/test benchmark:

Clone the repository:

git clone https://github.com/hendrycks/test.git

Navigate to the desired subject directory (e.g., mathematics, computer_science, etc.).
Use the provided data files (typically in JSON format) to evaluate your LLM.
Compare your model's performance against the reported baselines in the repository's documentation.

Note: This repository does not provide a specific code library for evaluation. Users are expected to implement their own evaluation scripts based on the provided data and guidelines.

Competitor Comparisons

lm-evaluation-harness

9,415

A framework for few-shot evaluation of language models.

Pros of lm-evaluation-harness

More comprehensive evaluation framework with a wider range of tasks and metrics
Supports multiple language models and frameworks (e.g., HuggingFace, OpenAI)
Actively maintained with regular updates and contributions

Cons of lm-evaluation-harness

More complex setup and configuration required
Steeper learning curve for new users
Potentially slower execution due to its comprehensive nature

Code Comparison

lm-evaluation-harness:

from lm_eval import evaluator, tasks

results = evaluator.simple_evaluate(
    model="gpt2",
    tasks=["hellaswag", "winogrande"],
    num_fewshot=2,
)

test:

from test import multiple_choice

score = multiple_choice.evaluate(model, dataset="hellaswag")

The lm-evaluation-harness provides a more flexible and comprehensive evaluation framework, allowing for multiple tasks and customizable parameters. The test repository offers a simpler, more straightforward approach for specific tasks but with less flexibility.

evals

16,478

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Pros of evals

More comprehensive evaluation framework with support for various model types
Actively maintained with regular updates and contributions
Includes tools for creating custom evaluations and analyzing results

Cons of evals

Steeper learning curve due to more complex architecture
Primarily focused on OpenAI models, potentially limiting its applicability

Code comparison

test:

def check_correctness(completion, answer):
    return completion.strip() == answer.strip()

def grade_question(completion, answer):
    return 1 if check_correctness(completion, answer) else 0

evals:

class Eval(ABC):
    @abstractmethod
    def eval_sample(self, sample: Any, *args: Any, **kwargs: Any) -> Dict[str, Any]:
        pass

    @abstractmethod
    def run(self, recorder: RecorderBase) -> None:
        pass

The code snippets highlight the difference in complexity between the two repositories. test uses simple functions for grading, while evals employs a more structured, object-oriented approach with abstract classes for creating custom evaluations.

evaluate

2,282

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

Pros of evaluate

Comprehensive evaluation framework with a wide range of metrics and tasks
Seamless integration with Hugging Face's ecosystem and models
Active development and community support

Cons of evaluate

More complex setup and usage compared to test
Potentially overwhelming for simple evaluation tasks
Requires additional dependencies and resources

Code comparison

test:

from test import tasks
score = tasks.math.arithmetic.evaluate_model(model)

evaluate:

from evaluate import load
metric = load("accuracy")
results = metric.compute(predictions=preds, references=refs)

Summary

evaluate offers a more robust and versatile evaluation framework, particularly suited for complex NLP tasks and integration with Hugging Face's ecosystem. test provides a simpler, more focused approach for specific benchmarks, especially in areas like math and reasoning. evaluate's flexibility comes at the cost of increased complexity, while test offers a more straightforward implementation for its targeted tasks. The choice between the two depends on the specific evaluation needs and the desired level of integration with other tools and models.

BIG-bench

3,076

Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models

Pros of BIG-bench

Larger and more diverse set of tasks, covering a wider range of capabilities
Collaborative effort with contributions from multiple organizations and researchers
Includes more complex and multi-step tasks that test advanced reasoning abilities

Cons of BIG-bench

Less standardized format across tasks, making it potentially more challenging to implement
May require more computational resources due to the larger number and complexity of tasks
Some tasks may be less relevant for general language model evaluation

Code comparison

BIG-bench task example:

def evaluate_model(model, examples):
    scores = []
    for example in examples:
        prediction = model.generate(example['input'])
        score = compute_score(prediction, example['target'])
        scores.append(score)
    return sum(scores) / len(scores)

Hendrycks test example:

def evaluate_model(model, examples):
    correct = 0
    for question, choices, answer in examples:
        prediction = model.predict(question, choices)
        if prediction == answer:
            correct += 1
    return correct / len(examples)

Both repositories provide valuable benchmarks for evaluating language models, with BIG-bench offering a broader range of tasks and Hendrycks test focusing on more standardized, multiple-choice formats.

natural-instructions

1,005

Expanding natural instructions

Pros of natural-instructions

Broader scope with diverse task types beyond multiple-choice questions
Includes detailed instructions and examples for each task
Supports multi-lingual tasks and evaluations

Cons of natural-instructions

Less standardized format, potentially more complex to implement
May require more computational resources due to open-ended responses
Smaller dataset size compared to test

Code comparison

test:

{
    "question": "What is the capital of France?",
    "choices": ["London", "Berlin", "Paris", "Madrid"],
    "answer": "Paris"
}

natural-instructions:

{
  "id": "task123",
  "description": "Name the capital city of the given country.",
  "positive_examples": [
    {
      "input": "United States",
      "output": "Washington, D.C."
    }
  ],
  "instances": [
    {
      "input": "France",
      "output": "Paris"
    }
  ]
}

The test repository uses a simple multiple-choice format, while natural-instructions provides more detailed task descriptions and examples. natural-instructions allows for open-ended responses, making it more flexible but potentially more challenging to evaluate automatically.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Measuring Massive Multitask Language Understanding

This is the repository for Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021).

This repository contains OpenAI API evaluation code, and the test is available for download here.

Test Leaderboard

If you want to have your model added to the leaderboard, please reach out to us or submit a pull request.

Results of the test:

Model	Authors	Humanities	Social Sciences	STEM	Other	Average
Chinchilla (70B, few-shot)	Hoffmann et al., 2022	63.6	79.3	54.9	73.9	67.5
Gopher (280B, few-shot)	Rae et al., 2021	56.2	71.9	47.4	66.1	60.0
GPT-3 (175B, fine-tuned)	Brown et al., 2020	52.5	63.9	41.4	57.9	53.9
flan-T5-xl	Chung et al., 2022	46.3	57.7	39.0	55.1	49.3
UnifiedQA	Khashabi et al., 2020	45.6	56.6	40.2	54.6	48.9
GPT-3 (175B, few-shot)	Brown et al., 2020	40.8	50.4	36.7	48.8	43.9
GPT-3 (6.7B, fine-tuned)	Brown et al., 2020	42.1	49.2	35.1	46.9	43.2
flan-T5-large	Chung et al., 2022	39.1	49.1	33.2	47.4	41.9
flan-T5-base	Chung et al., 2022	34.0	38.1	27.6	37.0	34.2
GPT-2	Radford et al., 2019	32.8	33.3	30.2	33.1	32.4
flan-T5-small	Chung et al., 2022	29.9	30.9	27.5	29.7	29.5
Random Baseline	N/A	25.0	25.0	25.0	25.0	25.0

Citation

If you find this useful in your research, please consider citing the test and also the ETHICS dataset it draws from:

@article{hendryckstest2021,
  title={Measuring Massive Multitask Language Understanding},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}

@article{hendrycks2021ethics,
  title={Aligning AI With Shared Human Values},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot