Top Related Projects
A framework for few-shot evaluation of language models.
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models
Quick Overview
The hendrycks/test repository is a comprehensive benchmark for measuring large language model (LLM) capabilities across a wide range of subjects and difficulty levels. It contains over 57 tasks spanning STEM fields, humanities, social sciences, and more, designed to evaluate the knowledge and reasoning abilities of AI models.
Pros
- Extensive coverage of diverse subjects and difficulty levels
- Provides a standardized benchmark for comparing LLM performance
- Regularly updated with new tasks and improvements
- Open-source and freely available for researchers and developers
Cons
- May not fully capture real-world application scenarios
- Potential for bias in question selection and difficulty assessment
- Requires significant computational resources to evaluate large models
- Limited to multiple-choice format, which may not assess all aspects of language understanding
Getting Started
To use the hendrycks/test benchmark:
-
Clone the repository:
git clone https://github.com/hendrycks/test.git
-
Navigate to the desired subject directory (e.g.,
mathematics
,computer_science
, etc.). -
Use the provided data files (typically in JSON format) to evaluate your LLM.
-
Compare your model's performance against the reported baselines in the repository's documentation.
Note: This repository does not provide a specific code library for evaluation. Users are expected to implement their own evaluation scripts based on the provided data and guidelines.
Competitor Comparisons
A framework for few-shot evaluation of language models.
Pros of lm-evaluation-harness
- More comprehensive evaluation framework with a wider range of tasks and metrics
- Supports multiple language models and frameworks (e.g., HuggingFace, OpenAI)
- Actively maintained with regular updates and contributions
Cons of lm-evaluation-harness
- More complex setup and configuration required
- Steeper learning curve for new users
- Potentially slower execution due to its comprehensive nature
Code Comparison
lm-evaluation-harness:
from lm_eval import evaluator, tasks
results = evaluator.simple_evaluate(
model="gpt2",
tasks=["hellaswag", "winogrande"],
num_fewshot=2,
)
test:
from test import multiple_choice
score = multiple_choice.evaluate(model, dataset="hellaswag")
The lm-evaluation-harness provides a more flexible and comprehensive evaluation framework, allowing for multiple tasks and customizable parameters. The test repository offers a simpler, more straightforward approach for specific tasks but with less flexibility.
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Pros of evals
- More comprehensive evaluation framework with support for various model types
- Actively maintained with regular updates and contributions
- Includes tools for creating custom evaluations and analyzing results
Cons of evals
- Steeper learning curve due to more complex architecture
- Primarily focused on OpenAI models, potentially limiting its applicability
Code comparison
test:
def check_correctness(completion, answer):
return completion.strip() == answer.strip()
def grade_question(completion, answer):
return 1 if check_correctness(completion, answer) else 0
evals:
class Eval(ABC):
@abstractmethod
def eval_sample(self, sample: Any, *args: Any, **kwargs: Any) -> Dict[str, Any]:
pass
@abstractmethod
def run(self, recorder: RecorderBase) -> None:
pass
The code snippets highlight the difference in complexity between the two repositories. test uses simple functions for grading, while evals employs a more structured, object-oriented approach with abstract classes for creating custom evaluations.
🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
Pros of evaluate
- Comprehensive evaluation framework with a wide range of metrics and tasks
- Seamless integration with Hugging Face's ecosystem and models
- Active development and community support
Cons of evaluate
- More complex setup and usage compared to test
- Potentially overwhelming for simple evaluation tasks
- Requires additional dependencies and resources
Code comparison
test:
from test import tasks
score = tasks.math.arithmetic.evaluate_model(model)
evaluate:
from evaluate import load
metric = load("accuracy")
results = metric.compute(predictions=preds, references=refs)
Summary
evaluate offers a more robust and versatile evaluation framework, particularly suited for complex NLP tasks and integration with Hugging Face's ecosystem. test provides a simpler, more focused approach for specific benchmarks, especially in areas like math and reasoning. evaluate's flexibility comes at the cost of increased complexity, while test offers a more straightforward implementation for its targeted tasks. The choice between the two depends on the specific evaluation needs and the desired level of integration with other tools and models.
Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models
Pros of BIG-bench
- Larger and more diverse set of tasks, covering a wider range of capabilities
- Collaborative effort with contributions from multiple organizations and researchers
- Includes more complex and multi-step tasks that test advanced reasoning abilities
Cons of BIG-bench
- Less standardized format across tasks, making it potentially more challenging to implement
- May require more computational resources due to the larger number and complexity of tasks
- Some tasks may be less relevant for general language model evaluation
Code comparison
BIG-bench task example:
def evaluate_model(model, examples):
scores = []
for example in examples:
prediction = model.generate(example['input'])
score = compute_score(prediction, example['target'])
scores.append(score)
return sum(scores) / len(scores)
Hendrycks test example:
def evaluate_model(model, examples):
correct = 0
for question, choices, answer in examples:
prediction = model.predict(question, choices)
if prediction == answer:
correct += 1
return correct / len(examples)
Both repositories provide valuable benchmarks for evaluating language models, with BIG-bench offering a broader range of tasks and Hendrycks test focusing on more standardized, multiple-choice formats.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Measuring Massive Multitask Language Understanding
This is the repository for Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021).
This repository contains OpenAI API evaluation code, and the test is available for download here.
Test Leaderboard
If you want to have your model added to the leaderboard, please reach out to us or submit a pull request.
Results of the test:
Model | Authors | Humanities | Social Sciences | STEM | Other | Average |
---|---|---|---|---|---|---|
Chinchilla (70B, few-shot) | Hoffmann et al., 2022 | 63.6 | 79.3 | 54.9 | 73.9 | 67.5 |
Gopher (280B, few-shot) | Rae et al., 2021 | 56.2 | 71.9 | 47.4 | 66.1 | 60.0 |
GPT-3 (175B, fine-tuned) | Brown et al., 2020 | 52.5 | 63.9 | 41.4 | 57.9 | 53.9 |
flan-T5-xl | Chung et al., 2022 | 46.3 | 57.7 | 39.0 | 55.1 | 49.3 |
UnifiedQA | Khashabi et al., 2020 | 45.6 | 56.6 | 40.2 | 54.6 | 48.9 |
GPT-3 (175B, few-shot) | Brown et al., 2020 | 40.8 | 50.4 | 36.7 | 48.8 | 43.9 |
GPT-3 (6.7B, fine-tuned) | Brown et al., 2020 | 42.1 | 49.2 | 35.1 | 46.9 | 43.2 |
flan-T5-large | Chung et al., 2022 | 39.1 | 49.1 | 33.2 | 47.4 | 41.9 |
flan-T5-base | Chung et al., 2022 | 34.0 | 38.1 | 27.6 | 37.0 | 34.2 |
GPT-2 | Radford et al., 2019 | 32.8 | 33.3 | 30.2 | 33.1 | 32.4 |
flan-T5-small | Chung et al., 2022 | 29.9 | 30.9 | 27.5 | 29.7 | 29.5 |
Random Baseline | N/A | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 |
Citation
If you find this useful in your research, please consider citing the test and also the ETHICS dataset it draws from:
@article{hendryckstest2021,
title={Measuring Massive Multitask Language Understanding},
author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
journal={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2021}
}
@article{hendrycks2021ethics,
title={Aligning AI With Shared Human Values},
author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
journal={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2021}
}
Top Related Projects
A framework for few-shot evaluation of language models.
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot