Top Related Projects
TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
An open-source NLP research library, built on PyTorch.
Models, data loaders and abstractions for language processing, powered by PyTorch
💫 Industrial-strength Natural Language Processing (NLP) in Python
Library for fast text representation and classification.
Quick Overview
CLUEDatasetSearch is a GitHub repository that provides a comprehensive collection of Chinese language datasets for various natural language processing (NLP) tasks. It aims to facilitate research and development in Chinese NLP by offering a centralized resource for accessing and exploring diverse datasets.
Pros
- Extensive collection of Chinese NLP datasets covering multiple tasks
- Well-organized structure with clear categorization of datasets
- Includes detailed information about each dataset, such as task type, size, and source
- Regularly updated with new datasets and improvements
Cons
- Limited to Chinese language datasets, which may not be useful for researchers working on other languages
- Some datasets may require additional processing or formatting for specific use cases
- Dependency on external sources for some datasets, which may lead to broken links or unavailable data
- Lack of standardized evaluation metrics across all datasets
Code Examples
This repository is primarily a collection of datasets and does not include code libraries. Therefore, code examples are not applicable in this case.
Getting Started
As this is not a code library, there are no specific code-based getting started instructions. However, to begin using the datasets:
- Visit the repository: https://github.com/CLUEbenchmark/CLUEDatasetSearch
- Browse the available datasets in the README file
- Click on the dataset of interest to access more detailed information
- Follow the provided links or instructions to download or access the specific dataset
- Refer to the dataset's documentation for usage guidelines and formatting information
Competitor Comparisons
TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
Pros of datasets
- Extensive collection of datasets across various domains
- Well-integrated with TensorFlow ecosystem
- Robust documentation and community support
Cons of datasets
- Primarily focused on machine learning datasets
- May have a steeper learning curve for beginners
Code comparison
CLUEDatasetSearch:
from CLUEDatasetSearch import CLUEDatasetSearch
searcher = CLUEDatasetSearch()
results = searcher.search("sentiment analysis")
datasets:
import tensorflow_datasets as tfds
dataset = tfds.load('imdb_reviews')
train_dataset = dataset['train']
Key differences
- CLUEDatasetSearch focuses on Chinese language datasets, while datasets covers a broader range of languages and domains
- CLUEDatasetSearch provides a search interface for finding relevant datasets, whereas datasets offers direct access to pre-processed datasets
- datasets is more tightly integrated with TensorFlow, making it easier to use in TensorFlow-based projects
Use cases
CLUEDatasetSearch is ideal for:
- Researchers working on Chinese NLP tasks
- Those seeking a curated list of Chinese language datasets
datasets is better suited for:
- Machine learning practitioners using TensorFlow
- Projects requiring a wide variety of datasets across multiple domains
An open-source NLP research library, built on PyTorch.
Pros of AllenNLP
- Comprehensive NLP toolkit with a wide range of pre-built models and components
- Extensive documentation and tutorials for easy adoption
- Active community and regular updates
Cons of AllenNLP
- Steeper learning curve for beginners
- Primarily focused on English language tasks
- Larger codebase and dependencies
Code Comparison
AllenNLP:
from allennlp.predictors import Predictor
predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bert-base-srl-2020.03.24.tar.gz")
result = predictor.predict(sentence="Did Uriah honestly think he could beat the game in under three hours?")
CLUEDatasetSearch:
from CLUEDatasetSearch import CLUEDatasetSearch
searcher = CLUEDatasetSearch()
results = searcher.search("text classification dataset")
AllenNLP offers a more comprehensive toolkit for various NLP tasks, while CLUEDatasetSearch focuses specifically on Chinese language datasets. AllenNLP provides pre-built models and predictors, whereas CLUEDatasetSearch is primarily a search tool for datasets. The code examples demonstrate the different use cases: AllenNLP for model prediction and CLUEDatasetSearch for dataset discovery.
Models, data loaders and abstractions for language processing, powered by PyTorch
Pros of text
- Broader scope, covering various NLP tasks and datasets
- More extensive documentation and community support
- Integrated with PyTorch ecosystem for seamless deep learning workflows
Cons of text
- Less focused on Chinese language tasks and datasets
- May require more setup and configuration for specific use cases
- Potentially steeper learning curve for beginners
Code Comparison
CLUEDatasetSearch:
from CLUEDatasetSearch import CLUEDatasetSearch
searcher = CLUEDatasetSearch()
results = searcher.search("情感分析")
print(results)
text:
from torchtext.datasets import IMDB
train_dataset, test_dataset = IMDB(split=('train', 'test'))
for label, text in train_dataset:
print(f"Label: {label}, Text: {text[:50]}...")
The CLUEDatasetSearch code demonstrates a simple search for Chinese NLP datasets, while the text example shows how to load and iterate through an English sentiment analysis dataset. text offers more flexibility for various NLP tasks, but CLUEDatasetSearch is more specialized for Chinese language datasets.
💫 Industrial-strength Natural Language Processing (NLP) in Python
Pros of spaCy
- Comprehensive NLP library with a wide range of functionalities
- Efficient and fast processing, suitable for large-scale applications
- Extensive documentation and active community support
Cons of spaCy
- Steeper learning curve for beginners
- Primarily focused on English, with limited support for other languages
- Requires more system resources compared to lightweight alternatives
Code Comparison
spaCy:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
print(ent.text, ent.label_)
CLUEDatasetSearch:
from CLUEDatasetSearch import CLUEDatasetSearch
searcher = CLUEDatasetSearch()
results = searcher.search("sentiment analysis dataset")
for result in results:
print(result.name, result.description)
Summary
spaCy is a powerful NLP library with extensive features and performance optimizations, while CLUEDatasetSearch focuses on dataset discovery for Chinese language tasks. spaCy offers more comprehensive NLP capabilities but may be more complex for beginners, whereas CLUEDatasetSearch provides a simpler interface for finding relevant datasets in the CLUE benchmark collection.
Library for fast text representation and classification.
Pros of fastText
- Efficient and fast text classification and word representation learning
- Supports multiple languages and can handle large datasets
- Provides pre-trained models and embeddings for various languages
Cons of fastText
- Limited to shallow neural network architectures
- May not capture complex semantic relationships as well as more advanced models
- Requires careful preprocessing and hyperparameter tuning for optimal performance
Code Comparison
fastText:
import fasttext
model = fasttext.train_supervised("train.txt")
result = model.predict("example text")
CLUEDatasetSearch:
from CLUEDatasetSearch import CLUEDatasetSearch
searcher = CLUEDatasetSearch()
results = searcher.search("query", top_k=5)
While fastText focuses on text classification and word embeddings, CLUEDatasetSearch is primarily designed for searching and retrieving Chinese language datasets. fastText offers more general-purpose text processing capabilities, while CLUEDatasetSearch is specialized for dataset discovery within the CLUE (Chinese Language Understanding Evaluation) benchmark ecosystem.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
CLUEDatasetSearch
ä¸è±æNLPæ°æ®éãå¯ä»¥ç¹å»æç´¢ã
æ¨å¯ä»¥éè¿ä¸ä¼ æ°æ®éä¿¡æ¯è´¡ç®ä½ çåéãä¸ä¼ äºä¸ªæ以ä¸æ°æ®éä¿¡æ¯å¹¶å®¡æ ¸éè¿åï¼è¯¥åå¦å¯ä»¥ä½ä¸ºé¡¹ç®è´¡ç®è ï¼å¹¶æ¾ç¤ºåºæ¥ã
clueaiå·¥å ·å : ä¸åéä¸è¡ä»£ç æå®NLPå¼åï¼é¶æ ·æ¬å¦ä¹ ï¼
- NER
- QA
- æ æåæ
- ææ¬åç±»
- ææ¬å¹é
- ææ¬æè¦
- æºå¨ç¿»è¯
- ç¥è¯å¾è°±
- è¯æåº
- é 读ç解
- è´¡ç®ä¸åä¸
å¦ææ°æ®éæé®é¢ï¼æ¬¢è¿æåºissueã
æææ°æ®éåæ¥æºäºç½ç»ï¼åªåæ´çä¾å¤§å®¶æåæ¹ä¾¿ï¼å¦ææä¾µæçé®é¢ï¼è¯·åæ¶èç³»æ们å é¤ã
NER
ID | æ é¢ | æ´æ°æ¥æ | æ°æ®éæä¾è | è®¸å¯ | 说æ | å ³é®å | ç±»å« | 论æå°å | å¤æ³¨ |
---|---|---|---|---|---|---|---|---|---|
1 | CCKS2017ä¸æçµåç ä¾å½åå®ä½è¯å« | 2017å¹´5æ | å京æç®äºå¥åº·ç§ææéå ¬å¸ | æ°æ®æ¥æºäºå ¶äºå»é¢å¹³å°ççå®çµåç åæ°æ®ï¼å ±è®¡800æ¡ï¼å个ç 人å次就è¯è®°å½ï¼ï¼ç»è±æå¤ç | çµåç å | å½åå®ä½è¯å« | \ | ä¸æ | |
2 | CCKS2018ä¸æçµåç ä¾å½åå®ä½è¯å« | 2018å¹´ | å»æ¸¡äºï¼å京ï¼ææ¯æéå ¬å¸ | CCKS2018ççµåç åå½åå®ä½è¯å«çè¯æµä»»å¡æä¾äº600份æ 注好ççµåç åææ¬ï¼å ±éè¯å«å«è§£åé¨ä½ãç¬ç«çç¶ãçç¶æè¿°ãææ¯åè¯ç©äºç±»å®ä½ | çµåç å | å½åå®ä½è¯å« | \ | ä¸æ | |
3 | 微软äºç é¢MSRAå½åå®ä½è¯å«è¯å«æ°æ®é | \ | MSRA | æ°æ®æ¥æºäºMSRAï¼æ 注形å¼ä¸ºBIOï¼å ±æ46365æ¡è¯æ | Msra | å½åå®ä½è¯å« | \ | ä¸æ | |
4 | 1998人æ°æ¥æ¥è¯æéå®ä½è¯å«æ 注é | 1998å¹´1æ | 人æ°æ¥æ¥ | æ°æ®æ¥æºä¸º98年人æ°æ¥æ¥ï¼æ 注形å¼ä¸ºBIOï¼å ±æ23061æ¡è¯æ | 98人æ°æ¥æ¥ | å½åå®ä½è¯å« | \ | ä¸æ | |
5 | Boson | \ | ç»æ£®æ°æ® | æ°æ®æ¥æºä¸ºBosonï¼æ 注形å¼ä¸ºBMEO,å ±æ2000æ¡è¯æ | Boson | å½åå®ä½è¯å« | \ | ä¸æ | |
6 | CLUE Fine-Grain NER | 2020å¹´ | CLUE | CLUENER2020æ°æ®éï¼æ¯å¨æ¸ å大å¦å¼æºçææ¬åç±»æ°æ®éTHUCTCåºç¡ä¸ï¼éåºé¨åæ°æ®è¿è¡ç»ç²åº¦å½åå®ä½æ 注ï¼åæ°æ®æ¥æºäºSina News RSSãæ°æ®å å«10个æ ç¾ç±»å«ï¼è®ç»éå ±æ10748æ¡è¯æï¼éªè¯éå ±æ1343æ¡è¯æ | ç»ç²åº¦ï¼CULE | å½åå®ä½è¯å« | \ | ä¸æ | |
7 | CoNLL-2003 | 2003 | CNTS - Language Technology Group | æ°æ®æ¥æºäºCoNLL-2003çä»»å¡ï¼è¯¥æ°æ®æ 注äºå æ¬PER, LOC, ORGåMISCçåä¸ªç±»å« | CoNLL-2003 | å½åå®ä½è¯å« | 论æ | è±æ | |
8 | å¾®åå®ä½è¯å« | 2015å¹´ | https://github.com/hltcoe/golden-horse | EMNLP-2015 | å½åå®ä½è¯å« | ||||
9 | SIGHAN Bakeoff 2005 | 2005å¹´ | MSR/PKU | bakeoff-2005 | å½åå®ä½è¯å« |
QA
ID | æ é¢ | æ´æ°æ¥æ | æ°æ®éæä¾è | è®¸å¯ | 说æ | å ³é®å | ç±»å« | 论æå°å | å¤æ³¨ |
---|---|---|---|---|---|---|---|---|---|
1 | NewsQA | 2019/9/13 | 微软ç ç©¶é¢ | Maluuba NewsQAæ°æ®éçç®çæ¯å¸®å©ç 究社åºæ建è½å¤åçéè¦äººç±»æ°´å¹³çç解åæ¨çæè½çé®é¢çç®æ³ãå å«è¶ è¿12000ç¯æ°é»æç« å120,000çæ¡ï¼æ¯ç¯æç« å¹³å616个åè¯ï¼æ¯ä¸ªé®é¢æ2ï½3个çæ¡ã | è±æ | QA | 论æ | ||
2 | SQuAD | æ¯å¦ç¦ | æ¯å¦ç¦é®çæ°æ®éï¼SQuADï¼æ¯ä¸ä¸ªé 读ç解æ°æ®éï¼ç±ç»´åºç¾ç§çä¸ç»æç« ä¸æåºçé®é¢ç»æï¼å ¶ä¸æ¯ä¸ªé®é¢ççæ¡é½æ¯ä¸æ®µææ¬ï¼å¯è½æ¥èªç¸åºçé 读段è½ï¼æè é®é¢å¯è½æ¯æªè§£ççã | è±æ | QA | 论æ | |||
3 | SimpleQuestions | åºäºåå¨ç½ç»ç大è§æ¨¡ç®åé®çç³»ç», æ°æ®éæä¾äºä¸ä¸ªå¤ä»»å¡é®çæ°æ®éï¼æ°æ®éæ100Kç®åé®é¢çåçã | è±æ | QA | 论æ | ||||
4 | WikiQA | 2016/7/14 | 微软ç ç©¶é¢ | 为äºåæ ä¸è¬ç¨æ·ççå®ä¿¡æ¯éæ±ï¼WikiQA使ç¨Bingæ¥è¯¢æ¥å¿ä½ä¸ºé®é¢æºãæ¯ä¸ªé®é¢é½é¾æ¥å°ä¸ä¸ªå¯è½æçæ¡çç»´åºç¾ç§é¡µé¢ãå 为维åºç¾ç§é¡µé¢çæè¦é¨åæä¾äºå ³äºè¿ä¸ªä¸»é¢çåºæ¬ä¸é常æéè¦çä¿¡æ¯ï¼æ以使ç¨æ¬èä¸çå¥åä½ä¸ºåéçæ¡ãå¨ä¼å ç帮å©ä¸ï¼æ°æ®éä¸å æ¬3047个é®é¢å29258个å¥åï¼å ¶ä¸1473个å¥å被æ 记为对åºé®é¢çåçå¥åã | è±æ | QA | 论æ | ||
5 | cMedQA | 2019/2/25 | Zhang Sheng | å»å¦å¨çº¿è®ºåçæ°æ®ï¼å å«5.4ä¸ä¸ªé®é¢ï¼å对åºç约10ä¸ä¸ªåçã | ä¸æ | QA | 论æ | ||
6 | cMedQA2 | 2019/1/9 | Zhang Sheng | cMedQAçæ©å±çï¼å å«çº¦10ä¸ä¸ªå»å¦ç¸å ³é®é¢ï¼å对åºç约20ä¸ä¸ªåçã | ä¸æ | QA | 论æ | ||
7 | webMedQA | 2019/3/10 | He Junqing | ä¸ä¸ªå»å¦å¨çº¿é®çæ°æ®éï¼å å«6ä¸ä¸ªé®é¢å31ä¸ä¸ªåçï¼èä¸å å«é®é¢çç±»å«ã | ä¸æ | QA | 论æ | ||
8 | XQA | 2019/7/29 | æ¸ åå¤§å¦ | 该ç¯æç« ä¸»è¦æ¯é对å¼æ¾å¼é®çæ建äºä¸ä¸ªè·¨è¯è¨çå¼æ¾å¼é®çæ°æ®éï¼è¯¥æ°æ®éï¼è®ç»éãæµè¯éï¼ä¸»è¦å æ¬ä¹ç§è¯è¨ï¼9ä¸å¤ä¸ªé®çã | å¤è¯è¨ | QA | 论æ | ||
9 | AmazonQA | 2019/9/29 | äºé©¬é | å¡èåºæ¢ é大å¦é对äºé©¬éå¹³å°ä¸é®é¢éå¤åçççç¹ï¼æåºäºåºäºè¯è®ºçQA模åä»»å¡ï¼å³å©ç¨å å对æä¸äº§åçé®çï¼QAç³»ç»èªå¨æ»ç»åºä¸ä¸ªçæ¡ç»å®¢æ· | è±æ | QA | 论æ | ||
9 | AmazonQA | 2019/9/29 | äºé©¬é | å¡èåºæ¢ é大å¦é对äºé©¬éå¹³å°ä¸é®é¢éå¤åçççç¹ï¼æåºäºåºäºè¯è®ºçQA模åä»»å¡ï¼å³å©ç¨å å对æä¸äº§åçé®çï¼QAç³»ç»èªå¨æ»ç»åºä¸ä¸ªçæ¡ç»å®¢æ· | è±æ | QA | 论æ |
æ æåæ
ID | æ é¢ | æ´æ°æ¥æ | æ°æ®éæä¾è | è®¸å¯ | 说æ | å ³é®å | ç±»å« | 论æå°å | å¤æ³¨ |
---|---|---|---|---|---|---|---|---|---|
1 | NLPCC2013 | 2013 | CCF | \ | å¾®åè¯æï¼æ 注äº7 emotions: like, disgust, happiness, sadness, anger, surprise, fearã大å°ï¼14 000 æ¡å¾®å, 45 431å¥å | NLPCC2013, Emotion | æ æåæ | 论æ | |
2 | NLPCC2014 Task1 | 2014 | CCF | \ | å¾®åè¯æï¼æ 注äº7 emotions: like, disgust, happiness, sadness, anger, surprise, fearã 大å°ï¼20000æ¡å¾®å | NLPCC2014, Emotion | æ æåæ | \ | |
3 | NLPCC2014 Task2 | 2014 | CCF | \ | å¾®åè¯æï¼æ 注äºæ£é¢åè´é¢ | NLPCC2014, Sentiment | æ æåæ | \ | |
4 | Weibo Emotion Corpus | 2016 | The Hong Kong Polytechnic University | \ | å¾®åè¯æï¼æ 注äº7 emotions: like, disgust, happiness, sadness, anger, surprise, fearã 大å°ï¼åä¸å¤æ¡å¾®å | weibo emotion corpus | æ æåæ | Emotion Corpus Construction Based on Selection from Noisy Natural Labels | |
5 | [RenCECPs](Fuji Ren can be contacted (ren@is.tokushima-u.ac.jp) for a license agreement.) | 2009 | Fuji Ren | \ | æ 注çå客è¯æåºï¼å¨æ档级ã段è½çº§åå¥å级æ 注äºemotionåsentimentãå å«äº1500个å客ï¼11000段è½å35000å¥åã | RenCECPs, emotion, sentiment | æ æåæ | Construction of a blog emotion corpus for Chinese emotional expression analysis | |
6 | weibo_senti_100k | ä¸è¯¦ | ä¸è¯¦ | \ | 带æ ææ 注 æ°æµªå¾®åï¼æ£è´åè¯è®ºçº¦å 5 ä¸æ¡ | weibo senti, sentiment | æ æåæ | \ | |
7 | BDCI2018-汽车è¡ä¸ç¨æ·è§ç¹ä¸»é¢åæ æè¯å« | 2018 | CCF | 汽车论åä¸å¯¹æ±½è½¦çè¯è®ºï¼æ 注äºæ±½è½¦çè¯æ主é¢ï¼å¨åãä»·æ ¼ãå 饰ãé ç½®ãå®å ¨æ§ãå¤è§ãææ§ãæ²¹èã空é´ãèéæ§ãæ¯ä¸ªä¸»é¢æ 注äºæ ææ ç¾ï¼æ æå为3ç±»ï¼åå«ç¨æ°å0ã1ã-1表示ä¸ç«ãæ£åãè´åã | å±æ§æ æåæ 主é¢æ æåæ | æ æåæ | \ | ||
8 | AI Challenger ç»ç²åº¦ç¨æ·è¯è®ºæ æåæ | 2o18 | ç¾å¢ | \ | é¤é¥®è¯è®ºï¼6个ä¸çº§å±æ§ï¼20个äºçº§å±æ§ï¼æ¯ä¸ªå±æ§æ 注æ£é¢ãè´é¢ãä¸æ§ãæªæåã | å±æ§æ æåæ | æ æåæ | \ | |
9 | BDCI2019éèä¿¡æ¯è´é¢å主ä½å¤å® | 2019 | ä¸åé¶è¡ | \ | éèé¢åæ°é»ï¼æ¯ä¸ªæ ·æ¬æ è®°äºå®ä½å表以åè´é¢å®ä½å表ãä»»å¡æ¯å¤æä¸ä¸ªæ ·æ¬æ¯å¦æ¯è´é¢ä»¥å对åºçè´é¢çå®ä½ã | å®ä½æ æåæ | æ æåæ | \ | |
10 | ä¹æ±æ¯çµåè¯è®ºè§ç¹ææå¤§èµ | 2019 | ä¹æ±å®éªå®¤ | \ | æ¬æ¬¡åçè¯è®ºè§ç¹ææçä»»å¡æ¯å¨ååè¯è®ºä¸æ½åååå±æ§ç¹å¾åæ¶è´¹è è§ç¹ï¼å¹¶ç¡®è®¤å ¶æ æææ§åå±æ§ç§ç±»ã对äºååçæä¸ä¸ªå±æ§ç¹å¾ï¼åå¨çä¸ç³»åæè¿°å®çè§ç¹è¯ï¼å®ä»¬ä»£è¡¨äºæ¶è´¹è 对该å±æ§ç¹å¾çè§ç¹ãæ¯ä¸ç»{ååå±æ§ç¹å¾ï¼æ¶è´¹è è§ç¹}å ·æç¸åºçæ æææ§ï¼è´é¢ãä¸æ§ãæ£é¢ï¼ï¼ä»£è¡¨äºæ¶è´¹è 对该å±æ§ç满æç¨åº¦ãæ¤å¤ï¼å¤ä¸ªå±æ§ç¹å¾å¯ä»¥å½å ¥æä¸ä¸ªå±æ§ç§ç±»ï¼ä¾å¦å¤è§ãçåçå±æ§ç¹å¾åå¯å½å ¥å è£ è¿ä¸ªå±æ§ç§ç±»ãåèµéä¼æç»éæ交对æµè¯æ°æ®çæ½åé¢æµä¿¡æ¯ï¼å æ¬å±æ§ç¹å¾è¯ãè§ç¹è¯ãè§ç¹ææ§åå±æ§ç§ç±»4个å段ã | å±æ§æ æåæ | æ æåæ | \ | |
11 | 2019æçæ ¡åç®æ³å¤§èµ | 2019 | æç | \ | ç»å®è¥å¹²æç« ï¼ç®æ æ¯å¤ææç« çæ ¸å¿å®ä½ä»¥åå¯¹æ ¸å¿å®ä½çæ ææ度ãæ¯ç¯æç« è¯å«æå¤ä¸ä¸ªæ ¸å¿å®ä½ï¼å¹¶åå«å¤ææç« å¯¹ä¸è¿°æ ¸å¿å®ä½çæ æå¾åï¼ç§¯æãä¸ç«ãæ¶æä¸ç§ï¼ãå®ä½ï¼äººãç©ãå°åºãæºæãå¢ä½ãä¼ä¸ãè¡ä¸ãæä¸ç¹å®äºä»¶çåºå®åå¨ï¼ä¸å¯ä»¥ä½ä¸ºæç« ä¸»ä½çå®ä½è¯ãæ ¸å¿å®ä½ï¼æç« ä¸»è¦æè¿°ãææ ä»»æç« ä¸»è¦è§è²çå®ä½è¯ã | å®ä½æ æåæ | æ æåæ | \ |
ææ¬åç±»
ID | æ é¢ | æ´æ°æ¥æ | æ°æ®éæä¾è | è®¸å¯ | 说æ | å ³é®å | ç±»å« | 论æå°å | å¤æ³¨ |
---|---|---|---|---|---|---|---|---|---|
1 | [2018âè¾¾è§æ¯âææ¬æºè½å¤çææèµ](https://www.pkbigdata.com/common/cmpt/ âè¾¾è§æ¯âææ¬æºè½å¤çææèµ_èµä½ä¸æ°æ®.html) | 2018å¹´7æ | è¾¾è§æ°æ® | æ°æ®éæ¥æºäºè¾¾è§æ°æ®ï¼ä¸ºé¿ææ¬å类任å¡ï¼å ¶ä¸»è¦å æ¬äºidï¼articleï¼word_segåclasså个å段ï¼æ°æ®å å«19个类å«ï¼å ±102275æ¡æ ·æ¬ | é¿ææ¬ï¼è±æ | ææ¬åç±» | \ | ä¸æ | |
2 | ä»æ¥å¤´æ¡ä¸ææ°é»ï¼ææ¬ï¼åç±» | 2018å¹´5æ | ä»æ¥å¤´æ¡ | æ°æ®éæ¥æºäºä»æ¥å¤´æ¡ï¼ä¸ºçææ¬å类任å¡ï¼æ°æ®å å«15个类å«ï¼å ±382688æ¡æ ·æ¬ | çææ¬ï¼æ°é» | ææ¬åç±» | \ | ä¸æ | |
3 | THUCNewsä¸æææ¬åç±» | 2016å¹´ | æ¸ åå¤§å¦ | THUCNewsæ¯æ ¹æ®æ°æµªæ°é»RSS订é é¢é2005~2011å¹´é´çåå²æ°æ®çéè¿æ»¤çæï¼å为UTF-8纯ææ¬æ ¼å¼ãæ们å¨åå§æ°æµªæ°é»åç±»ä½ç³»çåºç¡ä¸ï¼éæ°æ´ååååº14个åéå类类å«ï¼è´¢ç»ã彩票ãæ¿äº§ãè¡ç¥¨ãå®¶å± ãæè²ãç§æã社ä¼ãæ¶å°ãæ¶æ¿ãä½è²ãæ座ã游æã娱ä¹ï¼å ±74ä¸ç¯æ°é»ææ¡£ï¼2.19 GBï¼ | ææ¡£ï¼æ°é» | ææ¬åç±» | \ | ä¸æ | |
4 | å¤æ¦å¤§å¦ä¸æææ¬åç±» | \ | å¤æ¦å¤§å¦è®¡ç®æºä¿¡æ¯ä¸ææ¯ç³»å½é æ°æ®åºä¸å¿èªç¶è¯è¨å¤çå°ç» | æ°æ®éæ¥æºäºå¤æ¦å¤§å¦ï¼ä¸ºçææ¬å类任å¡ï¼æ°æ®å å«20个类å«ï¼å ±9804ç¯ææ¡£ | ææ¡£ï¼æ°é» | ææ¬åç±» | \ | ä¸æ | |
5 | æ°é»æ é¢çææ¬åç±» | 2019å¹´12æ | chenfengshf | CC0 å ¬å ±é¢åå ±äº« | æ°æ®éæ¥æºäºKesciå¹³å°ï¼ä¸ºæ°é»æ é¢é¢åçææ¬å类任å¡ãå 容大å¤ä¸ºçææ¬æ é¢(length<50)ï¼æ°æ®å å«15个类å«ï¼å ±38wæ¡æ ·æ¬ | çææ¬ï¼æ°é»æ é¢ | ææ¬åç±» | \ | ä¸æ |
6 | 2017 ç¥ä¹çå±±æ¯æºå¨å¦ä¹ ææèµ | 2017å¹´6æ | ä¸å½äººå·¥æºè½å¦ä¼;ç¥ä¹ | æ°æ®éæ¥æºäºç¥ä¹ï¼ä¸ºé®é¢åè¯é¢æ ç¾çç»å®å ³ç³»çæ 注æ°æ®ï¼æ¯ä¸ªé®é¢æ 1 个æå¤ä¸ªæ ç¾ï¼ç´¯è®¡1999 个æ ç¾ï¼å ±å å« 300 ä¸ä¸ªé®é¢ | é®é¢ï¼çææ¬ | ææ¬åç±» | \ | ä¸æ | |
7 | 2019ä¹æ±æ¯-çµåè¯è®ºè§ç¹ææå¤§èµ | 2019å¹´8æ | ä¹æ±å®éªå®¤ | æ¬æ¬¡åçè¯è®ºè§ç¹ææçä»»å¡æ¯å¨ååè¯è®ºä¸æ½åååå±æ§ç¹å¾åæ¶è´¹è è§ç¹ï¼å¹¶ç¡®è®¤å ¶æ æææ§åå±æ§ç§ç±»ã对äºååçæä¸ä¸ªå±æ§ç¹å¾ï¼åå¨çä¸ç³»åæè¿°å®çè§ç¹è¯ï¼å®ä»¬ä»£è¡¨äºæ¶è´¹è 对该å±æ§ç¹å¾çè§ç¹ãæ¯ä¸ç»{ååå±æ§ç¹å¾ï¼æ¶è´¹è è§ç¹}å ·æç¸åºçæ æææ§ï¼è´é¢ãä¸æ§ãæ£é¢ï¼ï¼ä»£è¡¨äºæ¶è´¹è 对该å±æ§ç满æç¨åº¦ | è¯è®ºï¼çææ¬ | ææ¬åç±» | \ | ä¸æ | |
8 | IFLYTEK' é¿ææ¬åç±» | \ | ç§å¤§è®¯é£ | 该æ°æ®éå ±æ1.7ä¸å¤æ¡å ³äºappåºç¨æè¿°çé¿ææ¬æ 注æ°æ®ï¼å å«åæ¥å¸¸çæ´»ç¸å ³çåç±»åºç¨ä¸»é¢ï¼å ±119ä¸ªç±»å« | é¿ææ¬ | ææ¬åç±» | \ | ä¸æ | |
9 | å ¨ç½æ°é»åç±»æ°æ®(SogouCA) | 2012å¹´8æ16å· | æç | 该æ°æ®æ¥èªè¥å¹²æ°é»ç«ç¹2012å¹´6æâ7ææé´å½å ï¼å½é ï¼ä½è²ï¼ç¤¾ä¼ï¼å¨±ä¹ç18个é¢éçæ°é»æ°æ® | æ°é» | ææ¬åç±» | \ | ä¸æ | |
10 | æçæ°é»æ°æ®(SogouCS) | 2012å¹´8æ | æç | æ°æ®æ¥æºä¸ºæçæ°é»2012å¹´6æâ7ææé´å½å ï¼å½é ï¼ä½è²ï¼ç¤¾ä¼ï¼å¨±ä¹ç18个é¢éçæ°é»æ°æ® | æ°é» | ææ¬åç±» | \ | ä¸æ | |
11 | ä¸ç§å¤§æ°é»åç±»è¯æåº | 2017å¹´11æ | å禹 ä¸å½ç§å¦é¢èªå¨åç 究æ综åä¿¡æ¯ä¸å¿ | ææ¶ä¸è½ä¸è½½ï¼å·²ç»èç³»ä½è ï¼çå¾ åé¦ | æ°é» | ||||
12 | ChnSentiCorp_htl_all | 2018å¹´3æ | https://github.com/SophonPlus/ChineseNlpCorpus | 7000 å¤æ¡é åºè¯è®ºæ°æ®ï¼5000 å¤æ¡æ£åè¯è®ºï¼2000 å¤æ¡è´åè¯è®º | |||||
13 | waimai_10k | 2018å¹´3æ | https://github.com/SophonPlus/ChineseNlpCorpus | æå¤åå¹³å°æ¶éçç¨æ·è¯ä»·ï¼æ£å 4000 æ¡ï¼è´å 约 8000 æ¡ | |||||
14 | online_shopping_10_cats | 2018å¹´3æ | https://github.com/SophonPlus/ChineseNlpCorpus | 10 个类å«ï¼å ± 6 ä¸å¤æ¡è¯è®ºæ°æ®ï¼æ£ãè´åè¯è®ºå约 3 ä¸æ¡ï¼ å æ¬ä¹¦ç±ãå¹³æ¿ãææºãæ°´æãæ´åæ°´ãçæ°´å¨ãèçãè¡£æã计ç®æºãé åº | |||||
15 | weibo_senti_100k | 2018å¹´3æ | https://github.com/SophonPlus/ChineseNlpCorpus | 10 ä¸å¤æ¡ï¼å¸¦æ ææ 注 æ°æµªå¾®åï¼æ£è´åè¯è®ºçº¦å 5 ä¸æ¡ | |||||
16 | simplifyweibo_4_moods | 2018å¹´3æ | https://github.com/SophonPlus/ChineseNlpCorpus | 36 ä¸å¤æ¡ï¼å¸¦æ ææ 注 æ°æµªå¾®åï¼å å« 4 ç§æ æï¼ å ¶ä¸åæ¦çº¦ 20 ä¸æ¡ï¼æ¤æãåæ¶ãä½è½å约 5 ä¸æ¡ | |||||
17 | dmsc_v2 | 2018å¹´3æ | https://github.com/SophonPlus/ChineseNlpCorpus | 28 é¨çµå½±ï¼è¶ 70 ä¸ ç¨æ·ï¼è¶ 200 ä¸æ¡ è¯å/è¯è®º æ°æ® | |||||
18 | yf_dianping | 2018å¹´3æ | https://github.com/SophonPlus/ChineseNlpCorpus | 24 ä¸å®¶é¤é¦ï¼54 ä¸ç¨æ·ï¼440 ä¸æ¡è¯è®º/è¯åæ°æ® | |||||
19 | yf_amazon | 2018å¹´3æ | https://github.com/SophonPlus/ChineseNlpCorpus | 52 ä¸ä»¶ååï¼1100 å¤ä¸ªç±»ç®ï¼142 ä¸ç¨æ·ï¼720 ä¸æ¡è¯è®º/è¯åæ°æ® |
ææ¬å¹é
ID | æ é¢ | æ´æ°æ¥æ | æ°æ®éæä¾è | è®¸å¯ | 说æ | å ³é®å | ç±»å« | 论æå°å | å¤æ³¨ |
---|---|---|---|---|---|---|---|---|---|
1 | LCQMC | 2018/6/6 | å工大(æ·±å³)æºè½è®¡ç®ç 究ä¸å¿ | Creative Commons Attribution 4.0 International License | 该æ°æ®éå ±å å«æ¥èªå¤ä¸ªé¢åç260068个ä¸æé®å¥å¯¹ï¼ç¸å询é®æå¾çå¥å对æ 记为1ï¼å¦å为0ï¼å¹¶é¢å å°å ¶åå为äºè®ç»éï¼238766对ï¼éªè¯éï¼8802对ï¼æµè¯éï¼12500对 | 大è§æ¨¡é®å¥å¹é ï¼æå¾å¹é | çææ¬å¹é ï¼é®å¥å¹é | 论æ | |
2 | The BQ Corpus | 2018/9/4 | å工大(æ·±å³)æºè½è®¡ç®ç 究ä¸å¿ï¼å¾®ä¼é¶è¡ | 该æ°æ®éå ±æ120000个å¥å对ï¼æ¥èªé¶è¡ä¸å¹´ä¸çå¨è¯¢æå¡æ¥å¿ï¼å¥å对å å«ä¸åçæå¾ï¼æ è®°æ£è´æ ·æ¬æ¯ä¾ä¸º1:1 | é¶è¡æå¡é®å¥ï¼æå¾å¹é | çææ¬å¹é ï¼é®å¥ä¸è´æ§æ£æµ | 论æ | ||
3 | AFQMC èèéèè¯ä¹ç¸ä¼¼åº¦ | 2018/4/25 | èèéæ | æä¾10ä¸å¯¹çæ 注æ°æ®ï¼åæ¹æ¬¡æ´æ°ï¼å·²æ´æ°å®æ¯ï¼ï¼ä½ä¸ºè®ç»æ°æ®ï¼å æ¬åä¹å¯¹åä¸åä¹å¯¹ | éèé®å¥ | çææ¬å¹é ï¼é®å¥å¹é | |||
4 | 第ä¸å±ææè´·âééæ¯âå¤§èµ | 2018/6/10 | ææè´·æºæ §éèç ç©¶é¢ | train.csvæ件å å«3åï¼åå«æ¯æ ç¾ï¼labelï¼è¡¨ç¤ºé®é¢1åé®é¢2æ¯å¦è¡¨ç¤ºç¸åçææï¼1表示ç¸åï¼0表示ä¸åï¼ï¼é®é¢1çç¼å·ï¼q1ï¼åé®é¢2çç¼å·ï¼q2ï¼ãæ¬æ件ä¸åºç°çææé®é¢ç¼å·åå¨question.csvä¸åºç°è¿ | éè产å | çææ¬å¹é ï¼é®å¥å¹é | |||
5 | CAIL2019ç¸ä¼¼æ¡ä¾å¹é å¤§èµ | 2019/6 | æ¸ å大å¦ï¼ä¸å½è£å¤æä¹¦ç½ | 对äºæ¯ä»½æ°æ®ï¼ç¨ä¸å ç»(A,B,C)æ¥ä»£è¡¨è¯¥ç»æ°æ®ï¼å ¶ä¸A,B,Cå对åºæä¸ç¯æ书ãæ书æ°æ®Aä¸Bçç¸ä¼¼åº¦æ»æ¯å¤§äºAä¸Bçç¸ä¼¼åº¦çï¼å³sim(A,B)>sim(A,C) | æ³å¾æ书ï¼ç¸ä¼¼æ¡ä¾ | é¿ææ¬å¹é | |||
6 | CCKS 2018 å¾®ä¼é¶è¡æºè½å®¢æé®å¥å¹é å¤§èµ | 2018/4/5 | å工大(æ·±å³)æºè½è®¡ç®ç 究ä¸å¿ï¼å¾®ä¼é¶è¡ | é¶è¡æå¡é®å¥ï¼æå¾å¹é | çææ¬å¹é ï¼é®å¥å¹é | ||||
7 | ChineseTextualInference | 2018/12/15 | åçåï¼ä¸å½ç§å¦é¢è½¯ä»¶ç 究æ | ä¸æææ¬æ¨æ项ç®,å æ¬88ä¸ææ¬è´å«ä¸æææ¬è´å«æ°æ®éçç¿»è¯ä¸æ建,åºäºæ·±åº¦å¦ä¹ çææ¬è´å«å¤å®æ¨¡åæ建 | ä¸æNLI | ä¸æææ¬æ¨æï¼ææ¬è´å« | |||
8 | NLPCC-DBQA | 2016/2017/2018 | NLPCC | ç»å®é®é¢-çæ¡ï¼ä»¥å该çæ¡æ¯å¦æ¯è¯¥é®é¢ççæ¡ä¹ä¸çæ è®°ï¼1表示æ¯ï¼0表示ä¸æ¯ | DBQA | é®çå¹é | |||
9 | âææ¯éæ±âä¸âææ¯ææâ项ç®ä¹é´å ³è度计ç®æ¨¡å | 201/8/32 | CCF | ç»å®ææ¬å½¢å¼çææ¯éæ±åææ¯ææï¼ä»¥åéæ±ä¸ææçå ³è度æ ç¾ï¼å ¶ä¸ææ¯éæ±ä¸ææ¯ææä¹é´çå ³è度å为å个å±çº§ï¼ 强ç¸å ³ãè¾å¼ºç¸å ³ãå¼±ç¸å ³ãæ ç¸å ³ | é¿ææ¬ï¼éæ±ä¸ææå¹é | é¿ææ¬å¹é | |||
10 | CNSD / CLUE-CMNLI | 2019/12 | ZengJunjun | ä¸æèªç¶è¯è¨æ¨çæ°æ®éï¼æ¬æ°æ®åéè¿ç¿»è¯å é¨å人工修æ£çæ¹æ³ï¼ä»è±æåæ°æ®éçæï¼å¯ä»¥ä¸å®ç¨åº¦ç¼è§£ä¸æèªç¶è¯è¨æ¨çåè¯ä¹ç¸ä¼¼åº¦è®¡ç®æ°æ®éä¸å¤çé®é¢ | ä¸æNLI | ä¸æèªç¶è¯è¨æ¨æ | 论æ | ||
11 | cMedQA v1.0 | 2017/4/5 | 寻è¯å¯»å»ç½ åå½é²ç§æå¤§å¦ ä¿¡æ¯ç³»ç»å管ç å¦é¢ | 该æ°æ®éæ¥æºä¸ºå¯»å»å¯»è¯ç½ç«ä¸çæé®ååçï¼ æ°æ®éåè¿å¿åå¤çï¼æä¾çæ¯å å« è®ç»éä¸æ50,000个é®é¢ï¼94,134个çæ¡ï¼å¹³åæ¯ä¸ªé®é¢ãçæ¡å符æ°åå«ä¸ºä¸º120ã212ä¸ªï¼ éªè¯éæ2,000个é®é¢ï¼æ3774个çæ¡ï¼é®é¢åçæ¡çå¹³åå符æ°åå«ä¸º117å212ä¸ªï¼ æµè¯éæ2,000个é®é¢ï¼æ3835个çæ¡ï¼é®é¢åçæ¡çå¹³åå符æ°åå«ä¸º119å211ä¸ªï¼ æ°æ®éæ»éæ54,000个é®é¢ï¼101,743个çæ¡ï¼å¹³åæ¯ä¸ªé®é¢åçæ¡çå符æ°åå«ä¸º119ã212ä¸ªï¼ | å»çé®çå¹é | é®çå¹é | 论æ | ||
12 | cMedQA2 | 2018/11/8 | 寻è¯å¯»å»ç½ åå½é²ç§æå¤§å¦ ä¿¡æ¯ç³»ç»å管ç å¦é¢ | 该æ°æ®éæ¥æºä¸ºå¯»å»å¯»è¯ç½ç«ä¸çæé®ååçï¼ æ°æ®éåè¿å¿åå¤çï¼æä¾çæ¯å å« è®ç»éä¸æ100,000个é®é¢ï¼188,490个çæ¡ï¼å¹³åæ¯ä¸ªé®é¢ãçæ¡å符æ°åå«ä¸ºä¸º48ã101ä¸ªï¼ éªè¯éæ4,000个é®é¢ï¼æ7527个çæ¡ï¼é®é¢åçæ¡çå¹³åå符æ°åå«ä¸º49å101ä¸ªï¼ æµè¯éæ4,000个é®é¢ï¼æ7552个çæ¡ï¼é®é¢åçæ¡çå¹³åå符æ°åå«ä¸º49å100ä¸ªï¼ æ°æ®éæ»éæ108,000个é®é¢ï¼203,569个çæ¡ï¼å¹³åæ¯ä¸ªé®é¢åçæ¡çå符æ°åå«ä¸º49ã101ä¸ªï¼ | å»çé®çå¹é | é®çå¹é | 论æ | ||
13 | ChineseSTS | 2017/9/21 | ååæ, ç½äºæ¦, 马ä»ç. 西å®ç§æå¤§å¦ | 该æ°æ®éæä¾äº12747对ä¸æç¸ä¼¼æ°æ®éï¼å¨æ°æ®éå ä½è ç»åºäºä»ä»¬ç¸ä¼¼åº¦çæåï¼è¯æç±çå¥ææã | çå¥ç¸ä¼¼åº¦ å¹é | ç¸ä¼¼åº¦å¹é | |||
14 | ä¸å½å¥åº·ä¿¡æ¯å¤çä¼è®® 举åçå»çé®é¢ç¸ä¼¼åº¦ è¡¡éç«èµæ°æ®é | 2018 | CHIP 2018-第åå±ä¸å½å¥åº·ä¿¡æ¯å¤çä¼è®®ï¼CHIPï¼ | æ¬æ¬¡è¯æµä»»å¡ç主è¦ç®æ æ¯é对ä¸æççå®æ£è å¥åº·å¨è¯¢è¯æï¼è¿è¡é®å¥æå¾å¹é ã ç»å®ä¸¤ä¸ªè¯å¥ï¼è¦æ±å¤å®ä¸¤è æå¾æ¯å¦ç¸åæè ç¸è¿ã ææè¯ææ¥èªäºèç½ä¸æ£è çå®çé®é¢ï¼å¹¶ç»è¿äºçéå人工çæå¾å¹é æ 注ã æ°æ®éç»è¿è±æå¤çï¼é®é¢ç±æ°åæ 示 è®ç»éå å«20000æ¡å·¦å³æ 注好çæ°æ®ï¼ç»è¿è±æå¤çï¼å å«æ ç¹ç¬¦å·ï¼ï¼ æµè¯éå å«10000æ¡å·¦å³æ labelçæ°æ®ï¼ç»è¿è±æå¤çï¼å å«æ ç¹> 符å·ï¼ã | å»çé®é¢ç¸ä¼¼åº¦ å¹é | ç¸ä¼¼åº¦å¹é | |||
15 | COS960: A Chinese Word Similarity Dataset of 960 Word Pairs | 2019/6/6 | æ¸ åå¤§å¦ | 该æ°æ®éä¸å å«äº960对åè¯ï¼ 并ä¸æ¯å¯¹åè¯é½è¢«15个æ¯è¯è ç¨ç¸ä¼¼åº¦åæ°æ¥è¡¡é è¿960个è¯å¯¹æ ¹æ®æ ç¾è¢«åæä¸ç»ï¼ å å«480对åè¯ï¼240对å¨è¯å240对形容è¯ã | åè¯ä¹é´çç¸ä¼¼åº¦ | åä¹è¯ | 论æ | ||
16 | OPPOææºæç´¢æåºquery-titleè¯ä¹å¹é æ°æ®éã(https://pan.baidu.com/s/1Hg2Hubsn3GEuu4gubbHCzw å¯ç 7p3n) | 2018/11/6 | OPPO | 该æ°æ®éæ¥èªäºOPPOææºæç´¢æåºä¼åå®æ¶æç´¢åºæ¯, 该åºæ¯å°±æ¯å¨ç¨æ·ä¸æè¾å ¥è¿ç¨ä¸ï¼å®æ¶è¿åæ¥è¯¢ç»æã 该æ°æ®éå¨æ¤åºç¡ä¸åäºç¸åºçç®åï¼ æä¾äºä¸ä¸ªquery-titleè¯ä¹å¹é ï¼å³ctré¢æµçé®é¢ã | é®é¢æ é¢å¹é ï¼ ctré¢æµ | ç¸ä¼¼åº¦å¹é | |||
17 | ç½é¡µæç´¢ç»æè¯ä»·(SogouE) | 2012å¹´ | æç | æçå®éªå®¤æ°æ®ä½¿ç¨è®¸å¯åè®® | 该æ°æ®éå å«äºæ¥è¯¢è¯ï¼ç¸å ³URL以åæ¥è¯¢ç±»å«çæç´¢æ°æ®ï¼æ ¼å¼å¦ä¸ æ°æ®æ ¼å¼è¯´æï¼æ¥è¯¢è¯]\tç¸å ³çURL\tæ¥è¯¢ç±»å« å ¶ä¸URLä¿è¯åå¨äºå¯¹åºçäºèç½è¯æåºï¼ æ¥è¯¢ç±»å«ä¸â1â表示导èªç±»æ¥è¯¢ï¼â2â表示信æ¯ç±»æ¥è¯¢ | Automatic Search Engine Performance Evaluation with Click-through Data Analysis | æ¥è¯¢ç±»åå¹é é¢æµ |
ææ¬æè¦
ID | æ é¢ | æ´æ°æ¥æ | æ°æ®éæä¾è | è®¸å¯ | 说æ | å ³é®å | ç±»å« | 论æå°å | å¤æ³¨ |
---|---|---|---|---|---|---|---|---|---|
1 | LCSTS | 2015/8/6 | Qingcai Chen | æ°æ®éæ¥æºäºæ°æµªå¾®åï¼å å«ä¸¤ç¾ä¸å·¦å³çå®ä¸æçææ¬ï¼æ¯æ¡æ°æ®å æ¬ç±ä½è æ 注çæè¦åæ£æ两个å段ãå¦å¤æ10,666æ¡æ°æ®ç±äººå·¥æ 注åºçææ¬ä¸æè¦çç¸å ³æ§ï¼ä»1-5ç¸å ³æ§ä¾æ¬¡å¢å ã | åææ¬æè¦ï¼çææ¬ï¼ææ¬ç¸å ³æ§ | ææ¬æè¦ | 论æ | ||
2 | ä¸æçææ¬æè¦æ°æ®é | 2018/6/20 | He Zhengfang | æ°æ®æ¥æºäºæ°æµªå¾®å主æµåªä½åå¸çå¾®åï¼å ±679898æ¡æ°æ®ã | åææ¬æè¦ï¼çææ¬ | ææ¬æè¦ | \ | ||
3 | æè²å¹è®è¡ä¸æ½è±¡å¼èªå¨æè¦ä¸æè¯æåº | 2018/6/5 | å¿å | è¯æåºæ¶éäºæè²å¹è®è¡ä¸ä¸»æµåç´åªä½çåå²æç« ï¼çº¦24500æ¡æ°æ®ï¼æ¯æ¡æ°æ®å æ¬ç±ä½è æ 注çæè¦åæ£æ两个å段ã | åææ¬æè¦ï¼æè²å¹è® | ææ¬æè¦ | \ | ||
4 | NLPCC2017 Task3 | 2017/11/8 | NLPCC2017主åæ¹ | æ°æ®éæ¥æºäºæ°é»é¢åï¼æ¯NLPCC2017举åæä¾çä»»å¡æ°æ®ï¼å¯ç¨äºåææ¬æè¦ã | åææ¬æè¦ï¼æ°é» | ææ¬æè¦ | \ | ||
5 | ç¥çæ¯2018 | 2018/10/11 | DCç«èµä¸»åæ¹ | æ°æ®æ¥æºäºæ°é»ææ¬ï¼ç±DCç«èµä¸»åæ¹æä¾ï¼æ¨¡æä¸å¡åºæ¯ï¼ä»¥æ°é»ææ¬çæ ¸å¿è¯æå为ç®çï¼æç»ç»æè¾¾å°æåæ¨èåç¨æ·ç»åçææã | ææ¬å ³é®åï¼æ°é» | ææ¬æè¦ | \ | ||
6 | Byte Cup 2018å½é æºå¨å¦ä¹ ç«èµ | 2018/12/4 | åèè·³å¨ | æ°æ®æ¥èªåèè·³å¨æä¸äº§åTopBuzzåå¼æ¾çæçæç« ï¼è®ç»éå æ¬äºçº¦ 130 ä¸ç¯ææ¬çä¿¡æ¯ï¼éªè¯é 1000 ç¯æç« ï¼ æµè¯é 800 ç¯æç« ã æ¯æ¡æµè¯éåéªè¯éçæ°æ®ç»ç±äººå·¥ç¼è¾æå·¥æ 注å¤ä¸ªå¯è½çæ é¢ï¼ä½ä¸ºçæ¡å¤éã | åææ¬æè¦ï¼è§é¢ï¼æ°é» | ææ¬æè¦ | \ | è±æ | |
7 | NEWSROOM | 2018/6/1 | Grusky | æ°æ®æ¯ä»1998å¹´å°2017å¹´çæç´¢å社交å æ°æ®ä¸è·å¾ï¼å¹¶ä½¿ç¨äºå¤ç§æååæ½è±¡ç¸ç»åçæè¦çç¥ï¼å å«ä½è åç¼è¾å¨38个主è¦åºçç©ç¼è¾é¨æ°åç130ä¸ç¯æç« åæè¦ã | åææ¬æè¦ï¼ç¤¾äº¤å æ°æ®ï¼æç´¢ | ææ¬æè¦ | 论æ | è±æ | |
8 | [DUC/TAC](https://duc.nist.gov/ https://tac.nist.gov//) | 2014/9/9 | NIST | å ¨ç§°Document Understanding Conferences/Text Analysis Conferenceï¼æ°æ®éæ¥æºäºæ¯å¹´çTAC KBPï¼TAC Knowledge Base Populationï¼æ¯èµä½¿ç¨çè¯æåºä¸çæ°é»ä¸çº¿åç½ç»ææ¬ã | åææ¬/å¤ææ¬æè¦ï¼æ°é» | ææ¬æè¦ | \ | è±æ | |
9 | CNN/Daily Mail | 2017/7/31 | Standford | GNU v3 | æ°æ®éæ¯ä»ç¾å½æ线æ°é»ç½ï¼CNNï¼åæ¯æ¥é®æ¥(DailyMail)ä¸ææºå¤§çº¦ä¸ç¾ä¸æ¡æ°é»æ°æ®ä½ä¸ºæºå¨é 读ç解è¯æåºã | å¤ææ¬æè¦ï¼é¿ææ¬ï¼æ°é» | ææ¬æè¦ | 论æ | è±æ |
10 | Amazon SNAP Review | 2013/3/1 | Standford | æ°æ®æ¥æºäºAmazonç½ç«è´ç©è¯è®ºï¼å¯ä»¥è·åæ¯ä¸ªå¤§ç±»å«ï¼å¦ç¾é£ãçµå½±çï¼ä¸çæ°æ®ï¼ä¹å¯ä»¥ä¸æ¬¡æ§è·åæææ°æ®ã | å¤ææ¬æè¦ï¼è´ç©è¯è®º | ææ¬æè¦ | \ | è±æ | |
11 | Gigaword | 2003/1/28 | David Graff, Christopher Cieri | æ°æ®éå æ¬çº¦950w ç¯æ°é»æç« ï¼ç¨æç« æ é¢åæè¦ï¼å±äºåå¥æè¦æ°æ®éã | åææ¬æè¦ï¼æ°é» | ææ¬æè¦ | è±æ | ||
12 | RA-MDS | 2017/9/11 | Piji Li | å ¨ç§°Reader-Aware Multi-Document Summarizationï¼æ°æ®éæ¥æºäºæ°é»æç« ï¼ç±ä¸å®¶æ¶éãæ 注å审æ¥ã涵çäº45个主é¢ï¼æ¯ä¸ªä¸»é¢å å«10个æ°é»ææ¡£å4个模åæè¦ï¼æ¯ä¸ªæ°é»æ档平åå å«27个å¥åï¼æ¯ä¸ªå¥åå¹³åå å«25个åè¯ã | å¤ææ¬æè¦ï¼æ°é»ï¼äººå·¥æ 注 | ææ¬æè¦ | 论æ | è±æ | |
13 | TIPSTER SUMMAC | 2003/5/21 | The MITRE Corporation and the University of Edinburgh | æ°æ®ç±183ç¯Computation and Language (cmp-lg) collectionæ è®°çææ¡£ç»æï¼ææ¡£åèªACLä¼è®®å表论æã | å¤ææ¬æè¦ï¼é¿ææ¬ | ææ¬æè¦ | \ | è±æ | |
14 | WikiHow | 2018/10/18 | Mahnaz Koupaee | æ¯æ¡æ°æ®ä¸ºä¸ç¯æç« ï¼æ¯ç¯æç« ç±å¤ä¸ªæ®µè½ç»æï¼æ¯ä¸ªæ®µè½ä»¥ä¸ä¸ªæ»ç»å®çå¥åå¼å¤´ãéè¿å并段è½å½¢ææç« å段è½å¤§çº²å½¢ææè¦ï¼æ°æ®éçæç»çæ¬å å«äºè¶ è¿200,000个é¿åºå对ã | å¤ææ¬æè¦ï¼é¿ææ¬ | ææ¬æè¦ | 论æ | è±æ | |
15 | Multi-News | 2019/12/4 | Alex Fabbri | æ°æ®æ¥èª1500å¤ä¸ªä¸åç½ç«çè¾å ¥æç« ä»¥åä»ç½ç«newser.comè·å¾ç56,216ç¯è¿äºæç« çä¸ä¸æè¦ã | å¤ææ¬æè¦ | ææ¬æè¦ | 论æ | è±æ | |
16 | MED Summaries | 2018/8/17 | D.Potapov | æ°æ®éç¨äºå¨æè§é¢æè¦è¯ä¼°ï¼å å«160个è§é¢ç注éï¼å ¶ä¸éªè¯é60ãæµè¯é100ï¼æµè¯éä¸æ10个äºä»¶ç±»å«ã | åææ¬æè¦ï¼è§é¢æ³¨é | ææ¬æè¦ | 论æ | è±æ | |
17 | BIGPATENT | 2019/7/27 | Sharma | æ°æ®éå æ¬130ä¸ä»½ç¾å½ä¸å©æç®è®°å½ä»¥å人类书é¢æ½è±¡æè¦ï¼æè¦å å«æ´ä¸°å¯çè¯è¯ç»æåæ´å¤ç常ç¨å®ä½ã | åææ¬æè¦ï¼ä¸å©ï¼ä¹¦é¢è¯ | ææ¬æè¦ | 论æ | è±æ | |
18 | [NYT]( https://catalog.ldc.upenn.edu/LDC2008T19) | 2008/10/17 | Evan Sandhaus | å ¨ç§°The New York Times,æ°æ®éå å«150ç¯æ¥èªçº½çº¦æ¶æ¥çåä¸æç« ,æåäºä»2009å¹´11æå°2010å¹´1æ纽约æ¶æ¥ç½ç«ä¸çæææç« ã | åææ¬æè¦ï¼åä¸æç« | ææ¬æè¦ | \ | è±æ | |
19 | The AQUAINT Corpus of English News Text | 2002/9/26 | David Graff | æ°æ®éç±æ°å社(ä¸å人æ°å ±åå½)ã纽约æ¶æ¥æ°é»æå¡åç¾è社ä¸çæ°é»æå¡çè±ææ°é»ææ¬æ°æ®ç»æï¼å å«å¤§çº¦3.75亿åãæ°æ®éæ¶è´¹ã | åææ¬æè¦ï¼æ°é» | ææ¬æè¦ | \ | ä¸æåè±æ | |
20 | Legal Case Reports Data Set | 2012/10/19 | Filippo Galgani | æ°æ®éæ¥èª2006-2009年澳大å©äºèé¦æ³é¢(FCA)ç澳大å©äºæ³å¾æ¡ä¾ï¼å å«çº¦4000个æ³å¾æ¡ä»¶åå ¶æè¦ã | åææ¬æè¦ï¼æ³å¾æ¡ä»¶ | ææ¬æè¦ | \ | è±æ | |
21 | 17 Timelines | 2015/5/29 | G. B. Tran | æ°æ®æ¯ä»æ°é»æç« ç½é¡µä¸æåçå 容ï¼å å«ååãå©æ¯äºãä¹é¨ãåå©äºå个å½å®¶çæ°é»ã | åææ¬æè¦ï¼æ°é» | ææ¬æè¦ | 论æ | å¤è¯è¨ | |
22 | PTS Corpus | 2018/10/9 | Fei Sun | å ¨ç§°Product Title Summarization Corpusï¼æ°æ®ä¸ºç§»å¨è®¾å¤æ¾ç¤ºçµååå¡åºç¨ä¸ç产åå称æè¦ | åææ¬æè¦ï¼çææ¬ | ææ¬æè¦ | 论æ | ||
23 | Scientific Summarization DataSets | 2019/10/26 | Santosh Gupta | æ°æ®éåèªSemantic Scholar CorpusåArXivãæ¥èªSemantic Scholarè¯æåºçæ é¢/æè¦å¯¹ï¼è¿æ»¤æçç©å»å¦é¢åçææ论æï¼å å«580ä¸æ¡æ°æ®ãæ¥èªArXivçæ°æ®ï¼å å«äºä»1991å¹´å¼å§å°2019å¹´7æ5æ¥çæ¯ç¯è®ºæçæ é¢/æè¦å¯¹ãæ°æ®éå å«éèç±»æ°æ®10kï¼çç©å¦ç±»26kï¼æ°å¦ç±»417kï¼ç©çç±»157ä¸ï¼CSç±»221kã | åææ¬æè¦ï¼è®ºæ | ææ¬æè¦ | \ | è±æ | |
24 | Scientific Document Summarization Corpus and Annotations from the WING NUS group | 2019/3/19 | Jaidka | æ°æ®éå æ¬ACL计ç®è¯è¨å¦åèªç¶è¯è¨å¤çç 究论æï¼ä»¥ååèªçå¼ç¨è®ºæåä¸ä¸ªè¾åºæè¦:ä¼ ç»ä½è ç论ææè¦(æè¦)ã社åºæè¦(å¼ç¨è¯å¥âå¼æâçæ¶é)åç±è®ç»æç´ ç注éåæ°åç人类æè¦ï¼è®ç»éå å«40ç¯æç« åå¼ç¨è®ºæã | åææ¬æè¦ï¼è®ºæ | ææ¬æè¦ | 论æ | è±æ |
æºå¨ç¿»è¯
ID | æ é¢ | æ´æ°æ¥æ | æ°æ®éæä¾è | è®¸å¯ | 说æ | å ³é®å | ç±»å« | 论æå°å | å¤æ³¨ |
---|---|---|---|---|---|---|---|---|---|
1 | WMT2017 | 2017/2/1 | EMNLP 2017 Workshop on Machine Translation | æ°æ®ä¸»è¦æ¥æºäº Europarl corpusåUN corpus两个æºæï¼ é带2017å¹´ä»News Commentary corpus ä»»å¡ä¸éæ°æ½åçæç« ã è¿æ¯ç±EMNLPä¼è®®æä¾çç¿»è¯è¯æï¼ ä½ä¸ºå¾å¤è®ºæææ çbenchmarkæ¥æ£æµ | Benchmark, WMT2017 | ä¸è±ç¿»è¯ è¯æ | 论æ | ||
2 | WMT2018 | 2018/11/1 | EMNLP 2018 Workshop on Machine Translation | æ°æ®ä¸»è¦æ¥æºäº Europarl corpusåUN corpus两个æºæï¼ é带2018å¹´ä»News Commentary corpus ä»»å¡ä¸éæ°æ½åçæç« ã è¿æ¯ç±EMNLPä¼è®®æä¾çç¿»è¯è¯æï¼ ä½ä¸ºå¾å¤è®ºæææ çbenchmarkæ¥æ£æµ | Benchmark, WMT2018 | ä¸è±ç¿»è¯ è¯æ | 论æ | ||
3 | WMT2019 | 2019/1/31 | EMNLP 2019 Workshop on Machine Translation | æ°æ®ä¸»è¦æ¥æºäº Europarl corpusåUN corpus两个æºæ, 以åéå äº news-commentary corpus and the ParaCrawl corpusä¸æ¥å¾æ°æ® | Benchmark, WMT2019 | ä¸è±ç¿»è¯ è¯æ | 论æ | ||
4 | UM-Corpus:A Large English-Chinese Parallel Corpus | 2014/5/26 | Department of Computer and Information Science, University of Macau, Macau | ç±æ¾³é¨å¤§å¦åå¸ç ä¸è±æå¯¹ç §ç é«è´¨éç¿»è¯è¯æ | UM-Corpus;English; Chinese;large | ä¸è±ç¿»è¯ è¯æ | 论æ | ||
5 | [Ai challenger translation 2017](https://pan.baidu.com/s/1E5gD5QnZvNxT3ZLtxe_boA æåç : stjf) | 2017/8/14 | åæ°å·¥åºãæçå ä»æ¥å¤´æ¡èååèµ·ç AIç§æç«èµ | è§æ¨¡æ大çå£è¯é¢åè±ä¸åè¯å¯¹ç §æ°æ®éã æä¾äºè¶ è¿1000ä¸çè±ä¸å¯¹ç §çå¥å对ä½ä¸ºæ°æ®éåã ææåè¯å¥å¯¹ç»è¿äººå·¥æ£æ¥ï¼ æ°æ®éä»è§æ¨¡ãç¸å ³åº¦ãè´¨éä¸é½æä¿éã è®ç»éï¼10,000,000 å¥ éªè¯éï¼åå£°ä¼ è¯ï¼ï¼934 å¥ éªè¯éï¼ææ¬ç¿»è¯ï¼ï¼8000 å¥ | AI challenger 2017 | ä¸è±ç¿»è¯ è¯æ | |||
6 | MultiUN | 2010 | Department of Linguistics and Philology Uppsala University, Uppsala/Sweden | 该æ°æ®éç±å¾·å½äººå·¥æºè½ç 究ä¸å¿æä¾ï¼ é¤æ¤æ°æ®éå¤ï¼è¯¥ç½ç«è¿æä¾äºå¾å¤çå« çè¯è¨ä¹é´çç¿»è¯å¯¹ç §è¯æä¾ä¸è½½ | MultiUN | ä¸è±ç¿»è¯ è¯æ | MultiUN: A Multilingual corpus from United Nation Documents, Andreas Eisele and Yu Chen, LREC 2010 | ||
7 | NIST 2002 Open Machine Translation (OpenMT) Evaluation | 2010/5/14 | NIST Multimodal Information Group | LDC User Agreement for Non-Members | æ°æ®æ¥æºäºXinhua æ°é»æå¡å å«70个æ°é»æ äºï¼ 以åæ¥èªäºZaobaoæ°é»æå¡ç30个æ°é»æ äºï¼å ±100个 ä»ä¸¤ä¸ªæ°é»éä¸éæ©åºæ¥çæ äºçé¿åº¦é½å212å°707个 ä¸æå符ä¹é´ï¼Xinhuaé¨åå ±ææ25247个åç¬¦ï¼ Zaobaoæ39256个å符 | NIST | ä¸è±ç¿»è¯ è¯æ | 论æ | 该系åæå¤å¹´çæ°æ®ï¼ 该æ°æ®ä½¿ç¨éè¦ä»è´¹ |
8 | The Multitarget TED Talks Task (MTTT) | 2018 | Kevin Duh, JUH | 该æ°æ®éå å«åºäºTEDæ¼è®²çå¤ç§è¯è¨çå¹³è¡è¯æï¼å å«ä¸è±æçå ±è®¡20ç§è¯è¨ | TED | ä¸è±ç¿»è¯ è¯æ | The Multitarget TED Talks Task | ||
9 | ASPEC Chinese-Japanese | 2019 | Workshop on Asian Translation | 该æ°æ®é主è¦ç 究äºæ´²åºåçè¯è¨ï¼å¦ä¸æåæ¥è¯ä¹é´ï¼ æ¥è¯åè±æä¹é´çç¿»è¯ä»»å¡ ç¿»è¯è¯æ主è¦æ¥èªè¯ç§æ论æï¼è®ºææè¦ï¼åææè¿°ï¼ä¸å©ççï¼ | Asian scientific patent Japanese | ä¸æ¥ç¿»è¯è¯æ | http://lotus.kuee.kyoto-u.ac.jp/WAT/ | ||
10 | casia2015 | 2015 | research group in Institute of Automation , Chinese Academy of Sciences | è¯æåºå å«ä»ç½ç»èªå¨æ¶éç大约ä¸ç¾ä¸ä¸ªå¥å对 | casia CWMT 2015 | ä¸è±ç¿»è¯è¯æ | |||
11 | casict2011 | 2011 | research group in Institute of Computing Technology , Chinese Academy of Sciences | è¯æåºå å«2个é¨åï¼æ¯ä¸ªé¨åå å«ä»ç½ç»èªå¨æ¶é ç大约1ç¾ä¸ï¼æ»è®¡2ç¾ä¸ï¼ä¸ªå¥å对ã å¥å级å«ç对é½ç²¾åº¦çº¦ä¸º90ï¼ ã | casict CWMT 2011 | ä¸è±ç¿»è¯è¯æ | |||
12 | casict2015 | 2015 | research group in Institute of Computing Technology , Chinese Academy of Sciences | è¯æåºå å«å¤§çº¦200ä¸ä¸ªå¥å对ï¼å æ¬ä»ç½ç»ï¼60ï¼ ï¼ï¼ çµå½±åå¹ï¼20ï¼ ï¼åè±è¯/æ±è¯è¯åºï¼20ï¼ ï¼æ¶éçå¥åã å¥å水平对é½ç²¾åº¦é«äº99ï¼ ã | casict CWMT 2015 | ä¸è±ç¿»è¯è¯æ | |||
13 | datum2015 | 2015 | Datum Data Co., Ltd. | è¯æåºå å«ä¸ç¾ä¸å¯¹å¥åï¼æ¶µçä¸åç±»åï¼ ä¾å¦ç¨äºè¯è¨æè²çæç§ä¹¦ï¼åè¯ä¹¦ç±ï¼ ææ¯ææ¡£ï¼åè¯æ°é»ï¼æ¿åºç½ç®ä¹¦ï¼ æ¿åºææ¡£ï¼ç½ç»ä¸çåè¯èµæºçã 请注æï¼æ°æ®ä¸æé¨åçæäºé¨åæ¯æè¯æ®µååçã | datum CWMT 2015 | ä¸è±ç¿»è¯è¯æ | |||
14 | datum2017 | 2017 | Datum Data Co., Ltd. | è¯æåºå å«20个æ件ï¼æ¶µçä¸åç±»åï¼ä¾å¦æ°é»ï¼å¯¹è¯ï¼æ³å¾æ件ï¼å°è¯´çã æ¯ä¸ªæ件æ50,000个å¥åã æ´ä¸ªè¯æåºå å«ä¸ç¾ä¸ä¸ªå¥åã å10个æ件ï¼Book1-Book10ï¼çä¸æè¯åå·²å段ã | datum CWMT 2017 | ä¸è±ç¿»è¯è¯æ | |||
15 | neu2017 | 2017 | NLP lab of Northeastern University, China | è¯æåºå å«ä»ç½ç»èªå¨æ¶éç200ä¸ä¸ªå¥å对ï¼å æ¬æ°é»ï¼ææ¯ææ¡£çã å¥å级å«ç对é½ç²¾åº¦çº¦ä¸º90ï¼ ã | neu CWMT 2017 | ä¸è±ç¿»è¯è¯æ | |||
16 | ç¿»è¯è¯æ(translation2019zh) | 2019 | å¾äº® | å¯ä»¥ç¨äºè®ç»ä¸è±æç¿»è¯ç³»ç»ï¼ä»ä¸æç¿»è¯å°è±æï¼æä»è±æç¿»è¯å°ä¸æï¼ ç±äºæä¸ç¾ä¸çä¸æå¥åï¼å¯ä»¥åªæ½åä¸æçå¥åï¼å为éç¨ä¸æè¯æï¼è®ç»è¯åéæå为é¢è®ç»çè¯æãè±æä»»å¡ä¹å¯ä»¥ç±»ä¼¼æä½ï¼ |
ç¥è¯å¾è°±
ID | æ é¢ | æ´æ°æ¥æ | æ°æ®éæä¾è | è®¸å¯ | 说æ | å ³é®å | ç±»å« | 论æå°å | å¤æ³¨ |
---|---|---|---|---|---|---|---|---|---|
1 | NLPIRå¾®åå ³æ³¨å ³ç³»è¯æåº100ä¸æ¡ | 2017/12/2 | å京ç工大å¦ç½ç»æç´¢ææä¸å®å ¨å®éªå®¤å¼ åå¹³å士 | NLPIRå¾®åå ³æ³¨å ³ç³»è¯æåºè¯´æ 1.NLPIRå¾®åå ³æ³¨å ³ç³»è¯æåºç±å京ç工大å¦ç½ç»æç´¢ææä¸å®å ¨å®éªå®¤å¼ åå¹³å士ï¼éè¿å ¬å¼ééä¸æ½åä»æ°æµªå¾®åãè ¾è®¯å¾®åä¸è·å¾ã为äºæ¨è¿å¾®å计ç®çç 究ï¼ç°éè¿èªç¶è¯è¨å¤çä¸ä¿¡æ¯æ£ç´¢å ±äº«å¹³å°(127.0.0.1/wordpress)äºä»¥å ¬å¼å ±äº«å ¶ä¸ç1000ä¸æ¡æ°æ®ï¼ç®åå·²ææ°æ®æ¥è¿10亿ï¼å·²ç»åé¤äºå¤§éçåä½æ°æ®ï¼ï¼ 2.æ¬è¯æåºå¨å ¬å¼è¿ç¨ä¸ï¼å·²ç»æ大é度å°éç¨ææ¯æ段å±è½äºç¨æ·çå®å§ååurlï¼å¦ææ¶åå°çç¨æ·éè¦å ¨é¢ä¿æ¤ä¸ªäººéç§çï¼å¯ä»¥Emailç»å¼ åå¹³å士kevinzhang@bit.edu.cnäºä»¥å é¤ï¼å¯¹ç»æ¨é æçå°æ°è¡¨ç¤ºæ±æï¼å¹¶å¸æè° è§£ï¼ 3.åªéç¨äºç§ç æå¦ç¨éï¼ä¸å¾ä½ä¸ºåç¨ï¼å¼ç¨æ¬è¯æåºï¼æ请å¨è½¯ä»¶æè 论æçææç¹å®ä½ç½®è¡¨æåºå¤ä¸ºï¼NLPIRå¾®åè¯æåºï¼åºå¤ä¸ºèªç¶è¯è¨å¤çä¸ä¿¡æ¯æ£ç´¢å ±äº«å¹³å°(http://www.nlpir.org/)ã 4.å段说æï¼ person_id 人ç©çid guanzhu_id æå ³æ³¨äººçid |
è¯æåº
ID | æ é¢ | æ´æ°æ¥æ | æ°æ®éæä¾è | è®¸å¯ | 说æ | å ³é®å | ç±»å« | 论æå°å | å¤æ³¨ |
---|---|---|---|---|---|---|---|---|---|
1 | NLPIRå¾®åå 容è¯æåº-23ä¸æ¡ | 2017å¹´12æ | å京ç工大å¦ç½ç»æç´¢ææä¸å®å ¨å®éªå®¤å¼ åå¹³å士 | NLPIRå¾®åå 容è¯æåºè¯´æ 1.NLPIRå¾®åå 容è¯æåºç±å京ç工大å¦ç½ç»æç´¢ææä¸å®å ¨å®éªå®¤å¼ åå¹³å士ï¼éè¿å ¬å¼ééä¸æ½åä»æ°æµªå¾®åãè ¾è®¯å¾®åä¸è·å¾ã为äºæ¨è¿å¾®å计ç®çç 究ï¼ç°éè¿èªç¶è¯è¨å¤çä¸ä¿¡æ¯æ£ç´¢å ±äº«å¹³å°(127.0.0.1/wordpress)äºä»¥å ¬å¼å ±äº«å ¶ä¸ç23ä¸æ¡æ°æ®ï¼ç®åå·²ææ°æ®æ¥è¿1000ä¸ï¼å·²ç»åé¤äºå¤§éçåä½æ°æ®ï¼ã 2.æ¬è¯æåºå¨å ¬å¼è¿ç¨ä¸ï¼å·²ç»æ大é度å°éç¨ææ¯æ段å±è½äºç¨æ·çå®å§ååurlï¼å¦ææ¶åå°çç¨æ·éè¦å ¨é¢ä¿æ¤ä¸ªäººéç§çï¼å¯ä»¥Emailç»å¼ åå¹³å士kevinzhang@bit.edu.cnäºä»¥å é¤ï¼å¯¹ç»æ¨é æçå°æ°è¡¨ç¤ºæ±æï¼å¹¶å¸æè° è§£ï¼ 3.åªéç¨äºç§ç æå¦ç¨éï¼ä¸å¾ä½ä¸ºåç¨ï¼å¼ç¨æ¬è¯æåºï¼æ请å¨è½¯ä»¶æè 论æçææç¹å®ä½ç½®è¡¨æåºå¤ä¸ºï¼NLPIRå¾®åè¯æåºï¼åºå¤ä¸ºèªç¶è¯è¨å¤çä¸ä¿¡æ¯æ£ç´¢å ±äº«å¹³å°(http://www.nlpir.org/)ã 4.å段说æï¼ id æç« ç¼å· article æ£æ discuss è¯è®ºæ°ç® insertTime æ£ææå ¥æ¶é´ origin æ¥æº person_id æå±äººç©çid time æ£æåå¸æ¶é´ transmit 转å | |||||
2 | 500ä¸å¾®åè¯æ | 2018å¹´1æ | å京ç工大å¦ç½ç»æç´¢ææä¸å®å ¨å®éªå®¤å¼ åå¹³å士 | ã500ä¸å¾®åè¯æãåçå·¥æç´¢ææå®éªå®¤ä¸»ä»»@ICTCLASå¼ åå¹³å士 æä¾500ä¸å¾®åè¯æä¾å¤§å®¶ä½¿ç¨ï¼æ件为sqlæ件ï¼åªè½å¯¼å ¥mysqlæ°æ®åºï¼å å«å»ºè¡¨è¯å¥ï¼å ±500ä¸æ°æ®ãè¯æåªéç¨äºç§ç æå¦ç¨éï¼ä¸å¾ä½ä¸ºåç¨ï¼å¼ç¨æ¬è¯æåºï¼è¯·å¨è½¯ä»¶æè 论æçææç¹å®ä½ç½®è¡¨æåºå¤ ã ãçèµ·æ¥è¿ä»½æ°æ®æ¯ä¸é¢é£ä¸ä»½è¦æç³ ä¸äºï¼æ²¡æåè¿å¤çã | |||||
3 | NLPIRæ°é»è¯æåº-2400ä¸å | 2017å¹´7æ | www.NLPIR.org | NLPIRæ°é»è¯æåºè¯´æ 1.解å缩åæ°æ®é为48MBï¼å¤§çº¦2400ä¸åçæ°é»ï¼ 2.ééçæ°é»æ¶é´è·¨åº¦ä¸º2009å¹´10æ12æ¥è³2009å¹´12æ14æ¥ã 3.æ件å为æ°é»çæ¶é´ï¼æ¯ä¸ªæ件å æ¬å¤ä¸ªæ°é»æ£æå 容ï¼å·²ç»å»é¤äºæ°é»çåå¾ä¿¡æ¯ï¼ï¼ 4.æ°é»æ¬èº«å 容ççæå±äºåä½è æè æ°é»æºæï¼ 5.æ´çåçè¯æåºçæå±äºwww.NLPIR.orgï¼ 6.å¯ä¾æ°é»åæãèªç¶è¯è¨å¤çãæç´¢çåºç¨æä¾æµè¯æ°æ®åºæ¯ï¼ å¦éæ´å¤§è§æ¨¡çè¯æåºï¼å¯ä»¥èç³»NLPIR.org管çåã | |||||
4 | NLPIRå¾®åå ³æ³¨å ³ç³»è¯æåº100ä¸æ¡ | 2017å¹´12æ | å京ç工大å¦ç½ç»æç´¢ææä¸å®å ¨å®éªå®¤å¼ åå¹³å士 | NLPIRå¾®åå ³æ³¨å ³ç³»è¯æåºè¯´æ 1.NLPIRå¾®åå ³æ³¨å ³ç³»è¯æåºç±å京ç工大å¦ç½ç»æç´¢ææä¸å®å ¨å®éªå®¤å¼ åå¹³å士ï¼éè¿å ¬å¼ééä¸æ½åä»æ°æµªå¾®åãè ¾è®¯å¾®åä¸è·å¾ã为äºæ¨è¿å¾®å计ç®çç 究ï¼ç°éè¿èªç¶è¯è¨å¤çä¸ä¿¡æ¯æ£ç´¢å ±äº«å¹³å°(127.0.0.1/wordpress)äºä»¥å ¬å¼å ±äº«å ¶ä¸ç1000ä¸æ¡æ°æ®ï¼ç®åå·²ææ°æ®æ¥è¿10亿ï¼å·²ç»åé¤äºå¤§éçåä½æ°æ®ï¼ï¼ 2.æ¬è¯æåºå¨å ¬å¼è¿ç¨ä¸ï¼å·²ç»æ大é度å°éç¨ææ¯æ段å±è½äºç¨æ·çå®å§ååurlï¼å¦ææ¶åå°çç¨æ·éè¦å ¨é¢ä¿æ¤ä¸ªäººéç§çï¼å¯ä»¥Emailç»å¼ åå¹³å士kevinzhang@bit.edu.cnäºä»¥å é¤ï¼å¯¹ç»æ¨é æçå°æ°è¡¨ç¤ºæ±æï¼å¹¶å¸æè° è§£ï¼ 3.åªéç¨äºç§ç æå¦ç¨éï¼ä¸å¾ä½ä¸ºåç¨ï¼å¼ç¨æ¬è¯æåºï¼æ请å¨è½¯ä»¶æè 论æçææç¹å®ä½ç½®è¡¨æåºå¤ä¸ºï¼NLPIRå¾®åè¯æåºï¼åºå¤ä¸ºèªç¶è¯è¨å¤çä¸ä¿¡æ¯æ£ç´¢å ±äº«å¹³å°(http://www.nlpir.org/)ã 4.å段说æï¼ person_id 人ç©çid guanzhu_id æå ³æ³¨äººçid | |||||
5 | NLPIRå¾®åå主è¯æåº100ä¸æ¡ | 2017å¹´9æ | å京ç工大å¦ç½ç»æç´¢ææä¸å®å ¨å®éªå®¤å¼ åå¹³å士 | NLPIRå¾®åå主è¯æåºè¯´æ 1.NLPIRå¾®åå主è¯æåºç±å京ç工大å¦ç½ç»æç´¢ææä¸å®å ¨å®éªå®¤å¼ åå¹³å士ï¼éè¿å ¬å¼ééä¸æ½åä»æ°æµªå¾®åãè ¾è®¯å¾®åä¸è·å¾ã为äºæ¨è¿å¾®å计ç®çç 究ï¼ç°éè¿èªç¶è¯è¨å¤çä¸ä¿¡æ¯æ£ç´¢å ±äº«å¹³å°(127.0.0.1/wordpress)äºä»¥å ¬å¼å ±äº«å ¶ä¸ç100ä¸æ¡æ°æ®ï¼ç®åå·²ææ°æ®æ¥è¿1亿ï¼å·²ç»åé¤äºå¤§éçåä½ä¸æºå¨ç²ä¸ï¼ 2.æ¬è¯æåºå¨å ¬å¼è¿ç¨ä¸ï¼å·²ç»æ大é度å°éç¨ææ¯æ段å±è½äºç¨æ·çå®å§ååurlï¼å¦ææ¶åå°çç¨æ·éè¦å ¨é¢ä¿æ¤ä¸ªäººéç§çï¼å¯ä»¥Emailç»å¼ åå¹³å士kevinzhang@bit.edu.cnäºä»¥å é¤ï¼å¯¹ç»æ¨é æçå°æ°è¡¨ç¤ºæ±æï¼å¹¶å¸æè° è§£ï¼ 3.åªéç¨äºç§ç æå¦ç¨éï¼ä¸å¾ä½ä¸ºåç¨ï¼å¼ç¨æ¬è¯æåºï¼æ请å¨è½¯ä»¶æè 论æçææç¹å®ä½ç½®è¡¨æåºå¤ä¸ºï¼NLPIRå¾®åè¯æåºï¼åºå¤ä¸ºèªç¶è¯è¨å¤çä¸ä¿¡æ¯æ£ç´¢å ±äº«å¹³å°(http://www.nlpir.org/)ã 4.å段说æï¼ id å é¨id sex æ§å« address 家åºä½å fansNum ç²ä¸æ°ç® summary 个人æè¦ wbNum å¾®åæ°é gzNum å ³æ³¨æ°é blog å客å°å edu æè²æ åµ work å·¥ä½æ åµ renZh æ¯å¦è®¤è¯ brithday çæ¥ï¼ | |||||
6 | NLPIRçææ¬è¯æåº-40ä¸å | 2017å¹´8æ | å京ç工大å¦ç½ç»æç´¢ææä¸å®å ¨å®éªå®¤ (SMS@BIT) | NLPIRçææ¬è¯æåºè¯´æ 1.解å缩åæ°æ®é为48ä¸åï¼å¤§çº¦8704ç¯çææ¬å å®¹ï¼ 2.æ´çåçè¯æåºçæå±äºwww.NLPIR.orgï¼ 3.å¯ä¾çææ¬èªç¶è¯è¨å¤çãæç´¢ãèæ åæçåºç¨æä¾æµè¯æ°æ®åºæ¯ï¼ | |||||
7 | ç»´åºç¾ç§è¯æåº | \ | ç»´åºç¾ç§ | ç»´åºç¾ç§ä¼å®ææå åå¸è¯æåº | |||||
8 | å¤è¯è¯æ°æ®åº | 2020å¹´ | github主ç¬è«ï¼http://shici.store | ||||||
9 | ä¿é©è¡ä¸è¯æåº | 2017å¹´ | 该è¯æåºå å«ä»ç½ç«Insurance Library æ¶éçé®é¢åçæ¡ã æ®æ们æç¥ï¼è¿æ¯ä¿é©é¢åé¦ä¸ªå¼æ¾çQAè¯æåºï¼ 该è¯æåºçå 容ç±ç°å®ä¸ççç¨æ·æåºï¼é«è´¨éççæ¡ç±å ·æ深度é¢åç¥è¯çä¸ä¸äººå£«æä¾ã æ以è¿æ¯ä¸ä¸ªå ·æçæ£ä»·å¼çè¯æï¼èä¸æ¯ç©å ·ã å¨ä¸è¿°è®ºæä¸ï¼è¯æåºç¨äºçå¤éæ©ä»»å¡ã å¦ä¸æ¹é¢ï¼è¿ç§è¯æåºçå ¶ä»ç¨æ³ä¹æ¯å¯è½çã ä¾å¦ï¼éè¿é 读ç解çæ¡ï¼è§å¯å¦ä¹ çèªä¸»å¦ä¹ ï¼ä½¿ç³»ç»è½å¤æç»æ¿åºèªå·±ççä¸è§çé®é¢ççæ¡ã æ°æ®éå为两个é¨åâé®çè¯æâåâé®ç对è¯æâãé®çè¯ææ¯ä»åå§è±ææ°æ®ç¿»è¯è¿æ¥ï¼æªç»å ¶ä»å¤ççãé®ç对è¯ææ¯åºäºé®çè¯æï¼ååäºåè¯åå»æ å»åï¼æ·»å labelãæ以ï¼"é®ç对è¯æ"å¯ä»¥ç´æ¥å¯¹æ¥æºå¨å¦ä¹ ä»»å¡ãå¦æ对äºæ°æ®æ ¼å¼ä¸æ»¡ææè 对åè¯ææä¸æ»¡æï¼å¯ä»¥ç´æ¥å¯¹"é®çè¯æ"使ç¨å ¶ä»æ¹æ³è¿è¡å¤çï¼è·å¾å¯ä»¥ç¨äºè®ç»æ¨¡åçæ°æ®ã | ||||||
10 | æ±è¯æååå ¸ | 1905å¹´7æ | æ¬å庫å«éæ¾è©å ¸ç¶²ç¨ä»¥æä¾åæåé¨ä»¶æ¥è©¢çæååå ¸æ¸æ庫ï¼æ便å©ä½¿ç¨è æ¥é£ææ¼¢åçç¨éãç®åæ¸æ庫æ¶é17,803ä¸åæ¼¢åçææ³ï¼åçºç¹é«åï¼chaizi-ft.txtï¼åç°¡é«åï¼chaizi-jt.txtï¼å ©åçæ¬ã æåæ³æå¥æ¼åºæççé å庫ãæåèéæ¼åéææ¯ååææå ©å以ä¸ççµæé¨ä»¶ï¼èä¸æ¯æææ寫åææ使ç¨ççç«ã | ||||||
11 | æ°é»é¢æ | 2016å¹´ | å¾äº® | å¯ä»¥å为ãéç¨ä¸æè¯æãï¼è®ç»ãè¯åéãæå为ãé¢è®ç»ãçè¯æï¼ ä¹å¯ä»¥ç¨äºè®ç»ãæ é¢çæã模åï¼æè®ç»ãå ³é®è¯çæã模åï¼éå ³é®è¯å 容ä¸åäºæ é¢çæ°æ®ï¼ï¼ 亦å¯ä»¥éè¿æ°é»æ¸ éåºååºæ°é»çç±»åã | |||||
12 | ç¾ç§ç±»é®çjsonç(baike2018qa) | 2018å¹´ | å¾äº® | å¯ä»¥å为éç¨ä¸æè¯æï¼è®ç»è¯åéæå为é¢è®ç»çè¯æï¼ä¹å¯ä»¥ç¨äºæ建ç¾ç§ç±»é®çï¼å ¶ä¸ç±»å«ä¿¡æ¯æ¯è¾æç¨ï¼å¯ä»¥ç¨äºåçç£è®ç»ï¼ä»èæ建 æ´å¥½å¥å表示ç模åãå¥åç¸ä¼¼æ§ä»»å¡çã | |||||
13 | 社åºé®çjsonç(webtext2019zh) ï¼å¤§è§æ¨¡é«è´¨éæ°æ®é | 2019å¹´ | å¾äº® | 1ï¼æ建ç¾ç§ç±»é®çï¼è¾å ¥ä¸ä¸ªé®é¢ï¼æ建æ£ç´¢ç³»ç»å¾å°ä¸ä¸ªåå¤æç产ä¸ä¸ªåå¤ï¼ææ ¹æ®ç¸å ³å ³é®è¯ä»ï¼ç¤¾åºé®çåºä¸çéåºä½ ç¸å ³çé¢åæ°æ® 2ï¼è®ç»è¯é¢é¢æµæ¨¡åï¼è¾å ¥ä¸ä¸ªé®é¢(åææè¿°)ï¼é¢æµå±äºè¯é¢ã 3ï¼è®ç»ç¤¾åºé®ç(cQA)ç³»ç»ï¼é对ä¸é®å¤ççåºæ¯ï¼è¾å ¥ä¸ä¸ªé®é¢ï¼æ¾å°æç¸å ³çé®é¢ï¼å¨è¿ä¸ªåºç¡ä¸åºäºä¸åçæ¡åå¤çè´¨éã é®é¢ä¸çæ¡çç¸å ³æ§ï¼æ¾å°æ好ççæ¡ã 4ï¼å为éç¨ä¸æè¯æï¼å大模åé¢è®ç»çè¯ææè®ç»è¯åéãå ¶ä¸ç±»å«ä¿¡æ¯ä¹æ¯è¾æç¨ï¼å¯ä»¥ç¨äºåçç£è®ç»ï¼ä»èæ建æ´å¥½å¥å表示ç模åãå¥åç¸ä¼¼æ§ä»»å¡çã 5ï¼ç»åç¹èµæ°éè¿ä¸é¢å¤ä¿¡æ¯ï¼é¢æµåå¤çå欢è¿ç¨åº¦æè®ç»çæ¡è¯åç³»ç»ã | |||||
14 | .ç»´åºç¾ç§jsonç(wiki2019zh) | 2019å¹´ | å¾äº® | å¯ä»¥å为éç¨ä¸æè¯æï¼åé¢è®ç»çè¯æææ建è¯åéï¼ä¹å¯ä»¥ç¨äºæ建ç¥è¯é®çããä¸åäºwikiåå§éæ¾çæ°æ®éï¼è¿ä¸ªå¤çè¿äºã |
é 读ç解
ID | æ é¢ | æ´æ°æ¥æ | æ°æ®éæä¾è | è®¸å¯ | 说æ | å ³é®å | ç±»å« | 论æå°å | å¤æ³¨ |
---|---|---|---|---|---|---|---|---|---|
1 | ç¾åº¦WebQA | 2016 | ç¾åº¦ | \ | æ¥èªäºç¾åº¦ç¥éï¼æ ¼å¼ä¸ºä¸ä¸ªé®é¢å¤ç¯ææåºæ¬ä¸è´çæç« ï¼å为人为æ 注以åæµè§å¨æ£ç´¢ | é 读ç解ãç¾åº¦ç¥éçå®é®é¢ | ä¸æé 读ç解 | 论æ | |
2 | DuReader 1.0 | 2018/3/1 | ç¾åº¦ | Apache2.0 | æ¬æ¬¡ç«èµæ°æ®éæ¥èªæç´¢å¼æçå®åºç¨åºæ¯ï¼å ¶ä¸çé®é¢ä¸ºç¾åº¦æç´¢ç¨æ·ççå®é®é¢ï¼æ¯ä¸ªé®é¢å¯¹åº5个åéææ¡£ææ¬å人工æ´ççä¼è´¨çæ¡ã | é 读ç解ãç¾åº¦æç´¢çå®é®é¢ | ä¸æé 读ç解 | 论æ | |
3 | SogouQA | 2018 | æç | \ | CIPS-SOGOUé®çæ¯èµæ°æ®ï¼æ¥èªäºæçæç´¢å¼æçå®ç¨æ·æ交çæ¥è¯¢è¯·æ±ï¼å«æäºå®ç±»ä¸éäºå®ç±»æ°æ® | é 读ç解ãæçæç´¢å¼æçå®é®é¢ | ä¸æé 读ç解 | \ | |
4 | ä¸ææ³å¾é 读ç解æ°æ®éCJRC | 2019/8/17 | å工大讯é£èåå®éªå®¤ï¼HFLï¼ | \ | æ°æ®éå å«çº¦10,000ç¯ææ¡£ï¼ä¸»è¦æ¶åæ°äºä¸å®¡å¤å³ä¹¦ååäºä¸å®¡å¤å³ä¹¦ãéè¿æ½åè£å¤æ书çäºå®æè¿°å 容ï¼é对äºå®æè¿°å 容æ 注é®é¢ï¼æç»å½¢æ约50,000个é®ç对 | é 读ç解ãä¸ææ³å¾é¢å | ä¸æé 读ç解 | 论æ | |
5 | 2019â讯é£æ¯âä¸ææºå¨é 读ç解æ°æ®éï¼CMRC ï¼ | 2019å¹´10æ | å工大讯é£èåå®éªå®¤ï¼HFLï¼ | CC-BY-SA-4.0 | æ¬æ¬¡é 读ç解çä»»å¡æ¯å¥å级填空åé 读ç解ã æ ¹æ®ç»å®çä¸ä¸ªåäºç¯ç« 以åè¥å¹²ä¸ªä»ç¯ç« ä¸æ½ååºçå¥åï¼åèµè éè¦å»ºç«æ¨¡åå°åéå¥åç²¾åçå¡«ååç¯ç« ä¸ï¼ä½¿ä¹æ为å®æ´çä¸ç¯æç« ã | å¥å级填空åé 读ç解 | ä¸æé 读ç解 | \ | èµäºå®ç½ï¼https://hfl-rc.github.io/cmrc2019/ |
6 | 2018â讯é£æ¯âä¸ææºå¨é 读ç解æ°æ®éï¼CMRC ï¼ | 2018/10/19 | å工大讯é£èåå®éªå®¤ï¼HFLï¼ | CC-BY-SA-4.0 | CMRC 2018æ°æ®éå å«äºçº¦20,000个å¨ç»´åºç¾ç§ææ¬ä¸äººå·¥æ 注çé®é¢ãåæ¶ï¼æ们è¿æ 注äºä¸ä¸ªææéï¼å ¶ä¸å å«äºéè¦å¤å¥æ¨çæè½å¤æ£ç¡®è§£ççé®é¢ï¼æ´å¯ææææ§ | é 读ç解ãåºäºç¯ç« ç段æ½å | ä¸æé 读ç解 | 论æ | èµäºå®ç½ï¼https://hfl-rc.github.io/cmrc2018/ |
7 | 2017â讯é£æ¯âä¸ææºå¨é 读ç解æ°æ®éï¼CMRC ï¼ | 2017/10/14 | å工大讯é£èåå®éªå®¤ï¼HFLï¼ | CC-BY-SA-4.0 | é¦ä¸ªä¸æ填空åé 读ç解æ°æ®éPD&CFT | 填空åé 读ç解 | ä¸æé 读ç解 | 论æ | èµäºå®ç½ |
8 | è±æ¯æ¯ï¼å ¨å½ç¬¬äºå±âåäºæºè½æºå¨é 读âææèµ | 2019/9/3 | ä¸çµè±æ¯ä¿¡æ¯ç³»ç»æéå ¬å¸ | \ | é¢ååäºåºç¨åºæ¯ç大è§æ¨¡ä¸æé 读ç解æ°æ®éï¼å´ç»å¤ææ¡£æºå¨é 读ç解è¿è¡ç«èµï¼æ¶åç解ãæ¨ççå¤æææ¯ã | å¤ææ¡£æºå¨é 读ç解 | ä¸æé 读ç解 | \ | èµäºå®ç½ |
9 | ReCO | 2020 | æç | \ | æ¥æºäºæççæµè§å¨ç¨æ·è¾å ¥ï¼æå¤éåç´æ¥çæ¡ | é 读ç解ãæçæç´¢ | ä¸æé 读ç解 | 论æ | \ |
10 | DuReader-checklist | 2021/3 | ç¾åº¦ | Apache-2.0 | 建ç«äºç»ç²åº¦çãå¤ç»´åº¦çè¯æµæ°æ®éï¼ä»è¯æ±ç解ãçè¯ç解ãè¯ä¹è§è²ç解ãé»è¾æ¨ççå¤ä¸ªç»´åº¦æ£æµæ¨¡åçä¸è¶³ä¹å¤ï¼ä»èæ¨å¨é 读ç解è¯æµè¿å ¥âç²¾ç»åâæ¶ä»£ | ç»ç²åº¦é 读ç解 | ä¸æé 读ç解 | \ | èµäºå®ç½ |
11 | DuReader-Robust | 2020/8 | ç¾åº¦ | Apache-2.0 | ä»è¿æææ§ï¼è¿ç¨³å®æ§ä»¥åæ³åæ§å¤ä¸ªç»´åº¦æ建äºæµè¯é 读ç解é²æ£æ§çæ°æ® | ç¾åº¦æç´¢ãé²æ£æ§é 读ç解 | ä¸æé 读ç解 | 论æ | èµäºå®ç½ |
12 | DuReader-YesNo | 2020/8 | ç¾åº¦ | Apache-2.0 | DuReader yesnoæ¯ä¸ä¸ªä»¥è§ç¹ææ§å¤æ为ç®æ ä»»å¡çæ°æ®éï¼å¯ä»¥å¼¥è¡¥æ½åç±»æ°æ®éè¯æµææ ç缺é·ï¼ä»èæ´å¥½å°è¯ä»·æ¨¡å对è§ç¹ææ§çç解è½åã | è§ç¹åé 读ç解 | ä¸æé 读ç解 | \ | èµäºå®ç½ |
13 | DuReader2.0 | 2021 | ç¾åº¦ | Apache-2.0 | DuReader2.0æ¯å ¨æ°ç大è§æ¨¡ä¸æé 读ç解æ°æ®ï¼æ¥æºäºç¨æ·çå®è¾å ¥ï¼çå®åºæ¯ | é 读ç解 | ä¸æé 读ç解 | 论æ | èµäºå®ç½ |
14 | CAIL2020 | 2020 | å工大讯é£èåå®éªå®¤ï¼HFLï¼ | \ | ä¸æå¸æ³é 读ç解任å¡ï¼ä»å¹´æ们å°æåºå级çï¼ä¸ä» æ书ç§ç±»ç±æ°äºãåäºæ©å±ä¸ºæ°äºãåäºãè¡æ¿ï¼é®é¢ç±»åä¹ç±åæ¥é¢æµæ©å±ä¸ºå¤æ¥æ¨çï¼é¾åº¦ææå级ã | æ³å¾é 读ç解 | ä¸æé 读ç解 | \ | èµäºå®ç½ |
15 | CAIL2021 | 2021 | å工大讯é£èåå®éªå®¤ï¼HFLï¼ | \ | ä¸ææ³å¾é 读ç解æ¯èµå¼å ¥å¤ç段åççé®é¢ç±»åï¼å³é¨åé®é¢éè¦æ½åæç« ä¸çå¤ä¸ªç段ç»åææç»çæ¡ãå¸æå¤ç段é®é¢ç±»åçå¼å ¥ï¼è½å¤æ©å¤§ä¸ææºå¨é 读ç解çåºæ¯éç¨æ§ãæ¬æ¬¡æ¯èµä¾æ§ä¿çåç段ãæ¯å¦ç±»åæçç±»çé®é¢ç±»åã | æ³å¾é 读ç解 | ä¸æé 读ç解 | \ | èµäºå®ç½ |
16 | CoQA | 2018/9 | æ¯å¦ç¦å¤§å¦ | CC BY-SA 4.0ãApacheç | CoQAæ¯é¢å建ç«å¯¹è¯å¼é®çç³»ç»ç大åæ°æ®éï¼ææçç®æ æ¯è¡¡éæºå¨å¯¹ææ¬çç解è½åï¼ä»¥åæºå¨é¢å对è¯ä¸åºç°çå½¼æ¤ç¸å ³çé®é¢çåçè½åçé«ä½ | 对è¯é®ç | è±æé 读ç解 | 论æ | å®æ¹ç½ç« |
17 | SQuAD2.0 | 2018/1/11 | æ¯å¦ç¦å¤§å¦ | \ | è¡ä¸å å ¬è®¤çæºå¨é 读ç解é¢åç顶级水平æµè¯ï¼å®æ建äºä¸ä¸ªå å«åä¸ä¸ªé®é¢ç大è§æ¨¡æºå¨é 读ç解æ°æ®éï¼éåè¶ è¿ 500 ç¯çç»´åºç¾ç§æç« ãæ°æ®éä¸æ¯ä¸ä¸ªé 读ç解é®é¢ççæ¡æ¯æ¥èªç»å®çé 读æç« çä¸å°æ®µææ¬ ââ 以åï¼ç°å¨å¨ SQuAD 2.0 ä¸è¿è¦å¤æè¿ä¸ªé®é¢æ¯å¦è½å¤æ ¹æ®å½åçé 读ææ¬ä½ç | é®çãå å«æªç¥çæ¡ | è±æé 读ç解 | 论æ | |
18 | SQuAD1.0 | 2016 | æ¯å¦ç¦å¤§å¦ | \ | æ¯å¦ç¦å¤§å¦äº2016å¹´æ¨åºçé 读ç解æ°æ®éï¼ç»å®ä¸ç¯æç« åç¸åºé®é¢ï¼éè¦ç®æ³ç»åºé®é¢ççæ¡ãæ¤æ°æ®éæææç« éèªç»´åºç¾ç§ï¼ä¸å ±æ107,785é®é¢ï¼ä»¥åé å¥ç 536 ç¯æç« | é®çãåºäºç¯ç« ç段æ½å | è±æé 读ç解 | 论æ | |
19 | MCTest | 2013 | 微软 | \ | 100,000ä¸ªå¿ åºBingé®é¢å人工çæççæ¡ãä»é£æ¶èµ·ï¼ç¸ç»§åå¸äº1,000,000个é®é¢æ°æ®éï¼èªç¶è¯è¨çææ°æ®éï¼æ®µè½æåæ°æ®éï¼å ³é®è¯æåæ°æ®éï¼ç¬ç½æ°æ®éåä¼è¯æç´¢ã | é®çãæç´¢ | è±æé 读ç解 | 论æ | |
20 | CNN/Dailymail | 2015 | DeepMind | Apache-2.0 | 填空å大è§æ¨¡è±ææºå¨ç解æ°æ®éï¼çæ¡æ¯åæä¸çæä¸ä¸ªè¯ã CNNæ°æ®éå å«ç¾å½æ线çµè§æ°é»ç½çæ°é»æç« åç¸å ³é®é¢ã大约æ90kæç« å380ké®é¢ã Dailymailæ°æ®éå å«æ¯æ¥æ°é»çæç« åç¸å ³é®é¢ã大约æ197kæç« å879ké®é¢ã | é®ç对ã填空åé 读ç解 | è±æé 读ç解 | 论æ | |
21 | RACE | 2017 | å¡èåºæ¢ éå¤§å¦ | / | æ°æ®é为ä¸å½ä¸å¦çè±è¯é 读ç解é¢ç®ï¼ç»å®ä¸ç¯æç« å 5 é 4 é 1 çé¢ç®ï¼å æ¬äº 28000+ passages å 100,000 é®é¢ã | éæ©é¢å½¢å¼ | è±æé 读ç解 | 论æ | ä¸è½½éé®ä»¶ç³è¯· |
22 | HEAD-QA | 2019 | aghie | MIT | ä¸ä¸ªé¢åå¤ææ¨ççå»çä¿å¥ãå¤éé®çæ°æ®éãæä¾è±è¯ã西ççè¯ä¸¤ç§å½¢å¼çæ°æ® | å»çé¢åãéæ©é¢å½¢å¼ | è±æé 读ç解 西ççè¯é 读ç解 | 论æ | |
23 | Consensus Attention-based Neural Networks for Chinese Reading Comprehension | 2018 | å工大讯é£èåå®éªå®¤ | / | ä¸æå®å½¢å¡«ç©ºåé 读ç解 | 填空åé 读ç解 | ä¸æé 读ç解 | 论æ | |
24 | WikiQA | 2015 | 微软 | / | WikiQAè¯æåºæ¯ä¸ä¸ªæ°çå ¬å¼çé®é¢åå¥å对éï¼æ¶é并注éç¨äºå¼æ¾åé®çç 究 | ç段æ½åé 读ç解 | è±æé 读ç解 | 论æ | |
25 | Childrenâs Book Test (CBT) | 2016 | / | æµè¯è¯è¨æ¨¡åå¦ä½å¨å¿ç«¥ä¹¦ç±ä¸æææä¹ãä¸æ åè¯è¨å»ºæ¨¡åºåä¸åï¼å®å°é¢æµå¥æ³åè½è¯çä»»å¡ä¸é¢æµè¯ä¹å 容æ´ä¸°å¯çä½é¢è¯çä»»å¡åºåå¼æ¥ | 填空åé 读ç解 | è±æé 读ç解 | 论æ | ||
26 | NewsQA | 2017 | Maluuba Research | / | ä¸ä¸ªå ·ææææ§çæºå¨ç解æ°æ®éï¼å å«è¶ è¿100000个人工çæçé®ç对ï¼æ ¹æ®CNNç10000å¤ç¯æ°é»æç« æä¾é®é¢åçæ¡ï¼çæ¡ç±ç¸åºæç« çææ¬è·¨åº¦ç»æã | ç段æ½åé 读ç解 | è±æé 读ç解 | 论æ | |
27 | Frames dataset | 2017 | 微软 | / | ä»ç»äºä¸ä¸ªç±1369个人类对è¯ç»æçæ¡æ¶æ°æ®éï¼å¹³åæ¯ä¸ªå¯¹è¯15è½®ãå¼åè¿ä¸ªæ°æ®éæ¯ä¸ºäºç 究记å¿å¨ç®æ 导å对è¯ç³»ç»ä¸çä½ç¨ã | é 读ç解ãå¯¹è¯ | è±æé 读ç解 | 论æ | |
28 | Quasar | 2017 | å¡å åºæ¢ éå¤§å¦ | BSD-2-Clause | æåºäºä¸¤ä¸ªå¤§è§æ¨¡æ°æ®éãQuasar-Sæ°æ®éç±37000个å®å½¢å¡«ç©ºå¼æ¥è¯¢ç»æï¼è¿äºæ¥è¯¢æ¯æ ¹æ®æµè¡ç½ç« Stack overflow ä¸ç软件å®ä½æ è®°çå®ä¹æé çãç½ç«ä¸çå¸ååè¯è®ºæ¯åçå®å½¢å¡«ç©ºé®é¢çèæ¯è¯æåºãQuasar-Tæ°æ®éå å«43000个å¼æ¾åçäºé®é¢åå ¶ä»åç§äºèç½æ¥æºè·å¾ççæ¡ã | ç段æ½åé 读ç解 | è±æé 读ç解 | 论æ | |
29 | MS MARCO | 2018 | 微软 | / | 微软åºäºæç´¢å¼æ BING æ建ç大è§æ¨¡è±æé 读ç解æ°æ®éï¼å å«10ä¸ä¸ªé®é¢å20ä¸ç¯ä¸éå¤çææ¡£ãMARCO æ°æ®éä¸çé®é¢å ¨é¨æ¥èªäº BING çæç´¢æ¥å¿ï¼æ ¹æ®ç¨æ·å¨ BING ä¸è¾å ¥ççå®é®é¢æ¨¡ææç´¢å¼æä¸ççå®åºç¨åºæ¯ï¼æ¯è¯¥é¢åææåºç¨ä»·å¼çæ°æ®éä¹ä¸ã | å¤ææ¡£ | è±æé 读ç解 | 论æ | |
30 | ä¸æå®å½¢å¡«ç©º | 2016å¹´ | å´ä¸é¸£ | é¦ä¸ªä¸æ填空åé 读ç解æ°æ®éPD&CFTï¼ å ¨ç§°People Daily and Children's Fairy Taleï¼ æ°æ®æ¥æºäºäººæ°æ¥æ¥åå¿ç«¥æ äºã | 填空åé 读ç解 | ä¸æå®å½¢å¡«ç©º | 论æ | ||
31 | NLPCC ICCPOL2016 | 2016.12.2 | NLPCC主åæ¹ | åºäºææ¡£ä¸çå¥å人工åæ14659个é®é¢ï¼å æ¬14Kä¸æç¯ç« ã | é®ç对é 读ç解 | ä¸æé 读ç解 | \ |
è´¡ç®ä¸åä¸
æ谢以ä¸åå¦çè´¡ç®ï¼æåä¸åå åï¼
éå°æ£ãææç£ãæé²ãå¶çãèå¸æ¦ãç« é¦å·ãæå°æãæä¿æ¯
æ¨å¯ä»¥éè¿ä¸ä¼ æ°æ®éä¿¡æ¯è´¡ç®ä½ çåéãä¸ä¼ äºä¸ªæ以ä¸æ°æ®éä¿¡æ¯å¹¶å®¡æ ¸éè¿åï¼è¯¥åå¦å¯ä»¥ä½ä¸ºé¡¹ç®è´¡ç®è ï¼å¹¶æ¾ç¤ºåºæ¥ã
Share your data set with community or make a contribution today! Just send email to chineseGLUE#163.com,
or join QQ group: 836811304
Top Related Projects
TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
An open-source NLP research library, built on PyTorch.
Models, data loaders and abstractions for language processing, powered by PyTorch
💫 Industrial-strength Natural Language Processing (NLP) in Python
Library for fast text representation and classification.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot