Top Related Projects
TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
An open-source NLP research library, built on PyTorch.
Models, data loaders and abstractions for language processing, powered by PyTorch
💫 Industrial-strength Natural Language Processing (NLP) in Python
Library for fast text representation and classification.
Quick Overview
CLUEDatasetSearch is a GitHub repository that provides a comprehensive collection of Chinese language datasets for various natural language processing (NLP) tasks. It aims to facilitate research and development in Chinese NLP by offering a centralized resource for accessing and exploring diverse datasets.
Pros
- Extensive collection of Chinese NLP datasets covering multiple tasks
- Well-organized structure with clear categorization of datasets
- Includes detailed information about each dataset, such as task type, size, and source
- Regularly updated with new datasets and improvements
Cons
- Limited to Chinese language datasets, which may not be useful for researchers working on other languages
- Some datasets may require additional processing or formatting for specific use cases
- Dependency on external sources for some datasets, which may lead to broken links or unavailable data
- Lack of standardized evaluation metrics across all datasets
Code Examples
This repository is primarily a collection of datasets and does not include code libraries. Therefore, code examples are not applicable in this case.
Getting Started
As this is not a code library, there are no specific code-based getting started instructions. However, to begin using the datasets:
- Visit the repository: https://github.com/CLUEbenchmark/CLUEDatasetSearch
- Browse the available datasets in the README file
- Click on the dataset of interest to access more detailed information
- Follow the provided links or instructions to download or access the specific dataset
- Refer to the dataset's documentation for usage guidelines and formatting information
Competitor Comparisons
TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
Pros of datasets
- Extensive collection of datasets across various domains
- Well-integrated with TensorFlow ecosystem
- Robust documentation and community support
Cons of datasets
- Primarily focused on machine learning datasets
- May have a steeper learning curve for beginners
Code comparison
CLUEDatasetSearch:
from CLUEDatasetSearch import CLUEDatasetSearch
searcher = CLUEDatasetSearch()
results = searcher.search("sentiment analysis")
datasets:
import tensorflow_datasets as tfds
dataset = tfds.load('imdb_reviews')
train_dataset = dataset['train']
Key differences
- CLUEDatasetSearch focuses on Chinese language datasets, while datasets covers a broader range of languages and domains
- CLUEDatasetSearch provides a search interface for finding relevant datasets, whereas datasets offers direct access to pre-processed datasets
- datasets is more tightly integrated with TensorFlow, making it easier to use in TensorFlow-based projects
Use cases
CLUEDatasetSearch is ideal for:
- Researchers working on Chinese NLP tasks
- Those seeking a curated list of Chinese language datasets
datasets is better suited for:
- Machine learning practitioners using TensorFlow
- Projects requiring a wide variety of datasets across multiple domains
An open-source NLP research library, built on PyTorch.
Pros of AllenNLP
- Comprehensive NLP toolkit with a wide range of pre-built models and components
- Extensive documentation and tutorials for easy adoption
- Active community and regular updates
Cons of AllenNLP
- Steeper learning curve for beginners
- Primarily focused on English language tasks
- Larger codebase and dependencies
Code Comparison
AllenNLP:
from allennlp.predictors import Predictor
predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bert-base-srl-2020.03.24.tar.gz")
result = predictor.predict(sentence="Did Uriah honestly think he could beat the game in under three hours?")
CLUEDatasetSearch:
from CLUEDatasetSearch import CLUEDatasetSearch
searcher = CLUEDatasetSearch()
results = searcher.search("text classification dataset")
AllenNLP offers a more comprehensive toolkit for various NLP tasks, while CLUEDatasetSearch focuses specifically on Chinese language datasets. AllenNLP provides pre-built models and predictors, whereas CLUEDatasetSearch is primarily a search tool for datasets. The code examples demonstrate the different use cases: AllenNLP for model prediction and CLUEDatasetSearch for dataset discovery.
Models, data loaders and abstractions for language processing, powered by PyTorch
Pros of text
- Broader scope, covering various NLP tasks and datasets
- More extensive documentation and community support
- Integrated with PyTorch ecosystem for seamless deep learning workflows
Cons of text
- Less focused on Chinese language tasks and datasets
- May require more setup and configuration for specific use cases
- Potentially steeper learning curve for beginners
Code Comparison
CLUEDatasetSearch:
from CLUEDatasetSearch import CLUEDatasetSearch
searcher = CLUEDatasetSearch()
results = searcher.search("情感分析")
print(results)
text:
from torchtext.datasets import IMDB
train_dataset, test_dataset = IMDB(split=('train', 'test'))
for label, text in train_dataset:
print(f"Label: {label}, Text: {text[:50]}...")
The CLUEDatasetSearch code demonstrates a simple search for Chinese NLP datasets, while the text example shows how to load and iterate through an English sentiment analysis dataset. text offers more flexibility for various NLP tasks, but CLUEDatasetSearch is more specialized for Chinese language datasets.
💫 Industrial-strength Natural Language Processing (NLP) in Python
Pros of spaCy
- Comprehensive NLP library with a wide range of functionalities
- Efficient and fast processing, suitable for large-scale applications
- Extensive documentation and active community support
Cons of spaCy
- Steeper learning curve for beginners
- Primarily focused on English, with limited support for other languages
- Requires more system resources compared to lightweight alternatives
Code Comparison
spaCy:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
print(ent.text, ent.label_)
CLUEDatasetSearch:
from CLUEDatasetSearch import CLUEDatasetSearch
searcher = CLUEDatasetSearch()
results = searcher.search("sentiment analysis dataset")
for result in results:
print(result.name, result.description)
Summary
spaCy is a powerful NLP library with extensive features and performance optimizations, while CLUEDatasetSearch focuses on dataset discovery for Chinese language tasks. spaCy offers more comprehensive NLP capabilities but may be more complex for beginners, whereas CLUEDatasetSearch provides a simpler interface for finding relevant datasets in the CLUE benchmark collection.
Library for fast text representation and classification.
Pros of fastText
- Efficient and fast text classification and word representation learning
- Supports multiple languages and can handle large datasets
- Provides pre-trained models and embeddings for various languages
Cons of fastText
- Limited to shallow neural network architectures
- May not capture complex semantic relationships as well as more advanced models
- Requires careful preprocessing and hyperparameter tuning for optimal performance
Code Comparison
fastText:
import fasttext
model = fasttext.train_supervised("train.txt")
result = model.predict("example text")
CLUEDatasetSearch:
from CLUEDatasetSearch import CLUEDatasetSearch
searcher = CLUEDatasetSearch()
results = searcher.search("query", top_k=5)
While fastText focuses on text classification and word embeddings, CLUEDatasetSearch is primarily designed for searching and retrieving Chinese language datasets. fastText offers more general-purpose text processing capabilities, while CLUEDatasetSearch is specialized for dataset discovery within the CLUE (Chinese Language Understanding Evaluation) benchmark ecosystem.
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
CLUEDatasetSearch
ä¸è±æNLPæ°æ®éãå¯ä»¥ç¹å»æç´¢ã
æ¨å¯ä»¥éè¿ä¸ä¼ æ°æ®éä¿¡æ¯è´¡ç®ä½ çåéãä¸ä¼ äºä¸ªæä»¥ä¸æ°æ®éä¿¡æ¯å¹¶å®¡æ ¸éè¿åï¼è¯¥åå¦å¯ä»¥ä½ä¸ºé¡¹ç®è´¡ç®è ï¼å¹¶æ¾ç¤ºåºæ¥ã
clueaiå·¥å ·å : ä¸åéä¸è¡ä»£ç æå®NLPå¼åï¼é¶æ ·æ¬å¦ä¹ ï¼
- NER
- QA
- æ æåæ
- ææ¬åç±»
- ææ¬å¹é
- ææ¬æè¦
- æºå¨ç¿»è¯
- ç¥è¯å¾è°±
- è¯æåº
- é 读çè§£
- è´¡ç®ä¸åä¸
å¦ææ°æ®éæé®é¢ï¼æ¬¢è¿æåºissueã
æææ°æ®é忥æºäºç½ç»ï¼åªåæ´çä¾å¤§å®¶æåæ¹ä¾¿ï¼å¦ææä¾µæçé®é¢ï¼è¯·åæ¶èç³»æä»¬å é¤ã
NER
ID | æ é¢ | æ´æ°æ¥æ | æ°æ®éæä¾è | è®¸å¯ | 说æ | å ³é®å | ç±»å« | 论æå°å | 夿³¨ |
---|---|---|---|---|---|---|---|---|---|
1 | CCKS2017䏿çµåç ä¾å½åå®ä½è¯å« | 2017å¹´5æ | å京æç®äºå¥åº·ç§ææéå ¬å¸ | æ°æ®æ¥æºäºå ¶äºå»é¢å¹³å°ççå®çµåç åæ°æ®ï¼å ±è®¡800æ¡ï¼å个ç äººåæ¬¡å°±è¯è®°å½ï¼ï¼ç»è±æå¤ç | çµåç å | å½åå®ä½è¯å« | \ | 䏿 | |
2 | CCKS2018䏿çµåç ä¾å½åå®ä½è¯å« | 2018å¹´ | 廿¸¡äºï¼åäº¬ï¼ææ¯æéå ¬å¸ | CCKS2018ççµåç åå½åå®ä½è¯å«çè¯æµä»»å¡æä¾äº600ä»½æ æ³¨å¥½ççµåç åææ¬ï¼å ±éè¯å«å«è§£åé¨ä½ãç¬ç«çç¶ãçç¶æè¿°ãææ¯åè¯ç©äºç±»å®ä½ | çµåç å | å½åå®ä½è¯å« | \ | 䏿 | |
3 | 微软äºç é¢MSRAå½åå®ä½è¯å«è¯å«æ°æ®é | \ | MSRA | æ°æ®æ¥æºäºMSRAï¼æ 注形å¼ä¸ºBIOï¼å ±æ46365æ¡è¯æ | Msra | å½åå®ä½è¯å« | \ | 䏿 | |
4 | 1998äººæ°æ¥æ¥è¯æéå®ä½è¯å«æ 注é | 1998å¹´1æ | äººæ°æ¥æ¥ | æ°æ®æ¥æºä¸º98å¹´äººæ°æ¥æ¥ï¼æ 注形å¼ä¸ºBIOï¼å ±æ23061æ¡è¯æ | 98äººæ°æ¥æ¥ | å½åå®ä½è¯å« | \ | 䏿 | |
5 | Boson | \ | ç»æ£®æ°æ® | æ°æ®æ¥æºä¸ºBosonï¼æ 注形å¼ä¸ºBMEO,å ±æ2000æ¡è¯æ | Boson | å½åå®ä½è¯å« | \ | 䏿 | |
6 | CLUE Fine-Grain NER | 2020å¹´ | CLUE | CLUENER2020æ°æ®éï¼æ¯å¨æ¸ å大å¦å¼æºçææ¬åç±»æ°æ®éTHUCTCåºç¡ä¸ï¼éåºé¨åæ°æ®è¿è¡ç»ç²åº¦å½åå®ä½æ 注ï¼åæ°æ®æ¥æºäºSina News RSSãæ°æ®å å«10个æ ç¾ç±»å«ï¼è®ç»éå ±æ10748æ¡è¯æï¼éªè¯éå ±æ1343æ¡è¯æ | ç»ç²åº¦ï¼CULE | å½åå®ä½è¯å« | \ | 䏿 | |
7 | CoNLL-2003 | 2003 | CNTS - Language Technology Group | æ°æ®æ¥æºäºCoNLL-2003çä»»å¡ï¼è¯¥æ°æ®æ 注äºå æ¬PER, LOC, ORGåMISCçåä¸ªç±»å« | CoNLL-2003 | å½åå®ä½è¯å« | 论æ | è±æ | |
8 | å¾®åå®ä½è¯å« | 2015å¹´ | https://github.com/hltcoe/golden-horse | EMNLP-2015 | å½åå®ä½è¯å« | ||||
9 | SIGHAN Bakeoff 2005 | 2005å¹´ | MSR/PKU | bakeoff-2005 | å½åå®ä½è¯å« |
QA
ID | æ é¢ | æ´æ°æ¥æ | æ°æ®éæä¾è | è®¸å¯ | 说æ | å ³é®å | ç±»å« | 论æå°å | 夿³¨ |
---|---|---|---|---|---|---|---|---|---|
1 | NewsQA | 2019/9/13 | 微软ç ç©¶é¢ | Maluuba NewsQAæ°æ®éçç®çæ¯å¸®å©ç ç©¶ç¤¾åºæå»ºè½å¤åçéè¦äººç±»æ°´å¹³ççè§£åæ¨çæè½çé®é¢çç®æ³ãå å«è¶ è¿12000ç¯æ°é»æç« å120,000çæ¡ï¼æ¯ç¯æç« å¹³å616个åè¯ï¼æ¯ä¸ªé®é¢æ2ï½3ä¸ªçæ¡ã | è±æ | QA | 论æ | ||
2 | SQuAD | æ¯å¦ç¦ | æ¯å¦ç¦é®çæ°æ®éï¼SQuADï¼æ¯ä¸ä¸ªé 读çè§£æ°æ®éï¼ç±ç»´åºç¾ç§çä¸ç»æç« 䏿åºçé®é¢ç»æï¼å ¶ä¸æ¯ä¸ªé®é¢ççæ¡é½æ¯ä¸æ®µææ¬ï¼å¯è½æ¥èªç¸åºçé 读段è½ï¼æè é®é¢å¯è½æ¯æªè§£ççã | è±æ | QA | 论æ | |||
3 | SimpleQuestions | åºäºåå¨ç½ç»çå¤§è§æ¨¡ç®åé®çç³»ç», æ°æ®éæä¾äºä¸ä¸ªå¤ä»»å¡é®çæ°æ®éï¼æ°æ®éæ100Kç®åé®é¢çåçã | è±æ | QA | 论æ | ||||
4 | WikiQA | 2016/7/14 | 微软ç ç©¶é¢ | 为äºåæ ä¸è¬ç¨æ·ççå®ä¿¡æ¯éæ±ï¼WikiQA使ç¨Bingæ¥è¯¢æ¥å¿ä½ä¸ºé®é¢æºãæ¯ä¸ªé®é¢é½é¾æ¥å°ä¸ä¸ªå¯è½æçæ¡çç»´åºç¾ç§é¡µé¢ãå 为维åºç¾ç§é¡µé¢çæè¦é¨åæä¾äºå ³äºè¿ä¸ªä¸»é¢çåºæ¬ä¸é常æéè¦çä¿¡æ¯ï¼æä»¥ä½¿ç¨æ¬èä¸çå¥åä½ä¸ºåéçæ¡ãå¨ä¼å ç帮å©ä¸ï¼æ°æ®éä¸å æ¬3047个é®é¢å29258个å¥åï¼å ¶ä¸1473个å¥å被æ 记为对åºé®é¢çåçå¥åã | è±æ | QA | 论æ | ||
5 | cMedQA | 2019/2/25 | Zhang Sheng | å»å¦å¨çº¿è®ºåçæ°æ®ï¼å å«5.4ä¸ä¸ªé®é¢ï¼å对åºç约10ä¸ä¸ªåçã | 䏿 | QA | 论æ | ||
6 | cMedQA2 | 2019/1/9 | Zhang Sheng | cMedQAçæ©å±çï¼å å«çº¦10ä¸ä¸ªå»å¦ç¸å ³é®é¢ï¼å对åºç约20ä¸ä¸ªåçã | 䏿 | QA | 论æ | ||
7 | webMedQA | 2019/3/10 | He Junqing | ä¸ä¸ªå»å¦å¨çº¿é®çæ°æ®éï¼å å«6ä¸ä¸ªé®é¢å31ä¸ä¸ªåçï¼èä¸å å«é®é¢çç±»å«ã | 䏿 | QA | 论æ | ||
8 | XQA | 2019/7/29 | æ¸ åå¤§å¦ | è¯¥ç¯æç« ä¸»è¦æ¯é坹弿¾å¼é®çæå»ºäºä¸ä¸ªè·¨è¯è¨ç弿¾å¼é®çæ°æ®éï¼è¯¥æ°æ®éï¼è®ç»éãæµè¯éï¼ä¸»è¦å æ¬ä¹ç§è¯è¨ï¼9ä¸å¤ä¸ªé®çã | å¤è¯è¨ | QA | 论æ | ||
9 | AmazonQA | 2019/9/29 | äºé©¬é | å¡èåºæ¢ é大å¦é对äºé©¬éå¹³å°ä¸é®é¢éå¤åçççç¹ï¼æåºäºåºäºè¯è®ºçQA模åä»»å¡ï¼å³å©ç¨å å对æä¸äº§åçé®çï¼QAç³»ç»èªå¨æ»ç»åºä¸ä¸ªçæ¡ç»å®¢æ· | è±æ | QA | 论æ | ||
9 | AmazonQA | 2019/9/29 | äºé©¬é | å¡èåºæ¢ é大å¦é对äºé©¬éå¹³å°ä¸é®é¢éå¤åçççç¹ï¼æåºäºåºäºè¯è®ºçQA模åä»»å¡ï¼å³å©ç¨å å对æä¸äº§åçé®çï¼QAç³»ç»èªå¨æ»ç»åºä¸ä¸ªçæ¡ç»å®¢æ· | è±æ | QA | 论æ |
æ æåæ
ID | æ é¢ | æ´æ°æ¥æ | æ°æ®éæä¾è | è®¸å¯ | 说æ | å ³é®å | ç±»å« | 论æå°å | 夿³¨ |
---|---|---|---|---|---|---|---|---|---|
1 | NLPCC2013 | 2013 | CCF | \ | å¾®åè¯æï¼æ 注äº7 emotions: like, disgust, happiness, sadness, anger, surprise, fearã大å°ï¼14 000 æ¡å¾®å, 45 431å¥å | NLPCC2013, Emotion | æ æåæ | 论æ | |
2 | NLPCC2014 Task1 | 2014 | CCF | \ | å¾®åè¯æï¼æ 注äº7 emotions: like, disgust, happiness, sadness, anger, surprise, fearã 大å°ï¼20000æ¡å¾®å | NLPCC2014, Emotion | æ æåæ | \ | |
3 | NLPCC2014 Task2 | 2014 | CCF | \ | å¾®åè¯æï¼æ æ³¨äºæ£é¢åè´é¢ | NLPCC2014, Sentiment | æ æåæ | \ | |
4 | Weibo Emotion Corpus | 2016 | The Hong Kong Polytechnic University | \ | å¾®åè¯æï¼æ 注äº7 emotions: like, disgust, happiness, sadness, anger, surprise, fearã 大å°ï¼åä¸å¤æ¡å¾®å | weibo emotion corpus | æ æåæ | Emotion Corpus Construction Based on Selection from Noisy Natural Labels | |
5 | [RenCECPs](Fuji Ren can be contacted (ren@is.tokushima-u.ac.jp) for a license agreement.) | 2009 | Fuji Ren | \ | æ æ³¨çåå®¢è¯æåºï¼å¨ææ¡£çº§ã段è½çº§åå¥åçº§æ æ³¨äºemotionåsentimentãå å«äº1500个å客ï¼11000段è½å35000å¥åã | RenCECPs, emotion, sentiment | æ æåæ | Construction of a blog emotion corpus for Chinese emotional expression analysis | |
6 | weibo_senti_100k | ä¸è¯¦ | ä¸è¯¦ | \ | å¸¦æ ææ æ³¨ æ°æµªå¾®åï¼æ£è´åè¯è®ºçº¦å 5 䏿¡ | weibo senti, sentiment | æ æåæ | \ | |
7 | BDCI2018-汽车è¡ä¸ç¨æ·è§ç¹ä¸»é¢åæ æè¯å« | 2018 | CCF | 汽车论åä¸å¯¹æ±½è½¦çè¯è®ºï¼æ æ³¨äºæ±½è½¦çè¯æä¸»é¢ï¼å¨åãä»·æ ¼ãå 饰ãé ç½®ãå®å ¨æ§ãå¤è§ãææ§ãæ²¹èã空é´ãèéæ§ãæ¯ä¸ªä¸»é¢æ æ³¨äºæ ææ ç¾ï¼æ æå为3ç±»ï¼åå«ç¨æ°å0ã1ã-1表示ä¸ç«ãæ£åãè´åã | 屿§æ æåæ 䏻颿 æåæ | æ æåæ | \ | ||
8 | AI Challenger ç»ç²åº¦ç¨æ·è¯è®ºæ æåæ | 2o18 | ç¾å¢ | \ | é¤é¥®è¯è®ºï¼6个ä¸çº§å±æ§ï¼20个äºçº§å±æ§ï¼æ¯ä¸ªå±æ§æ 注æ£é¢ãè´é¢ã䏿§ãæªæåã | 屿§æ æåæ | æ æåæ | \ | |
9 | BDCI2019éèä¿¡æ¯è´é¢å主ä½å¤å® | 2019 | ä¸åé¶è¡ | \ | éèé¢åæ°é»ï¼æ¯ä¸ªæ ·æ¬æ è®°äºå®ä½å表以åè´é¢å®ä½å表ã任塿¯å¤æä¸ä¸ªæ ·æ¬æ¯å¦æ¯è´é¢ä»¥å对åºçè´é¢çå®ä½ã | å®ä½æ æåæ | æ æåæ | \ | |
10 | 乿±æ¯çµåè¯è®ºè§ç¹ææå¤§èµ | 2019 | 乿±å®éªå®¤ | \ | æ¬æ¬¡åçè¯è®ºè§ç¹ææç任塿¯å¨ååè¯è®ºä¸æ½ååå屿§ç¹å¾åæ¶è´¹è è§ç¹ï¼å¹¶ç¡®è®¤å ¶æ æææ§å屿§ç§ç±»ã对äºååçæä¸ä¸ªå±æ§ç¹å¾ï¼åå¨çä¸ç³»åæè¿°å®çè§ç¹è¯ï¼å®ä»¬ä»£è¡¨äºæ¶è´¹è å¯¹è¯¥å±æ§ç¹å¾çè§ç¹ãæ¯ä¸ç»{åå屿§ç¹å¾ï¼æ¶è´¹è è§ç¹}å ·æç¸åºçæ æææ§ï¼è´é¢ã䏿§ãæ£é¢ï¼ï¼ä»£è¡¨äºæ¶è´¹è å¯¹è¯¥å±æ§ç满æç¨åº¦ãæ¤å¤ï¼å¤ä¸ªå±æ§ç¹å¾å¯ä»¥å½å ¥æä¸ä¸ªå±æ§ç§ç±»ï¼ä¾å¦å¤è§ãçåç屿§ç¹å¾åå¯å½å ¥å è£ è¿ä¸ªå±æ§ç§ç±»ãåèµé伿ç»éæäº¤å¯¹æµè¯æ°æ®çæ½å颿µä¿¡æ¯ï¼å æ¬å±æ§ç¹å¾è¯ãè§ç¹è¯ãè§ç¹ææ§å屿§ç§ç±»4ä¸ªåæ®µã | 屿§æ æåæ | æ æåæ | \ | |
11 | 2019æçæ ¡åç®æ³å¤§èµ | 2019 | æç | \ | ç»å®è¥å¹²æç« ï¼ç®æ æ¯å¤ææç« çæ ¸å¿å®ä½ä»¥åå¯¹æ ¸å¿å®ä½çæ ææåº¦ãæ¯ç¯æç« è¯å«æå¤ä¸ä¸ªæ ¸å¿å®ä½ï¼å¹¶åå«å¤ææç« 对ä¸è¿°æ ¸å¿å®ä½çæ æå¾åï¼ç§¯æãä¸ç«ãæ¶æä¸ç§ï¼ãå®ä½ï¼äººãç©ãå°åºãæºæãå¢ä½ãä¼ä¸ãè¡ä¸ãæä¸ç¹å®äºä»¶çåºå®åå¨ï¼ä¸å¯ä»¥ä½ä¸ºæç« 主ä½çå®ä½è¯ãæ ¸å¿å®ä½ï¼æç« ä¸»è¦æè¿°ãææ ä»»æç« 主è¦è§è²çå®ä½è¯ã | å®ä½æ æåæ | æ æåæ | \ |
ææ¬åç±»
ID | æ é¢ | æ´æ°æ¥æ | æ°æ®éæä¾è | è®¸å¯ | 说æ | å ³é®å | ç±»å« | 论æå°å | 夿³¨ |
---|---|---|---|---|---|---|---|---|---|
1 | [2018âè¾¾è§æ¯âææ¬æºè½å¤çææèµ](https://www.pkbigdata.com/common/cmpt/ âè¾¾è§æ¯âææ¬æºè½å¤çææèµ_èµä½ä¸æ°æ®.html) | 2018å¹´7æ | è¾¾è§æ°æ® | æ°æ®éæ¥æºäºè¾¾è§æ°æ®ï¼ä¸ºé¿ææ¬å类任å¡ï¼å ¶ä¸»è¦å æ¬äºidï¼articleï¼word_segåclassåä¸ªåæ®µï¼æ°æ®å å«19个类å«ï¼å ±102275æ¡æ ·æ¬ | é¿ææ¬ï¼è±æ | ææ¬åç±» | \ | 䏿 | |
2 | 仿¥å¤´æ¡ä¸ææ°é»ï¼ææ¬ï¼åç±» | 2018å¹´5æ | 仿¥å¤´æ¡ | æ°æ®éæ¥æºäºä»æ¥å¤´æ¡ï¼ä¸ºçææ¬å类任å¡ï¼æ°æ®å å«15个类å«ï¼å ±382688æ¡æ ·æ¬ | çææ¬ï¼æ°é» | ææ¬åç±» | \ | 䏿 | |
3 | THUCNewsä¸æææ¬åç±» | 2016å¹´ | æ¸ åå¤§å¦ | THUCNewsæ¯æ ¹æ®æ°æµªæ°é»RSS订é é¢é2005~2011å¹´é´çå岿°æ®çéè¿æ»¤çæï¼å为UTF-8çº¯ææ¬æ ¼å¼ãæä»¬å¨åå§æ°æµªæ°é»åç±»ä½ç³»çåºç¡ä¸ï¼éæ°æ´ååååº14个åéå类类å«ï¼è´¢ç»ãå½©ç¥¨ãæ¿äº§ãè¡ç¥¨ãå®¶å± ãæè²ãç§æã社ä¼ãæ¶å°ãæ¶æ¿ãä½è²ãæåº§ã游æã娱ä¹ï¼å ±74ä¸ç¯æ°é»ææ¡£ï¼2.19 GBï¼ | ææ¡£ï¼æ°é» | ææ¬åç±» | \ | 䏿 | |
4 | 夿¦å¤§å¦ä¸æææ¬åç±» | \ | 夿¦å¤§å¦è®¡ç®æºä¿¡æ¯ä¸ææ¯ç³»å½é æ°æ®åºä¸å¿èªç¶è¯è¨å¤çå°ç» | æ°æ®éæ¥æºäºå¤æ¦å¤§å¦ï¼ä¸ºçææ¬å类任å¡ï¼æ°æ®å å«20个类å«ï¼å ±9804ç¯ææ¡£ | ææ¡£ï¼æ°é» | ææ¬åç±» | \ | 䏿 | |
5 | æ°é»æ é¢çææ¬åç±» | 2019å¹´12æ | chenfengshf | CC0 å ¬å ±é¢åå ±äº« | æ°æ®éæ¥æºäºKesciå¹³å°ï¼ä¸ºæ°é»æ é¢é¢åçææ¬å类任å¡ãå 容大å¤ä¸ºçææ¬æ é¢(length<50)ï¼æ°æ®å å«15个类å«ï¼å ±38wæ¡æ ·æ¬ | çææ¬ï¼æ°é»æ é¢ | ææ¬åç±» | \ | 䏿 |
6 | 2017 ç¥ä¹çå±±æ¯æºå¨å¦ä¹ ææèµ | 2017å¹´6æ | ä¸å½äººå·¥æºè½å¦ä¼;ç¥ä¹ | æ°æ®éæ¥æºäºç¥ä¹ï¼ä¸ºé®é¢åè¯é¢æ ç¾çç»å®å ³ç³»çæ æ³¨æ°æ®ï¼æ¯ä¸ªé®é¢æ 1 个æå¤ä¸ªæ ç¾ï¼ç´¯è®¡1999 个æ ç¾ï¼å ±å å« 300 ä¸ä¸ªé®é¢ | é®é¢ï¼çææ¬ | ææ¬åç±» | \ | 䏿 | |
7 | 2019乿±æ¯-çµåè¯è®ºè§ç¹ææå¤§èµ | 2019å¹´8æ | 乿±å®éªå®¤ | æ¬æ¬¡åçè¯è®ºè§ç¹ææç任塿¯å¨ååè¯è®ºä¸æ½ååå屿§ç¹å¾åæ¶è´¹è è§ç¹ï¼å¹¶ç¡®è®¤å ¶æ æææ§å屿§ç§ç±»ã对äºååçæä¸ä¸ªå±æ§ç¹å¾ï¼åå¨çä¸ç³»åæè¿°å®çè§ç¹è¯ï¼å®ä»¬ä»£è¡¨äºæ¶è´¹è å¯¹è¯¥å±æ§ç¹å¾çè§ç¹ãæ¯ä¸ç»{åå屿§ç¹å¾ï¼æ¶è´¹è è§ç¹}å ·æç¸åºçæ æææ§ï¼è´é¢ã䏿§ãæ£é¢ï¼ï¼ä»£è¡¨äºæ¶è´¹è å¯¹è¯¥å±æ§ç满æç¨åº¦ | è¯è®ºï¼çææ¬ | ææ¬åç±» | \ | 䏿 | |
8 | IFLYTEK' é¿ææ¬åç±» | \ | ç§å¤§è®¯é£ | è¯¥æ°æ®éå ±æ1.7ä¸å¤æ¡å ³äºappåºç¨æè¿°çé¿ææ¬æ æ³¨æ°æ®ï¼å å«åæ¥å¸¸çæ´»ç¸å ³çåç±»åºç¨ä¸»é¢ï¼å ±119ä¸ªç±»å« | é¿ææ¬ | ææ¬åç±» | \ | 䏿 | |
9 | å ¨ç½æ°é»åç±»æ°æ®(SogouCA) | 2012å¹´8æ16å· | æç | è¯¥æ°æ®æ¥èªè¥å¹²æ°é»ç«ç¹2012å¹´6æâ7ææé´å½å ï¼å½é ï¼ä½è²ï¼ç¤¾ä¼ï¼å¨±ä¹ç18个é¢éçæ°é»æ°æ® | æ°é» | ææ¬åç±» | \ | 䏿 | |
10 | æçæ°é»æ°æ®(SogouCS) | 2012å¹´8æ | æç | æ°æ®æ¥æºä¸ºæçæ°é»2012å¹´6æâ7ææé´å½å ï¼å½é ï¼ä½è²ï¼ç¤¾ä¼ï¼å¨±ä¹ç18个é¢éçæ°é»æ°æ® | æ°é» | ææ¬åç±» | \ | 䏿 | |
11 | ä¸ç§å¤§æ°é»åç±»è¯æåº | 2017å¹´11æ | å禹 ä¸å½ç§å¦é¢èªå¨åç ç©¶æç»¼åä¿¡æ¯ä¸å¿ | ææ¶ä¸è½ä¸è½½ï¼å·²ç»èç³»ä½è ï¼çå¾ åé¦ | æ°é» | ||||
12 | ChnSentiCorp_htl_all | 2018å¹´3æ | https://github.com/SophonPlus/ChineseNlpCorpus | 7000 夿¡é åºè¯è®ºæ°æ®ï¼5000 夿¡æ£åè¯è®ºï¼2000 夿¡è´åè¯è®º | |||||
13 | waimai_10k | 2018å¹´3æ | https://github.com/SophonPlus/ChineseNlpCorpus | æå¤å平尿¶éçç¨æ·è¯ä»·ï¼æ£å 4000 æ¡ï¼è´å 约 8000 æ¡ | |||||
14 | online_shopping_10_cats | 2018å¹´3æ | https://github.com/SophonPlus/ChineseNlpCorpus | 10 个类å«ï¼å ± 6 ä¸å¤æ¡è¯è®ºæ°æ®ï¼æ£ãè´åè¯è®ºå约 3 䏿¡ï¼ å æ¬ä¹¦ç±ãå¹³æ¿ãææºãæ°´æãæ´åæ°´ãçæ°´å¨ãèçãè¡£æãè®¡ç®æºãé åº | |||||
15 | weibo_senti_100k | 2018å¹´3æ | https://github.com/SophonPlus/ChineseNlpCorpus | 10 ä¸å¤æ¡ï¼å¸¦æ ææ æ³¨ æ°æµªå¾®åï¼æ£è´åè¯è®ºçº¦å 5 䏿¡ | |||||
16 | simplifyweibo_4_moods | 2018å¹´3æ | https://github.com/SophonPlus/ChineseNlpCorpus | 36 ä¸å¤æ¡ï¼å¸¦æ ææ æ³¨ æ°æµªå¾®åï¼å å« 4 ç§æ æï¼ å ¶ä¸åæ¦çº¦ 20 䏿¡ï¼æ¤æãåæ¶ãä½è½å约 5 䏿¡ | |||||
17 | dmsc_v2 | 2018å¹´3æ | https://github.com/SophonPlus/ChineseNlpCorpus | 28 é¨çµå½±ï¼è¶ 70 ä¸ ç¨æ·ï¼è¶ 200 䏿¡ è¯å/è¯è®º æ°æ® | |||||
18 | yf_dianping | 2018å¹´3æ | https://github.com/SophonPlus/ChineseNlpCorpus | 24 ä¸å®¶é¤é¦ï¼54 ä¸ç¨æ·ï¼440 䏿¡è¯è®º/è¯åæ°æ® | |||||
19 | yf_amazon | 2018å¹´3æ | https://github.com/SophonPlus/ChineseNlpCorpus | 52 ä¸ä»¶ååï¼1100 å¤ä¸ªç±»ç®ï¼142 ä¸ç¨æ·ï¼720 䏿¡è¯è®º/è¯åæ°æ® |
ææ¬å¹é
ID | æ é¢ | æ´æ°æ¥æ | æ°æ®éæä¾è | è®¸å¯ | 说æ | å ³é®å | ç±»å« | 论æå°å | 夿³¨ |
---|---|---|---|---|---|---|---|---|---|
1 | LCQMC | 2018/6/6 | å工大(æ·±å³)æºè½è®¡ç®ç ç©¶ä¸å¿ | Creative Commons Attribution 4.0 International License | è¯¥æ°æ®éå ±å 嫿¥èªå¤ä¸ªé¢åç260068ä¸ªä¸æé®å¥å¯¹ï¼ç¸åè¯¢é®æå¾çå¥å对æ 记为1ï¼å¦å为0ï¼å¹¶é¢å å°å ¶åå为äºè®ç»éï¼238766对ï¼éªè¯éï¼8802å¯¹ï¼æµè¯éï¼12500对 | å¤§è§æ¨¡é®å¥å¹é ï¼æå¾å¹é | çææ¬å¹é ï¼é®å¥å¹é | 论æ | |
2 | The BQ Corpus | 2018/9/4 | å工大(æ·±å³)æºè½è®¡ç®ç ç©¶ä¸å¿ï¼å¾®ä¼é¶è¡ | è¯¥æ°æ®éå ±æ120000个å¥åå¯¹ï¼æ¥èªé¶è¡ä¸å¹´ä¸çå¨è¯¢æå¡æ¥å¿ï¼å¥å对å å«ä¸åçæå¾ï¼æ è®°æ£è´æ ·æ¬æ¯ä¾ä¸º1:1 | é¶è¡æå¡é®å¥ï¼æå¾å¹é | çææ¬å¹é ï¼é®å¥ä¸è´æ§æ£æµ | 论æ | ||
3 | AFQMC èèéèè¯ä¹ç¸ä¼¼åº¦ | 2018/4/25 | èèéæ | æä¾10ä¸å¯¹çæ æ³¨æ°æ®ï¼åæ¹æ¬¡æ´æ°ï¼å·²æ´æ°å®æ¯ï¼ï¼ä½ä¸ºè®ç»æ°æ®ï¼å æ¬åä¹å¯¹åä¸åä¹å¯¹ | éèé®å¥ | çææ¬å¹é ï¼é®å¥å¹é | |||
4 | 第ä¸å±ææè´·âé鿝âå¤§èµ | 2018/6/10 | ææè´·æºæ §éèç ç©¶é¢ | train.csvæä»¶å å«3åï¼å嫿¯æ ç¾ï¼labelï¼è¡¨ç¤ºé®é¢1åé®é¢2æ¯å¦è¡¨ç¤ºç¸åçææï¼1表示ç¸åï¼0表示ä¸åï¼ï¼é®é¢1çç¼å·ï¼q1ï¼åé®é¢2çç¼å·ï¼q2ï¼ãæ¬æä»¶ä¸åºç°çææé®é¢ç¼å·åå¨question.csvä¸åºç°è¿ | éè产å | çææ¬å¹é ï¼é®å¥å¹é | |||
5 | CAIL2019ç¸ä¼¼æ¡ä¾å¹é å¤§èµ | 2019/6 | æ¸ å大å¦ï¼ä¸å½è£å¤æä¹¦ç½ | å¯¹äºæ¯ä»½æ°æ®ï¼ç¨ä¸å ç»(A,B,C)æ¥ä»£è¡¨è¯¥ç»æ°æ®ï¼å ¶ä¸A,B,Cåå¯¹åºæä¸ç¯æä¹¦ãæä¹¦æ°æ®Aä¸Bçç¸ä¼¼åº¦æ»æ¯å¤§äºAä¸Bçç¸ä¼¼åº¦çï¼å³sim(A,B)>sim(A,C) | æ³å¾æä¹¦ï¼ç¸ä¼¼æ¡ä¾ | é¿ææ¬å¹é | |||
6 | CCKS 2018 å¾®ä¼é¶è¡æºè½å®¢æé®å¥å¹é å¤§èµ | 2018/4/5 | å工大(æ·±å³)æºè½è®¡ç®ç ç©¶ä¸å¿ï¼å¾®ä¼é¶è¡ | é¶è¡æå¡é®å¥ï¼æå¾å¹é | çææ¬å¹é ï¼é®å¥å¹é | ||||
7 | ChineseTextualInference | 2018/12/15 | åçåï¼ä¸å½ç§å¦é¢è½¯ä»¶ç ç©¶æ | ä¸æææ¬æ¨æé¡¹ç®,å æ¬88䏿æ¬è´å«ä¸æææ¬è´å«æ°æ®éçç¿»è¯ä¸æå»º,åºäºæ·±åº¦å¦ä¹ çææ¬è´å«å¤å®æ¨¡åæå»º | 䏿NLI | ä¸æææ¬æ¨æï¼ææ¬è´å« | |||
8 | NLPCC-DBQA | 2016/2017/2018 | NLPCC | ç»å®é®é¢-çæ¡ï¼ä»¥åè¯¥çæ¡æ¯å¦æ¯è¯¥é®é¢ççæ¡ä¹ä¸çæ è®°ï¼1表示æ¯ï¼0è¡¨ç¤ºä¸æ¯ | DBQA | é®çå¹é | |||
9 | âææ¯éæ±âä¸âææ¯ææâ项ç®ä¹é´å ³èåº¦è®¡ç®æ¨¡å | 201/8/32 | CCF | ç»å®ææ¬å½¢å¼çææ¯éæ±åææ¯ææï¼ä»¥å鿱䏿æçå ³è度æ ç¾ï¼å ¶ä¸ææ¯éæ±ä¸ææ¯ææä¹é´çå ³è度å为å个å±çº§ï¼ 强ç¸å ³ãè¾å¼ºç¸å ³ãå¼±ç¸å ³ãæ ç¸å ³ | é¿ææ¬ï¼éæ±ä¸ææå¹é | é¿ææ¬å¹é | |||
10 | CNSD / CLUE-CMNLI | 2019/12 | ZengJunjun | 䏿èªç¶è¯è¨æ¨çæ°æ®éï¼æ¬æ°æ®åéè¿ç¿»è¯å é¨å人工修æ£çæ¹æ³ï¼ä»è±æåæ°æ®éçæï¼å¯ä»¥ä¸å®ç¨åº¦ç¼è§£ä¸æèªç¶è¯è¨æ¨çåè¯ä¹ç¸ä¼¼åº¦è®¡ç®æ°æ®éä¸å¤çé®é¢ | 䏿NLI | 䏿èªç¶è¯è¨æ¨æ | 论æ | ||
11 | cMedQA v1.0 | 2017/4/5 | 寻è¯å¯»å»ç½ åå½é²ç§æå¤§å¦ ä¿¡æ¯ç³»ç»å管ç å¦é¢ | è¯¥æ°æ®éæ¥æºä¸ºå¯»å»å¯»è¯ç½ç«ä¸çæé®ååçï¼ æ°æ®éåè¿å¿åå¤çï¼æä¾çæ¯å å« è®ç»é䏿50,000个é®é¢ï¼94,134ä¸ªçæ¡ï¼å¹³åæ¯ä¸ªé®é¢ãçæ¡å符æ°åå«ä¸ºä¸º120ã212ä¸ªï¼ éªè¯éæ2,000个é®é¢ï¼æ3774ä¸ªçæ¡ï¼é®é¢åçæ¡çå¹³åå符æ°åå«ä¸º117å212ä¸ªï¼ æµè¯éæ2,000个é®é¢ï¼æ3835ä¸ªçæ¡ï¼é®é¢åçæ¡çå¹³åå符æ°åå«ä¸º119å211ä¸ªï¼ æ°æ®éæ»éæ54,000个é®é¢ï¼101,743ä¸ªçæ¡ï¼å¹³åæ¯ä¸ªé®é¢åçæ¡çå符æ°åå«ä¸º119ã212ä¸ªï¼ | å»çé®çå¹é | é®çå¹é | 论æ | ||
12 | cMedQA2 | 2018/11/8 | 寻è¯å¯»å»ç½ åå½é²ç§æå¤§å¦ ä¿¡æ¯ç³»ç»å管ç å¦é¢ | è¯¥æ°æ®éæ¥æºä¸ºå¯»å»å¯»è¯ç½ç«ä¸çæé®ååçï¼ æ°æ®éåè¿å¿åå¤çï¼æä¾çæ¯å å« è®ç»é䏿100,000个é®é¢ï¼188,490ä¸ªçæ¡ï¼å¹³åæ¯ä¸ªé®é¢ãçæ¡å符æ°åå«ä¸ºä¸º48ã101ä¸ªï¼ éªè¯éæ4,000个é®é¢ï¼æ7527ä¸ªçæ¡ï¼é®é¢åçæ¡çå¹³åå符æ°åå«ä¸º49å101ä¸ªï¼ æµè¯éæ4,000个é®é¢ï¼æ7552ä¸ªçæ¡ï¼é®é¢åçæ¡çå¹³åå符æ°åå«ä¸º49å100ä¸ªï¼ æ°æ®éæ»éæ108,000个é®é¢ï¼203,569ä¸ªçæ¡ï¼å¹³åæ¯ä¸ªé®é¢åçæ¡çå符æ°åå«ä¸º49ã101ä¸ªï¼ | å»çé®çå¹é | é®çå¹é | 论æ | ||
13 | ChineseSTS | 2017/9/21 | ååæ, ç½äºæ¦, 马ä»ç. 西å®ç§æå¤§å¦ | è¯¥æ°æ®éæä¾äº12747坹䏿ç¸ä¼¼æ°æ®éï¼å¨æ°æ®éå ä½è ç»åºäºä»ä»¬ç¸ä¼¼åº¦çæåï¼è¯æç±ç奿æã | çå¥ç¸ä¼¼åº¦ å¹é | ç¸ä¼¼åº¦å¹é | |||
14 | ä¸å½å¥åº·ä¿¡æ¯å¤çä¼è®® 举åçå»çé®é¢ç¸ä¼¼åº¦ è¡¡éç«èµæ°æ®é | 2018 | CHIP 2018-第åå±ä¸å½å¥åº·ä¿¡æ¯å¤çä¼è®®ï¼CHIPï¼ | æ¬æ¬¡è¯æµä»»å¡ç主è¦ç®æ æ¯é坹䏿çç宿£è å¥åº·å¨è¯¢è¯æï¼è¿è¡é®å¥æå¾å¹é ã ç»å®ä¸¤ä¸ªè¯å¥ï¼è¦æ±å¤å®ä¸¤è æå¾æ¯å¦ç¸åæè ç¸è¿ã ææè¯ææ¥èªäºèç½ä¸æ£è çå®çé®é¢ï¼å¹¶ç»è¿äºçéå人工çæå¾å¹é æ æ³¨ã æ°æ®éç»è¿è±æå¤çï¼é®é¢ç±æ°åæ 示 è®ç»éå å«20000æ¡å·¦å³æ æ³¨å¥½çæ°æ®ï¼ç»è¿è±æå¤çï¼å 嫿 ç¹ç¬¦å·ï¼ï¼ æµè¯éå å«10000æ¡å·¦å³æ labelçæ°æ®ï¼ç»è¿è±æå¤çï¼å 嫿 ç¹> 符å·ï¼ã | å»çé®é¢ç¸ä¼¼åº¦ å¹é | ç¸ä¼¼åº¦å¹é | |||
15 | COS960: A Chinese Word Similarity Dataset of 960 Word Pairs | 2019/6/6 | æ¸ åå¤§å¦ | è¯¥æ°æ®éä¸å å«äº960对åè¯ï¼ 并䏿¯å¯¹åè¯é½è¢«15个æ¯è¯è ç¨ç¸ä¼¼åº¦åæ°æ¥è¡¡é è¿960个è¯å¯¹æ ¹æ®æ ç¾è¢«åæä¸ç»ï¼ å å«480对åè¯ï¼240对å¨è¯å240对形容è¯ã | åè¯ä¹é´çç¸ä¼¼åº¦ | åä¹è¯ | 论æ | ||
16 | OPPOææºæç´¢æåºquery-titleè¯ä¹å¹é æ°æ®éã(https://pan.baidu.com/s/1Hg2Hubsn3GEuu4gubbHCzw å¯ç 7p3n) | 2018/11/6 | OPPO | è¯¥æ°æ®éæ¥èªäºOPPOææºæç´¢æåºä¼å宿¶æç´¢åºæ¯, è¯¥åºæ¯å°±æ¯å¨ç¨æ·ä¸æè¾å ¥è¿ç¨ä¸ï¼å®æ¶è¿åæ¥è¯¢ç»æã è¯¥æ°æ®é卿¤åºç¡ä¸åäºç¸åºçç®åï¼ æä¾äºä¸ä¸ªquery-titleè¯ä¹å¹é ï¼å³ctr颿µçé®é¢ã | é®é¢æ é¢å¹é ï¼ ctr颿µ | ç¸ä¼¼åº¦å¹é | |||
17 | ç½é¡µæç´¢ç»æè¯ä»·(SogouE) | 2012å¹´ | æç | æçå®éªå®¤æ°æ®ä½¿ç¨è®¸å¯åè®® | è¯¥æ°æ®éå å«äºæ¥è¯¢è¯ï¼ç¸å ³URLä»¥åæ¥è¯¢ç±»å«çæç´¢æ°æ®ï¼æ ¼å¼å¦ä¸ æ°æ®æ ¼å¼è¯´æï¼æ¥è¯¢è¯]\tç¸å ³çURL\tæ¥è¯¢ç±»å« å ¶ä¸URLä¿è¯åå¨äºå¯¹åºçäºèç½è¯æåºï¼ æ¥è¯¢ç±»å«ä¸â1â表示导èªç±»æ¥è¯¢ï¼â2â表示信æ¯ç±»æ¥è¯¢ | Automatic Search Engine Performance Evaluation with Click-through Data Analysis | æ¥è¯¢ç±»åå¹é 颿µ |
ææ¬æè¦
ID | æ é¢ | æ´æ°æ¥æ | æ°æ®éæä¾è | è®¸å¯ | 说æ | å ³é®å | ç±»å« | 论æå°å | 夿³¨ |
---|---|---|---|---|---|---|---|---|---|
1 | LCSTS | 2015/8/6 | Qingcai Chen | æ°æ®éæ¥æºäºæ°æµªå¾®åï¼å å«ä¸¤ç¾ä¸å·¦å³çå®ä¸æçææ¬ï¼æ¯æ¡æ°æ®å æ¬ç±ä½è æ æ³¨çæè¦åæ£æä¸¤ä¸ªåæ®µãå¦å¤æ10,666æ¡æ°æ®ç±äººå·¥æ 注åºçææ¬ä¸æè¦çç¸å ³æ§ï¼ä»1-5ç¸å ³æ§ä¾æ¬¡å¢å ã | åææ¬æè¦ï¼çææ¬ï¼ææ¬ç¸å ³æ§ | ææ¬æè¦ | 论æ | ||
2 | ä¸æçææ¬æè¦æ°æ®é | 2018/6/20 | He Zhengfang | æ°æ®æ¥æºäºæ°æµªå¾®å主æµåªä½åå¸çå¾®åï¼å ±679898æ¡æ°æ®ã | åææ¬æè¦ï¼çææ¬ | ææ¬æè¦ | \ | ||
3 | æè²å¹è®è¡ä¸æ½è±¡å¼èªå¨æè¦ä¸æè¯æåº | 2018/6/5 | å¿å | è¯æåºæ¶éäºæè²å¹è®è¡ä¸ä¸»æµåç´åªä½çåå²æç« ï¼çº¦24500æ¡æ°æ®ï¼æ¯æ¡æ°æ®å æ¬ç±ä½è æ æ³¨çæè¦åæ£æä¸¤ä¸ªåæ®µã | åææ¬æè¦ï¼æè²å¹è® | ææ¬æè¦ | \ | ||
4 | NLPCC2017 Task3 | 2017/11/8 | NLPCC2017䏻忹 | æ°æ®éæ¥æºäºæ°é»é¢åï¼æ¯NLPCC2017举åæä¾ç任塿°æ®ï¼å¯ç¨äºåææ¬æè¦ã | åææ¬æè¦ï¼æ°é» | ææ¬æè¦ | \ | ||
5 | ç¥çæ¯2018 | 2018/10/11 | DCç«èµä¸»åæ¹ | æ°æ®æ¥æºäºæ°é»ææ¬ï¼ç±DCç«èµä¸»åæ¹æä¾ï¼æ¨¡æä¸å¡åºæ¯ï¼ä»¥æ°é»ææ¬çæ ¸å¿è¯æå为ç®çï¼æç»ç»æè¾¾å°æåæ¨èåç¨æ·ç»åçææã | ææ¬å ³é®åï¼æ°é» | ææ¬æè¦ | \ | ||
6 | Byte Cup 2018å½é æºå¨å¦ä¹ ç«èµ | 2018/12/4 | åèè·³å¨ | æ°æ®æ¥èªåèè·³å¨æä¸äº§åTopBuzzå弿¾çæçæç« ï¼è®ç»éå æ¬äºçº¦ 130 ä¸ç¯ææ¬çä¿¡æ¯ï¼éªè¯é 1000 ç¯æç« ï¼ æµè¯é 800 ç¯æç« ã æ¯æ¡æµè¯éåéªè¯éçæ°æ®ç»ç±äººå·¥ç¼è¾æå·¥æ 注å¤ä¸ªå¯è½çæ é¢ï¼ä½ä¸ºçæ¡å¤éã | åææ¬æè¦ï¼è§é¢ï¼æ°é» | ææ¬æè¦ | \ | è±æ | |
7 | NEWSROOM | 2018/6/1 | Grusky | æ°æ®æ¯ä»1998å¹´å°2017å¹´çæç´¢åç¤¾äº¤å æ°æ®ä¸è·å¾ï¼å¹¶ä½¿ç¨äºå¤ç§æååæ½è±¡ç¸ç»åçæè¦çç¥ï¼å å«ä½è åç¼è¾å¨38个主è¦åºçç©ç¼è¾é¨æ°åç130ä¸ç¯æç« åæè¦ã | åææ¬æè¦ï¼ç¤¾äº¤å æ°æ®ï¼æç´¢ | ææ¬æè¦ | 论æ | è±æ | |
8 | [DUC/TAC](https://duc.nist.gov/ https://tac.nist.gov//) | 2014/9/9 | NIST | å ¨ç§°Document Understanding Conferences/Text Analysis Conferenceï¼æ°æ®éæ¥æºäºæ¯å¹´çTAC KBPï¼TAC Knowledge Base Populationï¼æ¯èµä½¿ç¨çè¯æåºä¸çæ°é»ä¸çº¿åç½ç»ææ¬ã | åææ¬/å¤ææ¬æè¦ï¼æ°é» | ææ¬æè¦ | \ | è±æ | |
9 | CNN/Daily Mail | 2017/7/31 | Standford | GNU v3 | æ°æ®éæ¯ä»ç¾å½æçº¿æ°é»ç½ï¼CNNï¼åæ¯æ¥é®æ¥(DailyMail)䏿æºå¤§çº¦ä¸ç¾ä¸æ¡æ°é»æ°æ®ä½ä¸ºæºå¨é 读çè§£è¯æåºã | å¤ææ¬æè¦ï¼é¿ææ¬ï¼æ°é» | ææ¬æè¦ | 论æ | è±æ |
10 | Amazon SNAP Review | 2013/3/1 | Standford | æ°æ®æ¥æºäºAmazonç½ç«è´ç©è¯è®ºï¼å¯ä»¥è·åæ¯ä¸ªå¤§ç±»å«ï¼å¦ç¾é£ãçµå½±çï¼ä¸çæ°æ®ï¼ä¹å¯ä»¥ä¸æ¬¡æ§è·åæææ°æ®ã | å¤ææ¬æè¦ï¼è´ç©è¯è®º | ææ¬æè¦ | \ | è±æ | |
11 | Gigaword | 2003/1/28 | David Graff, Christopher Cieri | æ°æ®éå æ¬çº¦950w ç¯æ°é»æç« ï¼ç¨æç« æ é¢åæè¦ï¼å±äºåå¥æè¦æ°æ®éã | åææ¬æè¦ï¼æ°é» | ææ¬æè¦ | è±æ | ||
12 | RA-MDS | 2017/9/11 | Piji Li | å ¨ç§°Reader-Aware Multi-Document Summarizationï¼æ°æ®éæ¥æºäºæ°é»æç« ï¼ç±ä¸å®¶æ¶éãæ æ³¨å审æ¥ãæ¶µçäº45个主é¢ï¼æ¯ä¸ªä¸»é¢å å«10个æ°é»ææ¡£å4个模åæè¦ï¼æ¯ä¸ªæ°é»ææ¡£å¹³åå å«27个å¥åï¼æ¯ä¸ªå¥åå¹³åå å«25个åè¯ã | å¤ææ¬æè¦ï¼æ°é»ï¼äººå·¥æ 注 | ææ¬æè¦ | 论æ | è±æ | |
13 | TIPSTER SUMMAC | 2003/5/21 | The MITRE Corporation and the University of Edinburgh | æ°æ®ç±183ç¯Computation and Language (cmp-lg) collectionæ è®°çææ¡£ç»æï¼ææ¡£åèªACLä¼è®®å表论æã | å¤ææ¬æè¦ï¼é¿ææ¬ | ææ¬æè¦ | \ | è±æ | |
14 | WikiHow | 2018/10/18 | Mahnaz Koupaee | æ¯æ¡æ°æ®ä¸ºä¸ç¯æç« ï¼æ¯ç¯æç« ç±å¤ä¸ªæ®µè½ç»æï¼æ¯ä¸ªæ®µè½ä»¥ä¸ä¸ªæ»ç»å®çå¥åå¼å¤´ãéè¿å并段è½å½¢ææç« åæ®µè½å¤§çº²å½¢ææè¦ï¼æ°æ®éçæç»çæ¬å å«äºè¶ è¿200,000个é¿åºå对ã | å¤ææ¬æè¦ï¼é¿ææ¬ | ææ¬æè¦ | 论æ | è±æ | |
15 | Multi-News | 2019/12/4 | Alex Fabbri | æ°æ®æ¥èª1500å¤ä¸ªä¸åç½ç«çè¾å ¥æç« 以åä»ç½ç«newser.comè·å¾ç56,216ç¯è¿äºæç« çä¸ä¸æè¦ã | å¤ææ¬æè¦ | ææ¬æè¦ | 论æ | è±æ | |
16 | MED Summaries | 2018/8/17 | D.Potapov | æ°æ®éç¨äºå¨æè§é¢æè¦è¯ä¼°ï¼å å«160个è§é¢ç注éï¼å ¶ä¸éªè¯é60ãæµè¯é100ï¼æµè¯é䏿10个äºä»¶ç±»å«ã | åææ¬æè¦ï¼è§é¢æ³¨é | ææ¬æè¦ | 论æ | è±æ | |
17 | BIGPATENT | 2019/7/27 | Sharma | æ°æ®éå æ¬130ä¸ä»½ç¾å½ä¸å©æç®è®°å½ä»¥åäººç±»ä¹¦é¢æ½è±¡æè¦ï¼æè¦å 嫿´ä¸°å¯çè¯è¯ç»æåæ´å¤ç常ç¨å®ä½ã | åææ¬æè¦ï¼ä¸å©ï¼ä¹¦é¢è¯ | ææ¬æè¦ | 论æ | è±æ | |
18 | [NYT]( https://catalog.ldc.upenn.edu/LDC2008T19) | 2008/10/17 | Evan Sandhaus | å ¨ç§°The New York Times,æ°æ®éå å«150ç¯æ¥èªçº½çº¦æ¶æ¥çåä¸æç« ,æåäºä»2009å¹´11æå°2010å¹´1æçº½çº¦æ¶æ¥ç½ç«ä¸çæææç« ã | åææ¬æè¦ï¼åä¸æç« | ææ¬æè¦ | \ | è±æ | |
19 | The AQUAINT Corpus of English News Text | 2002/9/26 | David Graff | æ°æ®éç±æ°å社(ä¸å人æ°å ±åå½)ãçº½çº¦æ¶æ¥æ°é»æå¡åç¾è社ä¸çæ°é»æå¡çè±ææ°é»ææ¬æ°æ®ç»æï¼å å«å¤§çº¦3.75亿åãæ°æ®éæ¶è´¹ã | åææ¬æè¦ï¼æ°é» | ææ¬æè¦ | \ | 䏿åè±æ | |
20 | Legal Case Reports Data Set | 2012/10/19 | Filippo Galgani | æ°æ®éæ¥èª2006-2009年澳大å©äºè馿³é¢(FCA)çæ¾³å¤§å©äºæ³å¾æ¡ä¾ï¼å å«çº¦4000个æ³å¾æ¡ä»¶åå ¶æè¦ã | åææ¬æè¦ï¼æ³å¾æ¡ä»¶ | ææ¬æè¦ | \ | è±æ | |
21 | 17 Timelines | 2015/5/29 | G. B. Tran | æ°æ®æ¯ä»æ°é»æç« ç½é¡µä¸æåçå 容ï¼å å«ååã婿¯äºãä¹é¨ãåå©äºå个å½å®¶çæ°é»ã | åææ¬æè¦ï¼æ°é» | ææ¬æè¦ | 论æ | å¤è¯è¨ | |
22 | PTS Corpus | 2018/10/9 | Fei Sun | å ¨ç§°Product Title Summarization Corpusï¼æ°æ®ä¸ºç§»å¨è®¾å¤æ¾ç¤ºçµååå¡åºç¨ä¸ç产ååç§°æè¦ | åææ¬æè¦ï¼çææ¬ | ææ¬æè¦ | 论æ | ||
23 | Scientific Summarization DataSets | 2019/10/26 | Santosh Gupta | æ°æ®éåèªSemantic Scholar CorpusåArXivãæ¥èªSemantic Scholarè¯æåºçæ é¢/æè¦å¯¹ï¼è¿æ»¤æçç©å»å¦é¢åçææè®ºæï¼å å«580䏿¡æ°æ®ãæ¥èªArXivçæ°æ®ï¼å å«äºä»1991å¹´å¼å§å°2019å¹´7æ5æ¥çæ¯ç¯è®ºæçæ é¢/æè¦å¯¹ãæ°æ®éå å«éèç±»æ°æ®10kï¼çç©å¦ç±»26kï¼æ°å¦ç±»417kï¼ç©çç±»157ä¸ï¼CSç±»221kã | åææ¬æè¦ï¼è®ºæ | ææ¬æè¦ | \ | è±æ | |
24 | Scientific Document Summarization Corpus and Annotations from the WING NUS group | 2019/3/19 | Jaidka | æ°æ®éå æ¬ACL计ç®è¯è¨å¦åèªç¶è¯è¨å¤çç 究论æï¼ä»¥ååèªçå¼ç¨è®ºæåä¸ä¸ªè¾åºæè¦:ä¼ ç»ä½è ç论ææè¦(æè¦)ãç¤¾åºæè¦(å¼ç¨è¯å¥â弿âçæ¶é)åç±è®ç»æç´ çæ³¨éåæ°åç人类æè¦ï¼è®ç»éå å«40ç¯æç« åå¼ç¨è®ºæã | åææ¬æè¦ï¼è®ºæ | ææ¬æè¦ | 论æ | è±æ |
æºå¨ç¿»è¯
ID | æ é¢ | æ´æ°æ¥æ | æ°æ®éæä¾è | è®¸å¯ | 说æ | å ³é®å | ç±»å« | 论æå°å | 夿³¨ |
---|---|---|---|---|---|---|---|---|---|
1 | WMT2017 | 2017/2/1 | EMNLP 2017 Workshop on Machine Translation | æ°æ®ä¸»è¦æ¥æºäº Europarl corpusåUN corpusä¸¤ä¸ªæºæï¼ é带2017å¹´ä»News Commentary corpus ä»»å¡ä¸éæ°æ½åçæç« ã è¿æ¯ç±EMNLPä¼è®®æä¾çç¿»è¯è¯æï¼ ä½ä¸ºå¾å¤è®ºæææ çbenchmarkæ¥æ£æµ | Benchmark, WMT2017 | ä¸è±ç¿»è¯ è¯æ | 论æ | ||
2 | WMT2018 | 2018/11/1 | EMNLP 2018 Workshop on Machine Translation | æ°æ®ä¸»è¦æ¥æºäº Europarl corpusåUN corpusä¸¤ä¸ªæºæï¼ é带2018å¹´ä»News Commentary corpus ä»»å¡ä¸éæ°æ½åçæç« ã è¿æ¯ç±EMNLPä¼è®®æä¾çç¿»è¯è¯æï¼ ä½ä¸ºå¾å¤è®ºæææ çbenchmarkæ¥æ£æµ | Benchmark, WMT2018 | ä¸è±ç¿»è¯ è¯æ | 论æ | ||
3 | WMT2019 | 2019/1/31 | EMNLP 2019 Workshop on Machine Translation | æ°æ®ä¸»è¦æ¥æºäº Europarl corpusåUN corpusä¸¤ä¸ªæºæ, 以åéå äº news-commentary corpus and the ParaCrawl corpus䏿¥å¾æ°æ® | Benchmark, WMT2019 | ä¸è±ç¿»è¯ è¯æ | 论æ | ||
4 | UM-Corpus:A Large English-Chinese Parallel Corpus | 2014/5/26 | Department of Computer and Information Science, University of Macau, Macau | ç±æ¾³é¨å¤§å¦åå¸ç ä¸è±æå¯¹ç §ç é«è´¨éç¿»è¯è¯æ | UM-Corpus;English; Chinese;large | ä¸è±ç¿»è¯ è¯æ | 论æ | ||
5 | [Ai challenger translation 2017](https://pan.baidu.com/s/1E5gD5QnZvNxT3ZLtxe_boA æåç : stjf) | 2017/8/14 | åæ°å·¥åºãæçå 仿¥å¤´æ¡èååèµ·ç AIç§æç«èµ | è§æ¨¡æå¤§çå£è¯é¢åè±ä¸åè¯å¯¹ç §æ°æ®éã æä¾äºè¶ è¿1000ä¸çè±ä¸å¯¹ç §çå¥å对ä½ä¸ºæ°æ®éåã ææåè¯å¥å¯¹ç»è¿äººå·¥æ£æ¥ï¼ æ°æ®éä»è§æ¨¡ãç¸å ³åº¦ãè´¨éä¸é½æä¿éã è®ç»éï¼10,000,000 å¥ éªè¯éï¼åå£°ä¼ è¯ï¼ï¼934 å¥ éªè¯éï¼ææ¬ç¿»è¯ï¼ï¼8000 å¥ | AI challenger 2017 | ä¸è±ç¿»è¯ è¯æ | |||
6 | MultiUN | 2010 | Department of Linguistics and Philology Uppsala University, Uppsala/Sweden | è¯¥æ°æ®éç±å¾·å½äººå·¥æºè½ç ç©¶ä¸å¿æä¾ï¼ 餿¤æ°æ®éå¤ï¼è¯¥ç½ç«è¿æä¾äºå¾å¤çå« çè¯è¨ä¹é´çç¿»è¯å¯¹ç §è¯æä¾ä¸è½½ | MultiUN | ä¸è±ç¿»è¯ è¯æ | MultiUN: A Multilingual corpus from United Nation Documents, Andreas Eisele and Yu Chen, LREC 2010 | ||
7 | NIST 2002 Open Machine Translation (OpenMT) Evaluation | 2010/5/14 | NIST Multimodal Information Group | LDC User Agreement for Non-Members | æ°æ®æ¥æºäºXinhua æ°é»æå¡å å«70个æ°é»æ äºï¼ 以忥èªäºZaobaoæ°é»æå¡ç30个æ°é»æ äºï¼å ±100个 ä»ä¸¤ä¸ªæ°é»éä¸éæ©åºæ¥çæ äºçé¿åº¦é½å212å°707个 䏿å符ä¹é´ï¼Xinhuaé¨åå ±ææ25247个åç¬¦ï¼ Zaobaoæ39256个å符 | NIST | ä¸è±ç¿»è¯ è¯æ | 论æ | 该系åæå¤å¹´çæ°æ®ï¼ è¯¥æ°æ®ä½¿ç¨éè¦ä»è´¹ |
8 | The Multitarget TED Talks Task (MTTT) | 2018 | Kevin Duh, JUH | è¯¥æ°æ®éå å«åºäºTEDæ¼è®²çå¤ç§è¯è¨çå¹³è¡è¯æï¼å å«ä¸è±æçå ±è®¡20ç§è¯è¨ | TED | ä¸è±ç¿»è¯ è¯æ | The Multitarget TED Talks Task | ||
9 | ASPEC Chinese-Japanese | 2019 | Workshop on Asian Translation | è¯¥æ°æ®é主è¦ç ç©¶äºæ´²åºåçè¯è¨ï¼å¦ä¸æåæ¥è¯ä¹é´ï¼ æ¥è¯åè±æä¹é´çç¿»è¯ä»»å¡ ç¿»è¯è¯æä¸»è¦æ¥èªè¯ç§æè®ºæï¼è®ºææè¦ï¼åææè¿°ï¼ä¸å©ççï¼ | Asian scientific patent Japanese | 䏿¥ç¿»è¯è¯æ | http://lotus.kuee.kyoto-u.ac.jp/WAT/ | ||
10 | casia2015 | 2015 | research group in Institute of Automation , Chinese Academy of Sciences | è¯æåºå å«ä»ç½ç»èªå¨æ¶éç大约ä¸ç¾ä¸ä¸ªå¥å对 | casia CWMT 2015 | ä¸è±ç¿»è¯è¯æ | |||
11 | casict2011 | 2011 | research group in Institute of Computing Technology , Chinese Academy of Sciences | è¯æåºå å«2个é¨åï¼æ¯ä¸ªé¨åå å«ä»ç½ç»èªå¨æ¶é ç大约1ç¾ä¸ï¼æ»è®¡2ç¾ä¸ï¼ä¸ªå¥å对ã å¥å级å«ç对é½ç²¾åº¦çº¦ä¸º90ï¼ ã | casict CWMT 2011 | ä¸è±ç¿»è¯è¯æ | |||
12 | casict2015 | 2015 | research group in Institute of Computing Technology , Chinese Academy of Sciences | è¯æåºå å«å¤§çº¦200ä¸ä¸ªå¥å对ï¼å æ¬ä»ç½ç»ï¼60ï¼ ï¼ï¼ çµå½±åå¹ï¼20ï¼ ï¼åè±è¯/æ±è¯è¯åºï¼20ï¼ ï¼æ¶éçå¥åã å¥å水平对é½ç²¾åº¦é«äº99ï¼ ã | casict CWMT 2015 | ä¸è±ç¿»è¯è¯æ | |||
13 | datum2015 | 2015 | Datum Data Co., Ltd. | è¯æåºå å«ä¸ç¾ä¸å¯¹å¥åï¼æ¶µçä¸åç±»åï¼ ä¾å¦ç¨äºè¯è¨æè²çæç§ä¹¦ï¼åè¯ä¹¦ç±ï¼ ææ¯ææ¡£ï¼åè¯æ°é»ï¼æ¿åºç½ç®ä¹¦ï¼ æ¿åºææ¡£ï¼ç½ç»ä¸çåè¯èµæºçã 请注æï¼æ°æ®ä¸æé¨åçæäºé¨åæ¯æè¯æ®µååçã | datum CWMT 2015 | ä¸è±ç¿»è¯è¯æ | |||
14 | datum2017 | 2017 | Datum Data Co., Ltd. | è¯æåºå å«20个æä»¶ï¼æ¶µçä¸åç±»åï¼ä¾å¦æ°é»ï¼å¯¹è¯ï¼æ³å¾æä»¶ï¼å°è¯´çã æ¯ä¸ªæä»¶æ50,000个å¥åã æ´ä¸ªè¯æåºå å«ä¸ç¾ä¸ä¸ªå¥åã å10个æä»¶ï¼Book1-Book10ï¼ç䏿è¯å已忮µã | datum CWMT 2017 | ä¸è±ç¿»è¯è¯æ | |||
15 | neu2017 | 2017 | NLP lab of Northeastern University, China | è¯æåºå å«ä»ç½ç»èªå¨æ¶éç200ä¸ä¸ªå¥å对ï¼å æ¬æ°é»ï¼ææ¯ææ¡£çã å¥å级å«ç对é½ç²¾åº¦çº¦ä¸º90ï¼ ã | neu CWMT 2017 | ä¸è±ç¿»è¯è¯æ | |||
16 | ç¿»è¯è¯æ(translation2019zh) | 2019 | å¾äº® | å¯ä»¥ç¨äºè®ç»ä¸è±æç¿»è¯ç³»ç»ï¼ä»ä¸æç¿»è¯å°è±æï¼æä»è±æç¿»è¯å°ä¸æï¼ ç±äºæä¸ç¾ä¸ç䏿å¥åï¼å¯ä»¥åªæ½å䏿çå¥åï¼å为éç¨ä¸æè¯æï¼è®ç»è¯åéæå为é¢è®ç»çè¯æãè±æä»»å¡ä¹å¯ä»¥ç±»ä¼¼æä½ï¼ |
ç¥è¯å¾è°±
ID | æ é¢ | æ´æ°æ¥æ | æ°æ®éæä¾è | è®¸å¯ | 说æ | å ³é®å | ç±»å« | 论æå°å | 夿³¨ |
---|---|---|---|---|---|---|---|---|---|
1 | NLPIRå¾®åå ³æ³¨å ³ç³»è¯æåº100䏿¡ | 2017/12/2 | å京ç工大å¦ç½ç»æç´¢ææä¸å®å ¨å®éªå®¤å¼ åå¹³å士 | NLPIRå¾®åå ³æ³¨å ³ç³»è¯æåºè¯´æ 1.NLPIRå¾®åå ³æ³¨å ³ç³»è¯æåºç±å京ç工大å¦ç½ç»æç´¢ææä¸å®å ¨å®éªå®¤å¼ åå¹³å士ï¼éè¿å ¬å¼éé䏿½å仿°æµªå¾®åãè ¾è®¯å¾®åä¸è·å¾ãä¸ºäºæ¨è¿å¾®å计ç®çç ç©¶ï¼ç°éè¿èªç¶è¯è¨å¤çä¸ä¿¡æ¯æ£ç´¢å ±äº«å¹³å°(127.0.0.1/wordpress)äºä»¥å ¬å¼å ±äº«å ¶ä¸ç1000䏿¡æ°æ®ï¼ç®åå·²ææ°æ®æ¥è¿10亿ï¼å·²ç»åé¤äºå¤§éçå使°æ®ï¼ï¼ 2.æ¬è¯æåºå¨å ¬å¼è¿ç¨ä¸ï¼å·²ç»æå¤§é度å°éç¨ææ¯ææ®µå±è½äºç¨æ·çå®å§ååurlï¼å¦ææ¶åå°çç¨æ·éè¦å ¨é¢ä¿æ¤ä¸ªäººéç§çï¼å¯ä»¥Emailç»å¼ åå¹³å士kevinzhang@bit.edu.cnäºä»¥å é¤ï¼å¯¹ç»æ¨é æçå°æ°è¡¨ç¤ºæ±æï¼å¹¶å¸æè° è§£ï¼ 3.åªéç¨äºç§ç æå¦ç¨éï¼ä¸å¾ä½ä¸ºåç¨ï¼å¼ç¨æ¬è¯æåºï¼æè¯·å¨è½¯ä»¶æè 论æçææç¹å®ä½ç½®è¡¨æåºå¤ä¸ºï¼NLPIRå¾®åè¯æåºï¼åºå¤ä¸ºèªç¶è¯è¨å¤çä¸ä¿¡æ¯æ£ç´¢å ±äº«å¹³å°(http://www.nlpir.org/)ã 4.åæ®µè¯´æï¼ person_id 人ç©çid guanzhu_id æå ³æ³¨äººçid |
è¯æåº
ID | æ é¢ | æ´æ°æ¥æ | æ°æ®éæä¾è | è®¸å¯ | 说æ | å ³é®å | ç±»å« | 论æå°å | 夿³¨ |
---|---|---|---|---|---|---|---|---|---|
1 | NLPIRå¾®åå å®¹è¯æåº-23䏿¡ | 2017å¹´12æ | å京ç工大å¦ç½ç»æç´¢ææä¸å®å ¨å®éªå®¤å¼ åå¹³å士 | NLPIRå¾®åå å®¹è¯æåºè¯´æ 1.NLPIRå¾®åå å®¹è¯æåºç±å京ç工大å¦ç½ç»æç´¢ææä¸å®å ¨å®éªå®¤å¼ åå¹³å士ï¼éè¿å ¬å¼éé䏿½å仿°æµªå¾®åãè ¾è®¯å¾®åä¸è·å¾ãä¸ºäºæ¨è¿å¾®å计ç®çç ç©¶ï¼ç°éè¿èªç¶è¯è¨å¤çä¸ä¿¡æ¯æ£ç´¢å ±äº«å¹³å°(127.0.0.1/wordpress)äºä»¥å ¬å¼å ±äº«å ¶ä¸ç23䏿¡æ°æ®ï¼ç®åå·²ææ°æ®æ¥è¿1000ä¸ï¼å·²ç»åé¤äºå¤§éçå使°æ®ï¼ã 2.æ¬è¯æåºå¨å ¬å¼è¿ç¨ä¸ï¼å·²ç»æå¤§é度å°éç¨ææ¯ææ®µå±è½äºç¨æ·çå®å§ååurlï¼å¦ææ¶åå°çç¨æ·éè¦å ¨é¢ä¿æ¤ä¸ªäººéç§çï¼å¯ä»¥Emailç»å¼ åå¹³å士kevinzhang@bit.edu.cnäºä»¥å é¤ï¼å¯¹ç»æ¨é æçå°æ°è¡¨ç¤ºæ±æï¼å¹¶å¸æè° è§£ï¼ 3.åªéç¨äºç§ç æå¦ç¨éï¼ä¸å¾ä½ä¸ºåç¨ï¼å¼ç¨æ¬è¯æåºï¼æè¯·å¨è½¯ä»¶æè 论æçææç¹å®ä½ç½®è¡¨æåºå¤ä¸ºï¼NLPIRå¾®åè¯æåºï¼åºå¤ä¸ºèªç¶è¯è¨å¤çä¸ä¿¡æ¯æ£ç´¢å ±äº«å¹³å°(http://www.nlpir.org/)ã 4.åæ®µè¯´æï¼ id æç« ç¼å· article æ£æ discuss è¯è®ºæ°ç® insertTime æ£ææå ¥æ¶é´ origin æ¥æº person_id æå±äººç©çid time æ£æå叿¶é´ transmit 转å | |||||
2 | 500ä¸å¾®åè¯æ | 2018å¹´1æ | å京ç工大å¦ç½ç»æç´¢ææä¸å®å ¨å®éªå®¤å¼ åå¹³å士 | ã500ä¸å¾®åè¯æãåçå·¥æç´¢ææå®éªå®¤ä¸»ä»»@ICTCLASå¼ åå¹³å士 æä¾500ä¸å¾®åè¯æä¾å¤§å®¶ä½¿ç¨ï¼æä»¶ä¸ºsqlæä»¶ï¼åªè½å¯¼å ¥mysqlæ°æ®åºï¼å å«å»ºè¡¨è¯å¥ï¼å ±500䏿°æ®ãè¯æåªéç¨äºç§ç æå¦ç¨éï¼ä¸å¾ä½ä¸ºåç¨ï¼å¼ç¨æ¬è¯æåºï¼è¯·å¨è½¯ä»¶æè 论æçææç¹å®ä½ç½®è¡¨æåºå¤ ã ãçèµ·æ¥è¿ä»½æ°æ®æ¯ä¸é¢é£ä¸ä»½è¦æç³ ä¸äºï¼æ²¡æåè¿å¤çã | |||||
3 | NLPIRæ°é»è¯æåº-2400ä¸å | 2017å¹´7æ | www.NLPIR.org | NLPIRæ°é»è¯æåºè¯´æ 1.è§£åç¼©åæ°æ®é为48MBï¼å¤§çº¦2400ä¸åçæ°é»ï¼ 2.ééçæ°é»æ¶é´è·¨åº¦ä¸º2009å¹´10æ12æ¥è³2009å¹´12æ14æ¥ã 3.æä»¶å为æ°é»çæ¶é´ï¼æ¯ä¸ªæä»¶å æ¬å¤ä¸ªæ°é»æ£æå 容ï¼å·²ç»å»é¤äºæ°é»çåå¾ä¿¡æ¯ï¼ï¼ 4.æ°é»æ¬èº«å 容ççæå±äºåä½è æè æ°é»æºæï¼ 5.æ´çåçè¯æåºçæå±äºwww.NLPIR.orgï¼ 6.å¯ä¾æ°é»åæãèªç¶è¯è¨å¤çãæç´¢çåºç¨æä¾æµè¯æ°æ®åºæ¯ï¼ å¦éæ´å¤§è§æ¨¡çè¯æåºï¼å¯ä»¥èç³»NLPIR.org管çåã | |||||
4 | NLPIRå¾®åå ³æ³¨å ³ç³»è¯æåº100䏿¡ | 2017å¹´12æ | å京ç工大å¦ç½ç»æç´¢ææä¸å®å ¨å®éªå®¤å¼ åå¹³å士 | NLPIRå¾®åå ³æ³¨å ³ç³»è¯æåºè¯´æ 1.NLPIRå¾®åå ³æ³¨å ³ç³»è¯æåºç±å京ç工大å¦ç½ç»æç´¢ææä¸å®å ¨å®éªå®¤å¼ åå¹³å士ï¼éè¿å ¬å¼éé䏿½å仿°æµªå¾®åãè ¾è®¯å¾®åä¸è·å¾ãä¸ºäºæ¨è¿å¾®å计ç®çç ç©¶ï¼ç°éè¿èªç¶è¯è¨å¤çä¸ä¿¡æ¯æ£ç´¢å ±äº«å¹³å°(127.0.0.1/wordpress)äºä»¥å ¬å¼å ±äº«å ¶ä¸ç1000䏿¡æ°æ®ï¼ç®åå·²ææ°æ®æ¥è¿10亿ï¼å·²ç»åé¤äºå¤§éçå使°æ®ï¼ï¼ 2.æ¬è¯æåºå¨å ¬å¼è¿ç¨ä¸ï¼å·²ç»æå¤§é度å°éç¨ææ¯ææ®µå±è½äºç¨æ·çå®å§ååurlï¼å¦ææ¶åå°çç¨æ·éè¦å ¨é¢ä¿æ¤ä¸ªäººéç§çï¼å¯ä»¥Emailç»å¼ åå¹³å士kevinzhang@bit.edu.cnäºä»¥å é¤ï¼å¯¹ç»æ¨é æçå°æ°è¡¨ç¤ºæ±æï¼å¹¶å¸æè° è§£ï¼ 3.åªéç¨äºç§ç æå¦ç¨éï¼ä¸å¾ä½ä¸ºåç¨ï¼å¼ç¨æ¬è¯æåºï¼æè¯·å¨è½¯ä»¶æè 论æçææç¹å®ä½ç½®è¡¨æåºå¤ä¸ºï¼NLPIRå¾®åè¯æåºï¼åºå¤ä¸ºèªç¶è¯è¨å¤çä¸ä¿¡æ¯æ£ç´¢å ±äº«å¹³å°(http://www.nlpir.org/)ã 4.åæ®µè¯´æï¼ person_id 人ç©çid guanzhu_id æå ³æ³¨äººçid | |||||
5 | NLPIRå¾®ååä¸»è¯æåº100䏿¡ | 2017å¹´9æ | å京ç工大å¦ç½ç»æç´¢ææä¸å®å ¨å®éªå®¤å¼ åå¹³å士 | NLPIRå¾®ååä¸»è¯æåºè¯´æ 1.NLPIRå¾®ååä¸»è¯æåºç±å京ç工大å¦ç½ç»æç´¢ææä¸å®å ¨å®éªå®¤å¼ åå¹³å士ï¼éè¿å ¬å¼éé䏿½å仿°æµªå¾®åãè ¾è®¯å¾®åä¸è·å¾ãä¸ºäºæ¨è¿å¾®å计ç®çç ç©¶ï¼ç°éè¿èªç¶è¯è¨å¤çä¸ä¿¡æ¯æ£ç´¢å ±äº«å¹³å°(127.0.0.1/wordpress)äºä»¥å ¬å¼å ±äº«å ¶ä¸ç100䏿¡æ°æ®ï¼ç®åå·²ææ°æ®æ¥è¿1亿ï¼å·²ç»åé¤äºå¤§éçåä½ä¸æºå¨ç²ä¸ï¼ 2.æ¬è¯æåºå¨å ¬å¼è¿ç¨ä¸ï¼å·²ç»æå¤§é度å°éç¨ææ¯ææ®µå±è½äºç¨æ·çå®å§ååurlï¼å¦ææ¶åå°çç¨æ·éè¦å ¨é¢ä¿æ¤ä¸ªäººéç§çï¼å¯ä»¥Emailç»å¼ åå¹³å士kevinzhang@bit.edu.cnäºä»¥å é¤ï¼å¯¹ç»æ¨é æçå°æ°è¡¨ç¤ºæ±æï¼å¹¶å¸æè° è§£ï¼ 3.åªéç¨äºç§ç æå¦ç¨éï¼ä¸å¾ä½ä¸ºåç¨ï¼å¼ç¨æ¬è¯æåºï¼æè¯·å¨è½¯ä»¶æè 论æçææç¹å®ä½ç½®è¡¨æåºå¤ä¸ºï¼NLPIRå¾®åè¯æåºï¼åºå¤ä¸ºèªç¶è¯è¨å¤çä¸ä¿¡æ¯æ£ç´¢å ±äº«å¹³å°(http://www.nlpir.org/)ã 4.åæ®µè¯´æï¼ id å é¨id sex æ§å« address å®¶åºä½å fansNum ç²ä¸æ°ç® summary 个人æè¦ wbNum 微忰é gzNum å ³æ³¨æ°é blog å客å°å edu æè²æ åµ work 工使 åµ renZh æ¯å¦è®¤è¯ brithday çæ¥ï¼ | |||||
6 | NLPIRçææ¬è¯æåº-40ä¸å | 2017å¹´8æ | å京ç工大å¦ç½ç»æç´¢ææä¸å®å ¨å®éªå®¤ (SMS@BIT) | NLPIRçææ¬è¯æåºè¯´æ 1.è§£åç¼©åæ°æ®é为48ä¸åï¼å¤§çº¦8704ç¯çææ¬å å®¹ï¼ 2.æ´çåçè¯æåºçæå±äºwww.NLPIR.orgï¼ 3.å¯ä¾çææ¬èªç¶è¯è¨å¤çãæç´¢ãèæ åæçåºç¨æä¾æµè¯æ°æ®åºæ¯ï¼ | |||||
7 | ç»´åºç¾ç§è¯æåº | \ | ç»´åºç¾ç§ | ç»´åºç¾ç§ä¼å®ææå åå¸è¯æåº | |||||
8 | å¤è¯è¯æ°æ®åº | 2020å¹´ | github主ç¬è«ï¼http://shici.store | ||||||
9 | ä¿é©è¡ä¸è¯æåº | 2017å¹´ | è¯¥è¯æåºå å«ä»ç½ç«Insurance Library æ¶éçé®é¢åçæ¡ã æ®æä»¬æç¥ï¼è¿æ¯ä¿é©é¢åé¦ä¸ªå¼æ¾çQAè¯æåºï¼ è¯¥è¯æåºçå 容ç±ç°å®ä¸ççç¨æ·æåºï¼é«è´¨éççæ¡ç±å ·ææ·±åº¦é¢åç¥è¯çä¸ä¸äººå£«æä¾ã æä»¥è¿æ¯ä¸ä¸ªå ·æçæ£ä»·å¼çè¯æï¼è䏿¯ç©å ·ã å¨ä¸è¿°è®ºæä¸ï¼è¯æåºç¨äºçå¤éæ©ä»»å¡ã å¦ä¸æ¹é¢ï¼è¿ç§è¯æåºçå ¶ä»ç¨æ³ä¹æ¯å¯è½çã ä¾å¦ï¼éè¿é 读çè§£çæ¡ï¼è§å¯å¦ä¹ çèªä¸»å¦ä¹ ï¼ä½¿ç³»ç»è½å¤æç»æ¿åºèªå·±ççä¸è§çé®é¢ççæ¡ã æ°æ®éå为两个é¨åâé®çè¯æâåâé®çå¯¹è¯æâãé®çè¯ææ¯ä»åå§è±ææ°æ®ç¿»è¯è¿æ¥ï¼æªç»å ¶ä»å¤ççãé®çå¯¹è¯ææ¯åºäºé®çè¯æï¼ååäºåè¯å廿 å»åï¼æ·»å labelãæä»¥ï¼"é®çå¯¹è¯æ"å¯ä»¥ç´æ¥å¯¹æ¥æºå¨å¦ä¹ ä»»å¡ãå¦æå¯¹äºæ°æ®æ ¼å¼ä¸æ»¡ææè 对åè¯ææä¸æ»¡æï¼å¯ä»¥ç´æ¥å¯¹"é®çè¯æ"使ç¨å ¶ä»æ¹æ³è¿è¡å¤çï¼è·å¾å¯ä»¥ç¨äºè®ç»æ¨¡åçæ°æ®ã | ||||||
10 | æ±è¯æååå ¸ | 1905å¹´7æ | æ¬å庫å«éæ¾è©å ¸ç¶²ç¨ä»¥æä¾åæåé¨ä»¶æ¥è©¢çæååå ¸æ¸æåº«ï¼æä¾¿å©ä½¿ç¨è æ¥é£ææ¼¢åçç¨éãç®åæ¸æåº«æ¶é17,803ä¸åæ¼¢åçææ³ï¼åçºç¹é«åï¼chaizi-ft.txtï¼åç°¡é«åï¼chaizi-jt.txtï¼å ©åçæ¬ã æåæ³æå¥æ¼åºæççé å庫ãæåèéæ¼åéææ¯ååææå ©å以ä¸ççµæé¨ä»¶ï¼è䏿¯æææå¯«åææä½¿ç¨ççç«ã | ||||||
11 | æ°é»é¢æ | 2016å¹´ | å¾äº® | å¯ä»¥å为ãéç¨ä¸æè¯æãï¼è®ç»ãè¯åéãæå为ãé¢è®ç»ãçè¯æï¼ ä¹å¯ä»¥ç¨äºè®ç»ãæ é¢çæã模åï¼æè®ç»ãå ³é®è¯çæã模åï¼éå ³é®è¯å 容ä¸åäºæ é¢çæ°æ®ï¼ï¼ 亦å¯ä»¥éè¿æ°é»æ¸ éåºååºæ°é»çç±»åã | |||||
12 | ç¾ç§ç±»é®çjsonç(baike2018qa) | 2018å¹´ | å¾äº® | å¯ä»¥å为éç¨ä¸æè¯æï¼è®ç»è¯åéæå为é¢è®ç»çè¯æï¼ä¹å¯ä»¥ç¨äºæå»ºç¾ç§ç±»é®çï¼å ¶ä¸ç±»å«ä¿¡æ¯æ¯è¾æç¨ï¼å¯ä»¥ç¨äºåçç£è®ç»ï¼ä»èæå»º æ´å¥½å¥åè¡¨ç¤ºçæ¨¡åãå¥åç¸ä¼¼æ§ä»»å¡çã | |||||
13 | 社åºé®çjsonç(webtext2019zh) ï¼å¤§è§æ¨¡é«è´¨éæ°æ®é | 2019å¹´ | å¾äº® | 1ï¼æå»ºç¾ç§ç±»é®çï¼è¾å ¥ä¸ä¸ªé®é¢ï¼æå»ºæ£ç´¢ç³»ç»å¾å°ä¸ä¸ªå夿ç产ä¸ä¸ªåå¤ï¼ææ ¹æ®ç¸å ³å ³é®è¯ä»ï¼ç¤¾åºé®çåºä¸çéåºä½ ç¸å ³çé¢åæ°æ® 2ï¼è®ç»è¯é¢é¢æµæ¨¡åï¼è¾å ¥ä¸ä¸ªé®é¢(åææè¿°)ï¼é¢æµå±äºè¯é¢ã 3ï¼è®ç»ç¤¾åºé®ç(cQA)ç³»ç»ï¼é对ä¸é®å¤ççåºæ¯ï¼è¾å ¥ä¸ä¸ªé®é¢ï¼æ¾å°æç¸å ³çé®é¢ï¼å¨è¿ä¸ªåºç¡ä¸åºäºä¸åçæ¡åå¤çè´¨éã é®é¢ä¸çæ¡çç¸å ³æ§ï¼æ¾å°æå¥½ççæ¡ã 4ï¼å为éç¨ä¸æè¯æï¼å大模åé¢è®ç»çè¯ææè®ç»è¯åéãå ¶ä¸ç±»å«ä¿¡æ¯ä¹æ¯è¾æç¨ï¼å¯ä»¥ç¨äºåçç£è®ç»ï¼ä»èæå»ºæ´å¥½å¥åè¡¨ç¤ºçæ¨¡åãå¥åç¸ä¼¼æ§ä»»å¡çã 5ï¼ç»åç¹èµæ°éè¿ä¸é¢å¤ä¿¡æ¯ï¼é¢æµåå¤ç忬¢è¿ç¨åº¦æè®ç»çæ¡è¯åç³»ç»ã | |||||
14 | .ç»´åºç¾ç§jsonç(wiki2019zh) | 2019å¹´ | å¾äº® | å¯ä»¥å为éç¨ä¸æè¯æï¼åé¢è®ç»çè¯æææå»ºè¯åéï¼ä¹å¯ä»¥ç¨äºæå»ºç¥è¯é®çããä¸åäºwikiåå§éæ¾çæ°æ®éï¼è¿ä¸ªå¤çè¿äºã |
é 读çè§£
ID | æ é¢ | æ´æ°æ¥æ | æ°æ®éæä¾è | è®¸å¯ | 说æ | å ³é®å | ç±»å« | 论æå°å | 夿³¨ |
---|---|---|---|---|---|---|---|---|---|
1 | ç¾åº¦WebQA | 2016 | ç¾åº¦ | \ | æ¥èªäºç¾åº¦ç¥éï¼æ ¼å¼ä¸ºä¸ä¸ªé®é¢å¤ç¯ææåºæ¬ä¸è´çæç« ï¼åä¸ºäººä¸ºæ æ³¨ä»¥åæµè§å¨æ£ç´¢ | é 读çè§£ãç¾åº¦ç¥éçå®é®é¢ | 䏿é 读çè§£ | 论æ | |
2 | DuReader 1.0 | 2018/3/1 | ç¾åº¦ | Apache2.0 | æ¬æ¬¡ç«èµæ°æ®éæ¥èªæç´¢å¼æçå®åºç¨åºæ¯ï¼å ¶ä¸çé®é¢ä¸ºç¾åº¦æç´¢ç¨æ·ççå®é®é¢ï¼æ¯ä¸ªé®é¢å¯¹åº5个åéææ¡£ææ¬å人工æ´ççä¼è´¨çæ¡ã | é 读çè§£ãç¾åº¦æç´¢çå®é®é¢ | 䏿é 读çè§£ | 论æ | |
3 | SogouQA | 2018 | æç | \ | CIPS-SOGOUé®çæ¯èµæ°æ®ï¼æ¥èªäºæçæç´¢å¼æçå®ç¨æ·æäº¤çæ¥è¯¢è¯·æ±ï¼å«æäºå®ç±»ä¸éäºå®ç±»æ°æ® | é 读çè§£ãæçæç´¢å¼æçå®é®é¢ | 䏿é 读çè§£ | \ | |
4 | 䏿æ³å¾é 读çè§£æ°æ®éCJRC | 2019/8/17 | å工大讯é£èåå®éªå®¤ï¼HFLï¼ | \ | æ°æ®éå å«çº¦10,000ç¯ææ¡£ï¼ä¸»è¦æ¶åæ°äºä¸å®¡å¤å³ä¹¦ååäºä¸å®¡å¤å³ä¹¦ãéè¿æ½åè£å¤æä¹¦çäºå®æè¿°å 容ï¼é对äºå®æè¿°å å®¹æ æ³¨é®é¢ï¼æç»å½¢æçº¦50,000个é®ç对 | é 读çè§£ã䏿æ³å¾é¢å | 䏿é 读çè§£ | 论æ | |
5 | 2019âè®¯é£æ¯â䏿æºå¨é 读çè§£æ°æ®éï¼CMRC ï¼ | 2019å¹´10æ | å工大讯é£èåå®éªå®¤ï¼HFLï¼ | CC-BY-SA-4.0 | æ¬æ¬¡é 读çè§£ç任塿¯å¥å级填空åé 读çè§£ã æ ¹æ®ç»å®çä¸ä¸ªåäºç¯ç« 以åè¥å¹²ä¸ªä»ç¯ç« 䏿½ååºçå¥åï¼åèµè éè¦å»ºç«æ¨¡åå°åéå¥åç²¾åçå¡«ååç¯ç« ä¸ï¼ä½¿ä¹æä¸ºå®æ´çä¸ç¯æç« ã | å¥å级填空åé 读çè§£ | 䏿é 读çè§£ | \ | èµäºå®ç½ï¼https://hfl-rc.github.io/cmrc2019/ |
6 | 2018âè®¯é£æ¯â䏿æºå¨é 读çè§£æ°æ®éï¼CMRC ï¼ | 2018/10/19 | å工大讯é£èåå®éªå®¤ï¼HFLï¼ | CC-BY-SA-4.0 | CMRC 2018æ°æ®éå å«äºçº¦20,000个å¨ç»´åºç¾ç§ææ¬ä¸äººå·¥æ 注çé®é¢ãåæ¶ï¼æä»¬è¿æ 注äºä¸ä¸ªææéï¼å ¶ä¸å å«äºéè¦å¤å¥æ¨çæè½å¤æ£ç¡®è§£ççé®é¢ï¼æ´å¯ææææ§ | é 读çè§£ãåºäºç¯ç« çæ®µæ½å | 䏿é 读çè§£ | 论æ | èµäºå®ç½ï¼https://hfl-rc.github.io/cmrc2018/ |
7 | 2017âè®¯é£æ¯â䏿æºå¨é 读çè§£æ°æ®éï¼CMRC ï¼ | 2017/10/14 | å工大讯é£èåå®éªå®¤ï¼HFLï¼ | CC-BY-SA-4.0 | é¦ä¸ªä¸æå¡«ç©ºåé 读çè§£æ°æ®éPD&CFT | 填空åé 读çè§£ | 䏿é 读çè§£ | 论æ | èµäºå®ç½ |
8 | è±æ¯æ¯ï¼å ¨å½ç¬¬äºå±âåäºæºè½æºå¨é è¯»âææèµ | 2019/9/3 | ä¸çµè±æ¯ä¿¡æ¯ç³»ç»æéå ¬å¸ | \ | é¢ååäºåºç¨åºæ¯çå¤§è§æ¨¡ä¸æé 读çè§£æ°æ®éï¼å´ç»å¤ææ¡£æºå¨é 读çè§£è¿è¡ç«èµï¼æ¶åçè§£ãæ¨ççå¤æææ¯ã | å¤ææ¡£æºå¨é 读çè§£ | 䏿é 读çè§£ | \ | èµäºå®ç½ |
9 | ReCO | 2020 | æç | \ | æ¥æºäºæççæµè§å¨ç¨æ·è¾å ¥ï¼æå¤éåç´æ¥çæ¡ | é 读çè§£ãæçæç´¢ | 䏿é 读çè§£ | 论æ | \ |
10 | DuReader-checklist | 2021/3 | ç¾åº¦ | Apache-2.0 | 建ç«äºç»ç²åº¦çãå¤ç»´åº¦çè¯æµæ°æ®éï¼ä»è¯æ±çè§£ãçè¯çè§£ãè¯ä¹è§è²çè§£ãé»è¾æ¨ççå¤ä¸ªç»´åº¦æ£æµæ¨¡åçä¸è¶³ä¹å¤ï¼ä»èæ¨å¨é 读çè§£è¯æµè¿å ¥âç²¾ç»åâæ¶ä»£ | ç»ç²åº¦é 读çè§£ | 䏿é 读çè§£ | \ | èµäºå®ç½ |
11 | DuReader-Robust | 2020/8 | ç¾åº¦ | Apache-2.0 | ä»è¿æææ§ï¼è¿ç¨³å®æ§ä»¥åæ³åæ§å¤ä¸ªç»´åº¦æå»ºäºæµè¯é 读çè§£é²æ£æ§çæ°æ® | ç¾åº¦æç´¢ã鲿£æ§é 读çè§£ | 䏿é 读çè§£ | 论æ | èµäºå®ç½ |
12 | DuReader-YesNo | 2020/8 | ç¾åº¦ | Apache-2.0 | DuReader yesnoæ¯ä¸ä¸ªä»¥è§ç¹ææ§å¤æä¸ºç®æ ä»»å¡çæ°æ®éï¼å¯ä»¥å¼¥è¡¥æ½åç±»æ°æ®éè¯æµææ ç缺é·ï¼ä»èæ´å¥½å°è¯ä»·æ¨¡å对è§ç¹ææ§ççè§£è½åã | è§ç¹åé 读çè§£ | 䏿é 读çè§£ | \ | èµäºå®ç½ |
13 | DuReader2.0 | 2021 | ç¾åº¦ | Apache-2.0 | DuReader2.0æ¯å ¨æ°çå¤§è§æ¨¡ä¸æé 读çè§£æ°æ®ï¼æ¥æºäºç¨æ·çå®è¾å ¥ï¼çå®åºæ¯ | é 读çè§£ | 䏿é 读çè§£ | 论æ | èµäºå®ç½ |
14 | CAIL2020 | 2020 | å工大讯é£èåå®éªå®¤ï¼HFLï¼ | \ | 䏿叿³é 读ç解任å¡ï¼ä»å¹´æä»¬å°æåºå级çï¼ä¸ä» æä¹¦ç§ç±»ç±æ°äºãåäºæ©å±ä¸ºæ°äºãåäºãè¡æ¿ï¼é®é¢ç±»åä¹ç±åæ¥é¢æµæ©å±ä¸ºå¤æ¥æ¨çï¼é¾åº¦ææå级ã | æ³å¾é 读çè§£ | 䏿é 读çè§£ | \ | èµäºå®ç½ |
15 | CAIL2021 | 2021 | å工大讯é£èåå®éªå®¤ï¼HFLï¼ | \ | 䏿æ³å¾é 读çè§£æ¯èµå¼å ¥å¤ç段åççé®é¢ç±»åï¼å³é¨åé®é¢éè¦æ½åæç« ä¸çå¤ä¸ªç段ç»åææç»çæ¡ã叿å¤ç段é®é¢ç±»åçå¼å ¥ï¼è½å¤æ©å¤§ä¸ææºå¨é 读çè§£çåºæ¯éç¨æ§ãæ¬æ¬¡æ¯èµä¾æ§ä¿çåçæ®µãæ¯å¦ç±»åæçç±»çé®é¢ç±»åã | æ³å¾é 读çè§£ | 䏿é 读çè§£ | \ | èµäºå®ç½ |
16 | CoQA | 2018/9 | æ¯å¦ç¦å¤§å¦ | CC BY-SA 4.0ãApacheç | CoQAæ¯é¢å建ç«å¯¹è¯å¼é®çç³»ç»ç大忰æ®éï¼ææçç®æ æ¯è¡¡éæºå¨å¯¹ææ¬ççè§£è½åï¼ä»¥åæºå¨é¢å对è¯ä¸åºç°çå½¼æ¤ç¸å ³çé®é¢çåçè½åçé«ä½ | 对è¯é®ç | è±æé 读çè§£ | 论æ | 宿¹ç½ç« |
17 | SQuAD2.0 | 2018/1/11 | æ¯å¦ç¦å¤§å¦ | \ | è¡ä¸å å ¬è®¤çæºå¨é 读çè§£é¢åç顶级水平æµè¯ï¼å®æå»ºäºä¸ä¸ªå å«åä¸ä¸ªé®é¢çå¤§è§æ¨¡æºå¨é 读çè§£æ°æ®éï¼éåè¶ è¿ 500 ç¯çç»´åºç¾ç§æç« ãæ°æ®é䏿¯ä¸ä¸ªé 读çè§£é®é¢ççæ¡æ¯æ¥èªç»å®çé 读æç« çä¸å°æ®µææ¬ ââ 以åï¼ç°å¨å¨ SQuAD 2.0 ä¸è¿è¦å¤æè¿ä¸ªé®é¢æ¯å¦è½å¤æ ¹æ®å½åçé è¯»ææ¬ä½ç | é®çãå 嫿ªç¥çæ¡ | è±æé 读çè§£ | 论æ | |
18 | SQuAD1.0 | 2016 | æ¯å¦ç¦å¤§å¦ | \ | æ¯å¦ç¦å¤§å¦äº2016å¹´æ¨åºçé 读çè§£æ°æ®éï¼ç»å®ä¸ç¯æç« åç¸åºé®é¢ï¼éè¦ç®æ³ç»åºé®é¢ççæ¡ãæ¤æ°æ®éæææç« éèªç»´åºç¾ç§ï¼ä¸å ±æ107,785é®é¢ï¼ä»¥åé å¥ç 536 ç¯æç« | é®çãåºäºç¯ç« çæ®µæ½å | è±æé 读çè§£ | 论æ | |
19 | MCTest | 2013 | 微软 | \ | 100,000ä¸ªå¿ åºBingé®é¢å人工çæççæ¡ãä»é£æ¶èµ·ï¼ç¸ç»§åå¸äº1,000,000个é®é¢æ°æ®éï¼èªç¶è¯è¨çææ°æ®éï¼æ®µè½æåæ°æ®éï¼å ³é®è¯æåæ°æ®éï¼ç¬ç½æ°æ®éåä¼è¯æç´¢ã | é®çãæç´¢ | è±æé 读çè§£ | 论æ | |
20 | CNN/Dailymail | 2015 | DeepMind | Apache-2.0 | 填空åå¤§è§æ¨¡è±ææºå¨çè§£æ°æ®éï¼çæ¡æ¯åæä¸çæä¸ä¸ªè¯ã CNNæ°æ®éå å«ç¾å½æçº¿çµè§æ°é»ç½çæ°é»æç« åç¸å ³é®é¢ã大约æ90kæç« å380ké®é¢ã Dailymailæ°æ®éå 嫿¯æ¥æ°é»çæç« åç¸å ³é®é¢ã大约æ197kæç« å879ké®é¢ã | é®ç对ã填空åé 读çè§£ | è±æé 读çè§£ | 论æ | |
21 | RACE | 2017 | å¡èåºæ¢ éå¤§å¦ | / | æ°æ®é为ä¸å½ä¸å¦çè±è¯é 读çè§£é¢ç®ï¼ç»å®ä¸ç¯æç« å 5 é 4 é 1 çé¢ç®ï¼å æ¬äº 28000+ passages å 100,000 é®é¢ã | éæ©é¢å½¢å¼ | è±æé 读çè§£ | 论æ | ä¸è½½éé®ä»¶ç³è¯· |
22 | HEAD-QA | 2019 | aghie | MIT | ä¸ä¸ªé¢å夿æ¨ççå»çä¿å¥ãå¤éé®çæ°æ®éãæä¾è±è¯ã西ççè¯ä¸¤ç§å½¢å¼çæ°æ® | å»çé¢åãéæ©é¢å½¢å¼ | è±æé 读çè§£ 西ççè¯é 读çè§£ | 论æ | |
23 | Consensus Attention-based Neural Networks for Chinese Reading Comprehension | 2018 | å工大讯é£èåå®éªå®¤ | / | 䏿å®å½¢å¡«ç©ºåé 读çè§£ | 填空åé 读çè§£ | 䏿é 读çè§£ | 论æ | |
24 | WikiQA | 2015 | 微软 | / | WikiQAè¯æåºæ¯ä¸ä¸ªæ°çå ¬å¼çé®é¢åå¥å对éï¼æ¶é并注éç¨äºå¼æ¾åé®çç ç©¶ | çæ®µæ½åé 读çè§£ | è±æé 读çè§£ | 论æ | |
25 | Childrenâs Book Test (CBT) | 2016 | / | æµè¯è¯è¨æ¨¡åå¦ä½å¨å¿ç«¥ä¹¦ç±ä¸æææä¹ã䏿 åè¯è¨å»ºæ¨¡åºåä¸åï¼å®å°é¢æµå¥æ³åè½è¯çä»»å¡ä¸é¢æµè¯ä¹å 容æ´ä¸°å¯çä½é¢è¯çä»»å¡åºå弿¥ | 填空åé 读çè§£ | è±æé 读çè§£ | 论æ | ||
26 | NewsQA | 2017 | Maluuba Research | / | ä¸ä¸ªå ·ææææ§çæºå¨çè§£æ°æ®éï¼å å«è¶ è¿100000个人工çæçé®çå¯¹ï¼æ ¹æ®CNNç10000å¤ç¯æ°é»æç« æä¾é®é¢åçæ¡ï¼çæ¡ç±ç¸åºæç« çææ¬è·¨åº¦ç»æã | çæ®µæ½åé 读çè§£ | è±æé 读çè§£ | 论æ | |
27 | Frames dataset | 2017 | 微软 | / | ä»ç»äºä¸ä¸ªç±1369个人类对è¯ç»æçæ¡æ¶æ°æ®éï¼å¹³åæ¯ä¸ªå¯¹è¯15è½®ãå¼åè¿ä¸ªæ°æ®éæ¯ä¸ºäºç ç©¶è®°å¿å¨ç®æ 导å对è¯ç³»ç»ä¸çä½ç¨ã | é 读çè§£ãå¯¹è¯ | è±æé 读çè§£ | 论æ | |
28 | Quasar | 2017 | å¡å åºæ¢ éå¤§å¦ | BSD-2-Clause | æåºäºä¸¤ä¸ªå¤§è§æ¨¡æ°æ®éãQuasar-Sæ°æ®éç±37000个å®å½¢å¡«ç©ºå¼æ¥è¯¢ç»æï¼è¿äºæ¥è¯¢æ¯æ ¹æ®æµè¡ç½ç« Stack overflow ä¸ç软件å®ä½æ è®°çå®ä¹æé çãç½ç«ä¸çå¸ååè¯è®ºæ¯åçå®å½¢å¡«ç©ºé®é¢çèæ¯è¯æåºãQuasar-Tæ°æ®éå å«43000ä¸ªå¼æ¾åçäºé®é¢åå ¶ä»åç§äºèç½æ¥æºè·å¾ççæ¡ã | çæ®µæ½åé 读çè§£ | è±æé 读çè§£ | 论æ | |
29 | MS MARCO | 2018 | 微软 | / | 微软åºäºæç´¢å¼æ BING æå»ºçå¤§è§æ¨¡è±æé 读çè§£æ°æ®éï¼å å«10ä¸ä¸ªé®é¢å20ä¸ç¯ä¸éå¤çææ¡£ãMARCO æ°æ®éä¸çé®é¢å ¨é¨æ¥èªäº BING çæç´¢æ¥å¿ï¼æ ¹æ®ç¨æ·å¨ BING ä¸è¾å ¥ççå®é®é¢æ¨¡ææç´¢å¼æä¸ççå®åºç¨åºæ¯ï¼æ¯è¯¥é¢åææåºç¨ä»·å¼çæ°æ®éä¹ä¸ã | å¤ææ¡£ | è±æé 读çè§£ | 论æ | |
30 | 䏿å®å½¢å¡«ç©º | 2016å¹´ | å´ä¸é¸£ | é¦ä¸ªä¸æå¡«ç©ºåé 读çè§£æ°æ®éPD&CFTï¼ å ¨ç§°People Daily and Children's Fairy Taleï¼ æ°æ®æ¥æºäºäººæ°æ¥æ¥åå¿ç«¥æ äºã | 填空åé 读çè§£ | 䏿å®å½¢å¡«ç©º | 论æ | ||
31 | NLPCC ICCPOL2016 | 2016.12.2 | NLPCC䏻忹 | åºäºææ¡£ä¸çå¥å人工åæ14659个é®é¢ï¼å æ¬14K䏿ç¯ç« ã | é®ç对é 读çè§£ | 䏿é 读çè§£ | \ |
è´¡ç®ä¸åä¸
æè°¢ä»¥ä¸åå¦çè´¡ç®ï¼æåä¸åå åï¼
éå°æ£ãææç£ãæé²ãå¶çãè叿¦ãç« é¦å·ãæå°æãæä¿æ¯
æ¨å¯ä»¥éè¿ä¸ä¼ æ°æ®éä¿¡æ¯è´¡ç®ä½ çåéãä¸ä¼ äºä¸ªæä»¥ä¸æ°æ®éä¿¡æ¯å¹¶å®¡æ ¸éè¿åï¼è¯¥åå¦å¯ä»¥ä½ä¸ºé¡¹ç®è´¡ç®è ï¼å¹¶æ¾ç¤ºåºæ¥ã
Share your data set with community or make a contribution today! Just send email to chineseGLUE#163.com,
or join QQ group: 836811304
Top Related Projects
TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
An open-source NLP research library, built on PyTorch.
Models, data loaders and abstractions for language processing, powered by PyTorch
💫 Industrial-strength Natural Language Processing (NLP) in Python
Library for fast text representation and classification.
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot