HanLP
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
Top Related Projects
Quick Overview
HanLP (Han Language Processing) is a powerful and versatile Natural Language Processing (NLP) library for Chinese language processing. It provides a wide range of NLP tools and algorithms, including word segmentation, part-of-speech tagging, named entity recognition, dependency parsing, and more. HanLP supports both traditional and simplified Chinese characters and offers both rule-based and machine learning-based approaches.
Pros
- Comprehensive suite of NLP tools specifically designed for Chinese language processing
- Supports both traditional and simplified Chinese characters
- Offers both rule-based and machine learning-based approaches for various NLP tasks
- Active development and regular updates
Cons
- Steeper learning curve for users not familiar with Chinese NLP concepts
- Documentation is primarily in Chinese, which may be challenging for non-Chinese speakers
- Some advanced features require additional resources or models to be downloaded
Code Examples
- Word segmentation:
from hanlp_restful import HanLPClient
HanLP = HanLPClient('https://www.hanlp.com/api', auth=None, language='zh')
print(HanLP.tokenize('美国总统拜登今天看望了救助人员'))
- Part-of-speech tagging:
from hanlp_restful import HanLPClient
HanLP = HanLPClient('https://www.hanlp.com/api', auth=None, language='zh')
print(HanLP.pos('美国总统拜登今天看望了救助人员'))
- Named entity recognition:
from hanlp_restful import HanLPClient
HanLP = HanLPClient('https://www.hanlp.com/api', auth=None, language='zh')
print(HanLP.ner('美国总统拜登今天看望了救助人员'))
Getting Started
To get started with HanLP, follow these steps:
-
Install HanLP using pip:
pip install hanlp
-
Import and initialize HanLP in your Python script:
from hanlp_restful import HanLPClient HanLP = HanLPClient('https://www.hanlp.com/api', auth=None, language='zh')
-
Use HanLP functions for various NLP tasks:
text = '美国总统拜登今天看望了救助人员' print(HanLP.tokenize(text)) print(HanLP.pos(text)) print(HanLP.ner(text))
Note: For more advanced features and offline usage, you may need to download additional resources or models. Refer to the official documentation for detailed instructions.
Competitor Comparisons
结巴中文分词
Pros of jieba
- Lightweight and easy to use
- Fast processing speed for basic NLP tasks
- Wide adoption and community support
Cons of jieba
- Limited advanced features compared to HanLP
- Less accurate for some specialized tasks
- Fewer options for customization and fine-tuning
Code Comparison
jieba:
import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))
HanLP:
from pyhanlp import *
sentence = HanLP.segment("我来到北京清华大学")
print([term.word for term in sentence])
Both libraries offer simple APIs for basic Chinese text segmentation. HanLP provides more advanced features and customization options, while jieba focuses on simplicity and speed for common tasks. HanLP generally offers higher accuracy and more comprehensive NLP capabilities, but jieba remains popular due to its ease of use and lightweight nature.
百度NLP:分词,词性标注,命名实体识别,词重要性
Pros of LAC
- Developed by Baidu, a leading Chinese tech company, potentially offering industry-standard performance
- Focuses specifically on Chinese language processing, which may result in better accuracy for Chinese texts
- Provides pre-trained models, reducing the need for extensive training data
Cons of LAC
- Limited to Chinese language processing, lacking support for other languages
- Less comprehensive feature set compared to HanLP, which offers a wider range of NLP tasks
- Smaller community and potentially less frequent updates
Code Comparison
HanLP:
from hanlp_restful import HanLP
HanLP.parse('我爱自然语言处理')
LAC:
from LAC import LAC
lac = LAC(mode='lac')
lac.run('我爱自然语言处理')
Both libraries offer simple APIs for basic NLP tasks, but HanLP provides a more extensive set of functions for various language processing tasks. LAC focuses primarily on lexical analysis and named entity recognition for Chinese text.
HanLP offers a broader range of NLP capabilities, including parsing, word segmentation, part-of-speech tagging, and more, across multiple languages. LAC, on the other hand, specializes in Chinese language processing with a more focused feature set.
Python library for processing Chinese text
Pros of snownlp
- Lightweight and easy to use for basic Chinese NLP tasks
- Includes sentiment analysis functionality out of the box
- Simple API for common operations like word segmentation and POS tagging
Cons of snownlp
- Less comprehensive feature set compared to HanLP
- Not as actively maintained or updated
- Limited documentation and community support
Code Comparison
snownlp:
from snownlp import SnowNLP
s = SnowNLP(u'这个东西真心很赞')
print(s.words) # [u'这个', u'东西', u'真心', u'很', u'赞']
print(s.tags) # [(u'这个', u'r'), (u'东西', u'n'), (u'真心', u'd'), (u'很', u'd'), (u'赞', u'Vg')]
print(s.sentiments) # 0.9769663402895832
HanLP:
from pyhanlp import *
sentence = "这个东西真心很赞"
print(HanLP.segment(sentence)) # [这个/rz, 东西/n, 真心/d, 很/d, 赞/vg]
print(HanLP.parseDependency(sentence)) # 1 这个 这个 rz rz _ 2 定中关系 _ _
# 2 东西 东西 n n _ 5 主谓关系 _ _
# 3 真心 真心 d d _ 5 状中结构 _ _
# 4 很 很 d d _ 5 程度修饰 _ _
# 5 赞 赞 vg vg _ 0 核心关系 _ _
"结巴"中文分词的C++版本
Pros of cppjieba
- Written in C++, offering potentially faster performance for certain tasks
- Lightweight and focused specifically on Chinese word segmentation
- Easy integration into C++ projects
Cons of cppjieba
- Limited to Chinese language processing, while HanLP supports multiple languages
- Fewer features compared to HanLP's comprehensive NLP toolkit
- Less active development and smaller community support
Code Comparison
cppjieba:
#include "cppjieba/Jieba.hpp"
vector<string> words;
jieba.Cut("我来到北京清华大学", words);
HanLP:
import com.hankcs.hanlp.HanLP;
List<String> words = HanLP.segment("我来到北京清华大学");
Both libraries provide simple APIs for word segmentation, but HanLP offers a wider range of NLP functions beyond just segmentation. cppjieba's C++ implementation may provide performance benefits in certain scenarios, while HanLP's Java-based approach offers greater flexibility and a more comprehensive set of NLP tools.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
HanLP: Han Language Processing
English | æ¥æ¬èª | ææ¡£ | 论æ | 论å | docker | â¶ï¸å¨çº¿è¿è¡
é¢åç产ç¯å¢çå¤è¯ç§èªç¶è¯è¨å¤çå·¥å ·å ï¼åºäºPyTorchåTensorFlow 2.xåå¼æï¼ç®æ æ¯æ®åè½å°æå沿çNLPææ¯ãHanLPå ·å¤åè½å®åã精度åç¡®ãæ§è½é«æãè¯ææ¶æ°ãæ¶ææ¸ æ°ãå¯èªå®ä¹çç¹ç¹ã
åå©ä¸çä¸æ大çå¤è¯ç§è¯æåºï¼HanLP2.1æ¯æå æ¬ç®ç¹ä¸è±æ¥ä¿æ³å¾·å¨å ç130ç§è¯è¨ä¸ç10ç§èåä»»å¡ä»¥åå¤ç§åä»»å¡ãHanLPé¢è®ç»äºåå ç§ä»»å¡ä¸çæ°å个模å并ä¸æ£å¨æç»è¿ä»£è¯æåºä¸æ¨¡åï¼
åè½ | RESTful | å¤ä»»å¡ | åä»»å¡ | 模å | æ 注æ å |
---|---|---|---|---|---|
åè¯ | æç¨ | æç¨ | æç¨ | tok | ç²åãç»å |
è¯æ§æ 注 | æç¨ | æç¨ | æç¨ | pos | CTBãPKUã863 |
å½åå®ä½è¯å« | æç¨ | æç¨ | æç¨ | ner | PKUãMSRAãOntoNotes |
ä¾åå¥æ³åæ | æç¨ | æç¨ | æç¨ | dep | SDãUDãPMT |
æåå¥æ³åæ | æç¨ | æç¨ | æç¨ | con | Chinese Tree Bank |
è¯ä¹ä¾ååæ | æç¨ | æç¨ | æç¨ | sdp | CSDP |
è¯ä¹è§è²æ 注 | æç¨ | æç¨ | æç¨ | srl | Chinese Proposition Bank |
æ½è±¡æä¹è¡¨ç¤º | æç¨ | ææ | æç¨ | amr | CAMR |
æ代æ¶è§£ | æç¨ | ææ | ææ | ææ | OntoNotes |
è¯ä¹ææ¬ç¸ä¼¼åº¦ | æç¨ | ææ | æç¨ | sts | ææ |
ææ¬é£æ ¼è½¬æ¢ | æç¨ | ææ | ææ | ææ | ææ |
å ³é®è¯çè¯æå | æç¨ | ææ | ææ | ææ | ææ |
æ½åå¼èªå¨æè¦ | æç¨ | ææ | ææ | ææ | ææ |
çæå¼èªå¨æè¦ | æç¨ | ææ | ææ | ææ | ææ |
ææ¬è¯æ³çº é | æç¨ | ææ | ææ | ææ | ææ |
ææ¬åç±» | æç¨ | ææ | ææ | ææ | ææ |
æ æåæ | æç¨ | ææ | ææ | ææ | [-1,+1] |
è¯ç§æ£æµ | æç¨ | ææ | æç¨ | ææ | ISO 639-1ç¼ç |
- è¯å¹²æåãè¯æ³è¯æ³ç¹å¾æå请åèè±ææç¨ï¼è¯åéåå®å½¢å¡«ç©ºè¯·åèç¸åºææ¡£ã
- ç®ç¹è½¬æ¢ãæ¼é³ãæ°è¯åç°ãææ¬è类请åè1.xæç¨ã
éä½è£è¡£ï¼HanLPæä¾RESTfulånative两ç§APIï¼åå«é¢åè½»é级åæµ·é级两ç§åºæ¯ãæ 论ä½ç§APIä½ç§è¯è¨ï¼HanLPæ¥å£å¨è¯ä¹ä¸ä¿æä¸è´ï¼å¨ä»£ç ä¸åæå¼æºãå¦ææ¨å¨ç 究ä¸ä½¿ç¨äºHanLPï¼è¯·å¼ç¨æ们çEMNLP论æã
è½»é级RESTful API
ä»
æ°KBï¼éåææ·å¼åã移å¨APPçåºæ¯ãç®åæç¨ï¼æ éGPUé
ç¯å¢ï¼ç§éå®è£
ãè¯ææ´å¤ã模åæ´å¤§ã精度æ´é«ï¼å¼ºçæ¨èãæå¡å¨GPUç®åæéï¼å¿åç¨æ·é
é¢è¾å°ï¼å»ºè®®ç³è¯·å
è´¹å
¬çAPIç§é¥auth
ã
Python
pip install hanlp_restful
å建客æ·ç«¯ï¼å¡«å ¥æå¡å¨å°ååç§é¥ï¼
from hanlp_restful import HanLPClient
HanLP = HanLPClient('https://www.hanlp.com/api', auth=None, language='zh') # authä¸å¡«åå¿åï¼zhä¸æï¼mulå¤è¯ç§
Golang
å®è£
go get -u github.com/hankcs/gohanlp@main
ï¼å建客æ·ç«¯ï¼å¡«å
¥æå¡å¨å°ååç§é¥ï¼
HanLP := hanlp.HanLPClient(hanlp.WithAuth(""),hanlp.WithLanguage("zh")) // authä¸å¡«åå¿åï¼zhä¸æï¼mulå¤è¯ç§
Java
å¨pom.xml
ä¸æ·»å ä¾èµï¼
<dependency>
<groupId>com.hankcs.hanlp.restful</groupId>
<artifactId>hanlp-restful</artifactId>
<version>0.0.12</version>
</dependency>
å建客æ·ç«¯ï¼å¡«å ¥æå¡å¨å°ååç§é¥ï¼
HanLPClient HanLP = new HanLPClient("https://www.hanlp.com/api", null, "zh"); // authä¸å¡«åå¿åï¼zhä¸æï¼mulå¤è¯ç§
å¿«éä¸æ
æ 论ä½ç§å¼åè¯è¨ï¼è°ç¨parse
æ¥å£ï¼ä¼ å
¥ä¸ç¯æç« ï¼å¾å°HanLPç²¾åçåæç»æã
HanLP.parse("2021å¹´HanLPv2.1为ç产ç¯å¢å¸¦æ¥æ¬¡ä¸ä»£æå
è¿çå¤è¯ç§NLPææ¯ãé¿å©ä¸»æ¥å°å京ç«æ¹åºåè§èªç¶è¯ä¹ç§æå
¬å¸ã")
æ´å¤åè½å æ¬è¯ä¹ç¸ä¼¼åº¦ãé£æ ¼è½¬æ¢ãæ代æ¶è§£çï¼è¯·åèææ¡£åæµè¯ç¨ä¾ã
æµ·é级native API
ä¾èµPyTorchãTensorFlowç深度å¦ä¹ ææ¯ï¼éåä¸ä¸NLPå·¥ç¨å¸ãç 究è 以åæ¬å°æµ·éæ°æ®åºæ¯ãè¦æ±Python 3.6è³3.10ï¼æ¯æWindowsï¼æ¨è*nixãå¯ä»¥å¨CPUä¸è¿è¡ï¼æ¨èGPU/TPUãå®è£ PyTorchçï¼
pip install hanlp
- HanLPæ¯æ¬¡åå¸é½éè¿äºLinuxãmacOSåWindowsä¸Python3.6è³3.10çåå æµè¯ï¼ä¸åå¨å®è£ é®é¢ã
HanLPåå¸ç模åå为å¤ä»»å¡ååä»»å¡ä¸¤ç§ï¼å¤ä»»å¡é度快çæ¾åï¼åä»»å¡ç²¾åº¦é«æ´çµæ´»ã
å¤ä»»å¡æ¨¡å
HanLPçå·¥ä½æµç¨ä¸ºå 载模åç¶åå°å ¶å½ä½å½æ°è°ç¨ï¼ä¾å¦ä¸åèåå¤ä»»å¡æ¨¡åï¼
import hanlp
HanLP = hanlp.load(hanlp.pretrained.mtl.CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_SMALL_ZH) # ä¸çæ大ä¸æè¯æåº
HanLP(['2021å¹´HanLPv2.1为ç产ç¯å¢å¸¦æ¥æ¬¡ä¸ä»£æå
è¿çå¤è¯ç§NLPææ¯ã', 'é¿å©ä¸»æ¥å°å京ç«æ¹åºåè§èªç¶è¯ä¹ç§æå
¬å¸ã'])
Native APIçè¾å ¥åä½ä¸ºå¥åï¼é使ç¨å¤è¯ç§åå¥æ¨¡åæåºäºè§åçåå¥å½æ°å è¡åå¥ãRESTfulånative两ç§APIçè¯ä¹è®¾è®¡å®å ¨ä¸è´ï¼ç¨æ·å¯ä»¥æ ç¼äºæ¢ãç®æ´çæ¥å£ä¹æ¯æçµæ´»çåæ°ï¼å¸¸ç¨çæå·§æï¼
- çµæ´»ç
tasks
ä»»å¡è°åº¦ï¼ä»»å¡è¶å°ï¼é度è¶å¿«ï¼è¯¦è§æç¨ãå¨å åæéçåºæ¯ä¸ï¼ç¨æ·è¿å¯ä»¥å é¤ä¸éè¦çä»»å¡è¾¾å°æ¨¡åç¦èº«çææã - é«æçtrieæ èªå®ä¹è¯å ¸ï¼ä»¥å强å¶ãå并ãæ ¡æ£3ç§è§åï¼è¯·åèdemoåææ¡£ãè§åç³»ç»çææå°æ ç¼åºç¨å°åç»ç»è®¡æ¨¡åï¼ä»èå¿«ééåºæ°é¢åã
åä»»å¡æ¨¡å
æ ¹æ®æ们çææ°ç 究ï¼å¤ä»»å¡å¦ä¹ çä¼å¿å¨äºé度åæ¾åï¼ç¶è精度å¾å¾ä¸å¦åä»»å¡æ¨¡åãæ以ï¼HanLPé¢è®ç»äºè®¸å¤åä»»å¡æ¨¡å并设计äºä¼é çæµæ°´çº¿æ¨¡å¼å°å ¶ç»è£ èµ·æ¥ã
import hanlp
HanLP = hanlp.pipeline() \
.append(hanlp.utils.rules.split_sentence, output_key='sentences') \
.append(hanlp.load('FINE_ELECTRA_SMALL_ZH'), output_key='tok') \
.append(hanlp.load('CTB9_POS_ELECTRA_SMALL'), output_key='pos') \
.append(hanlp.load('MSRA_NER_ELECTRA_SMALL_ZH'), output_key='ner', input_key='tok') \
.append(hanlp.load('CTB9_DEP_ELECTRA_SMALL', conll=0), output_key='dep', input_key='tok')\
.append(hanlp.load('CTB9_CON_ELECTRA_SMALL'), output_key='con', input_key='tok')
HanLP('2021å¹´HanLPv2.1为ç产ç¯å¢å¸¦æ¥æ¬¡ä¸ä»£æå
è¿çå¤è¯ç§NLPææ¯ãé¿å©ä¸»æ¥å°å京ç«æ¹åºåè§èªç¶è¯ä¹ç§æå
¬å¸ã')
æ´å¤åè½ï¼è¯·åèdemoåææ¡£äºè§£æ´å¤æ¨¡åä¸ç¨æ³ã
è¾åºæ ¼å¼
æ 论ä½ç§APIä½ç§å¼åè¯è¨ä½ç§èªç¶è¯è¨ï¼HanLPçè¾åºç»ä¸ä¸ºjson
æ ¼å¼å
¼å®¹dict
çDocument
:
{
"tok/fine": [
["2021å¹´", "HanLPv2.1", "为", "ç产", "ç¯å¢", "带æ¥", "次", "ä¸ä»£", "æ", "å
è¿", "ç", "å¤", "è¯ç§", "NLP", "ææ¯", "ã"],
["é¿å©ä¸»", "æ¥å°", "å京", "ç«æ¹åº", "åè§", "èªç¶", "è¯ä¹", "ç§æ", "å
¬å¸", "ã"]
],
"tok/coarse": [
["2021å¹´", "HanLPv2.1", "为", "ç产", "ç¯å¢", "带æ¥", "次ä¸ä»£", "æ", "å
è¿", "ç", "å¤è¯ç§", "NLP", "ææ¯", "ã"],
["é¿å©ä¸»", "æ¥å°", "å京ç«æ¹åº", "åè§", "èªç¶è¯ä¹ç§æå
¬å¸", "ã"]
],
"pos/ctb": [
["NT", "NR", "P", "NN", "NN", "VV", "JJ", "NN", "AD", "JJ", "DEG", "CD", "NN", "NR", "NN", "PU"],
["NN", "VV", "NR", "NR", "VV", "NN", "NN", "NN", "NN", "PU"]
],
"pos/pku": [
["t", "nx", "p", "vn", "n", "v", "b", "n", "d", "a", "u", "a", "n", "nx", "n", "w"],
["n", "v", "ns", "ns", "v", "n", "n", "n", "n", "w"]
],
"pos/863": [
["nt", "w", "p", "v", "n", "v", "a", "nt", "d", "a", "u", "a", "n", "ws", "n", "w"],
["n", "v", "ns", "n", "v", "n", "n", "n", "n", "w"]
],
"ner/pku": [
[],
[["å京ç«æ¹åº", "ns", 2, 4], ["èªç¶è¯ä¹ç§æå
¬å¸", "nt", 5, 9]]
],
"ner/msra": [
[["2021å¹´", "DATE", 0, 1], ["HanLPv2.1", "ORGANIZATION", 1, 2]],
[["å京", "LOCATION", 2, 3], ["ç«æ¹åº", "LOCATION", 3, 4], ["èªç¶è¯ä¹ç§æå
¬å¸", "ORGANIZATION", 5, 9]]
],
"ner/ontonotes": [
[["2021å¹´", "DATE", 0, 1], ["HanLPv2.1", "ORG", 1, 2]],
[["å京ç«æ¹åº", "FAC", 2, 4], ["èªç¶è¯ä¹ç§æå
¬å¸", "ORG", 5, 9]]
],
"srl": [
[[["2021å¹´", "ARGM-TMP", 0, 1], ["HanLPv2.1", "ARG0", 1, 2], ["为ç产ç¯å¢", "ARG2", 2, 5], ["带æ¥", "PRED", 5, 6], ["次ä¸ä»£æå
è¿çå¤è¯ç§NLPææ¯", "ARG1", 6, 15]], [["æ", "ARGM-ADV", 8, 9], ["å
è¿", "PRED", 9, 10], ["ææ¯", "ARG0", 14, 15]]],
[[["é¿å©ä¸»", "ARG0", 0, 1], ["æ¥å°", "PRED", 1, 2], ["å京ç«æ¹åº", "ARG1", 2, 4]], [["é¿å©ä¸»", "ARG0", 0, 1], ["åè§", "PRED", 4, 5], ["èªç¶è¯ä¹ç§æå
¬å¸", "ARG1", 5, 9]]]
],
"dep": [
[[6, "tmod"], [6, "nsubj"], [6, "prep"], [5, "nn"], [3, "pobj"], [0, "root"], [8, "amod"], [15, "nn"], [10, "advmod"], [15, "rcmod"], [10, "assm"], [13, "nummod"], [15, "nn"], [15, "nn"], [6, "dobj"], [6, "punct"]],
[[2, "nsubj"], [0, "root"], [4, "nn"], [2, "dobj"], [2, "conj"], [9, "nn"], [9, "nn"], [9, "nn"], [5, "dobj"], [2, "punct"]]
],
"sdp": [
[[[6, "Time"]], [[6, "Exp"]], [[5, "mPrep"]], [[5, "Desc"]], [[6, "Datv"]], [[13, "dDesc"]], [[0, "Root"], [8, "Desc"], [13, "Desc"]], [[15, "Time"]], [[10, "mDegr"]], [[15, "Desc"]], [[10, "mAux"]], [[8, "Quan"], [13, "Quan"]], [[15, "Desc"]], [[15, "Nmod"]], [[6, "Pat"]], [[6, "mPunc"]]],
[[[2, "Agt"], [5, "Agt"]], [[0, "Root"]], [[4, "Loc"]], [[2, "Lfin"]], [[2, "ePurp"]], [[8, "Nmod"]], [[9, "Nmod"]], [[9, "Nmod"]], [[5, "Datv"]], [[5, "mPunc"]]]
],
"con": [
["TOP", [["IP", [["NP", [["NT", ["2021å¹´"]]]], ["NP", [["NR", ["HanLPv2.1"]]]], ["VP", [["PP", [["P", ["为"]], ["NP", [["NN", ["ç产"]], ["NN", ["ç¯å¢"]]]]]], ["VP", [["VV", ["带æ¥"]], ["NP", [["ADJP", [["NP", [["ADJP", [["JJ", ["次"]]]], ["NP", [["NN", ["ä¸ä»£"]]]]]], ["ADVP", [["AD", ["æ"]]]], ["VP", [["JJ", ["å
è¿"]]]]]], ["DEG", ["ç"]], ["NP", [["QP", [["CD", ["å¤"]]]], ["NP", [["NN", ["è¯ç§"]]]]]], ["NP", [["NR", ["NLP"]], ["NN", ["ææ¯"]]]]]]]]]], ["PU", ["ã"]]]]]],
["TOP", [["IP", [["NP", [["NN", ["é¿å©ä¸»"]]]], ["VP", [["VP", [["VV", ["æ¥å°"]], ["NP", [["NR", ["å京"]], ["NR", ["ç«æ¹åº"]]]]]], ["VP", [["VV", ["åè§"]], ["NP", [["NN", ["èªç¶"]], ["NN", ["è¯ä¹"]], ["NN", ["ç§æ"]], ["NN", ["å
¬å¸"]]]]]]]], ["PU", ["ã"]]]]]]
]
}
ç¹å«å°ï¼Python RESTfulånative APIæ¯æåºäºç宽åä½çå¯è§åï¼è½å¤ç´æ¥å°è¯è¨å¦ç»æå¨æ§å¶å°å å¯è§ååºæ¥ï¼
HanLP(['2021å¹´HanLPv2.1为ç产ç¯å¢å¸¦æ¥æ¬¡ä¸ä»£æå
è¿çå¤è¯ç§NLPææ¯ã', 'é¿å©ä¸»æ¥å°å京ç«æ¹åºåè§èªç¶è¯ä¹ç§æå
¬å¸ã']).pretty_print()
Dep Tree Token Relati PoS Tok NER Type Tok SRL PA1 Tok SRL PA2 Tok PoS 3 4 5 6 7 8 9
ââââââââââââ âââââââââ ââââââ âââ âââââââââ ââââââââââââââââ âââââââââ ââââââââââââ âââââââââ ââââââââââââ âââââââââ âââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
ââââââââââ⺠2021å¹´ tmod NT 2021å¹´ ââââºDATE 2021å¹´ ââââºARGM-TMP 2021å¹´ 2021å¹´ NT ââââââââââââââââââââââââââââââââââââââââââââºNP ââââ
ââââââââââ⺠HanLPv2.1 nsubj NR HanLPv2.1 ââââºORGANIZATION HanLPv2.1 ââââºARG0 HanLPv2.1 HanLPv2.1 NR ââââââââââââââââââââââââââââââââââââââââââââºNPâââââ¤
âââââºââââââ 为 prep P 为 为 âââ 为 为 P ââââââââââââ â
âââ â ââ⺠ç产 nn NN ç产 ç产 ââºARG2 ç产 ç产 NN âââ ââââââââââââââââââââââââââºPP ââââ â
âââ âââºâââ ç¯å¢ pobj NN ç¯å¢ ç¯å¢ âââ ç¯å¢ ç¯å¢ NN âââ´âºNP ââââ â â
ââ¼â´â´ââââââââ å¸¦æ¥ root VV å¸¦æ¥ å¸¦æ¥ ââââºPRED å¸¦æ¥ å¸¦æ¥ VV âââââââââââââââââââââââââââââââââââ â â
ââ ââ⺠次 amod JJ 次 次 âââ 次 次 JJ ââââºADJPâââ â ââºVPâââââ¤
ââ âââââºâââ ä¸ä»£ nn NN ä¸ä»£ ä¸ä»£ â ä¸ä»£ ä¸ä»£ NN ââââºNP ââââ´âºNP ââââ â â â
ââ â ââ⺠æ advmod AD æ æ â æ ââââºARGM-ADV æ AD ââââââââââââºADVPâââ¼âºADJPâââ ââºVP ââââ ââºIP
ââ âââââºâââ å
è¿ rcmod JJ å
è¿ å
è¿ â å
è¿ ââââºPRED å
è¿ JJ ââââââââââââºVP ââââ â â â
ââ ââ ââ⺠ç assm DEG ç ç ââºARG1 ç ç DEGââââââââââââââââââââââââââ⤠â â
ââ ââ âââº å¤ nummod CD å¤ å¤ â å¤ å¤ CD ââââºQP ââââ ââºNP ââââ â
ââ âââââºâââ è¯ç§ nn NN è¯ç§ è¯ç§ â è¯ç§ è¯ç§ NN ââââºNP ââââ´âââââââââºNPââââ⤠â
ââ âââ ââ⺠NLP nn NR NLP NLP â NLP NLP NR âââ â â
ââââºââ´â´âââ´ââ ææ¯ dobj NN ææ¯ ææ¯ âââ ææ¯ ââââºARG0 ææ¯ NN âââ´âââââââââââââââââºNP ââââ â
âââââââââââ⺠ã punct PU ã ã ã ã PU âââââââââââââââââââââââââââââââââââââââââââââââââââ
Dep Tree Tok Relat Po Tok NER Type Tok SRL PA1 Tok SRL PA2 Tok Po 3 4 5 6
ââââââââââââ âââ âââââ ââ âââ ââââââââââââââââ âââ ââââââââ âââ ââââââââ âââ ââââââââââââââââââââââââââââââââ
ââ⺠é¿å©ä¸» nsubj NN é¿å©ä¸» é¿å©ä¸» ââââºARG0 é¿å©ä¸» ââââºARG0 é¿å©ä¸» NNââââââââââââââââââââºNP ââââ
ââ¬âââââ¬âââ´ââ æ¥å° root VV æ¥å° æ¥å° ââââºPRED æ¥å° æ¥å° VVâââââââââââ â
ââ â ââ⺠å京 nn NR å京 ââââºLOCATION å京 âââ å京 å京 NRâââ ââºVP ââââ â
ââ âââºâââ ç«æ¹åº dobj NR ç«æ¹åº ââââºLOCATION ç«æ¹åº âââ´âºARG1 ç«æ¹åº ç«æ¹åº NRâââ´âºNP ââââ â â
ââââºââââââââ åè§ conj VV åè§ åè§ åè§ ââââºPRED åè§ VVâââââââââââ ââºVPâââââ¤
â â ââââ⺠èªç¶ nn NN èªç¶ âââ èªç¶ èªç¶ âââ èªç¶ NNâââ â â ââºIP
â â ââââ⺠è¯ä¹ nn NN è¯ä¹ â è¯ä¹ è¯ä¹ â è¯ä¹ NN â ââºVP ââââ â
â â ââââ⺠ç§æ nn NN ç§æ ââºORGANIZATION ç§æ ç§æ ââºARG1 ç§æ NN ââºNP ââââ â
â âââºââ´â´ââ å
¬å¸ dobj NN å
¬å¸ âââ å
¬å¸ å
¬å¸ âââ å
¬å¸ NNâââ â
âââââââââââ⺠ã punct PU ã ã ã ã PUâââââââââââââââââââââââââââ
å ³äºæ 注éå«ä¹ï¼è¯·åèãè¯è¨å¦æ 注è§èãåãæ ¼å¼è§èããæ们è´ä¹°ãæ 注æéç¨äºä¸çä¸é级æ大ãç§ç±»æå¤çè¯æåºç¨äºèåå¤è¯ç§å¤ä»»å¡å¦ä¹ ï¼æ以HanLPçæ 注éä¹æ¯è¦çé¢æ广çã
è®ç»ä½ èªå·±çé¢å模å
å深度å¦ä¹ 模åä¸ç¹é½ä¸é¾ï¼é¾çæ¯å¤ç°è¾é«çåç¡®çãä¸å代ç å±ç¤ºäºå¦ä½å¨sighan2005 PKUè¯æåºä¸è±6åéè®ç»ä¸ä¸ªè¶ è¶å¦æ¯çstate-of-the-artçä¸æåè¯æ¨¡åã
tokenizer = TransformerTaggingTokenizer()
save_dir = 'data/model/cws/sighan2005_pku_bert_base_96.73'
tokenizer.fit(
SIGHAN2005_PKU_TRAIN_ALL,
SIGHAN2005_PKU_TEST, # Conventionally, no devset is used. See Tian et al. (2020).
save_dir,
'bert-base-chinese',
max_seq_len=300,
char_level=True,
hard_constraint=True,
sampler_builder=SortingSamplerBuilder(batch_size=32),
epochs=3,
adam_epsilon=1e-6,
warmup_steps=0.1,
weight_decay=0.01,
word_dropout=0.1,
seed=1660853059,
)
tokenizer.evaluate(SIGHAN2005_PKU_TEST, save_dir)
å
¶ä¸ï¼ç±äºæå®äºéæºæ°ç§åï¼ç»æä¸å®æ¯96.73
ãä¸åäºé£äºèåå®£ä¼ çå¦æ¯è®ºææåä¸é¡¹ç®ï¼HanLPä¿è¯ææç»æå¯å¤ç°ãå¦æä½ æä»»ä½è´¨çï¼æ们å°å½ä½æé«ä¼å
级çè´å½æ§bug第ä¸æ¶é´ææ¥é®é¢ã
请åèdemoäºè§£æ´å¤è®ç»èæ¬ã
æ§è½
lang | corpora | model | tok | pos | ner | dep | con | srl | sdp | lem | fea | amr | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
fine | coarse | ctb | pku | 863 | ud | pku | msra | ontonotes | SemEval16 | DM | PAS | PSD | |||||||||
mul | UD2.7 OntoNotes5 | small | 98.62 | - | - | - | - | 93.23 | - | - | 74.42 | 79.10 | 76.85 | 70.63 | - | 91.19 | 93.67 | 85.34 | 87.71 | 84.51 | - |
base | 98.97 | - | - | - | - | 90.32 | - | - | 80.32 | 78.74 | 71.23 | 73.63 | - | 92.60 | 96.04 | 81.19 | 85.08 | 82.13 | - | ||
zh | open | small | 97.25 | - | 96.66 | - | - | - | - | - | 95.00 | 84.57 | 87.62 | 73.40 | 84.57 | - | - | - | - | - | - |
base | 97.50 | - | 97.07 | - | - | - | - | - | 96.04 | 87.11 | 89.84 | 77.78 | 87.11 | - | - | - | - | - | - | ||
close | small | 96.70 | 95.93 | 96.87 | 97.56 | 95.05 | - | 96.22 | 95.74 | 76.79 | 84.44 | 88.13 | 75.81 | 74.28 | - | - | - | - | - | - | |
base | 97.52 | 96.44 | 96.99 | 97.59 | 95.29 | - | 96.48 | 95.72 | 77.77 | 85.29 | 88.57 | 76.52 | 73.76 | - | - | - | - | - | - | ||
ernie | 96.95 | 97.29 | 96.76 | 97.64 | 95.22 | - | 97.31 | 96.47 | 77.95 | 85.67 | 89.17 | 78.51 | 74.10 | - | - | - | - | - | - |
- æ ¹æ®æ们çææ°ç 究ï¼åä»»å¡å¦ä¹ çæ§è½å¾å¾ä¼äºå¤ä»»å¡å¦ä¹ ãå¨ä¹ç²¾åº¦çäºé度çè¯ï¼å»ºè®®ä½¿ç¨åä»»å¡æ¨¡åã
HanLPéç¨çæ°æ®é¢å¤çä¸æåæ¯ä¾ä¸æµè¡æ¹æ³æªå¿ ç¸åï¼æ¯å¦HanLPéç¨äºå®æ´ççMSRAå½åå®ä½è¯å«è¯æï¼èé大ä¼ä½¿ç¨çéå²çï¼HanLP使ç¨äºè¯æ³è¦çæ´å¹¿çStanford Dependenciesæ åï¼èéå¦æ¯ç沿ç¨çZhang and Clark (2008)æ åï¼HanLPæåºäºåååå²CTBçæ¹æ³ï¼èä¸éç¨å¦æ¯çä¸ååä¸éæ¼äº51个é»éæ件çæ¹æ³ãHanLPå¼æºäºä¸æ´å¥è¯æé¢å¤çèæ¬ä¸ç¸åºè¯æåºï¼åå¾æ¨å¨ä¸æNLPçéæåã
æ»ä¹ï¼HanLPåªåæ们认为æ£ç¡®ãå è¿çäºæ ï¼èä¸ä¸å®æ¯æµè¡ãæå¨çäºæ ã
å¼ç¨
å¦æä½ å¨ç 究ä¸ä½¿ç¨äºHanLPï¼è¯·æå¦ä¸æ ¼å¼å¼ç¨ï¼
@inproceedings{he-choi-2021-stem,
title = "The Stem Cell Hypothesis: Dilemma behind Multi-Task Learning with Transformer Encoders",
author = "He, Han and Choi, Jinho D.",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.451",
pages = "5555--5577",
abstract = "Multi-task learning with transformer encoders (MTL) has emerged as a powerful technique to improve performance on closely-related tasks for both accuracy and efficiency while a question still remains whether or not it would perform as well on tasks that are distinct in nature. We first present MTL results on five NLP tasks, POS, NER, DEP, CON, and SRL, and depict its deficiency over single-task learning. We then conduct an extensive pruning analysis to show that a certain set of attention heads get claimed by most tasks during MTL, who interfere with one another to fine-tune those heads for their own objectives. Based on this finding, we propose the Stem Cell Hypothesis to reveal the existence of attention heads naturally talented for many tasks that cannot be jointly trained to create adequate embeddings for all of those tasks. Finally, we design novel parameter-free probes to justify our hypothesis and demonstrate how attention heads are transformed across the five tasks during MTL through label analysis.",
}
License
æºä»£ç
HanLPæºä»£ç çææå议为 Apache License 2.0ï¼å¯å è´¹ç¨ååä¸ç¨éã请å¨äº§å说æä¸éå HanLPçé¾æ¥åææåè®®ãHanLPåçææ³ä¿æ¤ï¼ä¾µæå¿ ç©¶ã
èªç¶è¯ä¹ï¼éå²ï¼ç§ææéå ¬å¸
HanLPä»v1.7çèµ·ç¬ç«è¿ä½ï¼ç±èªç¶è¯ä¹ï¼éå²ï¼ç§ææéå ¬å¸ä½ä¸ºé¡¹ç®ä¸»ä½ï¼ä¸»å¯¼åç»çæ¬çå¼åï¼å¹¶æ¥æåç»çæ¬ççæã
大快æç´¢
HanLP v1.3~v1.65çç±å¤§å¿«æ索主导å¼åï¼ç»§ç»å®å ¨å¼æºï¼å¤§å¿«æç´¢æ¥æç¸å ³çæã
ä¸æµ·æåå ¬å¸
HanLP æ©æå¾å°äºä¸æµ·æåå ¬å¸ç大åæ¯æï¼å¹¶æ¥æ1.28åååºçæ¬ççæï¼ç¸å ³çæ¬ä¹æ¾å¨ä¸æµ·æåå ¬å¸ç½ç«åå¸ã
é¢è®ç»æ¨¡å
æºå¨å¦ä¹ 模åçææå¨æ³å¾ä¸æ²¡æå®è®ºï¼ä½æ¬çå°éå¼æºè¯æåºåå§ææçç²¾ç¥ï¼å¦ä¸ç¹å«è¯´æï¼HanLPçå¤è¯ç§æ¨¡åææ沿ç¨CC BY-NC-SA 4.0ï¼ä¸æ模åææä¸ºä» ä¾ç 究ä¸æå¦ä½¿ç¨ã
References
Top Related Projects
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot