HanLP

中文分词词性标注命名实体识别依存句法分析成分句法分析语义依存分析语义角色标注指代消解风格转换语义相似度新词发现关键词短语提取自动摘要文本分类聚类拼音简繁转换自然语言处理

35,454

10,728

35,454

View on GitHub

Top Related Projects

snownlp

6,568

Python library for processing Chinese text

Quick Overview

HanLP (Han Language Processing) is a powerful and versatile Natural Language Processing (NLP) library for Chinese language processing. It provides a wide range of NLP tools and algorithms, including word segmentation, part-of-speech tagging, named entity recognition, dependency parsing, and more. HanLP supports both traditional and simplified Chinese characters and offers both rule-based and machine learning-based approaches.

Pros

Comprehensive suite of NLP tools specifically designed for Chinese language processing
Supports both traditional and simplified Chinese characters
Offers both rule-based and machine learning-based approaches for various NLP tasks
Active development and regular updates

Cons

Steeper learning curve for users not familiar with Chinese NLP concepts
Documentation is primarily in Chinese, which may be challenging for non-Chinese speakers
Some advanced features require additional resources or models to be downloaded

Code Examples

Word segmentation:

from hanlp_restful import HanLPClient

HanLP = HanLPClient('https://www.hanlp.com/api', auth=None, language='zh')
print(HanLP.tokenize('美国总统拜登今天看望了救助人员'))

Part-of-speech tagging:

from hanlp_restful import HanLPClient

HanLP = HanLPClient('https://www.hanlp.com/api', auth=None, language='zh')
print(HanLP.pos('美国总统拜登今天看望了救助人员'))

Named entity recognition:

from hanlp_restful import HanLPClient

HanLP = HanLPClient('https://www.hanlp.com/api', auth=None, language='zh')
print(HanLP.ner('美国总统拜登今天看望了救助人员'))

Getting Started

To get started with HanLP, follow these steps:

Install HanLP using pip:
```
pip install hanlp
```

Import and initialize HanLP in your Python script:

from hanlp_restful import HanLPClient

HanLP = HanLPClient('https://www.hanlp.com/api', auth=None, language='zh')

Use HanLP functions for various NLP tasks:

text = '美国总统拜登今天看望了救助人员'
print(HanLP.tokenize(text))
print(HanLP.pos(text))
print(HanLP.ner(text))

Note: For more advanced features and offline usage, you may need to download additional resources or models. Refer to the official documentation for detailed instructions.

Competitor Comparisons

jieba

34,296

结巴中文分词

Pros of jieba

Lightweight and easy to use
Fast processing speed for basic NLP tasks
Wide adoption and community support

Cons of jieba

Limited advanced features compared to HanLP
Less accurate for some specialized tasks
Fewer options for customization and fine-tuning

Code Comparison

jieba:

import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))

HanLP:

from pyhanlp import *
sentence = HanLP.segment("我来到北京清华大学")
print([term.word for term in sentence])

Both libraries offer simple APIs for basic Chinese text segmentation. HanLP provides more advanced features and customization options, while jieba focuses on simplicity and speed for common tasks. HanLP generally offers higher accuracy and more comprehensive NLP capabilities, but jieba remains popular due to its ease of use and lightweight nature.

lac

3,957

百度NLP：分词，词性标注，命名实体识别，词重要性

Pros of LAC

Developed by Baidu, a leading Chinese tech company, potentially offering industry-standard performance
Focuses specifically on Chinese language processing, which may result in better accuracy for Chinese texts
Provides pre-trained models, reducing the need for extensive training data

Cons of LAC

Limited to Chinese language processing, lacking support for other languages
Less comprehensive feature set compared to HanLP, which offers a wider range of NLP tasks
Smaller community and potentially less frequent updates

Code Comparison

HanLP:

from hanlp_restful import HanLP

HanLP.parse('我爱自然语言处理')

LAC:

from LAC import LAC

lac = LAC(mode='lac')
lac.run('我爱自然语言处理')

Both libraries offer simple APIs for basic NLP tasks, but HanLP provides a more extensive set of functions for various language processing tasks. LAC focuses primarily on lexical analysis and named entity recognition for Chinese text.

HanLP offers a broader range of NLP capabilities, including parsing, word segmentation, part-of-speech tagging, and more, across multiple languages. LAC, on the other hand, specializes in Chinese language processing with a more focused feature set.

snownlp

6,568

Python library for processing Chinese text

Pros of snownlp

Lightweight and easy to use for basic Chinese NLP tasks
Includes sentiment analysis functionality out of the box
Simple API for common operations like word segmentation and POS tagging

Cons of snownlp

Less comprehensive feature set compared to HanLP
Not as actively maintained or updated
Limited documentation and community support

Code Comparison

snownlp:

from snownlp import SnowNLP

s = SnowNLP(u'这个东西真心很赞')
print(s.words)         # [u'这个', u'东西', u'真心', u'很', u'赞']
print(s.tags)          # [(u'这个', u'r'), (u'东西', u'n'), (u'真心', u'd'), (u'很', u'd'), (u'赞', u'Vg')]
print(s.sentiments)    # 0.9769663402895832

HanLP:

from pyhanlp import *

sentence = "这个东西真心很赞"
print(HanLP.segment(sentence))  # [这个/rz, 东西/n, 真心/d, 很/d, 赞/vg]
print(HanLP.parseDependency(sentence))  # 1	这个	这个	rz	rz	_	2	定中关系	_	_
                                        # 2	东西	东西	n	n	_	5	主谓关系	_	_
                                        # 3	真心	真心	d	d	_	5	状中结构	_	_
                                        # 4	很	很	d	d	_	5	程度修饰	_	_
                                        # 5	赞	赞	vg	vg	_	0	核心关系	_	_

cppjieba

2,732

"结巴"中文分词的C++版本

Pros of cppjieba

Written in C++, offering potentially faster performance for certain tasks
Lightweight and focused specifically on Chinese word segmentation
Easy integration into C++ projects

Cons of cppjieba

Limited to Chinese language processing, while HanLP supports multiple languages
Fewer features compared to HanLP's comprehensive NLP toolkit
Less active development and smaller community support

Code Comparison

cppjieba:

#include "cppjieba/Jieba.hpp"
vector<string> words;
jieba.Cut("我来到北京清华大学", words);

HanLP:

import com.hankcs.hanlp.HanLP;
List<String> words = HanLP.segment("我来到北京清华大学");

Both libraries provide simple APIs for word segmentation, but HanLP offers a wider range of NLP functions beyond just segmentation. cppjieba's C++ implementation may provide performance benefits in certain scenarios, while HanLP's Java-based approach offers greater flexibility and a more comprehensive set of NLP tools.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

HanLP: Han Language Processing

English | æ¥æ¬èª | ææ¡£ | è®ºæ | è®ºå | docker | â¶ï¸å¨çº¿è¿è¡

åè½	RESTful	å¤ä»»å¡	åä»»å¡	æ¨¡å	æ æ³¨æ å
åè¯	æç¨	æç¨	æç¨	tok	ç²åãç»å
è¯æ§æ æ³¨	æç¨	æç¨	æç¨	pos	CTBãPKUã863
å½åå®ä½è¯å«	æç¨	æç¨	æç¨	ner	PKUãMSRAãOntoNotes
ä¾åå¥æ³åæ	æç¨	æç¨	æç¨	dep	SDãUDãPMT
æåå¥æ³åæ	æç¨	æç¨	æç¨	con	Chinese Tree Bank
è¯ä¹ä¾ååæ	æç¨	æç¨	æç¨	sdp	CSDP
è¯ä¹è§è²æ æ³¨	æç¨	æç¨	æç¨	srl	Chinese Proposition Bank
æ½è±¡æä¹è¡¨ç¤º	æç¨	ææ	æç¨	amr	CAMR
æä»£æ¶è§£	æç¨	ææ	ææ	ææ	OntoNotes
è¯ä¹ææ¬ç¸ä¼¼åº¦	æç¨	ææ	æç¨	sts	ææ
ææ¬é£æ ¼è½¬æ¢	æç¨	ææ	ææ	ææ	ææ
å³é®è¯çè¯æå	æç¨	ææ	ææ	ææ	ææ
æ½åå¼èªå¨æè¦	æç¨	ææ	ææ	ææ	ææ
çæå¼èªå¨æè¦	æç¨	ææ	ææ	ææ	ææ
ææ¬è¯æ³çº é	æç¨	ææ	ææ	ææ	ææ
ææ¬åç±»	æç¨	ææ	ææ	ææ	ææ
ææåæ	æç¨	ææ	ææ	ææ	`[-1,+1]`
è¯ç§æ£æµ	æç¨	ææ	æç¨	ææ	ISO 639-1ç¼ç

è¯å¹²æåãè¯æ³è¯æ³ç¹å¾æåè¯·åèè±ææç¨ï¼è¯åéåå®å½¢å¡«ç©ºè¯·åèç¸åºææ¡£ã
ç®ç¹è½¬æ¢ãæ¼é³ãæ°è¯åç°ãææ¬èç±»è¯·åè1.xæç¨ã

è½»éçº§RESTful API

Python

pip install hanlp_restful

åå»ºå®¢æ·ç«¯ï¼å¡«å¥æå¡å¨å°ååç§é¥ï¼

from hanlp_restful import HanLPClient
HanLP = HanLPClient('https://www.hanlp.com/api', auth=None, language='zh') # authä¸å¡«åå¿åï¼zhä¸æï¼mulå¤è¯ç§

Golang

å®è£ go get -u github.com/hankcs/gohanlp@main ï¼åå»ºå®¢æ·ç«¯ï¼å¡«å¥æå¡å¨å°ååç§é¥ï¼

HanLP := hanlp.HanLPClient(hanlp.WithAuth(""),hanlp.WithLanguage("zh")) // authä¸å¡«åå¿åï¼zhä¸æï¼mulå¤è¯ç§

Java

å¨pom.xmlä¸æ·»å ä¾èµï¼

<dependency>
    <groupId>com.hankcs.hanlp.restful</groupId>
    <artifactId>hanlp-restful</artifactId>
    <version>0.0.12</version>
</dependency>

åå»ºå®¢æ·ç«¯ï¼å¡«å¥æå¡å¨å°ååç§é¥ï¼

HanLPClient HanLP = new HanLPClient("https://www.hanlp.com/api", null, "zh"); // authä¸å¡«åå¿åï¼zhä¸æï¼mulå¤è¯ç§

å¿«éä¸æ

HanLP.parse("2021å¹´HanLPv2.1ä¸ºçäº§ç¯å¢å¸¦æ¥æ¬¡ä¸ä»£æåè¿çå¤è¯ç§NLPææ¯ãé¿å©ä¸»æ¥å°åäº¬ç«æ¹åºåè§èªç¶è¯ä¹ç§æå¬å¸ã")

æµ·éçº§native API

ä¾èµPyTorchãTensorFlowçæ·±åº¦å¦ä¹ ææ¯ï¼éåä¸ä¸NLPå·¥ç¨å¸ãç ç©¶èä»¥åæ¬å°æµ·éæ°æ®åºæ¯ãè¦æ±Python 3.6è³3.10ï¼æ¯æWindowsï¼æ¨è*nixãå¯ä»¥å¨CPUä¸è¿è¡ï¼æ¨èGPU/TPUãå®è£PyTorchçï¼

pip install hanlp

HanLPæ¯æ¬¡åå¸é½éè¿äºLinuxãmacOSåWindowsä¸Python3.6è³3.10çååæµè¯ï¼ä¸åå¨å®è£é®é¢ã

å¤ä»»å¡æ¨¡å

import hanlp
HanLP = hanlp.load(hanlp.pretrained.mtl.CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_SMALL_ZH) # ä¸çæå¤§ä¸æè¯æåº
HanLP(['2021å¹´HanLPv2.1ä¸ºçäº§ç¯å¢å¸¦æ¥æ¬¡ä¸ä»£æåè¿çå¤è¯ç§NLPææ¯ã', 'é¿å©ä¸»æ¥å°åäº¬ç«æ¹åºåè§èªç¶è¯ä¹ç§æå¬å¸ã'])

çµæ´»çtasksä»»å¡è°åº¦ï¼ä»»å¡è¶å°ï¼éåº¦è¶å¿«ï¼è¯¦è§æç¨ãå¨ååæéçåºæ¯ä¸ï¼ç¨æ·è¿å¯ä»¥å é¤ä¸éè¦çä»»å¡è¾¾å°æ¨¡åç¦èº«çææã
é«æçtrieæ èªå®ä¹è¯å¸ï¼ä»¥åå¼ºå¶ãåå¹¶ãæ ¡æ£3ç§è§åï¼è¯·åèdemoåææ¡£ãè§åç³»ç»çææå°æ ç¼åºç¨å°åç»ç»è®¡æ¨¡åï¼ä»èå¿«ééåºæ°é¢åã

åä»»å¡æ¨¡å

import hanlp
HanLP = hanlp.pipeline() \
    .append(hanlp.utils.rules.split_sentence, output_key='sentences') \
    .append(hanlp.load('FINE_ELECTRA_SMALL_ZH'), output_key='tok') \
    .append(hanlp.load('CTB9_POS_ELECTRA_SMALL'), output_key='pos') \
    .append(hanlp.load('MSRA_NER_ELECTRA_SMALL_ZH'), output_key='ner', input_key='tok') \
    .append(hanlp.load('CTB9_DEP_ELECTRA_SMALL', conll=0), output_key='dep', input_key='tok')\
    .append(hanlp.load('CTB9_CON_ELECTRA_SMALL'), output_key='con', input_key='tok')
HanLP('2021å¹´HanLPv2.1ä¸ºçäº§ç¯å¢å¸¦æ¥æ¬¡ä¸ä»£æåè¿çå¤è¯ç§NLPææ¯ãé¿å©ä¸»æ¥å°åäº¬ç«æ¹åºåè§èªç¶è¯ä¹ç§æå¬å¸ã')

æ´å¤åè½ï¼è¯·åèdemoåææ¡£äºè§£æ´å¤æ¨¡åä¸ç¨æ³ã

è¾åºæ ¼å¼

{
  "tok/fine": [
    ["2021å¹´", "HanLPv2.1", "ä¸º", "çäº§", "ç¯å¢", "å¸¦æ¥", "æ¬¡", "ä¸ä»£", "æ", "åè¿", "ç", "å¤", "è¯ç§", "NLP", "ææ¯", "ã"],
    ["é¿å©ä¸»", "æ¥å°", "åäº¬", "ç«æ¹åº", "åè§", "èªç¶", "è¯ä¹", "ç§æ", "å¬å¸", "ã"]
  ],
  "tok/coarse": [
    ["2021å¹´", "HanLPv2.1", "ä¸º", "çäº§", "ç¯å¢", "å¸¦æ¥", "æ¬¡ä¸ä»£", "æ", "åè¿", "ç", "å¤è¯ç§", "NLP", "ææ¯", "ã"],
    ["é¿å©ä¸»", "æ¥å°", "åäº¬ç«æ¹åº", "åè§", "èªç¶è¯ä¹ç§æå¬å¸", "ã"]
  ],
  "pos/ctb": [
    ["NT", "NR", "P", "NN", "NN", "VV", "JJ", "NN", "AD", "JJ", "DEG", "CD", "NN", "NR", "NN", "PU"],
    ["NN", "VV", "NR", "NR", "VV", "NN", "NN", "NN", "NN", "PU"]
  ],
  "pos/pku": [
    ["t", "nx", "p", "vn", "n", "v", "b", "n", "d", "a", "u", "a", "n", "nx", "n", "w"],
    ["n", "v", "ns", "ns", "v", "n", "n", "n", "n", "w"]
  ],
  "pos/863": [
    ["nt", "w", "p", "v", "n", "v", "a", "nt", "d", "a", "u", "a", "n", "ws", "n", "w"],
    ["n", "v", "ns", "n", "v", "n", "n", "n", "n", "w"]
  ],
  "ner/pku": [
    [],
    [["åäº¬ç«æ¹åº", "ns", 2, 4], ["èªç¶è¯ä¹ç§æå¬å¸", "nt", 5, 9]]
  ],
  "ner/msra": [
    [["2021å¹´", "DATE", 0, 1], ["HanLPv2.1", "ORGANIZATION", 1, 2]],
    [["åäº¬", "LOCATION", 2, 3], ["ç«æ¹åº", "LOCATION", 3, 4], ["èªç¶è¯ä¹ç§æå¬å¸", "ORGANIZATION", 5, 9]]
  ],
  "ner/ontonotes": [
    [["2021å¹´", "DATE", 0, 1], ["HanLPv2.1", "ORG", 1, 2]],
    [["åäº¬ç«æ¹åº", "FAC", 2, 4], ["èªç¶è¯ä¹ç§æå¬å¸", "ORG", 5, 9]]
  ],
  "srl": [
    [[["2021å¹´", "ARGM-TMP", 0, 1], ["HanLPv2.1", "ARG0", 1, 2], ["ä¸ºçäº§ç¯å¢", "ARG2", 2, 5], ["å¸¦æ¥", "PRED", 5, 6], ["æ¬¡ä¸ä»£æåè¿çå¤è¯ç§NLPææ¯", "ARG1", 6, 15]], [["æ", "ARGM-ADV", 8, 9], ["åè¿", "PRED", 9, 10], ["ææ¯", "ARG0", 14, 15]]],
    [[["é¿å©ä¸»", "ARG0", 0, 1], ["æ¥å°", "PRED", 1, 2], ["åäº¬ç«æ¹åº", "ARG1", 2, 4]], [["é¿å©ä¸»", "ARG0", 0, 1], ["åè§", "PRED", 4, 5], ["èªç¶è¯ä¹ç§æå¬å¸", "ARG1", 5, 9]]]
  ],
  "dep": [
    [[6, "tmod"], [6, "nsubj"], [6, "prep"], [5, "nn"], [3, "pobj"], [0, "root"], [8, "amod"], [15, "nn"], [10, "advmod"], [15, "rcmod"], [10, "assm"], [13, "nummod"], [15, "nn"], [15, "nn"], [6, "dobj"], [6, "punct"]],
    [[2, "nsubj"], [0, "root"], [4, "nn"], [2, "dobj"], [2, "conj"], [9, "nn"], [9, "nn"], [9, "nn"], [5, "dobj"], [2, "punct"]]
  ],
  "sdp": [
    [[[6, "Time"]], [[6, "Exp"]], [[5, "mPrep"]], [[5, "Desc"]], [[6, "Datv"]], [[13, "dDesc"]], [[0, "Root"], [8, "Desc"], [13, "Desc"]], [[15, "Time"]], [[10, "mDegr"]], [[15, "Desc"]], [[10, "mAux"]], [[8, "Quan"], [13, "Quan"]], [[15, "Desc"]], [[15, "Nmod"]], [[6, "Pat"]], [[6, "mPunc"]]],
    [[[2, "Agt"], [5, "Agt"]], [[0, "Root"]], [[4, "Loc"]], [[2, "Lfin"]], [[2, "ePurp"]], [[8, "Nmod"]], [[9, "Nmod"]], [[9, "Nmod"]], [[5, "Datv"]], [[5, "mPunc"]]]
  ],
  "con": [
    ["TOP", [["IP", [["NP", [["NT", ["2021å¹´"]]]], ["NP", [["NR", ["HanLPv2.1"]]]], ["VP", [["PP", [["P", ["ä¸º"]], ["NP", [["NN", ["çäº§"]], ["NN", ["ç¯å¢"]]]]]], ["VP", [["VV", ["å¸¦æ¥"]], ["NP", [["ADJP", [["NP", [["ADJP", [["JJ", ["æ¬¡"]]]], ["NP", [["NN", ["ä¸ä»£"]]]]]], ["ADVP", [["AD", ["æ"]]]], ["VP", [["JJ", ["åè¿"]]]]]], ["DEG", ["ç"]], ["NP", [["QP", [["CD", ["å¤"]]]], ["NP", [["NN", ["è¯ç§"]]]]]], ["NP", [["NR", ["NLP"]], ["NN", ["ææ¯"]]]]]]]]]], ["PU", ["ã"]]]]]],
    ["TOP", [["IP", [["NP", [["NN", ["é¿å©ä¸»"]]]], ["VP", [["VP", [["VV", ["æ¥å°"]], ["NP", [["NR", ["åäº¬"]], ["NR", ["ç«æ¹åº"]]]]]], ["VP", [["VV", ["åè§"]], ["NP", [["NN", ["èªç¶"]], ["NN", ["è¯ä¹"]], ["NN", ["ç§æ"]], ["NN", ["å¬å¸"]]]]]]]], ["PU", ["ã"]]]]]]
  ]
}

HanLP(['2021å¹´HanLPv2.1ä¸ºçäº§ç¯å¢å¸¦æ¥æ¬¡ä¸ä»£æåè¿çå¤è¯ç§NLPææ¯ã', 'é¿å©ä¸»æ¥å°åäº¬ç«æ¹åºåè§èªç¶è¯ä¹ç§æå¬å¸ã']).pretty_print()

Dep Tree    	Token    	Relati	PoS	Tok      	NER Type        	Tok      	SRL PA1     	Tok      	SRL PA2     	Tok      	PoS    3       4       5       6       7       8       9 
ââââââââââââ	âââââââââ	ââââââ	âââ	âââââââââ	ââââââââââââââââ	âââââââââ	ââââââââââââ	âââââââââ	ââââââââââââ	âââââââââ	âââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
 âââââââââââº	2021å¹´    	tmod  	NT 	2021å¹´    	ââââºDATE        	2021å¹´    	ââââºARGM-TMP	2021å¹´    	            	2021å¹´    	NT ââââââââââââââââââââââââââââââââââââââââââââºNP ââââ   
 âââââââââââº	HanLPv2.1	nsubj 	NR 	HanLPv2.1	ââââºORGANIZATION	HanLPv2.1	ââââºARG0    	HanLPv2.1	            	HanLPv2.1	NR ââââââââââââââââââââââââââââââââââââââââââââºNPâââââ¤   
 âââââºââââââ	ä¸º        	prep  	P  	ä¸º        	                	ä¸º        	âââ         	ä¸º        	            	ä¸º        	P ââââââââââââ                                       â   
 âââ  â  âââº	çäº§       	nn    	NN 	çäº§       	                	çäº§       	  ââºARG2    	çäº§       	            	çäº§       	NN âââ       ââââââââââââââââââââââââââºPP ââââ       â   
 âââ  âââºâââ	ç¯å¢       	pobj  	NN 	ç¯å¢       	                	ç¯å¢       	âââ         	ç¯å¢       	            	ç¯å¢       	NN âââ´âºNP ââââ                               â       â   
ââ¼â´â´ââââââââ	å¸¦æ¥       	root  	VV 	å¸¦æ¥       	                	å¸¦æ¥       	ââââºPRED    	å¸¦æ¥       	            	å¸¦æ¥       	VV âââââââââââââââââââââââââââââââââââ       â       â   
ââ       âââº	æ¬¡        	amod  	JJ 	æ¬¡        	                	æ¬¡        	âââ         	æ¬¡        	            	æ¬¡        	JJ ââââºADJPâââ                       â       ââºVPâââââ¤   
ââ  âââââºâââ	ä¸ä»£       	nn    	NN 	ä¸ä»£       	                	ä¸ä»£       	  â         	ä¸ä»£       	            	ä¸ä»£       	NN ââââºNP ââââ´âºNP ââââ               â       â       â   
ââ  â    âââº	æ        	advmod	AD 	æ        	                	æ        	  â         	æ        	ââââºARGM-ADV	æ        	AD ââââââââââââºADVPâââ¼âºADJPâââ       ââºVP ââââ       ââºIP
ââ  âââââºâââ	åè¿       	rcmod 	JJ 	åè¿       	                	åè¿       	  â         	åè¿       	ââââºPRED    	åè¿       	JJ ââââââââââââºVP ââââ       â       â               â   
ââ  ââ   âââº	ç        	assm  	DEG	ç        	                	ç        	  ââºARG1    	ç        	            	ç        	DEGâââââââââââââââââââââââââââ¤       â               â   
ââ  ââ   âââº	å¤        	nummod	CD 	å¤        	                	å¤        	  â         	å¤        	            	å¤        	CD ââââºQP ââââ               ââºNP ââââ               â   
ââ  âââââºâââ	è¯ç§       	nn    	NN 	è¯ç§       	                	è¯ç§       	  â         	è¯ç§       	            	è¯ç§       	NN ââââºNP ââââ´âââââââââºNPâââââ¤                       â   
ââ  âââ  âââº	NLP      	nn    	NR 	NLP      	                	NLP      	  â         	NLP      	            	NLP      	NR âââ                       â                       â   
ââââºââ´â´âââ´ââ	ææ¯       	dobj  	NN 	ææ¯       	                	ææ¯       	âââ         	ææ¯       	ââââºARG0    	ææ¯       	NN âââ´âââââââââââââââââºNP ââââ                       â   
ââââââââââââº	ã        	punct 	PU 	ã        	                	ã        	            	ã        	            	ã        	PU âââââââââââââââââââââââââââââââââââââââââââââââââââ   

Dep Tree    	Tok	Relat	Po	Tok	NER Type        	Tok	SRL PA1 	Tok	SRL PA2 	Tok	Po    3       4       5       6 
ââââââââââââ	âââ	âââââ	ââ	âââ	ââââââââââââââââ	âââ	ââââââââ	âââ	ââââââââ	âââ	ââââââââââââââââââââââââââââââââ
         âââº	é¿å©ä¸»	nsubj	NN	é¿å©ä¸»	                	é¿å©ä¸»	ââââºARG0	é¿å©ä¸»	ââââºARG0	é¿å©ä¸»	NNââââââââââââââââââââºNP ââââ   
ââ¬âââââ¬âââ´ââ	æ¥å° 	root 	VV	æ¥å° 	                	æ¥å° 	ââââºPRED	æ¥å° 	        	æ¥å° 	VVâââââââââââ               â   
ââ    â  âââº	åäº¬ 	nn   	NR	åäº¬ 	ââââºLOCATION    	åäº¬ 	âââ     	åäº¬ 	        	åäº¬ 	NRâââ       ââºVP ââââ       â   
ââ    âââºâââ	ç«æ¹åº	dobj 	NR	ç«æ¹åº	ââââºLOCATION    	ç«æ¹åº	âââ´âºARG1	ç«æ¹åº	        	ç«æ¹åº	NRâââ´âºNP ââââ       â       â   
ââââºââââââââ	åè§ 	conj 	VV	åè§ 	                	åè§ 	        	åè§ 	ââââºPRED	åè§ 	VVâââââââââââ       ââºVPâââââ¤   
â   â  âââââº	èªç¶ 	nn   	NN	èªç¶ 	âââ             	èªç¶ 	        	èªç¶ 	âââ     	èªç¶ 	NNâââ       â       â       ââºIP
â   â  âââââº	è¯ä¹ 	nn   	NN	è¯ä¹ 	  â             	è¯ä¹ 	        	è¯ä¹ 	  â     	è¯ä¹ 	NN  â       ââºVP ââââ       â   
â   â  âââââº	ç§æ 	nn   	NN	ç§æ 	  ââºORGANIZATION	ç§æ 	        	ç§æ 	  ââºARG1	ç§æ 	NN  ââºNP ââââ               â   
â   âââºââ´â´ââ	å¬å¸ 	dobj 	NN	å¬å¸ 	âââ             	å¬å¸ 	        	å¬å¸ 	âââ     	å¬å¸ 	NNâââ                       â   
ââââââââââââº	ã  	punct	PU	ã  	                	ã  	        	ã  	        	ã  	PUâââââââââââââââââââââââââââ

è®ç»ä½ èªå·±çé¢åæ¨¡å

tokenizer = TransformerTaggingTokenizer()
save_dir = 'data/model/cws/sighan2005_pku_bert_base_96.73'
tokenizer.fit(
    SIGHAN2005_PKU_TRAIN_ALL,
    SIGHAN2005_PKU_TEST,  # Conventionally, no devset is used. See Tian et al. (2020).
    save_dir,
    'bert-base-chinese',
    max_seq_len=300,
    char_level=True,
    hard_constraint=True,
    sampler_builder=SortingSamplerBuilder(batch_size=32),
    epochs=3,
    adam_epsilon=1e-6,
    warmup_steps=0.1,
    weight_decay=0.01,
    word_dropout=0.1,
    seed=1660853059,
)
tokenizer.evaluate(SIGHAN2005_PKU_TEST, save_dir)

è¯·åèdemoäºè§£æ´å¤è®ç»èæ¬ã

æ§è½

lang	corpora	model	tok		pos				ner			dep	con	srl	sdp				lem	fea	amr
lang	corpora	model	fine	coarse	ctb	pku	863	ud	pku	msra	ontonotes	dep	con	srl	SemEval16	DM	PAS	PSD	lem	fea	amr
mul	UD2.7 OntoNotes5	small	98.62	-	-	-	-	93.23	-	-	74.42	79.10	76.85	70.63	-	91.19	93.67	85.34	87.71	84.51	-
mul	UD2.7 OntoNotes5	base	98.97	-	-	-	-	90.32	-	-	80.32	78.74	71.23	73.63	-	92.60	96.04	81.19	85.08	82.13	-
zh	open	small	97.25	-	96.66	-	-	-	-	-	95.00	84.57	87.62	73.40	84.57	-	-	-	-	-	-
	open	base	97.50	-	97.07	-	-	-	-	-	96.04	87.11	89.84	77.78	87.11	-	-	-	-	-	-
	close	small	96.70	95.93	96.87	97.56	95.05	-	96.22	95.74	76.79	84.44	88.13	75.81	74.28	-	-	-	-	-	-
		base	97.52	96.44	96.99	97.59	95.29	-	96.48	95.72	77.77	85.29	88.57	76.52	73.76	-	-	-	-	-	-
		ernie	96.95	97.29	96.76	97.64	95.22	-	97.31	96.47	77.95	85.67	89.17	78.51	74.10	-	-	-	-	-	-

å¼ç¨

@inproceedings{he-choi-2021-stem,
    title = "The Stem Cell Hypothesis: Dilemma behind Multi-Task Learning with Transformer Encoders",
    author = "He, Han and Choi, Jinho D.",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.451",
    pages = "5555--5577",
    abstract = "Multi-task learning with transformer encoders (MTL) has emerged as a powerful technique to improve performance on closely-related tasks for both accuracy and efficiency while a question still remains whether or not it would perform as well on tasks that are distinct in nature. We first present MTL results on five NLP tasks, POS, NER, DEP, CON, and SRL, and depict its deficiency over single-task learning. We then conduct an extensive pruning analysis to show that a certain set of attention heads get claimed by most tasks during MTL, who interfere with one another to fine-tune those heads for their own objectives. Based on this finding, we propose the Stem Cell Hypothesis to reveal the existence of attention heads naturally talented for many tasks that cannot be jointly trained to create adequate embeddings for all of those tasks. Finally, we design novel parameter-free probes to justify our hypothesis and demonstrate how attention heads are transformed across the five tasks during MTL through label analysis.",
}

License

æºä»£ç

HanLPæºä»£ç çææåè®®ä¸º Apache License 2.0ï¼å¯åè´¹ç¨ååä¸ç¨éãè¯·å¨äº§åè¯´æä¸éå HanLPçé¾æ¥åææåè®®ãHanLPåçææ³ä¿æ¤ï¼ä¾µæå¿ç©¶ã

èªç¶è¯ä¹ï¼éå²ï¼ç§ææéå¬å¸

ä¸æµ·æåå¬å¸

é¢è®ç»æ¨¡å

References

https://hanlp.hankcs.com/docs/references.html

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of jieba

Cons of jieba

Code Comparison

Pros of LAC

Cons of LAC

Code Comparison

Pros of snownlp

Cons of snownlp

Code Comparison

Pros of cppjieba

Cons of cppjieba

Code Comparison

Convert designs to code with AI

README

HanLP: Han Language Processing

English | æ¥æ¬èª | ææ¡£ | è®ºæ | è®ºå | docker | â¶ï¸å¨çº¿è¿è¡

è½»éçº§RESTful API

Python

Golang

Java

å¿«éä¸æ

æµ·éçº§native API

å¤ä»»å¡æ¨¡å

åä»»å¡æ¨¡å

è¾åºæ ¼å¼

è®­ç»ä½ èªå·±çé¢åæ¨¡å

æ§è½

å¼ç¨

License

æºä»£ç 

èªç¶è¯­ä¹ï¼éå²ï¼ç§ææéå ¬å¸

ä¸æµ·æåå ¬å¸

é¢è®­ç»æ¨¡å

References

Top Related Projects

Convert designs to code with AI

English | æ¥æ¬èª | ææ¡£ | è®ºæ | è®ºå | docker | â¶ï¸å¨çº¿è¿è¡

è½»éçº§RESTful API

å¿«éä¸æ

æµ·éçº§native API

å¤ä»»å¡æ¨¡å

åä»»å¡æ¨¡å

è¾åºæ ¼å¼

è®ç»ä½ èªå·±çé¢åæ¨¡å

æ§è½

å¼ç¨

æºä»£ç

èªç¶è¯ä¹ï¼éå²ï¼ç§ææéå¬å¸

ä¸æµ·æåå¬å¸

é¢è®ç»æ¨¡å