HanLP
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
Top Related Projects
Quick Overview
HanLP (Han Language Processing) is a powerful and versatile Natural Language Processing (NLP) library for Chinese language processing. It provides a wide range of NLP tools and algorithms, including word segmentation, part-of-speech tagging, named entity recognition, dependency parsing, and more. HanLP supports both traditional and simplified Chinese characters and offers both rule-based and machine learning-based approaches.
Pros
- Comprehensive suite of NLP tools specifically designed for Chinese language processing
- Supports both traditional and simplified Chinese characters
- Offers both rule-based and machine learning-based approaches for various NLP tasks
- Active development and regular updates
Cons
- Steeper learning curve for users not familiar with Chinese NLP concepts
- Documentation is primarily in Chinese, which may be challenging for non-Chinese speakers
- Some advanced features require additional resources or models to be downloaded
Code Examples
- Word segmentation:
from hanlp_restful import HanLPClient
HanLP = HanLPClient('https://www.hanlp.com/api', auth=None, language='zh')
print(HanLP.tokenize('美国总统拜登今天看望了救助人员'))
- Part-of-speech tagging:
from hanlp_restful import HanLPClient
HanLP = HanLPClient('https://www.hanlp.com/api', auth=None, language='zh')
print(HanLP.pos('美国总统拜登今天看望了救助人员'))
- Named entity recognition:
from hanlp_restful import HanLPClient
HanLP = HanLPClient('https://www.hanlp.com/api', auth=None, language='zh')
print(HanLP.ner('美国总统拜登今天看望了救助人员'))
Getting Started
To get started with HanLP, follow these steps:
-
Install HanLP using pip:
pip install hanlp
-
Import and initialize HanLP in your Python script:
from hanlp_restful import HanLPClient HanLP = HanLPClient('https://www.hanlp.com/api', auth=None, language='zh')
-
Use HanLP functions for various NLP tasks:
text = '美国总统拜登今天看望了救助人员' print(HanLP.tokenize(text)) print(HanLP.pos(text)) print(HanLP.ner(text))
Note: For more advanced features and offline usage, you may need to download additional resources or models. Refer to the official documentation for detailed instructions.
Competitor Comparisons
结巴中文分词
Pros of jieba
- Lightweight and easy to use
- Fast processing speed for basic NLP tasks
- Wide adoption and community support
Cons of jieba
- Limited advanced features compared to HanLP
- Less accurate for some specialized tasks
- Fewer options for customization and fine-tuning
Code Comparison
jieba:
import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))
HanLP:
from pyhanlp import *
sentence = HanLP.segment("我来到北京清华大学")
print([term.word for term in sentence])
Both libraries offer simple APIs for basic Chinese text segmentation. HanLP provides more advanced features and customization options, while jieba focuses on simplicity and speed for common tasks. HanLP generally offers higher accuracy and more comprehensive NLP capabilities, but jieba remains popular due to its ease of use and lightweight nature.
百度NLP:分词,词性标注,命名实体识别,词重要性
Pros of LAC
- Developed by Baidu, a leading Chinese tech company, potentially offering industry-standard performance
- Focuses specifically on Chinese language processing, which may result in better accuracy for Chinese texts
- Provides pre-trained models, reducing the need for extensive training data
Cons of LAC
- Limited to Chinese language processing, lacking support for other languages
- Less comprehensive feature set compared to HanLP, which offers a wider range of NLP tasks
- Smaller community and potentially less frequent updates
Code Comparison
HanLP:
from hanlp_restful import HanLP
HanLP.parse('我爱自然语言处理')
LAC:
from LAC import LAC
lac = LAC(mode='lac')
lac.run('我爱自然语言处理')
Both libraries offer simple APIs for basic NLP tasks, but HanLP provides a more extensive set of functions for various language processing tasks. LAC focuses primarily on lexical analysis and named entity recognition for Chinese text.
HanLP offers a broader range of NLP capabilities, including parsing, word segmentation, part-of-speech tagging, and more, across multiple languages. LAC, on the other hand, specializes in Chinese language processing with a more focused feature set.
Python library for processing Chinese text
Pros of snownlp
- Lightweight and easy to use for basic Chinese NLP tasks
- Includes sentiment analysis functionality out of the box
- Simple API for common operations like word segmentation and POS tagging
Cons of snownlp
- Less comprehensive feature set compared to HanLP
- Not as actively maintained or updated
- Limited documentation and community support
Code Comparison
snownlp:
from snownlp import SnowNLP
s = SnowNLP(u'这个东西真心很赞')
print(s.words) # [u'这个', u'东西', u'真心', u'很', u'赞']
print(s.tags) # [(u'这个', u'r'), (u'东西', u'n'), (u'真心', u'd'), (u'很', u'd'), (u'赞', u'Vg')]
print(s.sentiments) # 0.9769663402895832
HanLP:
from pyhanlp import *
sentence = "这个东西真心很赞"
print(HanLP.segment(sentence)) # [这个/rz, 东西/n, 真心/d, 很/d, 赞/vg]
print(HanLP.parseDependency(sentence)) # 1 这个 这个 rz rz _ 2 定中关系 _ _
# 2 东西 东西 n n _ 5 主谓关系 _ _
# 3 真心 真心 d d _ 5 状中结构 _ _
# 4 很 很 d d _ 5 程度修饰 _ _
# 5 赞 赞 vg vg _ 0 核心关系 _ _
"结巴"中文分词的C++版本
Pros of cppjieba
- Written in C++, offering potentially faster performance for certain tasks
- Lightweight and focused specifically on Chinese word segmentation
- Easy integration into C++ projects
Cons of cppjieba
- Limited to Chinese language processing, while HanLP supports multiple languages
- Fewer features compared to HanLP's comprehensive NLP toolkit
- Less active development and smaller community support
Code Comparison
cppjieba:
#include "cppjieba/Jieba.hpp"
vector<string> words;
jieba.Cut("我来到北京清华大学", words);
HanLP:
import com.hankcs.hanlp.HanLP;
List<String> words = HanLP.segment("我来到北京清华大学");
Both libraries provide simple APIs for word segmentation, but HanLP offers a wider range of NLP functions beyond just segmentation. cppjieba's C++ implementation may provide performance benefits in certain scenarios, while HanLP's Java-based approach offers greater flexibility and a more comprehensive set of NLP tools.
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME

HanLP: Han Language Processing
ä¸æ | æ¥æ¬èª | Docs | Forum
HanLP is the multilingual NLP library designed for researchers and enterprises, built on PyTorch and TensorFlow 2.x to advance state-of-the-art deep learning techniques in academia and industry. HanLP was designed from day one to be efficient, user-friendly and extendable.
Thanks to open-access corpora like Universal Dependencies and OntoNotes, HanLP 2.1 now offers 10 joint tasks on 130 languages: tokenization, lemmatization, part-of-speech tagging, token feature extraction, dependency parsing, constituency parsing, semantic role labeling, semantic dependency parsing, abstract meaning representation (AMR) parsing.
For end users, HanLP offers light-weighted RESTful APIs and native Python APIs.
RESTful APIs
Tiny packages in several KBs for agile development and mobile applications. Although anonymous users are welcomed, an auth key is suggested and a free one can be applied here under the CC BY-NC-SA 4.0 license.
Click to expand tutorials for RESTful APIs
Python
pip install hanlp_restful
Create a client with our API endpoint and your auth.
from hanlp_restful import HanLPClient
HanLP = HanLPClient('https://hanlp.hankcs.com/api', auth=None, language='mul') # Support en, ja, zh, mul
Java
Insert the following dependency into your pom.xml
.
<dependency>
<groupId>com.hankcs.hanlp.restful</groupId>
<artifactId>hanlp-restful</artifactId>
<version>0.0.15</version>
</dependency>
Create a client with our API endpoint and your auth.
HanLPClient HanLP = new HanLPClient("https://hanlp.hankcs.com/api", null, "mul"); // Support en, ja, zh, mul
Quick Start
No matter which language you use, the same interface can be used to parse a document.
HanLP.parse(
"In 2021, HanLPv2.1 delivers state-of-the-art multilingual NLP techniques to production environments. 2021å¹´ãHanLPv2.1ã¯æ¬¡ä¸ä»£ã®æå
端å¤è¨èªNLPæè¡ãæ¬çªç°å¢ã«å°å
¥ãã¾ãã2021å¹´ HanLPv2.1为ç产ç¯å¢å¸¦æ¥æ¬¡ä¸ä»£æå
è¿çå¤è¯ç§NLPææ¯ã")
See docs for visualization, annotation guidelines and more details.
Native APIs
pip install hanlp
HanLP requires Python 3.6 or higher. While GPU or TPU acceleration is recommended, it is not mandatory.
Quick Start
import hanlp
HanLP = hanlp.load(hanlp.pretrained.mtl.UD_ONTONOTES_TOK_POS_LEM_FEA_NER_SRL_DEP_SDP_CON_XLMR_BASE)
print(HanLP(['In 2021, HanLPv2.1 delivers state-of-the-art multilingual NLP techniques to production environments.',
'2021å¹´ãHanLPv2.1ã¯æ¬¡ä¸ä»£ã®æå
端å¤è¨èªNLPæè¡ãæ¬çªç°å¢ã«å°å
¥ãã¾ãã',
'2021å¹´ HanLPv2.1为ç产ç¯å¢å¸¦æ¥æ¬¡ä¸ä»£æå
è¿çå¤è¯ç§NLPææ¯ã']))
- In particular, the Python
HanLPClient
can also be used as a callable function following the same semantics. See docs for visualization, annotation guidelines and more details. - To process English, Chinese or Japanese, HanLP provides mono-lingual models in each language which significantly outperform the multilingual model. See docs for the list of models.
Train Your Own Models
To write DL models is not hard, the real hard thing is to write a model able to reproduce the scores in papers. The snippet below shows how to surpass the state-of-the-art tokenizer in 6 minutes.
tokenizer = TransformerTaggingTokenizer()
save_dir = 'data/model/cws/sighan2005_pku_bert_base_96.7'
tokenizer.fit(
SIGHAN2005_PKU_TRAIN_ALL,
SIGHAN2005_PKU_TEST, # Conventionally, no devset is used. See Tian et al. (2020).
save_dir,
'bert-base-chinese',
max_seq_len=300,
char_level=True,
hard_constraint=True,
sampler_builder=SortingSamplerBuilder(batch_size=32),
epochs=3,
adam_epsilon=1e-6,
warmup_steps=0.1,
weight_decay=0.01,
word_dropout=0.1,
seed=1660853059,
)
tokenizer.evaluate(SIGHAN2005_PKU_TEST, save_dir)
The result is guaranteed to be 96.73
as the random seed is fixed. Different from some overclaiming papers and
projects, HanLP promises every single digit in our scores is reproducible. Any issues on reproducibility will be treated
and solved as a top-priority fatal bug.
Performance
The performance of multi-task learning models is shown in the following table.
lang | corpora | model | tok | pos | ner | dep | con | srl | sdp | lem | fea | amr | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
fine | coarse | ctb | pku | 863 | ud | pku | msra | ontonotes | SemEval16 | DM | PAS | PSD | |||||||||
mul | UD2.7 OntoNotes5 | small | 98.62 | - | - | - | - | 93.23 | - | - | 74.42 | 79.10 | 76.85 | 70.63 | - | 91.19 | 93.67 | 85.34 | 87.71 | 84.51 | - |
base | 98.97 | - | - | - | - | 90.32 | - | - | 80.32 | 78.74 | 71.23 | 73.63 | - | 92.60 | 96.04 | 81.19 | 85.08 | 82.13 | - | ||
zh | open | small | 97.25 | - | 96.66 | - | - | - | - | - | 95.00 | 84.57 | 87.62 | 73.40 | 84.57 | - | - | - | - | - | - |
base | 97.50 | - | 97.07 | - | - | - | - | - | 96.04 | 87.11 | 89.84 | 77.78 | 87.11 | - | - | - | - | - | - | ||
close | small | 96.70 | 95.93 | 96.87 | 97.56 | 95.05 | - | 96.22 | 95.74 | 76.79 | 84.44 | 88.13 | 75.81 | 74.28 | - | - | - | - | - | - | |
base | 97.52 | 96.44 | 96.99 | 97.59 | 95.29 | - | 96.48 | 95.72 | 77.77 | 85.29 | 88.57 | 76.52 | 73.76 | - | - | - | - | - | - | ||
ernie | 96.95 | 97.29 | 96.76 | 97.64 | 95.22 | - | 97.31 | 96.47 | 77.95 | 85.67 | 89.17 | 78.51 | 74.10 | - | - | - | - | - | - |
- Multi-task learning models often under-perform their single-task learning counterparts according to our latest research. Similarly, mono-lingual models often outperform multi-lingual models. Therefore, we strongly recommend the use of a single-task mono-lingual model if you are targeting at high accuracy instead of faster speed.
- A state-of-the-art AMR model has been released.
Citing
If you use HanLP in your research, please cite our EMNLP paper:
@inproceedings{he-choi-2021-stem,
title = "The Stem Cell Hypothesis: Dilemma behind Multi-Task Learning with Transformer Encoders",
author = "He, Han and Choi, Jinho D.",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.451",
pages = "5555--5577",
abstract = "Multi-task learning with transformer encoders (MTL) has emerged as a powerful technique to improve performance on closely-related tasks for both accuracy and efficiency while a question still remains whether or not it would perform as well on tasks that are distinct in nature. We first present MTL results on five NLP tasks, POS, NER, DEP, CON, and SRL, and depict its deficiency over single-task learning. We then conduct an extensive pruning analysis to show that a certain set of attention heads get claimed by most tasks during MTL, who interfere with one another to fine-tune those heads for their own objectives. Based on this finding, we propose the Stem Cell Hypothesis to reveal the existence of attention heads naturally talented for many tasks that cannot be jointly trained to create adequate embeddings for all of those tasks. Finally, we design novel parameter-free probes to justify our hypothesis and demonstrate how attention heads are transformed across the five tasks during MTL through label analysis.",
}
License
Codes
HanLP is licensed under Apache License 2.0. You can use HanLP in your commercial products for free. We would appreciate it if you add a link to HanLP on your website.
Models
Unless otherwise specified, all models in HanLP are licensed under CC BY-NC-SA 4.0.
References
Top Related Projects
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot