pkuseg-python

pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

6,632

986

6,632

133

View on GitHub

Top Related Projects

THULAC-Python

2,075

An Efficient Lexical Analyzer for Chinese

HanLP

35,454

中文分词词性标注命名实体识别依存句法分析成分句法分析语义依存分析语义角色标注指代消解风格转换语义相似度新词发现关键词短语提取自动摘要文本分类聚类拼音简繁转换自然语言处理

snownlp

6,568

Python library for processing Chinese text

HarvestText

2,530

文本挖掘和预处理工具（文本清洗、新词发现、情感分析、实体识别链接、关键词抽取、知识抽取、句法分析等），无监督或弱监督方法

Quick Overview

PKUSeg-python is a Chinese word segmentation toolkit developed by the Peking University Language Computing Lab. It offers state-of-the-art performance in Chinese text segmentation tasks and is designed to be easy to use and integrate into various natural language processing applications.

Pros

High accuracy in Chinese word segmentation compared to other popular tools
Supports multiple domains, including news, web, medicine, and more
Easy to install and use with a simple Python API
Allows for custom model training on specific domains

Cons

Primarily focused on Chinese language, limiting its use for other languages
May require more computational resources compared to simpler segmentation tools
Documentation is primarily in Chinese, which might be challenging for non-Chinese speakers
Limited community support compared to more widely-used NLP libraries

Code Examples

Basic word segmentation:

import pkuseg

seg = pkuseg.pkuseg()
text = "我爱北京天安门"
result = seg.cut(text)
print(result)
# Output: ['我', '爱', '北京', '天安门']

Using a specific domain model:

seg = pkuseg.pkuseg(model_name='medicine')
text = "头孢霉素类抗生素可以治疗肺炎"
result = seg.cut(text)
print(result)
# Output: ['头孢霉素', '类', '抗生素', '可以', '治疗', '肺炎']

Custom dictionary usage:

seg = pkuseg.pkuseg(user_dict=['北京大学'])
text = "北京大学是世界一流大学"
result = seg.cut(text)
print(result)
# Output: ['北京大学', '是', '世界', '一流', '大学']

Getting Started

To get started with PKUSeg-python, follow these steps:

Install the library using pip:
```
pip install pkuseg
```
Import the library and create a segmenter:
```
import pkuseg
seg = pkuseg.pkuseg()
```

Segment Chinese text:

text = "我在北京大学学习自然语言处理"
result = seg.cut(text)
print(result)

This will output the segmented words as a list. You can now use PKUSeg-python for various Chinese text processing tasks in your projects.

Competitor Comparisons

jieba

34,296

结巴中文分词

Pros of jieba

Faster processing speed for large-scale text segmentation
More extensive documentation and community support
Broader range of features, including keyword extraction and text summarization

Cons of jieba

Less accurate for specialized domains or formal texts
Requires more manual tuning for optimal performance
Larger memory footprint, especially for large dictionaries

Code Comparison

jieba:

import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))

pkuseg:

import pkuseg
seg = pkuseg.pkuseg()
text = seg.cut("我来到北京清华大学")
print(text)

Both libraries offer simple APIs for text segmentation, but pkuseg provides more flexibility in model selection and training. jieba offers additional features like keyword extraction, while pkuseg focuses on accurate segmentation for various domains.

jieba is generally faster and more suitable for large-scale processing, while pkuseg excels in accuracy, especially for formal texts or specific domains. The choice between the two depends on the specific requirements of the project, such as processing speed, accuracy, and domain specificity.

THULAC-Python

2,075

An Efficient Lexical Analyzer for Chinese

Pros of THULAC-Python

Faster processing speed for large-scale text segmentation tasks
Includes part-of-speech tagging functionality out of the box
Supports both simplified and traditional Chinese characters

Cons of THULAC-Python

Less flexible in terms of customization and fine-tuning
May have lower accuracy on domain-specific or non-standard text
Requires more memory resources compared to pkuseg-python

Code Comparison

THULAC-Python usage:

import thulac

thu = thulac.thulac()
text = "我爱北京天安门"
result = thu.cut(text)
print(result)

pkuseg-python usage:

import pkuseg

seg = pkuseg.pkuseg()
text = "我爱北京天安门"
result = seg.cut(text)
print(result)

Both libraries offer simple APIs for text segmentation, but THULAC-Python provides additional features like part-of-speech tagging by default. pkuseg-python focuses on customizable segmentation and may be more suitable for specific domain applications.

HanLP

35,454

Pros of HanLP

More comprehensive NLP toolkit with broader functionality beyond segmentation
Supports multiple languages, not just Chinese
Actively maintained with frequent updates and improvements

Cons of HanLP

Larger library size, potentially slower for simple segmentation tasks
May have a steeper learning curve due to more extensive features
Higher computational resource requirements for full functionality

Code Comparison

HanLP:

from hanlp_restful import HanLPClient

HanLP = HanLPClient('https://www.hanlp.com/api', auth=None, language='zh')
print(HanLP.tokenize('我爱自然语言处理'))

pkuseg:

import pkuseg

seg = pkuseg.pkuseg()
text = seg.cut('我爱自然语言处理')
print(text)

Both libraries offer straightforward APIs for tokenization, but HanLP provides a more extensive set of NLP tools beyond simple segmentation. pkuseg is more focused on Chinese word segmentation specifically, while HanLP offers a broader range of language processing capabilities.

HanLP's code example demonstrates its client-server architecture, which allows for more advanced processing but may require additional setup. pkuseg, on the other hand, provides a simpler, more direct approach to Chinese word segmentation that can be quickly implemented in Python projects.

lac

3,957

百度NLP：分词，词性标注，命名实体识别，词重要性

Pros of LAC

Supports both word segmentation and part-of-speech tagging
Offers pre-trained models for various domains (e.g., news, web)
Provides both Python and C++ interfaces for flexibility

Cons of LAC

Less customizable for specific domains compared to pkuseg
May have lower accuracy on certain text types or specialized content
Requires more dependencies and setup compared to pkuseg

Code Comparison

pkuseg usage:

import pkuseg
seg = pkuseg.pkuseg()
text = "我爱北京天安门"
print(seg.cut(text))

LAC usage:

from LAC import LAC
lac = LAC(mode='seg')
text = "我爱北京天安门"
print(lac.run(text))

Both libraries offer simple APIs for word segmentation, but LAC provides additional functionality for part-of-speech tagging and named entity recognition. pkuseg focuses primarily on customizable word segmentation for various domains.

pkuseg is generally easier to set up and use for basic word segmentation tasks, while LAC offers more comprehensive language processing capabilities at the cost of increased complexity.

snownlp

6,568

Python library for processing Chinese text

Pros of snownlp

More comprehensive NLP toolkit with sentiment analysis, text classification, and more
Simpler installation process and fewer dependencies
Faster processing speed for basic NLP tasks

Cons of snownlp

Less accurate for complex Chinese word segmentation tasks
Not actively maintained (last update in 2020)
Limited documentation and community support

Code Comparison

snownlp:

from snownlp import SnowNLP

s = SnowNLP(u'这是一个测试句子')
print(s.words)  # 分词
print(s.sentiments)  # 情感分析

pkuseg-python:

import pkuseg

seg = pkuseg.pkuseg()
text = "这是一个测试句子"
print(seg.cut(text))  # 分词

Both libraries offer Chinese word segmentation, but pkuseg-python focuses on providing more accurate segmentation, especially for domain-specific texts. snownlp, on the other hand, offers a broader range of NLP functionalities beyond just segmentation. pkuseg-python is more actively maintained and provides better documentation, while snownlp offers a simpler API for quick NLP tasks but may lack in accuracy for complex segmentation scenarios.

HarvestText

2,530

文本挖掘和预处理工具（文本清洗、新词发现、情感分析、实体识别链接、关键词抽取、知识抽取、句法分析等），无监督或弱监督方法

Pros of HarvestText

More comprehensive NLP toolkit with additional features like named entity recognition and sentiment analysis
Supports both Chinese and English text processing
Includes built-in dictionaries and regular expression patterns for common tasks

Cons of HarvestText

Less specialized for Chinese word segmentation compared to pkuseg-python
May have slower performance for large-scale text processing tasks
Requires more dependencies and setup compared to the more focused pkuseg-python

Code Comparison

HarvestText:

from harvesttext import HarvestText
ht = HarvestText()
text = "今天是个好日子"
words = ht.seg(text)
print(words)

pkuseg-python:

import pkuseg
seg = pkuseg.pkuseg()
text = "今天是个好日子"
words = seg.cut(text)
print(words)

Both libraries offer simple APIs for Chinese word segmentation, but HarvestText provides a more extensive set of NLP tools beyond just segmentation. pkuseg-python focuses specifically on accurate Chinese word segmentation, while HarvestText aims to be a more comprehensive toolkit for various NLP tasks in both Chinese and English.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

pkusegï¼ä¸ä¸ªå¤é¢åä¸æåè¯å·¥å·å (English Version)

pkuseg æ¯åºäºè®ºæ[Luo et. al, 2019]çå·¥å·åãå¶ç®åæç¨ï¼æ¯æç»åé¢ååè¯ï¼æææåäºåè¯åç¡®åº¦ã

ä¸»è¦äº®ç¹

pkusegå·æå¦ä¸å ä¸ªç¹ç¹ï¼

å¤é¢ååè¯ãä¸åäºä»¥å¾çéç¨ä¸æåè¯å·¥å·ï¼æ¤å·¥å·ååæ¶è´åäºä¸ºä¸åé¢åçæ°æ®æä¾ä¸ªæ§åçé¢è®ç»æ¨¡åãæ ¹æ®å¾åè¯ææ¬çé¢åç¹ç¹ï¼ç¨æ·å¯ä»¥èªç±å°éæ©ä¸åçæ¨¡åã æä»¬ç®åæ¯æäºæ°é»é¢åï¼ç½ç»é¢åï¼å»è¯é¢åï¼ææ¸¸é¢åï¼ä»¥åæ··åé¢åçåè¯é¢è®ç»æ¨¡åãå¨ä½¿ç¨ä¸ï¼å¦æç¨æ·æç¡®å¾åè¯çé¢åï¼å¯å è½½å¯¹åºçæ¨¡åè¿è¡åè¯ãå¦æç¨æ·æ æ³ç¡®å®å·ä½é¢åï¼æ¨èä½¿ç¨å¨æ··åé¢åä¸è®ç»çéç¨æ¨¡åãåé¢ååè¯æ ·ä¾å¯åè example.txtã
æ´é«çåè¯åç¡®çãç¸æ¯äºå¶ä»çåè¯å·¥å·åï¼å½ä½¿ç¨ç¸åçè®ç»æ°æ®åæµè¯æ°æ®ï¼pkusegå¯ä»¥åå¾æ´é«çåè¯åç¡®çã
æ¯æç¨æ·èªè®ç»æ¨¡åãæ¯æç¨æ·ä½¿ç¨å¨æ°çæ æ³¨æ°æ®è¿è¡è®ç»ã
æ¯æè¯æ§æ æ³¨ã

ç¼è¯åå®è£

ç®åä»æ¯æpython3
ä¸ºäºè·å¾å¥½çææåéåº¦ï¼å¼ºçå»ºè®®å¤§å®¶éè¿pip installæ´æ°å°ç®åçææ°çæ¬

éè¿PyPIå®è£(èªå¸¦æ¨¡åæä»¶)ï¼
```
pip3 install pkuseg
ä¹åéè¿import pkusegæ¥å¼ç¨
```
**å»ºè®®æ´æ°å°ææ°çæ¬**ä»¥è·å¾æ´å¥½çå¼ç®±ä½éªï¼
```
pip3 install -U pkuseg
```
å¦æPyPIå®æ¹æºä¸è½½éåº¦ä¸çæ³ï¼å»ºè®®ä½¿ç¨éåæºï¼æ¯å¦ï¼
åæ¬¡å®è£ï¼
```
pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple pkuseg
```
æ´æ°ï¼
```
pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -U pkuseg
```
å¦æä¸ä½¿ç¨pipå®è£æ¹å¼ï¼éæ©ä»GitHubä¸è½½ï¼å¯è¿è¡ä»¥ä¸å½ä»¤å®è£ï¼
```
python setup.py build_ext -i
```
GitHubçä»£ç å¹¶ä¸åæ¬é¢è®ç»æ¨¡åï¼å æ¤éè¦ç¨æ·èªè¡ä¸è½½æè®ç»æ¨¡åï¼é¢è®ç»æ¨¡åå¯è¯¦è§releaseãä½¿ç¨æ¶éè®¾å®"model_name"ä¸ºæ¨¡åæä»¶ã

åç±»åè¯å·¥å·åçæ§è½å¯¹æ¯

ç»é¢åè®ç»åæµè¯ç»æ

ä»¥ä¸æ¯å¨ä¸åæ°æ®éä¸çå¯¹æ¯ç»æï¼

MSRA	Precision	Recall	F-score
jieba	87.01	89.88	88.42
THULAC	95.60	95.91	95.71
pkuseg	96.94	96.81	96.88

WEIBO	Precision	Recall	F-score
jieba	87.79	87.54	87.66
THULAC	93.40	92.40	92.87
pkuseg	93.78	94.65	94.21

é»è®¤æ¨¡åå¨ä¸åé¢åçæµè¯ææ

Default	MSRA	CTB8	PKU	WEIBO	All Average
jieba	81.45	79.58	81.83	83.56	81.61
THULAC	85.55	87.84	92.29	86.65	88.08
pkuseg	87.29	91.77	92.68	93.43	91.29

å¶ä¸ï¼All Averageæ¾ç¤ºçæ¯å¨æææµè¯éä¸F-scoreçå¹³åã

æ´å¤è¯¦ç»æ¯è¾å¯åè§åç°æå·¥å·åçæ¯è¾ã

ä½¿ç¨æ¹å¼

ä»£ç ç¤ºä¾

ä»¥ä¸ä»£ç ç¤ºä¾éç¨äºpythonäº¤äºå¼ç¯å¢ã

import pkuseg

seg = pkuseg.pkuseg()           # ä»¥é»è®¤éç½®å è½½æ¨¡å
text = seg.cut('æç±åäº¬å¤©å®é¨')  # è¿è¡åè¯
print(text)

import pkuseg

seg = pkuseg.pkuseg(model_name='medicine')  # ç¨åºä¼èªå¨ä¸è½½æå¯¹åºçç»é¢åæ¨¡å
text = seg.cut('æç±åäº¬å¤©å®é¨')              # è¿è¡åè¯
print(text)

import pkuseg

seg = pkuseg.pkuseg(postag=True)  # å¼å¯è¯æ§æ æ³¨åè½
text = seg.cut('æç±åäº¬å¤©å®é¨')    # è¿è¡åè¯åè¯æ§æ æ³¨
print(text)

ä»£ç ç¤ºä¾4ï¼å¯¹æä»¶åè¯

import pkuseg

# å¯¹input.txtçæä»¶åè¯è¾åºå°output.txtä¸
# å¼20ä¸ªè¿ç¨
pkuseg.test('input.txt', 'output.txt', nthread=20)

å¶ä»ä½¿ç¨ç¤ºä¾å¯åè§è¯¦ç»ä»£ç ç¤ºä¾ã

åæ°è¯´æ

æ¨¡åéç½®

pkuseg.pkuseg(model_name = "default", user_dict = "default", postag = False)
	model_name		æ¨¡åè·¯å¾ã
			        "default"ï¼é»è®¤åæ°ï¼è¡¨ç¤ºä½¿ç¨æä»¬é¢è®ç»å¥½çæ··åé¢åæ¨¡å(ä»å¯¹pipä¸è½½çç¨æ·)ã
				"news", ä½¿ç¨æ°é»é¢åæ¨¡åã
				"web", ä½¿ç¨ç½ç»é¢åæ¨¡åã
				"medicine", ä½¿ç¨å»è¯é¢åæ¨¡åã
				"tourism", ä½¿ç¨ææ¸¸é¢åæ¨¡åã
			        model_path, ä»ç¨æ·æå®è·¯å¾å è½½æ¨¡åã
	user_dict		è®¾ç½®ç¨æ·è¯å¸ã
				"default", é»è®¤åæ°ï¼ä½¿ç¨æä»¬æä¾çè¯å¸ã
				None, ä¸ä½¿ç¨è¯å¸ã
				dict_path, å¨ä½¿ç¨é»è®¤è¯å¸çåæ¶ä¼é¢å¤ä½¿ç¨ç¨æ·èªå®ä¹è¯å¸ï¼å¯ä»¥å¡«èªå·±çç¨æ·è¯å¸çè·¯å¾ï¼è¯å¸æ ¼å¼ä¸ºä¸è¡ä¸ä¸ªè¯ï¼å¦æéæ©è¿è¡è¯æ§æ æ³¨å¹¶ä¸å·²ç¥è¯¥è¯çè¯æ§ï¼åå¨è¯¥è¡åä¸è¯åè¯æ§ï¼ä¸é´ç¨tabåç¬¦éå¼ï¼ã
	postag		        æ¯å¦è¿è¡è¯æ§åæã
				False, é»è®¤åæ°ï¼åªè¿è¡åè¯ï¼ä¸è¿è¡è¯æ§æ æ³¨ã
				True, ä¼å¨åè¯çåæ¶è¿è¡è¯æ§æ æ³¨ã

å¯¹æä»¶è¿è¡åè¯

pkuseg.test(readFile, outputFile, model_name = "default", user_dict = "default", postag = False, nthread = 10)
	readFile		è¾å¥æä»¶è·¯å¾ã
	outputFile		è¾åºæä»¶è·¯å¾ã
	model_name		æ¨¡åè·¯å¾ãåpkuseg.pkuseg
	user_dict		è®¾ç½®ç¨æ·è¯å¸ãåpkuseg.pkuseg
	postag			è®¾ç½®æ¯å¦å¼å¯è¯æ§åæåè½ãåpkuseg.pkuseg
	nthread			æµè¯æ¶å¼çè¿ç¨æ°ã

æ¨¡åè®ç»

pkuseg.train(trainFile, testFile, savedir, train_iter = 20, init_model = None)
	trainFile		è®ç»æä»¶è·¯å¾ã
	testFile		æµè¯æä»¶è·¯å¾ã
	savedir			è®ç»æ¨¡åçä¿åè·¯å¾ã
	train_iter		è®ç»è½®æ°ã
	init_model		åå§åæ¨¡åï¼é»è®¤ä¸ºNoneè¡¨ç¤ºä½¿ç¨é»è®¤åå§åï¼ç¨æ·å¯ä»¥å¡«èªå·±æ³è¦åå§åçæ¨¡åçè·¯å¾å¦init_model='./models/'ã

å¤è¿ç¨åè¯

é¢è®ç»æ¨¡å

news: å¨MSRAï¼æ°é»è¯æï¼ä¸è®ç»çæ¨¡åã
web: å¨å¾®åï¼ç½ç»ææ¬è¯æï¼ä¸è®ç»çæ¨¡åã
medicine: å¨å»è¯é¢åä¸è®ç»çæ¨¡åã
tourism: å¨ææ¸¸é¢åä¸è®ç»çæ¨¡åã
mixed: æ··åæ°æ®éè®ç»çéç¨æ¨¡åãépipåéå¸¦çæ¯æ¤æ¨¡åã

art: å¨èºæ¯ä¸æåé¢åä¸è®ç»çæ¨¡åã
entertainment: å¨å¨±ä¹ä¸ä½è²é¢åä¸è®ç»çæ¨¡åã
science: å¨ç§å¦é¢åä¸è®ç»çæ¨¡åã
default_v2: ä½¿ç¨é¢åèªéåºæ¹æ³å¾å°çä¼ååçéç¨æ¨¡åï¼ç¸è¾äºé»è®¤æ¨¡åè§æ¨¡æ´å¤§ï¼ä½æ³åæ§è½æ´å¥½ã

æ¬¢è¿æ´å¤ç¨æ·å¯ä»¥åäº«èªå·±è®ç»å¥½çç»åé¢åæ¨¡åã

çæ¬åå²

è¯¦è§çæ¬åå²ã

å¼æºåè®®

æ¬ä»£ç éç¨MITè®¸å¯è¯ã
æ¬¢è¿å¯¹è¯¥å·¥å·åæåºä»»ä½å®è´µæè§åå»ºè®®ï¼è¯·åé®ä»¶è³jingjingxu@pku.edu.cnã

è®ºæå¼ç¨

Ruixuan Luo, Jingjing Xu, Yi Zhang, Zhiyuan Zhang, Xuancheng Ren, Xu Sun. PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation. Arxiv. 2019.


@article{pkuseg,
  author = {Luo, Ruixuan and Xu, Jingjing and Zhang, Yi and Zhang, Zhiyuan and Ren, Xuancheng and Sun, Xu},
  journal = {CoRR},
  title = {PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation.},
  url = {https://arxiv.org/abs/1906.11455},
  volume = {abs/1906.11455},
  year = 2019
}

å¶ä»ç¸å³è®ºæ

Xu Sun, Houfeng Wang, Wenjie Li. Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection. ACL. 2012.
Jingjing Xu and Xu Sun. Dependency-based gated recursive neural network for chinese word segmentation. ACL. 2016.
Jingjing Xu and Xu Sun. Transfer learning for low-resource chinese word segmentation with a novel neural network. NLPCC. 2017.

å¸¸è§é®é¢åè§£ç

è´è°¢

ä½è

Ruixuan Luo ï¼ç½ç¿è½©ï¼, Jingjing Xuï¼è®¸æ¶æ¶ï¼, Xuancheng Renï¼ä»»å®£ä¸ï¼, Yi Zhangï¼å¼ èºï¼, Zhiyuan Zhangï¼å¼ ä¹è¿ï¼, Bingzhen Weiï¼ä½å°éï¼ï¼ Xu Sun ï¼åæ ©ï¼

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of jieba

Cons of jieba

Code Comparison

Pros of THULAC-Python

Cons of THULAC-Python

Code Comparison

Pros of HanLP

Cons of HanLP

Code Comparison

Pros of LAC

Cons of LAC

Code Comparison

Pros of snownlp

Cons of snownlp

Code Comparison

Pros of HarvestText

Cons of HarvestText

Code Comparison

Convert designs to code with AI

README

pkusegï¼ä¸ä¸ªå¤é¢åä¸­æåè¯å·¥å ·å (English Version)

ç®å½

ä¸»è¦äº®ç¹

ç¼è¯åå®è£

åç±»åè¯å·¥å ·å çæ§è½å¯¹æ¯

ç»é¢åè®­ç»åæµè¯ç»æ

é»è®¤æ¨¡åå¨ä¸åé¢åçæµè¯ææ

ä½¿ç¨æ¹å¼

ä»£ç ç¤ºä¾

åæ°è¯´æ

å¤è¿ç¨åè¯

é¢è®­ç»æ¨¡å

çæ¬åå²

å¼æºåè®®

è®ºæå¼ç¨

å ¶ä»ç¸å ³è®ºæ

å¸¸è§é®é¢åè§£ç­

è´è°¢

ä½è

Top Related Projects

Convert designs to code with AI

pkusegï¼ä¸ä¸ªå¤é¢åä¸æåè¯å·¥å·å (English Version)

ç®å½

ä¸»è¦äº®ç¹

ç¼è¯åå®è£

åç±»åè¯å·¥å·åçæ§è½å¯¹æ¯

ç»é¢åè®ç»åæµè¯ç»æ

é»è®¤æ¨¡åå¨ä¸åé¢åçæµè¯ææ

ä½¿ç¨æ¹å¼

ä»£ç ç¤ºä¾

åæ°è¯´æ

å¤è¿ç¨åè¯

é¢è®ç»æ¨¡å

çæ¬åå²

å¼æºåè®®

è®ºæå¼ç¨

å¶ä»ç¸å³è®ºæ

å¸¸è§é®é¢åè§£ç

è´è°¢

ä½è