THULAC-Python

An Efficient Lexical Analyzer for Chinese

2,075

336

2,075

View on GitHub

Top Related Projects

HanLP

34,953

中文分词词性标注命名实体识别依存句法分析成分句法分析语义依存分析语义角色标注指代消解风格转换语义相似度新词发现关键词短语提取自动摘要文本分类聚类拼音简繁转换自然语言处理

pkuseg-python

6,632

pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

nlp_chinese_corpus

9,737

大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP

Quick Overview

THULAC (THU Lexical Analyzer for Chinese) is a lexical analyzer for the Chinese language developed by the Natural Language Processing and Social Computing Lab at Tsinghua University. It provides part-of-speech tagging, word segmentation, and named entity recognition capabilities for Chinese text.

Pros

Comprehensive Functionality: THULAC offers a wide range of natural language processing capabilities, including word segmentation, part-of-speech tagging, and named entity recognition, making it a versatile tool for Chinese text analysis.
High Accuracy: The model has been trained on large-scale Chinese corpora, resulting in high accuracy for various NLP tasks.
Efficient Performance: THULAC is designed to be efficient, with fast processing times for real-world applications.
Multilingual Support: The library supports both simplified and traditional Chinese, as well as English.

Cons

Limited Customization: The library may not provide extensive customization options, which could be a limitation for users with specific requirements.
Dependency on External Resources: THULAC relies on pre-trained models and external resources, which may require additional setup and configuration.
Python-only: The library is currently only available in Python, which may limit its adoption by developers working in other programming languages.
Potential Maintenance Challenges: As an open-source project, the long-term maintenance and updates of THULAC may depend on the continued involvement of the Tsinghua University team.

Code Examples

Here are a few examples of how to use the THULAC library in Python:

import thulac

# Create a THULAC instance
thu = thulac.thulac(seg_only=False)

# Perform word segmentation and part-of-speech tagging
text = "我爱北京天安门。"
result = thu.cut(text, text=True)
print(result)
# Output: '我_r 爱_v 北京_ns 天安门_ns 。_w'

# Perform named entity recognition
text = "北京是中国的首都。"
result = thu.run(text)
print(result)
# Output: [('北京', 'ns'), ('中国', 'ns'), ('首都', 'n')]

# Customize the THULAC instance
thu = thulac.thulac(user_dict="path/to/user_dict.txt", T2S=True)
result = thu.cut(text, text=True)
print(result)
# Output: '北京_ns 是_v 中国_ns 的_uj 首都_n 。_w'

Getting Started

To get started with THULAC-Python, follow these steps:

Install the THULAC library using pip:

pip install thulac

Import the thulac module and create a thulac instance:

import thulac

# Create a THULAC instance with default settings
thu = thulac.thulac()

Use the cut() method to perform word segmentation and part-of-speech tagging:

text = "我爱北京天安门。"
result = thu.cut(text, text=True)
print(result)

Use the run() method to perform named entity recognition:

text = "北京是中国的首都。"
result = thu.run(text)
print(result)

(Optional) Customize the THULAC instance by providing a user dictionary or enabling traditional-to-simplified Chinese conversion:

thu = thulac.thulac(user_dict="path/to/user_dict.txt", T2S=True)
result = thu.cut(text, text=True)
print(result)

That's it! You can now use the THULAC-Python library to perform various Chinese natural language processing tasks in your Python applications.

Competitor Comparisons

HanLP

34,953

Pros of HanLP

Comprehensive Feature Set: HanLP provides a wide range of natural language processing capabilities, including word segmentation, part-of-speech tagging, named entity recognition, dependency parsing, and more.
Multilingual Support: HanLP supports multiple languages, including Chinese, English, and other languages, making it a versatile tool for diverse applications.
Active Development and Community: HanLP has an active development team and a large community of contributors, ensuring regular updates and improvements.

Cons of HanLP

Larger Codebase: HanLP has a more extensive codebase compared to THULAC-Python, which may result in a larger memory footprint and slower startup times.
Steeper Learning Curve: HanLP's comprehensive feature set and configuration options may present a steeper learning curve for some users, especially those new to natural language processing.
Dependency on Java: HanLP is primarily written in Java, which may be a limitation for users who prefer to work in Python or other programming languages.

Code Comparison

THULAC-Python:

import thulac

thu1 = thulac.thulac()
text = "我爱北京天安门。"
words = thu1.cut(text, text=True)
print(words)

HanLP:

import com.hankcs.hanlp.HanLP;
import com.hankcs.hanlp.seg.common.Term;

public class HanLPExample {
    public static void main(String[] args) {
        String text = "我爱北京天安门。";
        for (Term term : HanLP.segment(text)) {
            System.out.print(term.word + " ");
        }
    }
}

jieba

34,028

结巴中文分词

Pros of Jieba

Jieba is a widely-used and well-maintained Chinese text segmentation library, with a large and active community.
Jieba supports a variety of use cases, including word segmentation, part-of-speech tagging, and named entity recognition.
Jieba is highly customizable, allowing users to add their own dictionaries and adjust the segmentation algorithm.

Cons of Jieba

Jieba is primarily focused on Chinese text processing, and may not be as suitable for other languages as THULAC-Python.
Jieba's performance may not be as optimized as THULAC-Python, especially for large-scale text processing tasks.
Jieba's documentation and community support may not be as comprehensive as THULAC-Python.

Code Comparison

Jieba:

import jieba

text = "这是一个测试句子"
words = jieba.cut(text)
print(" ".join(words))

THULAC-Python:

from thulac import THULAC

text = "这是一个测试句子"
thu = THULAC()
words = thu.cut(text, text=True)
print(" ".join(words))

Both code snippets perform basic Chinese text segmentation, but THULAC-Python provides additional functionality, such as part-of-speech tagging, out of the box.

pkuseg-python

6,632

pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

Pros of pkuseg-python

Faster Performance: pkuseg-python is reported to be faster than THULAC-Python in terms of processing speed, making it more efficient for large-scale text processing tasks.
Improved Accuracy: The pkuseg-python model is trained on a larger and more diverse dataset, which can lead to better performance on a wider range of text types.
Easier Installation: pkuseg-python has a simpler installation process, with fewer dependencies, making it more accessible for users.

Cons of pkuseg-python

Limited Documentation: The documentation for pkuseg-python is not as comprehensive as that of THULAC-Python, which may make it more challenging for new users to get started.
Fewer Features: THULAC-Python offers a wider range of features, such as part-of-speech tagging and named entity recognition, which may be important for certain applications.
Potential Bias: The dataset used to train the pkuseg-python model may have inherent biases, which could affect the performance on specific types of text.

Code Comparison

THULAC-Python:

import thulac

thu1 = thulac.thulac()
text = "我爱北京天安门。"
words = thu1.cut(text, text=True)
print(words)

pkuseg-python:

import pkuseg

seg = pkuseg.pkuseg()
text = "我爱北京天安门。"
words = seg.cut(text)
print(words)

Both code snippets perform Chinese word segmentation on the same input text, but the API and output format differ slightly between the two libraries.

nlp_chinese_corpus

9,737

大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP

Pros of nlp_chinese_corpus

Provides a comprehensive collection of Chinese language datasets for various NLP tasks, including text classification, named entity recognition, and sentiment analysis.
Includes high-quality datasets from reputable sources, making it a valuable resource for researchers and developers.
Offers a diverse range of datasets, catering to different domains and applications.

Cons of nlp_chinese_corpus

The repository does not provide any pre-trained models or tools for processing the datasets, unlike THULAC-Python.
The documentation and usage instructions may not be as detailed or user-friendly as THULAC-Python.
The datasets may not be as actively maintained or updated as the THULAC-Python library.

Code Comparison

THULAC-Python:

import thulac

# Create a THULAC object
thu = thulac.thulac()

# Perform Chinese word segmentation
text = "我爱北京天安门"
seg_text = thu.cut(text, text=True)
print(seg_text)

nlp_chinese_corpus:

# No code example provided, as the repository is a collection of datasets rather than a library or tool.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

THULACï¼ä¸ä¸ªé«æçä¸æè¯æ³åæå·¥å·å

é¡¹ç®ä»ç»

è½åå¼ºãå©ç¨æä»¬éæçç®åä¸çä¸è§æ¨¡æå¤§çäººå·¥åè¯åè¯æ§æ æ³¨ä¸æè¯æåºï¼çº¦å«5800ä¸åï¼è®ç»èæï¼æ¨¡åæ æ³¨è½åå¼ºå¤§ã
åç¡®çé«ãè¯¥å·¥å·åå¨æ åæ°æ®éChinese Treebankï¼CTB5ï¼ä¸åè¯çF1å¼å¯è¾¾97.3ï¼ï¼è¯æ§æ æ³¨çF1å¼å¯è¾¾å°92.9ï¼ï¼ä¸è¯¥æ°æ®éä¸æå¥½æ¹æ³ææç¸å½ã
éåº¦è¾å¿«ãåæ¶è¿è¡åè¯åè¯æ§æ æ³¨éåº¦ä¸º300KB/sï¼æ¯ç§å¯å¤ççº¦15ä¸åãåªè¿è¡åè¯éåº¦å¯è¾¾å°1.3MB/sã

ç¼è¯åå®è£

pythonç(å¼å®¹python2.xçåpython3.xç)

ä»githubä¸è½½(éä¸è½½æ¨¡åæä»¶ï¼è§è·åæ¨¡å)

å°thulacæä»¶æ¾å°ç®å½ä¸ï¼éè¿ import thulac æ¥å¼ç¨
thulacéè¦æ¨¡åçæ¯æï¼éè¦å°ä¸è½½çæ¨¡åæ¾å°thulacç®å½ä¸ã

pipä¸è½½(èªå¸¦æ¨¡åæä»¶)

sudo pip install thulac
éè¿ import thulac æ¥å¼ç¨

ä½¿ç¨æ¹å¼(æ°å¢fastæ¥å£)

1.åè¯åè¯æ§æ æ³¨ç¨åº

1.1.æ¥å£ä½¿ç¨ç¤ºä¾

pythonç

ä»£ç ç¤ºä¾1
import thulac	

thu1 = thulac.thulac()  #é»è®¤æ¨¡å¼
text = thu1.cut("æç±åäº¬å¤©å®é¨", text=True)  #è¿è¡ä¸å¥è¯åè¯
print(text)

ä»£ç ç¤ºä¾2
thu1 = thulac.thulac(seg_only=True)  #åªè¿è¡åè¯ï¼ä¸è¿è¡è¯æ§æ æ³¨
thu1.cut_f("input.txt", "output.txt")  #å¯¹input.txtæä»¶åå®¹è¿è¡åè¯ï¼è¾åºå°output.txt

1.2.æ¥å£åæ°

thulac(user_dict=None, model_path=None, T2S=False, seg_only=False, filt=False, deli='_') åå§åç¨åºï¼è¿è¡èªå®ä¹è®¾ç½®

user_dict	      	è®¾ç½®ç¨æ·è¯å¸ï¼ç¨æ·è¯å¸ä¸çè¯ä¼è¢«æä¸uwæ ç¾ãè¯å¸ä¸æ¯ä¸ä¸ªè¯ä¸è¡ï¼UTF8ç¼ç 
T2S					é»è®¤False, æ¯å¦å°å¥åä»ç¹ä½è½¬åä¸ºç®ä½
seg_only	   		é»è®¤False, æ¶ååªè¿è¡åè¯ï¼ä¸è¿è¡è¯æ§æ æ³¨
filt		   		é»è®¤False, æ¯å¦ä½¿ç¨è¿æ»¤å¨å»é¤ä¸äºæ²¡ææä¹çè¯è¯ï¼ä¾å¦âå¯ä»¥âã
model_path	 	    è®¾ç½®æ¨¡åæä»¶æå¨æä»¶å¤¹ï¼é»è®¤ä¸ºmodels/
deli	 	      	é»è®¤ä¸ºâ_â, è®¾ç½®è¯ä¸è¯æ§ä¹é´çåéç¬¦

cut(ææ¬, text=False) å¯¹ä¸å¥è¯è¿è¡åè¯

text 				é»è®¤ä¸ºFalse, æ¯å¦è¿åææ¬ï¼ä¸è¿åææ¬åè¿åä¸ä¸ªäºç»´æ°ç»([[word, tag]..]),seg_onlyæ¨¡å¼ä¸tagä¸ºç©ºåç¬¦ã

cut_f(è¾å¥æä»¶, è¾åºæä»¶) å¯¹æä»¶è¿è¡åè¯
run() å½ä»¤è¡äº¤äºå¼åè¯(å±å¹è¾å¥ãå±å¹è¾åº)

1.3.å½ä»¤è¡è¿è¡(épipå®è£ä½¿ç¨)

ç´æ¥è°ç¨

python -m thulac input.txt output.txt
#ä»input.txtè¯»å¥ï¼å¹¶å°åè¯åè¯æ§æ æ³¨ç»æè¾åºå°ouptut.txtä¸

#å¦æåªéè¦åè¯åè½ï¼å¯å¨å¢å åæ°"seg_only" 
python -m thulac input.txt output.txt seg_only

1.4.fastæ¥å£

(è¯·ä¸è½½makeåå°å¾å°çlibthulac.soæ¾å¥modelsæä»¶å¤¹åç®å½ä¸)

cut -> fast_cut, cut_f -> fast_cut_f

2.è·åæ¨¡å

ä»£è¡¨åè¯è½¯ä»¶çæ§è½å¯¹æ¯

å¨ç¬¬äºå±å½éæ±è¯åè¯æµè¯ä¸ï¼å±æåå®¶åä½æä¾çæµè¯è¯æï¼Academia Sinicaã City University ãPeking University ãMicrosoft Researchï¼, å¨è¯æµæä¾çèµæºicwb2-dataä¸åå«äºæ¥èªè¿åå®¶åä½çè®ç»éï¼trainingï¼ãæµè¯éï¼testingï¼, ä»¥åæ ¹æ®åèªåè¯æ åèæä¾çç¸åºæµè¯éçæ åçæ¡ï¼icwb2-data/scripts/goldï¼ï¼å¨icwb2-data/scriptsç®å½ä¸å«æå¯¹åè¯è¿è¡èªå¨è¯åçperlèæ¬scoreã

msr_testï¼560KBï¼

Algorithm	Time	Precision	Recall
LTP-3.2.0	3.21s	0.867	0.896
ICTCLAS(2015ç)	0.55s	0.869	0.914
jieba	0.26s	0.814	0.809
THULAC	0.62s	0.877	0.899

pku_testï¼510KBï¼

Algorithm	Time	Precision	Recall
LTP-3.2.0	3.83s	0.960	0.947
ICTCLAS(2015ç)	0.53s	0.939	0.944
jieba	0.23s	0.850	0.784
THULAC	0.51s	0.944	0.908

CNKI_journal.txtï¼51 MBï¼

Algorithm	Time	Speed
LTP-3.2.0	348.624s	149.80KB/s
ICTCLAS(2015ç)	106.461s	490.59KB/s
jieba	22.5583s	2314.89KB/s
THULAC	42.625s	1221.05KB/s

è¯æ§è§£é

n/åè¯ np/äººå ns/å°å ni/æºæå nz/å¶å®ä¸å
m/æ°è¯ q/éè¯ mq/æ°éè¯ t/æ¶é´è¯ f/æ¹ä½è¯ s/å¤æè¯
v/å¨è¯ a/å½¢å®¹è¯ d/å¯è¯ h/åæ¥æå k/åæ¥æå 
i/ä¹ è¯ j/ç®ç§° r/ä»£è¯ c/è¿è¯ p/ä»è¯ u/å©è¯ y/è¯æ°å©è¯
e/å¹è¯ o/æå£°è¯ g/è¯ç´  w/æ ç¹ x/å¶å®

THULACæ¨¡åä»ç»

æä»¬éTHULACæºä»£ç éå¸¦äºç®åçåè¯æ¨¡åModel_1ï¼ä»æ¯æåè¯åè½ãè¯¥æ¨¡åç±äººæ°æ¥æ¥åè¯è¯æåºè®ç»å¾å°ã
æä»¬éTHULACæºä»£ç éå¸¦äºåè¯åè¯æ§æ æ³¨èåæ¨¡åModel_2ï¼æ¯æåæ¶åè¯åè¯æ§æ æ³¨åè½ãè¯¥æ¨¡åç±äººæ°æ¥æ¥åè¯åè¯æ§æ æ³¨è¯æåºè®ç»å¾å°ã
æä»¬è¿æä¾æ´å¤æãå®ååç²¾ç¡®çåè¯åè¯æ§æ æ³¨èåæ¨¡åModel_3ååè¯è¯è¡¨ãè¯¥æ¨¡åæ¯ç±å¤è¯æèåè®ç»è®ç»å¾å°ï¼è¯æåæ¬æ¥èªå¤æä½çæ æ³¨ææ¬åäººæ°æ¥æ¥æ æ³¨ææ¬çï¼ãç±äºæ¨¡åè¾å¤§ï¼å¦ææºææä¸ªäººéè¦ï¼è¯·å¡«åâdoc/èµæºç³è¯·è¡¨.docâï¼å¹¶åéè³ thunlp@gmail.com ï¼éè¿å®¡æ ¸åæä»¬ä¼å°ç¸å³èµæºåéç»èç³»äººã

æ³¨æäºé¡¹

å¶ä»è¯è¨å®ç°

åå²

æ´æ°æ¶é´	æ´æ°åå®¹
2017-01-17	å¨pipä¸åå¸THULACåè¯pythonçæ¬ã
2016-09-29	å¢å THULACåè¯soçæ¬ã
2016-03-31	å¢å THULACåè¯pythonçæ¬ã
2016-01-20	å¢å THULACåè¯Javaçæ¬ã
2016-01-10	å¼æºTHULACåè¯å·¥å·C++çæ¬ã

å¼æºåè®®

å¦ææºææä¸ªäººæå°THULACç¨äºåä¸ç®çï¼è¯·åé®ä»¶è³thunlp@gmail.comæ´½è°ææ¯è®¸å¯åè®®ã
æ¬¢è¿å¯¹è¯¥å·¥å·åæåºä»»ä½å®è´µæè§åå»ºè®®ãè¯·åé®ä»¶è³thunlp@gmail.comã
å¦ææ¨å¨THULACåºç¡ä¸åè¡¨è®ºææåå¾ç§ç ææï¼è¯·æ¨å¨åè¡¨è®ºæåç³æ¥æææ¶å£°æâä½¿ç¨äºæ¸åå¤§å¦THULACâï¼å¹¶æå¦ä¸æ ¼å¼å¼ç¨ï¼

ä¸æï¼ åèæ¾, éæ°é, å¼ å¼æ, éå¿è, åç¥è¿. THULACï¼ä¸ä¸ªé«æçä¸æè¯æ³åæå·¥å·å. 2016.
è±æï¼ Maosong Sun, Xinxiong Chen, Kaixu Zhang, Zhipeng Guo, Zhiyuan Liu. THULAC: An Efficient Lexical Analyzer for Chinese. 2016.

ç¸å³è®ºæ

Zhongguo Li, Maosong Sun. Punctuation as Implicit Annotations for Chinese Word Segmentation.Â Computational Linguistics, vol. 35, no. 4, pp. 505-512, 2009.

ä½è

Maosong Sun ï¼åèæ¾ï¼å¯¼å¸ï¼, Xinxiong Chenï¼éæ°éï¼åå£«çï¼, Kaixu Zhang (å¼ å¼æï¼ç¡å£«çï¼, Zhipeng Guoï¼éå¿èï¼æ¬ç§çï¼, Junhua Ma ï¼é©¬éªéªï¼è®¿é®å¦çï¼, Zhiyuan Liuï¼åç¥è¿ï¼å©çææï¼.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of HanLP

Cons of HanLP

Code Comparison

Pros of Jieba

Cons of Jieba

Code Comparison

Pros of pkuseg-python

Cons of pkuseg-python

Code Comparison

Pros of nlp_chinese_corpus

Cons of nlp_chinese_corpus

Code Comparison

Convert designs to code with AI

README

THULACï¼ä¸ä¸ªé«æçä¸­æè¯æ³åæå·¥å ·å

ç®å½

é¡¹ç®ä»ç»

ç¼è¯åå®è£

ä½¿ç¨æ¹å¼(æ°å¢fastæ¥å£)

1.åè¯åè¯æ§æ æ³¨ç¨åº

1.1.æ¥å£ä½¿ç¨ç¤ºä¾

1.2.æ¥å£åæ°

1.3.å½ä»¤è¡è¿è¡(épipå®è£ ä½¿ç¨)

1.4.fastæ¥å£

2.è·åæ¨¡å

ä»£è¡¨åè¯è½¯ä»¶çæ§è½å¯¹æ¯

è¯æ§è§£é

THULACæ¨¡åä»ç»

æ³¨æäºé¡¹

å ¶ä»è¯­è¨å®ç°

THULACï¼C++çï¼

THULACï¼Javaçï¼

THULACï¼soçï¼

åå²

å¼æºåè®®

ç¸å ³è®ºæ

ä½è

Top Related Projects

Convert designs to code with AI

THULACï¼ä¸ä¸ªé«æçä¸æè¯æ³åæå·¥å·å

ç®å½

é¡¹ç®ä»ç»

ç¼è¯åå®è£

ä½¿ç¨æ¹å¼(æ°å¢fastæ¥å£)

1.åè¯åè¯æ§æ æ³¨ç¨åº

1.1.æ¥å£ä½¿ç¨ç¤ºä¾

1.2.æ¥å£åæ°

1.3.å½ä»¤è¡è¿è¡(épipå®è£ä½¿ç¨)

1.4.fastæ¥å£

2.è·åæ¨¡å

ä»£è¡¨åè¯è½¯ä»¶çæ§è½å¯¹æ¯

è¯æ§è§£é

THULACæ¨¡åä»ç»

æ³¨æäºé¡¹

å¶ä»è¯è¨å®ç°

THULACï¼C++çï¼

THULACï¼Javaçï¼

THULACï¼soçï¼

åå²

å¼æºåè®®

ç¸å³è®ºæ

ä½è