Top Related Projects
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
结巴中文分词
pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Quick Overview
THULAC (THU Lexical Analyzer for Chinese) is a lexical analyzer for the Chinese language developed by the Natural Language Processing and Social Computing Lab at Tsinghua University. It provides part-of-speech tagging, word segmentation, and named entity recognition capabilities for Chinese text.
Pros
- Comprehensive Functionality: THULAC offers a wide range of natural language processing capabilities, including word segmentation, part-of-speech tagging, and named entity recognition, making it a versatile tool for Chinese text analysis.
- High Accuracy: The model has been trained on large-scale Chinese corpora, resulting in high accuracy for various NLP tasks.
- Efficient Performance: THULAC is designed to be efficient, with fast processing times for real-world applications.
- Multilingual Support: The library supports both simplified and traditional Chinese, as well as English.
Cons
- Limited Customization: The library may not provide extensive customization options, which could be a limitation for users with specific requirements.
- Dependency on External Resources: THULAC relies on pre-trained models and external resources, which may require additional setup and configuration.
- Python-only: The library is currently only available in Python, which may limit its adoption by developers working in other programming languages.
- Potential Maintenance Challenges: As an open-source project, the long-term maintenance and updates of THULAC may depend on the continued involvement of the Tsinghua University team.
Code Examples
Here are a few examples of how to use the THULAC library in Python:
import thulac
# Create a THULAC instance
thu = thulac.thulac(seg_only=False)
# Perform word segmentation and part-of-speech tagging
text = "我爱北京天安门。"
result = thu.cut(text, text=True)
print(result)
# Output: '我_r 爱_v 北京_ns 天安门_ns 。_w'
# Perform named entity recognition
text = "北京是中国的首都。"
result = thu.run(text)
print(result)
# Output: [('北京', 'ns'), ('中国', 'ns'), ('首都', 'n')]
# Customize the THULAC instance
thu = thulac.thulac(user_dict="path/to/user_dict.txt", T2S=True)
result = thu.cut(text, text=True)
print(result)
# Output: '北京_ns 是_v 中国_ns 的_uj 首都_n 。_w'
Getting Started
To get started with THULAC-Python, follow these steps:
- Install the THULAC library using pip:
pip install thulac
- Import the
thulac
module and create athulac
instance:
import thulac
# Create a THULAC instance with default settings
thu = thulac.thulac()
- Use the
cut()
method to perform word segmentation and part-of-speech tagging:
text = "我爱北京天安门。"
result = thu.cut(text, text=True)
print(result)
- Use the
run()
method to perform named entity recognition:
text = "北京是中国的首都。"
result = thu.run(text)
print(result)
- (Optional) Customize the THULAC instance by providing a user dictionary or enabling traditional-to-simplified Chinese conversion:
thu = thulac.thulac(user_dict="path/to/user_dict.txt", T2S=True)
result = thu.cut(text, text=True)
print(result)
That's it! You can now use the THULAC-Python library to perform various Chinese natural language processing tasks in your Python applications.
Competitor Comparisons
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
Pros of HanLP
- Comprehensive Feature Set: HanLP provides a wide range of natural language processing capabilities, including word segmentation, part-of-speech tagging, named entity recognition, dependency parsing, and more.
- Multilingual Support: HanLP supports multiple languages, including Chinese, English, and other languages, making it a versatile tool for diverse applications.
- Active Development and Community: HanLP has an active development team and a large community of contributors, ensuring regular updates and improvements.
Cons of HanLP
- Larger Codebase: HanLP has a more extensive codebase compared to THULAC-Python, which may result in a larger memory footprint and slower startup times.
- Steeper Learning Curve: HanLP's comprehensive feature set and configuration options may present a steeper learning curve for some users, especially those new to natural language processing.
- Dependency on Java: HanLP is primarily written in Java, which may be a limitation for users who prefer to work in Python or other programming languages.
Code Comparison
THULAC-Python:
import thulac
thu1 = thulac.thulac()
text = "我爱北京天安门。"
words = thu1.cut(text, text=True)
print(words)
HanLP:
import com.hankcs.hanlp.HanLP;
import com.hankcs.hanlp.seg.common.Term;
public class HanLPExample {
public static void main(String[] args) {
String text = "我爱北京天安门。";
for (Term term : HanLP.segment(text)) {
System.out.print(term.word + " ");
}
}
}
结巴中文分词
Pros of Jieba
- Jieba is a widely-used and well-maintained Chinese text segmentation library, with a large and active community.
- Jieba supports a variety of use cases, including word segmentation, part-of-speech tagging, and named entity recognition.
- Jieba is highly customizable, allowing users to add their own dictionaries and adjust the segmentation algorithm.
Cons of Jieba
- Jieba is primarily focused on Chinese text processing, and may not be as suitable for other languages as THULAC-Python.
- Jieba's performance may not be as optimized as THULAC-Python, especially for large-scale text processing tasks.
- Jieba's documentation and community support may not be as comprehensive as THULAC-Python.
Code Comparison
Jieba:
import jieba
text = "这是一个测试句子"
words = jieba.cut(text)
print(" ".join(words))
THULAC-Python:
from thulac import THULAC
text = "这是一个测试句子"
thu = THULAC()
words = thu.cut(text, text=True)
print(" ".join(words))
Both code snippets perform basic Chinese text segmentation, but THULAC-Python provides additional functionality, such as part-of-speech tagging, out of the box.
pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation
Pros of pkuseg-python
- Faster Performance: pkuseg-python is reported to be faster than THULAC-Python in terms of processing speed, making it more efficient for large-scale text processing tasks.
- Improved Accuracy: The pkuseg-python model is trained on a larger and more diverse dataset, which can lead to better performance on a wider range of text types.
- Easier Installation: pkuseg-python has a simpler installation process, with fewer dependencies, making it more accessible for users.
Cons of pkuseg-python
- Limited Documentation: The documentation for pkuseg-python is not as comprehensive as that of THULAC-Python, which may make it more challenging for new users to get started.
- Fewer Features: THULAC-Python offers a wider range of features, such as part-of-speech tagging and named entity recognition, which may be important for certain applications.
- Potential Bias: The dataset used to train the pkuseg-python model may have inherent biases, which could affect the performance on specific types of text.
Code Comparison
THULAC-Python:
import thulac
thu1 = thulac.thulac()
text = "我爱北京天安门。"
words = thu1.cut(text, text=True)
print(words)
pkuseg-python:
import pkuseg
seg = pkuseg.pkuseg()
text = "我爱北京天安门。"
words = seg.cut(text)
print(words)
Both code snippets perform Chinese word segmentation on the same input text, but the API and output format differ slightly between the two libraries.
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Pros of nlp_chinese_corpus
- Provides a comprehensive collection of Chinese language datasets for various NLP tasks, including text classification, named entity recognition, and sentiment analysis.
- Includes high-quality datasets from reputable sources, making it a valuable resource for researchers and developers.
- Offers a diverse range of datasets, catering to different domains and applications.
Cons of nlp_chinese_corpus
- The repository does not provide any pre-trained models or tools for processing the datasets, unlike THULAC-Python.
- The documentation and usage instructions may not be as detailed or user-friendly as THULAC-Python.
- The datasets may not be as actively maintained or updated as the THULAC-Python library.
Code Comparison
THULAC-Python:
import thulac
# Create a THULAC object
thu = thulac.thulac()
# Perform Chinese word segmentation
text = "我爱北京天安门"
seg_text = thu.cut(text, text=True)
print(seg_text)
nlp_chinese_corpus:
# No code example provided, as the repository is a collection of datasets rather than a library or tool.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
THULACï¼ä¸ä¸ªé«æçä¸æè¯æ³åæå·¥å ·å
ç®å½
- 项ç®ä»ç»
- ç¼è¯åå®è£
- 使ç¨æ¹å¼
- åç±»åè¯çæ§è½å¯¹æ¯
- è¯æ§è§£é
- THULAC模åä»ç»
- 注æäºé¡¹
- å ¶ä»è¯è¨å®ç°
- åå²
- å¼æºåè®®
- ç¸å ³è®ºæ
- ä½è
项ç®ä»ç»
THULACï¼THU Lexical Analyzer for Chineseï¼ç±æ¸ å大å¦èªç¶è¯è¨å¤çä¸ç¤¾ä¼äººæ计ç®å®éªå®¤ç å¶æ¨åºçä¸å¥ä¸æè¯æ³åæå·¥å ·å ï¼å ·æä¸æåè¯åè¯æ§æ 注åè½ãTHULACå ·æå¦ä¸å 个ç¹ç¹ï¼
- è½å强ãå©ç¨æ们éæçç®åä¸çä¸è§æ¨¡æ大ç人工åè¯åè¯æ§æ 注ä¸æè¯æåºï¼çº¦å«5800ä¸åï¼è®ç»èæï¼æ¨¡åæ 注è½å强大ã
- åç¡®çé«ãè¯¥å·¥å ·å å¨æ åæ°æ®éChinese Treebankï¼CTB5ï¼ä¸åè¯çF1å¼å¯è¾¾97.3ï¼ ï¼è¯æ§æ 注çF1å¼å¯è¾¾å°92.9ï¼ ï¼ä¸è¯¥æ°æ®éä¸æ好æ¹æ³ææç¸å½ã
- é度è¾å¿«ãåæ¶è¿è¡åè¯åè¯æ§æ 注é度为300KB/sï¼æ¯ç§å¯å¤ç约15ä¸åãåªè¿è¡åè¯é度å¯è¾¾å°1.3MB/sã
ç¼è¯åå®è£
- pythonç(å
¼å®¹python2.xçåpython3.xç)
-
ä»githubä¸è½½(éä¸è½½æ¨¡åæ件ï¼è§è·å模å)
å°thulacæ件æ¾å°ç®å½ä¸ï¼éè¿ import thulac æ¥å¼ç¨ thulacéè¦æ¨¡åçæ¯æï¼éè¦å°ä¸è½½ç模åæ¾å°thulacç®å½ä¸ã
-
pipä¸è½½(èªå¸¦æ¨¡åæ件)
sudo pip install thulac éè¿ import thulac æ¥å¼ç¨
-
使ç¨æ¹å¼(æ°å¢fastæ¥å£)
1.åè¯åè¯æ§æ 注ç¨åº
1.1.æ¥å£ä½¿ç¨ç¤ºä¾
-
pythonç
代ç 示ä¾1 import thulac thu1 = thulac.thulac() #é»è®¤æ¨¡å¼ text = thu1.cut("æç±å京天å®é¨", text=True) #è¿è¡ä¸å¥è¯åè¯ print(text)
代ç 示ä¾2 thu1 = thulac.thulac(seg_only=True) #åªè¿è¡åè¯ï¼ä¸è¿è¡è¯æ§æ 注 thu1.cut_f("input.txt", "output.txt") #对input.txtæ件å 容è¿è¡åè¯ï¼è¾åºå°output.txt
1.2.æ¥å£åæ°
-
thulac(user_dict=None, model_path=None, T2S=False, seg_only=False, filt=False, deli='_')
åå§åç¨åºï¼è¿è¡èªå®ä¹è®¾ç½®user_dict 设置ç¨æ·è¯å ¸ï¼ç¨æ·è¯å ¸ä¸çè¯ä¼è¢«æä¸uwæ ç¾ãè¯å ¸ä¸æ¯ä¸ä¸ªè¯ä¸è¡ï¼UTF8ç¼ç T2S é»è®¤False, æ¯å¦å°å¥åä»ç¹ä½è½¬å为ç®ä½ seg_only é»è®¤False, æ¶ååªè¿è¡åè¯ï¼ä¸è¿è¡è¯æ§æ 注 filt é»è®¤False, æ¯å¦ä½¿ç¨è¿æ»¤å¨å»é¤ä¸äºæ²¡ææä¹çè¯è¯ï¼ä¾å¦âå¯ä»¥âã model_path 设置模åæ件æå¨æ件夹ï¼é»è®¤ä¸ºmodels/ deli é»è®¤ä¸ºâ_â, 设置è¯ä¸è¯æ§ä¹é´çåé符
  rm_space      é»è®¤ä¸ºFalse, æ¯å¦å»æåææ¬ä¸çç©ºæ ¼ååè¿è¡åè¯ ```
-
cut(ææ¬, text=False)
对ä¸å¥è¯è¿è¡åè¯text é»è®¤ä¸ºFalse, æ¯å¦è¿åææ¬ï¼ä¸è¿åææ¬åè¿åä¸ä¸ªäºç»´æ°ç»([[word, tag]..]),seg_only模å¼ä¸tag为空å符ã
-
cut_f(è¾å ¥æ件, è¾åºæ件)
对æ件è¿è¡åè¯ -
run()
å½ä»¤è¡äº¤äºå¼åè¯(å±å¹è¾å ¥ãå±å¹è¾åº)
1.3.å½ä»¤è¡è¿è¡(épipå®è£ 使ç¨)
ç´æ¥è°ç¨
python -m thulac input.txt output.txt
#ä»input.txt读å
¥ï¼å¹¶å°åè¯åè¯æ§æ 注ç»æè¾åºå°ouptut.txtä¸
#å¦æåªéè¦åè¯åè½ï¼å¯å¨å¢å åæ°"seg_only"
python -m thulac input.txt output.txt seg_only
1.4.fastæ¥å£
(请ä¸è½½makeåå°å¾å°çlibthulac.soæ¾å ¥modelsæ件夹åç®å½ä¸)
æ两个å½æ°å®ç°äºfastæ¥å£ï¼ä» å½æ°åæ¹åï¼åæ°ä½¿ç¨åæ®éå½æ°
cut -> fast_cut, cut_f -> fast_cut_f
2.è·å模å
THULACéè¦åè¯åè¯æ§æ 注模åçæ¯æï¼è·åä¸è½½å¥½ç模åç¨æ·å¯ä»¥ç»å½thulac.thunlp.orgç½ç«å¡«å个人信æ¯è¿è¡ä¸è½½ï¼å¹¶æ¾å°THULACçæ ¹ç®å½å³å¯ï¼æè
使ç¨åæ°model_path
æå®æ¨¡åçä½ç½®ã
代表åè¯è½¯ä»¶çæ§è½å¯¹æ¯
æ们éæ©LTPãICTCLASãç»å·´åè¯çå½å 代表åè¯è½¯ä»¶ä¸THULACåæ§è½æ¯è¾ãæ们éæ©Windowsä½ä¸ºæµè¯ç¯å¢ï¼æ ¹æ®ç¬¬äºå±å½é æ±è¯åè¯æµè¯åå¸çå½é ä¸æåè¯æµè¯æ åï¼å¯¹ä¸å软件è¿è¡äºé度ååç¡®çæµè¯ã
å¨ç¬¬äºå±å½é æ±è¯åè¯æµè¯ä¸ï¼å ±æå家åä½æä¾çæµè¯è¯æï¼Academia Sinicaã City University ãPeking University ãMicrosoft Researchï¼, å¨è¯æµæä¾çèµæºicwb2-dataä¸å å«äºæ¥èªè¿å家åä½çè®ç»éï¼trainingï¼ãæµè¯éï¼testingï¼, 以åæ ¹æ®åèªåè¯æ åèæä¾çç¸åºæµè¯éçæ åçæ¡ï¼icwb2-data/scripts/goldï¼ï¼å¨icwb2-data/scriptsç®å½ä¸å«æ对åè¯è¿è¡èªå¨è¯åçperlèæ¬scoreã
æ们å¨ç»ä¸æµè¯ç¯å¢ä¸ï¼å¯¹è¥å¹²æµè¡åè¯è½¯ä»¶åTHULACè¿è¡äºæµè¯ï¼ä½¿ç¨ç模å为ååè¯è½¯ä»¶èªå¸¦æ¨¡åãTHULAC使ç¨çæ¯é软件æä¾çç®å模åModel_1ãè¯æµç¯å¢ä¸º Intel Core i5 2.4 GHz è¯æµç»æå¦ä¸ï¼
msr_testï¼560KBï¼
Algorithm | Time | Precision | Recall |
---|---|---|---|
LTP-3.2.0 | 3.21s | 0.867 | 0.896 |
ICTCLAS(2015ç) | 0.55s | 0.869 | 0.914 |
jieba | 0.26s | 0.814 | 0.809 |
THULAC | 0.62s | 0.877 | 0.899 |
pku_testï¼510KBï¼
Algorithm | Time | Precision | Recall |
---|---|---|---|
LTP-3.2.0 | 3.83s | 0.960 | 0.947 |
ICTCLAS(2015ç) | 0.53s | 0.939 | 0.944 |
jieba | 0.23s | 0.850 | 0.784 |
THULAC | 0.51s | 0.944 | 0.908 |
é¤äºä»¥ä¸å¨æ åæµè¯éä¸çè¯æµï¼æ们ä¹å¯¹å个åè¯å·¥å ·å¨å¤§æ°æ®ä¸çé度è¿è¡äºè¯æµï¼ç»æå¦ä¸ï¼
CNKI_journal.txtï¼51 MBï¼
Algorithm | Time | Speed |
---|---|---|
LTP-3.2.0 | 348.624s | 149.80KB/s |
ICTCLAS(2015ç) | 106.461s | 490.59KB/s |
jieba | 22.5583s | 2314.89KB/s |
THULAC | 42.625s | 1221.05KB/s |
è¯æ§è§£é
n/åè¯ np/人å ns/å°å ni/æºæå nz/å
¶å®ä¸å
m/æ°è¯ q/éè¯ mq/æ°éè¯ t/æ¶é´è¯ f/æ¹ä½è¯ s/å¤æè¯
v/å¨è¯ a/å½¢å®¹è¯ d/å¯è¯ h/åæ¥æå k/åæ¥æå
i/ä¹ è¯ j/ç®ç§° r/ä»£è¯ c/è¿è¯ p/ä»è¯ u/å©è¯ y/è¯æ°å©è¯
e/å¹è¯ o/æå£°è¯ g/è¯ç´ w/æ ç¹ x/å
¶å®
THULAC模åä»ç»
-
æ们éTHULACæºä»£ç é带äºç®åçåè¯æ¨¡åModel_1ï¼ä» æ¯æåè¯åè½ã该模åç±äººæ°æ¥æ¥åè¯è¯æåºè®ç»å¾å°ã
-
æ们éTHULACæºä»£ç é带äºåè¯åè¯æ§æ 注èå模åModel_2ï¼æ¯æåæ¶åè¯åè¯æ§æ 注åè½ã该模åç±äººæ°æ¥æ¥åè¯åè¯æ§æ 注è¯æåºè®ç»å¾å°ã
-
æ们è¿æä¾æ´å¤æãå®åå精确çåè¯åè¯æ§æ 注èå模åModel_3ååè¯è¯è¡¨ã该模åæ¯ç±å¤è¯æèåè®ç»è®ç»å¾å°ï¼è¯æå æ¬æ¥èªå¤æä½çæ 注ææ¬å人æ°æ¥æ¥æ 注ææ¬çï¼ãç±äºæ¨¡åè¾å¤§ï¼å¦ææºææ个人éè¦ï¼è¯·å¡«åâdoc/èµæºç³è¯·è¡¨.docâï¼å¹¶åéè³ thunlp@gmail.com ï¼éè¿å®¡æ ¸åæ们ä¼å°ç¸å ³èµæºåéç»è系人ã
注æäºé¡¹
è¯¥å·¥å ·ç®åä» å¤çUTF8ç¼ç ä¸æææ¬ï¼ä¹åä¼éæ¸å¢å æ¯æå ¶ä»ç¼ç çåè½ï¼æ¬è¯·æå¾ ã
å ¶ä»è¯è¨å®ç°
THULACï¼C++çï¼
https://github.com/thunlp/THULAC
THULACï¼Javaçï¼
https://github.com/thunlp/THULAC-Java
THULACï¼soçï¼
https://github.com/thunlp/THULAC.so
åå²
æ´æ°æ¶é´ | æ´æ°å 容 |
---|---|
2017-01-17 | å¨pipä¸åå¸THULACåè¯pythonçæ¬ã |
2016-09-29 | å¢å THULACåè¯soçæ¬ã |
2016-03-31 | å¢å THULACåè¯pythonçæ¬ã |
2016-01-20 | å¢å THULACåè¯Javaçæ¬ã |
2016-01-10 | å¼æºTHULACåè¯å·¥å ·C++çæ¬ã |
å¼æºåè®®
- THULACé¢åå½å å¤å¤§å¦ãç 究æãä¼ä¸ä»¥å个人ç¨äºç 究ç®çå è´¹å¼æ¾æºä»£ç ã
- å¦ææºææ个人æå°THULACç¨äºåä¸ç®çï¼è¯·åé®ä»¶è³thunlp@gmail.comæ´½è°ææ¯è®¸å¯åè®®ã
- 欢è¿å¯¹è¯¥å·¥å ·å æåºä»»ä½å®è´µæè§å建议ã请åé®ä»¶è³thunlp@gmail.comã
- å¦ææ¨å¨THULACåºç¡ä¸å表论ææåå¾ç§ç ææï¼è¯·æ¨å¨å表论æåç³æ¥æææ¶å£°æâ使ç¨äºæ¸ å大å¦THULACâï¼å¹¶æå¦ä¸æ ¼å¼å¼ç¨ï¼
-
ä¸æï¼ åèæ¾, éæ°é, å¼ å¼æ, éå¿è, åç¥è¿. THULACï¼ä¸ä¸ªé«æçä¸æè¯æ³åæå·¥å ·å . 2016.
-
è±æï¼ Maosong Sun, Xinxiong Chen, Kaixu Zhang, Zhipeng Guo, Zhiyuan Liu. THULAC: An Efficient Lexical Analyzer for Chinese. 2016.
ç¸å ³è®ºæ
- Zhongguo Li, Maosong Sun. Punctuation as Implicit Annotations for Chinese Word Segmentation. Computational Linguistics, vol. 35, no. 4, pp. 505-512, 2009.
ä½è
Maosong Sun ï¼åèæ¾ï¼å¯¼å¸ï¼, Xinxiong Chenï¼éæ°éï¼å士çï¼, Kaixu Zhang (å¼ å¼æï¼ç¡å£«çï¼, Zhipeng Guoï¼éå¿èï¼æ¬ç§çï¼, Junhua Ma ï¼é©¬éªéª ï¼è®¿é®å¦çï¼, Zhiyuan Liuï¼åç¥è¿ï¼å©çææï¼.
Top Related Projects
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
结巴中文分词
pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot