Convert Figma logo to code with AI

thunlp logoTHULAC-Python

An Efficient Lexical Analyzer for Chinese

2,003
336
2,003
86

Top Related Projects

33,448

Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification

33,063

结巴中文分词

pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP

Quick Overview

THULAC (THU Lexical Analyzer for Chinese) is a lexical analyzer for the Chinese language developed by the Natural Language Processing and Social Computing Lab at Tsinghua University. It provides part-of-speech tagging, word segmentation, and named entity recognition capabilities for Chinese text.

Pros

  • Comprehensive Functionality: THULAC offers a wide range of natural language processing capabilities, including word segmentation, part-of-speech tagging, and named entity recognition, making it a versatile tool for Chinese text analysis.
  • High Accuracy: The model has been trained on large-scale Chinese corpora, resulting in high accuracy for various NLP tasks.
  • Efficient Performance: THULAC is designed to be efficient, with fast processing times for real-world applications.
  • Multilingual Support: The library supports both simplified and traditional Chinese, as well as English.

Cons

  • Limited Customization: The library may not provide extensive customization options, which could be a limitation for users with specific requirements.
  • Dependency on External Resources: THULAC relies on pre-trained models and external resources, which may require additional setup and configuration.
  • Python-only: The library is currently only available in Python, which may limit its adoption by developers working in other programming languages.
  • Potential Maintenance Challenges: As an open-source project, the long-term maintenance and updates of THULAC may depend on the continued involvement of the Tsinghua University team.

Code Examples

Here are a few examples of how to use the THULAC library in Python:

import thulac

# Create a THULAC instance
thu = thulac.thulac(seg_only=False)

# Perform word segmentation and part-of-speech tagging
text = "我爱北京天安门。"
result = thu.cut(text, text=True)
print(result)
# Output: '我_r 爱_v 北京_ns 天安门_ns 。_w'

# Perform named entity recognition
text = "北京是中国的首都。"
result = thu.run(text)
print(result)
# Output: [('北京', 'ns'), ('中国', 'ns'), ('首都', 'n')]

# Customize the THULAC instance
thu = thulac.thulac(user_dict="path/to/user_dict.txt", T2S=True)
result = thu.cut(text, text=True)
print(result)
# Output: '北京_ns 是_v 中国_ns 的_uj 首都_n 。_w'

Getting Started

To get started with THULAC-Python, follow these steps:

  1. Install the THULAC library using pip:
pip install thulac
  1. Import the thulac module and create a thulac instance:
import thulac

# Create a THULAC instance with default settings
thu = thulac.thulac()
  1. Use the cut() method to perform word segmentation and part-of-speech tagging:
text = "我爱北京天安门。"
result = thu.cut(text, text=True)
print(result)
  1. Use the run() method to perform named entity recognition:
text = "北京是中国的首都。"
result = thu.run(text)
print(result)
  1. (Optional) Customize the THULAC instance by providing a user dictionary or enabling traditional-to-simplified Chinese conversion:
thu = thulac.thulac(user_dict="path/to/user_dict.txt", T2S=True)
result = thu.cut(text, text=True)
print(result)

That's it! You can now use the THULAC-Python library to perform various Chinese natural language processing tasks in your Python applications.

Competitor Comparisons

33,448

Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification

Pros of HanLP

  • Comprehensive Feature Set: HanLP provides a wide range of natural language processing capabilities, including word segmentation, part-of-speech tagging, named entity recognition, dependency parsing, and more.
  • Multilingual Support: HanLP supports multiple languages, including Chinese, English, and other languages, making it a versatile tool for diverse applications.
  • Active Development and Community: HanLP has an active development team and a large community of contributors, ensuring regular updates and improvements.

Cons of HanLP

  • Larger Codebase: HanLP has a more extensive codebase compared to THULAC-Python, which may result in a larger memory footprint and slower startup times.
  • Steeper Learning Curve: HanLP's comprehensive feature set and configuration options may present a steeper learning curve for some users, especially those new to natural language processing.
  • Dependency on Java: HanLP is primarily written in Java, which may be a limitation for users who prefer to work in Python or other programming languages.

Code Comparison

THULAC-Python:

import thulac

thu1 = thulac.thulac()
text = "我爱北京天安门。"
words = thu1.cut(text, text=True)
print(words)

HanLP:

import com.hankcs.hanlp.HanLP;
import com.hankcs.hanlp.seg.common.Term;

public class HanLPExample {
    public static void main(String[] args) {
        String text = "我爱北京天安门。";
        for (Term term : HanLP.segment(text)) {
            System.out.print(term.word + " ");
        }
    }
}
33,063

结巴中文分词

Pros of Jieba

  • Jieba is a widely-used and well-maintained Chinese text segmentation library, with a large and active community.
  • Jieba supports a variety of use cases, including word segmentation, part-of-speech tagging, and named entity recognition.
  • Jieba is highly customizable, allowing users to add their own dictionaries and adjust the segmentation algorithm.

Cons of Jieba

  • Jieba is primarily focused on Chinese text processing, and may not be as suitable for other languages as THULAC-Python.
  • Jieba's performance may not be as optimized as THULAC-Python, especially for large-scale text processing tasks.
  • Jieba's documentation and community support may not be as comprehensive as THULAC-Python.

Code Comparison

Jieba:

import jieba

text = "这是一个测试句子"
words = jieba.cut(text)
print(" ".join(words))

THULAC-Python:

from thulac import THULAC

text = "这是一个测试句子"
thu = THULAC()
words = thu.cut(text, text=True)
print(" ".join(words))

Both code snippets perform basic Chinese text segmentation, but THULAC-Python provides additional functionality, such as part-of-speech tagging, out of the box.

pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

Pros of pkuseg-python

  • Faster Performance: pkuseg-python is reported to be faster than THULAC-Python in terms of processing speed, making it more efficient for large-scale text processing tasks.
  • Improved Accuracy: The pkuseg-python model is trained on a larger and more diverse dataset, which can lead to better performance on a wider range of text types.
  • Easier Installation: pkuseg-python has a simpler installation process, with fewer dependencies, making it more accessible for users.

Cons of pkuseg-python

  • Limited Documentation: The documentation for pkuseg-python is not as comprehensive as that of THULAC-Python, which may make it more challenging for new users to get started.
  • Fewer Features: THULAC-Python offers a wider range of features, such as part-of-speech tagging and named entity recognition, which may be important for certain applications.
  • Potential Bias: The dataset used to train the pkuseg-python model may have inherent biases, which could affect the performance on specific types of text.

Code Comparison

THULAC-Python:

import thulac

thu1 = thulac.thulac()
text = "我爱北京天安门。"
words = thu1.cut(text, text=True)
print(words)

pkuseg-python:

import pkuseg

seg = pkuseg.pkuseg()
text = "我爱北京天安门。"
words = seg.cut(text)
print(words)

Both code snippets perform Chinese word segmentation on the same input text, but the API and output format differ slightly between the two libraries.

大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP

Pros of nlp_chinese_corpus

  • Provides a comprehensive collection of Chinese language datasets for various NLP tasks, including text classification, named entity recognition, and sentiment analysis.
  • Includes high-quality datasets from reputable sources, making it a valuable resource for researchers and developers.
  • Offers a diverse range of datasets, catering to different domains and applications.

Cons of nlp_chinese_corpus

  • The repository does not provide any pre-trained models or tools for processing the datasets, unlike THULAC-Python.
  • The documentation and usage instructions may not be as detailed or user-friendly as THULAC-Python.
  • The datasets may not be as actively maintained or updated as the THULAC-Python library.

Code Comparison

THULAC-Python:

import thulac

# Create a THULAC object
thu = thulac.thulac()

# Perform Chinese word segmentation
text = "我爱北京天安门"
seg_text = thu.cut(text, text=True)
print(seg_text)

nlp_chinese_corpus:

# No code example provided, as the repository is a collection of datasets rather than a library or tool.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

THULAC:一个高效的中文词法分析工具包

目录

项目介绍

THULAC(THU Lexical Analyzer for Chinese)由清华大学自然语言处理与社会人文计算实验室研制推出的一套中文词法分析工具包,具有中文分词和词性标注功能。THULAC具有如下几个特点:

  1. 能力强。利用我们集成的目前世界上规模最大的人工分词和词性标注中文语料库(约含5800万字)训练而成,模型标注能力强大。
  2. 准确率高。该工具包在标准数据集Chinese Treebank(CTB5)上分词的F1值可达97.3%,词性标注的F1值可达到92.9%,与该数据集上最好方法效果相当。
  3. 速度较快。同时进行分词和词性标注速度为300KB/s,每秒可处理约15万字。只进行分词速度可达到1.3MB/s。

编译和安装

  • python版(兼容python2.x版和python3.x版)
    1. 从github下载(需下载模型文件,见获取模型)

      将thulac文件放到目录下,通过 import thulac 来引用
      thulac需要模型的支持,需要将下载的模型放到thulac目录下。
      
    2. pip下载(自带模型文件)

      sudo pip install thulac
      通过 import thulac 来引用
      

使用方式(新增fast接口)

1.分词和词性标注程序

1.1.接口使用示例

  • python版

    代码示例1
    import thulac	
    
    thu1 = thulac.thulac()  #默认模式
    text = thu1.cut("我爱北京天安门", text=True)  #进行一句话分词
    print(text)
    
    代码示例2
    thu1 = thulac.thulac(seg_only=True)  #只进行分词,不进行词性标注
    thu1.cut_f("input.txt", "output.txt")  #对input.txt文件内容进行分词,输出到output.txt
    

1.2.接口参数

  • thulac(user_dict=None, model_path=None, T2S=False, seg_only=False, filt=False, deli='_') 初始化程序,进行自定义设置

    user_dict	      	设置用户词典,用户词典中的词会被打上uw标签。词典中每一个词一行,UTF8编码
    T2S					默认False, 是否将句子从繁体转化为简体
    seg_only	   		默认False, 时候只进行分词,不进行词性标注
    filt		   		默认False, 是否使用过滤器去除一些没有意义的词语,例如“可以”。
    model_path	 	    设置模型文件所在文件夹,默认为models/
    deli	 	      	默认为‘_’, 设置词与词性之间的分隔符
    

   rm_space           默认为False, 是否去掉原文本中的空格后再进行分词 ```

  • cut(文本, text=False) 对一句话进行分词

    text 				默认为False, 是否返回文本,不返回文本则返回一个二维数组([[word, tag]..]),seg_only模式下tag为空字符。
    
  • cut_f(输入文件, 输出文件) 对文件进行分词

  • run() 命令行交互式分词(屏幕输入、屏幕输出)

1.3.命令行运行(限pip安装使用)

直接调用

python -m thulac input.txt output.txt
#从input.txt读入,并将分词和词性标注结果输出到ouptut.txt中

#如果只需要分词功能,可在增加参数"seg_only" 
python -m thulac input.txt output.txt seg_only

1.4.fast接口

(请下载make后将得到的libthulac.so放入models文件夹同目录下)

有两个函数实现了fast接口,仅函数名改变,参数使用同普通函数

cut -> fast_cut, cut_f -> fast_cut_f

2.获取模型

THULAC需要分词和词性标注模型的支持,获取下载好的模型用户可以登录thulac.thunlp.org网站填写个人信息进行下载,并放到THULAC的根目录即可,或者使用参数model_path指定模型的位置。

代表分词软件的性能对比

我们选择LTP、ICTCLAS、结巴分词等国内代表分词软件与THULAC做性能比较。我们选择Windows作为测试环境,根据第二届国际汉语分词测评发布的国际中文分词测评标准,对不同软件进行了速度和准确率测试。

在第二届国际汉语分词测评中,共有四家单位提供的测试语料(Academia Sinica、 City University 、Peking University 、Microsoft Research), 在评测提供的资源icwb2-data中包含了来自这四家单位的训练集(training)、测试集(testing), 以及根据各自分词标准而提供的相应测试集的标准答案(icwb2-data/scripts/gold).在icwb2-data/scripts目录下含有对分词进行自动评分的perl脚本score。

我们在统一测试环境下,对若干流行分词软件和THULAC进行了测试,使用的模型为各分词软件自带模型。THULAC使用的是随软件提供的简单模型Model_1。评测环境为 Intel Core i5 2.4 GHz 评测结果如下:

msr_test(560KB)

AlgorithmTimePrecisionRecall
LTP-3.2.03.21s0.8670.896
ICTCLAS(2015版)0.55s0.8690.914
jieba0.26s0.8140.809
THULAC0.62s0.8770.899

pku_test(510KB)

AlgorithmTimePrecisionRecall
LTP-3.2.03.83s0.9600.947
ICTCLAS(2015版)0.53s0.9390.944
jieba0.23s0.8500.784
THULAC0.51s0.9440.908

除了以上在标准测试集上的评测,我们也对各个分词工具在大数据上的速度进行了评测,结果如下:

CNKI_journal.txt(51 MB)

AlgorithmTimeSpeed
LTP-3.2.0348.624s149.80KB/s
ICTCLAS(2015版)106.461s490.59KB/s
jieba22.5583s2314.89KB/s
THULAC42.625s1221.05KB/s

词性解释

n/名词 np/人名 ns/地名 ni/机构名 nz/其它专名
m/数词 q/量词 mq/数量词 t/时间词 f/方位词 s/处所词
v/动词 a/形容词 d/副词 h/前接成分 k/后接成分 
i/习语 j/简称 r/代词 c/连词 p/介词 u/助词 y/语气助词
e/叹词 o/拟声词 g/语素 w/标点 x/其它 

THULAC模型介绍

  1. 我们随THULAC源代码附带了简单的分词模型Model_1,仅支持分词功能。该模型由人民日报分词语料库训练得到。

  2. 我们随THULAC源代码附带了分词和词性标注联合模型Model_2,支持同时分词和词性标注功能。该模型由人民日报分词和词性标注语料库训练得到。

  3. 我们还提供更复杂、完善和精确的分词和词性标注联合模型Model_3和分词词表。该模型是由多语料联合训练训练得到(语料包括来自多文体的标注文本和人民日报标注文本等)。由于模型较大,如有机构或个人需要,请填写“doc/资源申请表.doc”,并发送至 thunlp@gmail.com ,通过审核后我们会将相关资源发送给联系人。

注意事项

该工具目前仅处理UTF8编码中文文本,之后会逐渐增加支持其他编码的功能,敬请期待。

其他语言实现

THULAC(C++版)

https://github.com/thunlp/THULAC

THULAC(Java版)

https://github.com/thunlp/THULAC-Java

THULAC(so版)

https://github.com/thunlp/THULAC.so

历史

更新时间更新内容
2017-01-17在pip上发布THULAC分词python版本。
2016-09-29增加THULAC分词so版本。
2016-03-31增加THULAC分词python版本。
2016-01-20增加THULAC分词Java版本。
2016-01-10开源THULAC分词工具C++版本。

开源协议

  1. THULAC面向国内外大学、研究所、企业以及个人用于研究目的免费开放源代码。
  2. 如有机构或个人拟将THULAC用于商业目的,请发邮件至thunlp@gmail.com洽谈技术许可协议。
  3. 欢迎对该工具包提出任何宝贵意见和建议。请发邮件至thunlp@gmail.com。
  4. 如果您在THULAC基础上发表论文或取得科研成果,请您在发表论文和申报成果时声明“使用了清华大学THULAC”,并按如下格式引用:
  • 中文: 孙茂松, 陈新雄, 张开旭, 郭志芃, 刘知远. THULAC:一个高效的中文词法分析工具包. 2016.

  • 英文: Maosong Sun, Xinxiong Chen, Kaixu Zhang, Zhipeng Guo, Zhiyuan Liu. THULAC: An Efficient Lexical Analyzer for Chinese. 2016.

相关论文

  • Zhongguo Li, Maosong Sun. Punctuation as Implicit Annotations for Chinese Word Segmentation. Computational Linguistics, vol. 35, no. 4, pp. 505-512, 2009.

作者

Maosong Sun (孙茂松,导师), Xinxiong Chen(陈新雄,博士生), Kaixu Zhang (张开旭,硕士生), Zhipeng Guo(郭志芃,本科生), Junhua Ma (马骏骅,访问学生), Zhiyuan Liu(刘知远,助理教授).