Convert Figma logo to code with AI

baidu logolac

百度NLP:分词,词性标注,命名实体识别,词重要性

3,840
597
3,840
158

Top Related Projects

33,063

结巴中文分词

33,448

Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification

6,397

Python library for processing Chinese text

4,924

Language Technology Platform

"结巴"中文分词的C++版本

Quick Overview

Baidu LAC (Lexical Analysis of Chinese) is an open-source project for Chinese lexical analysis. It provides efficient and accurate word segmentation, part-of-speech tagging, and named entity recognition for Chinese text using deep learning techniques.

Pros

  • High accuracy and performance in Chinese language processing tasks
  • Supports both Python and C++ interfaces for flexibility
  • Includes pre-trained models for immediate use
  • Offers customization options for specific domain applications

Cons

  • Limited documentation, especially for advanced usage
  • Primarily focused on Chinese language, limiting its use for other languages
  • Requires some understanding of deep learning concepts for optimal use
  • May have a steeper learning curve compared to simpler NLP tools

Code Examples

  1. Basic word segmentation:
from LAC import LAC

lac = LAC(mode='seg')
text = "百度是一家高科技公司"
result = lac.run(text)
print(result)
# Output: ['百度', '是', '一家', '高科技', '公司']
  1. Part-of-speech tagging:
lac = LAC(mode='lac')
text = "我爱北京天安门"
result = lac.run(text)
print(result)
# Output: (['我', '爱', '北京', '天安门'], ['r', 'v', 'ns', 'ns'])
  1. Named entity recognition:
lac = LAC(mode='lac')
text = "李小明在北京大学读书"
result = lac.run(text)
print(result)
# Output: (['李小明', '在', '北京大学', '读书'], ['PER', 'p', 'ORG', 'v'])

Getting Started

To use Baidu LAC, follow these steps:

  1. Install the package:

    pip install lac
    
  2. Import and initialize LAC:

    from LAC import LAC
    lac = LAC(mode='lac')
    
  3. Process text:

    text = "百度是一家高科技公司"
    result = lac.run(text)
    print(result)
    

For more advanced usage, refer to the project's GitHub repository and documentation.

Competitor Comparisons

33,063

结巴中文分词

Pros of jieba

  • More widely adopted and mature, with a larger community and ecosystem
  • Easier to use and integrate, with simpler API and installation process
  • Supports customization of dictionaries and user-defined words

Cons of jieba

  • Generally slower performance compared to LAC
  • Less accurate for some specific domains or complex sentences
  • Lacks advanced features like named entity recognition and part-of-speech tagging

Code Comparison

jieba:

import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))

LAC:

from LAC import LAC
lac = LAC(mode='seg')
seg_result = lac.run("我来到北京清华大学")
print("Segmentation result:", seg_result)

Both libraries provide Chinese word segmentation functionality, but LAC offers more advanced features and potentially better performance for specific use cases. jieba is easier to use and has a larger community, making it a popular choice for general-purpose segmentation tasks. The choice between the two depends on the specific requirements of your project, such as accuracy needs, performance constraints, and desired features.

33,448

Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification

Pros of HanLP

  • More comprehensive NLP toolkit with wider range of features
  • Supports multiple languages beyond just Chinese
  • Active community and frequent updates

Cons of HanLP

  • Larger codebase and dependencies may increase complexity
  • Potentially slower performance for basic Chinese NLP tasks
  • Steeper learning curve for beginners

Code Comparison

HanLP:

from pyhanlp import *

text = "我爱北京天安门"
print(HanLP.segment(text))

LAC:

from LAC import LAC

lac = LAC(mode='seg')
text = "我爱北京天安门"
print(lac.run(text))

Key Differences

  • HanLP offers a more extensive set of NLP tools and language support
  • LAC focuses specifically on Chinese language processing
  • HanLP may require more setup and configuration
  • LAC provides a simpler API for basic Chinese NLP tasks

Use Cases

  • Choose HanLP for multi-language or advanced NLP projects
  • Opt for LAC for straightforward Chinese text segmentation and POS tagging

Community and Support

  • HanLP has a larger community and more frequent updates
  • LAC benefits from Baidu's backing and specialized Chinese NLP expertise
6,397

Python library for processing Chinese text

Pros of SnowNLP

  • Broader range of NLP tasks including sentiment analysis and text summarization
  • Simpler installation process with fewer dependencies
  • More lightweight and suitable for smaller projects or quick prototyping

Cons of SnowNLP

  • Less accurate for complex Chinese language processing tasks
  • Not as actively maintained or updated as LAC
  • Limited support for advanced features like custom model training

Code Comparison

SnowNLP:

from snownlp import SnowNLP

s = SnowNLP(u'这是一个测试句子')
print(s.words)  # Word segmentation
print(s.tags)   # Part-of-speech tagging
print(s.sentiments)  # Sentiment analysis

LAC:

from LAC import LAC

lac = LAC(mode='lac')
text = "这是一个测试句子"
result = lac.run(text)
print(result)  # Word segmentation and part-of-speech tagging

SnowNLP offers a more straightforward API for various NLP tasks, while LAC focuses on providing more accurate results for word segmentation and part-of-speech tagging in Chinese. LAC is better suited for production environments requiring high accuracy in Chinese language processing, whereas SnowNLP is more versatile for quick NLP experiments across different tasks.

4,924

Language Technology Platform

Pros of ltp

  • More comprehensive NLP toolkit with additional tasks like dependency parsing and semantic role labeling
  • Supports both Python and C++ interfaces for flexibility
  • Provides pre-trained models for multiple languages beyond Chinese

Cons of ltp

  • Larger model size and potentially slower processing speed
  • May require more system resources due to its comprehensive nature
  • Less frequent updates compared to lac

Code Comparison

lac usage:

from LAC import LAC
lac = LAC(mode='lac')
text = "我爱北京天安门"
result = lac.run(text)
print(result)

ltp usage:

from ltp import LTP
ltp = LTP()
text = "我爱北京天安门"
result = ltp.pipeline(text, tasks=["cws", "pos", "ner"])
print(result)

Both repositories provide Chinese language processing tools, but ltp offers a more comprehensive suite of NLP tasks. lac focuses primarily on word segmentation, part-of-speech tagging, and named entity recognition, making it potentially faster and more lightweight. ltp, on the other hand, includes additional capabilities like dependency parsing and semantic role labeling, but may require more resources. The code examples demonstrate the simplicity of use for both libraries, with lac having a slightly more straightforward API for basic tasks.

"结巴"中文分词的C++版本

Pros of cppjieba

  • Lightweight and easy to integrate into C++ projects
  • Supports multiple segmentation modes (e.g., MPSegment, HMMSegment)
  • Provides a user dictionary feature for customization

Cons of cppjieba

  • Limited to Chinese language segmentation only
  • May have lower accuracy compared to more advanced models like LAC
  • Less actively maintained (last update was in 2020)

Code Comparison

cppjieba:

#include "cppjieba/Jieba.hpp"
jieba::Jieba jieba(dict_path, hmm_path, user_dict_path);
vector<string> words;
jieba.Cut(sentence, words, true);

LAC:

from LAC import LAC
lac = LAC(mode='lac')
seg_result = lac.run("百度是一家高科技公司")

Key Differences

  • Language: cppjieba is written in C++, while LAC is primarily Python-based
  • Functionality: LAC offers more advanced NLP features beyond segmentation
  • Performance: LAC may provide better accuracy, especially for complex sentences
  • Integration: cppjieba is easier to integrate into C++ projects, while LAC is more suitable for Python environments

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

工具介绍

LAC全称Lexical Analysis of Chinese,是百度自然语言处理部研发的一款联合的词法分析工具,实现中文分词、词性标注、专名识别等功能。该工具具有以下特点与优势:

  • 效果好:通过深度学习模型联合学习分词、词性标注、专名识别任务,词语重要性,整体效果F1值超过0.91,词性标注F1值超过0.94,专名识别F1值超过0.85,效果业内领先。
  • 效率高:精简模型参数,结合Paddle预测库的性能优化,CPU单线程性能达800QPS,效率业内领先。
  • **可定制**:实现简单可控的干预机制,精准匹配用户词典对模型进行干预。词典支持长片段形式,使得干预更为精准。
  • 调用便捷:支持一键安装,同时提供了Python、Java和C++调用接口与调用示例,实现快速调用和集成。
  • 支持移动端: 定制超轻量级模型,体积仅为2M,主流千元手机单线程性能达200QPS,满足大多数移动端应用的需求,同等体积量级效果业内领先。

安装与使用

在此我们主要介绍Python安装与使用,其他语言使用:

安装说明

代码兼容Python2/3

  • 全自动安装: pip install lac

  • 半自动下载:先下载http://pypi.python.org/pypi/lac/,解压后运行 python setup.py install

  • 安装完成后可在命令行输入lac或lac --segonly,lac --rank启动服务,进行快速体验。

    国内网络可使用百度源安装,安装速率更快:pip install lac -i https://mirror.baidu.com/pypi/simple

功能与使用

分词

  • 代码示例:
from LAC import LAC

# 装载分词模型
lac = LAC(mode='seg')

# 单个样本输入,输入为Unicode编码的字符串
text = u"LAC是个优秀的分词工具"
seg_result = lac.run(text)

# 批量样本输入, 输入为多个句子组成的list,平均速率会更快
texts = [u"LAC是个优秀的分词工具", u"百度是一家高科技公司"]
seg_result = lac.run(texts)
  • 输出:
【单样本】:seg_result = [LAC, 是, 个, 优秀, 的, 分词, 工具]
【批量样本】:seg_result = [[LAC, 是, 个, 优秀, 的, 分词, 工具], [百度, 是, 一家, 高科技, 公司]]

词性标注与实体识别

  • 代码示例:
from LAC import LAC

# 装载LAC模型
lac = LAC(mode='lac')

# 单个样本输入,输入为Unicode编码的字符串
text = u"LAC是个优秀的分词工具"
lac_result = lac.run(text)

# 批量样本输入, 输入为多个句子组成的list,平均速率更快
texts = [u"LAC是个优秀的分词工具", u"百度是一家高科技公司"]
lac_result = lac.run(texts)
  • 输出:

每个句子的输出其切词结果word_list以及对每个单词的标注tags_list,其格式为(word_list, tags_list)

【单样本】: lac_result = ([百度, 是, 一家, 高科技, 公司], [ORG, v, m, n, n])
【批量样本】:lac_result = [
                    ([百度, 是, 一家, 高科技, 公司], [ORG, v, m, n, n]),
                    ([LAC, 是, 个, 优秀, 的, 分词, 工具], [nz, v, q, a, u, n, n])
                ]

词性和专名类别标签集合如下表,其中我们将最常用的4个专名类别标记为大写的形式:

标签含义标签含义标签含义标签含义
n普通名词f方位名词s处所名词nw作品名
nz其他专名v普通动词vd动副词vn名动词
a形容词ad副形词an名形词d副词
m数量词q量词r代词p介词
c连词u助词xc其他虚词w标点符号
PER人名LOC地名ORG机构名TIME时间

词语重要性

  • 代码示例:
from LAC import LAC

# 装载词语重要性模型
lac = LAC(mode='rank')

# 单个样本输入,输入为Unicode编码的字符串
text = u"LAC是个优秀的分词工具"
rank_result = lac.run(text)

# 批量样本输入, 输入为多个句子组成的list,平均速率会更快
texts = [u"LAC是个优秀的分词工具", u"百度是一家高科技公司"]
rank_result = lac.run(texts)
  • 输出:
【单样本】:rank_result = [['LAC', '是', '个', '优秀', '的', '分词', '工具'], 
                        [nz, v, q, a, u, n, n],[3, 0, 0, 2, 0, 3, 1]]
【批量样本】:rank_result = [
                    (['LAC', '是', '个', '优秀', '的', '分词', '工具'], 
                     [nz, v, q, a, u, n, n], [3, 0, 0, 2, 0, 3, 1]),  
                    (['百度', '是', '一家', '高科技', '公司'], 
                     [ORG, v, m, n, n], [3, 0, 2, 3, 1])
                ]

词语重要性各类别标签集合如下表,我们使用4-Level梯度进行分类:

标签含义常见于词性
0query中表述的冗余词p, w, xc ...
1query中限定较弱的词r, c, u ...
2query中强限定的词n, s, v ...
3query中的核心词nz, nw, LOC ...

定制化功能

在模型输出的基础上,LAC还支持用户配置定制化的切分结果和专名类型输出。当模型预测匹配到词典的中的item时,会用定制化的结果替代原有结果。为了实现更加精确的匹配,我们支持以由多个单词组成的长片段作为一个item。

我们通过装载词典文件的形式实现该功能,词典文件每行表示一个定制化的item,由一个单词或多个连续的单词组成,每个单词后使用'/'表示标签,如果没有'/'标签则会使用模型默认的标签。每个item单词数越多,干预效果会越精准。

  • 词典文件示例

    这里仅作为示例,展现各种需求情况下的结果。后续还将开放以通配符配置词典的模式,敬请期待。

春天/SEASON
花/n 开/v
秋天的风
落 阳
  • 代码示例
from LAC import LAC
lac = LAC()

# 装载干预词典, sep参数表示词典文件采用的分隔符,为None时默认使用空格或制表符'\t'
lac.load_customization('custom.txt', sep=None)

# 干预后结果
custom_result = lac.run(u"春天的花开秋天的风以及冬天的落阳")
  • 以输入“春天的花开秋天的风以及冬天的落阳”为例,原本输出结果为:
春天/TIME 的/u 花开/v 秋天/TIME 的/u 风/n 以及/c 冬天/TIME 的/u 落阳/n
  • 添加示例中的词典文件后的结果为:
春天/SEASON 的/u 花/n 开/v 秋天的风/n 以及/c 冬天/TIME 的/u 落/n 阳/n

增量训练

我们也提供了增量训练的接口,用户可以使用自己的数据,进行增量训练,首先需要将数据转换为模型输入的格式,并且所有数据文件均为"UTF-8"编码:

1. 分词训练
  • 数据样例

    与大多数开源分词数据集格式一致,使用空格作为单词切分标记,如下所示:

LAC 是 个 优秀 的 分词 工具 。
百度 是 一家 高科技 公司 。
春天 的 花开 秋天 的 风 以及 冬天 的 落阳 。
  • 代码示例
from LAC import LAC

# 选择使用分词模型
lac = LAC(mode = 'seg')

# 训练和测试数据集,格式一致
train_file = "./data/seg_train.tsv"
test_file = "./data/seg_test.tsv"
lac.train(model_save_dir='./my_seg_model/',train_data=train_file, test_data=test_file)

# 使用自己训练好的模型
my_lac = LAC(model_path='my_seg_model')
2. 词法分析训练
  • 样例数据

    在分词数据的基础上,每个单词以“/type”的形式标记其词性或实体类别。值得注意的是,词法分析的训练目前仅支持标签体系与我们一致的数据。后续也会开放支持新的标签体系,敬请期待。

LAC/nz 是/v 个/q 优秀/a 的/u 分词/n 工具/n 。/w
百度/ORG 是/v 一家/m 高科技/n 公司/n 。/w
春天/TIME 的/u 花开/v 秋天/TIME 的/u 风/n 以及/c 冬天/TIME 的/u 落阳/n 。/w
  • 代码示例
from LAC import LAC

# 选择使用默认的词法分析模型
lac = LAC()

# 训练和测试数据集,格式一致
train_file = "./data/lac_train.tsv"
test_file = "./data/lac_test.tsv"
lac.train(model_save_dir='./my_lac_model/',train_data=train_file, test_data=test_file)

# 使用自己训练好的模型
my_lac = LAC(model_path='my_lac_model')

文件结构

.
├── python                      # Python调用的脚本
├── c++                         # C++调用的代码
├── java                        # Java调用的代码
├── Android                     # Android调用的示例
├── README.md                   # 本文件
└── CMakeList.txt               # 编译C++和Java调用的脚本

在论文中引用LAC

如果您的学术工作成果中使用了LAC,请您增加下述引用。我们非常欣慰LAC能够对您的学术工作带来帮助。

@article{jiao2018LAC,
	title={Chinese Lexical Analysis with Deep Bi-GRU-CRF Network},
	author={Jiao, Zhenyu and Sun, Shuqi and Sun, Ke},
	journal={arXiv preprint arXiv:1807.01882},
	year={2018},
	url={https://arxiv.org/abs/1807.01882}
}

贡献代码

我们欢迎开发者向LAC贡献代码。如果您开发了新功能,发现了bug……欢迎提交Pull request与issue到Github。