lac

百度NLP：分词，词性标注，命名实体识别，词重要性

3,957

595

3,957

158

View on GitHub

Top Related Projects

HanLP

35,454

中文分词词性标注命名实体识别依存句法分析成分句法分析语义依存分析语义角色标注指代消解风格转换语义相似度新词发现关键词短语提取自动摘要文本分类聚类拼音简繁转换自然语言处理

snownlp

6,568

Python library for processing Chinese text

Quick Overview

Baidu LAC (Lexical Analysis of Chinese) is an open-source project for Chinese lexical analysis. It provides efficient and accurate word segmentation, part-of-speech tagging, and named entity recognition for Chinese text using deep learning techniques.

Pros

High accuracy and performance in Chinese language processing tasks
Supports both Python and C++ interfaces for flexibility
Includes pre-trained models for immediate use
Offers customization options for specific domain applications

Cons

Limited documentation, especially for advanced usage
Primarily focused on Chinese language, limiting its use for other languages
Requires some understanding of deep learning concepts for optimal use
May have a steeper learning curve compared to simpler NLP tools

Code Examples

Basic word segmentation:

from LAC import LAC

lac = LAC(mode='seg')
text = "百度是一家高科技公司"
result = lac.run(text)
print(result)
# Output: ['百度', '是', '一家', '高科技', '公司']

Part-of-speech tagging:

lac = LAC(mode='lac')
text = "我爱北京天安门"
result = lac.run(text)
print(result)
# Output: (['我', '爱', '北京', '天安门'], ['r', 'v', 'ns', 'ns'])

Named entity recognition:

lac = LAC(mode='lac')
text = "李小明在北京大学读书"
result = lac.run(text)
print(result)
# Output: (['李小明', '在', '北京大学', '读书'], ['PER', 'p', 'ORG', 'v'])

Getting Started

To use Baidu LAC, follow these steps:

Install the package:
```
pip install lac
```

Import and initialize LAC:

from LAC import LAC
lac = LAC(mode='lac')

Process text:

text = "百度是一家高科技公司"
result = lac.run(text)
print(result)

For more advanced usage, refer to the project's GitHub repository and documentation.

Competitor Comparisons

jieba

34,296

结巴中文分词

Pros of jieba

More widely adopted and mature, with a larger community and ecosystem
Easier to use and integrate, with simpler API and installation process
Supports customization of dictionaries and user-defined words

Cons of jieba

Generally slower performance compared to LAC
Less accurate for some specific domains or complex sentences
Lacks advanced features like named entity recognition and part-of-speech tagging

Code Comparison

jieba:

import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))

LAC:

from LAC import LAC
lac = LAC(mode='seg')
seg_result = lac.run("我来到北京清华大学")
print("Segmentation result:", seg_result)

Both libraries provide Chinese word segmentation functionality, but LAC offers more advanced features and potentially better performance for specific use cases. jieba is easier to use and has a larger community, making it a popular choice for general-purpose segmentation tasks. The choice between the two depends on the specific requirements of your project, such as accuracy needs, performance constraints, and desired features.

HanLP

35,454

Pros of HanLP

More comprehensive NLP toolkit with wider range of features
Supports multiple languages beyond just Chinese
Active community and frequent updates

Cons of HanLP

Larger codebase and dependencies may increase complexity
Potentially slower performance for basic Chinese NLP tasks
Steeper learning curve for beginners

Code Comparison

HanLP:

from pyhanlp import *

text = "我爱北京天安门"
print(HanLP.segment(text))

LAC:

from LAC import LAC

lac = LAC(mode='seg')
text = "我爱北京天安门"
print(lac.run(text))

Key Differences

HanLP offers a more extensive set of NLP tools and language support
LAC focuses specifically on Chinese language processing
HanLP may require more setup and configuration
LAC provides a simpler API for basic Chinese NLP tasks

Use Cases

Choose HanLP for multi-language or advanced NLP projects
Opt for LAC for straightforward Chinese text segmentation and POS tagging

Community and Support

HanLP has a larger community and more frequent updates
LAC benefits from Baidu's backing and specialized Chinese NLP expertise

snownlp

6,568

Python library for processing Chinese text

Pros of SnowNLP

Broader range of NLP tasks including sentiment analysis and text summarization
Simpler installation process with fewer dependencies
More lightweight and suitable for smaller projects or quick prototyping

Cons of SnowNLP

Less accurate for complex Chinese language processing tasks
Not as actively maintained or updated as LAC
Limited support for advanced features like custom model training

Code Comparison

SnowNLP:

from snownlp import SnowNLP

s = SnowNLP(u'这是一个测试句子')
print(s.words)  # Word segmentation
print(s.tags)   # Part-of-speech tagging
print(s.sentiments)  # Sentiment analysis

LAC:

from LAC import LAC

lac = LAC(mode='lac')
text = "这是一个测试句子"
result = lac.run(text)
print(result)  # Word segmentation and part-of-speech tagging

SnowNLP offers a more straightforward API for various NLP tasks, while LAC focuses on providing more accurate results for word segmentation and part-of-speech tagging in Chinese. LAC is better suited for production environments requiring high accuracy in Chinese language processing, whereas SnowNLP is more versatile for quick NLP experiments across different tasks.

ltp

5,148

Language Technology Platform

Pros of ltp

More comprehensive NLP toolkit with additional tasks like dependency parsing and semantic role labeling
Supports both Python and C++ interfaces for flexibility
Provides pre-trained models for multiple languages beyond Chinese

Cons of ltp

Larger model size and potentially slower processing speed
May require more system resources due to its comprehensive nature
Less frequent updates compared to lac

Code Comparison

lac usage:

from LAC import LAC
lac = LAC(mode='lac')
text = "我爱北京天安门"
result = lac.run(text)
print(result)

ltp usage:

from ltp import LTP
ltp = LTP()
text = "我爱北京天安门"
result = ltp.pipeline(text, tasks=["cws", "pos", "ner"])
print(result)

Both repositories provide Chinese language processing tools, but ltp offers a more comprehensive suite of NLP tasks. lac focuses primarily on word segmentation, part-of-speech tagging, and named entity recognition, making it potentially faster and more lightweight. ltp, on the other hand, includes additional capabilities like dependency parsing and semantic role labeling, but may require more resources. The code examples demonstrate the simplicity of use for both libraries, with lac having a slightly more straightforward API for basic tasks.

cppjieba

2,732

"结巴"中文分词的C++版本

Pros of cppjieba

Lightweight and easy to integrate into C++ projects
Supports multiple segmentation modes (e.g., MPSegment, HMMSegment)
Provides a user dictionary feature for customization

Cons of cppjieba

Limited to Chinese language segmentation only
May have lower accuracy compared to more advanced models like LAC
Less actively maintained (last update was in 2020)

Code Comparison

cppjieba:

#include "cppjieba/Jieba.hpp"
jieba::Jieba jieba(dict_path, hmm_path, user_dict_path);
vector<string> words;
jieba.Cut(sentence, words, true);

LAC:

from LAC import LAC
lac = LAC(mode='lac')
seg_result = lac.run("百度是一家高科技公司")

Key Differences

Language: cppjieba is written in C++, while LAC is primarily Python-based
Functionality: LAC offers more advanced NLP features beyond segmentation
Performance: LAC may provide better accuracy, especially for complex sentences
Integration: cppjieba is easier to integrate into C++ projects, while LAC is more suitable for Python environments

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

å·¥å·ä»ç»

ææå¥½ï¼éè¿æ·±åº¦å¦ä¹ æ¨¡åèåå¦ä¹ åè¯ãè¯æ§æ æ³¨ãä¸åè¯å«ä»»å¡ï¼è¯è¯éè¦æ§ï¼æ´ä½ææF1å¼è¶è¿0.91ï¼è¯æ§æ æ³¨F1å¼è¶è¿0.94ï¼ä¸åè¯å«F1å¼è¶è¿0.85ï¼ææä¸åé¢åã
æçé«ï¼ç²¾ç®æ¨¡ååæ°ï¼ç»åPaddleé¢æµåºçæ§è½ä¼åï¼CPUåçº¿ç¨æ§è½è¾¾800QPSï¼æçä¸åé¢åã
**å¯å®å¶**ï¼å®ç°ç®åå¯æ§çå¹²é¢æºå¶ï¼ç²¾åå¹éç¨æ·è¯å¸å¯¹æ¨¡åè¿è¡å¹²é¢ãè¯å¸æ¯æé¿çæ®µå½¢å¼ï¼ä½¿å¾å¹²é¢æ´ä¸ºç²¾åã
è°ç¨ä¾¿æ·ï¼æ¯æä¸é®å®è£ï¼åæ¶æä¾äºPythonãJavaåC++è°ç¨æ¥å£ä¸è°ç¨ç¤ºä¾ï¼å®ç°å¿«éè°ç¨åéæã
æ¯æç§»å¨ç«¯: å®å¶è¶è½»éçº§æ¨¡åï¼ä½ç§¯ä»ä¸º2Mï¼ä¸»æµååææºåçº¿ç¨æ§è½è¾¾200QPSï¼æ»¡è¶³å¤§å¤æ°ç§»å¨ç«¯åºç¨çéæ±ï¼åçä½ç§¯éçº§ææä¸åé¢åã

å®è£ä¸ä½¿ç¨

å®è£è¯´æ

ä»£ç å¼å®¹Python2/3

å¨èªå¨å®è£: pip install lac
åèªå¨ä¸è½½ï¼åä¸è½½http://pypi.python.org/pypi/lac/ï¼è§£ååè¿è¡ python setup.py install
å®è£å®æåå¯å¨å½ä»¤è¡è¾å¥lacælac --segonly,lac --rankå¯å¨æå¡ï¼è¿è¡å¿«éä½éªã

å½åç½ç»å¯ä½¿ç¨ç¾åº¦æºå®è£ï¼å®è£éçæ´å¿«ï¼pip install lac -i https://mirror.baidu.com/pypi/simple

åè½ä¸ä½¿ç¨

åè¯

ä»£ç ç¤ºä¾ï¼

from LAC import LAC

# è£è½½åè¯æ¨¡å
lac = LAC(mode='seg')

# åä¸ªæ ·æ¬è¾å¥ï¼è¾å¥ä¸ºUnicodeç¼ç çåç¬¦ä¸²
text = u"LACæ¯ä¸ªä¼ç§çåè¯å·¥å·"
seg_result = lac.run(text)

# æ¹éæ ·æ¬è¾å¥, è¾å¥ä¸ºå¤ä¸ªå¥åç»æçlistï¼å¹³åéçä¼æ´å¿«
texts = [u"LACæ¯ä¸ªä¼ç§çåè¯å·¥å·", u"ç¾åº¦æ¯ä¸å®¶é«ç§æå¬å¸"]
seg_result = lac.run(texts)

è¾åºï¼

ãåæ ·æ¬ãï¼seg_result = [LAC, æ¯, ä¸ª, ä¼ç§, ç, åè¯, å·¥å·]
ãæ¹éæ ·æ¬ãï¼seg_result = [[LAC, æ¯, ä¸ª, ä¼ç§, ç, åè¯, å·¥å·], [ç¾åº¦, æ¯, ä¸å®¶, é«ç§æ, å¬å¸]]

è¯æ§æ æ³¨ä¸å®ä½è¯å«

ä»£ç ç¤ºä¾ï¼

from LAC import LAC

# è£è½½LACæ¨¡å
lac = LAC(mode='lac')

# åä¸ªæ ·æ¬è¾å¥ï¼è¾å¥ä¸ºUnicodeç¼ç çåç¬¦ä¸²
text = u"LACæ¯ä¸ªä¼ç§çåè¯å·¥å·"
lac_result = lac.run(text)

# æ¹éæ ·æ¬è¾å¥, è¾å¥ä¸ºå¤ä¸ªå¥åç»æçlistï¼å¹³åéçæ´å¿«
texts = [u"LACæ¯ä¸ªä¼ç§çåè¯å·¥å·", u"ç¾åº¦æ¯ä¸å®¶é«ç§æå¬å¸"]
lac_result = lac.run(texts)

è¾åºï¼

æ¯ä¸ªå¥åçè¾åºå¶åè¯ç»æword_listä»¥åå¯¹æ¯ä¸ªåè¯çæ æ³¨tags_listï¼å¶æ ¼å¼ä¸ºï¼word_list, tags_list)

ãåæ ·æ¬ãï¼ lac_result = ([ç¾åº¦, æ¯, ä¸å®¶, é«ç§æ, å¬å¸], [ORG, v, m, n, n])
ãæ¹éæ ·æ¬ãï¼lac_result = [
                    ([ç¾åº¦, æ¯, ä¸å®¶, é«ç§æ, å¬å¸], [ORG, v, m, n, n]),
                    ([LAC, æ¯, ä¸ª, ä¼ç§, ç, åè¯, å·¥å·], [nz, v, q, a, u, n, n])
                ]

æ ç¾	å«ä¹	æ ç¾	å«ä¹	æ ç¾	å«ä¹	æ ç¾	å«ä¹
n	æ®éåè¯	f	æ¹ä½åè¯	s	å¤æåè¯	nw	ä½åå
nz	å¶ä»ä¸å	v	æ®éå¨è¯	vd	å¨å¯è¯	vn	åå¨è¯
a	å½¢å®¹è¯	ad	å¯å½¢è¯	an	åå½¢è¯	d	å¯è¯
m	æ°éè¯	q	éè¯	r	ä»£è¯	p	ä»è¯
c	è¿è¯	u	å©è¯	xc	å¶ä»èè¯	w	æ ç¹ç¬¦å·
PER	äººå	LOC	å°å	ORG	æºæå	TIME	æ¶é´

è¯è¯éè¦æ§

ä»£ç ç¤ºä¾ï¼

from LAC import LAC

# è£è½½è¯è¯éè¦æ§æ¨¡å
lac = LAC(mode='rank')

# åä¸ªæ ·æ¬è¾å¥ï¼è¾å¥ä¸ºUnicodeç¼ç çåç¬¦ä¸²
text = u"LACæ¯ä¸ªä¼ç§çåè¯å·¥å·"
rank_result = lac.run(text)

# æ¹éæ ·æ¬è¾å¥, è¾å¥ä¸ºå¤ä¸ªå¥åç»æçlistï¼å¹³åéçä¼æ´å¿«
texts = [u"LACæ¯ä¸ªä¼ç§çåè¯å·¥å·", u"ç¾åº¦æ¯ä¸å®¶é«ç§æå¬å¸"]
rank_result = lac.run(texts)

è¾åºï¼

ãåæ ·æ¬ãï¼rank_result = [['LAC', 'æ¯', 'ä¸ª', 'ä¼ç§', 'ç', 'åè¯', 'å·¥å·'], 
                        [nz, v, q, a, u, n, n],[3, 0, 0, 2, 0, 3, 1]]
ãæ¹éæ ·æ¬ãï¼rank_result = [
                    (['LAC', 'æ¯', 'ä¸ª', 'ä¼ç§', 'ç', 'åè¯', 'å·¥å·'], 
                     [nz, v, q, a, u, n, n], [3, 0, 0, 2, 0, 3, 1]),  
                    (['ç¾åº¦', 'æ¯', 'ä¸å®¶', 'é«ç§æ', 'å¬å¸'], 
                     [ORG, v, m, n, n], [3, 0, 2, 3, 1])
                ]

æ ç¾	å«ä¹	å¸¸è§äºè¯æ§
0	queryä¸è¡¨è¿°çåä½è¯	p, w, xc ...
1	queryä¸éå®è¾å¼±çè¯	r, c, u ...
2	queryä¸å¼ºéå®çè¯	n, s, v ...
3	queryä¸çæ ¸å¿è¯	nz, nw, LOC ...

å®å¶ååè½

è¯å¸æä»¶ç¤ºä¾

è¿éä»ä½ä¸ºç¤ºä¾ï¼å±ç°åç§éæ±æåµä¸çç»æãåç»è¿å°å¼æ¾ä»¥ééç¬¦éç½®è¯å¸çæ¨¡å¼ï¼æ¬è¯·æå¾ã

æ¥å¤©/SEASON
è±/n å¼/v
ç§å¤©çé£
è½ é³

ä»£ç ç¤ºä¾

from LAC import LAC
lac = LAC()

# è£è½½å¹²é¢è¯å¸, sepåæ°è¡¨ç¤ºè¯å¸æä»¶éç¨çåéç¬¦ï¼ä¸ºNoneæ¶é»è®¤ä½¿ç¨ç©ºæ ¼æå¶è¡¨ç¬¦'\t'
lac.load_customization('custom.txt', sep=None)

# å¹²é¢åç»æ
custom_result = lac.run(u"æ¥å¤©çè±å¼ç§å¤©çé£ä»¥åå¬å¤©çè½é³")

æ¥å¤©/TIME ç/u è±å¼/v ç§å¤©/TIME ç/u é£/n ä»¥å/c å¬å¤©/TIME ç/u è½é³/n

æ·»å ç¤ºä¾ä¸çè¯å¸æä»¶åçç»æä¸ºï¼

æ¥å¤©/SEASON ç/u è±/n å¼/v ç§å¤©çé£/n ä»¥å/c å¬å¤©/TIME ç/u è½/n é³/n

å¢éè®ç»

1. åè¯è®ç»

LAC æ¯ ä¸ª ä¼ç§ ç åè¯ å·¥å· ã
ç¾åº¦ æ¯ ä¸å®¶ é«ç§æ å¬å¸ ã
æ¥å¤© ç è±å¼ ç§å¤© ç é£ ä»¥å å¬å¤© ç è½é³ ã

ä»£ç ç¤ºä¾

from LAC import LAC

# éæ©ä½¿ç¨åè¯æ¨¡å
lac = LAC(mode = 'seg')

# è®ç»åæµè¯æ°æ®éï¼æ ¼å¼ä¸è´
train_file = "./data/seg_train.tsv"
test_file = "./data/seg_test.tsv"
lac.train(model_save_dir='./my_seg_model/',train_data=train_file, test_data=test_file)

# ä½¿ç¨èªå·±è®ç»å¥½çæ¨¡å
my_lac = LAC(model_path='my_seg_model')

2. è¯æ³åæè®ç»

æ ·ä¾æ°æ®

å¨åè¯æ°æ®çåºç¡ä¸ï¼æ¯ä¸ªåè¯ä»¥â/typeâçå½¢å¼æ è®°å¶è¯æ§æå®ä½ç±»å«ãå¼å¾æ³¨æçæ¯ï¼è¯æ³åæçè®ç»ç®åä»æ¯ææ ç¾ä½ç³»ä¸æä»¬ä¸è´çæ°æ®ãåç»ä¹ä¼å¼æ¾æ¯ææ°çæ ç¾ä½ç³»ï¼æ¬è¯·æå¾ã

LAC/nz æ¯/v ä¸ª/q ä¼ç§/a ç/u åè¯/n å·¥å·/n ã/w
ç¾åº¦/ORG æ¯/v ä¸å®¶/m é«ç§æ/n å¬å¸/n ã/w
æ¥å¤©/TIME ç/u è±å¼/v ç§å¤©/TIME ç/u é£/n ä»¥å/c å¬å¤©/TIME ç/u è½é³/n ã/w

ä»£ç ç¤ºä¾

from LAC import LAC

# éæ©ä½¿ç¨é»è®¤çè¯æ³åææ¨¡å
lac = LAC()

# è®ç»åæµè¯æ°æ®éï¼æ ¼å¼ä¸è´
train_file = "./data/lac_train.tsv"
test_file = "./data/lac_test.tsv"
lac.train(model_save_dir='./my_lac_model/',train_data=train_file, test_data=test_file)

# ä½¿ç¨èªå·±è®ç»å¥½çæ¨¡å
my_lac = LAC(model_path='my_lac_model')

æä»¶ç»æ

.
âââ python                      # Pythonè°ç¨çèæ¬
âââ c++                         # C++è°ç¨çä»£ç 
âââ java                        # Javaè°ç¨çä»£ç 
âââ Android                     # Androidè°ç¨çç¤ºä¾
âââ README.md                   # æ¬æä»¶
âââ CMakeList.txt               # ç¼è¯C++åJavaè°ç¨çèæ¬

å¨è®ºæä¸å¼ç¨LAC

@article{jiao2018LAC,
	title={Chinese Lexical Analysis with Deep Bi-GRU-CRF Network},
	author={Jiao, Zhenyu and Sun, Shuqi and Sun, Ke},
	journal={arXiv preprint arXiv:1807.01882},
	year={2018},
	url={https://arxiv.org/abs/1807.01882}
}

è´¡ç®ä»£ç

æä»¬æ¬¢è¿å¼åèåLACè´¡ç®ä»£ç ãå¦ææ¨å¼åäºæ°åè½ï¼åç°äºbugâ¦â¦æ¬¢è¿æäº¤Pull requestä¸issueå°Githubã

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of jieba

Cons of jieba

Code Comparison

Pros of HanLP

Cons of HanLP

Code Comparison

Key Differences

Use Cases

Community and Support

Pros of SnowNLP

Cons of SnowNLP

Code Comparison

Pros of ltp

Cons of ltp

Code Comparison

Pros of cppjieba

Cons of cppjieba

Code Comparison

Key Differences

Convert designs to code with AI

README

å·¥å ·ä»ç»

å®è£ ä¸ä½¿ç¨

å®è£ è¯´æ

åè½ä¸ä½¿ç¨

åè¯

è¯æ§æ æ³¨ä¸å®ä½è¯å«

è¯è¯­éè¦æ§

å®å¶ååè½

å¢éè®­ç»

1. åè¯è®­ç»

2. è¯æ³åæè®­ç»

æä»¶ç»æ

å¨è®ºæä¸­å¼ç¨LAC

è´¡ç®ä»£ç 

Top Related Projects

Convert designs to code with AI

å·¥å·ä»ç»

å®è£ä¸ä½¿ç¨

å®è£è¯´æ

åè½ä¸ä½¿ç¨

åè¯

è¯æ§æ æ³¨ä¸å®ä½è¯å«

è¯è¯éè¦æ§

å®å¶ååè½

å¢éè®ç»

1. åè¯è®ç»

2. è¯æ³åæè®ç»

æä»¶ç»æ

å¨è®ºæä¸å¼ç¨LAC

è´¡ç®ä»£ç