Top Related Projects
Quick Overview
Baidu LAC (Lexical Analysis of Chinese) is an open-source project for Chinese lexical analysis. It provides efficient and accurate word segmentation, part-of-speech tagging, and named entity recognition for Chinese text using deep learning techniques.
Pros
- High accuracy and performance in Chinese language processing tasks
- Supports both Python and C++ interfaces for flexibility
- Includes pre-trained models for immediate use
- Offers customization options for specific domain applications
Cons
- Limited documentation, especially for advanced usage
- Primarily focused on Chinese language, limiting its use for other languages
- Requires some understanding of deep learning concepts for optimal use
- May have a steeper learning curve compared to simpler NLP tools
Code Examples
- Basic word segmentation:
from LAC import LAC
lac = LAC(mode='seg')
text = "百度是一家高科技公司"
result = lac.run(text)
print(result)
# Output: ['百度', '是', '一家', '高科技', '公司']
- Part-of-speech tagging:
lac = LAC(mode='lac')
text = "我爱北京天安门"
result = lac.run(text)
print(result)
# Output: (['我', '爱', '北京', '天安门'], ['r', 'v', 'ns', 'ns'])
- Named entity recognition:
lac = LAC(mode='lac')
text = "李小明在北京大学读书"
result = lac.run(text)
print(result)
# Output: (['李小明', '在', '北京大学', '读书'], ['PER', 'p', 'ORG', 'v'])
Getting Started
To use Baidu LAC, follow these steps:
-
Install the package:
pip install lac
-
Import and initialize LAC:
from LAC import LAC lac = LAC(mode='lac')
-
Process text:
text = "百度是一家高科技公司" result = lac.run(text) print(result)
For more advanced usage, refer to the project's GitHub repository and documentation.
Competitor Comparisons
结巴中文分词
Pros of jieba
- More widely adopted and mature, with a larger community and ecosystem
- Easier to use and integrate, with simpler API and installation process
- Supports customization of dictionaries and user-defined words
Cons of jieba
- Generally slower performance compared to LAC
- Less accurate for some specific domains or complex sentences
- Lacks advanced features like named entity recognition and part-of-speech tagging
Code Comparison
jieba:
import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))
LAC:
from LAC import LAC
lac = LAC(mode='seg')
seg_result = lac.run("我来到北京清华大学")
print("Segmentation result:", seg_result)
Both libraries provide Chinese word segmentation functionality, but LAC offers more advanced features and potentially better performance for specific use cases. jieba is easier to use and has a larger community, making it a popular choice for general-purpose segmentation tasks. The choice between the two depends on the specific requirements of your project, such as accuracy needs, performance constraints, and desired features.
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发 现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
Pros of HanLP
- More comprehensive NLP toolkit with wider range of features
- Supports multiple languages beyond just Chinese
- Active community and frequent updates
Cons of HanLP
- Larger codebase and dependencies may increase complexity
- Potentially slower performance for basic Chinese NLP tasks
- Steeper learning curve for beginners
Code Comparison
HanLP:
from pyhanlp import *
text = "我爱北京天安门"
print(HanLP.segment(text))
LAC:
from LAC import LAC
lac = LAC(mode='seg')
text = "我爱北京天安门"
print(lac.run(text))
Key Differences
- HanLP offers a more extensive set of NLP tools and language support
- LAC focuses specifically on Chinese language processing
- HanLP may require more setup and configuration
- LAC provides a simpler API for basic Chinese NLP tasks
Use Cases
- Choose HanLP for multi-language or advanced NLP projects
- Opt for LAC for straightforward Chinese text segmentation and POS tagging
Community and Support
- HanLP has a larger community and more frequent updates
- LAC benefits from Baidu's backing and specialized Chinese NLP expertise
Python library for processing Chinese text
Pros of SnowNLP
- Broader range of NLP tasks including sentiment analysis and text summarization
- Simpler installation process with fewer dependencies
- More lightweight and suitable for smaller projects or quick prototyping
Cons of SnowNLP
- Less accurate for complex Chinese language processing tasks
- Not as actively maintained or updated as LAC
- Limited support for advanced features like custom model training
Code Comparison
SnowNLP:
from snownlp import SnowNLP
s = SnowNLP(u'这是一个测试句子')
print(s.words) # Word segmentation
print(s.tags) # Part-of-speech tagging
print(s.sentiments) # Sentiment analysis
LAC:
from LAC import LAC
lac = LAC(mode='lac')
text = "这是一个测试句子"
result = lac.run(text)
print(result) # Word segmentation and part-of-speech tagging
SnowNLP offers a more straightforward API for various NLP tasks, while LAC focuses on providing more accurate results for word segmentation and part-of-speech tagging in Chinese. LAC is better suited for production environments requiring high accuracy in Chinese language processing, whereas SnowNLP is more versatile for quick NLP experiments across different tasks.
Language Technology Platform
Pros of ltp
- More comprehensive NLP toolkit with additional tasks like dependency parsing and semantic role labeling
- Supports both Python and C++ interfaces for flexibility
- Provides pre-trained models for multiple languages beyond Chinese
Cons of ltp
- Larger model size and potentially slower processing speed
- May require more system resources due to its comprehensive nature
- Less frequent updates compared to lac
Code Comparison
lac usage:
from LAC import LAC
lac = LAC(mode='lac')
text = "我爱北京天安门"
result = lac.run(text)
print(result)
ltp usage:
from ltp import LTP
ltp = LTP()
text = "我爱北京天安门"
result = ltp.pipeline(text, tasks=["cws", "pos", "ner"])
print(result)
Both repositories provide Chinese language processing tools, but ltp offers a more comprehensive suite of NLP tasks. lac focuses primarily on word segmentation, part-of-speech tagging, and named entity recognition, making it potentially faster and more lightweight. ltp, on the other hand, includes additional capabilities like dependency parsing and semantic role labeling, but may require more resources. The code examples demonstrate the simplicity of use for both libraries, with lac having a slightly more straightforward API for basic tasks.
"结巴"中文分词的C++版本
Pros of cppjieba
- Lightweight and easy to integrate into C++ projects
- Supports multiple segmentation modes (e.g., MPSegment, HMMSegment)
- Provides a user dictionary feature for customization
Cons of cppjieba
- Limited to Chinese language segmentation only
- May have lower accuracy compared to more advanced models like LAC
- Less actively maintained (last update was in 2020)
Code Comparison
cppjieba:
#include "cppjieba/Jieba.hpp"
jieba::Jieba jieba(dict_path, hmm_path, user_dict_path);
vector<string> words;
jieba.Cut(sentence, words, true);
LAC:
from LAC import LAC
lac = LAC(mode='lac')
seg_result = lac.run("百度是一家高科技公司")
Key Differences
- Language: cppjieba is written in C++, while LAC is primarily Python-based
- Functionality: LAC offers more advanced NLP features beyond segmentation
- Performance: LAC may provide better accuracy, especially for complex sentences
- Integration: cppjieba is easier to integrate into C++ projects, while LAC is more suitable for Python environments
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
å·¥å ·ä»ç»
LACå ¨ç§°Lexical Analysis of Chineseï¼æ¯ç¾åº¦èªç¶è¯è¨å¤çé¨ç åçä¸æ¬¾èåçè¯æ³åæå·¥å ·ï¼å®ç°ä¸æåè¯ãè¯æ§æ 注ãä¸åè¯å«çåè½ãè¯¥å·¥å ·å ·æ以ä¸ç¹ç¹ä¸ä¼å¿ï¼
- ææ好ï¼éè¿æ·±åº¦å¦ä¹ 模åèåå¦ä¹ åè¯ãè¯æ§æ 注ãä¸åè¯å«ä»»å¡ï¼è¯è¯éè¦æ§ï¼æ´ä½ææF1å¼è¶ è¿0.91ï¼è¯æ§æ 注F1å¼è¶ è¿0.94ï¼ä¸åè¯å«F1å¼è¶ è¿0.85ï¼ææä¸å é¢å ã
- æçé«ï¼ç²¾ç®æ¨¡ååæ°ï¼ç»åPaddleé¢æµåºçæ§è½ä¼åï¼CPUå线ç¨æ§è½è¾¾800QPSï¼æçä¸å é¢å ã
- **å¯å®å¶**ï¼å®ç°ç®åå¯æ§çå¹²é¢æºå¶ï¼ç²¾åå¹é ç¨æ·è¯å ¸å¯¹æ¨¡åè¿è¡å¹²é¢ãè¯å ¸æ¯æé¿ç段形å¼ï¼ä½¿å¾å¹²é¢æ´ä¸ºç²¾åã
- è°ç¨ä¾¿æ·ï¼æ¯æä¸é®å®è£ ï¼åæ¶æä¾äºPythonãJavaåC++è°ç¨æ¥å£ä¸è°ç¨ç¤ºä¾ï¼å®ç°å¿«éè°ç¨åéæã
- æ¯æ移å¨ç«¯: å®å¶è¶ è½»é级模åï¼ä½ç§¯ä» 为2Mï¼ä¸»æµåå ææºå线ç¨æ§è½è¾¾200QPSï¼æ»¡è¶³å¤§å¤æ°ç§»å¨ç«¯åºç¨çéæ±ï¼åçä½ç§¯é级ææä¸å é¢å ã
å®è£ ä¸ä½¿ç¨
å¨æ¤æ们主è¦ä»ç»Pythonå®è£ ä¸ä½¿ç¨ï¼å ¶ä»è¯è¨ä½¿ç¨ï¼
å®è£ 说æ
代ç å ¼å®¹Python2/3
-
å ¨èªå¨å®è£ :
pip install lac
-
åèªå¨ä¸è½½ï¼å ä¸è½½http://pypi.python.org/pypi/lac/ï¼è§£ååè¿è¡
python setup.py install
-
å®è£ å®æåå¯å¨å½ä»¤è¡è¾å ¥
lac
ælac --segonly
,lac --rank
å¯å¨æå¡ï¼è¿è¡å¿«éä½éªãå½å ç½ç»å¯ä½¿ç¨ç¾åº¦æºå®è£ ï¼å®è£ éçæ´å¿«ï¼
pip install lac -i https://mirror.baidu.com/pypi/simple
åè½ä¸ä½¿ç¨
åè¯
- 代ç 示ä¾ï¼
from LAC import LAC
# è£
è½½åè¯æ¨¡å
lac = LAC(mode='seg')
# åä¸ªæ ·æ¬è¾å
¥ï¼è¾å
¥ä¸ºUnicodeç¼ç çå符串
text = u"LACæ¯ä¸ªä¼ç§çåè¯å·¥å
·"
seg_result = lac.run(text)
# æ¹éæ ·æ¬è¾å
¥, è¾å
¥ä¸ºå¤ä¸ªå¥åç»æçlistï¼å¹³åéçä¼æ´å¿«
texts = [u"LACæ¯ä¸ªä¼ç§çåè¯å·¥å
·", u"ç¾åº¦æ¯ä¸å®¶é«ç§æå
¬å¸"]
seg_result = lac.run(texts)
- è¾åºï¼
ãåæ ·æ¬ãï¼seg_result = [LAC, æ¯, 个, ä¼ç§, ç, åè¯, å·¥å
·]
ãæ¹éæ ·æ¬ãï¼seg_result = [[LAC, æ¯, 个, ä¼ç§, ç, åè¯, å·¥å
·], [ç¾åº¦, æ¯, ä¸å®¶, é«ç§æ, å
¬å¸]]
è¯æ§æ 注ä¸å®ä½è¯å«
- 代ç 示ä¾ï¼
from LAC import LAC
# è£
è½½LAC模å
lac = LAC(mode='lac')
# åä¸ªæ ·æ¬è¾å
¥ï¼è¾å
¥ä¸ºUnicodeç¼ç çå符串
text = u"LACæ¯ä¸ªä¼ç§çåè¯å·¥å
·"
lac_result = lac.run(text)
# æ¹éæ ·æ¬è¾å
¥, è¾å
¥ä¸ºå¤ä¸ªå¥åç»æçlistï¼å¹³åéçæ´å¿«
texts = [u"LACæ¯ä¸ªä¼ç§çåè¯å·¥å
·", u"ç¾åº¦æ¯ä¸å®¶é«ç§æå
¬å¸"]
lac_result = lac.run(texts)
- è¾åºï¼
æ¯ä¸ªå¥åçè¾åºå ¶åè¯ç»æword_list以å对æ¯ä¸ªåè¯çæ 注tags_listï¼å ¶æ ¼å¼ä¸ºï¼word_list, tags_list)
ãåæ ·æ¬ãï¼ lac_result = ([ç¾åº¦, æ¯, ä¸å®¶, é«ç§æ, å
¬å¸], [ORG, v, m, n, n])
ãæ¹éæ ·æ¬ãï¼lac_result = [
([ç¾åº¦, æ¯, ä¸å®¶, é«ç§æ, å
¬å¸], [ORG, v, m, n, n]),
([LAC, æ¯, 个, ä¼ç§, ç, åè¯, å·¥å
·], [nz, v, q, a, u, n, n])
]
è¯æ§åä¸åç±»å«æ ç¾éåå¦ä¸è¡¨ï¼å ¶ä¸æ们å°æ常ç¨ç4个ä¸åç±»å«æ 记为大åçå½¢å¼ï¼
æ ç¾ | å«ä¹ | æ ç¾ | å«ä¹ | æ ç¾ | å«ä¹ | æ ç¾ | å«ä¹ |
---|---|---|---|---|---|---|---|
n | æ®éåè¯ | f | æ¹ä½åè¯ | s | å¤æåè¯ | nw | ä½åå |
nz | å ¶ä»ä¸å | v | æ®éå¨è¯ | vd | å¨å¯è¯ | vn | åå¨è¯ |
a | å½¢å®¹è¯ | ad | å¯å½¢è¯ | an | åå½¢è¯ | d | å¯è¯ |
m | æ°éè¯ | q | éè¯ | r | ä»£è¯ | p | ä»è¯ |
c | è¿è¯ | u | å©è¯ | xc | å ¶ä»èè¯ | w | æ ç¹ç¬¦å· |
PER | 人å | LOC | å°å | ORG | æºæå | TIME | æ¶é´ |
è¯è¯éè¦æ§
- 代ç 示ä¾ï¼
from LAC import LAC
# è£
è½½è¯è¯éè¦æ§æ¨¡å
lac = LAC(mode='rank')
# åä¸ªæ ·æ¬è¾å
¥ï¼è¾å
¥ä¸ºUnicodeç¼ç çå符串
text = u"LACæ¯ä¸ªä¼ç§çåè¯å·¥å
·"
rank_result = lac.run(text)
# æ¹éæ ·æ¬è¾å
¥, è¾å
¥ä¸ºå¤ä¸ªå¥åç»æçlistï¼å¹³åéçä¼æ´å¿«
texts = [u"LACæ¯ä¸ªä¼ç§çåè¯å·¥å
·", u"ç¾åº¦æ¯ä¸å®¶é«ç§æå
¬å¸"]
rank_result = lac.run(texts)
- è¾åºï¼
ãåæ ·æ¬ãï¼rank_result = [['LAC', 'æ¯', '个', 'ä¼ç§', 'ç', 'åè¯', 'å·¥å
·'],
[nz, v, q, a, u, n, n],[3, 0, 0, 2, 0, 3, 1]]
ãæ¹éæ ·æ¬ãï¼rank_result = [
(['LAC', 'æ¯', '个', 'ä¼ç§', 'ç', 'åè¯', 'å·¥å
·'],
[nz, v, q, a, u, n, n], [3, 0, 0, 2, 0, 3, 1]),
(['ç¾åº¦', 'æ¯', 'ä¸å®¶', 'é«ç§æ', 'å
¬å¸'],
[ORG, v, m, n, n], [3, 0, 2, 3, 1])
]
è¯è¯éè¦æ§åç±»å«æ ç¾éåå¦ä¸è¡¨ï¼æ们使ç¨4-Level梯度è¿è¡åç±»ï¼
æ ç¾ | å«ä¹ | 常è§äºè¯æ§ |
---|---|---|
0 | queryä¸è¡¨è¿°çåä½è¯ | p, w, xc ... |
1 | queryä¸éå®è¾å¼±çè¯ | r, c, u ... |
2 | queryä¸å¼ºéå®çè¯ | n, s, v ... |
3 | queryä¸çæ ¸å¿è¯ | nz, nw, LOC ... |
å®å¶ååè½
å¨æ¨¡åè¾åºçåºç¡ä¸ï¼LACè¿æ¯æç¨æ·é ç½®å®å¶åçååç»æåä¸åç±»åè¾åºãå½æ¨¡åé¢æµå¹é å°è¯å ¸çä¸çitemæ¶ï¼ä¼ç¨å®å¶åçç»ææ¿ä»£åæç»æã为äºå®ç°æ´å 精确çå¹é ï¼æ们æ¯æ以ç±å¤ä¸ªåè¯ç»æçé¿ç段ä½ä¸ºä¸ä¸ªitemã
æ们éè¿è£ è½½è¯å ¸æ件çå½¢å¼å®ç°è¯¥åè½ï¼è¯å ¸æ件æ¯è¡è¡¨ç¤ºä¸ä¸ªå®å¶åçitemï¼ç±ä¸ä¸ªåè¯æå¤ä¸ªè¿ç»çåè¯ç»æï¼æ¯ä¸ªåè¯å使ç¨'/'表示æ ç¾ï¼å¦æ没æ'/'æ ç¾åä¼ä½¿ç¨æ¨¡åé»è®¤çæ ç¾ãæ¯ä¸ªitemåè¯æ°è¶å¤ï¼å¹²é¢ææä¼è¶ç²¾åã
-
è¯å ¸æ件示ä¾
è¿éä» ä½ä¸ºç¤ºä¾ï¼å±ç°åç§éæ±æ åµä¸çç»æãåç»è¿å°å¼æ¾ä»¥éé 符é ç½®è¯å ¸ç模å¼ï¼æ¬è¯·æå¾ ã
æ¥å¤©/SEASON
è±/n å¼/v
ç§å¤©çé£
è½ é³
- 代ç 示ä¾
from LAC import LAC
lac = LAC()
# è£
载干é¢è¯å
¸, sepåæ°è¡¨ç¤ºè¯å
¸æ件éç¨çåé符ï¼ä¸ºNoneæ¶é»è®¤ä½¿ç¨ç©ºæ ¼æå¶è¡¨ç¬¦'\t'
lac.load_customization('custom.txt', sep=None)
# å¹²é¢åç»æ
custom_result = lac.run(u"æ¥å¤©çè±å¼ç§å¤©çé£ä»¥åå¬å¤©çè½é³")
- 以è¾å ¥âæ¥å¤©çè±å¼ç§å¤©çé£ä»¥åå¬å¤©çè½é³â为ä¾ï¼åæ¬è¾åºç»æ为ï¼
æ¥å¤©/TIME ç/u è±å¼/v ç§å¤©/TIME ç/u é£/n 以å/c å¬å¤©/TIME ç/u è½é³/n
- æ·»å 示ä¾ä¸çè¯å ¸æ件åçç»æ为ï¼
æ¥å¤©/SEASON ç/u è±/n å¼/v ç§å¤©çé£/n 以å/c å¬å¤©/TIME ç/u è½/n é³/n
å¢éè®ç»
æ们ä¹æä¾äºå¢éè®ç»çæ¥å£ï¼ç¨æ·å¯ä»¥ä½¿ç¨èªå·±çæ°æ®ï¼è¿è¡å¢éè®ç»ï¼é¦å éè¦å°æ°æ®è½¬æ¢ä¸ºæ¨¡åè¾å ¥çæ ¼å¼ï¼å¹¶ä¸æææ°æ®æ件å为"UTF-8"ç¼ç ï¼
1. åè¯è®ç»
-
æ°æ®æ ·ä¾
ä¸å¤§å¤æ°å¼æºåè¯æ°æ®éæ ¼å¼ä¸è´ï¼ä½¿ç¨ç©ºæ ¼ä½ä¸ºåè¯ååæ è®°ï¼å¦ä¸æ示ï¼
LAC æ¯ ä¸ª ä¼ç§ ç åè¯ å·¥å
· ã
ç¾åº¦ æ¯ ä¸å®¶ é«ç§æ å
¬å¸ ã
æ¥å¤© ç è±å¼ ç§å¤© ç é£ ä»¥å å¬å¤© ç è½é³ ã
- 代ç 示ä¾
from LAC import LAC
# éæ©ä½¿ç¨åè¯æ¨¡å
lac = LAC(mode = 'seg')
# è®ç»åæµè¯æ°æ®éï¼æ ¼å¼ä¸è´
train_file = "./data/seg_train.tsv"
test_file = "./data/seg_test.tsv"
lac.train(model_save_dir='./my_seg_model/',train_data=train_file, test_data=test_file)
# 使ç¨èªå·±è®ç»å¥½ç模å
my_lac = LAC(model_path='my_seg_model')
2. è¯æ³åæè®ç»
-
æ ·ä¾æ°æ®
å¨åè¯æ°æ®çåºç¡ä¸ï¼æ¯ä¸ªåè¯ä»¥â/typeâçå½¢å¼æ è®°å ¶è¯æ§æå®ä½ç±»å«ãå¼å¾æ³¨æçæ¯ï¼è¯æ³åæçè®ç»ç®åä» æ¯ææ ç¾ä½ç³»ä¸æ们ä¸è´çæ°æ®ãåç»ä¹ä¼å¼æ¾æ¯ææ°çæ ç¾ä½ç³»ï¼æ¬è¯·æå¾ ã
LAC/nz æ¯/v 个/q ä¼ç§/a ç/u åè¯/n å·¥å
·/n ã/w
ç¾åº¦/ORG æ¯/v ä¸å®¶/m é«ç§æ/n å
¬å¸/n ã/w
æ¥å¤©/TIME ç/u è±å¼/v ç§å¤©/TIME ç/u é£/n 以å/c å¬å¤©/TIME ç/u è½é³/n ã/w
- 代ç 示ä¾
from LAC import LAC
# éæ©ä½¿ç¨é»è®¤çè¯æ³åæ模å
lac = LAC()
# è®ç»åæµè¯æ°æ®éï¼æ ¼å¼ä¸è´
train_file = "./data/lac_train.tsv"
test_file = "./data/lac_test.tsv"
lac.train(model_save_dir='./my_lac_model/',train_data=train_file, test_data=test_file)
# 使ç¨èªå·±è®ç»å¥½ç模å
my_lac = LAC(model_path='my_lac_model')
æ件ç»æ
.
âââ python # Pythonè°ç¨çèæ¬
âââ c++ # C++è°ç¨ç代ç
âââ java # Javaè°ç¨ç代ç
âââ Android # Androidè°ç¨ç示ä¾
âââ README.md # æ¬æ件
âââ CMakeList.txt # ç¼è¯C++åJavaè°ç¨çèæ¬
å¨è®ºæä¸å¼ç¨LAC
å¦ææ¨çå¦æ¯å·¥ä½ææä¸ä½¿ç¨äºLACï¼è¯·æ¨å¢å ä¸è¿°å¼ç¨ãæ们éå¸¸æ¬£æ °LACè½å¤å¯¹æ¨çå¦æ¯å·¥ä½å¸¦æ¥å¸®å©ã
@article{jiao2018LAC,
title={Chinese Lexical Analysis with Deep Bi-GRU-CRF Network},
author={Jiao, Zhenyu and Sun, Shuqi and Sun, Ke},
journal={arXiv preprint arXiv:1807.01882},
year={2018},
url={https://arxiv.org/abs/1807.01882}
}
è´¡ç®ä»£ç
æ们欢è¿å¼åè åLACè´¡ç®ä»£ç ãå¦ææ¨å¼åäºæ°åè½ï¼åç°äºbugâ¦â¦æ¬¢è¿æ交Pull requestä¸issueå°Githubã
Top Related Projects
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot