Top Related Projects
Quick Overview
Jieba is a popular Chinese text segmentation library for Python. It provides efficient and accurate word segmentation, part-of-speech tagging, and keyword extraction for Chinese text processing tasks.
Pros
- Fast and efficient segmentation algorithm
- Supports both simplified and traditional Chinese
- Customizable dictionary for domain-specific segmentation
- Offers various segmentation modes (accurate, full, search engine)
Cons
- Limited support for other languages besides Chinese
- May require fine-tuning for specific domains or dialects
- Documentation is primarily in Chinese, which can be challenging for non-Chinese speakers
Code Examples
- Basic word segmentation:
import jieba
text = "我来到北京清华大学"
seg_list = jieba.cut(text, cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))
- Part-of-speech tagging:
import jieba.posseg as pseg
words = pseg.cut("我爱北京天安门")
for word, flag in words:
print(f'{word} {flag}')
- Keyword extraction:
import jieba.analyse
text = "线程是程序执行流的最小单元,是进程中的一个实体,是被系统独立调度和分派的基本单位"
keywords = jieba.analyse.extract_tags(text, topK=5, withWeight=True)
for keyword, weight in keywords:
print(f'{keyword} {weight}')
Getting Started
To get started with Jieba, follow these steps:
-
Install Jieba using pip:
pip install jieba
-
Import Jieba in your Python script:
import jieba
-
Use Jieba for word segmentation:
text = "我来到北京清华大学" seg_list = jieba.cut(text, cut_all=False) print("/ ".join(seg_list))
This will output the segmented words separated by slashes.
Competitor Comparisons
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
Pros of HanLP
- More comprehensive NLP functionality beyond just word segmentation
- Supports multiple languages and domains
- Actively maintained with frequent updates
Cons of HanLP
- Steeper learning curve due to more complex API
- Larger memory footprint and slower processing speed
- Requires more setup and configuration
Code Comparison
HanLP:
from pyhanlp import *
text = "我爱北京天安门"
print(HanLP.segment(text))
Jieba:
import jieba
text = "我爱北京天安门"
print(jieba.cut(text))
Both libraries offer simple APIs for basic word segmentation, but HanLP provides more advanced NLP functions out of the box. Jieba's simpler interface makes it easier to get started with basic tasks, while HanLP offers more flexibility for complex NLP projects.
HanLP's broader feature set comes at the cost of increased complexity and resource usage. Jieba, being more focused on word segmentation, is generally faster and lighter.
For projects requiring only Chinese word segmentation, Jieba may be sufficient. However, for more comprehensive NLP tasks or multi-language support, HanLP could be a better choice despite its steeper learning curve.
百度NLP:分词,词性标注,命名实体识别,词重要性
Pros of LAC
- More advanced NLP capabilities, including part-of-speech tagging and named entity recognition
- Potentially better accuracy for complex sentences and specialized domains
- Actively maintained by Baidu, a major tech company with NLP expertise
Cons of LAC
- Larger model size and potentially slower processing speed
- More complex setup and usage compared to Jieba's simplicity
- Less community support and third-party integrations
Code Comparison
Jieba usage:
import jieba
seg_list = jieba.cut("我来到北京清华大学")
print(" ".join(seg_list))
LAC usage:
from LAC import LAC
lac = LAC(mode='seg')
seg_result = lac.run("我来到北京清华大学")
print(" ".join(seg_result))
Both libraries offer simple APIs for basic word segmentation, but LAC requires an additional step of initializing the model. LAC also provides more advanced NLP functions beyond this basic example.
LAC is better suited for projects requiring comprehensive NLP capabilities and potentially higher accuracy, especially in specialized domains. Jieba remains a solid choice for simpler projects prioritizing ease of use and faster processing speed.
Python library for processing Chinese text
Pros of SnowNLP
- Offers sentiment analysis and text classification features
- Includes pinyin conversion functionality
- Provides text summarization capabilities
Cons of SnowNLP
- Less active development and community support
- More limited word segmentation accuracy compared to Jieba
- Smaller corpus and dictionary size
Code Comparison
Jieba word segmentation:
import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))
SnowNLP word segmentation:
from snownlp import SnowNLP
s = SnowNLP(u'这个东西真心很赞')
print(s.words)
Both libraries offer Chinese text processing capabilities, but they have different strengths. Jieba is primarily focused on word segmentation and is widely regarded as one of the best options for this task. It has a larger user base and more frequent updates.
SnowNLP, on the other hand, provides a broader range of NLP functions, including sentiment analysis and text summarization. However, its word segmentation may not be as accurate as Jieba's for certain texts.
The choice between the two depends on the specific requirements of your project. If you need advanced word segmentation, Jieba is likely the better choice. For a more comprehensive NLP toolkit with additional features, SnowNLP might be more suitable.
"结巴"中文分词的C++版本
Pros of cppjieba
- Faster performance due to C++ implementation
- Lower memory usage compared to Python-based Jieba
- Suitable for high-performance applications and embedded systems
Cons of cppjieba
- Less extensive documentation and community support
- Fewer built-in features and extensions compared to Jieba
- Requires C++ knowledge for integration and customization
Code Comparison
Jieba (Python):
import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))
cppjieba (C++):
#include "cppjieba/Jieba.hpp"
vector<string> words;
jieba.Cut("我来到北京清华大学", words, false);
cout << limonp::Join(words.begin(), words.end(), "/") << endl;
Both libraries aim to provide Chinese word segmentation functionality, but they cater to different use cases and developer preferences. Jieba offers a more user-friendly Python interface with extensive features, making it ideal for rapid development and data analysis tasks. On the other hand, cppjieba provides a performance-oriented C++ implementation, suitable for scenarios where speed and resource efficiency are crucial. The choice between the two depends on the specific requirements of the project, such as programming language preference, performance needs, and available system resources.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
jieba
âç»å·´âä¸æåè¯ï¼åæ好ç Python ä¸æåè¯ç»ä»¶
"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.
- Scroll down for English documentation.
ç¹ç¹
- æ¯æåç§åè¯æ¨¡å¼ï¼
- 精确模å¼ï¼è¯å¾å°å¥åæ精确å°åå¼ï¼éåææ¬åæï¼
- å ¨æ¨¡å¼ï¼æå¥åä¸ææçå¯ä»¥æè¯çè¯è¯é½æ«æåºæ¥, é度é常快ï¼ä½æ¯ä¸è½è§£å³æ§ä¹ï¼
- æç´¢å¼æ模å¼ï¼å¨ç²¾ç¡®æ¨¡å¼çåºç¡ä¸ï¼å¯¹é¿è¯å次ååï¼æé«å¬åçï¼éåç¨äºæç´¢å¼æåè¯ã
- paddle模å¼ï¼å©ç¨PaddlePaddle深度å¦ä¹ æ¡æ¶ï¼è®ç»åºåæ 注ï¼ååGRUï¼ç½ç»æ¨¡åå®ç°åè¯ãåæ¶æ¯æè¯æ§æ 注ãpaddle模å¼ä½¿ç¨éå®è£
paddlepaddle-tinyï¼
pip install paddlepaddle-tiny==1.6.1
ãç®åpaddle模å¼æ¯æjieba v0.40å以ä¸çæ¬ãjieba v0.40以ä¸çæ¬ï¼è¯·å级jiebaï¼pip install jieba --upgrade
ãPaddlePaddleå®ç½
- æ¯æç¹ä½åè¯
- æ¯æèªå®ä¹è¯å ¸
- MIT ææåè®®
å®è£ 说æ
代ç 对 Python 2/3 åå ¼å®¹
- å
¨èªå¨å®è£
ï¼
easy_install jieba
æèpip install jieba
/pip3 install jieba
- åèªå¨å®è£
ï¼å
ä¸è½½ http://pypi.python.org/pypi/jieba/ ï¼è§£ååè¿è¡
python setup.py install
- æå¨å®è£ ï¼å° jieba ç®å½æ¾ç½®äºå½åç®å½æè site-packages ç®å½
- éè¿
import jieba
æ¥å¼ç¨ - å¦æéè¦ä½¿ç¨paddle模å¼ä¸çåè¯åè¯æ§æ 注åè½ï¼è¯·å
å®è£
paddlepaddle-tinyï¼
pip install paddlepaddle-tiny==1.6.1
ã
ç®æ³
- åºäºåç¼è¯å ¸å®ç°é«æçè¯å¾æ«æï¼çæå¥åä¸æ±åææå¯è½æè¯æ åµæææçæåæ ç¯å¾ (DAG)
- éç¨äºå¨æè§åæ¥æ¾æ大æ¦çè·¯å¾, æ¾åºåºäºè¯é¢çæ大ååç»å
- 对äºæªç»å½è¯ï¼éç¨äºåºäºæ±åæè¯è½åç HMM 模åï¼ä½¿ç¨äº Viterbi ç®æ³
主è¦åè½
- åè¯
jieba.cut
æ¹æ³æ¥åå个è¾å ¥åæ°: éè¦åè¯çå符串ï¼cut_all åæ°ç¨æ¥æ§å¶æ¯å¦éç¨å ¨æ¨¡å¼ï¼HMM åæ°ç¨æ¥æ§å¶æ¯å¦ä½¿ç¨ HMM 模åï¼use_paddle åæ°ç¨æ¥æ§å¶æ¯å¦ä½¿ç¨paddle模å¼ä¸çåè¯æ¨¡å¼ï¼paddle模å¼éç¨å»¶è¿å è½½æ¹å¼ï¼éè¿enable_paddleæ¥å£å®è£ paddlepaddle-tinyï¼å¹¶ä¸importç¸å ³ä»£ç ï¼jieba.cut_for_search
æ¹æ³æ¥å两个åæ°ï¼éè¦åè¯çå符串ï¼æ¯å¦ä½¿ç¨ HMM 模åã该æ¹æ³éåç¨äºæç´¢å¼ææ建åæç´¢å¼çåè¯ï¼ç²åº¦æ¯è¾ç»- å¾ åè¯çå符串å¯ä»¥æ¯ unicode æ UTF-8 å符串ãGBK å符串ã注æï¼ä¸å»ºè®®ç´æ¥è¾å ¥ GBK å符串ï¼å¯è½æ æ³é¢æå°é误解ç æ UTF-8
jieba.cut
以åjieba.cut_for_search
è¿åçç»æé½æ¯ä¸ä¸ªå¯è¿ä»£ç generatorï¼å¯ä»¥ä½¿ç¨ for 循ç¯æ¥è·å¾åè¯åå¾å°çæ¯ä¸ä¸ªè¯è¯(unicode)ï¼æè ç¨jieba.lcut
以åjieba.lcut_for_search
ç´æ¥è¿å listjieba.Tokenizer(dictionary=DEFAULT_DICT)
æ°å»ºèªå®ä¹åè¯å¨ï¼å¯ç¨äºåæ¶ä½¿ç¨ä¸åè¯å ¸ãjieba.dt
为é»è®¤åè¯å¨ï¼ææå ¨å±åè¯ç¸å ³å½æ°é½æ¯è¯¥åè¯å¨çæ å°ã
代ç 示ä¾
# encoding=utf-8
import jieba
jieba.enable_paddle()# å¯å¨paddle模å¼ã 0.40çä¹åå¼å§æ¯æï¼æ©æçæ¬ä¸æ¯æ
strs=["ææ¥å°å京æ¸
å大å¦","ä¹ä¹çæåå®äº","ä¸å½ç§å¦ææ¯å¤§å¦"]
for str in strs:
seg_list = jieba.cut(str,use_paddle=True) # 使ç¨paddle模å¼
print("Paddle Mode: " + '/'.join(list(seg_list)))
seg_list = jieba.cut("ææ¥å°å京æ¸
å大å¦", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list)) # å
¨æ¨¡å¼
seg_list = jieba.cut("ææ¥å°å京æ¸
å大å¦", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list)) # 精确模å¼
seg_list = jieba.cut("ä»æ¥å°äºç½ææç 大å¦") # é»è®¤æ¯ç²¾ç¡®æ¨¡å¼
print(", ".join(seg_list))
seg_list = jieba.cut_for_search("å°æç¡å£«æ¯ä¸äºä¸å½ç§å¦é¢è®¡ç®æï¼åå¨æ¥æ¬äº¬é½å¤§å¦æ·±é ") # æç´¢å¼æ模å¼
print(", ".join(seg_list))
è¾åº:
ãå
¨æ¨¡å¼ã: æ/ æ¥å°/ å京/ æ¸
å/ æ¸
å大å¦/ å大/ 大å¦
ã精确模å¼ã: æ/ æ¥å°/ å京/ æ¸
å大å¦
ãæ°è¯è¯å«ãï¼ä», æ¥å°, äº, ç½æ, æç , å¤§å¦ (æ¤å¤ï¼âæç â并没æå¨è¯å
¸ä¸ï¼ä½æ¯ä¹è¢«Viterbiç®æ³è¯å«åºæ¥äº)
ãæç´¢å¼æ模å¼ãï¼ å°æ, ç¡å£«, æ¯ä¸, äº, ä¸å½, ç§å¦, å¦é¢, ç§å¦é¢, ä¸å½ç§å¦é¢, 计ç®, 计ç®æ, å, å¨, æ¥æ¬, 京é½, 大å¦, æ¥æ¬äº¬é½å¤§å¦, æ·±é
2. æ·»å èªå®ä¹è¯å ¸
è½½å ¥è¯å ¸
- å¼åè å¯ä»¥æå®èªå·±èªå®ä¹çè¯å ¸ï¼ä»¥ä¾¿å å« jieba è¯åºé没æçè¯ãè½ç¶ jieba ææ°è¯è¯å«è½åï¼ä½æ¯èªè¡æ·»å æ°è¯å¯ä»¥ä¿è¯æ´é«çæ£ç¡®ç
- ç¨æ³ï¼ jieba.load_userdict(file_name) # file_name 为æ件类对象æèªå®ä¹è¯å ¸çè·¯å¾
- è¯å
¸æ ¼å¼å
dict.txt
ä¸æ ·ï¼ä¸ä¸ªè¯å ä¸è¡ï¼æ¯ä¸è¡åä¸é¨åï¼è¯è¯ãè¯é¢ï¼å¯çç¥ï¼ãè¯æ§ï¼å¯çç¥ï¼ï¼ç¨ç©ºæ ¼éå¼ï¼é¡ºåºä¸å¯é¢ åãfile_name
è¥ä¸ºè·¯å¾æäºè¿å¶æ¹å¼æå¼çæ件ï¼åæä»¶å¿ é¡»ä¸º UTF-8 ç¼ç ã - è¯é¢çç¥æ¶ä½¿ç¨èªå¨è®¡ç®çè½ä¿è¯ååºè¯¥è¯çè¯é¢ã
ä¾å¦ï¼
åæ°å 3 i
äºè®¡ç® 5
å±ç¹ç³ nz
å°ä¸
-
æ´æ¹åè¯å¨ï¼é»è®¤ä¸º
jieba.dt
ï¼çtmp_dir
åcache_file
å±æ§ï¼å¯åå«æå®ç¼åæ件æå¨çæ件夹åå ¶æ件åï¼ç¨äºåéçæ件系ç»ã -
èä¾ï¼
-
èªå®ä¹è¯å ¸ï¼https://github.com/fxsjy/jieba/blob/master/test/userdict.txt
-
ç¨æ³ç¤ºä¾ï¼https://github.com/fxsjy/jieba/blob/master/test/test_userdict.py
-
ä¹åï¼ æå°ç¦ / æ¯ / åæ° / å / 主任 / ä¹ / æ¯ / äº / è®¡ç® / æ¹é¢ / ç / ä¸å®¶ /
-
å è½½èªå®ä¹è¯åºåï¼ãæå°ç¦ / æ¯ / åæ°å / 主任 / ä¹ / æ¯ / äºè®¡ç® / æ¹é¢ / ç / ä¸å®¶ /
-
-
è°æ´è¯å ¸
-
使ç¨
add_word(word, freq=None, tag=None)
ådel_word(word)
å¯å¨ç¨åºä¸å¨æä¿®æ¹è¯å ¸ã -
使ç¨
suggest_freq(segment, tune=True)
å¯è°èå个è¯è¯çè¯é¢ï¼ä½¿å ¶è½ï¼æä¸è½ï¼è¢«ååºæ¥ã -
注æï¼èªå¨è®¡ç®çè¯é¢å¨ä½¿ç¨ HMM æ°è¯åç°åè½æ¶å¯è½æ æã
代ç 示ä¾ï¼
>>> print('/'.join(jieba.cut('å¦ææ¾å°postä¸å°åºéã', HMM=False)))
å¦æ/æ¾å°/post/ä¸å°/åºé/ã
>>> jieba.suggest_freq(('ä¸', 'å°'), True)
494
>>> print('/'.join(jieba.cut('å¦ææ¾å°postä¸å°åºéã', HMM=False)))
å¦æ/æ¾å°/post/ä¸/å°/åºé/ã
>>> print('/'.join(jieba.cut('ãå°ä¸ãæ£ç¡®åºè¯¥ä¸ä¼è¢«åå¼', HMM=False)))
ã/å°/ä¸/ã/æ£ç¡®/åºè¯¥/ä¸ä¼/被/åå¼
>>> jieba.suggest_freq('å°ä¸', True)
69
>>> print('/'.join(jieba.cut('ãå°ä¸ãæ£ç¡®åºè¯¥ä¸ä¼è¢«åå¼', HMM=False)))
ã/å°ä¸/ã/æ£ç¡®/åºè¯¥/ä¸ä¼/被/åå¼
- "éè¿ç¨æ·èªå®ä¹è¯å ¸æ¥å¢å¼ºæ§ä¹çº éè½å" --- https://github.com/fxsjy/jieba/issues/14
- å ³é®è¯æå
åºäº TF-IDF ç®æ³çå ³é®è¯æ½å
import jieba.analyse
- jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
- sentence ä¸ºå¾ æåçææ¬
- topK 为è¿åå 个 TF/IDF æéæ大çå ³é®è¯ï¼é»è®¤å¼ä¸º 20
- withWeight 为æ¯å¦ä¸å¹¶è¿åå ³é®è¯æéå¼ï¼é»è®¤å¼ä¸º False
- allowPOS ä» å æ¬æå®è¯æ§çè¯ï¼é»è®¤å¼ä¸ºç©ºï¼å³ä¸çé
- jieba.analyse.TFIDF(idf_path=None) æ°å»º TFIDF å®ä¾ï¼idf_path 为 IDF é¢çæ件
代ç ç¤ºä¾ ï¼å ³é®è¯æåï¼
https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
å ³é®è¯æåæ使ç¨éåæ件é¢çï¼IDFï¼ææ¬è¯æåºå¯ä»¥åæ¢æèªå®ä¹è¯æåºçè·¯å¾
- ç¨æ³ï¼ jieba.analyse.set_idf_path(file_name) # file_name为èªå®ä¹è¯æåºçè·¯å¾
- èªå®ä¹è¯æåºç¤ºä¾ï¼https://github.com/fxsjy/jieba/blob/master/extra_dict/idf.txt.big
- ç¨æ³ç¤ºä¾ï¼https://github.com/fxsjy/jieba/blob/master/test/extract_tags_idfpath.py
å ³é®è¯æåæ使ç¨åæ¢è¯ï¼Stop Wordsï¼ææ¬è¯æåºå¯ä»¥åæ¢æèªå®ä¹è¯æåºçè·¯å¾
- ç¨æ³ï¼ jieba.analyse.set_stop_words(file_name) # file_name为èªå®ä¹è¯æåºçè·¯å¾
- èªå®ä¹è¯æåºç¤ºä¾ï¼https://github.com/fxsjy/jieba/blob/master/extra_dict/stop_words.txt
- ç¨æ³ç¤ºä¾ï¼https://github.com/fxsjy/jieba/blob/master/test/extract_tags_stop_words.py
å ³é®è¯ä¸å¹¶è¿åå ³é®è¯æéå¼ç¤ºä¾
åºäº TextRank ç®æ³çå ³é®è¯æ½å
- jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v')) ç´æ¥ä½¿ç¨ï¼æ¥å£ç¸åï¼æ³¨æé»è®¤è¿æ»¤è¯æ§ã
- jieba.analyse.TextRank() æ°å»ºèªå®ä¹ TextRank å®ä¾
ç®æ³è®ºæï¼ TextRank: Bringing Order into Texts
åºæ¬ææ³:
- å°å¾ æ½åå ³é®è¯çææ¬è¿è¡åè¯
- 以åºå®çªå£å¤§å°(é»è®¤ä¸º5ï¼éè¿spanå±æ§è°æ´)ï¼è¯ä¹é´çå ±ç°å ³ç³»ï¼æ建å¾
- 计ç®å¾ä¸èç¹çPageRankï¼æ³¨ææ¯æ å带æå¾
使ç¨ç¤ºä¾:
è§ test/demo.py
- è¯æ§æ 注
jieba.posseg.POSTokenizer(tokenizer=None)
æ°å»ºèªå®ä¹åè¯å¨ï¼tokenizer
åæ°å¯æå®å é¨ä½¿ç¨çjieba.Tokenizer
åè¯å¨ãjieba.posseg.dt
为é»è®¤è¯æ§æ 注åè¯å¨ã- æ 注å¥ååè¯åæ¯ä¸ªè¯çè¯æ§ï¼éç¨å ictclas å ¼å®¹çæ è®°æ³ã
- é¤äºjiebaé»è®¤åè¯æ¨¡å¼ï¼æä¾paddle模å¼ä¸çè¯æ§æ 注åè½ãpaddle模å¼éç¨å»¶è¿å è½½æ¹å¼ï¼éè¿enable_paddle()å®è£ paddlepaddle-tinyï¼å¹¶ä¸importç¸å ³ä»£ç ï¼
- ç¨æ³ç¤ºä¾
>>> import jieba
>>> import jieba.posseg as pseg
>>> words = pseg.cut("æç±å京天å®é¨") #jiebaé»è®¤æ¨¡å¼
>>> jieba.enable_paddle() #å¯å¨paddle模å¼ã 0.40çä¹åå¼å§æ¯æï¼æ©æçæ¬ä¸æ¯æ
>>> words = pseg.cut("æç±å京天å®é¨",use_paddle=True) #paddle模å¼
>>> for word, flag in words:
... print('%s %s' % (word, flag))
...
æ r
ç± v
å京 ns
天å®é¨ ns
paddle模å¼è¯æ§æ 注对åºè¡¨å¦ä¸ï¼
paddle模å¼è¯æ§åä¸åç±»å«æ ç¾éåå¦ä¸è¡¨ï¼å ¶ä¸è¯æ§æ ç¾ 24 个ï¼å°ååæ¯ï¼ï¼ä¸åç±»å«æ ç¾ 4 个ï¼å¤§ååæ¯ï¼ã
æ ç¾ | å«ä¹ | æ ç¾ | å«ä¹ | æ ç¾ | å«ä¹ | æ ç¾ | å«ä¹ |
---|---|---|---|---|---|---|---|
n | æ®éåè¯ | f | æ¹ä½åè¯ | s | å¤æåè¯ | t | æ¶é´ |
nr | 人å | ns | å°å | nt | æºæå | nw | ä½åå |
nz | å ¶ä»ä¸å | v | æ®éå¨è¯ | vd | å¨å¯è¯ | vn | åå¨è¯ |
a | å½¢å®¹è¯ | ad | å¯å½¢è¯ | an | åå½¢è¯ | d | å¯è¯ |
m | æ°éè¯ | q | éè¯ | r | ä»£è¯ | p | ä»è¯ |
c | è¿è¯ | u | å©è¯ | xc | å ¶ä»èè¯ | w | æ ç¹ç¬¦å· |
PER | 人å | LOC | å°å | ORG | æºæå | TIME | æ¶é´ |
- 并è¡åè¯
-
åçï¼å°ç®æ ææ¬æè¡åéåï¼æåè¡ææ¬åé å°å¤ä¸ª Python è¿ç¨å¹¶è¡åè¯ï¼ç¶åå½å¹¶ç»æï¼ä»èè·å¾åè¯é度çå¯è§æå
-
åºäº python èªå¸¦ç multiprocessing 模åï¼ç®åæä¸æ¯æ Windows
-
ç¨æ³ï¼
jieba.enable_parallel(4)
# å¼å¯å¹¶è¡åè¯æ¨¡å¼ï¼åæ°ä¸ºå¹¶è¡è¿ç¨æ°jieba.disable_parallel()
# å ³é并è¡åè¯æ¨¡å¼
-
ä¾åï¼https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py
-
å®éªç»æï¼å¨ 4 æ ¸ 3.4GHz Linux æºå¨ä¸ï¼å¯¹éåº¸å ¨éè¿è¡ç²¾ç¡®åè¯ï¼è·å¾äº 1MB/s çé度ï¼æ¯åè¿ç¨çç 3.3 åã
-
注æï¼å¹¶è¡åè¯ä» æ¯æé»è®¤åè¯å¨
jieba.dt
åjieba.posseg.dt
ã
- Tokenizeï¼è¿åè¯è¯å¨åæçèµ·æ¢ä½ç½®
- 注æï¼è¾å ¥åæ°åªæ¥å unicode
- é»è®¤æ¨¡å¼
result = jieba.tokenize(u'æ°¸åæè£
饰åæéå
¬å¸')
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
word æ°¸å start: 0 end:2
word æè£
start: 2 end:4
word 饰å start: 4 end:6
word æéå
¬å¸ start: 6 end:10
- æ索模å¼
result = jieba.tokenize(u'æ°¸åæè£
饰åæéå
¬å¸', mode='search')
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
word æ°¸å start: 0 end:2
word æè£
start: 2 end:4
word 饰å start: 4 end:6
word æé start: 6 end:8
word å
¬å¸ start: 8 end:10
word æéå
¬å¸ start: 6 end:10
- ChineseAnalyzer for Whoosh æç´¢å¼æ
- å¼ç¨ï¼
from jieba.analyse import ChineseAnalyzer
- ç¨æ³ç¤ºä¾ï¼https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py
- å½ä»¤è¡åè¯
使ç¨ç¤ºä¾ï¼python -m jieba news.txt > cut_result.txt
å½ä»¤è¡é项ï¼ç¿»è¯ï¼ï¼
使ç¨: python -m jieba [options] filename
ç»å·´å½ä»¤è¡çé¢ã
åºå®åæ°:
filename è¾å
¥æ件
å¯éåæ°:
-h, --help æ¾ç¤ºæ¤å¸®å©ä¿¡æ¯å¹¶éåº
-d [DELIM], --delimiter [DELIM]
ä½¿ç¨ DELIM åéè¯è¯ï¼èä¸æ¯ç¨é»è®¤ç' / 'ã
è¥ä¸æå® DELIMï¼å使ç¨ä¸ä¸ªç©ºæ ¼åéã
-p [DELIM], --pos [DELIM]
å¯ç¨è¯æ§æ 注ï¼å¦ææå® DELIMï¼è¯è¯åè¯æ§ä¹é´
ç¨å®åéï¼å¦åç¨ _ åé
-D DICT, --dict DICT ä½¿ç¨ DICT 代æ¿é»è®¤è¯å
¸
-u USER_DICT, --user-dict USER_DICT
ä½¿ç¨ USER_DICT ä½ä¸ºéå è¯å
¸ï¼ä¸é»è®¤è¯å
¸æèªå®ä¹è¯å
¸é
å使ç¨
-a, --cut-all å
¨æ¨¡å¼åè¯ï¼ä¸æ¯æè¯æ§æ 注ï¼
-n, --no-hmm ä¸ä½¿ç¨éå«é©¬å°å¯å¤«æ¨¡å
-q, --quiet ä¸è¾åºè½½å
¥ä¿¡æ¯å° STDERR
-V, --version æ¾ç¤ºçæ¬ä¿¡æ¯å¹¶éåº
å¦æ没ææå®æ件åï¼å使ç¨æ åè¾å
¥ã
--help
é项è¾åºï¼
$> python -m jieba --help
Jieba command line interface.
positional arguments:
filename input file
optional arguments:
-h, --help show this help message and exit
-d [DELIM], --delimiter [DELIM]
use DELIM instead of ' / ' for word delimiter; or a
space if it is used without DELIM
-p [DELIM], --pos [DELIM]
enable POS tagging; if DELIM is specified, use DELIM
instead of '_' for POS delimiter
-D DICT, --dict DICT use DICT as dictionary
-u USER_DICT, --user-dict USER_DICT
use USER_DICT together with the default dictionary or
DICT (if specified)
-a, --cut-all full pattern cutting (ignored with POS tagging)
-n, --no-hmm don't use the Hidden Markov Model
-q, --quiet don't print loading messages to stderr
-V, --version show program's version number and exit
If no filename specified, use STDIN instead.
延è¿å è½½æºå¶
jieba éç¨å»¶è¿å è½½ï¼import jieba
å jieba.Tokenizer()
ä¸ä¼ç«å³è§¦åè¯å
¸çå è½½ï¼ä¸æ¦æå¿
è¦æå¼å§å è½½è¯å
¸æ建åç¼åå
¸ãå¦æä½ æ³æå·¥åå§ jiebaï¼ä¹å¯ä»¥æå¨åå§åã
import jieba
jieba.initialize() # æå¨åå§åï¼å¯éï¼
å¨ 0.28 ä¹åççæ¬æ¯ä¸è½æå®ä¸»è¯å ¸çè·¯å¾çï¼æäºå»¶è¿å è½½æºå¶åï¼ä½ å¯ä»¥æ¹å主è¯å ¸çè·¯å¾:
jieba.set_dictionary('data/dict.txt.big')
ä¾åï¼ https://github.com/fxsjy/jieba/blob/master/test/test_change_dictpath.py
å ¶ä»è¯å ¸
-
å ç¨å åè¾å°çè¯å ¸æ件 https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small
-
æ¯æç¹ä½åè¯æ´å¥½çè¯å ¸æ件 https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
ä¸è½½ä½ æéè¦çè¯å
¸ï¼ç¶åè¦ç jieba/dict.txt å³å¯ï¼æè
ç¨ jieba.set_dictionary('data/dict.txt.big')
å ¶ä»è¯è¨å®ç°
ç»å·´åè¯ Java çæ¬
ä½è ï¼piaolingxue å°åï¼https://github.com/huaban/jieba-analysis
ç»å·´åè¯ C++ çæ¬
ä½è ï¼yanyiwu å°åï¼https://github.com/yanyiwu/cppjieba
ç»å·´åè¯ Rust çæ¬
ä½è ï¼messense, MnO2 å°åï¼https://github.com/messense/jieba-rs
ç»å·´åè¯ Node.js çæ¬
ä½è ï¼yanyiwu å°åï¼https://github.com/yanyiwu/nodejieba
ç»å·´åè¯ Erlang çæ¬
ä½è ï¼falood å°åï¼https://github.com/falood/exjieba
ç»å·´åè¯ R çæ¬
ä½è ï¼qinwf å°åï¼https://github.com/qinwf/jiebaR
ç»å·´åè¯ iOS çæ¬
ä½è ï¼yanyiwu å°åï¼https://github.com/yanyiwu/iosjieba
ç»å·´åè¯ PHP çæ¬
ä½è ï¼fukuball å°åï¼https://github.com/fukuball/jieba-php
ç»å·´åè¯ .NET(C#) çæ¬
ä½è ï¼anderscui å°åï¼https://github.com/anderscui/jieba.NET/
ç»å·´åè¯ Go çæ¬
- ä½è : wangbin å°å: https://github.com/wangbin/jiebago
- ä½è : yanyiwu å°å: https://github.com/yanyiwu/gojieba
ç»å·´åè¯Androidçæ¬
- ä½è Dongliang.W å°åï¼https://github.com/452896915/jieba-android
åæ é¾æ¥
- https://github.com/baidu/lac ç¾åº¦ä¸æè¯æ³åæï¼åè¯+è¯æ§+ä¸åï¼ç³»ç»
- https://github.com/baidu/AnyQ ç¾åº¦FAQèªå¨é®çç³»ç»
- https://github.com/baidu/Senta ç¾åº¦æ æè¯å«ç³»ç»
ç³»ç»éæ
åè¯é度
- 1.5 MB / Second in Full Mode
- 400 KB / Second in Default Mode
- æµè¯ç¯å¢: Intel(R) Core(TM) i7-2600 CPU @ 3.4GHzï¼ãå´åã.txt
常è§é®é¢
1. 模åçæ°æ®æ¯å¦ä½çæçï¼
详è§ï¼ https://github.com/fxsjy/jieba/issues/7
2. âå°ä¸âæ»æ¯è¢«åæâå° ä¸âï¼ï¼ä»¥å类似æ åµï¼
P(å°ä¸) ï¼ P(å°)ÃP(ä¸)ï¼âå°ä¸âè¯é¢ä¸å¤å¯¼è´å ¶æè¯æ¦çè¾ä½
解å³æ¹æ³ï¼å¼ºå¶è°é«è¯é¢
jieba.add_word('å°ä¸')
æè
jieba.suggest_freq('å°ä¸', True)
3. âä»å¤©å¤©æ° ä¸éâåºè¯¥è¢«åæâä»å¤© å¤©æ° ä¸éâï¼ï¼ä»¥å类似æ åµï¼
解å³æ¹æ³ï¼å¼ºå¶è°ä½è¯é¢
jieba.suggest_freq(('ä»å¤©', '天æ°'), True)
æè
ç´æ¥å é¤è¯¥è¯ jieba.del_word('ä»å¤©å¤©æ°')
4. ååºäºè¯å ¸ä¸æ²¡æçè¯è¯ï¼ææä¸çæ³ï¼
解å³æ¹æ³ï¼å ³éæ°è¯åç°
jieba.cut('丰ç°å¤ªçäº', HMM=False)
jieba.cut('æ们ä¸åºäºä¸ä¸ªåå¾', HMM=False)
**æ´å¤é®é¢è¯·ç¹å»**ï¼https://github.com/fxsjy/jieba/issues?sort=updated&state=closed
修订åå²
https://github.com/fxsjy/jieba/blob/master/Changelog
jieba
"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.
Features
- Support three types of segmentation mode:
- Accurate Mode attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis.
- Full Mode gets all the possible words from the sentence. Fast but not accurate.
- Search Engine Mode, based on the Accurate Mode, attempts to cut long words into several short words, which can raise the recall rate. Suitable for search engines.
- Supports Traditional Chinese
- Supports customized dictionaries
- MIT License
Online demo
http://jiebademo.ap01.aws.af.cm/
(Powered by Appfog)
Usage
- Fully automatic installation:
easy_install jieba
orpip install jieba
- Semi-automatic installation: Download http://pypi.python.org/pypi/jieba/ , run
python setup.py install
after extracting. - Manual installation: place the
jieba
directory in the current directory or pythonsite-packages
directory. import jieba
.
Algorithm
- Based on a prefix dictionary structure to achieve efficient word graph scanning. Build a directed acyclic graph (DAG) for all possible word combinations.
- Use dynamic programming to find the most probable combination based on the word frequency.
- For unknown words, a HMM-based model is used with the Viterbi algorithm.
Main Functions
- Cut
- The
jieba.cut
function accepts three input parameters: the first parameter is the string to be cut; the second parameter iscut_all
, controlling the cut mode; the third parameter is to control whether to use the Hidden Markov Model. jieba.cut_for_search
accepts two parameter: the string to be cut; whether to use the Hidden Markov Model. This will cut the sentence into short words suitable for search engines.- The input string can be an unicode/str object, or a str/bytes object which is encoded in UTF-8 or GBK. Note that using GBK encoding is not recommended because it may be unexpectly decoded as UTF-8.
jieba.cut
andjieba.cut_for_search
returns an generator, from which you can use afor
loop to get the segmentation result (in unicode).jieba.lcut
andjieba.lcut_for_search
returns a list.jieba.Tokenizer(dictionary=DEFAULT_DICT)
creates a new customized Tokenizer, which enables you to use different dictionaries at the same time.jieba.dt
is the default Tokenizer, to which almost all global functions are mapped.
Code example: segmentation
#encoding=utf-8
import jieba
seg_list = jieba.cut("ææ¥å°å京æ¸
å大å¦", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list)) # å
¨æ¨¡å¼
seg_list = jieba.cut("ææ¥å°å京æ¸
å大å¦", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list)) # é»è®¤æ¨¡å¼
seg_list = jieba.cut("ä»æ¥å°äºç½ææç 大å¦")
print(", ".join(seg_list))
seg_list = jieba.cut_for_search("å°æç¡å£«æ¯ä¸äºä¸å½ç§å¦é¢è®¡ç®æï¼åå¨æ¥æ¬äº¬é½å¤§å¦æ·±é ") # æç´¢å¼æ模å¼
print(", ".join(seg_list))
Output:
[Full Mode]: æ/ æ¥å°/ å京/ æ¸
å/ æ¸
å大å¦/ å大/ 大å¦
[Accurate Mode]: æ/ æ¥å°/ å京/ æ¸
å大å¦
[Unknown Words Recognize] ä», æ¥å°, äº, ç½æ, æç , å¤§å¦ (In this case, "æç " is not in the dictionary, but is identified by the Viterbi algorithm)
[Search Engine Mode]ï¼ å°æ, ç¡å£«, æ¯ä¸, äº, ä¸å½, ç§å¦, å¦é¢, ç§å¦é¢, ä¸å½ç§å¦é¢, 计ç®, 计ç®æ, å, å¨, æ¥æ¬, 京é½, 大å¦, æ¥æ¬äº¬é½å¤§å¦, æ·±é
2. Add a custom dictionary
Load dictionary
- Developers can specify their own custom dictionary to be included in the jieba default dictionary. Jieba is able to identify new words, but you can add your own new words can ensure a higher accuracy.
- Usageï¼
jieba.load_userdict(file_name)
# file_name is a file-like object or the path of the custom dictionary - The dictionary format is the same as that of
dict.txt
: one word per line; each line is divided into three parts separated by a space: word, word frequency, POS tag. Iffile_name
is a path or a file opened in binary mode, the dictionary must be UTF-8 encoded. - The word frequency and POS tag can be omitted respectively. The word frequency will be filled with a suitable value if omitted.
For example:
åæ°å 3 i
äºè®¡ç® 5
å±ç¹ç³ nz
å°ä¸
-
Change a Tokenizer's
tmp_dir
andcache_file
to specify the path of the cache file, for using on a restricted file system. -
Example:
äºè®¡ç® 5 æå°ç¦ 2 åæ°å 3 [Before]ï¼ æå°ç¦ / æ¯ / åæ° / å / 主任 / ä¹ / æ¯ / äº / è®¡ç® / æ¹é¢ / ç / ä¸å®¶ / [After]ï¼ãæå°ç¦ / æ¯ / åæ°å / 主任 / ä¹ / æ¯ / äºè®¡ç® / æ¹é¢ / ç / ä¸å®¶ /
Modify dictionary
-
Use
add_word(word, freq=None, tag=None)
anddel_word(word)
to modify the dictionary dynamically in programs. -
Use
suggest_freq(segment, tune=True)
to adjust the frequency of a single word so that it can (or cannot) be segmented. -
Note that HMM may affect the final result.
Example:
>>> print('/'.join(jieba.cut('å¦ææ¾å°postä¸å°åºéã', HMM=False)))
å¦æ/æ¾å°/post/ä¸å°/åºé/ã
>>> jieba.suggest_freq(('ä¸', 'å°'), True)
494
>>> print('/'.join(jieba.cut('å¦ææ¾å°postä¸å°åºéã', HMM=False)))
å¦æ/æ¾å°/post/ä¸/å°/åºé/ã
>>> print('/'.join(jieba.cut('ãå°ä¸ãæ£ç¡®åºè¯¥ä¸ä¼è¢«åå¼', HMM=False)))
ã/å°/ä¸/ã/æ£ç¡®/åºè¯¥/ä¸ä¼/被/åå¼
>>> jieba.suggest_freq('å°ä¸', True)
69
>>> print('/'.join(jieba.cut('ãå°ä¸ãæ£ç¡®åºè¯¥ä¸ä¼è¢«åå¼', HMM=False)))
ã/å°ä¸/ã/æ£ç¡®/åºè¯¥/ä¸ä¼/被/åå¼
- Keyword Extraction
import jieba.analyse
jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
sentence
: the text to be extractedtopK
: return how many keywords with the highest TF/IDF weights. The default value is 20withWeight
: whether return TF/IDF weights with the keywords. The default value is FalseallowPOS
: filter words with which POSs are included. Empty for no filtering.
jieba.analyse.TFIDF(idf_path=None)
creates a new TFIDF instance,idf_path
specifies IDF file path.
Example (keyword extraction)
https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
Developers can specify their own custom IDF corpus in jieba keyword extraction
- Usageï¼
jieba.analyse.set_idf_path(file_name) # file_name is the path for the custom corpus
- Custom Corpus Sampleï¼https://github.com/fxsjy/jieba/blob/master/extra_dict/idf.txt.big
- Sample Codeï¼https://github.com/fxsjy/jieba/blob/master/test/extract_tags_idfpath.py
Developers can specify their own custom stop words corpus in jieba keyword extraction
- Usageï¼
jieba.analyse.set_stop_words(file_name) # file_name is the path for the custom corpus
- Custom Corpus Sampleï¼https://github.com/fxsjy/jieba/blob/master/extra_dict/stop_words.txt
- Sample Codeï¼https://github.com/fxsjy/jieba/blob/master/test/extract_tags_stop_words.py
There's also a TextRank implementation available.
Use: jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'))
Note that it filters POS by default.
jieba.analyse.TextRank()
creates a new TextRank instance.
- Part of Speech Tagging
jieba.posseg.POSTokenizer(tokenizer=None)
creates a new customized Tokenizer.tokenizer
specifies the jieba.Tokenizer to internally use.jieba.posseg.dt
is the default POSTokenizer.- Tags the POS of each word after segmentation, using labels compatible with ictclas.
- Example:
>>> import jieba.posseg as pseg
>>> words = pseg.cut("æç±å京天å®é¨")
>>> for w in words:
... print('%s %s' % (w.word, w.flag))
...
æ r
ç± v
å京 ns
天å®é¨ ns
- Parallel Processing
-
Principle: Split target text by line, assign the lines into multiple Python processes, and then merge the results, which is considerably faster.
-
Based on the multiprocessing module of Python.
-
Usage:
jieba.enable_parallel(4)
# Enable parallel processing. The parameter is the number of processes.jieba.disable_parallel()
# Disable parallel processing.
-
Example: https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py
-
Result: On a four-core 3.4GHz Linux machine, do accurate word segmentation on Complete Works of Jin Yong, and the speed reaches 1MB/s, which is 3.3 times faster than the single-process version.
-
Note that parallel processing supports only default tokenizers,
jieba.dt
andjieba.posseg.dt
.
- Tokenize: return words with position
- The input must be unicode
- Default mode
result = jieba.tokenize(u'æ°¸åæè£
饰åæéå
¬å¸')
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
word æ°¸å start: 0 end:2
word æè£
start: 2 end:4
word 饰å start: 4 end:6
word æéå
¬å¸ start: 6 end:10
- Search mode
result = jieba.tokenize(u'æ°¸åæè£
饰åæéå
¬å¸',mode='search')
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
word æ°¸å start: 0 end:2
word æè£
start: 2 end:4
word 饰å start: 4 end:6
word æé start: 6 end:8
word å
¬å¸ start: 8 end:10
word æéå
¬å¸ start: 6 end:10
- ChineseAnalyzer for Whoosh
from jieba.analyse import ChineseAnalyzer
- Example: https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py
- Command Line Interface
$> python -m jieba --help
Jieba command line interface.
positional arguments:
filename input file
optional arguments:
-h, --help show this help message and exit
-d [DELIM], --delimiter [DELIM]
use DELIM instead of ' / ' for word delimiter; or a
space if it is used without DELIM
-p [DELIM], --pos [DELIM]
enable POS tagging; if DELIM is specified, use DELIM
instead of '_' for POS delimiter
-D DICT, --dict DICT use DICT as dictionary
-u USER_DICT, --user-dict USER_DICT
use USER_DICT together with the default dictionary or
DICT (if specified)
-a, --cut-all full pattern cutting (ignored with POS tagging)
-n, --no-hmm don't use the Hidden Markov Model
-q, --quiet don't print loading messages to stderr
-V, --version show program's version number and exit
If no filename specified, use STDIN instead.
Initialization
By default, Jieba don't build the prefix dictionary unless it's necessary. This takes 1-3 seconds, after which it is not initialized again. If you want to initialize Jieba manually, you can call:
import jieba
jieba.initialize() # (optional)
You can also specify the dictionary (not supported before version 0.28) :
jieba.set_dictionary('data/dict.txt.big')
Using Other Dictionaries
It is possible to use your own dictionary with Jieba, and there are also two dictionaries ready for download:
-
A smaller dictionary for a smaller memory footprint: https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small
-
There is also a bigger dictionary that has better support for traditional Chinese (ç¹é«): https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
By default, an in-between dictionary is used, called dict.txt
and included in the distribution.
In either case, download the file you want, and then call jieba.set_dictionary('data/dict.txt.big')
or just replace the existing dict.txt
.
Segmentation speed
- 1.5 MB / Second in Full Mode
- 400 KB / Second in Default Mode
- Test Env: Intel(R) Core(TM) i7-2600 CPU @ 3.4GHzï¼ãå´åã.txt
Top Related Projects
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot