jieba

结巴中文分词

33,063

6,723

33,063

677

View on GitHub

Top Related Projects

HanLP

33,448

Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification

snownlp

6,397

Python library for processing Chinese text

Quick Overview

Jieba is a popular Chinese text segmentation library for Python. It provides efficient and accurate word segmentation, part-of-speech tagging, and keyword extraction for Chinese text processing tasks.

Pros

Fast and efficient segmentation algorithm
Supports both simplified and traditional Chinese
Customizable dictionary for domain-specific segmentation
Offers various segmentation modes (accurate, full, search engine)

Cons

Limited support for other languages besides Chinese
May require fine-tuning for specific domains or dialects
Documentation is primarily in Chinese, which can be challenging for non-Chinese speakers

Code Examples

Basic word segmentation:

import jieba

text = "我来到北京清华大学"
seg_list = jieba.cut(text, cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))

Part-of-speech tagging:

import jieba.posseg as pseg

words = pseg.cut("我爱北京天安门")
for word, flag in words:
    print(f'{word} {flag}')

Keyword extraction:

import jieba.analyse

text = "线程是程序执行流的最小单元，是进程中的一个实体，是被系统独立调度和分派的基本单位"
keywords = jieba.analyse.extract_tags(text, topK=5, withWeight=True)
for keyword, weight in keywords:
    print(f'{keyword} {weight}')

Getting Started

To get started with Jieba, follow these steps:

Install Jieba using pip:
```
pip install jieba
```
Import Jieba in your Python script:
```
import jieba
```

Use Jieba for word segmentation:

text = "我来到北京清华大学"
seg_list = jieba.cut(text, cut_all=False)
print("/ ".join(seg_list))

This will output the segmented words separated by slashes.

Competitor Comparisons

HanLP

33,448

Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification

Pros of HanLP

More comprehensive NLP functionality beyond just word segmentation
Supports multiple languages and domains
Actively maintained with frequent updates

Cons of HanLP

Steeper learning curve due to more complex API
Larger memory footprint and slower processing speed
Requires more setup and configuration

Code Comparison

HanLP:

from pyhanlp import *

text = "我爱北京天安门"
print(HanLP.segment(text))

Jieba:

import jieba

text = "我爱北京天安门"
print(jieba.cut(text))

Both libraries offer simple APIs for basic word segmentation, but HanLP provides more advanced NLP functions out of the box. Jieba's simpler interface makes it easier to get started with basic tasks, while HanLP offers more flexibility for complex NLP projects.

HanLP's broader feature set comes at the cost of increased complexity and resource usage. Jieba, being more focused on word segmentation, is generally faster and lighter.

For projects requiring only Chinese word segmentation, Jieba may be sufficient. However, for more comprehensive NLP tasks or multi-language support, HanLP could be a better choice despite its steeper learning curve.

lac

3,840

百度NLP：分词，词性标注，命名实体识别，词重要性

Pros of LAC

More advanced NLP capabilities, including part-of-speech tagging and named entity recognition
Potentially better accuracy for complex sentences and specialized domains
Actively maintained by Baidu, a major tech company with NLP expertise

Cons of LAC

Larger model size and potentially slower processing speed
More complex setup and usage compared to Jieba's simplicity
Less community support and third-party integrations

Code Comparison

Jieba usage:

import jieba
seg_list = jieba.cut("我来到北京清华大学")
print(" ".join(seg_list))

LAC usage:

from LAC import LAC
lac = LAC(mode='seg')
seg_result = lac.run("我来到北京清华大学")
print(" ".join(seg_result))

Both libraries offer simple APIs for basic word segmentation, but LAC requires an additional step of initializing the model. LAC also provides more advanced NLP functions beyond this basic example.

LAC is better suited for projects requiring comprehensive NLP capabilities and potentially higher accuracy, especially in specialized domains. Jieba remains a solid choice for simpler projects prioritizing ease of use and faster processing speed.

snownlp

6,397

Python library for processing Chinese text

Pros of SnowNLP

Offers sentiment analysis and text classification features
Includes pinyin conversion functionality
Provides text summarization capabilities

Cons of SnowNLP

Less active development and community support
More limited word segmentation accuracy compared to Jieba
Smaller corpus and dictionary size

Code Comparison

Jieba word segmentation:

import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))

SnowNLP word segmentation:

from snownlp import SnowNLP
s = SnowNLP(u'这个东西真心很赞')
print(s.words)

Both libraries offer Chinese text processing capabilities, but they have different strengths. Jieba is primarily focused on word segmentation and is widely regarded as one of the best options for this task. It has a larger user base and more frequent updates.

SnowNLP, on the other hand, provides a broader range of NLP functions, including sentiment analysis and text summarization. However, its word segmentation may not be as accurate as Jieba's for certain texts.

The choice between the two depends on the specific requirements of your project. If you need advanced word segmentation, Jieba is likely the better choice. For a more comprehensive NLP toolkit with additional features, SnowNLP might be more suitable.

cppjieba

2,565

"结巴"中文分词的C++版本

Pros of cppjieba

Faster performance due to C++ implementation
Lower memory usage compared to Python-based Jieba
Suitable for high-performance applications and embedded systems

Cons of cppjieba

Less extensive documentation and community support
Fewer built-in features and extensions compared to Jieba
Requires C++ knowledge for integration and customization

Code Comparison

Jieba (Python):

import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))

cppjieba (C++):

#include "cppjieba/Jieba.hpp"
vector<string> words;
jieba.Cut("我来到北京清华大学", words, false);
cout << limonp::Join(words.begin(), words.end(), "/") << endl;

Both libraries aim to provide Chinese word segmentation functionality, but they cater to different use cases and developer preferences. Jieba offers a more user-friendly Python interface with extensive features, making it ideal for rapid development and data analysis tasks. On the other hand, cppjieba provides a performance-oriented C++ implementation, suitable for scenarios where speed and resource efficiency are crucial. The choice between the two depends on the specific requirements of the project, such as programming language preference, performance needs, and available system resources.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

jieba

âç»å·´âä¸æåè¯ï¼åæå¥½ç Python ä¸æåè¯ç»ä»¶

"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.

Scroll down for English documentation.

ç¹ç¹

æ¯æåç§åè¯æ¨¡å¼ï¼
- ç²¾ç¡®æ¨¡å¼ï¼è¯å¾å°å¥åæç²¾ç¡®å°åå¼ï¼éåææ¬åæï¼
- å¨æ¨¡å¼ï¼æå¥åä¸ææçå¯ä»¥æè¯çè¯è¯é½æ«æåºæ¥, éåº¦éå¸¸å¿«ï¼ä½æ¯ä¸è½è§£å³æ§ä¹ï¼
- æç´¢å¼ææ¨¡å¼ï¼å¨ç²¾ç¡®æ¨¡å¼çåºç¡ä¸ï¼å¯¹é¿è¯åæ¬¡ååï¼æé«å¬åçï¼éåç¨äºæç´¢å¼æåè¯ã
- paddleæ¨¡å¼ï¼å©ç¨PaddlePaddleæ·±åº¦å¦ä¹ æ¡æ¶ï¼è®ç»åºåæ æ³¨ï¼ååGRUï¼ç½ç»æ¨¡åå®ç°åè¯ãåæ¶æ¯æè¯æ§æ æ³¨ãpaddleæ¨¡å¼ä½¿ç¨éå®è£paddlepaddle-tinyï¼pip install paddlepaddle-tiny==1.6.1ãç®åpaddleæ¨¡å¼æ¯æjieba v0.40åä»¥ä¸çæ¬ãjieba v0.40ä»¥ä¸çæ¬ï¼è¯·åçº§jiebaï¼pip install jieba --upgrade ãPaddlePaddleå®ç½
æ¯æç¹ä½åè¯
æ¯æèªå®ä¹è¯å¸
MIT ææåè®®

å®è£è¯´æ

ä»£ç å¯¹ Python 2/3 åå¼å®¹

å¨èªå¨å®è£ï¼easy_install jieba æè pip install jieba / pip3 install jieba
åèªå¨å®è£ï¼åä¸è½½ http://pypi.python.org/pypi/jieba/ ï¼è§£ååè¿è¡ python setup.py install
æå¨å®è£ï¼å° jieba ç®å½æ¾ç½®äºå½åç®å½æè site-packages ç®å½
éè¿ import jieba æ¥å¼ç¨
å¦æéè¦ä½¿ç¨paddleæ¨¡å¼ä¸çåè¯åè¯æ§æ æ³¨åè½ï¼è¯·åå®è£paddlepaddle-tinyï¼pip install paddlepaddle-tiny==1.6.1ã

ç®æ³

åºäºåç¼è¯å¸å®ç°é«æçè¯å¾æ«æï¼çæå¥åä¸æ±åææå¯è½æè¯æåµæææçæåæ ç¯å¾ (DAG)
éç¨äºå¨æè§åæ¥æ¾æå¤§æ¦çè·¯å¾, æ¾åºåºäºè¯é¢çæå¤§ååç»å
å¯¹äºæªç»å½è¯ï¼éç¨äºåºäºæ±åæè¯è½åç HMM æ¨¡åï¼ä½¿ç¨äº Viterbi ç®æ³

ä¸»è¦åè½

åè¯

jieba.cut æ¹æ³æ¥ååä¸ªè¾å¥åæ°: éè¦åè¯çåç¬¦ä¸²ï¼cut_all åæ°ç¨æ¥æ§å¶æ¯å¦éç¨å¨æ¨¡å¼ï¼HMM åæ°ç¨æ¥æ§å¶æ¯å¦ä½¿ç¨ HMM æ¨¡åï¼use_paddle åæ°ç¨æ¥æ§å¶æ¯å¦ä½¿ç¨paddleæ¨¡å¼ä¸çåè¯æ¨¡å¼ï¼paddleæ¨¡å¼éç¨å»¶è¿å è½½æ¹å¼ï¼éè¿enable_paddleæ¥å£å®è£paddlepaddle-tinyï¼å¹¶ä¸importç¸å³ä»£ç ï¼
jieba.cut_for_search æ¹æ³æ¥åä¸¤ä¸ªåæ°ï¼éè¦åè¯çåç¬¦ä¸²ï¼æ¯å¦ä½¿ç¨ HMM æ¨¡åãè¯¥æ¹æ³éåç¨äºæç´¢å¼ææå»ºåæç´¢å¼çåè¯ï¼ç²åº¦æ¯è¾ç»
å¾åè¯çåç¬¦ä¸²å¯ä»¥æ¯ unicode æ UTF-8 åç¬¦ä¸²ãGBK åç¬¦ä¸²ãæ³¨æï¼ä¸å»ºè®®ç´æ¥è¾å¥ GBK åç¬¦ä¸²ï¼å¯è½æ æ³é¢æå°éè¯¯è§£ç æ UTF-8
jieba.cut ä»¥å jieba.cut_for_search è¿åçç»æé½æ¯ä¸ä¸ªå¯è¿ä»£ç generatorï¼å¯ä»¥ä½¿ç¨ for å¾ªç¯æ¥è·å¾åè¯åå¾å°çæ¯ä¸ä¸ªè¯è¯(unicode)ï¼æèç¨
jieba.lcut ä»¥å jieba.lcut_for_search ç´æ¥è¿å list
jieba.Tokenizer(dictionary=DEFAULT_DICT) æ°å»ºèªå®ä¹åè¯å¨ï¼å¯ç¨äºåæ¶ä½¿ç¨ä¸åè¯å¸ãjieba.dt ä¸ºé»è®¤åè¯å¨ï¼ææå¨å±åè¯ç¸å³å½æ°é½æ¯è¯¥åè¯å¨çæ å°ã

ä»£ç ç¤ºä¾

# encoding=utf-8
import jieba

jieba.enable_paddle()# å¯å¨paddleæ¨¡å¼ã 0.40çä¹åå¼å§æ¯æï¼æ©æçæ¬ä¸æ¯æ
strs=["ææ¥å°åäº¬æ¸åå¤§å¦","ä¹ä¹çæåå®äº","ä¸å½ç§å¦ææ¯å¤§å¦"]
for str in strs:
    seg_list = jieba.cut(str,use_paddle=True) # ä½¿ç¨paddleæ¨¡å¼
    print("Paddle Mode: " + '/'.join(list(seg_list)))

seg_list = jieba.cut("ææ¥å°åäº¬æ¸åå¤§å¦", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list))  # å¨æ¨¡å¼

seg_list = jieba.cut("ææ¥å°åäº¬æ¸åå¤§å¦", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))  # ç²¾ç¡®æ¨¡å¼

seg_list = jieba.cut("ä»æ¥å°äºç½ææç å¤§å¦")  # é»è®¤æ¯ç²¾ç¡®æ¨¡å¼
print(", ".join(seg_list))

seg_list = jieba.cut_for_search("å°æç¡å£«æ¯ä¸äºä¸å½ç§å¦é¢è®¡ç®æï¼åå¨æ¥æ¬äº¬é½å¤§å¦æ·±é ")  # æç´¢å¼ææ¨¡å¼
print(", ".join(seg_list))

è¾åº:

ãå¨æ¨¡å¼ã: æ/ æ¥å°/ åäº¬/ æ¸å/ æ¸åå¤§å¦/ åå¤§/ å¤§å¦

ãç²¾ç¡®æ¨¡å¼ã: æ/ æ¥å°/ åäº¬/ æ¸åå¤§å¦

ãæ°è¯è¯å«ãï¼ä», æ¥å°, äº, ç½æ, æç , å¤§å¦    (æ¤å¤ï¼âæç âå¹¶æ²¡æå¨è¯å¸ä¸ï¼ä½æ¯ä¹è¢«Viterbiç®æ³è¯å«åºæ¥äº)

ãæç´¢å¼ææ¨¡å¼ãï¼ å°æ, ç¡å£«, æ¯ä¸, äº, ä¸å½, ç§å¦, å¦é¢, ç§å¦é¢, ä¸å½ç§å¦é¢, è®¡ç®, è®¡ç®æ, å, å¨, æ¥æ¬, äº¬é½, å¤§å¦, æ¥æ¬äº¬é½å¤§å¦, æ·±é

2. æ·»å èªå®ä¹è¯å¸

è½½å¥è¯å¸

å¼åèå¯ä»¥æå®èªå·±èªå®ä¹çè¯å¸ï¼ä»¥ä¾¿åå« jieba è¯åºéæ²¡æçè¯ãè½ç¶ jieba ææ°è¯è¯å«è½åï¼ä½æ¯èªè¡æ·»å æ°è¯å¯ä»¥ä¿è¯æ´é«çæ£ç¡®ç
ç¨æ³ï¼ jieba.load_userdict(file_name) # file_name ä¸ºæä»¶ç±»å¯¹è±¡æèªå®ä¹è¯å¸çè·¯å¾
è¯å¸æ ¼å¼å dict.txt ä¸æ ·ï¼ä¸ä¸ªè¯å ä¸è¡ï¼æ¯ä¸è¡åä¸é¨åï¼è¯è¯ãè¯é¢ï¼å¯çç¥ï¼ãè¯æ§ï¼å¯çç¥ï¼ï¼ç¨ç©ºæ ¼éå¼ï¼é¡ºåºä¸å¯é¢ åãfile_name è¥ä¸ºè·¯å¾æäºè¿å¶æ¹å¼æå¼çæä»¶ï¼åæä»¶å¿é¡»ä¸º UTF-8 ç¼ç ã
è¯é¢çç¥æ¶ä½¿ç¨èªå¨è®¡ç®çè½ä¿è¯ååºè¯¥è¯çè¯é¢ã

ä¾å¦ï¼

åæ°å 3 i
äºè®¡ç® 5
å±ç¹ç³ nz
å°ä¸

æ´æ¹åè¯å¨ï¼é»è®¤ä¸º jieba.dtï¼ç tmp_dir å cache_file å±æ§ï¼å¯åå«æå®ç¼åæä»¶æå¨çæä»¶å¤¹åå¶æä»¶åï¼ç¨äºåéçæä»¶ç³»ç»ã
èä¾ï¼
- èªå®ä¹è¯å¸ï¼https://github.com/fxsjy/jieba/blob/master/test/userdict.txt
- ç¨æ³ç¤ºä¾ï¼https://github.com/fxsjy/jieba/blob/master/test/test_userdict.py
  - ä¹åï¼ æå°ç¦ / æ¯ / åæ° / å / ä¸»ä»» / ä¹ / æ¯ / äº / è®¡ç® / æ¹é¢ / ç / ä¸å®¶ /
  - å è½½èªå®ä¹è¯åºåï¼ãæå°ç¦ / æ¯ / åæ°å / ä¸»ä»» / ä¹ / æ¯ / äºè®¡ç® / æ¹é¢ / ç / ä¸å®¶ /

è°æ´è¯å¸

ä½¿ç¨ add_word(word, freq=None, tag=None) å del_word(word) å¯å¨ç¨åºä¸å¨æä¿®æ¹è¯å¸ã
ä½¿ç¨ suggest_freq(segment, tune=True) å¯è°èåä¸ªè¯è¯çè¯é¢ï¼ä½¿å¶è½ï¼æä¸è½ï¼è¢«ååºæ¥ã
æ³¨æï¼èªå¨è®¡ç®çè¯é¢å¨ä½¿ç¨ HMM æ°è¯åç°åè½æ¶å¯è½æ æã

ä»£ç ç¤ºä¾ï¼

>>> print('/'.join(jieba.cut('å¦ææ¾å°postä¸å°åºéã', HMM=False)))
å¦æ/æ¾å°/post/ä¸å°/åºé/ã
>>> jieba.suggest_freq(('ä¸', 'å°'), True)
494
>>> print('/'.join(jieba.cut('å¦ææ¾å°postä¸å°åºéã', HMM=False)))
å¦æ/æ¾å°/post/ä¸/å°/åºé/ã
>>> print('/'.join(jieba.cut('ãå°ä¸ãæ£ç¡®åºè¯¥ä¸ä¼è¢«åå¼', HMM=False)))
ã/å°/ä¸/ã/æ£ç¡®/åºè¯¥/ä¸ä¼/è¢«/åå¼
>>> jieba.suggest_freq('å°ä¸', True)
69
>>> print('/'.join(jieba.cut('ãå°ä¸ãæ£ç¡®åºè¯¥ä¸ä¼è¢«åå¼', HMM=False)))
ã/å°ä¸/ã/æ£ç¡®/åºè¯¥/ä¸ä¼/è¢«/åå¼

"éè¿ç¨æ·èªå®ä¹è¯å¸æ¥å¢å¼ºæ§ä¹çº éè½å" --- https://github.com/fxsjy/jieba/issues/14

å³é®è¯æå

åºäº TF-IDF ç®æ³çå³é®è¯æ½å

import jieba.analyse

jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
- sentence ä¸ºå¾æåçææ¬
- topK ä¸ºè¿åå ä¸ª TF/IDF æéæå¤§çå³é®è¯ï¼é»è®¤å¼ä¸º 20
- withWeight ä¸ºæ¯å¦ä¸å¹¶è¿åå³é®è¯æéå¼ï¼é»è®¤å¼ä¸º False
- allowPOS ä»åæ¬æå®è¯æ§çè¯ï¼é»è®¤å¼ä¸ºç©ºï¼å³ä¸çé
jieba.analyse.TFIDF(idf_path=None) æ°å»º TFIDF å®ä¾ï¼idf_path ä¸º IDF é¢çæä»¶

ä»£ç ç¤ºä¾ ï¼å³é®è¯æåï¼

https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py

ç¨æ³ï¼ jieba.analyse.set_idf_path(file_name) # file_nameä¸ºèªå®ä¹è¯æåºçè·¯å¾
èªå®ä¹è¯æåºç¤ºä¾ï¼https://github.com/fxsjy/jieba/blob/master/extra_dict/idf.txt.big
ç¨æ³ç¤ºä¾ï¼https://github.com/fxsjy/jieba/blob/master/test/extract_tags_idfpath.py

ç¨æ³ï¼ jieba.analyse.set_stop_words(file_name) # file_nameä¸ºèªå®ä¹è¯æåºçè·¯å¾
èªå®ä¹è¯æåºç¤ºä¾ï¼https://github.com/fxsjy/jieba/blob/master/extra_dict/stop_words.txt
ç¨æ³ç¤ºä¾ï¼https://github.com/fxsjy/jieba/blob/master/test/extract_tags_stop_words.py

å³é®è¯ä¸å¹¶è¿åå³é®è¯æéå¼ç¤ºä¾

ç¨æ³ç¤ºä¾ï¼https://github.com/fxsjy/jieba/blob/master/test/extract_tags_with_weight.py

åºäº TextRank ç®æ³çå³é®è¯æ½å

jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v')) ç´æ¥ä½¿ç¨ï¼æ¥å£ç¸åï¼æ³¨æé»è®¤è¿æ»¤è¯æ§ã
jieba.analyse.TextRank() æ°å»ºèªå®ä¹ TextRank å®ä¾

ç®æ³è®ºæï¼ TextRank: Bringing Order into Texts

åºæ¬ææ³:

å°å¾æ½åå³é®è¯çææ¬è¿è¡åè¯
ä»¥åºå®çªå£å¤§å°(é»è®¤ä¸º5ï¼éè¿spanå±æ§è°æ´)ï¼è¯ä¹é´çå±ç°å³ç³»ï¼æå»ºå¾
è®¡ç®å¾ä¸èç¹çPageRankï¼æ³¨ææ¯æ åå¸¦æå¾

ä½¿ç¨ç¤ºä¾:

è§ test/demo.py

è¯æ§æ æ³¨

jieba.posseg.POSTokenizer(tokenizer=None) æ°å»ºèªå®ä¹åè¯å¨ï¼tokenizer åæ°å¯æå®åé¨ä½¿ç¨ç jieba.Tokenizer åè¯å¨ãjieba.posseg.dt ä¸ºé»è®¤è¯æ§æ æ³¨åè¯å¨ã
æ æ³¨å¥ååè¯åæ¯ä¸ªè¯çè¯æ§ï¼éç¨å ictclas å¼å®¹çæ è®°æ³ã
é¤äºjiebaé»è®¤åè¯æ¨¡å¼ï¼æä¾paddleæ¨¡å¼ä¸çè¯æ§æ æ³¨åè½ãpaddleæ¨¡å¼éç¨å»¶è¿å è½½æ¹å¼ï¼éè¿enable_paddle()å®è£paddlepaddle-tinyï¼å¹¶ä¸importç¸å³ä»£ç ï¼
ç¨æ³ç¤ºä¾

>>> import jieba
>>> import jieba.posseg as pseg
>>> words = pseg.cut("æç±åäº¬å¤©å®é¨") #jiebaé»è®¤æ¨¡å¼
>>> jieba.enable_paddle() #å¯å¨paddleæ¨¡å¼ã 0.40çä¹åå¼å§æ¯æï¼æ©æçæ¬ä¸æ¯æ
>>> words = pseg.cut("æç±åäº¬å¤©å®é¨",use_paddle=True) #paddleæ¨¡å¼
>>> for word, flag in words:
...    print('%s %s' % (word, flag))
...
æ r
ç± v
åäº¬ ns
å¤©å®é¨ ns

paddleæ¨¡å¼è¯æ§æ æ³¨å¯¹åºè¡¨å¦ä¸ï¼

æ ç¾	å«ä¹	æ ç¾	å«ä¹	æ ç¾	å«ä¹	æ ç¾	å«ä¹
n	æ®éåè¯	f	æ¹ä½åè¯	s	å¤æåè¯	t	æ¶é´
nr	äººå	ns	å°å	nt	æºæå	nw	ä½åå
nz	å¶ä»ä¸å	v	æ®éå¨è¯	vd	å¨å¯è¯	vn	åå¨è¯
a	å½¢å®¹è¯	ad	å¯å½¢è¯	an	åå½¢è¯	d	å¯è¯
m	æ°éè¯	q	éè¯	r	ä»£è¯	p	ä»è¯
c	è¿è¯	u	å©è¯	xc	å¶ä»èè¯	w	æ ç¹ç¬¦å·
PER	äººå	LOC	å°å	ORG	æºæå	TIME	æ¶é´

å¹¶è¡åè¯

åçï¼å°ç®æ ææ¬æè¡åéåï¼æåè¡ææ¬åéå°å¤ä¸ª Python è¿ç¨å¹¶è¡åè¯ï¼ç¶åå½å¹¶ç»æï¼ä»èè·å¾åè¯éåº¦çå¯è§æå
åºäº python èªå¸¦ç multiprocessing æ¨¡åï¼ç®åæä¸æ¯æ Windows
ç¨æ³ï¼
- jieba.enable_parallel(4) # å¼å¯å¹¶è¡åè¯æ¨¡å¼ï¼åæ°ä¸ºå¹¶è¡è¿ç¨æ°
- jieba.disable_parallel() # å³éå¹¶è¡åè¯æ¨¡å¼
ä¾åï¼https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py
å®éªç»æï¼å¨ 4 æ ¸ 3.4GHz Linux æºå¨ä¸ï¼å¯¹éåº¸å¨éè¿è¡ç²¾ç¡®åè¯ï¼è·å¾äº 1MB/s çéåº¦ï¼æ¯åè¿ç¨çç 3.3 åã
æ³¨æï¼å¹¶è¡åè¯ä»æ¯æé»è®¤åè¯å¨ jieba.dt å jieba.posseg.dtã

Tokenizeï¼è¿åè¯è¯å¨åæçèµ·æ¢ä½ç½®

æ³¨æï¼è¾å¥åæ°åªæ¥å unicode
é»è®¤æ¨¡å¼

result = jieba.tokenize(u'æ°¸åæè£é¥°åæéå¬å¸')
for tk in result:
    print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))

word æ°¸å                start: 0                end:2
word æè£                start: 2                end:4
word é¥°å                start: 4                end:6
word æéå¬å¸            start: 6                end:10

æç´¢æ¨¡å¼

result = jieba.tokenize(u'æ°¸åæè£é¥°åæéå¬å¸', mode='search')
for tk in result:
    print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))

word æ°¸å                start: 0                end:2
word æè£                start: 2                end:4
word é¥°å                start: 4                end:6
word æé                start: 6                end:8
word å¬å¸                start: 8                end:10
word æéå¬å¸            start: 6                end:10

ChineseAnalyzer for Whoosh æç´¢å¼æ

å¼ç¨ï¼ from jieba.analyse import ChineseAnalyzer
ç¨æ³ç¤ºä¾ï¼https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py

å½ä»¤è¡åè¯

ä½¿ç¨ç¤ºä¾ï¼python -m jieba news.txt > cut_result.txt

å½ä»¤è¡éé¡¹ï¼ç¿»è¯ï¼ï¼

ä½¿ç¨: python -m jieba [options] filename

ç»å·´å½ä»¤è¡çé¢ã

åºå®åæ°:
  filename              è¾å¥æä»¶

å¯éåæ°:
  -h, --help            æ¾ç¤ºæ¤å¸®å©ä¿¡æ¯å¹¶éåº
  -d [DELIM], --delimiter [DELIM]
                        ä½¿ç¨ DELIM åéè¯è¯ï¼èä¸æ¯ç¨é»è®¤ç' / 'ã
                        è¥ä¸æå® DELIMï¼åä½¿ç¨ä¸ä¸ªç©ºæ ¼åéã
  -p [DELIM], --pos [DELIM]
                        å¯ç¨è¯æ§æ æ³¨ï¼å¦ææå® DELIMï¼è¯è¯åè¯æ§ä¹é´
                        ç¨å®åéï¼å¦åç¨ _ åé
  -D DICT, --dict DICT  ä½¿ç¨ DICT ä»£æ¿é»è®¤è¯å¸
  -u USER_DICT, --user-dict USER_DICT
                        ä½¿ç¨ USER_DICT ä½ä¸ºéå è¯å¸ï¼ä¸é»è®¤è¯å¸æèªå®ä¹è¯å¸éåä½¿ç¨
  -a, --cut-all         å¨æ¨¡å¼åè¯ï¼ä¸æ¯æè¯æ§æ æ³¨ï¼
  -n, --no-hmm          ä¸ä½¿ç¨éå«é©¬å°å¯å¤«æ¨¡å
  -q, --quiet           ä¸è¾åºè½½å¥ä¿¡æ¯å° STDERR
  -V, --version         æ¾ç¤ºçæ¬ä¿¡æ¯å¹¶éåº

å¦ææ²¡ææå®æä»¶åï¼åä½¿ç¨æ åè¾å¥ã

--help éé¡¹è¾åºï¼

$> python -m jieba --help
Jieba command line interface.

positional arguments:
  filename              input file

optional arguments:
  -h, --help            show this help message and exit
  -d [DELIM], --delimiter [DELIM]
                        use DELIM instead of ' / ' for word delimiter; or a
                        space if it is used without DELIM
  -p [DELIM], --pos [DELIM]
                        enable POS tagging; if DELIM is specified, use DELIM
                        instead of '_' for POS delimiter
  -D DICT, --dict DICT  use DICT as dictionary
  -u USER_DICT, --user-dict USER_DICT
                        use USER_DICT together with the default dictionary or
                        DICT (if specified)
  -a, --cut-all         full pattern cutting (ignored with POS tagging)
  -n, --no-hmm          don't use the Hidden Markov Model
  -q, --quiet           don't print loading messages to stderr
  -V, --version         show program's version number and exit

If no filename specified, use STDIN instead.

å»¶è¿å è½½æºå¶

jieba éç¨å»¶è¿å è½½ï¼import jieba å jieba.Tokenizer() ä¸ä¼ç«å³è§¦åè¯å¸çå è½½ï¼ä¸æ¦æå¿è¦æå¼å§å è½½è¯å¸æå»ºåç¼åå¸ãå¦æä½ æ³æå·¥åå§ jiebaï¼ä¹å¯ä»¥æå¨åå§åã

import jieba
jieba.initialize()  # æå¨åå§åï¼å¯éï¼

jieba.set_dictionary('data/dict.txt.big')

ä¾åï¼ https://github.com/fxsjy/jieba/blob/master/test/test_change_dictpath.py

å¶ä»è¯å¸

å ç¨ååè¾å°çè¯å¸æä»¶ https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small
æ¯æç¹ä½åè¯æ´å¥½çè¯å¸æä»¶ https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big

ä¸è½½ä½ æéè¦çè¯å¸ï¼ç¶åè¦ç jieba/dict.txt å³å¯ï¼æèç¨ jieba.set_dictionary('data/dict.txt.big')

å¶ä»è¯è¨å®ç°

ç»å·´åè¯ Java çæ¬

ä½èï¼piaolingxue å°åï¼https://github.com/huaban/jieba-analysis

ç»å·´åè¯ C++ çæ¬

ä½èï¼yanyiwu å°åï¼https://github.com/yanyiwu/cppjieba

ç»å·´åè¯ Rust çæ¬

ä½èï¼messense, MnO2 å°åï¼https://github.com/messense/jieba-rs

ç»å·´åè¯ Node.js çæ¬

ä½èï¼yanyiwu å°åï¼https://github.com/yanyiwu/nodejieba

ç»å·´åè¯ Erlang çæ¬

ä½èï¼falood å°åï¼https://github.com/falood/exjieba

ç»å·´åè¯ R çæ¬

ä½èï¼qinwf å°åï¼https://github.com/qinwf/jiebaR

ç»å·´åè¯ iOS çæ¬

ä½èï¼yanyiwu å°åï¼https://github.com/yanyiwu/iosjieba

ç»å·´åè¯ PHP çæ¬

ä½èï¼fukuball å°åï¼https://github.com/fukuball/jieba-php

ç»å·´åè¯ .NET(C#) çæ¬

ä½èï¼anderscui å°åï¼https://github.com/anderscui/jieba.NET/

ç»å·´åè¯ Go çæ¬

ä½è: wangbin å°å: https://github.com/wangbin/jiebago
ä½è: yanyiwu å°å: https://github.com/yanyiwu/gojieba

ç»å·´åè¯Androidçæ¬

ä½è Dongliang.W å°åï¼https://github.com/452896915/jieba-android

åæé¾æ¥

https://github.com/baidu/lac ç¾åº¦ä¸æè¯æ³åæï¼åè¯+è¯æ§+ä¸åï¼ç³»ç»
https://github.com/baidu/AnyQ ç¾åº¦FAQèªå¨é®çç³»ç»
https://github.com/baidu/Senta ç¾åº¦ææè¯å«ç³»ç»

ç³»ç»éæ

Solr: https://github.com/sing1ee/jieba-solr

åè¯éåº¦

1.5 MB / Second in Full Mode
400 KB / Second in Default Mode
æµè¯ç¯å¢: Intel(R) Core(TM) i7-2600 CPU @ 3.4GHzï¼ãå´åã.txt

å¸¸è§é®é¢

1. æ¨¡åçæ°æ®æ¯å¦ä½çæçï¼

è¯¦è§ï¼ https://github.com/fxsjy/jieba/issues/7

2. âå°ä¸âæ»æ¯è¢«åæâå° ä¸âï¼ï¼ä»¥åç±»ä¼¼æåµï¼

è§£å³æ¹æ³ï¼å¼ºå¶è°é«è¯é¢

jieba.add_word('å°ä¸') æè jieba.suggest_freq('å°ä¸', True)

è§£å³æ¹æ³ï¼å¼ºå¶è°ä½è¯é¢

4. ååºäºè¯å¸ä¸æ²¡æçè¯è¯ï¼ææä¸çæ³ï¼

è§£å³æ¹æ³ï¼å³éæ°è¯åç°

jieba.cut('ä¸°ç°å¤ªçäº', HMM=False) jieba.cut('æä»¬ä¸åºäºä¸ä¸ªåå¾', HMM=False)

**æ´å¤é®é¢è¯·ç¹å»**ï¼https://github.com/fxsjy/jieba/issues?sort=updated&state=closed

ä¿®è®¢åå²

https://github.com/fxsjy/jieba/blob/master/Changelog

jieba

"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.

Features

Support three types of segmentation mode:

Accurate Mode attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis.
Full Mode gets all the possible words from the sentence. Fast but not accurate.
Search Engine Mode, based on the Accurate Mode, attempts to cut long words into several short words, which can raise the recall rate. Suitable for search engines.

Supports Traditional Chinese
Supports customized dictionaries
MIT License

Online demo

http://jiebademo.ap01.aws.af.cm/

(Powered by Appfog)

Usage

Fully automatic installation: easy_install jieba or pip install jieba
Semi-automatic installation: Download http://pypi.python.org/pypi/jieba/ , run python setup.py install after extracting.
Manual installation: place the jieba directory in the current directory or python site-packages directory.
import jieba.

Algorithm

Based on a prefix dictionary structure to achieve efficient word graph scanning. Build a directed acyclic graph (DAG) for all possible word combinations.
Use dynamic programming to find the most probable combination based on the word frequency.
For unknown words, a HMM-based model is used with the Viterbi algorithm.

Main Functions

The jieba.cut function accepts three input parameters: the first parameter is the string to be cut; the second parameter is cut_all, controlling the cut mode; the third parameter is to control whether to use the Hidden Markov Model.
jieba.cut_for_search accepts two parameter: the string to be cut; whether to use the Hidden Markov Model. This will cut the sentence into short words suitable for search engines.
The input string can be an unicode/str object, or a str/bytes object which is encoded in UTF-8 or GBK. Note that using GBK encoding is not recommended because it may be unexpectly decoded as UTF-8.
jieba.cut and jieba.cut_for_search returns an generator, from which you can use a for loop to get the segmentation result (in unicode).
jieba.lcut and jieba.lcut_for_search returns a list.
jieba.Tokenizer(dictionary=DEFAULT_DICT) creates a new customized Tokenizer, which enables you to use different dictionaries at the same time. jieba.dt is the default Tokenizer, to which almost all global functions are mapped.

Code example: segmentation

#encoding=utf-8
import jieba

seg_list = jieba.cut("ææ¥å°åäº¬æ¸åå¤§å¦", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list))  # å¨æ¨¡å¼

seg_list = jieba.cut("ææ¥å°åäº¬æ¸åå¤§å¦", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))  # é»è®¤æ¨¡å¼

seg_list = jieba.cut("ä»æ¥å°äºç½ææç å¤§å¦")
print(", ".join(seg_list))

seg_list = jieba.cut_for_search("å°æç¡å£«æ¯ä¸äºä¸å½ç§å¦é¢è®¡ç®æï¼åå¨æ¥æ¬äº¬é½å¤§å¦æ·±é ")  # æç´¢å¼ææ¨¡å¼
print(", ".join(seg_list))

Output:

[Full Mode]: æ/ æ¥å°/ åäº¬/ æ¸å/ æ¸åå¤§å¦/ åå¤§/ å¤§å¦

[Accurate Mode]: æ/ æ¥å°/ åäº¬/ æ¸åå¤§å¦

[Unknown Words Recognize] ä», æ¥å°, äº, ç½æ, æç , å¤§å¦    (In this case, "æç " is not in the dictionary, but is identified by the Viterbi algorithm)

[Search Engine Mode]ï¼ å°æ, ç¡å£«, æ¯ä¸, äº, ä¸å½, ç§å¦, å¦é¢, ç§å¦é¢, ä¸å½ç§å¦é¢, è®¡ç®, è®¡ç®æ, å, å¨, æ¥æ¬, äº¬é½, å¤§å¦, æ¥æ¬äº¬é½å¤§å¦, æ·±é

2. Add a custom dictionary

Load dictionary

Developers can specify their own custom dictionary to be included in the jieba default dictionary. Jieba is able to identify new words, but you can add your own new words can ensure a higher accuracy.
Usageï¼ jieba.load_userdict(file_name) # file_name is a file-like object or the path of the custom dictionary
The dictionary format is the same as that of dict.txt: one word per line; each line is divided into three parts separated by a space: word, word frequency, POS tag. If file_name is a path or a file opened in binary mode, the dictionary must be UTF-8 encoded.
The word frequency and POS tag can be omitted respectively. The word frequency will be filled with a suitable value if omitted.

For example:

åæ°å 3 i
äºè®¡ç® 5
å±ç¹ç³ nz
å°ä¸

Change a Tokenizer's tmp_dir and cache_file to specify the path of the cache file, for using on a restricted file system.

Example:

  äºè®¡ç® 5
  æå°ç¦ 2
  åæ°å 3

  [Before]ï¼ æå°ç¦ / æ¯ / åæ° / å / ä¸»ä»» / ä¹ / æ¯ / äº / è®¡ç® / æ¹é¢ / ç / ä¸å®¶ /

  [After]ï¼ãæå°ç¦ / æ¯ / åæ°å / ä¸»ä»» / ä¹ / æ¯ / äºè®¡ç® / æ¹é¢ / ç / ä¸å®¶ /

Modify dictionary

Use add_word(word, freq=None, tag=None) and del_word(word) to modify the dictionary dynamically in programs.
Use suggest_freq(segment, tune=True) to adjust the frequency of a single word so that it can (or cannot) be segmented.
Note that HMM may affect the final result.

Example:

>>> print('/'.join(jieba.cut('å¦ææ¾å°postä¸å°åºéã', HMM=False)))
å¦æ/æ¾å°/post/ä¸å°/åºé/ã
>>> jieba.suggest_freq(('ä¸', 'å°'), True)
494
>>> print('/'.join(jieba.cut('å¦ææ¾å°postä¸å°åºéã', HMM=False)))
å¦æ/æ¾å°/post/ä¸/å°/åºé/ã
>>> print('/'.join(jieba.cut('ãå°ä¸ãæ£ç¡®åºè¯¥ä¸ä¼è¢«åå¼', HMM=False)))
ã/å°/ä¸/ã/æ£ç¡®/åºè¯¥/ä¸ä¼/è¢«/åå¼
>>> jieba.suggest_freq('å°ä¸', True)
69
>>> print('/'.join(jieba.cut('ãå°ä¸ãæ£ç¡®åºè¯¥ä¸ä¼è¢«åå¼', HMM=False)))
ã/å°ä¸/ã/æ£ç¡®/åºè¯¥/ä¸ä¼/è¢«/åå¼

Keyword Extraction

import jieba.analyse

jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
- sentence: the text to be extracted
- topK: return how many keywords with the highest TF/IDF weights. The default value is 20
- withWeight: whether return TF/IDF weights with the keywords. The default value is False
- allowPOS: filter words with which POSs are included. Empty for no filtering.
jieba.analyse.TFIDF(idf_path=None) creates a new TFIDF instance, idf_path specifies IDF file path.

Example (keyword extraction)

https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py

Developers can specify their own custom IDF corpus in jieba keyword extraction

Usageï¼ jieba.analyse.set_idf_path(file_name) # file_name is the path for the custom corpus
Custom Corpus Sampleï¼https://github.com/fxsjy/jieba/blob/master/extra_dict/idf.txt.big
Sample Codeï¼https://github.com/fxsjy/jieba/blob/master/test/extract_tags_idfpath.py

Developers can specify their own custom stop words corpus in jieba keyword extraction

Usageï¼ jieba.analyse.set_stop_words(file_name) # file_name is the path for the custom corpus
Custom Corpus Sampleï¼https://github.com/fxsjy/jieba/blob/master/extra_dict/stop_words.txt
Sample Codeï¼https://github.com/fxsjy/jieba/blob/master/test/extract_tags_stop_words.py

There's also a TextRank implementation available.

Use: jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'))

Note that it filters POS by default.

jieba.analyse.TextRank() creates a new TextRank instance.

Part of Speech Tagging

jieba.posseg.POSTokenizer(tokenizer=None) creates a new customized Tokenizer. tokenizer specifies the jieba.Tokenizer to internally use. jieba.posseg.dt is the default POSTokenizer.
Tags the POS of each word after segmentation, using labels compatible with ictclas.
Example:

>>> import jieba.posseg as pseg
>>> words = pseg.cut("æç±åäº¬å¤©å®é¨")
>>> for w in words:
...    print('%s %s' % (w.word, w.flag))
...
æ r
ç± v
åäº¬ ns
å¤©å®é¨ ns

Parallel Processing

Principle: Split target text by line, assign the lines into multiple Python processes, and then merge the results, which is considerably faster.
Based on the multiprocessing module of Python.
Usage:
- jieba.enable_parallel(4) # Enable parallel processing. The parameter is the number of processes.
- jieba.disable_parallel() # Disable parallel processing.
Example: https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py
Result: On a four-core 3.4GHz Linux machine, do accurate word segmentation on Complete Works of Jin Yong, and the speed reaches 1MB/s, which is 3.3 times faster than the single-process version.
Note that parallel processing supports only default tokenizers, jieba.dt and jieba.posseg.dt.

Tokenize: return words with position

The input must be unicode
Default mode

result = jieba.tokenize(u'æ°¸åæè£é¥°åæéå¬å¸')
for tk in result:
    print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))

word æ°¸å                start: 0                end:2
word æè£                start: 2                end:4
word é¥°å                start: 4                end:6
word æéå¬å¸            start: 6                end:10

Search mode

result = jieba.tokenize(u'æ°¸åæè£é¥°åæéå¬å¸',mode='search')
for tk in result:
    print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))

word æ°¸å                start: 0                end:2
word æè£                start: 2                end:4
word é¥°å                start: 4                end:6
word æé                start: 6                end:8
word å¬å¸                start: 8                end:10
word æéå¬å¸            start: 6                end:10

ChineseAnalyzer for Whoosh

from jieba.analyse import ChineseAnalyzer
Example: https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py

Command Line Interface

$> python -m jieba --help
Jieba command line interface.

positional arguments:
  filename              input file

optional arguments:
  -h, --help            show this help message and exit
  -d [DELIM], --delimiter [DELIM]
                        use DELIM instead of ' / ' for word delimiter; or a
                        space if it is used without DELIM
  -p [DELIM], --pos [DELIM]
                        enable POS tagging; if DELIM is specified, use DELIM
                        instead of '_' for POS delimiter
  -D DICT, --dict DICT  use DICT as dictionary
  -u USER_DICT, --user-dict USER_DICT
                        use USER_DICT together with the default dictionary or
                        DICT (if specified)
  -a, --cut-all         full pattern cutting (ignored with POS tagging)
  -n, --no-hmm          don't use the Hidden Markov Model
  -q, --quiet           don't print loading messages to stderr
  -V, --version         show program's version number and exit

If no filename specified, use STDIN instead.

Initialization

By default, Jieba don't build the prefix dictionary unless it's necessary. This takes 1-3 seconds, after which it is not initialized again. If you want to initialize Jieba manually, you can call:

import jieba
jieba.initialize()  # (optional)

You can also specify the dictionary (not supported before version 0.28) :

jieba.set_dictionary('data/dict.txt.big')

Using Other Dictionaries

It is possible to use your own dictionary with Jieba, and there are also two dictionaries ready for download:

A smaller dictionary for a smaller memory footprint: https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small
There is also a bigger dictionary that has better support for traditional Chinese (ç¹é«): https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big

By default, an in-between dictionary is used, called dict.txt and included in the distribution.

In either case, download the file you want, and then call jieba.set_dictionary('data/dict.txt.big') or just replace the existing dict.txt.

Segmentation speed

1.5 MB / Second in Full Mode
400 KB / Second in Default Mode
Test Env: Intel(R) Core(TM) i7-2600 CPU @ 3.4GHzï¼ãå´åã.txt

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of HanLP

Cons of HanLP

Code Comparison

Pros of LAC

Cons of LAC

Code Comparison

Pros of SnowNLP

Cons of SnowNLP

Code Comparison

Pros of cppjieba

Cons of cppjieba

Code Comparison

Convert designs to code with AI

README

jieba

ç¹ç¹

å®è£ è¯´æ

ç®æ³

ä¸»è¦åè½

2. æ·»å èªå®ä¹è¯å ¸

è½½å ¥è¯å ¸

è°æ´è¯å ¸

åºäº TF-IDF ç®æ³çå ³é®è¯æ½å

åºäº TextRank ç®æ³çå ³é®è¯æ½å

åºæ¬ææ³:

ä½¿ç¨ç¤ºä¾:

å»¶è¿å è½½æºå¶

å ¶ä»è¯å ¸

å ¶ä»è¯­è¨å®ç°

ç»å·´åè¯ Java çæ¬

ç»å·´åè¯ C++ çæ¬

ç»å·´åè¯ Rust çæ¬

ç»å·´åè¯ Node.js çæ¬

ç»å·´åè¯ Erlang çæ¬

ç»å·´åè¯ R çæ¬

ç»å·´åè¯ iOS çæ¬

ç»å·´åè¯ PHP çæ¬

ç»å·´åè¯ .NET(C#) çæ¬

ç»å·´åè¯ Go çæ¬

ç»å·´åè¯Androidçæ¬

åæ é¾æ¥

ç³»ç»éæ

åè¯éåº¦

å¸¸è§é®é¢

1. æ¨¡åçæ°æ®æ¯å¦ä½çæçï¼

2. âå°ä¸­âæ»æ¯è¢«åæâå° ä¸­âï¼ï¼ä»¥åç±»ä¼¼æ åµï¼

3. âä»å¤©å¤©æ° ä¸éâåºè¯¥è¢«åæâä»å¤© å¤©æ° ä¸éâï¼ï¼ä»¥åç±»ä¼¼æ åµï¼

4. ååºäºè¯å ¸ä¸­æ²¡æçè¯è¯­ï¼ææä¸çæ³ï¼

ä¿®è®¢åå²

jieba

Features

Online demo

Usage

Algorithm

Main Functions

2. Add a custom dictionary

Load dictionary

Modify dictionary

Initialization

Using Other Dictionaries

Segmentation speed

Top Related Projects

Convert designs to code with AI

ç¹ç¹

å®è£è¯´æ

ç®æ³

ä¸»è¦åè½

2. æ·»å èªå®ä¹è¯å¸

è½½å¥è¯å¸

è°æ´è¯å¸

åºäº TF-IDF ç®æ³çå³é®è¯æ½å

åºäº TextRank ç®æ³çå³é®è¯æ½å

åºæ¬ææ³:

ä½¿ç¨ç¤ºä¾:

å»¶è¿å è½½æºå¶

å¶ä»è¯å¸

å¶ä»è¯è¨å®ç°

ç»å·´åè¯ Java çæ¬

ç»å·´åè¯ C++ çæ¬

ç»å·´åè¯ Rust çæ¬

ç»å·´åè¯ Node.js çæ¬

ç»å·´åè¯ Erlang çæ¬

ç»å·´åè¯ R çæ¬

ç»å·´åè¯ iOS çæ¬

ç»å·´åè¯ PHP çæ¬

ç»å·´åè¯ .NET(C#) çæ¬

ç»å·´åè¯ Go çæ¬

ç»å·´åè¯Androidçæ¬

åæé¾æ¥

ç³»ç»éæ

åè¯éåº¦

å¸¸è§é®é¢

1. æ¨¡åçæ°æ®æ¯å¦ä½çæçï¼

2. âå°ä¸âæ»æ¯è¢«åæâå° ä¸âï¼ï¼ä»¥åç±»ä¼¼æåµï¼

3. âä»å¤©å¤©æ° ä¸éâåºè¯¥è¢«åæâä»å¤© å¤©æ° ä¸éâï¼ï¼ä»¥åç±»ä¼¼æåµï¼

4. ååºäºè¯å¸ä¸æ²¡æçè¯è¯ï¼ææä¸çæ³ï¼

ä¿®è®¢åå²