cppjieba

"结巴"中文分词的C++版本

2,732

716

2,732

View on GitHub

Top Related Projects

HanLP

34,953

中文分词词性标注命名实体识别依存句法分析成分句法分析语义依存分析语义角色标注指代消解风格转换语义相似度新词发现关键词短语提取自动摘要文本分类聚类拼音简繁转换自然语言处理

Quick Overview

CppJieba is a Chinese word segmentation library implemented in C++. It provides efficient and accurate tokenization for Chinese text, supporting various segmentation modes and custom dictionaries.

Pros

High performance and low memory usage
Supports multiple segmentation modes (MPSegment, HMMSegment, MixSegment, FullSegment, QuerySegment)
Allows custom dictionary integration
Thread-safe implementation

Cons

Limited documentation, especially for advanced usage
Primarily focused on Chinese language, may not be suitable for other languages
Requires C++ knowledge for integration and customization
Less active development in recent years

Code Examples

Basic word segmentation:

#include "cppjieba/Jieba.hpp"

int main() {
    cppjieba::Jieba jieba;
    std::vector<std::string> words;
    std::string sentence = "我来到北京清华大学";
    jieba.Cut(sentence, words);
    for (const auto& word : words) {
        std::cout << word << " ";
    }
    return 0;
}

Using different segmentation modes:

#include "cppjieba/Jieba.hpp"

int main() {
    cppjieba::Jieba jieba;
    std::vector<std::string> words;
    std::string sentence = "我来到北京清华大学";
    
    // HMM segmentation
    jieba.CutHMM(sentence, words);
    
    // Full segmentation
    jieba.CutAll(sentence, words);
    
    // Query segmentation
    jieba.CutForSearch(sentence, words);
    
    return 0;
}

Adding a custom dictionary:

#include "cppjieba/Jieba.hpp"

int main() {
    cppjieba::Jieba jieba("./dict/jieba.dict.utf8",
                          "./dict/hmm_model.utf8",
                          "./dict/user.dict.utf8");
    std::vector<std::string> words;
    std::string sentence = "我来到北京清华大学";
    jieba.Cut(sentence, words);
    return 0;
}

Getting Started

Clone the repository:

git clone https://github.com/yanyiwu/cppjieba.git

Include the necessary headers in your C++ project:
```
#include "cppjieba/Jieba.hpp"
```

Create a Jieba instance and use it for segmentation:

cppjieba::Jieba jieba;
std::vector<std::string> words;
std::string sentence = "你好世界";
jieba.Cut(sentence, words);

Compile your project with C++11 support:

g++ -std=c++11 your_file.cpp -I/path/to/cppjieba

Competitor Comparisons

Pros of jieba

Written in Python, making it more accessible for Python developers
Larger community and more frequent updates
Supports more features like keyword extraction and TextRank

Cons of jieba

Slower performance compared to C++ implementation
Higher memory usage due to Python's overhead
Less suitable for embedding in C/C++ applications

Code Comparison

jieba (Python):

import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))

cppjieba (C++):

#include "cppjieba/Jieba.hpp"
vector<string> words;
jieba.Cut("我来到北京清华大学", words, true);
cout << limonp::Join(words.begin(), words.end(), "/") << endl;

Both libraries provide similar functionality for Chinese word segmentation, but cppjieba offers better performance and lower memory usage due to its C++ implementation. jieba, on the other hand, is more feature-rich and has a larger community, making it a popular choice for Python developers. The choice between the two depends on the specific requirements of the project, such as programming language preference, performance needs, and desired features.

HanLP

34,953

Pros of HanLP

More comprehensive NLP toolkit with broader functionality beyond just word segmentation
Supports multiple languages, not limited to Chinese
Actively maintained with frequent updates and improvements

Cons of HanLP

Larger codebase and dependencies, potentially more complex to integrate
May have higher resource requirements due to its extensive feature set
Learning curve can be steeper for users only needing basic word segmentation

Code Comparison

HanLP example:

HanLP.segment("我的希望是希望张晚霞的背影被晚霞映红")

CppJieba example:

vector<string> words;
jieba.Cut("我的希望是希望张晚霞的背影被晚霞映红", words);

Summary

HanLP offers a more comprehensive NLP toolkit with support for multiple languages, while CppJieba focuses specifically on Chinese word segmentation. HanLP's broader feature set comes at the cost of increased complexity and potentially higher resource requirements. CppJieba, being more focused, may be simpler to integrate for projects only needing Chinese word segmentation. The choice between the two depends on the specific requirements of the project and the desired balance between functionality and simplicity.

lac

3,931

百度NLP：分词，词性标注，命名实体识别，词重要性

Pros of LAC

Supports both word segmentation and part-of-speech tagging
Utilizes deep learning techniques for improved accuracy
Offers pre-trained models for various domains

Cons of LAC

Larger resource footprint due to deep learning models
May require more setup and dependencies
Potentially slower processing speed for simple tasks

Code Comparison

LAC:

from LAC import LAC

lac = LAC(mode='lac')
text = "我爱北京天安门"
result = lac.run(text)
print(result)

CppJieba:

#include "cppjieba/Jieba.hpp"

jieba::Jieba jieba;
std::string s = "我爱北京天安门";
std::vector<std::string> words;
jieba.Cut(s, words);

Summary

LAC offers more advanced features and potentially higher accuracy for complex NLP tasks, leveraging deep learning techniques. However, it may require more resources and setup compared to CppJieba. CppJieba, on the other hand, is lightweight and easy to integrate, making it suitable for simpler word segmentation tasks or resource-constrained environments. The choice between the two depends on the specific requirements of the project, such as accuracy needs, resource availability, and the complexity of the NLP tasks at hand.

NLPIR

3,449

Pros of NLPIR

More comprehensive NLP toolkit with additional features beyond word segmentation
Supports multiple languages, including English and Chinese
Includes named entity recognition and sentiment analysis capabilities

Cons of NLPIR

Less actively maintained, with fewer recent updates
More complex setup and integration process
Closed-source core components, limiting customization options

Code Comparison

NLPIR (C++):

CNLPIR::GetInstance().Init();
const char* sResult = CNLPIR::GetInstance().ParagraphProcess("我是中国人", 0);
cout << sResult << endl;
CNLPIR::GetInstance().Exit();

cppjieba (C++):

jieba::Jieba jieba;
vector<string> words;
jieba.Cut("我是中国人", words);
cout << limonp::Join(words.begin(), words.end(), "/") << endl;

Both libraries provide word segmentation functionality, but NLPIR offers a more comprehensive toolkit with additional NLP features. cppjieba is more focused on Chinese word segmentation and is easier to integrate into projects. NLPIR supports multiple languages, while cppjieba is primarily designed for Chinese text processing. The code examples demonstrate the basic usage of each library for word segmentation tasks.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

CppJieba

ç®ä»

CppJiebaæ¯"ç»å·´(Jieba)"ä¸æåè¯çC++çæ¬

ä¸»è¦ç¹ç¹

ð é«æ§è½ï¼ç»è¿çº¿ä¸ç¯å¢éªè¯çç¨³å®æ§åæ§è½è¡¨ç°
ð¦ æéæï¼æºä»£ç ä»¥å¤´æä»¶å½¢å¼æä¾ (include/cppjieba/*.hpp)ï¼åå«å³å¯ä½¿ç¨
ð å¤ç§åè¯æ¨¡å¼ï¼æ¯æç²¾ç¡®æ¨¡å¼ãå¨æ¨¡å¼ãæç´¢å¼ææ¨¡å¼ç
ð èªå®ä¹è¯å¸ï¼æ¯æç¨æ·èªå®ä¹è¯å¸ï¼æ¯æå¤è¯å¸è·¯å¾ï¼ä½¿ç¨'|'æ';'åéï¼
ð» è·¨å¹³å°ï¼æ¯æ LinuxãmacOSãWindows æä½ç³»ç»
ð UTF-8ç¼ç ï¼åçæ¯æ UTF-8 ç¼ç çä¸æå¤ç

å¿«éå¼å§

ç¯å¢è¦æ±

C++ ç¼è¯å¨ï¼
- g++ (æ¨è 4.1 ä»¥ä¸çæ¬)
- æ clang++
cmake (æ¨è 2.6 ä»¥ä¸çæ¬)

å®è£æ¥éª¤

git clone https://github.com/yanyiwu/cppjieba.git
cd cppjieba
git submodule init
git submodule update
mkdir build
cd build
cmake ..
make

make test

ä½¿ç¨ç¤ºä¾

./demo

ç»æç¤ºä¾ï¼

[demo] Cut With HMM
ä»/æ¥å°/äº/ç½æ/æç /å¤§å¦
[demo] Cut Without HMM
ä»/æ¥å°/äº/ç½æ/æ/ç /å¤§å¦
ææ¥å°åäº¬æ¸åå¤§å¦
[demo] CutAll
æ/æ¥å°/åäº¬/æ¸å/æ¸åå¤§å¦/åå¤§/å¤§å¦
å°æç¡å£«æ¯ä¸äºä¸å½ç§å¦é¢è®¡ç®æï¼åå¨æ¥æ¬äº¬é½å¤§å¦æ·±é 
[demo] CutForSearch
å°æ/ç¡å£«/æ¯ä¸/äº/ä¸å½/ç§å¦/å¦é¢/ç§å¦é¢/ä¸å½ç§å¦é¢/è®¡ç®/è®¡ç®æ/ï¼/å/å¨/æ¥æ¬/äº¬é½/å¤§å¦/æ¥æ¬äº¬é½å¤§å¦/æ·±é 
[demo] Insert User Word
ç·é»/å¥³æ³ª
ç·é»å¥³æ³ª
[demo] CutForSearch Word With Offset
[{"word": "å°æ", "offset": 0}, {"word": "ç¡å£«", "offset": 6}, {"word": "æ¯ä¸", "offset": 12}, {"word": "äº", "offset": 18}, {"word": "ä¸å½", "offset": 21}, {"word": "ç§å¦", "offset": 27}, {"word": "å¦é¢", "offset": 30}, {"word": "ç§å¦é¢", "offset": 27}, {"word": "ä¸å½ç§å¦é¢", "offset": 21}, {"word": "è®¡ç®", "offset": 36}, {"word": "è®¡ç®æ", "offset": 36}, {"word": "ï¼", "offset": 45}, {"word": "å", "offset": 48}, {"word": "å¨", "offset": 51}, {"word": "æ¥æ¬", "offset": 54}, {"word": "äº¬é½", "offset": 60}, {"word": "å¤§å¦", "offset": 66}, {"word": "æ¥æ¬äº¬é½å¤§å¦", "offset": 54}, {"word": "æ·±é ", "offset": 72}]
[demo] Tagging
ææ¯æææºå¦é¢ææ¶æææºä¸ä¸çãä¸ç¨å¤ä¹ï¼æå°±ä¼åèå èªï¼å½ä¸CEOï¼èµ°ä¸äººçå·å³°ã
[æ:r, æ¯:v, æææº:n, å¦é¢:n, ææ¶æææº:n, ä¸ä¸:n, ç:uj, ã:x, ä¸ç¨:v, å¤ä¹:m, ï¼:x, æ:r, å°±:d, ä¼:v, åè:v, å èª:nr, ï¼:x, å½ä¸:t, CEO:eng, ï¼:x, èµ°ä¸:v, äººç:n, å·å³°:n, ã:x]
[demo] Keyword Extraction
ææ¯æææºå¦é¢ææ¶æææºä¸ä¸çãä¸ç¨å¤ä¹ï¼æå°±ä¼åèå èªï¼å½ä¸CEOï¼èµ°ä¸äººçå·å³°ã
[{"word": "CEO", "offset": [93], "weight": 11.7392}, {"word": "åè", "offset": [72], "weight": 10.8562}, {"word": "å èª", "offset": [78], "weight": 10.6426}, {"word": "ææ¶æææº", "offset": [21], "weight": 10.0089}, {"word": "å·å³°", "offset": [111], "weight": 9.49396}]

For more details, please see demo.

åè¯ç»æç¤ºä¾

MPSegment

Output:

ææ¥å°åäº¬æ¸åå¤§å¦
æ/æ¥å°/åäº¬/æ¸åå¤§å¦

ä»æ¥å°äºç½ææç å¤§å¦
ä»/æ¥å°/äº/ç½æ/æ/ç /å¤§å¦

å°æç¡å£«æ¯ä¸äºä¸å½ç§å¦é¢è®¡ç®æï¼åå¨æ¥æ¬äº¬é½å¤§å¦æ·±é 
å°/æ/ç¡å£«/æ¯ä¸/äº/ä¸å½ç§å¦é¢/è®¡ç®æ/ï¼/å/å¨/æ¥æ¬äº¬é½å¤§å¦/æ·±é

HMMSegment

ææ¥å°åäº¬æ¸åå¤§å¦
ææ¥/å°/åäº¬/æ¸åå¤§å¦

ä»æ¥å°äºç½ææç å¤§å¦
ä»æ¥/å°/äº/ç½æ/æ/ç å¤§å¦

å°æç¡å£«æ¯ä¸äºä¸å½ç§å¦é¢è®¡ç®æï¼åå¨æ¥æ¬äº¬é½å¤§å¦æ·±é 
å°æ/ç¡å£«/æ¯ä¸äº/ä¸å½/ç§å¦é¢/è®¡ç®æ/ï¼/å/å¨/æ¥/æ¬/äº¬/é½/å¤§/å¦/æ·±/é

MixSegment

ææ¥å°åäº¬æ¸åå¤§å¦
æ/æ¥å°/åäº¬/æ¸åå¤§å¦

ä»æ¥å°äºç½ææç å¤§å¦
ä»/æ¥å°/äº/ç½æ/æç /å¤§å¦

å°æç¡å£«æ¯ä¸äºä¸å½ç§å¦é¢è®¡ç®æï¼åå¨æ¥æ¬äº¬é½å¤§å¦æ·±é 
å°æ/ç¡å£«/æ¯ä¸/äº/ä¸å½ç§å¦é¢/è®¡ç®æ/ï¼/å/å¨/æ¥æ¬äº¬é½å¤§å¦/æ·±é

FullSegment

ææ¥å°åäº¬æ¸åå¤§å¦
æ/æ¥å°/åäº¬/æ¸å/æ¸åå¤§å¦/åå¤§/å¤§å¦

ä»æ¥å°äºç½ææç å¤§å¦
ä»/æ¥å°/äº/ç½æ/æ/ç /å¤§å¦

å°æç¡å£«æ¯ä¸äºä¸å½ç§å¦é¢è®¡ç®æï¼åå¨æ¥æ¬äº¬é½å¤§å¦æ·±é 
å°/æ/ç¡å£«/æ¯ä¸/äº/ä¸å½/ä¸å½ç§å¦é¢/ç§å¦/ç§å¦é¢/å¦é¢/è®¡ç®/è®¡ç®æ/ï¼/å/å¨/æ¥æ¬/æ¥æ¬äº¬é½å¤§å¦/äº¬é½/äº¬é½å¤§å¦/å¤§å¦/æ·±é

QuerySegment

ææ¥å°åäº¬æ¸åå¤§å¦
æ/æ¥å°/åäº¬/æ¸å/æ¸åå¤§å¦/åå¤§/å¤§å¦

ä»æ¥å°äºç½ææç å¤§å¦
ä»/æ¥å°/äº/ç½æ/æç /å¤§å¦

å°æç¡å£«æ¯ä¸äºä¸å½ç§å¦é¢è®¡ç®æï¼åå¨æ¥æ¬äº¬é½å¤§å¦æ·±é 
å°æ/ç¡å£«/æ¯ä¸/äº/ä¸å½/ä¸å½ç§å¦é¢/ç§å¦/ç§å¦é¢/å¦é¢/è®¡ç®æ/ï¼/å/å¨/ä¸å½/ä¸å½ç§å¦é¢/ç§å¦/ç§å¦é¢/å¦é¢/æ¥æ¬/æ¥æ¬äº¬é½å¤§å¦/äº¬é½/äº¬é½å¤§å¦/å¤§å¦/æ·±é

ä»¥ä¸ä¾æ¬¡æ¯MP,HMM,Mixä¸ç§æ¹æ³çææã

Fullæ¹æ³ååºææåå¸éçè¯è¯ã

èªå®ä¹ç¨æ·è¯å¸

èªå®ä¹è¯å¸ç¤ºä¾è¯·çdict/user.dict.utf8ã

æ²¡æä½¿ç¨èªå®ä¹ç¨æ·è¯å¸æ¶çç»æ:

ä»¤çå²/æ¯/äº/è®¡ç®/è¡ä¸/ç/ä¸å®¶

ä½¿ç¨èªå®ä¹ç¨æ·è¯å¸æ¶çç»æ:

ä»¤çå²/æ¯/äºè®¡ç®/è¡ä¸/ç/ä¸å®¶

å³é®è¯æ½å

ææ¯æææºå¦é¢ææ¶æææºä¸ä¸çãä¸ç¨å¤ä¹ï¼æå°±ä¼åèå èªï¼å½ä¸CEOï¼èµ°ä¸äººçå·å³°ã
["CEO:11.7392", "åè:10.8562", "å èª:10.6426", "ææ¶æææº:10.0089", "å·å³°:9.49396"]

For more details, please see demo.

è¯æ§æ æ³¨

ææ¯èç¿æå·¥æææºå¦é¢ææ¶æææºä¸ä¸çãä¸ç¨å¤ä¹ï¼æå°±ä¼åèå èªï¼å½ä¸æ»ç»çï¼åºä»»CEOï¼è¿å¨¶ç½å¯ç¾ï¼èµ°ä¸äººçå·å³°ã
["æ:r", "æ¯:v", "æææº:n", "å¦é¢:n", "ææ¶æææº:n", "ä¸ä¸:n", "ç:uj", "ã:x", "ä¸ç¨:v", "å¤ä¹:m", "ï¼:x", "æ:r", "å°±:d", "ä¼:v", "åè:v", "å èª:nr", "ï¼:x", "å½ä¸:t", "CEO:eng", "ï¼:x", "èµ°ä¸:v", "äººç:n", "å·å³°:n", "ã:x"]

For more details, please see demo.

æ¯æèªå®ä¹è¯æ§ã æ¯å¦å¨(dict/user.dict.utf8)å¢å ä¸è¡

èç¿ nz

ç»æå¦ä¸ï¼

["æ:r", "æ¯:v", "èç¿:nz", "æå·¥:n", "æææº:n", "å¦é¢:n", "ææ¶æææº:n", "ä¸ä¸:n", "ç:uj", "ã:x", "ä¸ç¨:v", "å¤ä¹:m", "ï¼:x", "æ:r", "å°±:d", "ä¼:v", "åè:v", "å èª:nr", "ï¼:x", "å½:t", "ä¸:f", "æ»ç»ç:n", "ï¼:x", "åºä»»:v", "CEO:eng", "ï¼:x", "è¿å¨¶:v", "ç½å¯ç¾:x", "ï¼:x", "èµ°ä¸:v", "äººç:n", "å·å³°:n", "ã:x"]

å¶å®è¯å¸èµæåäº«

[dict.367W.utf8] iLife(562193561 at qq.com)

çæç³»ç»

GoJieba - Go è¯è¨çæ¬
NodeJieba - Node.js çæ¬
CJieba - C è¯è¨çæ¬
jiebaR - R è¯è¨çæ¬
exjieba - Erlang çæ¬
jieba_rb - Ruby çæ¬
iosjieba - iOS çæ¬
phpjieba - PHP çæ¬
perl5-jieba - Perl çæ¬

åºç¨é¡¹ç®

simhash - ä¸æææ¡£ç¸ä¼¼åº¦è®¡ç®
pg_jieba - PostgreSQL åè¯æä»¶
gitbook-plugin-search-pro - Gitbook ä¸ææç´¢æä»¶
ngx_http_cppjieba_module - Nginx åè¯æä»¶

è´¡ç®æå

æä»¬æ¬¢è¿åç§å½¢å¼çè´¡ç®ï¼åæ¬ä½ä¸éäºï¼

æäº¤é®é¢åå»ºè®®
æ¹è¿ææ¡£
æäº¤ä»£ç ä¿®å¤
æ·»å æ°åè½

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of jieba

Cons of jieba

Code Comparison

Pros of HanLP

Cons of HanLP

Code Comparison

Summary

Pros of LAC

Cons of LAC

Code Comparison

Summary

Pros of NLPIR

Cons of NLPIR

Code Comparison

Convert designs to code with AI

README

CppJieba

ç®ä»

ä¸»è¦ç¹ç¹

å¿«éå¼å§

ç¯å¢è¦æ±

å®è£ æ­¥éª¤

ä½¿ç¨ç¤ºä¾

åè¯ç»æç¤ºä¾

èªå®ä¹ç¨æ·è¯å ¸

å ³é®è¯æ½å

è¯æ§æ æ³¨

å ¶å®è¯å ¸èµæåäº«

çæç³»ç»

åºç¨é¡¹ç®

è´¡ç®æå

Top Related Projects

Convert designs to code with AI

ç®ä»

ä¸»è¦ç¹ç¹

å¿«éå¼å§

ç¯å¢è¦æ±

å®è£æ¥éª¤

ä½¿ç¨ç¤ºä¾

åè¯ç»æç¤ºä¾

èªå®ä¹ç¨æ·è¯å¸

å³é®è¯æ½å

è¯æ§æ æ³¨

å¶å®è¯å¸èµæåäº«

çæç³»ç»

åºç¨é¡¹ç®

è´¡ç®æå