Top Related Projects
Quick Overview
CppJieba is a Chinese word segmentation library implemented in C++. It provides efficient and accurate tokenization for Chinese text, supporting various segmentation modes and custom dictionaries.
Pros
- High performance and low memory usage
- Supports multiple segmentation modes (MPSegment, HMMSegment, MixSegment, FullSegment, QuerySegment)
- Allows custom dictionary integration
- Thread-safe implementation
Cons
- Limited documentation, especially for advanced usage
- Primarily focused on Chinese language, may not be suitable for other languages
- Requires C++ knowledge for integration and customization
- Less active development in recent years
Code Examples
- Basic word segmentation:
#include "cppjieba/Jieba.hpp"
int main() {
cppjieba::Jieba jieba;
std::vector<std::string> words;
std::string sentence = "我来到北京清华大学";
jieba.Cut(sentence, words);
for (const auto& word : words) {
std::cout << word << " ";
}
return 0;
}
- Using different segmentation modes:
#include "cppjieba/Jieba.hpp"
int main() {
cppjieba::Jieba jieba;
std::vector<std::string> words;
std::string sentence = "我来到北京清华大学";
// HMM segmentation
jieba.CutHMM(sentence, words);
// Full segmentation
jieba.CutAll(sentence, words);
// Query segmentation
jieba.CutForSearch(sentence, words);
return 0;
}
- Adding a custom dictionary:
#include "cppjieba/Jieba.hpp"
int main() {
cppjieba::Jieba jieba("./dict/jieba.dict.utf8",
"./dict/hmm_model.utf8",
"./dict/user.dict.utf8");
std::vector<std::string> words;
std::string sentence = "我来到北京清华大学";
jieba.Cut(sentence, words);
return 0;
}
Getting Started
-
Clone the repository:
git clone https://github.com/yanyiwu/cppjieba.git
-
Include the necessary headers in your C++ project:
#include "cppjieba/Jieba.hpp"
-
Create a Jieba instance and use it for segmentation:
cppjieba::Jieba jieba; std::vector<std::string> words; std::string sentence = "你好世界"; jieba.Cut(sentence, words);
-
Compile your project with C++11 support:
g++ -std=c++11 your_file.cpp -I/path/to/cppjieba
Competitor Comparisons
结巴中文分词
Pros of jieba
- Written in Python, making it more accessible for Python developers
- Larger community and more frequent updates
- Supports more features like keyword extraction and TextRank
Cons of jieba
- Slower performance compared to C++ implementation
- Higher memory usage due to Python's overhead
- Less suitable for embedding in C/C++ applications
Code Comparison
jieba (Python):
import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))
cppjieba (C++):
#include "cppjieba/Jieba.hpp"
vector<string> words;
jieba.Cut("我来到北京清华大学", words, true);
cout << limonp::Join(words.begin(), words.end(), "/") << endl;
Both libraries provide similar functionality for Chinese word segmentation, but cppjieba offers better performance and lower memory usage due to its C++ implementation. jieba, on the other hand, is more feature-rich and has a larger community, making it a popular choice for Python developers. The choice between the two depends on the specific requirements of the project, such as programming language preference, performance needs, and desired features.
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发 现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
Pros of HanLP
- More comprehensive NLP toolkit with broader functionality beyond just word segmentation
- Supports multiple languages, not limited to Chinese
- Actively maintained with frequent updates and improvements
Cons of HanLP
- Larger codebase and dependencies, potentially more complex to integrate
- May have higher resource requirements due to its extensive feature set
- Learning curve can be steeper for users only needing basic word segmentation
Code Comparison
HanLP example:
HanLP.segment("我的希望是希望张晚霞的背影被晚霞映红")
CppJieba example:
vector<string> words;
jieba.Cut("我的希望是希望张晚霞的背影被晚霞映红", words);
Summary
HanLP offers a more comprehensive NLP toolkit with support for multiple languages, while CppJieba focuses specifically on Chinese word segmentation. HanLP's broader feature set comes at the cost of increased complexity and potentially higher resource requirements. CppJieba, being more focused, may be simpler to integrate for projects only needing Chinese word segmentation. The choice between the two depends on the specific requirements of the project and the desired balance between functionality and simplicity.
百度NLP:分词,词性标注,命名实体识别,词重要性
Pros of LAC
- Supports both word segmentation and part-of-speech tagging
- Utilizes deep learning techniques for improved accuracy
- Offers pre-trained models for various domains
Cons of LAC
- Larger resource footprint due to deep learning models
- May require more setup and dependencies
- Potentially slower processing speed for simple tasks
Code Comparison
LAC:
from LAC import LAC
lac = LAC(mode='lac')
text = "我爱北京天安门"
result = lac.run(text)
print(result)
CppJieba:
#include "cppjieba/Jieba.hpp"
jieba::Jieba jieba;
std::string s = "我爱北京天安门";
std::vector<std::string> words;
jieba.Cut(s, words);
Summary
LAC offers more advanced features and potentially higher accuracy for complex NLP tasks, leveraging deep learning techniques. However, it may require more resources and setup compared to CppJieba. CppJieba, on the other hand, is lightweight and easy to integrate, making it suitable for simpler word segmentation tasks or resource-constrained environments. The choice between the two depends on the specific requirements of the project, such as accuracy needs, resource availability, and the complexity of the NLP tasks at hand.
Pros of NLPIR
- More comprehensive NLP toolkit with additional features beyond word segmentation
- Supports multiple languages, including English and Chinese
- Includes named entity recognition and sentiment analysis capabilities
Cons of NLPIR
- Less actively maintained, with fewer recent updates
- More complex setup and integration process
- Closed-source core components, limiting customization options
Code Comparison
NLPIR (C++):
CNLPIR::GetInstance().Init();
const char* sResult = CNLPIR::GetInstance().ParagraphProcess("我是中国人", 0);
cout << sResult << endl;
CNLPIR::GetInstance().Exit();
cppjieba (C++):
jieba::Jieba jieba;
vector<string> words;
jieba.Cut("我是中国人", words);
cout << limonp::Join(words.begin(), words.end(), "/") << endl;
Both libraries provide word segmentation functionality, but NLPIR offers a more comprehensive toolkit with additional NLP features. cppjieba is more focused on Chinese word segmentation and is easier to integrate into projects. NLPIR supports multiple languages, while cppjieba is primarily designed for Chinese text processing. The code examples demonstrate the basic usage of each library for word segmentation tasks.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
CppJieba
ç®ä»
CppJiebaæ¯"ç»å·´(Jieba)"ä¸æåè¯çC++çæ¬
ç¹æ§
- æºä»£ç é½åè¿å¤´æ件
include/cppjieba/*.hpp
éï¼include
å³å¯ä½¿ç¨ã - æ¯æ
utf8
ç¼ç ã - 项ç®èªå¸¦è¾ä¸ºå®åçåå æµè¯ï¼æ ¸å¿åè½ä¸æåè¯(utf8)ç稳å®æ§æ¥åè¿çº¿ä¸ç¯å¢æ£éªã
- æ¯æè½½èªå®ä¹ç¨æ·è¯å ¸ï¼å¤è·¯å¾æ¶æ¯æåé符'|'æè ';'åéã
- æ¯æ
Linux
,Mac OSX
,Windows
æä½ç³»ç»ã
ç¨æ³
ä¾èµè½¯ä»¶
g++ (version >= 4.1 is recommended) or clang++
;cmake (version >= 2.6 is recommended)
;
ä¸è½½åç¼è¯
git clone https://github.com/yanyiwu/cppjieba.git
cd cppjieba
git submodule init
git submodule update
mkdir build
cd build
cmake ..
make
æå ´è¶£çå¯ä»¥è·è·æµè¯(å¯é):
make test
Demo
./demo
ç»æ示ä¾ï¼
[demo] Cut With HMM
ä»/æ¥å°/äº/ç½æ/æç /大å¦
[demo] Cut Without HMM
ä»/æ¥å°/äº/ç½æ/æ/ç /大å¦
ææ¥å°å京æ¸
å大å¦
[demo] CutAll
æ/æ¥å°/å京/æ¸
å/æ¸
å大å¦/å大/大å¦
å°æç¡å£«æ¯ä¸äºä¸å½ç§å¦é¢è®¡ç®æï¼åå¨æ¥æ¬äº¬é½å¤§å¦æ·±é
[demo] CutForSearch
å°æ/ç¡å£«/æ¯ä¸/äº/ä¸å½/ç§å¦/å¦é¢/ç§å¦é¢/ä¸å½ç§å¦é¢/计ç®/计ç®æ/ï¼/å/å¨/æ¥æ¬/京é½/大å¦/æ¥æ¬äº¬é½å¤§å¦/æ·±é
[demo] Insert User Word
ç·é»/女泪
ç·é»å¥³æ³ª
[demo] CutForSearch Word With Offset
[{"word": "å°æ", "offset": 0}, {"word": "ç¡å£«", "offset": 6}, {"word": "æ¯ä¸", "offset": 12}, {"word": "äº", "offset": 18}, {"word": "ä¸å½", "offset": 21}, {"word": "ç§å¦", "offset": 27}, {"word": "å¦é¢", "offset": 30}, {"word": "ç§å¦é¢", "offset": 27}, {"word": "ä¸å½ç§å¦é¢", "offset": 21}, {"word": "计ç®", "offset": 36}, {"word": "计ç®æ", "offset": 36}, {"word": "ï¼", "offset": 45}, {"word": "å", "offset": 48}, {"word": "å¨", "offset": 51}, {"word": "æ¥æ¬", "offset": 54}, {"word": "京é½", "offset": 60}, {"word": "大å¦", "offset": 66}, {"word": "æ¥æ¬äº¬é½å¤§å¦", "offset": 54}, {"word": "æ·±é ", "offset": 72}]
[demo] Tagging
ææ¯æææºå¦é¢ææ¶æææºä¸ä¸çãä¸ç¨å¤ä¹
ï¼æå°±ä¼åèå èªï¼å½ä¸CEOï¼èµ°ä¸äººçå·
å³°ã
[æ:r, æ¯:v, æææº:n, å¦é¢:n, ææ¶æææº:n, ä¸ä¸:n, ç:uj, ã:x, ä¸ç¨:v, å¤ä¹
:m, ï¼:x, æ:r, å°±:d, ä¼:v, åè:v, å èª:nr, ï¼:x, å½ä¸:t, CEO:eng, ï¼:x, èµ°ä¸:v, 人ç:n, å·
å³°:n, ã:x]
[demo] Keyword Extraction
ææ¯æææºå¦é¢ææ¶æææºä¸ä¸çãä¸ç¨å¤ä¹
ï¼æå°±ä¼åèå èªï¼å½ä¸CEOï¼èµ°ä¸äººçå·
å³°ã
[{"word": "CEO", "offset": [93], "weight": 11.7392}, {"word": "åè", "offset": [72], "weight": 10.8562}, {"word": "å èª", "offset": [78], "weight": 10.6426}, {"word": "ææ¶æææº", "offset": [21], "weight": 10.0089}, {"word": "å·
å³°", "offset": [111], "weight": 9.49396}]
For more details, please see demo.
åè¯ç»æ示ä¾
MPSegment
Output:
ææ¥å°å京æ¸
å大å¦
æ/æ¥å°/å京/æ¸
å大å¦
ä»æ¥å°äºç½ææç 大å¦
ä»/æ¥å°/äº/ç½æ/æ/ç /大å¦
å°æç¡å£«æ¯ä¸äºä¸å½ç§å¦é¢è®¡ç®æï¼åå¨æ¥æ¬äº¬é½å¤§å¦æ·±é
å°/æ/ç¡å£«/æ¯ä¸/äº/ä¸å½ç§å¦é¢/计ç®æ/ï¼/å/å¨/æ¥æ¬äº¬é½å¤§å¦/æ·±é
HMMSegment
ææ¥å°å京æ¸
å大å¦
ææ¥/å°/å京/æ¸
å大å¦
ä»æ¥å°äºç½ææç 大å¦
ä»æ¥/å°/äº/ç½æ/æ/ç 大å¦
å°æç¡å£«æ¯ä¸äºä¸å½ç§å¦é¢è®¡ç®æï¼åå¨æ¥æ¬äº¬é½å¤§å¦æ·±é
å°æ/ç¡å£«/æ¯ä¸äº/ä¸å½/ç§å¦é¢/计ç®æ/ï¼/å/å¨/æ¥/æ¬/京/é½/大/å¦/æ·±/é
MixSegment
ææ¥å°å京æ¸
å大å¦
æ/æ¥å°/å京/æ¸
å大å¦
ä»æ¥å°äºç½ææç 大å¦
ä»/æ¥å°/äº/ç½æ/æç /大å¦
å°æç¡å£«æ¯ä¸äºä¸å½ç§å¦é¢è®¡ç®æï¼åå¨æ¥æ¬äº¬é½å¤§å¦æ·±é
å°æ/ç¡å£«/æ¯ä¸/äº/ä¸å½ç§å¦é¢/计ç®æ/ï¼/å/å¨/æ¥æ¬äº¬é½å¤§å¦/æ·±é
FullSegment
ææ¥å°å京æ¸
å大å¦
æ/æ¥å°/å京/æ¸
å/æ¸
å大å¦/å大/大å¦
ä»æ¥å°äºç½ææç 大å¦
ä»/æ¥å°/äº/ç½æ/æ/ç /大å¦
å°æç¡å£«æ¯ä¸äºä¸å½ç§å¦é¢è®¡ç®æï¼åå¨æ¥æ¬äº¬é½å¤§å¦æ·±é
å°/æ/ç¡å£«/æ¯ä¸/äº/ä¸å½/ä¸å½ç§å¦é¢/ç§å¦/ç§å¦é¢/å¦é¢/计ç®/计ç®æ/ï¼/å/å¨/æ¥æ¬/æ¥æ¬äº¬é½å¤§å¦/京é½/京é½å¤§å¦/大å¦/æ·±é
QuerySegment
ææ¥å°å京æ¸
å大å¦
æ/æ¥å°/å京/æ¸
å/æ¸
å大å¦/å大/大å¦
ä»æ¥å°äºç½ææç 大å¦
ä»/æ¥å°/äº/ç½æ/æç /大å¦
å°æç¡å£«æ¯ä¸äºä¸å½ç§å¦é¢è®¡ç®æï¼åå¨æ¥æ¬äº¬é½å¤§å¦æ·±é
å°æ/ç¡å£«/æ¯ä¸/äº/ä¸å½/ä¸å½ç§å¦é¢/ç§å¦/ç§å¦é¢/å¦é¢/计ç®æ/ï¼/å/å¨/ä¸å½/ä¸å½ç§å¦é¢/ç§å¦/ç§å¦é¢/å¦é¢/æ¥æ¬/æ¥æ¬äº¬é½å¤§å¦/京é½/京é½å¤§å¦/大å¦/æ·±é
以ä¸ä¾æ¬¡æ¯MP,HMM,Mixä¸ç§æ¹æ³çææã
å¯ä»¥çåºæææ好çæ¯Mixï¼ä¹å°±æ¯èåMPåHMMçåè¯ç®æ³ãå³å¯ä»¥åç¡®ååºè¯å ¸å·²æçè¯ï¼åå¯ä»¥ååºå"æç "è¿æ ·çæªç»å½è¯ã
Fullæ¹æ³ååºææåå ¸éçè¯è¯ã
Queryæ¹æ³å 使ç¨Mixæ¹æ³åè¯ï¼å¯¹äºååºæ¥çè¾é¿çè¯å使ç¨Fullæ¹æ³ã
èªå®ä¹ç¨æ·è¯å ¸
èªå®ä¹è¯å
¸ç¤ºä¾è¯·çdict/user.dict.utf8
ã
没æ使ç¨èªå®ä¹ç¨æ·è¯å ¸æ¶çç»æ:
令çå²/æ¯/äº/计ç®/è¡ä¸/ç/ä¸å®¶
使ç¨èªå®ä¹ç¨æ·è¯å ¸æ¶çç»æ:
令çå²/æ¯/äºè®¡ç®/è¡ä¸/ç/ä¸å®¶
å ³é®è¯æ½å
ææ¯æææºå¦é¢ææ¶æææºä¸ä¸çãä¸ç¨å¤ä¹
ï¼æå°±ä¼åèå èªï¼å½ä¸CEOï¼èµ°ä¸äººçå·
å³°ã
["CEO:11.7392", "åè:10.8562", "å èª:10.6426", "ææ¶æææº:10.0089", "å·
å³°:9.49396"]
For more details, please see demo.
è¯æ§æ 注
ææ¯èç¿æå·¥æææºå¦é¢ææ¶æææºä¸ä¸çãä¸ç¨å¤ä¹
ï¼æå°±ä¼åèå èªï¼å½ä¸æ»ç»çï¼åºä»»CEOï¼è¿å¨¶ç½å¯ç¾ï¼èµ°ä¸äººçå·
å³°ã
["æ:r", "æ¯:v", "æææº:n", "å¦é¢:n", "ææ¶æææº:n", "ä¸ä¸:n", "ç:uj", "ã:x", "ä¸ç¨:v", "å¤ä¹
:m", "ï¼:x", "æ:r", "å°±:d", "ä¼:v", "åè:v", "å èª:nr", "ï¼:x", "å½ä¸:t", "CEO:eng", "ï¼:x", "èµ°ä¸:v", "人ç:n", "å·
å³°:n", "ã:x"]
For more details, please see demo.
æ¯æèªå®ä¹è¯æ§ã
æ¯å¦å¨(dict/user.dict.utf8
)å¢å ä¸è¡
èç¿ nz
ç»æå¦ä¸ï¼
["æ:r", "æ¯:v", "èç¿:nz", "æå·¥:n", "æææº:n", "å¦é¢:n", "ææ¶æææº:n", "ä¸ä¸:n", "ç:uj", "ã:x", "ä¸ç¨:v", "å¤ä¹
:m", "ï¼:x", "æ:r", "å°±:d", "ä¼:v", "åè:v", "å èª:nr", "ï¼:x", "å½:t", "ä¸:f", "æ»ç»ç:n", "ï¼:x", "åºä»»:v", "CEO:eng", "ï¼:x", "è¿å¨¶:v", "ç½å¯ç¾:x", "ï¼:x", "èµ°ä¸:v", "人ç:n", "å·
å³°:n", "ã:x"]
å ¶å®è¯å ¸èµæå享
- dict.367W.utf8 iLife(562193561 at qq.com)
åºç¨
- GoJieba goè¯è¨çæ¬çç»å·´ä¸æåè¯ã
- NodeJieba Node.js çæ¬çç»å·´ä¸æåè¯ã
- simhash ä¸æææ¡£ççç¸ä¼¼åº¦è®¡ç®
- exjieba Erlang çæ¬çç»å·´ä¸æåè¯ã
- jiebaR Rè¯è¨çæ¬çç»å·´ä¸æåè¯ã
- cjieba Cè¯è¨çæ¬çç»å·´åè¯ã
- jieba_rb Ruby çæ¬çç»å·´åè¯ã
- iosjieba iOS çæ¬çç»å·´åè¯ã
- SqlJieba MySQL å ¨æç´¢å¼çç»å·´ä¸æåè¯æ件ã
- pg_jieba PostgreSQL æ°æ®åºçåè¯æ件ã
- simple SQLite3 FTS5 æ°æ®åºçåè¯æ件ã
- gitbook-plugin-search-pro æ¯æä¸ææç´¢ç gitbook æ件ã
- ngx_http_cppjieba_module Nginx åè¯æ件ã
- cppjiebapy ç± jannson å¼åçä¾ python 模åè°ç¨çé¡¹ç® cppjiebapy, ç¸å ³è®¨è®º cppjiebapy_discussion .
- cppjieba-py ç± bung87 åºäº pybind11 å°è£ ç python 模å,使ç¨ä½éªä¸æ¥è¿äºåjiebaã
- KeywordServer 50è¡æ建ä¸ä¸ªä¸æå ³é®è¯æ½åæå¡ã
- cppjieba-server CppJieba HTTP æå¡å¨ã
- phpjieba phpçæ¬çç»å·´åè¯æ©å±ã
- perl5-jieba Perlçæ¬çç»å·´åè¯æ©å±ã
- jieba-dlang D è¯è¨çç»å·´åè¯ Deimos Bindingsã
æ§è½è¯æµ
Jiebaä¸æåè¯ç³»åæ§è½è¯æµ
Sponsorship
Contributors
Code Contributors
This project exists thanks to all the people who contribute.
Top Related Projects
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot