Convert Figma logo to code with AI

yanyiwu logocppjieba

"结巴"中文分词的C++版本

2,565
690
2,565
68

Top Related Projects

33,199

结巴中文分词

33,691

中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理

3,863

百度NLP:分词,词性标注,命名实体识别,词重要性

3,401

Quick Overview

CppJieba is a Chinese word segmentation library implemented in C++. It provides efficient and accurate tokenization for Chinese text, supporting various segmentation modes and custom dictionaries.

Pros

  • High performance and low memory usage
  • Supports multiple segmentation modes (MPSegment, HMMSegment, MixSegment, FullSegment, QuerySegment)
  • Allows custom dictionary integration
  • Thread-safe implementation

Cons

  • Limited documentation, especially for advanced usage
  • Primarily focused on Chinese language, may not be suitable for other languages
  • Requires C++ knowledge for integration and customization
  • Less active development in recent years

Code Examples

  1. Basic word segmentation:
#include "cppjieba/Jieba.hpp"

int main() {
    cppjieba::Jieba jieba;
    std::vector<std::string> words;
    std::string sentence = "我来到北京清华大学";
    jieba.Cut(sentence, words);
    for (const auto& word : words) {
        std::cout << word << " ";
    }
    return 0;
}
  1. Using different segmentation modes:
#include "cppjieba/Jieba.hpp"

int main() {
    cppjieba::Jieba jieba;
    std::vector<std::string> words;
    std::string sentence = "我来到北京清华大学";
    
    // HMM segmentation
    jieba.CutHMM(sentence, words);
    
    // Full segmentation
    jieba.CutAll(sentence, words);
    
    // Query segmentation
    jieba.CutForSearch(sentence, words);
    
    return 0;
}
  1. Adding a custom dictionary:
#include "cppjieba/Jieba.hpp"

int main() {
    cppjieba::Jieba jieba("./dict/jieba.dict.utf8",
                          "./dict/hmm_model.utf8",
                          "./dict/user.dict.utf8");
    std::vector<std::string> words;
    std::string sentence = "我来到北京清华大学";
    jieba.Cut(sentence, words);
    return 0;
}

Getting Started

  1. Clone the repository:

    git clone https://github.com/yanyiwu/cppjieba.git
    
  2. Include the necessary headers in your C++ project:

    #include "cppjieba/Jieba.hpp"
    
  3. Create a Jieba instance and use it for segmentation:

    cppjieba::Jieba jieba;
    std::vector<std::string> words;
    std::string sentence = "你好世界";
    jieba.Cut(sentence, words);
    
  4. Compile your project with C++11 support:

    g++ -std=c++11 your_file.cpp -I/path/to/cppjieba
    

Competitor Comparisons

33,199

结巴中文分词

Pros of jieba

  • Written in Python, making it more accessible for Python developers
  • Larger community and more frequent updates
  • Supports more features like keyword extraction and TextRank

Cons of jieba

  • Slower performance compared to C++ implementation
  • Higher memory usage due to Python's overhead
  • Less suitable for embedding in C/C++ applications

Code Comparison

jieba (Python):

import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))

cppjieba (C++):

#include "cppjieba/Jieba.hpp"
vector<string> words;
jieba.Cut("我来到北京清华大学", words, true);
cout << limonp::Join(words.begin(), words.end(), "/") << endl;

Both libraries provide similar functionality for Chinese word segmentation, but cppjieba offers better performance and lower memory usage due to its C++ implementation. jieba, on the other hand, is more feature-rich and has a larger community, making it a popular choice for Python developers. The choice between the two depends on the specific requirements of the project, such as programming language preference, performance needs, and desired features.

33,691

中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理

Pros of HanLP

  • More comprehensive NLP toolkit with broader functionality beyond just word segmentation
  • Supports multiple languages, not limited to Chinese
  • Actively maintained with frequent updates and improvements

Cons of HanLP

  • Larger codebase and dependencies, potentially more complex to integrate
  • May have higher resource requirements due to its extensive feature set
  • Learning curve can be steeper for users only needing basic word segmentation

Code Comparison

HanLP example:

HanLP.segment("我的希望是希望张晚霞的背影被晚霞映红")

CppJieba example:

vector<string> words;
jieba.Cut("我的希望是希望张晚霞的背影被晚霞映红", words);

Summary

HanLP offers a more comprehensive NLP toolkit with support for multiple languages, while CppJieba focuses specifically on Chinese word segmentation. HanLP's broader feature set comes at the cost of increased complexity and potentially higher resource requirements. CppJieba, being more focused, may be simpler to integrate for projects only needing Chinese word segmentation. The choice between the two depends on the specific requirements of the project and the desired balance between functionality and simplicity.

3,863

百度NLP:分词,词性标注,命名实体识别,词重要性

Pros of LAC

  • Supports both word segmentation and part-of-speech tagging
  • Utilizes deep learning techniques for improved accuracy
  • Offers pre-trained models for various domains

Cons of LAC

  • Larger resource footprint due to deep learning models
  • May require more setup and dependencies
  • Potentially slower processing speed for simple tasks

Code Comparison

LAC:

from LAC import LAC

lac = LAC(mode='lac')
text = "我爱北京天安门"
result = lac.run(text)
print(result)

CppJieba:

#include "cppjieba/Jieba.hpp"

jieba::Jieba jieba;
std::string s = "我爱北京天安门";
std::vector<std::string> words;
jieba.Cut(s, words);

Summary

LAC offers more advanced features and potentially higher accuracy for complex NLP tasks, leveraging deep learning techniques. However, it may require more resources and setup compared to CppJieba. CppJieba, on the other hand, is lightweight and easy to integrate, making it suitable for simpler word segmentation tasks or resource-constrained environments. The choice between the two depends on the specific requirements of the project, such as accuracy needs, resource availability, and the complexity of the NLP tasks at hand.

3,401

Pros of NLPIR

  • More comprehensive NLP toolkit with additional features beyond word segmentation
  • Supports multiple languages, including English and Chinese
  • Includes named entity recognition and sentiment analysis capabilities

Cons of NLPIR

  • Less actively maintained, with fewer recent updates
  • More complex setup and integration process
  • Closed-source core components, limiting customization options

Code Comparison

NLPIR (C++):

CNLPIR::GetInstance().Init();
const char* sResult = CNLPIR::GetInstance().ParagraphProcess("我是中国人", 0);
cout << sResult << endl;
CNLPIR::GetInstance().Exit();

cppjieba (C++):

jieba::Jieba jieba;
vector<string> words;
jieba.Cut("我是中国人", words);
cout << limonp::Join(words.begin(), words.end(), "/") << endl;

Both libraries provide word segmentation functionality, but NLPIR offers a more comprehensive toolkit with additional NLP features. cppjieba is more focused on Chinese word segmentation and is easier to integrate into projects. NLPIR supports multiple languages, while cppjieba is primarily designed for Chinese text processing. The code examples demonstrate the basic usage of each library for word segmentation tasks.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

CppJieba

CMake Author Platform Performance Tag

简介

CppJieba是"结巴(Jieba)"中文分词的C++版本

特性

  • 源代码都写进头文件include/cppjieba/*.hpp里,include即可使用。
  • 支持utf8编码。
  • 项目自带较为完善的单元测试,核心功能中文分词(utf8)的稳定性接受过线上环境检验。
  • 支持载自定义用户词典,多路径时支持分隔符'|'或者';'分隔。
  • 支持 Linux , Mac OSX, Windows 操作系统。

用法

依赖软件

  • g++ (version >= 4.1 is recommended) or clang++;
  • cmake (version >= 2.6 is recommended);

下载和编译

git clone https://github.com/yanyiwu/cppjieba.git
cd cppjieba
git submodule init
git submodule update
mkdir build
cd build
cmake ..
make

有兴趣的可以跑跑测试(可选):

make test

Demo

./demo

结果示例:

[demo] Cut With HMM
他/来到/了/网易/杭研/大厦
[demo] Cut Without HMM
他/来到/了/网易/杭/研/大厦
我来到北京清华大学
[demo] CutAll
我/来到/北京/清华/清华大学/华大/大学
小明硕士毕业于中国科学院计算所,后在日本京都大学深造
[demo] CutForSearch
小明/硕士/毕业/于/中国/科学/学院/科学院/中国科学院/计算/计算所/,/后/在/日本/京都/大学/日本京都大学/深造
[demo] Insert User Word
男默/女泪
男默女泪
[demo] CutForSearch Word With Offset
[{"word": "小明", "offset": 0}, {"word": "硕士", "offset": 6}, {"word": "毕业", "offset": 12}, {"word": "于", "offset": 18}, {"word": "中国", "offset": 21}, {"word": "科学", "offset": 27}, {"word": "学院", "offset": 30}, {"word": "科学院", "offset": 27}, {"word": "中国科学院", "offset": 21}, {"word": "计算", "offset": 36}, {"word": "计算所", "offset": 36}, {"word": ",", "offset": 45}, {"word": "后", "offset": 48}, {"word": "在", "offset": 51}, {"word": "日本", "offset": 54}, {"word": "京都", "offset": 60}, {"word": "大学", "offset": 66}, {"word": "日本京都大学", "offset": 54}, {"word": "深造", "offset": 72}]
[demo] Tagging
我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰。
[我:r, 是:v, 拖拉机:n, 学院:n, 手扶拖拉机:n, 专业:n, 的:uj, 。:x, 不用:v, 多久:m, ,:x, 我:r, 就:d, 会:v, 升职:v, 加薪:nr, ,:x, 当上:t, CEO:eng, ,:x, 走上:v, 人生:n, 巅峰:n, 。:x]
[demo] Keyword Extraction
我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰。
[{"word": "CEO", "offset": [93], "weight": 11.7392}, {"word": "升职", "offset": [72], "weight": 10.8562}, {"word": "加薪", "offset": [78], "weight": 10.6426}, {"word": "手扶拖拉机", "offset": [21], "weight": 10.0089}, {"word": "巅峰", "offset": [111], "weight": 9.49396}]

详细请看 test/demo.cpp.

分词结果示例

MPSegment

Output:

我来到北京清华大学
我/来到/北京/清华大学

他来到了网易杭研大厦
他/来到/了/网易/杭/研/大厦

小明硕士毕业于中国科学院计算所,后在日本京都大学深造
小/明/硕士/毕业/于/中国科学院/计算所/,/后/在/日本京都大学/深造

HMMSegment

我来到北京清华大学
我来/到/北京/清华大学

他来到了网易杭研大厦
他来/到/了/网易/杭/研大厦

小明硕士毕业于中国科学院计算所,后在日本京都大学深造
小明/硕士/毕业于/中国/科学院/计算所/,/后/在/日/本/京/都/大/学/深/造

MixSegment

我来到北京清华大学
我/来到/北京/清华大学

他来到了网易杭研大厦
他/来到/了/网易/杭研/大厦

小明硕士毕业于中国科学院计算所,后在日本京都大学深造
小明/硕士/毕业/于/中国科学院/计算所/,/后/在/日本京都大学/深造

FullSegment

我来到北京清华大学
我/来到/北京/清华/清华大学/华大/大学

他来到了网易杭研大厦
他/来到/了/网易/杭/研/大厦

小明硕士毕业于中国科学院计算所,后在日本京都大学深造
小/明/硕士/毕业/于/中国/中国科学院/科学/科学院/学院/计算/计算所/,/后/在/日本/日本京都大学/京都/京都大学/大学/深造

QuerySegment

我来到北京清华大学
我/来到/北京/清华/清华大学/华大/大学

他来到了网易杭研大厦
他/来到/了/网易/杭研/大厦

小明硕士毕业于中国科学院计算所,后在日本京都大学深造
小明/硕士/毕业/于/中国/中国科学院/科学/科学院/学院/计算所/,/后/在/中国/中国科学院/科学/科学院/学院/日本/日本京都大学/京都/京都大学/大学/深造

以上依次是MP,HMM,Mix三种方法的效果。

可以看出效果最好的是Mix,也就是融合MP和HMM的切词算法。即可以准确切出词典已有的词,又可以切出像"杭研"这样的未登录词。

Full方法切出所有字典里的词语。

Query方法先使用Mix方法切词,对于切出来的较长的词再使用Full方法。

自定义用户词典

自定义词典示例请看dict/user.dict.utf8。

没有使用自定义用户词典时的结果:

令狐冲/是/云/计算/行业/的/专家

使用自定义用户词典时的结果:

令狐冲/是/云计算/行业/的/专家

关键词抽取

我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰。
["CEO:11.7392", "升职:10.8562", "加薪:10.6426", "手扶拖拉机:10.0089", "巅峰:9.49396"]

详细请见 test/demo.cpp.

词性标注

我是蓝翔技工拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上总经理,出任CEO,迎娶白富美,走上人生巅峰。
["我:r", "是:v", "拖拉机:n", "学院:n", "手扶拖拉机:n", "专业:n", "的:uj", "。:x", "不用:v", "多久:m", ",:x", "我:r", "就:d", "会:v", "升职:v", "加薪:nr", ",:x", "当上:t", "CEO:eng", ",:x", "走上:v", "人生:n", "巅峰:n", "。:x"]

详细请看 test/demo.cpp.

支持自定义词性。 比如在(dict/user.dict.utf8)增加一行

蓝翔 nz

结果如下:

["我:r", "是:v", "蓝翔:nz", "技工:n", "拖拉机:n", "学院:n", "手扶拖拉机:n", "专业:n", "的:uj", "。:x", "不用:v", "多久:m", ",:x", "我:r", "就:d", "会:v", "升职:v", "加薪:nr", ",:x", "当:t", "上:f", "总经理:n", ",:x", "出任:v", "CEO:eng", ",:x", "迎娶:v", "白富美:x", ",:x", "走上:v", "人生:n", "巅峰:n", "。:x"]

其它词典资料分享

应用

  • GoJieba go语言版本的结巴中文分词。
  • NodeJieba Node.js 版本的结巴中文分词。
  • simhash 中文文档的的相似度计算
  • exjieba Erlang 版本的结巴中文分词。
  • jiebaR R语言版本的结巴中文分词。
  • cjieba C语言版本的结巴分词。
  • jieba_rb Ruby 版本的结巴分词。
  • iosjieba iOS 版本的结巴分词。
  • SqlJieba MySQL 全文索引的结巴中文分词插件。
  • pg_jieba PostgreSQL 数据库的分词插件。
  • simple SQLite3 FTS5 数据库的分词插件。
  • gitbook-plugin-search-pro 支持中文搜索的 gitbook 插件。
  • ngx_http_cppjieba_module Nginx 分词插件。
  • cppjiebapy 由 jannson 开发的供 python 模块调用的项目 cppjiebapy, 相关讨论 cppjiebapy_discussion .
  • cppjieba-py 由 bung87 基于 pybind11 封装的 python 模块,使用体验上接近于原jieba。
  • KeywordServer 50行搭建一个中文关键词抽取服务。
  • cppjieba-server CppJieba HTTP 服务器。
  • phpjieba php版本的结巴分词扩展。
  • perl5-jieba Perl版本的结巴分词扩展。
  • jieba-dlang D 语言的结巴分词 Deimos Bindings。

线上演示

Web-Demo (建议使用chrome打开)

性能评测

Jieba中文分词系列性能评测

Sponsorship

sponsorship

Contributors

Code Contributors

This project exists thanks to all the people who contribute.