Convert Figma logo to code with AI

yanyiwu logocppjieba

"结巴"中文分词的C++版本

2,732
716
2,732
17

Top Related Projects

34,028

结巴中文分词

34,953

中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理

3,931

百度NLP:分词,词性标注,命名实体识别,词重要性

3,449

Quick Overview

CppJieba is a Chinese word segmentation library implemented in C++. It provides efficient and accurate tokenization for Chinese text, supporting various segmentation modes and custom dictionaries.

Pros

  • High performance and low memory usage
  • Supports multiple segmentation modes (MPSegment, HMMSegment, MixSegment, FullSegment, QuerySegment)
  • Allows custom dictionary integration
  • Thread-safe implementation

Cons

  • Limited documentation, especially for advanced usage
  • Primarily focused on Chinese language, may not be suitable for other languages
  • Requires C++ knowledge for integration and customization
  • Less active development in recent years

Code Examples

  1. Basic word segmentation:
#include "cppjieba/Jieba.hpp"

int main() {
    cppjieba::Jieba jieba;
    std::vector<std::string> words;
    std::string sentence = "我来到北京清华大学";
    jieba.Cut(sentence, words);
    for (const auto& word : words) {
        std::cout << word << " ";
    }
    return 0;
}
  1. Using different segmentation modes:
#include "cppjieba/Jieba.hpp"

int main() {
    cppjieba::Jieba jieba;
    std::vector<std::string> words;
    std::string sentence = "我来到北京清华大学";
    
    // HMM segmentation
    jieba.CutHMM(sentence, words);
    
    // Full segmentation
    jieba.CutAll(sentence, words);
    
    // Query segmentation
    jieba.CutForSearch(sentence, words);
    
    return 0;
}
  1. Adding a custom dictionary:
#include "cppjieba/Jieba.hpp"

int main() {
    cppjieba::Jieba jieba("./dict/jieba.dict.utf8",
                          "./dict/hmm_model.utf8",
                          "./dict/user.dict.utf8");
    std::vector<std::string> words;
    std::string sentence = "我来到北京清华大学";
    jieba.Cut(sentence, words);
    return 0;
}

Getting Started

  1. Clone the repository:

    git clone https://github.com/yanyiwu/cppjieba.git
    
  2. Include the necessary headers in your C++ project:

    #include "cppjieba/Jieba.hpp"
    
  3. Create a Jieba instance and use it for segmentation:

    cppjieba::Jieba jieba;
    std::vector<std::string> words;
    std::string sentence = "你好世界";
    jieba.Cut(sentence, words);
    
  4. Compile your project with C++11 support:

    g++ -std=c++11 your_file.cpp -I/path/to/cppjieba
    

Competitor Comparisons

34,028

结巴中文分词

Pros of jieba

  • Written in Python, making it more accessible for Python developers
  • Larger community and more frequent updates
  • Supports more features like keyword extraction and TextRank

Cons of jieba

  • Slower performance compared to C++ implementation
  • Higher memory usage due to Python's overhead
  • Less suitable for embedding in C/C++ applications

Code Comparison

jieba (Python):

import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))

cppjieba (C++):

#include "cppjieba/Jieba.hpp"
vector<string> words;
jieba.Cut("我来到北京清华大学", words, true);
cout << limonp::Join(words.begin(), words.end(), "/") << endl;

Both libraries provide similar functionality for Chinese word segmentation, but cppjieba offers better performance and lower memory usage due to its C++ implementation. jieba, on the other hand, is more feature-rich and has a larger community, making it a popular choice for Python developers. The choice between the two depends on the specific requirements of the project, such as programming language preference, performance needs, and desired features.

34,953

中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理

Pros of HanLP

  • More comprehensive NLP toolkit with broader functionality beyond just word segmentation
  • Supports multiple languages, not limited to Chinese
  • Actively maintained with frequent updates and improvements

Cons of HanLP

  • Larger codebase and dependencies, potentially more complex to integrate
  • May have higher resource requirements due to its extensive feature set
  • Learning curve can be steeper for users only needing basic word segmentation

Code Comparison

HanLP example:

HanLP.segment("我的希望是希望张晚霞的背影被晚霞映红")

CppJieba example:

vector<string> words;
jieba.Cut("我的希望是希望张晚霞的背影被晚霞映红", words);

Summary

HanLP offers a more comprehensive NLP toolkit with support for multiple languages, while CppJieba focuses specifically on Chinese word segmentation. HanLP's broader feature set comes at the cost of increased complexity and potentially higher resource requirements. CppJieba, being more focused, may be simpler to integrate for projects only needing Chinese word segmentation. The choice between the two depends on the specific requirements of the project and the desired balance between functionality and simplicity.

3,931

百度NLP:分词,词性标注,命名实体识别,词重要性

Pros of LAC

  • Supports both word segmentation and part-of-speech tagging
  • Utilizes deep learning techniques for improved accuracy
  • Offers pre-trained models for various domains

Cons of LAC

  • Larger resource footprint due to deep learning models
  • May require more setup and dependencies
  • Potentially slower processing speed for simple tasks

Code Comparison

LAC:

from LAC import LAC

lac = LAC(mode='lac')
text = "我爱北京天安门"
result = lac.run(text)
print(result)

CppJieba:

#include "cppjieba/Jieba.hpp"

jieba::Jieba jieba;
std::string s = "我爱北京天安门";
std::vector<std::string> words;
jieba.Cut(s, words);

Summary

LAC offers more advanced features and potentially higher accuracy for complex NLP tasks, leveraging deep learning techniques. However, it may require more resources and setup compared to CppJieba. CppJieba, on the other hand, is lightweight and easy to integrate, making it suitable for simpler word segmentation tasks or resource-constrained environments. The choice between the two depends on the specific requirements of the project, such as accuracy needs, resource availability, and the complexity of the NLP tasks at hand.

3,449

Pros of NLPIR

  • More comprehensive NLP toolkit with additional features beyond word segmentation
  • Supports multiple languages, including English and Chinese
  • Includes named entity recognition and sentiment analysis capabilities

Cons of NLPIR

  • Less actively maintained, with fewer recent updates
  • More complex setup and integration process
  • Closed-source core components, limiting customization options

Code Comparison

NLPIR (C++):

CNLPIR::GetInstance().Init();
const char* sResult = CNLPIR::GetInstance().ParagraphProcess("我是中国人", 0);
cout << sResult << endl;
CNLPIR::GetInstance().Exit();

cppjieba (C++):

jieba::Jieba jieba;
vector<string> words;
jieba.Cut("我是中国人", words);
cout << limonp::Join(words.begin(), words.end(), "/") << endl;

Both libraries provide word segmentation functionality, but NLPIR offers a more comprehensive toolkit with additional NLP features. cppjieba is more focused on Chinese word segmentation and is easier to integrate into projects. NLPIR supports multiple languages, while cppjieba is primarily designed for Chinese text processing. The code examples demonstrate the basic usage of each library for word segmentation tasks.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

CppJieba

CMake Author Platform Performance Tag

简介

CppJieba是"结巴(Jieba)"中文分词的C++版本

主要特点

  • 🚀 高性能:经过线上环境验证的稳定性和性能表现
  • 📦 易集成:源代码以头文件形式提供 (include/cppjieba/*.hpp),包含即可使用
  • 🔍 多种分词模式:支持精确模式、全模式、搜索引擎模式等
  • 📚 自定义词典:支持用户自定义词典,支持多词典路径(使用'|'或';'分隔)
  • 💻 跨平台:支持 Linux、macOS、Windows 操作系统
  • 🌈 UTF-8编码:原生支持 UTF-8 编码的中文处理

快速开始

环境要求

  • C++ 编译器:
    • g++ (推荐 4.1 以上版本)
    • 或 clang++
  • cmake (推荐 2.6 以上版本)

安装步骤

git clone https://github.com/yanyiwu/cppjieba.git
cd cppjieba
git submodule init
git submodule update
mkdir build
cd build
cmake ..
make

make test

使用示例

./demo

结果示例:

[demo] Cut With HMM
他/来到/了/网易/杭研/大厦
[demo] Cut Without HMM
他/来到/了/网易/杭/研/大厦
我来到北京清华大学
[demo] CutAll
我/来到/北京/清华/清华大学/华大/大学
小明硕士毕业于中国科学院计算所,后在日本京都大学深造
[demo] CutForSearch
小明/硕士/毕业/于/中国/科学/学院/科学院/中国科学院/计算/计算所/,/后/在/日本/京都/大学/日本京都大学/深造
[demo] Insert User Word
男默/女泪
男默女泪
[demo] CutForSearch Word With Offset
[{"word": "小明", "offset": 0}, {"word": "硕士", "offset": 6}, {"word": "毕业", "offset": 12}, {"word": "于", "offset": 18}, {"word": "中国", "offset": 21}, {"word": "科学", "offset": 27}, {"word": "学院", "offset": 30}, {"word": "科学院", "offset": 27}, {"word": "中国科学院", "offset": 21}, {"word": "计算", "offset": 36}, {"word": "计算所", "offset": 36}, {"word": ",", "offset": 45}, {"word": "后", "offset": 48}, {"word": "在", "offset": 51}, {"word": "日本", "offset": 54}, {"word": "京都", "offset": 60}, {"word": "大学", "offset": 66}, {"word": "日本京都大学", "offset": 54}, {"word": "深造", "offset": 72}]
[demo] Tagging
我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰。
[我:r, 是:v, 拖拉机:n, 学院:n, 手扶拖拉机:n, 专业:n, 的:uj, 。:x, 不用:v, 多久:m, ,:x, 我:r, 就:d, 会:v, 升职:v, 加薪:nr, ,:x, 当上:t, CEO:eng, ,:x, 走上:v, 人生:n, 巅峰:n, 。:x]
[demo] Keyword Extraction
我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰。
[{"word": "CEO", "offset": [93], "weight": 11.7392}, {"word": "升职", "offset": [72], "weight": 10.8562}, {"word": "加薪", "offset": [78], "weight": 10.6426}, {"word": "手扶拖拉机", "offset": [21], "weight": 10.0089}, {"word": "巅峰", "offset": [111], "weight": 9.49396}]

For more details, please see demo.

分词结果示例

MPSegment

Output:

我来到北京清华大学
我/来到/北京/清华大学

他来到了网易杭研大厦
他/来到/了/网易/杭/研/大厦

小明硕士毕业于中国科学院计算所,后在日本京都大学深造
小/明/硕士/毕业/于/中国科学院/计算所/,/后/在/日本京都大学/深造

HMMSegment

我来到北京清华大学
我来/到/北京/清华大学

他来到了网易杭研大厦
他来/到/了/网易/杭/研大厦

小明硕士毕业于中国科学院计算所,后在日本京都大学深造
小明/硕士/毕业于/中国/科学院/计算所/,/后/在/日/本/京/都/大/学/深/造

MixSegment

我来到北京清华大学
我/来到/北京/清华大学

他来到了网易杭研大厦
他/来到/了/网易/杭研/大厦

小明硕士毕业于中国科学院计算所,后在日本京都大学深造
小明/硕士/毕业/于/中国科学院/计算所/,/后/在/日本京都大学/深造

FullSegment

我来到北京清华大学
我/来到/北京/清华/清华大学/华大/大学

他来到了网易杭研大厦
他/来到/了/网易/杭/研/大厦

小明硕士毕业于中国科学院计算所,后在日本京都大学深造
小/明/硕士/毕业/于/中国/中国科学院/科学/科学院/学院/计算/计算所/,/后/在/日本/日本京都大学/京都/京都大学/大学/深造

QuerySegment

我来到北京清华大学
我/来到/北京/清华/清华大学/华大/大学

他来到了网易杭研大厦
他/来到/了/网易/杭研/大厦

小明硕士毕业于中国科学院计算所,后在日本京都大学深造
小明/硕士/毕业/于/中国/中国科学院/科学/科学院/学院/计算所/,/后/在/中国/中国科学院/科学/科学院/学院/日本/日本京都大学/京都/京都大学/大学/深造

以上依次是MP,HMM,Mix三种方法的效果。

可以看出效果最好的是Mix,也就是融合MP和HMM的切词算法。即可以准确切出词典已有的词,又可以切出像"杭研"这样的未登录词。

Full方法切出所有字典里的词语。

Query方法先使用Mix方法切词,对于切出来的较长的词再使用Full方法。

自定义用户词典

自定义词典示例请看dict/user.dict.utf8。

没有使用自定义用户词典时的结果:

令狐冲/是/云/计算/行业/的/专家

使用自定义用户词典时的结果:

令狐冲/是/云计算/行业/的/专家

关键词抽取

我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰。
["CEO:11.7392", "升职:10.8562", "加薪:10.6426", "手扶拖拉机:10.0089", "巅峰:9.49396"]

For more details, please see demo.

词性标注

我是蓝翔技工拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上总经理,出任CEO,迎娶白富美,走上人生巅峰。
["我:r", "是:v", "拖拉机:n", "学院:n", "手扶拖拉机:n", "专业:n", "的:uj", "。:x", "不用:v", "多久:m", ",:x", "我:r", "就:d", "会:v", "升职:v", "加薪:nr", ",:x", "当上:t", "CEO:eng", ",:x", "走上:v", "人生:n", "巅峰:n", "。:x"]

For more details, please see demo.

支持自定义词性。 比如在(dict/user.dict.utf8)增加一行

蓝翔 nz

结果如下:

["我:r", "是:v", "蓝翔:nz", "技工:n", "拖拉机:n", "学院:n", "手扶拖拉机:n", "专业:n", "的:uj", "。:x", "不用:v", "多久:m", ",:x", "我:r", "就:d", "会:v", "升职:v", "加薪:nr", ",:x", "当:t", "上:f", "总经理:n", ",:x", "出任:v", "CEO:eng", ",:x", "迎娶:v", "白富美:x", ",:x", "走上:v", "人生:n", "巅峰:n", "。:x"]

其它词典资料分享

  • [dict.367W.utf8] iLife(562193561 at qq.com)

生态系统

CppJieba 已经被广泛应用于各种编程语言的分词实现中:

应用项目

贡献指南

我们欢迎各种形式的贡献,包括但不限于:

  • 提交问题和建议
  • 改进文档
  • 提交代码修复
  • 添加新功能

如果您觉得 CppJieba 对您有帮助,欢迎 star ⭐️ 支持项目!