Convert Figma logo to code with AI

isnowfy logosnownlp

Python library for processing Chinese text

6,397
1,364
6,397
44

Top Related Projects

33,063

结巴中文分词

33,448

Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification

3,840

百度NLP:分词,词性标注,命名实体识别,词重要性

"结巴"中文分词的C++版本

Quick Overview

SnowNLP is a Python library for processing Chinese text. It provides tools for Chinese word segmentation, part-of-speech tagging, sentiment analysis, and other natural language processing tasks specifically tailored for the Chinese language.

Pros

  • Specialized for Chinese language processing
  • Includes a variety of NLP tasks in a single library
  • Easy to use with simple API
  • Supports both simplified and traditional Chinese

Cons

  • Limited documentation, especially in English
  • Not as actively maintained as some other NLP libraries
  • May not perform as well as more specialized tools for specific tasks
  • Limited support for the latest deep learning-based NLP techniques

Code Examples

Word segmentation:

from snownlp import SnowNLP

s = SnowNLP(u'这是一个测试句子')
print(s.words)  # Output: ['这是', '一个', '测试', '句子']

Sentiment analysis:

from snownlp import SnowNLP

s = SnowNLP(u'这个产品非常好用!')
print(s.sentiments)  # Output: 0.9876543210987654 (positive sentiment)

Keyword extraction:

from snownlp import SnowNLP

s = SnowNLP(u'这是一篇关于自然语言处理的文章')
print(s.keywords(3))  # Output: ['自然语言', '处理', '文章']

Getting Started

To get started with SnowNLP, follow these steps:

  1. Install SnowNLP using pip:

    pip install snownlp
    
  2. Import and use SnowNLP in your Python script:

    from snownlp import SnowNLP
    
    text = u'这是一个中文文本处理的例子'
    s = SnowNLP(text)
    
    print(s.words)  # Word segmentation
    print(s.tags)   # Part-of-speech tagging
    print(s.sentiments)  # Sentiment analysis
    
  3. For more advanced usage, refer to the project's GitHub repository and examples provided in the code.

Competitor Comparisons

33,063

结巴中文分词

Pros of Jieba

  • More comprehensive Chinese text processing capabilities, including part-of-speech tagging and keyword extraction
  • Larger user base and more active development, resulting in better documentation and community support
  • Offers multiple segmentation modes (accurate, full, and search engine) for different use cases

Cons of Jieba

  • Slower processing speed compared to SnowNLP, especially for large text datasets
  • Requires more memory and computational resources
  • Less focus on sentiment analysis and text classification features

Code Comparison

SnowNLP:

from snownlp import SnowNLP
s = SnowNLP(u'这是一个测试句子')
print(s.words)

Jieba:

import jieba
seg_list = jieba.cut("这是一个测试句子", cut_all=False)
print(" ".join(seg_list))

Both libraries provide Chinese word segmentation functionality, but Jieba offers more granular control over the segmentation process. SnowNLP's API is simpler and more straightforward for basic tasks, while Jieba provides more advanced features and customization options.

SnowNLP excels in sentiment analysis and text summarization, making it a better choice for projects focused on these areas. Jieba, on the other hand, is more suitable for general-purpose Chinese text processing tasks and projects requiring detailed linguistic analysis.

33,448

Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification

Pros of HanLP

  • More comprehensive NLP toolkit with a wider range of features
  • Better documentation and active community support
  • Higher performance and accuracy for various NLP tasks

Cons of HanLP

  • Steeper learning curve due to its complexity
  • Larger resource footprint and slower processing speed
  • May be overkill for simple Chinese NLP tasks

Code Comparison

SnowNLP:

from snownlp import SnowNLP

s = SnowNLP(u'这是一个测试句子')
print(s.words)
print(s.tags)
print(s.sentiments)

HanLP:

from pyhanlp import *

sentence = "这是一个测试句子"
print(HanLP.segment(sentence))
print(HanLP.parseDependency(sentence))
print(HanLP.extractKeyword(sentence, 3))

Both libraries offer Chinese text processing capabilities, but HanLP provides more advanced features and flexibility. SnowNLP is simpler and easier to use for basic tasks, while HanLP offers a more comprehensive toolkit for complex NLP applications. The choice between them depends on the specific requirements of your project and the level of complexity you need in Chinese language processing.

3,840

百度NLP:分词,词性标注,命名实体识别,词重要性

Pros of LAC

  • More comprehensive NLP toolkit with advanced features like named entity recognition and word segmentation
  • Actively maintained by Baidu, with regular updates and improvements
  • Supports both Python and C++ implementations for flexibility

Cons of LAC

  • Larger and more complex, potentially harder to integrate for simple tasks
  • Primarily focused on Chinese language processing, less versatile for other languages
  • Requires more computational resources due to its advanced features

Code Comparison

SnowNLP sentiment analysis:

from snownlp import SnowNLP
s = SnowNLP(u'这个东西真心很赞')
print(s.sentiments)

LAC word segmentation and part-of-speech tagging:

from LAC import LAC
lac = LAC(mode='lac')
text = "LAC是个优秀的分词工具"
result = lac.run(text)
print(result)

SnowNLP is simpler and more straightforward for basic tasks like sentiment analysis, while LAC offers more advanced features and better accuracy for complex NLP tasks, especially in Chinese language processing. SnowNLP is lighter and easier to use for beginners, but LAC provides more comprehensive tools for professional NLP applications.

"结巴"中文分词的C++版本

Pros of cppjieba

  • Written in C++, offering better performance for large-scale text processing
  • Provides multiple segmentation modes (e.g., Maximum Probability, Hidden Markov Model, Query Sensitive)
  • Extensive documentation and examples available in both Chinese and English

Cons of cppjieba

  • Focused solely on Chinese word segmentation, lacking other NLP features
  • Steeper learning curve due to C++ implementation, compared to Python-based snownlp
  • May require additional setup and compilation steps for integration

Code Comparison

snownlp:

from snownlp import SnowNLP
s = SnowNLP(u'这是一个测试句子')
print(s.words)

cppjieba:

#include "cppjieba/Jieba.hpp"
vector<string> words;
jieba.Cut("这是一个测试句子", words);

Both libraries provide word segmentation functionality, but cppjieba offers more granular control over the segmentation process. snownlp's Python implementation makes it easier to use for quick prototyping, while cppjieba's C++ implementation is better suited for high-performance applications.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

SnowNLP: Simplified Chinese Text Processing

SnowNLP是一个python写的类库,可以方便的处理中文文本内容,是受到了TextBlob的启发而写的,由于现在大部分的自然语言处理库基本都是针对英文的,于是写了一个方便处理中文的类库,并且和TextBlob不同的是,这里没有用NLTK,所有的算法都是自己实现的,并且自带了一些训练好的字典。注意本程序都是处理的unicode编码,所以使用时请自行decode成unicode。

from snownlp import SnowNLP

s = SnowNLP(u'这个东西真心很赞')

s.words         # [u'这个', u'东西', u'真心',
                #  u'很', u'赞']

s.tags          # [(u'这个', u'r'), (u'东西', u'n'),
                #  (u'真心', u'd'), (u'很', u'd'),
                #  (u'赞', u'Vg')]

s.sentiments    # 0.9769663402895832 positive的概率

s.pinyin        # [u'zhe', u'ge', u'dong', u'xi',
                #  u'zhen', u'xin', u'hen', u'zan']

s = SnowNLP(u'「繁體字」「繁體中文」的叫法在臺灣亦很常見。')

s.han           # u'「繁体字」「繁体中文」的叫法
                # 在台湾亦很常见。'

text = u'''
自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。
它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。
自然语言处理是一门融语言学、计算机科学、数学于一体的科学。
因此,这一领域的研究将涉及自然语言,即人们日常使用的语言,
所以它与语言学的研究有着密切的联系,但又有重要的区别。
自然语言处理并不是一般地研究自然语言,
而在于研制能有效地实现自然语言通信的计算机系统,
特别是其中的软件系统。因而它是计算机科学的一部分。
'''

s = SnowNLP(text)

s.keywords(3)	# [u'语言', u'自然', u'计算机']

s.summary(3)	# [u'因而它是计算机科学的一部分',
                #  u'自然语言处理是一门融语言学、计算机科学、
				#	 数学于一体的科学',
				#  u'自然语言处理是计算机科学领域与人工智能
				#	 领域中的一个重要方向']
s.sentences

s = SnowNLP([[u'这篇', u'文章'],
             [u'那篇', u'论文'],
             [u'这个']])
s.tf
s.idf
s.sim([u'文章'])# [0.3756070762985226, 0, 0]

Features

  • 中文分词(Character-Based Generative Model)
  • 词性标注(TnT 3-gram 隐马)
  • 情感分析(现在训练数据主要是买卖东西时的评价,所以对其他的一些可能效果不是很好,待解决)
  • 文本分类(Naive Bayes)
  • 转换成拼音(Trie树实现的最大匹配)
  • 繁体转简体(Trie树实现的最大匹配)
  • 提取文本关键词(TextRank算法)
  • 提取文本摘要(TextRank算法)
  • tf,idf
  • Tokenization(分割成句子)
  • 文本相似(BM25)
  • 支持python3(感谢erning)

Get It now

$ pip install snownlp

关于训练

现在提供训练的包括分词,词性标注,情感分析,而且都提供了我用来训练的原始文件 以分词为例 分词在snownlp/seg目录下

from snownlp import seg
seg.train('data.txt')
seg.save('seg.marshal')
# from snownlp import tag
# tag.train('199801.txt')
# tag.save('tag.marshal')
# from snownlp import sentiment
# sentiment.train('neg.txt', 'pos.txt')
# sentiment.save('sentiment.marshal')

这样训练好的文件就存储为seg.marshal了,之后修改snownlp/seg/__init__.py里的data_path指向刚训练好的文件即可

License

MIT licensed.