snownlp

Python library for processing Chinese text

6,541

1,369

6,541

View on GitHub

Top Related Projects

HanLP

34,953

中文分词词性标注命名实体识别依存句法分析成分句法分析语义依存分析语义角色标注指代消解风格转换语义相似度新词发现关键词短语提取自动摘要文本分类聚类拼音简繁转换自然语言处理

Quick Overview

SnowNLP is a Python library for processing Chinese text. It provides tools for Chinese word segmentation, part-of-speech tagging, sentiment analysis, and other natural language processing tasks specifically tailored for the Chinese language.

Pros

Specialized for Chinese language processing
Includes a variety of NLP tasks in a single library
Easy to use with simple API
Supports both simplified and traditional Chinese

Cons

Limited documentation, especially in English
Not as actively maintained as some other NLP libraries
May not perform as well as more specialized tools for specific tasks
Limited support for the latest deep learning-based NLP techniques

Code Examples

Word segmentation:

from snownlp import SnowNLP

s = SnowNLP(u'这是一个测试句子')
print(s.words)  # Output: ['这是', '一个', '测试', '句子']

Sentiment analysis:

from snownlp import SnowNLP

s = SnowNLP(u'这个产品非常好用！')
print(s.sentiments)  # Output: 0.9876543210987654 (positive sentiment)

Keyword extraction:

from snownlp import SnowNLP

s = SnowNLP(u'这是一篇关于自然语言处理的文章')
print(s.keywords(3))  # Output: ['自然语言', '处理', '文章']

Getting Started

To get started with SnowNLP, follow these steps:

Install SnowNLP using pip:
```
pip install snownlp
```

Import and use SnowNLP in your Python script:

from snownlp import SnowNLP

text = u'这是一个中文文本处理的例子'
s = SnowNLP(text)

print(s.words)  # Word segmentation
print(s.tags)   # Part-of-speech tagging
print(s.sentiments)  # Sentiment analysis

For more advanced usage, refer to the project's GitHub repository and examples provided in the code.

Competitor Comparisons

jieba

34,028

结巴中文分词

Pros of Jieba

More comprehensive Chinese text processing capabilities, including part-of-speech tagging and keyword extraction
Larger user base and more active development, resulting in better documentation and community support
Offers multiple segmentation modes (accurate, full, and search engine) for different use cases

Cons of Jieba

Slower processing speed compared to SnowNLP, especially for large text datasets
Requires more memory and computational resources
Less focus on sentiment analysis and text classification features

Code Comparison

SnowNLP:

from snownlp import SnowNLP
s = SnowNLP(u'这是一个测试句子')
print(s.words)

Jieba:

import jieba
seg_list = jieba.cut("这是一个测试句子", cut_all=False)
print(" ".join(seg_list))

Both libraries provide Chinese word segmentation functionality, but Jieba offers more granular control over the segmentation process. SnowNLP's API is simpler and more straightforward for basic tasks, while Jieba provides more advanced features and customization options.

SnowNLP excels in sentiment analysis and text summarization, making it a better choice for projects focused on these areas. Jieba, on the other hand, is more suitable for general-purpose Chinese text processing tasks and projects requiring detailed linguistic analysis.

HanLP

34,953

Pros of HanLP

More comprehensive NLP toolkit with a wider range of features
Better documentation and active community support
Higher performance and accuracy for various NLP tasks

Cons of HanLP

Steeper learning curve due to its complexity
Larger resource footprint and slower processing speed
May be overkill for simple Chinese NLP tasks

Code Comparison

SnowNLP:

from snownlp import SnowNLP

s = SnowNLP(u'这是一个测试句子')
print(s.words)
print(s.tags)
print(s.sentiments)

HanLP:

from pyhanlp import *

sentence = "这是一个测试句子"
print(HanLP.segment(sentence))
print(HanLP.parseDependency(sentence))
print(HanLP.extractKeyword(sentence, 3))

Both libraries offer Chinese text processing capabilities, but HanLP provides more advanced features and flexibility. SnowNLP is simpler and easier to use for basic tasks, while HanLP offers a more comprehensive toolkit for complex NLP applications. The choice between them depends on the specific requirements of your project and the level of complexity you need in Chinese language processing.

lac

3,931

百度NLP：分词，词性标注，命名实体识别，词重要性

Pros of LAC

More comprehensive NLP toolkit with advanced features like named entity recognition and word segmentation
Actively maintained by Baidu, with regular updates and improvements
Supports both Python and C++ implementations for flexibility

Cons of LAC

Larger and more complex, potentially harder to integrate for simple tasks
Primarily focused on Chinese language processing, less versatile for other languages
Requires more computational resources due to its advanced features

Code Comparison

SnowNLP sentiment analysis:

from snownlp import SnowNLP
s = SnowNLP(u'这个东西真心很赞')
print(s.sentiments)

LAC word segmentation and part-of-speech tagging:

from LAC import LAC
lac = LAC(mode='lac')
text = "LAC是个优秀的分词工具"
result = lac.run(text)
print(result)

SnowNLP is simpler and more straightforward for basic tasks like sentiment analysis, while LAC offers more advanced features and better accuracy for complex NLP tasks, especially in Chinese language processing. SnowNLP is lighter and easier to use for beginners, but LAC provides more comprehensive tools for professional NLP applications.

cppjieba

2,732

"结巴"中文分词的C++版本

Pros of cppjieba

Written in C++, offering better performance for large-scale text processing
Provides multiple segmentation modes (e.g., Maximum Probability, Hidden Markov Model, Query Sensitive)
Extensive documentation and examples available in both Chinese and English

Cons of cppjieba

Focused solely on Chinese word segmentation, lacking other NLP features
Steeper learning curve due to C++ implementation, compared to Python-based snownlp
May require additional setup and compilation steps for integration

Code Comparison

snownlp:

from snownlp import SnowNLP
s = SnowNLP(u'这是一个测试句子')
print(s.words)

cppjieba:

#include "cppjieba/Jieba.hpp"
vector<string> words;
jieba.Cut("这是一个测试句子", words);

Both libraries provide word segmentation functionality, but cppjieba offers more granular control over the segmentation process. snownlp's Python implementation makes it easier to use for quick prototyping, while cppjieba's C++ implementation is better suited for high-performance applications.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

SnowNLP: Simplified Chinese Text Processing

from snownlp import SnowNLP

s = SnowNLP(u'è¿ä¸ªä¸è¥¿çå¿å¾èµ')

s.words         # [u'è¿ä¸ª', u'ä¸è¥¿', u'çå¿',
                #  u'å¾', u'èµ']

s.tags          # [(u'è¿ä¸ª', u'r'), (u'ä¸è¥¿', u'n'),
                #  (u'çå¿', u'd'), (u'å¾', u'd'),
                #  (u'èµ', u'Vg')]

s.sentiments    # 0.9769663402895832 positiveçæ¦ç

s.pinyin        # [u'zhe', u'ge', u'dong', u'xi',
                #  u'zhen', u'xin', u'hen', u'zan']

s = SnowNLP(u'ãç¹é«åããç¹é«ä¸æãçå«æ³å¨èºç£äº¦å¾å¸¸è¦ã')

s.han           # u'ãç¹ä½åããç¹ä½ä¸æãçå«æ³
                # å¨å°æ¹¾äº¦å¾å¸¸è§ã'

text = u'''
èªç¶è¯è¨å¤çæ¯è®¡ç®æºç§å¦é¢åä¸äººå·¥æºè½é¢åä¸çä¸ä¸ªéè¦æ¹åã
å®ç ç©¶è½å®ç°äººä¸è®¡ç®æºä¹é´ç¨èªç¶è¯è¨è¿è¡ææéä¿¡çåç§çè®ºåæ¹æ³ã
èªç¶è¯è¨å¤çæ¯ä¸é¨èè¯è¨å¦ãè®¡ç®æºç§å¦ãæ°å¦äºä¸ä½çç§å¦ã
å æ¤ï¼è¿ä¸é¢åçç ç©¶å°æ¶åèªç¶è¯è¨ï¼å³äººä»¬æ¥å¸¸ä½¿ç¨çè¯è¨ï¼
æä»¥å®ä¸è¯è¨å¦çç ç©¶æçå¯åçèç³»ï¼ä½åæéè¦çåºå«ã
èªç¶è¯è¨å¤çå¹¶ä¸æ¯ä¸è¬å°ç ç©¶èªç¶è¯è¨ï¼
èå¨äºç å¶è½ææå°å®ç°èªç¶è¯è¨éä¿¡çè®¡ç®æºç³»ç»ï¼
ç¹å«æ¯å¶ä¸çè½¯ä»¶ç³»ç»ãå èå®æ¯è®¡ç®æºç§å¦çä¸é¨åã
'''

s = SnowNLP(text)

s.keywords(3)	# [u'è¯è¨', u'èªç¶', u'è®¡ç®æº']

s.summary(3)	# [u'å èå®æ¯è®¡ç®æºç§å¦çä¸é¨å',
                #  u'èªç¶è¯è¨å¤çæ¯ä¸é¨èè¯è¨å¦ãè®¡ç®æºç§å¦ã
				#	 æ°å¦äºä¸ä½çç§å¦',
				#  u'èªç¶è¯è¨å¤çæ¯è®¡ç®æºç§å¦é¢åä¸äººå·¥æºè½
				#	 é¢åä¸çä¸ä¸ªéè¦æ¹å']
s.sentences

s = SnowNLP([[u'è¿ç¯', u'æç« '],
             [u'é£ç¯', u'è®ºæ'],
             [u'è¿ä¸ª']])
s.tf
s.idf
s.sim([u'æç« '])# [0.3756070762985226, 0, 0]

Features

ä¸æåè¯ï¼Character-Based Generative Modelï¼
ææåæï¼ç°å¨è®ç»æ°æ®ä¸»è¦æ¯ä¹°åä¸è¥¿æ¶çè¯ä»·ï¼æä»¥å¯¹å¶ä»çä¸äºå¯è½ææä¸æ¯å¾å¥½ï¼å¾è§£å³ï¼
ææ¬åç±»ï¼Naive Bayesï¼
è½¬æ¢ææ¼é³ï¼Trieæ å®ç°çæå¤§å¹éï¼
ç¹ä½è½¬ç®ä½ï¼Trieæ å®ç°çæå¤§å¹éï¼
æåææ¬å³é®è¯ï¼TextRankç®æ³ï¼
æåææ¬æè¦ï¼TextRankç®æ³ï¼
tfï¼idf
Tokenizationï¼åå²æå¥åï¼
ææ¬ç¸ä¼¼ï¼BM25ï¼
æ¯æpython3ï¼æè°¢erningï¼

Get It now

$ pip install snownlp

å³äºè®ç»

from snownlp import seg
seg.train('data.txt')
seg.save('seg.marshal')
# from snownlp import tag
# tag.train('199801.txt')
# tag.save('tag.marshal')
# from snownlp import sentiment
# sentiment.train('neg.txt', 'pos.txt')
# sentiment.save('sentiment.marshal')

è¿æ ·è®ç»å¥½çæä»¶å°±åå¨ä¸ºseg.marshaläºï¼ä¹åä¿®æ¹snownlp/seg/__init__.pyéçdata_pathæååè®ç»å¥½çæä»¶å³å¯

License

MIT licensed.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of Jieba

Cons of Jieba

Code Comparison

Pros of HanLP

Cons of HanLP

Code Comparison

Pros of LAC

Cons of LAC

Code Comparison

Pros of cppjieba

Cons of cppjieba

Code Comparison

Convert designs to code with AI

README

SnowNLP: Simplified Chinese Text Processing

Features

Get It now

å ³äºè®­ç»

License

Top Related Projects

Convert designs to code with AI

å³äºè®ç»