Convert Figma logo to code with AI

crownpku logoAwesome-Chinese-NLP

A curated list of resources for Chinese NLP 中文自然语言处理相关资料

7,776
1,710
7,776
3

Top Related Projects

3,840

百度NLP:分词,词性标注,命名实体识别,词重要性

33,063

结巴中文分词

33,448

Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification

6,397

Python library for processing Chinese text

keras implement of transformers for humans

Quick Overview

Awesome-Chinese-NLP is a curated list of resources for Chinese Natural Language Processing (NLP). It provides a comprehensive collection of tools, datasets, papers, and other materials specifically focused on NLP tasks for the Chinese language. This repository serves as a valuable reference for researchers, developers, and enthusiasts working on Chinese language processing.

Pros

  • Extensive collection of resources covering various aspects of Chinese NLP
  • Regularly updated with new tools, datasets, and research papers
  • Well-organized structure, making it easy to find specific resources
  • Includes both open-source and commercial tools, providing a broad overview of the field

Cons

  • May be overwhelming for beginners due to the large number of resources
  • Some links may become outdated over time if not regularly maintained
  • Lacks detailed explanations or comparisons of the listed resources
  • Primarily in English, which may be a barrier for some Chinese-speaking users

Code Examples

This repository is not a code library but a curated list of resources. Therefore, there are no code examples to provide.

Getting Started

As this is not a code library, there are no specific getting started instructions. However, users can begin by exploring the repository's README file on GitHub, which provides an organized list of resources categorized by different aspects of Chinese NLP, such as:

  1. Chinese Word Segmentation
  2. Named Entity Recognition
  3. Sentiment Analysis
  4. Machine Translation
  5. Information Extraction
  6. Text Summarization
  7. Datasets
  8. Toolkits

Users can click on the links provided in each category to access the relevant resources, tools, or papers.

Competitor Comparisons

3,840

百度NLP:分词,词性标注,命名实体识别,词重要性

Pros of lac

  • Focused, production-ready Chinese NLP toolkit
  • Provides pre-trained models for immediate use
  • Optimized for performance and efficiency

Cons of lac

  • Limited scope compared to Awesome-Chinese-NLP's comprehensive resource list
  • Less frequently updated than Awesome-Chinese-NLP
  • Primarily maintained by a single organization (Baidu)

Code comparison

lac:

from LAC import LAC

lac = LAC(mode='lac')
text = "我爱北京天安门"
result = lac.run(text)
print(result)

Awesome-Chinese-NLP doesn't provide direct code examples but offers links to various tools and libraries. A typical usage might involve selecting a specific tool from the list and implementing it separately.

Summary

lac is a focused, ready-to-use Chinese NLP toolkit optimized for performance, while Awesome-Chinese-NLP serves as a comprehensive resource list for Chinese NLP tools and research. lac offers immediate functionality but has a narrower scope, whereas Awesome-Chinese-NLP provides a broader overview of available resources but requires additional effort to implement specific tools.

33,063

结巴中文分词

Pros of jieba

  • Focused, specialized tool for Chinese word segmentation
  • Lightweight and easy to integrate into projects
  • Offers multiple segmentation modes (accurate, full, search engine)

Cons of jieba

  • Limited to word segmentation, not a comprehensive NLP toolkit
  • May require additional libraries for advanced NLP tasks
  • Less frequently updated compared to Awesome-Chinese-NLP

Code Comparison

Awesome-Chinese-NLP is a curated list of resources, not a code library. However, here's a basic usage example of jieba:

import jieba

text = "我来到北京清华大学"
seg_list = jieba.cut(text, cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))

Summary

jieba is a specialized Chinese word segmentation tool, offering efficient and accurate text processing for specific tasks. It's lightweight and easy to use but limited in scope compared to the comprehensive resource list provided by Awesome-Chinese-NLP.

Awesome-Chinese-NLP serves as a curated collection of various Chinese NLP tools, datasets, and research papers, providing a broader overview of the field. While it doesn't offer direct functionality, it guides users to a wide range of resources for different NLP tasks.

Choose jieba for quick integration of Chinese word segmentation into your project. Opt for Awesome-Chinese-NLP when seeking a comprehensive guide to Chinese NLP resources and tools for more complex or diverse NLP tasks.

33,448

Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification

Pros of HanLP

  • Comprehensive NLP toolkit with a wide range of functionalities
  • Actively maintained with regular updates and improvements
  • Provides both Java and Python interfaces for flexibility

Cons of HanLP

  • Steeper learning curve due to its extensive feature set
  • May be overkill for simple NLP tasks or projects
  • Requires more system resources compared to lightweight alternatives

Code Comparison

HanLP:

from hanlp_restful import HanLP

HanLP.parse('我爱自然语言处理技术!')

Awesome-Chinese-NLP (using jieba as an example):

import jieba

jieba.cut('我爱自然语言处理技术!')

Summary

HanLP is a comprehensive NLP toolkit offering a wide range of functionalities for Chinese language processing. It provides both Java and Python interfaces, making it versatile for different development environments. However, its extensive feature set may result in a steeper learning curve and higher resource requirements.

Awesome-Chinese-NLP, on the other hand, is a curated list of resources and tools for Chinese NLP. It doesn't provide direct functionality but serves as a valuable reference for various Chinese NLP tools and libraries. This makes it more suitable for developers looking to explore different options or find specific tools for their projects.

While HanLP offers a unified solution for many NLP tasks, Awesome-Chinese-NLP allows users to pick and choose from a variety of specialized tools, potentially resulting in a more tailored and lightweight solution for specific use cases.

6,397

Python library for processing Chinese text

Pros of snownlp

  • Focused tool: Provides a specific set of Chinese NLP functionalities
  • Ready-to-use: Offers pre-trained models for immediate application
  • Lightweight: Easy to install and integrate into projects

Cons of snownlp

  • Limited scope: Covers fewer NLP tasks compared to Awesome-Chinese-NLP
  • Less frequently updated: May not include the latest advancements in Chinese NLP
  • Smaller community: Less active development and support

Code comparison

snownlp:

from snownlp import SnowNLP

s = SnowNLP(u'这是一个测试句子')
print(s.words)         # 分词
print(s.tags)          # 词性标注
print(s.sentiments)    # 情感分析

Awesome-Chinese-NLP: (Note: This is a curated list, not a tool, so there's no direct code comparison)

Summary

snownlp is a practical, ready-to-use Chinese NLP library with a focused set of features. It's suitable for quick implementation of basic Chinese NLP tasks. Awesome-Chinese-NLP, on the other hand, is a comprehensive resource list that provides a wider range of tools and research papers for Chinese NLP. It's more suitable for researchers and developers looking to explore various options and stay updated with the latest advancements in the field.

keras implement of transformers for humans

Pros of bert4keras

  • Focused specifically on BERT implementation in Keras
  • Provides ready-to-use BERT models for Chinese NLP tasks
  • Offers more hands-on code examples and implementations

Cons of bert4keras

  • Limited to BERT-based models and Keras framework
  • Less comprehensive in covering other Chinese NLP resources
  • May require more technical expertise to use effectively

Code Comparison

bert4keras example:

from bert4keras.models import build_transformer_model
from bert4keras.tokenizers import Tokenizer

model = build_transformer_model(config_path, checkpoint_path)
tokenizer = Tokenizer(dict_path)

Awesome-Chinese-NLP doesn't provide direct code examples but offers links to various Chinese NLP tools and resources:

## Chinese Word Segmentation

- [THULAC](http://thulac.thunlp.org/) - An Efficient Lexical Analyzer for Chinese
- [Jieba](https://github.com/fxsjy/jieba) - Python Chinese Word Segmentation Module

While Awesome-Chinese-NLP serves as a comprehensive resource hub for Chinese NLP, bert4keras focuses on providing a specific implementation of BERT for Chinese language tasks. Awesome-Chinese-NLP covers a broader range of topics and tools, making it more suitable for researchers and developers looking for an overview of the field. bert4keras, on the other hand, is more appropriate for those specifically interested in using BERT models with Keras for Chinese NLP projects.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

awesome-chinese-nlp

Awesome

A curated list of resources for NLP (Natural Language Processing) for Chinese

中文自然语言处理相关资料

图片来自复旦大学邱锡鹏教授

Contents 列表

1. Chinese NLP Toolkits 中文NLP工具

2. Corpus 中文语料

3. Organizations 中文NLP学术组织及竞赛

4. Industry 中文NLP商业服务

5. Learning Materials 学习资料



Chinese NLP Toolkits 中文NLP工具

Toolkits 综合NLP工具包

  • THULAC 中文词法分析工具包 by 清华 (C++/Java/Python)

  • NLPIR by 中科院 (Java)

  • LTP 语言技术平台 by 哈工大 (C++) pylyp LTP的python封装

  • FudanNLP by 复旦 (Java)

  • BaiduLac by 百度 Baidu's open-source lexical analysis tool for Chinese, including word segmentation, part-of-speech tagging & named entity recognition.

  • HanLP (Java)

  • FastNLP (Python) 一款轻量级的 NLP 处理套件。

  • SnowNLP (Python) Python library for processing Chinese text

  • YaYaNLP (Python) 纯python编写的中文自然语言处理包,取名于“牙牙学语”

  • 小明NLP (Python) 轻量级中文自然语言处理工具

  • DeepNLP (Python) Deep Learning NLP Pipeline implemented on Tensorflow with pretrained Chinese models.

  • chinese_nlp (C++ & Python) Chinese Natural Language Processing tools and examples

  • lightNLP (Python) 基于Pytorch和torchtext的自然语言处理深度学习框架

  • Chinese-Annotator (Python) Annotator for Chinese Text Corpus 中文文本标注工具

  • Poplar (Typescript) A web-based annotation tool for natural language processing (NLP)

  • Jiagu (Python) Jiagu以BiLSTM等模型为基础,使用大规模语料训练而成。将提供中文分词、词性标注、命名实体识别、情感分析、知识图谱关系抽取、关键词抽取、文本摘要、新词发现等常用自然语言处理功能。

  • SmoothNLP (Python & Java) 专注于可解释的NLP技术

  • FoolNLTK (Python & Java) A Chinese Nature Language Toolkit

Popular NLP Toolkits for English/Multi-Language 常用的英文或支持多语言的NLP工具包

  • CoreNLP by Stanford (Java) A Java suite of core NLP tools.

  • Stanza by Stanford (Python) A Python NLP Library for Many Human Languages

  • NLTK (Python) Natural Language Toolkit

  • spaCy (Python) Industrial-Strength Natural Language Processing with a online course

  • textacy (Python) NLP, before and after spaCy

  • OpenNLP (Java) A machine learning based toolkit for the processing of natural language text.

  • gensim (Python) Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora.

  • Kashgari - Simple and powerful NLP framework, build your state-of-art model in 5 minutes for named entity recognition (NER), part-of-speech tagging (PoS) and text classification tasks. Includes BERT and word2vec embedding.

Chinese Word Segment 中文分词

Information Extraction 信息提取

  • MITIE (C++) library and tools for information extraction

  • Duckling (Haskell) Language, engine, and tooling for expressing, testing, and evaluating composable language rules on input strings.

  • IEPY (Python) IEPY is an open source tool for Information Extraction focused on Relation Extraction.

  • Snorkel A training data creation and management system focused on information extraction

  • Neural Relation Extraction implemented with LSTM in TensorFlow

  • A neural network model for Chinese named entity recognition

  • bert-chinese-ner 使用预训练语言模型BERT做中文NER

  • Information-Extraction-Chinese Chinese Named Entity Recognition with IDCNN/biLSTM+CRF, and Relation Extraction with biGRU+2ATT 中文实体识别与关系提取

  • Familia 百度出品的 A Toolkit for Industrial Topic Modeling

  • Text Classification All kinds of text classificaiton models and more with deep learning. 用知乎问答语聊作为测试数据。

  • ComplexEventExtraction 中文复合事件的概念与显式模式,包括条件事件、因果事件、顺承事件、反转事件等事件抽取,并形成事理图谱。

  • TextRank4ZH 从中文文本中自动提取关键词和摘要

QA & Chatbot 问答和聊天机器人

Multi-Modal Representation & Retrieval 多模态表征与检索

  • Chinese-CLIP (Python) Chinese-CLIP是中文多模态图文表征预训练模型。其基于OpenAI的CLIP模型结构,利用大规模中文原生图文语料完成预训练,目前开源了多个模型规模,同时公开了技术报告论文及检索demo


Corpus 中文语料


Organizations 中文NLP学术组织及竞赛



Industry 中文NLP商业服务

  • 华为云NLP 针对各类企业及开发者提供的用于文本分析及挖掘的云服务,旨在帮助用户高效的处理文本

  • 百度云NLP 提供业界领先的自然语言处理技术,提供优质文本处理及理解技术

  • 阿里云NLP 为各类企业及开发者提供的用于文本分析及挖掘的核心工具

  • 腾讯云NLP 基于并行计算、分布式爬虫系统,结合独特的语义分析技术,一站满足NLP、转码、抽取、数据抓取等需求

  • 讯飞开放平台 以语音交互为核心的人工智能开放平台

  • 搜狗实验室 分词和词性标注

  • 玻森数据 上海玻森数据科技有限公司,专注中文语义分析技术

  • 云孚科技 NLP工具包、知识图谱、文本挖掘、对话系统、舆情分析等

  • 智言科技 专注于深度学习和知识图谱技术突破的人工智能公司

  • 追一科技 主攻深度学习和自然语言处理



Learning Materials 学习资料