Synonyms

:herb: 中文近义词：聊天机器人，智能问答工具包

5,091

895

5,091

View on GitHub

Top Related Projects

HanLP

34,953

中文分词词性标注命名实体识别依存句法分析成分句法分析语义依存分析语义角色标注指代消解风格转换语义相似度新词发现关键词短语提取自动摘要文本分类聚类拼音简繁转换自然语言处理

snownlp

6,541

Python library for processing Chinese text

bert4keras

5,414

keras implement of transformers for humans

Quick Overview

Synonyms is a Chinese Natural Language Processing (NLP) library for word similarity and sentence similarity calculations. It provides tools for semantic analysis, word embedding, and related NLP tasks specifically tailored for the Chinese language.

Pros

Specialized for Chinese language processing
Offers both word and sentence similarity calculations
Includes pre-trained word vectors for immediate use
Supports custom word vectors and dictionaries

Cons

Limited documentation in English
May require additional resources for optimal performance
Focused primarily on similarity calculations, lacking broader NLP features
Relatively small community compared to more general-purpose NLP libraries

Code Examples

Calculate word similarity:

import synonyms
r = synonyms.compare('北京', '上海')
print(r)  # Output: 0.63395087

Find synonyms for a given word:

import synonyms
words, scores = synonyms.nearby('北京')
print(words[:3])  # Output: ['首都', '城市', '中国']
print(scores[:3])  # Output: [0.896611, 0.754657, 0.74305]

Calculate sentence similarity:

import synonyms
sen1 = '我喜欢吃苹果'
sen2 = '我喜欢吃香蕉'
r = synonyms.compare(sen1, sen2, seg=True)
print(r)  # Output: 0.9285714285714286

Getting Started

To get started with Synonyms:

Install the library:

pip install synonyms

Import and use in your Python script:

import synonyms

# Calculate word similarity
similarity = synonyms.compare('北京', '上海')
print(f"Similarity between '北京' and '上海': {similarity}")

# Find synonyms
words, scores = synonyms.nearby('学习')
print(f"Synonyms for '学习': {words[:5]}")
print(f"Scores: {scores[:5]}")

Note: Make sure you have sufficient disk space (about 1.5GB) for the pre-trained word vectors.

Competitor Comparisons

jieba

34,028

结巴中文分词

Pros of jieba

More mature and widely adopted Chinese text segmentation library
Supports multiple segmentation modes (accurate, full, search engine)
Extensive documentation and community support

Cons of jieba

Primarily focused on word segmentation, not synonyms
Less emphasis on semantic analysis and word relationships

Code Comparison

jieba:

import jieba

seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))

Synonyms:

import synonyms

words = synonyms.seg("我来到北京清华大学")
print(words)

Key Differences

jieba is primarily a word segmentation tool, while Synonyms focuses on both segmentation and synonym generation
Synonyms provides word vectors and similarity calculations, which are not core features of jieba
jieba offers more granular control over segmentation modes, while Synonyms emphasizes semantic understanding

Use Cases

jieba: Best for applications requiring precise Chinese word segmentation
Synonyms: Ideal for projects needing both segmentation and semantic analysis, such as text similarity comparisons or synonym suggestions

Community and Maintenance

jieba: Larger user base, more frequent updates, and extensive third-party integrations
Synonyms: Smaller but growing community, with a focus on semantic analysis and NLP applications

HanLP

34,953

Pros of HanLP

More comprehensive NLP toolkit with broader functionality
Better support for traditional Chinese characters
More active development and frequent updates

Cons of HanLP

Steeper learning curve due to more complex API
Larger library size, potentially impacting performance
Requires more setup and configuration

Code Comparison

HanLP:

from pyhanlp import *

text = "我爱北京天安门"
print(HanLP.segment(text))

Synonyms:

import synonyms

words = synonyms.seg("我爱北京天安门")
print(words)

Both libraries offer word segmentation functionality, but HanLP provides more detailed output with part-of-speech tagging. Synonyms focuses primarily on word similarity and segmentation, while HanLP offers a wider range of NLP tasks.

HanLP is better suited for projects requiring advanced NLP capabilities in Chinese, including named entity recognition, dependency parsing, and more. Synonyms is more appropriate for simpler tasks focused on word relationships and basic segmentation.

Consider your project's specific needs, performance requirements, and the level of NLP functionality required when choosing between these libraries.

snownlp

6,541

Python library for processing Chinese text

Pros of SnowNLP

Broader functionality including sentiment analysis, text classification, and word segmentation
Includes tools for pinyin conversion and simplified/traditional Chinese conversion
More comprehensive documentation and examples

Cons of SnowNLP

Less focused on synonyms and semantic similarity
May require more setup and configuration for specific tasks
Not as actively maintained (last update was in 2020)

Code Comparison

SnowNLP example:

from snownlp import SnowNLP

s = SnowNLP(u'这个东西真心很赞')
print(s.sentiments)  # Sentiment analysis
print(s.pinyin)  # Pinyin conversion

Synonyms example:

import synonyms

print(synonyms.nearby('人脸'))
print(synonyms.compare('北京', '上海', seg=True))

Summary

SnowNLP offers a wider range of NLP functionalities for Chinese text processing, including sentiment analysis and text classification. However, it's less focused on synonyms and semantic similarity compared to Synonyms. SnowNLP provides more comprehensive documentation but hasn't been updated as recently as Synonyms. The choice between the two depends on the specific NLP tasks required for your project.

lac

3,931

百度NLP：分词，词性标注，命名实体识别，词重要性

Pros of LAC

Offers comprehensive Chinese language processing capabilities, including word segmentation, part-of-speech tagging, and named entity recognition
Provides pre-trained models for various domains, enhancing accuracy and performance
Supports both Python and C++ interfaces, offering flexibility for different development environments

Cons of LAC

Primarily focused on Chinese language processing, limiting its applicability for other languages
May require more computational resources due to its comprehensive feature set
Has a steeper learning curve compared to simpler synonym-focused libraries

Code Comparison

LAC:

from LAC import LAC

lac = LAC(mode='lac')
text = "LAC是个优秀的中文处理工具"
result = lac.run(text)
print(result)

Synonyms:

import synonyms

word = "优秀"
synonyms = synonyms.nearby(word)
print(synonyms)

Key Differences

LAC is a more comprehensive Chinese language processing tool, offering a wide range of features beyond synonym detection. It's well-suited for complex NLP tasks in Chinese. Synonyms, on the other hand, is more focused on providing synonym functionality and is simpler to use for basic word similarity tasks. LAC may be preferred for large-scale Chinese NLP projects, while Synonyms could be more appropriate for quick synonym lookups or simpler language processing needs.

bert4keras

5,414

keras implement of transformers for humans

Pros of bert4keras

More comprehensive and flexible BERT implementation
Supports multiple BERT variants and architectures
Better suited for advanced NLP tasks beyond synonyms

Cons of bert4keras

Steeper learning curve and more complex setup
Requires more computational resources
May be overkill for simple synonym-related tasks

Code Comparison

Synonyms:

import synonyms
synonyms.nearby("你好")

bert4keras:

from bert4keras.models import build_transformer_model
from bert4keras.tokenizers import Tokenizer

model = build_transformer_model(config_path, checkpoint_path)
tokenizer = Tokenizer(dict_path)

Summary

Synonyms is a lightweight library focused specifically on Chinese synonym detection and word similarity. It's easy to use and suitable for simple NLP tasks. bert4keras, on the other hand, is a more powerful and versatile BERT implementation that can handle a wide range of NLP tasks but requires more setup and resources. Choose Synonyms for quick synonym-related projects, and bert4keras for more complex NLP applications requiring BERT's capabilities.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Synonyms

Chinese Synonyms for Natural Language Processing and Understanding.

æ´å¥½çä¸æè¿ä¹è¯ï¼èå¤©æºå¨äººãæºè½é®çå·¥å·åã

Table of Content:

Welcome

Follow steps below to install and activate packages.

1/3 Install Sourcecodes Package

pip install -U synonyms

å½åç¨³å®çæ¬ v3.xã

2/3 Config license id

Synonyms's machine learning model package(s) requires a License from Chatopera License Store, first purchase a License and get the license id from Licenses page on Chatopera License Store(license idï¼å¨è¯ä¹¦ååºï¼è¯ä¹¦è¯¦æé¡µï¼ç¹å»ãå¤å¶è¯ä¹¦æ è¯ã).

Secondly, set environment variable in your terminal or shell scripts as below.

For Shell Users

e.g. Shell, CMD Scripts on Linux, Windows, macOS.

# Linux / macOS
export SYNONYMS_DL_LICENSE=YOUR_LICENSE
## e.g. if your license id is `FOOBAR`, run `export SYNONYMS_DL_LICENSE=FOOBAR`

# Windows
## 1/2 Command Prompt
set SYNONYMS_DL_LICENSE=YOUR_LICENSE
## 2/2 PowerShell
$env:SYNONYMS_DL_LICENSE='YOUR_LICENSE'

For Python Code Users

Jupyter Notebook, etc.

import os
os.environ["SYNONYMS_DL_LICENSE"] = "YOUR_LICENSE"
_licenseid = os.environ.get("SYNONYMS_DL_LICENSE", None)
print("SYNONYMS_DL_LICENSE=", _licenseid)

3/3 Download Model Package

Last, download the model package by command or script -

python -c "import synonyms; synonyms.display('è½é')" # download word vectors file

Usage

æ¯æä½¿ç¨ç¯å¢åééç½®åè¯è¯è¡¨å word2vec è¯åéæä»¶ã

ç¯å¢åé	æè¿°
SYNONYMS_WORD2VEC_BIN_MODEL_ZH_CN	ä½¿ç¨ word2vec è®ç»çè¯åéæä»¶ï¼äºè¿å¶æ ¼å¼ã
SYNONYMS_WORDSEG_DICT	ä¸æåè¯ä¸»åå¸ï¼æ ¼å¼åä½¿ç¨åè
SYNONYMS_DEBUG	["TRUE"\|"FALSE"], æ¯å¦è¾åºè°è¯æ¥å¿ï¼è®¾ç½®ä¸º âTRUEâ è¾åºï¼é»è®¤ä¸º âFALSEâ

çææ¥å£

$ pip install -r Requirements.txt
$ python demo.py

å®ç° RAG (Retrieval-Augmented Generation)æå¡

æ¥çç¤ºä¾ç¨åº, hailiang-wang/llm-get-started

APIs

synonyms.nearby(word [, size = 10])

import synonyms
print("äººè¸: ", synonyms.nearby("äººè¸"))
print("è¯å«: ", synonyms.nearby("è¯å«"))
print("NOT_EXIST: ", synonyms.nearby("NOT_EXIST"))

synonyms.nearby(WORD [,SIZE])è¿åä¸ä¸ªåç»ï¼åç»ä¸åå«ä¸¤é¡¹ï¼([nearby_words], [nearby_words_score])ï¼nearby_wordsæ¯ WORD çè¿ä¹è¯ä»¬ï¼ä¹ä»¥ list çæ¹å¼åå¨ï¼å¹¶ä¸æç§è·ç¦»çé¿åº¦ç±è¿åè¿æåï¼nearby_words_scoreæ¯nearby_wordsä¸**å¯¹åºä½ç½®**çè¯çè·ç¦»çåæ°ï¼åæ°å¨(0-1)åºé´åï¼è¶æ¥è¿äº 1ï¼ä»£è¡¨è¶ç¸è¿ï¼SIZE æ¯è¿åè¯æ±æ°éï¼é»è®¤ 10ãæ¯å¦:

synonyms.nearby(äººè¸, 10) = (
    ["å¾ç", "å¾å", "éè¿è§å¯", "æ°åå¾å", "å ä½å¾å½¢", "è¸é¨", "å¾è±¡", "æ¾å¤§é", "é¢å", "Mii"],
    [0.597284, 0.580373, 0.568486, 0.535674, 0.531835, 0.530
095, 0.525344, 0.524009, 0.523101, 0.516046])

å¨ OOV çæåµä¸ï¼è¿å ([], [])ï¼ç®åçåå¸å¤§å°: 435,729ã

synonyms.compare(sen1, sen2 [, seg=True])

ä¸¤ä¸ªå¥åçç¸ä¼¼åº¦æ¯è¾

    sen1 = "åçåå²æ§åé©"
    sen2 = "åçåå²æ§åé©"
    r = synonyms.compare(sen1, sen2, seg=True)

æå¸å¼é¢æ¹å vs éè·¯å³å®å½è¿: 0.429
æå¸å¼é¢æ¹å vs æå¸æå¼éè·¯: 0.93
åçåå²æ§åé© vs åçåå²æ§åé©: 1.0

synonyms.display(word [, size = 10])

ä»¥åå¥½çæ¹å¼æå°è¿ä¹è¯ï¼æ¹ä¾¿è°è¯ï¼display(WORD [, SIZE])è°ç¨äº synonyms#nearby æ¹æ³ã

>>> synonyms.display("é£æº")
'é£æº'è¿ä¹è¯ï¼
  1. é£æº:1.0
  2. ç´åæº:0.8423391
  3. å®¢æº:0.8393003
  4. æ»ç¿æº:0.7872388
  5. åç¨é£æº:0.7832081
  6. æ°´ä¸é£æº:0.77857226
  7. è¿è¾æº:0.7724742
  8. èªæº:0.7664748
  9. èªç©ºå¨:0.76592904
  10. æ°èªæº:0.74209654

SIZE æ¯æå°è¯æ±è¡¨çæ°éï¼é»è®¤ 10ã

synonyms.describe()

æå°å½ååçæè¿°ä¿¡æ¯ï¼

>>> synonyms.describe()
Vocab size in vector model: 435729
model_path: /Users/hain/chatopera/Synonyms/synonyms/data/words.vector.gz
version: 3.18.0
{'vocab_size': 435729, 'version': '3.18.0', 'model_path': '/chatopera/Synonyms/synonyms/data/words.vector.gz'}

synonyms.v(word)

>>> synonyms.v("é£æº")
array([-2.412167  ,  2.2628384 , -7.0214124 ,  3.9381874 ,  0.8219283 ,
       -3.2809453 ,  3.8747153 , -5.217062  , -2.2786229 , -1.2572327 ],
      dtype=float32)

synonyms.sv(sentence, ignore=False)

è·å¾ä¸ä¸ªåè¯åå¥åçåéï¼åéä»¥ array[array[]] æ¹å¼ç»æï¼å³è·å sentence ä¸æ¯ä¸ªè¯çåé array[] æ¾å¨ä¸ä¸ª array ä¸

    sentence: å¥åæ¯åè¯åéè¿ç©ºæ ¼èåèµ·æ¥
    ignore: æ¯å¦å¿½ç¥OOVï¼Falseæ¶ï¼éæºçæä¸ä¸ªåé

synonyms.bow(sentence, ignore=False)

è·å¾ä¸ä¸ªåè¯åå¥åçåéï¼åéä»¥ BoW æ¹å¼ç»æ

    sentence: å¥åæ¯åè¯åéè¿ç©ºæ ¼èåèµ·æ¥
    ignore: æ¯å¦å¿½ç¥OOVï¼Falseæ¶ï¼éæºçæä¸ä¸ªåé

synonyms.seg(sentence)

ä¸æåè¯

synonyms.seg("ä¸æè¿ä¹è¯å·¥å·å")

(['ä¸æ', 'è¿ä¹è¯', 'å·¥å·å'], ['nz', 'n', 'n'])

è¯¥åè¯ä¸å»åç¨è¯åæ ç¹ã

synonyms.keywords(sentence [, topK=5, withWeight=False])

æåå³é®è¯ï¼é»è®¤æç§éè¦ç¨åº¦æåå³é®è¯ã

keywords = synonyms.keywords("9æ15æ¥ä»¥æ¥ï¼å°ç§¯çµãé«éãä¸æçåä¸ºçéè¦åä½ä¼ä¼´ï¼åªè¦æ²¡æç¾å½çç¸å³è®¸å¯è¯ï¼é½æ æ³ä¾åºè¯çç»åä¸ºï¼èä¸è¯å½éçå½äº§è¯çä¼ä¸ï¼ä¹å éç¨ç¾å½ææ¯ï¼èæ æ³ä¾è´§ç»åä¸ºãç®ååä¸ºé¨ååå·çææºäº§ååºç°è´§å°çç°è±¡ï¼è¥è¯¥å½¢å¿æç»ä¸å»ï¼åä¸ºææºä¸å¡å°éåéåã")

Contribution

Get more logs for debugging, set environment variable.

SYNONYMS_DEBUG=TRUE

PCA

ä»¥âäººè¸âä¸ºä¾ä¸»è¦æååæï¼

Change logs

æ´æ°æåµè¯´æã

Voice of Users

ç¨æ·æä¹è¯´ï¼

Data

data is built based on wikidata-corpus.

Valuation

åä¹è¯è¯æ

ç¥ç½, HowNet

å¯¹æ¯

æ´å¤æ¯å¯¹ç»æã

Used by

Github å³èç¨æ·åè¡¨

Benchmark

Test with py3, MacBook Pro.

python benchmark.py

++++++++++ OS Name and version ++++++++++

Platform: Darwin

Kernel: 16.7.0

Architecture: ('64bit', '')

++++++++++ CPU Cores ++++++++++

Cores: 4

CPU Load: 60

++++++++++ System Memory ++++++++++

meminfo 8GB

synonyms#nearby: 100000 loops, best of 3 epochs: 0.209 usec per loop

Live Sharing

52nlp.cn

æºå¨ä¹å¿

çº¿ä¸åäº«å®å½: Synonyms ä¸æè¿ä¹è¯å·¥å·å @ 2018-02-07

Statement

@online{Synonyms:hain2017,
  author = {Hai Liang Wang, Hu Ying Xi},
  title = {ä¸æè¿ä¹è¯å·¥å·åSynonyms},
  year = 2017,
  url = {https://github.com/chatopera/Synonyms},
  urldate = {2017-09-27}
}

References

wikidata-corpus

word2vec åçæ¨å¯¼ä¸ä»£ç åæ

Frequently Asked Questions (FAQ)

æ¯å¦æ¯ææ·»å åè¯å°è¯è¡¨ä¸ï¼

ä¸æ¯æï¼æ¬²äºè§£æ´å¤è¯·ç #5

è¯åéçè®ç»æ¯ç¨åªä¸ªå·¥å·ï¼

Google åå¸çword2vecï¼è¯¥åºç± C è¯è¨ç¼åï¼ååä½¿ç¨æçé«ï¼è®ç»éåº¦å¿«ãgensim å¯ä»¥å è½½ word2vec è¾åºçæ¨¡åæä»¶ã

ç¸ä¼¼åº¦è®¡ç®çæ¹æ³æ¯ä»ä¹ï¼

è¯¦è§ #64

Authors

Hai Liang Wang

Hu Ying Xi

èªç¶è¯è¨å¤çæ¨èå¥é¨&å·¥å·ä¹¦

æ¬ä¹¦ç± Synonyms ä½èåä¸èä½ã

å¿«éè´ä¹¦é¾æ¥

https://github.com/l11x0m7/book-of-qna-code

Give credits to

Word2vec by Google

Wikimedia: è®ç»è¯ææ¥æº

gensim: word2vec.py

SentenceSim: ç¸ä¼¼åº¦è¯æµè¯æ

jieba: ä¸æåè¯

License

Chunsong Public License, version 1.0

Project Sponsor

Chatopera äºæå¡

https://bot.chatopera.com/

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of jieba

Cons of jieba

Code Comparison

Key Differences

Use Cases

Community and Maintenance

Pros of HanLP

Cons of HanLP

Code Comparison

Pros of SnowNLP

Cons of SnowNLP

Code Comparison

Summary

Pros of LAC

Cons of LAC

Code Comparison

Key Differences

Pros of bert4keras

Cons of bert4keras

Code Comparison

Summary

Convert designs to code with AI

README

Synonyms

Table of Content:

Welcome

1/3 Install Sourcecodes Package

2/3 Config license id

3/3 Download Model Package

Usage

çææ¥å£

å®ç° RAG (Retrieval-Augmented Generation)æå¡

APIs

synonyms.nearby(word [, size = 10])

synonyms.compare(sen1, sen2 [, seg=True])

synonyms.display(word [, size = 10])

synonyms.describe()

synonyms.v(word)

synonyms.sv(sentence, ignore=False)

synonyms.bow(sentence, ignore=False)

synonyms.seg(sentence)

synonyms.keywords(sentence [, topK=5, withWeight=False])

Contribution

PCA

Change logs

Voice of Users

Data

Valuation

åä¹è¯è¯æ

ç¥ç½, HowNet

å¯¹æ¯

Used by

Benchmark

Live Sharing

Statement

References

Frequently Asked Questions (FAQ)

Authors

èªç¶è¯­è¨å¤çæ¨èå ¥é¨&å·¥å ·ä¹¦

Give credits to

License

Project Sponsor

Chatopera äºæå¡

Top Related Projects

Convert designs to code with AI

çææ¥å£

å®ç° RAG (Retrieval-Augmented Generation)æå¡

åä¹è¯è¯æ

ç¥ç½, HowNet

å¯¹æ¯

èªç¶è¯è¨å¤çæ¨èå¥é¨&å·¥å·ä¹¦

Chatopera äºæå¡