Top Related Projects
结巴中文分词
百度NLP:分词,词性标注,命名实体识别,词重要性
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
Python library for processing Chinese text
keras implement of transformers for humans
Quick Overview
LTP (Language Technology Platform) is an open-source Chinese natural language processing toolkit developed by the Research Center for Social Computing and Information Retrieval at Harbin Institute of Technology. It provides a comprehensive set of Chinese language processing tools, including word segmentation, part-of-speech tagging, named entity recognition, dependency parsing, and semantic role labeling.
Pros
- Comprehensive suite of Chinese NLP tools in a single package
- High accuracy and performance for various Chinese language processing tasks
- Actively maintained and regularly updated with new features and improvements
- Supports both Python and C++ interfaces for flexibility in integration
Cons
- Primarily focused on Chinese language, limiting its use for other languages
- Requires some understanding of Chinese linguistics for optimal usage
- Documentation is primarily in Chinese, which may be challenging for non-Chinese speakers
- Resource-intensive for some tasks, potentially requiring significant computational power
Code Examples
- Word segmentation and part-of-speech tagging:
import ltp
ltp = ltp.LTP()
segment, hidden = ltp.seg(["我爱北京天安门"])
pos = ltp.pos(hidden)
print(segment)
print(pos)
- Named entity recognition:
import ltp
ltp = ltp.LTP()
segment, hidden = ltp.seg(["华北电力大学位于北京市昌平区"])
ner = ltp.ner(hidden)
print(ner)
- Dependency parsing:
import ltp
ltp = ltp.LTP()
segment, hidden = ltp.seg(["他送了一本书给我"])
dep = ltp.dep(hidden)
print(dep)
Getting Started
To get started with LTP, follow these steps:
-
Install LTP using pip:
pip install ltp
-
Import and initialize LTP in your Python script:
import ltp ltp_model = ltp.LTP()
-
Use LTP for various NLP tasks:
text = ["我爱北京天安门"] segment, hidden = ltp_model.seg(text) pos = ltp_model.pos(hidden) ner = ltp_model.ner(hidden) dep = ltp_model.dep(hidden) srl = ltp_model.srl(hidden)
For more detailed usage and advanced features, refer to the official documentation on the GitHub repository.
Competitor Comparisons
结巴中文分词
Pros of jieba
- Lightweight and easy to use, with simple installation and integration
- Fast processing speed, especially for large-scale text segmentation tasks
- Supports customization of dictionaries and user-defined words
Cons of jieba
- Limited functionality compared to ltp, focusing primarily on word segmentation
- Less accurate for complex linguistic tasks like named entity recognition or dependency parsing
- Fewer options for fine-tuning and model customization
Code Comparison
jieba:
import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))
ltp:
from ltp import LTP
ltp = LTP()
segment, _ = ltp.segment(["我来到北京清华大学"])
print(segment)
Both libraries provide Chinese word segmentation, but ltp offers a more comprehensive set of NLP tools. jieba is simpler to use and faster for basic segmentation tasks, while ltp provides higher accuracy and additional linguistic analysis capabilities. The choice between them depends on the specific requirements of your project, balancing simplicity and speed against accuracy and advanced features.
百度NLP:分词,词性标注,命名实体识别,词重要性
Pros of LAC
- Higher performance and speed for Chinese word segmentation and part-of-speech tagging
- Simpler API and easier integration into existing projects
- Better support for specialized domains like medicine and finance
Cons of LAC
- Limited functionality compared to LTP (focuses mainly on word segmentation and POS tagging)
- Less comprehensive documentation and community support
- Fewer language options (primarily focused on Chinese)
Code Comparison
LAC:
from LAC import LAC
lac = LAC(mode='lac')
text = "我爱北京天安门"
result = lac.run(text)
print(result)
LTP:
from ltp import LTP
ltp = LTP()
text = "我爱北京天安门"
seg, hidden = ltp.seg([text])
pos = ltp.pos(hidden)
print(seg, pos)
Both repositories provide Chinese natural language processing tools, but they differ in scope and implementation. LAC offers a more streamlined approach for specific tasks, while LTP provides a broader range of NLP functionalities. The choice between them depends on the specific requirements of your project and the depth of NLP analysis needed.
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发 现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
Pros of HanLP
- More comprehensive feature set, including advanced NLP tasks like text classification and sentiment analysis
- Better documentation and examples, making it easier for new users to get started
- More active development and frequent updates
Cons of HanLP
- Larger resource footprint, potentially slower for basic tasks
- More complex setup and configuration process
- May be overkill for simple NLP tasks
Code Comparison
HanLP:
from pyhanlp import *
text = "我爱北京天安门"
print(HanLP.segment(text))
LTP:
from ltp import LTP
ltp = LTP()
seg, _ = ltp.seg([text])
print(seg)
Both libraries offer similar basic functionality for Chinese text segmentation, but HanLP provides a more straightforward API for this task. However, LTP's approach allows for more flexibility in processing multiple sentences at once.
Python library for processing Chinese text
Pros of SnowNLP
- Lightweight and easy to use for basic Chinese NLP tasks
- Includes sentiment analysis functionality out of the box
- Simpler installation process with fewer dependencies
Cons of SnowNLP
- Less comprehensive feature set compared to LTP
- Lower accuracy for complex NLP tasks
- Less active development and community support
Code Comparison
SnowNLP example:
from snownlp import SnowNLP
s = SnowNLP(u'这个东西真心很赞')
print(s.words) # [u'这个', u'东西', u'真心', u'很', u'赞']
print(s.tags) # [(u'这个', u'r'), (u'东西', u'n'), (u'真心', u'd'), (u'很', u'd'), (u'赞', u'Vg')]
print(s.sentiments) # 0.9769663402895832 # Positive sentiment
LTP example:
from ltp import LTP
ltp = LTP()
seg, hidden = ltp.seg(["这个东西真心很赞"])
pos = ltp.pos(hidden)
ner = ltp.ner(hidden)
dep = ltp.dep(hidden)
srl = ltp.srl(hidden)
print(seg)
print(pos)
print(ner)
print(dep)
print(srl)
keras implement of transformers for humans
Pros of bert4keras
- Focused specifically on BERT and related models, offering more specialized functionality
- Simpler API and easier to use for BERT-based tasks
- More actively maintained with frequent updates
Cons of bert4keras
- Limited to BERT and related models, less versatile for other NLP tasks
- Smaller community and fewer resources compared to LTP
- May require more manual configuration for complex tasks
Code Comparison
bert4keras:
from bert4keras.tokenizers import Tokenizer
from bert4keras.models import build_transformer_model
tokenizer = Tokenizer(dict_path)
model = build_transformer_model(config_path, checkpoint_path)
LTP:
from ltp import LTP
ltp = LTP()
seg, hidden = ltp.seg(["我爱北京天安门"])
pos = ltp.pos(hidden)
Summary
bert4keras is more specialized for BERT-related tasks with a simpler API, while LTP offers a broader range of NLP functionalities. bert4keras may be preferable for BERT-specific projects, while LTP is more suitable for general Chinese NLP tasks. The choice depends on the specific requirements of your project and the level of customization needed.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Language | version |
---|---|
Python | |
Rust |
LTP 4
LTPï¼Language Technology Platformï¼ æä¾äºä¸ç³»åä¸æèªç¶è¯è¨å¤çå·¥å ·ï¼ç¨æ·å¯ä»¥ä½¿ç¨è¿äºå·¥å ·å¯¹äºä¸æææ¬è¿è¡åè¯ãè¯æ§æ 注ãå¥æ³åæççå·¥ä½ã
å¼ç¨
å¦ææ¨å¨å·¥ä½ä¸ä½¿ç¨äº LTPï¼æ¨å¯ä»¥å¼ç¨è¿ç¯è®ºæ
@inproceedings{che-etal-2021-n,
title = "N-{LTP}: An Open-source Neural Language Technology Platform for {C}hinese",
author = "Che, Wanxiang and
Feng, Yunlong and
Qin, Libo and
Liu, Ting",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-demo.6",
doi = "10.18653/v1/2021.emnlp-demo.6",
pages = "42--49",
abstract = "We introduce N-LTP, an open-source neural language technology platform supporting six fundamental Chinese NLP tasks: lexical analysis (Chinese word segmentation, part-of-speech tagging, and named entity recognition), syntactic parsing (dependency parsing), and semantic parsing (semantic dependency parsing and semantic role labeling). Unlike the existing state-of-the-art toolkits, such as Stanza, that adopt an independent model for each task, N-LTP adopts the multi-task framework by using a shared pre-trained model, which has the advantage of capturing the shared knowledge across relevant Chinese tasks. In addition, a knowledge distillation method (Clark et al., 2019) where the single-task model teaches the multi-task model is further introduced to encourage the multi-task model to surpass its single-task teacher. Finally, we provide a collection of easy-to-use APIs and a visualization tool to make users to use and view the processing results more easily and directly. To the best of our knowledge, this is the first toolkit to support six Chinese NLP fundamental tasks. Source code, documentation, and pre-trained models are available at https://github.com/HIT-SCIR/ltp.",
}
åèä¹¦ï¼ ç±å工大社ä¼è®¡ç®ä¸ä¿¡æ¯æ£ç´¢ç 究ä¸å¿ï¼HIT-SCIRï¼çå¤ä½å¦è å ±åç¼èçãèªç¶è¯è¨å¤çï¼åºäºé¢è®ç»æ¨¡åçæ¹æ³ ãï¼ä½è ï¼è½¦ä¸ç¿ãéæ±ãå´ä¸é¸£ï¼ä¸»å®¡ï¼åæºï¼ä¸ä¹¦ç°å·²æ£å¼åºçï¼è¯¥ä¹¦éç¹ä»ç»äºæ°çåºäºé¢è®ç»æ¨¡åçèªç¶è¯è¨å¤çææ¯ï¼å æ¬åºç¡ç¥è¯ãé¢è®ç»è¯åéåé¢è®ç»æ¨¡åä¸å¤§é¨åï¼å¯ä¾å¹¿å¤§ LTP ç¨æ·å¦ä¹ åèã
æ´æ°è¯´æ
- 4.2.0
- [ç»ææ§åå] å° LTP æåæ 2 个é¨åï¼ç»´æ¤åè®ç»æ´æ¹ä¾¿ï¼ç»ææ´æ¸
æ°
- [Legacy 模å] é对广大ç¨æ·å¯¹äº**æ¨çé度**çéæ±ï¼ä½¿ç¨ Rust éåäºåºäºæç¥æºçç®æ³ï¼åç¡®çä¸ LTP3 çæ¬ç¸å½ï¼é度åæ¯ LTP v3 ç 3.55 åï¼å¼å¯å¤çº¿ç¨æ´å¯è·å¾ 17.17 åçé度æåï¼ä½ç®åä» æ¯æåè¯ãè¯æ§ãå½åå®ä½ä¸å¤§ä»»å¡
- [深度å¦ä¹ 模å] å³åºäº PyTorch å®ç°ç深度å¦ä¹ 模åï¼æ¯æå ¨é¨ç 6 大任å¡ï¼åè¯/è¯æ§/å½åå®ä½/è¯ä¹è§è²/ä¾åå¥æ³/è¯ä¹ä¾åï¼
- [å
¶ä»æ¹è¿] æ¹è¿äºæ¨¡åè®ç»æ¹æ³
- [å ±å] æä¾äºè®ç»èæ¬åè®ç»æ ·ä¾ï¼ä½¿å¾ç¨æ·è½å¤æ´æ¹ä¾¿å°ä½¿ç¨ç§æçæ°æ®ï¼èªè¡è®ç»ä¸ªæ§åç模å
- [深度å¦ä¹ 模å] éç¨ hydra 对è®ç»è¿ç¨è¿è¡é ç½®ï¼æ¹ä¾¿å¹¿å¤§ç¨æ·ä¿®æ¹æ¨¡åè®ç»åæ°ä»¥å对 LTP è¿è¡æ©å±ï¼æ¯å¦ä½¿ç¨å ¶ä»å ä¸ç Moduleï¼
- [å ¶ä»åå] åè¯ãä¾åå¥æ³åæ (Eisner) å è¯ä¹ä¾ååæ (Eisner) ä»»å¡ç解ç ç®æ³ä½¿ç¨ Rust å®ç°ï¼é度æ´å¿«
- [æ°ç¹æ§] 模åä¸ä¼ è³ Huggingface Hubï¼æ¯æèªå¨ä¸è½½ï¼ä¸è½½é度æ´å¿«ï¼å¹¶ä¸æ¯æç¨æ·èªè¡ä¸ä¼ èªå·±è®ç»ç模åä¾ LTP è¿è¡æ¨ç使ç¨
- [ç ´åæ§åæ´] æ¹ç¨ Pipeline API è¿è¡æ¨çï¼æ¹ä¾¿åç»è¿è¡æ´æ·±å ¥çæ§è½ä¼åï¼å¦ SDP å SDPG å¾å¤§ä¸é¨åæ¯éå çï¼éç¨å¯ä»¥å å¿«æ¨çé度ï¼ï¼ä½¿ç¨è¯´æåè§Github å¿«é使ç¨é¨å
- [ç»ææ§åå] å° LTP æåæ 2 个é¨åï¼ç»´æ¤åè®ç»æ´æ¹ä¾¿ï¼ç»ææ´æ¸
æ°
- 4.1.0
- æä¾äºèªå®ä¹åè¯çåè½
- ä¿®å¤äºä¸äº bug
- 4.0.0
- åºäº Pytorch å¼åï¼åç Python æ¥å£
- å¯æ ¹æ®éè¦èªç±éæ©ä¸åé度åææ ç模å
- åè¯ãè¯æ§ãå½åå®ä½ãä¾åå¥æ³ãè¯ä¹è§è²ãè¯ä¹ä¾å 6 大任å¡
å¿«é使ç¨
Python
# æ¹æ³ 1ï¼ ä½¿ç¨æ¸
åæºå®è£
LTP
# 1. å®è£
PyTorch å Transformers ä¾èµ
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple torch transformers
# 2. å®è£
LTP
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple ltp ltp-core ltp-extension
# æ¹æ³ 2ï¼ å
å
¨å±æ¢æºï¼åå®è£
LTP
# 1. å
¨å±æ¢ TUNA æº
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
# 2. å®è£
PyTorch å Transformers ä¾èµ
pip install torch transformers
# 3. å®è£
LTP
pip install ltp ltp-core ltp-extension
æ³¨ï¼ å¦æéå°ä»»ä½é误ï¼è¯·å°è¯ä½¿ç¨ä¸è¿°å½ä»¤éæ°å®è£ ltpï¼å¦æä¾ç¶æ¥éï¼è¯·å¨ Github issues ä¸åé¦ã
import torch
from ltp import LTP
# é»è®¤ huggingface ä¸è½½ï¼å¯è½éè¦ä»£ç
ltp = LTP("LTP/small") # é»è®¤å è½½ Small 模å
# ä¹å¯ä»¥ä¼ å
¥æ¨¡åçè·¯å¾ï¼ltp = LTP("/path/to/your/model")
# /path/to/your/model åºå½åå¨ config.json åå
¶ä»æ¨¡åæ件
# å°æ¨¡å移å¨å° GPU ä¸
if torch.cuda.is_available():
# ltp.cuda()
ltp.to("cuda")
# èªå®ä¹è¯è¡¨
ltp.add_word("汤å§å»", freq=2)
ltp.add_words(["å¤å¥", "å¤è¡£"], freq=2)
# åè¯ cwsãè¯æ§ posãå½åå®ä½æ 注 nerãè¯ä¹è§è²æ 注 srlãä¾åå¥æ³åæ depãè¯ä¹ä¾ååææ sdpãè¯ä¹ä¾ååæå¾ sdpg
output = ltp.pipeline(["ä»å«æ±¤å§å»æ¿å¤è¡£ã"], tasks=["cws", "pos", "ner", "srl", "dep", "sdp", "sdpg"])
# 使ç¨åå
¸æ ¼å¼ä½ä¸ºè¿åç»æ
print(output.cws) # print(output[0]) / print(output['cws']) # ä¹å¯ä»¥ä½¿ç¨ä¸æ 访é®
print(output.pos)
print(output.sdp)
# 使ç¨æç¥æºç®æ³å®ç°çåè¯ãè¯æ§åå½åå®ä½è¯å«ï¼é度æ¯è¾å¿«ï¼ä½æ¯ç²¾åº¦ç¥ä½
ltp = LTP("LTP/legacy")
# cws, pos, ner = ltp.pipeline(["ä»å«æ±¤å§å»æ¿å¤è¡£ã"], tasks=["cws", "ner"]).to_tuple() # error: NER éè¦ è¯æ§æ 注任å¡çç»æ
cws, pos, ner = ltp.pipeline(["ä»å«æ±¤å§å»æ¿å¤è¡£ã"], tasks=["cws", "pos", "ner"]).to_tuple() # to tuple å¯ä»¥èªå¨è½¬æ¢ä¸ºå
ç»æ ¼å¼
# 使ç¨å
ç»æ ¼å¼ä½ä¸ºè¿åç»æ
print(cws, pos, ner)
Rust
use std::fs::File;
use itertools::multizip;
use ltp::{CWSModel, POSModel, NERModel, ModelSerde, Format, Codec};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let file = File::open("data/legacy-models/cws_model.bin")?;
let cws: CWSModel = ModelSerde::load(file, Format::AVRO(Codec::Deflate))?;
let file = File::open("data/legacy-models/pos_model.bin")?;
let pos: POSModel = ModelSerde::load(file, Format::AVRO(Codec::Deflate))?;
let file = File::open("data/legacy-models/ner_model.bin")?;
let ner: NERModel = ModelSerde::load(file, Format::AVRO(Codec::Deflate))?;
let words = cws.predict("ä»å«æ±¤å§å»æ¿å¤è¡£ã")?;
let pos = pos.predict(&words)?;
let ner = ner.predict((&words, &pos))?;
for (w, p, n) in multizip((words, pos, ner)) {
println!("{}/{}/{}", w, p, n);
}
Ok(())
}
模åæ§è½ä»¥åä¸è½½å°å
深度å¦ä¹ 模å(ð¤HF/ð å缩å ) | åè¯ | è¯æ§ | å½åå®ä½ | è¯ä¹è§è² | ä¾åå¥æ³ | è¯ä¹ä¾å | é度(å¥/S) |
---|---|---|---|---|---|---|---|
ð¤Base ðBase | 98.7 | 98.5 | 95.4 | 80.6 | 89.5 | 75.2 | 39.12 |
ð¤Base1 ðBase1 | 99.22 | 98.73 | 96.39 | 79.28 | 89.57 | 76.57 | --.-- |
ð¤Base2 ðBase2 | 99.18 | 98.69 | 95.97 | 79.49 | 90.19 | 76.62 | --.-- |
ð¤Small ðSmall | 98.4 | 98.2 | 94.3 | 78.4 | 88.3 | 74.7 | 43.13 |
ð¤Tiny ðTiny | 96.8 | 97.1 | 91.6 | 70.9 | 83.8 | 70.1 | 53.22 |
æç¥æºç®æ³æ¨¡å(ð¤HF/ð å缩å ) | åè¯ | è¯æ§ | å½åå®ä½ | é度(å¥/s) | å¤æ³¨ |
---|---|---|---|---|---|
ð¤Legacy ðLegacy | 97.93 | 98.41 | 94.28 | 21581.48 | æ§è½è¯¦æ |
注ï¼æç¥æºç®æ³é度为å¼å¯ 16 线ç¨é度
å¦ä½ä¸è½½å¯¹åºç模å
# ä½¿ç¨ HTTP é¾æ¥ä¸è½½
# ç¡®ä¿å·²å®è£
git-lfs (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/LTP/base
# ä½¿ç¨ ssh ä¸è½½
# ç¡®ä¿å·²å®è£
git-lfs (https://git-lfs.com)
git lfs install
git clone git@hf.co:LTP/base
# ä¸è½½å缩å
wget http://39.96.43.154/ltp/v4/base.tgz
tar -zxvf base.tgz -C base
å¦ä½ä½¿ç¨ä¸è½½ç模å
from ltp import LTP
# å¨è·¯å¾ä¸ç»åºæ¨¡åä¸è½½æ解ååçè·¯å¾
# ä¾å¦ï¼base 模åçæ件夹路å¾ä¸º "path/to/base"
# "path/to/base" ä¸åºå½åå¨ "config.json"
ltp = LTP("path/to/base")
æ建 Wheel å
make bdist
å ¶ä»è¯è¨ç»å®
æç¥æºç®æ³
深度å¦ä¹ ç®æ³
ä½è ä¿¡æ¯
- 车ä¸ç¿ <<car@ir.hit.edu.cn>>
- å¯äºé¾ <<ylfeng@ir.hit.edu.cn>>
å¼æºåè®®
- è¯è¨ææ¯å¹³å°é¢åå½å å¤å¤§å¦ãä¸ç§é¢åç 究æ以å个人ç 究è å è´¹å¼æ¾æºä»£ç ï¼ä½å¦ä¸è¿°æºæå个人å°è¯¥å¹³å°ç¨äºåä¸ç®çï¼å¦ä¼ä¸åä½é¡¹ç®çï¼åéè¦ä»è´¹ã
- é¤ä¸è¿°æºæ以å¤çä¼äºä¸åä½ï¼å¦ç³è¯·ä½¿ç¨è¯¥å¹³å°ï¼éä»è´¹ã
- å¡æ¶åä»è´¹é®é¢ï¼è¯·åé®ä»¶å° car@ir.hit.edu.cn æ´½åã
- å¦ææ¨å¨ LTP åºç¡ä¸å表论ææåå¾ç§ç ææï¼è¯·æ¨å¨å表论æåç³æ¥æææ¶å£°æâ使ç¨äºå工大社ä¼è®¡ç®ä¸ä¿¡æ¯æ£ç´¢ç 究ä¸å¿ç å¶çè¯è¨ææ¯å¹³å°ï¼LTPï¼â. åæ¶ï¼åä¿¡ç»car@ir.hit.edu.cnï¼è¯´æå表论ææç³æ¥ææçé¢ç®ãåºå¤çã
Top Related Projects
结巴中文分词
百度NLP:分词,词性标注,命名实体识别,词重要性
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
Python library for processing Chinese text
keras implement of transformers for humans
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot