Convert Figma logo to code with AI

HIT-SCIR logoltp

Language Technology Platform

4,924
1,037
4,924
68

Top Related Projects

33,199

结巴中文分词

3,863

百度NLP:分词,词性标注,命名实体识别,词重要性

33,691

中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理

6,411

Python library for processing Chinese text

keras implement of transformers for humans

Quick Overview

LTP (Language Technology Platform) is an open-source Chinese natural language processing toolkit developed by the Research Center for Social Computing and Information Retrieval at Harbin Institute of Technology. It provides a comprehensive set of Chinese language processing tools, including word segmentation, part-of-speech tagging, named entity recognition, dependency parsing, and semantic role labeling.

Pros

  • Comprehensive suite of Chinese NLP tools in a single package
  • High accuracy and performance for various Chinese language processing tasks
  • Actively maintained and regularly updated with new features and improvements
  • Supports both Python and C++ interfaces for flexibility in integration

Cons

  • Primarily focused on Chinese language, limiting its use for other languages
  • Requires some understanding of Chinese linguistics for optimal usage
  • Documentation is primarily in Chinese, which may be challenging for non-Chinese speakers
  • Resource-intensive for some tasks, potentially requiring significant computational power

Code Examples

  1. Word segmentation and part-of-speech tagging:
import ltp

ltp = ltp.LTP()
segment, hidden = ltp.seg(["我爱北京天安门"])
pos = ltp.pos(hidden)
print(segment)
print(pos)
  1. Named entity recognition:
import ltp

ltp = ltp.LTP()
segment, hidden = ltp.seg(["华北电力大学位于北京市昌平区"])
ner = ltp.ner(hidden)
print(ner)
  1. Dependency parsing:
import ltp

ltp = ltp.LTP()
segment, hidden = ltp.seg(["他送了一本书给我"])
dep = ltp.dep(hidden)
print(dep)

Getting Started

To get started with LTP, follow these steps:

  1. Install LTP using pip:

    pip install ltp
    
  2. Import and initialize LTP in your Python script:

    import ltp
    ltp_model = ltp.LTP()
    
  3. Use LTP for various NLP tasks:

    text = ["我爱北京天安门"]
    segment, hidden = ltp_model.seg(text)
    pos = ltp_model.pos(hidden)
    ner = ltp_model.ner(hidden)
    dep = ltp_model.dep(hidden)
    srl = ltp_model.srl(hidden)
    

For more detailed usage and advanced features, refer to the official documentation on the GitHub repository.

Competitor Comparisons

33,199

结巴中文分词

Pros of jieba

  • Lightweight and easy to use, with simple installation and integration
  • Fast processing speed, especially for large-scale text segmentation tasks
  • Supports customization of dictionaries and user-defined words

Cons of jieba

  • Limited functionality compared to ltp, focusing primarily on word segmentation
  • Less accurate for complex linguistic tasks like named entity recognition or dependency parsing
  • Fewer options for fine-tuning and model customization

Code Comparison

jieba:

import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))

ltp:

from ltp import LTP
ltp = LTP()
segment, _ = ltp.segment(["我来到北京清华大学"])
print(segment)

Both libraries provide Chinese word segmentation, but ltp offers a more comprehensive set of NLP tools. jieba is simpler to use and faster for basic segmentation tasks, while ltp provides higher accuracy and additional linguistic analysis capabilities. The choice between them depends on the specific requirements of your project, balancing simplicity and speed against accuracy and advanced features.

3,863

百度NLP:分词,词性标注,命名实体识别,词重要性

Pros of LAC

  • Higher performance and speed for Chinese word segmentation and part-of-speech tagging
  • Simpler API and easier integration into existing projects
  • Better support for specialized domains like medicine and finance

Cons of LAC

  • Limited functionality compared to LTP (focuses mainly on word segmentation and POS tagging)
  • Less comprehensive documentation and community support
  • Fewer language options (primarily focused on Chinese)

Code Comparison

LAC:

from LAC import LAC

lac = LAC(mode='lac')
text = "我爱北京天安门"
result = lac.run(text)
print(result)

LTP:

from ltp import LTP

ltp = LTP()
text = "我爱北京天安门"
seg, hidden = ltp.seg([text])
pos = ltp.pos(hidden)
print(seg, pos)

Both repositories provide Chinese natural language processing tools, but they differ in scope and implementation. LAC offers a more streamlined approach for specific tasks, while LTP provides a broader range of NLP functionalities. The choice between them depends on the specific requirements of your project and the depth of NLP analysis needed.

33,691

中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理

Pros of HanLP

  • More comprehensive feature set, including advanced NLP tasks like text classification and sentiment analysis
  • Better documentation and examples, making it easier for new users to get started
  • More active development and frequent updates

Cons of HanLP

  • Larger resource footprint, potentially slower for basic tasks
  • More complex setup and configuration process
  • May be overkill for simple NLP tasks

Code Comparison

HanLP:

from pyhanlp import *

text = "我爱北京天安门"
print(HanLP.segment(text))

LTP:

from ltp import LTP

ltp = LTP()
seg, _ = ltp.seg([text])
print(seg)

Both libraries offer similar basic functionality for Chinese text segmentation, but HanLP provides a more straightforward API for this task. However, LTP's approach allows for more flexibility in processing multiple sentences at once.

6,411

Python library for processing Chinese text

Pros of SnowNLP

  • Lightweight and easy to use for basic Chinese NLP tasks
  • Includes sentiment analysis functionality out of the box
  • Simpler installation process with fewer dependencies

Cons of SnowNLP

  • Less comprehensive feature set compared to LTP
  • Lower accuracy for complex NLP tasks
  • Less active development and community support

Code Comparison

SnowNLP example:

from snownlp import SnowNLP

s = SnowNLP(u'这个东西真心很赞')
print(s.words)         # [u'这个', u'东西', u'真心', u'很', u'赞']
print(s.tags)          # [(u'这个', u'r'), (u'东西', u'n'), (u'真心', u'd'), (u'很', u'd'), (u'赞', u'Vg')]
print(s.sentiments)    # 0.9769663402895832 # Positive sentiment

LTP example:

from ltp import LTP

ltp = LTP()
seg, hidden = ltp.seg(["这个东西真心很赞"])
pos = ltp.pos(hidden)
ner = ltp.ner(hidden)
dep = ltp.dep(hidden)
srl = ltp.srl(hidden)

print(seg)
print(pos)
print(ner)
print(dep)
print(srl)

keras implement of transformers for humans

Pros of bert4keras

  • Focused specifically on BERT and related models, offering more specialized functionality
  • Simpler API and easier to use for BERT-based tasks
  • More actively maintained with frequent updates

Cons of bert4keras

  • Limited to BERT and related models, less versatile for other NLP tasks
  • Smaller community and fewer resources compared to LTP
  • May require more manual configuration for complex tasks

Code Comparison

bert4keras:

from bert4keras.tokenizers import Tokenizer
from bert4keras.models import build_transformer_model

tokenizer = Tokenizer(dict_path)
model = build_transformer_model(config_path, checkpoint_path)

LTP:

from ltp import LTP

ltp = LTP()
seg, hidden = ltp.seg(["我爱北京天安门"])
pos = ltp.pos(hidden)

Summary

bert4keras is more specialized for BERT-related tasks with a simpler API, while LTP offers a broader range of NLP functionalities. bert4keras may be preferable for BERT-specific projects, while LTP is more suitable for general Chinese NLP tasks. The choice depends on the specific requirements of your project and the level of customization needed.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

CODE SIZE CONTRIBUTORS LAST COMMIT

Languageversion
PythonLTP LTP-Core LTP-Extension
RustLTP

LTP 4

LTP(Language Technology Platform) 提供了一系列中文自然语言处理工具,用户可以使用这些工具对于中文文本进行分词、词性标注、句法分析等等工作。

引用

如果您在工作中使用了 LTP,您可以引用这篇论文

@inproceedings{che-etal-2021-n,
    title = "N-{LTP}: An Open-source Neural Language Technology Platform for {C}hinese",
    author = "Che, Wanxiang  and
      Feng, Yunlong  and
      Qin, Libo  and
      Liu, Ting",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-demo.6",
    doi = "10.18653/v1/2021.emnlp-demo.6",
    pages = "42--49",
    abstract = "We introduce N-LTP, an open-source neural language technology platform supporting six fundamental Chinese NLP tasks: lexical analysis (Chinese word segmentation, part-of-speech tagging, and named entity recognition), syntactic parsing (dependency parsing), and semantic parsing (semantic dependency parsing and semantic role labeling). Unlike the existing state-of-the-art toolkits, such as Stanza, that adopt an independent model for each task, N-LTP adopts the multi-task framework by using a shared pre-trained model, which has the advantage of capturing the shared knowledge across relevant Chinese tasks. In addition, a knowledge distillation method (Clark et al., 2019) where the single-task model teaches the multi-task model is further introduced to encourage the multi-task model to surpass its single-task teacher. Finally, we provide a collection of easy-to-use APIs and a visualization tool to make users to use and view the processing results more easily and directly. To the best of our knowledge, this is the first toolkit to support six Chinese NLP fundamental tasks. Source code, documentation, and pre-trained models are available at https://github.com/HIT-SCIR/ltp.",
}

参考书: 由哈工大社会计算与信息检索研究中心(HIT-SCIR)的多位学者共同编著的《自然语言处理:基于预训练模型的方法 》(作者:车万翔、郭江、崔一鸣;主审:刘挺)一书现已正式出版,该书重点介绍了新的基于预训练模型的自然语言处理技术,包括基础知识、预训练词向量和预训练模型三大部分,可供广大 LTP 用户学习参考。

更新说明

  • 4.2.0
    • [结构性变化] 将 LTP 拆分成 2 个部分,维护和训练更方便,结构更清晰
      • [Legacy 模型] 针对广大用户对于**推理速度**的需求,使用 Rust 重写了基于感知机的算法,准确率与 LTP3 版本相当,速度则是 LTP v3 的 3.55 倍,开启多线程更可获得 17.17 倍的速度提升,但目前仅支持分词、词性、命名实体三大任务
      • [深度学习模型] 即基于 PyTorch 实现的深度学习模型,支持全部的 6 大任务(分词/词性/命名实体/语义角色/依存句法/语义依存)
    • [其他改进] 改进了模型训练方法
      • [共同] 提供了训练脚本和训练样例,使得用户能够更方便地使用私有的数据,自行训练个性化的模型
      • [深度学习模型] 采用 hydra 对训练过程进行配置,方便广大用户修改模型训练参数以及对 LTP 进行扩展(比如使用其他包中的 Module)
    • [其他变化] 分词、依存句法分析 (Eisner) 和 语义依存分析 (Eisner) 任务的解码算法使用 Rust 实现,速度更快
    • [新特性] 模型上传至 Huggingface Hub,支持自动下载,下载速度更快,并且支持用户自行上传自己训练的模型供 LTP 进行推理使用
    • [破坏性变更] 改用 Pipeline API 进行推理,方便后续进行更深入的性能优化(如 SDP 和 SDPG 很大一部分是重叠的,重用可以加快推理速度),使用说明参见Github 快速使用部分
  • 4.1.0
    • 提供了自定义分词等功能
    • 修复了一些 bug
  • 4.0.0
    • 基于 Pytorch 开发,原生 Python 接口
    • 可根据需要自由选择不同速度和指标的模型
    • 分词、词性、命名实体、依存句法、语义角色、语义依存 6 大任务

快速使用

Python

# 方法 1: 使用清华源安装 LTP
# 1. 安装 PyTorch 和 Transformers 依赖
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple torch transformers
# 2. 安装 LTP
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple ltp ltp-core ltp-extension

# 方法 2: 先全局换源,再安装 LTP
# 1. 全局换 TUNA 源
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
# 2. 安装 PyTorch 和 Transformers 依赖
pip install torch transformers
# 3. 安装 LTP
pip install ltp ltp-core ltp-extension

注: 如果遇到任何错误,请尝试使用上述命令重新安装 ltp,如果依然报错,请在 Github issues 中反馈。

import torch
from ltp import LTP

# 默认 huggingface 下载,可能需要代理

ltp = LTP("LTP/small")  # 默认加载 Small 模型
                        # 也可以传入模型的路径,ltp = LTP("/path/to/your/model")
                        # /path/to/your/model 应当存在 config.json 和其他模型文件

# 将模型移动到 GPU 上
if torch.cuda.is_available():
    # ltp.cuda()
    ltp.to("cuda")

# 自定义词表
ltp.add_word("汤姆去", freq=2)
ltp.add_words(["外套", "外衣"], freq=2)

#  分词 cws、词性 pos、命名实体标注 ner、语义角色标注 srl、依存句法分析 dep、语义依存分析树 sdp、语义依存分析图 sdpg
output = ltp.pipeline(["他叫汤姆去拿外衣。"], tasks=["cws", "pos", "ner", "srl", "dep", "sdp", "sdpg"])
# 使用字典格式作为返回结果
print(output.cws)  # print(output[0]) / print(output['cws']) # 也可以使用下标访问
print(output.pos)
print(output.sdp)

# 使用感知机算法实现的分词、词性和命名实体识别,速度比较快,但是精度略低
ltp = LTP("LTP/legacy")
# cws, pos, ner = ltp.pipeline(["他叫汤姆去拿外衣。"], tasks=["cws", "ner"]).to_tuple() # error: NER 需要 词性标注任务的结果
cws, pos, ner = ltp.pipeline(["他叫汤姆去拿外衣。"], tasks=["cws", "pos", "ner"]).to_tuple()  # to tuple 可以自动转换为元组格式
# 使用元组格式作为返回结果
print(cws, pos, ner)

详细说明

Rust

use std::fs::File;
use itertools::multizip;
use ltp::{CWSModel, POSModel, NERModel, ModelSerde, Format, Codec};

fn main() -> Result<(), Box<dyn std::error::Error>> {
  let file = File::open("data/legacy-models/cws_model.bin")?;
  let cws: CWSModel = ModelSerde::load(file, Format::AVRO(Codec::Deflate))?;
  let file = File::open("data/legacy-models/pos_model.bin")?;
  let pos: POSModel = ModelSerde::load(file, Format::AVRO(Codec::Deflate))?;
  let file = File::open("data/legacy-models/ner_model.bin")?;
  let ner: NERModel = ModelSerde::load(file, Format::AVRO(Codec::Deflate))?;

  let words = cws.predict("他叫汤姆去拿外衣。")?;
  let pos = pos.predict(&words)?;
  let ner = ner.predict((&words, &pos))?;

  for (w, p, n) in multizip((words, pos, ner)) {
    println!("{}/{}/{}", w, p, n);
  }

  Ok(())
}

模型性能以及下载地址

深度学习模型(🤗HF/🗜 压缩包)分词词性命名实体语义角色依存句法语义依存速度(句/S)
🤗Base 🗜Base98.798.595.480.689.575.239.12
🤗Base1 🗜Base199.2298.7396.3979.2889.5776.57--.--
🤗Base2 🗜Base299.1898.6995.9779.4990.1976.62--.--
🤗Small 🗜Small98.498.294.378.488.374.743.13
🤗Tiny 🗜Tiny96.897.191.670.983.870.153.22
感知机算法模型(🤗HF/🗜 压缩包)分词词性命名实体速度(句/s)备注
🤗Legacy 🗜Legacy97.9398.4194.2821581.48性能详情

注:感知机算法速度为开启 16 线程速度

如何下载对应的模型

# 使用 HTTP 链接下载
# 确保已安装 git-lfs (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/LTP/base

# 使用 ssh 下载
# 确保已安装 git-lfs (https://git-lfs.com)
git lfs install
git clone git@hf.co:LTP/base

# 下载压缩包
wget http://39.96.43.154/ltp/v4/base.tgz
tar -zxvf base.tgz -C base

如何使用下载的模型

from ltp import LTP

# 在路径中给出模型下载或解压后的路径
# 例如:base 模型的文件夹路径为 "path/to/base"
#      "path/to/base" 下应当存在 "config.json"
ltp = LTP("path/to/base")

构建 Wheel 包

make bdist

其他语言绑定

感知机算法

深度学习算法

作者信息

开源协议

  1. 语言技术平台面向国内外大学、中科院各研究所以及个人研究者免费开放源代码,但如上述机构和个人将该平台用于商业目的(如企业合作项目等)则需要付费。
  2. 除上述机构以外的企事业单位,如申请使用该平台,需付费。
  3. 凡涉及付费问题,请发邮件到 car@ir.hit.edu.cn 洽商。
  4. 如果您在 LTP 基础上发表论文或取得科研成果,请您在发表论文和申报成果时声明“使用了哈工大社会计算与信息检索研究中心研制的语言技术平台(LTP)”. 同时,发信给car@ir.hit.edu.cn,说明发表论文或申报成果的题目、出处等。