pkuseg-python
pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation
Top Related Projects
结巴中文分词
An Efficient Lexical Analyzer for Chinese
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
百度NLP:分词,词性标注,命名实体识别,词重要性
Python library for processing Chinese text
文本挖掘和预处理工具(文本清洗、新词发现、情感分析、实体识别链接、关键词抽取、知识抽取、句法分析等),无监督或弱监督方法
Quick Overview
PKUSeg-python is a Chinese word segmentation toolkit developed by the Peking University Language Computing Lab. It offers state-of-the-art performance in Chinese text segmentation tasks and is designed to be easy to use and integrate into various natural language processing applications.
Pros
- High accuracy in Chinese word segmentation compared to other popular tools
- Supports multiple domains, including news, web, medicine, and more
- Easy to install and use with a simple Python API
- Allows for custom model training on specific domains
Cons
- Primarily focused on Chinese language, limiting its use for other languages
- May require more computational resources compared to simpler segmentation tools
- Documentation is primarily in Chinese, which might be challenging for non-Chinese speakers
- Limited community support compared to more widely-used NLP libraries
Code Examples
- Basic word segmentation:
import pkuseg
seg = pkuseg.pkuseg()
text = "我爱北京天安门"
result = seg.cut(text)
print(result)
# Output: ['我', '爱', '北京', '天安门']
- Using a specific domain model:
seg = pkuseg.pkuseg(model_name='medicine')
text = "头孢霉素类抗生素可以治疗肺炎"
result = seg.cut(text)
print(result)
# Output: ['头孢霉素', '类', '抗生素', '可以', '治疗', '肺炎']
- Custom dictionary usage:
seg = pkuseg.pkuseg(user_dict=['北京大学'])
text = "北京大学是世界一流大学"
result = seg.cut(text)
print(result)
# Output: ['北京大学', '是', '世界', '一流', '大学']
Getting Started
To get started with PKUSeg-python, follow these steps:
-
Install the library using pip:
pip install pkuseg
-
Import the library and create a segmenter:
import pkuseg seg = pkuseg.pkuseg()
-
Segment Chinese text:
text = "我在北京大学学习自然语言处理" result = seg.cut(text) print(result)
This will output the segmented words as a list. You can now use PKUSeg-python for various Chinese text processing tasks in your projects.
Competitor Comparisons
结巴中文分词
Pros of jieba
- Faster processing speed for large-scale text segmentation
- More extensive documentation and community support
- Broader range of features, including keyword extraction and text summarization
Cons of jieba
- Less accurate for specialized domains or formal texts
- Requires more manual tuning for optimal performance
- Larger memory footprint, especially for large dictionaries
Code Comparison
jieba:
import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))
pkuseg:
import pkuseg
seg = pkuseg.pkuseg()
text = seg.cut("我来到北京清华大学")
print(text)
Both libraries offer simple APIs for text segmentation, but pkuseg provides more flexibility in model selection and training. jieba offers additional features like keyword extraction, while pkuseg focuses on accurate segmentation for various domains.
jieba is generally faster and more suitable for large-scale processing, while pkuseg excels in accuracy, especially for formal texts or specific domains. The choice between the two depends on the specific requirements of the project, such as processing speed, accuracy, and domain specificity.
An Efficient Lexical Analyzer for Chinese
Pros of THULAC-Python
- Faster processing speed for large-scale text segmentation tasks
- Includes part-of-speech tagging functionality out of the box
- Supports both simplified and traditional Chinese characters
Cons of THULAC-Python
- Less flexible in terms of customization and fine-tuning
- May have lower accuracy on domain-specific or non-standard text
- Requires more memory resources compared to pkuseg-python
Code Comparison
THULAC-Python usage:
import thulac
thu = thulac.thulac()
text = "我爱北京天安门"
result = thu.cut(text)
print(result)
pkuseg-python usage:
import pkuseg
seg = pkuseg.pkuseg()
text = "我爱北京天安门"
result = seg.cut(text)
print(result)
Both libraries offer simple APIs for text segmentation, but THULAC-Python provides additional features like part-of-speech tagging by default. pkuseg-python focuses on customizable segmentation and may be more suitable for specific domain applications.
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发 现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
Pros of HanLP
- More comprehensive NLP toolkit with broader functionality beyond segmentation
- Supports multiple languages, not just Chinese
- Actively maintained with frequent updates and improvements
Cons of HanLP
- Larger library size, potentially slower for simple segmentation tasks
- May have a steeper learning curve due to more extensive features
- Higher computational resource requirements for full functionality
Code Comparison
HanLP:
from hanlp_restful import HanLPClient
HanLP = HanLPClient('https://www.hanlp.com/api', auth=None, language='zh')
print(HanLP.tokenize('我爱自然语言处理'))
pkuseg:
import pkuseg
seg = pkuseg.pkuseg()
text = seg.cut('我爱自然语言处理')
print(text)
Both libraries offer straightforward APIs for tokenization, but HanLP provides a more extensive set of NLP tools beyond simple segmentation. pkuseg is more focused on Chinese word segmentation specifically, while HanLP offers a broader range of language processing capabilities.
HanLP's code example demonstrates its client-server architecture, which allows for more advanced processing but may require additional setup. pkuseg, on the other hand, provides a simpler, more direct approach to Chinese word segmentation that can be quickly implemented in Python projects.
百度NLP:分词,词性标注,命名实体识别,词重要性
Pros of LAC
- Supports both word segmentation and part-of-speech tagging
- Offers pre-trained models for various domains (e.g., news, web)
- Provides both Python and C++ interfaces for flexibility
Cons of LAC
- Less customizable for specific domains compared to pkuseg
- May have lower accuracy on certain text types or specialized content
- Requires more dependencies and setup compared to pkuseg
Code Comparison
pkuseg usage:
import pkuseg
seg = pkuseg.pkuseg()
text = "我爱北京天安门"
print(seg.cut(text))
LAC usage:
from LAC import LAC
lac = LAC(mode='seg')
text = "我爱北京天安门"
print(lac.run(text))
Both libraries offer simple APIs for word segmentation, but LAC provides additional functionality for part-of-speech tagging and named entity recognition. pkuseg focuses primarily on customizable word segmentation for various domains.
pkuseg is generally easier to set up and use for basic word segmentation tasks, while LAC offers more comprehensive language processing capabilities at the cost of increased complexity.
Python library for processing Chinese text
Pros of snownlp
- More comprehensive NLP toolkit with sentiment analysis, text classification, and more
- Simpler installation process and fewer dependencies
- Faster processing speed for basic NLP tasks
Cons of snownlp
- Less accurate for complex Chinese word segmentation tasks
- Not actively maintained (last update in 2020)
- Limited documentation and community support
Code Comparison
snownlp:
from snownlp import SnowNLP
s = SnowNLP(u'这是一个测试句子')
print(s.words) # 分词
print(s.sentiments) # 情感分析
pkuseg-python:
import pkuseg
seg = pkuseg.pkuseg()
text = "这是一个测试句子"
print(seg.cut(text)) # 分词
Both libraries offer Chinese word segmentation, but pkuseg-python focuses on providing more accurate segmentation, especially for domain-specific texts. snownlp, on the other hand, offers a broader range of NLP functionalities beyond just segmentation. pkuseg-python is more actively maintained and provides better documentation, while snownlp offers a simpler API for quick NLP tasks but may lack in accuracy for complex segmentation scenarios.
文本挖掘和预处理工具(文本清洗、新词发现、情感分析、实体识别链接、关键词抽取、知识抽取、句法分析等),无监督或弱监督方法
Pros of HarvestText
- More comprehensive NLP toolkit with additional features like named entity recognition and sentiment analysis
- Supports both Chinese and English text processing
- Includes built-in dictionaries and regular expression patterns for common tasks
Cons of HarvestText
- Less specialized for Chinese word segmentation compared to pkuseg-python
- May have slower performance for large-scale text processing tasks
- Requires more dependencies and setup compared to the more focused pkuseg-python
Code Comparison
HarvestText:
from harvesttext import HarvestText
ht = HarvestText()
text = "今天是个好日子"
words = ht.seg(text)
print(words)
pkuseg-python:
import pkuseg
seg = pkuseg.pkuseg()
text = "今天是个好日子"
words = seg.cut(text)
print(words)
Both libraries offer simple APIs for Chinese word segmentation, but HarvestText provides a more extensive set of NLP tools beyond just segmentation. pkuseg-python focuses specifically on accurate Chinese word segmentation, while HarvestText aims to be a more comprehensive toolkit for various NLP tasks in both Chinese and English.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
pkusegï¼ä¸ä¸ªå¤é¢åä¸æåè¯å·¥å ·å (English Version)
pkuseg æ¯åºäºè®ºæ[Luo et. al, 2019]çå·¥å ·å ãå ¶ç®åæç¨ï¼æ¯æç»åé¢ååè¯ï¼æææåäºåè¯å确度ã
ç®å½
- 主è¦äº®ç¹
- ç¼è¯åå®è£
- åç±»åè¯å·¥å ·å çæ§è½å¯¹æ¯
- 使ç¨æ¹å¼
- 论æå¼ç¨
- ä½è
- 常è§é®é¢å解ç
主è¦äº®ç¹
pkusegå ·æå¦ä¸å 个ç¹ç¹ï¼
- å¤é¢ååè¯ãä¸åäºä»¥å¾çéç¨ä¸æåè¯å·¥å ·ï¼æ¤å·¥å ·å åæ¶è´åäºä¸ºä¸åé¢åçæ°æ®æä¾ä¸ªæ§åçé¢è®ç»æ¨¡åãæ ¹æ®å¾ åè¯ææ¬çé¢åç¹ç¹ï¼ç¨æ·å¯ä»¥èªç±å°éæ©ä¸åç模åã æ们ç®åæ¯æäºæ°é»é¢åï¼ç½ç»é¢åï¼å»è¯é¢åï¼æ 游é¢åï¼ä»¥åæ··åé¢åçåè¯é¢è®ç»æ¨¡åãå¨ä½¿ç¨ä¸ï¼å¦æç¨æ·æç¡®å¾ åè¯çé¢åï¼å¯å 载对åºç模åè¿è¡åè¯ãå¦æç¨æ·æ æ³ç¡®å®å ·ä½é¢åï¼æ¨è使ç¨å¨æ··åé¢åä¸è®ç»çéç¨æ¨¡åãåé¢ååè¯æ ·ä¾å¯åè example.txtã
- æ´é«çåè¯åç¡®çãç¸æ¯äºå ¶ä»çåè¯å·¥å ·å ï¼å½ä½¿ç¨ç¸åçè®ç»æ°æ®åæµè¯æ°æ®ï¼pkusegå¯ä»¥åå¾æ´é«çåè¯åç¡®çã
- æ¯æç¨æ·èªè®ç»æ¨¡åãæ¯æç¨æ·ä½¿ç¨å ¨æ°çæ 注æ°æ®è¿è¡è®ç»ã
- æ¯æè¯æ§æ 注ã
ç¼è¯åå®è£
- ç®åä» æ¯æpython3
- 为äºè·å¾å¥½çææåé度ï¼å¼ºç建议大家éè¿pip installæ´æ°å°ç®åçææ°çæ¬
-
éè¿PyPIå®è£ (èªå¸¦æ¨¡åæ件)ï¼
pip3 install pkuseg ä¹åéè¿import pkusegæ¥å¼ç¨
**建议æ´æ°å°ææ°çæ¬**以è·å¾æ´å¥½çå¼ç®±ä½éªï¼
pip3 install -U pkuseg
-
å¦æPyPIå®æ¹æºä¸è½½é度ä¸çæ³ï¼å»ºè®®ä½¿ç¨éåæºï¼æ¯å¦ï¼
å次å®è£ ï¼pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple pkuseg
æ´æ°ï¼
pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -U pkuseg
-
å¦æä¸ä½¿ç¨pipå®è£ æ¹å¼ï¼éæ©ä»GitHubä¸è½½ï¼å¯è¿è¡ä»¥ä¸å½ä»¤å®è£ ï¼
python setup.py build_ext -i
GitHubç代ç 并ä¸å æ¬é¢è®ç»æ¨¡åï¼å æ¤éè¦ç¨æ·èªè¡ä¸è½½æè®ç»æ¨¡åï¼é¢è®ç»æ¨¡åå¯è¯¦è§releaseã使ç¨æ¶é设å®"model_name"为模åæ件ã
注æï¼**å®è£ æ¹å¼1å2ç®åä» æ¯ælinux(ubuntu)ãmacãwindows 64 ä½çpython3çæ¬**ãå¦æé以ä¸ç³»ç»ï¼è¯·ä½¿ç¨å®è£ æ¹å¼3è¿è¡æ¬å°ç¼è¯å®è£ ã
åç±»åè¯å·¥å ·å çæ§è½å¯¹æ¯
æ们éæ©jiebaãTHULACçå½å 代表åè¯å·¥å ·å ä¸pkusegåæ§è½æ¯è¾ï¼è¯¦ç»è®¾ç½®å¯åèå®éªç¯å¢ã
ç»é¢åè®ç»åæµè¯ç»æ
以ä¸æ¯å¨ä¸åæ°æ®éä¸ç对æ¯ç»æï¼
MSRA | Precision | Recall | F-score |
---|---|---|---|
jieba | 87.01 | 89.88 | 88.42 |
THULAC | 95.60 | 95.91 | 95.71 |
pkuseg | 96.94 | 96.81 | 96.88 |
Precision | Recall | F-score | |
---|---|---|---|
jieba | 87.79 | 87.54 | 87.66 |
THULAC | 93.40 | 92.40 | 92.87 |
pkuseg | 93.78 | 94.65 | 94.21 |
é»è®¤æ¨¡åå¨ä¸åé¢åçæµè¯ææ
èèå°å¾å¤ç¨æ·å¨å°è¯åè¯å·¥å ·çæ¶åï¼å¤§å¤æ°æ¶åä¼ä½¿ç¨å·¥å ·å èªå¸¦æ¨¡åæµè¯ã为äºç´æ¥å¯¹æ¯âåå§âæ§è½ï¼æ们ä¹æ¯è¾äºåä¸ªå·¥å ·å çé»è®¤æ¨¡åå¨ä¸åé¢åçæµè¯ææã请注æï¼è¿æ ·çæ¯è¾åªæ¯ä¸ºäºè¯´æé»è®¤æ åµä¸çææï¼å¹¶ä¸ä¸å®æ¯å ¬å¹³çã
Default | MSRA | CTB8 | PKU | All Average | |
---|---|---|---|---|---|
jieba | 81.45 | 79.58 | 81.83 | 83.56 | 81.61 |
THULAC | 85.55 | 87.84 | 92.29 | 86.65 | 88.08 |
pkuseg | 87.29 | 91.77 | 92.68 | 93.43 | 91.29 |
å
¶ä¸ï¼All Average
æ¾ç¤ºçæ¯å¨æææµè¯éä¸F-scoreçå¹³åã
æ´å¤è¯¦ç»æ¯è¾å¯åè§åç°æå·¥å ·å çæ¯è¾ã
使ç¨æ¹å¼
代ç 示ä¾
以ä¸ä»£ç 示ä¾éç¨äºpython交äºå¼ç¯å¢ã
代ç 示ä¾1ï¼ä½¿ç¨é»è®¤é ç½®è¿è¡åè¯ï¼å¦æç¨æ·æ æ³ç¡®å®åè¯é¢åï¼æ¨è使ç¨é»è®¤æ¨¡ååè¯ï¼
import pkuseg
seg = pkuseg.pkuseg() # 以é»è®¤é
ç½®å 载模å
text = seg.cut('æç±å京天å®é¨') # è¿è¡åè¯
print(text)
代ç 示ä¾2ï¼ç»é¢ååè¯ï¼å¦æç¨æ·æç¡®åè¯é¢åï¼æ¨è使ç¨ç»é¢å模ååè¯ï¼
import pkuseg
seg = pkuseg.pkuseg(model_name='medicine') # ç¨åºä¼èªå¨ä¸è½½æ对åºçç»é¢å模å
text = seg.cut('æç±å京天å®é¨') # è¿è¡åè¯
print(text)
代ç 示ä¾3ï¼åè¯åæ¶è¿è¡è¯æ§æ 注ï¼åè¯æ§æ ç¾ç详ç»å«ä¹å¯åè tags.txt
import pkuseg
seg = pkuseg.pkuseg(postag=True) # å¼å¯è¯æ§æ 注åè½
text = seg.cut('æç±å京天å®é¨') # è¿è¡åè¯åè¯æ§æ 注
print(text)
代ç 示ä¾4ï¼å¯¹æ件åè¯
import pkuseg
# 对input.txtçæ件åè¯è¾åºå°output.txtä¸
# å¼20个è¿ç¨
pkuseg.test('input.txt', 'output.txt', nthread=20)
å ¶ä»ä½¿ç¨ç¤ºä¾å¯åè§è¯¦ç»ä»£ç 示ä¾ã
åæ°è¯´æ
模åé ç½®
pkuseg.pkuseg(model_name = "default", user_dict = "default", postag = False)
model_name 模åè·¯å¾ã
"default"ï¼é»è®¤åæ°ï¼è¡¨ç¤ºä½¿ç¨æ们é¢è®ç»å¥½çæ··åé¢å模å(ä»
对pipä¸è½½çç¨æ·)ã
"news", 使ç¨æ°é»é¢å模åã
"web", 使ç¨ç½ç»é¢å模åã
"medicine", 使ç¨å»è¯é¢å模åã
"tourism", 使ç¨æ
游é¢å模åã
model_path, ä»ç¨æ·æå®è·¯å¾å 载模åã
user_dict 设置ç¨æ·è¯å
¸ã
"default", é»è®¤åæ°ï¼ä½¿ç¨æ们æä¾çè¯å
¸ã
None, ä¸ä½¿ç¨è¯å
¸ã
dict_path, å¨ä½¿ç¨é»è®¤è¯å
¸çåæ¶ä¼é¢å¤ä½¿ç¨ç¨æ·èªå®ä¹è¯å
¸ï¼å¯ä»¥å¡«èªå·±çç¨æ·è¯å
¸çè·¯å¾ï¼è¯å
¸æ ¼å¼ä¸ºä¸è¡ä¸ä¸ªè¯ï¼å¦æéæ©è¿è¡è¯æ§æ 注并ä¸å·²ç¥è¯¥è¯çè¯æ§ï¼åå¨è¯¥è¡åä¸è¯åè¯æ§ï¼ä¸é´ç¨tabå符éå¼ï¼ã
postag æ¯å¦è¿è¡è¯æ§åæã
False, é»è®¤åæ°ï¼åªè¿è¡åè¯ï¼ä¸è¿è¡è¯æ§æ 注ã
True, ä¼å¨åè¯çåæ¶è¿è¡è¯æ§æ 注ã
对æ件è¿è¡åè¯
pkuseg.test(readFile, outputFile, model_name = "default", user_dict = "default", postag = False, nthread = 10)
readFile è¾å
¥æ件路å¾ã
outputFile è¾åºæ件路å¾ã
model_name 模åè·¯å¾ãåpkuseg.pkuseg
user_dict 设置ç¨æ·è¯å
¸ãåpkuseg.pkuseg
postag 设置æ¯å¦å¼å¯è¯æ§åæåè½ãåpkuseg.pkuseg
nthread æµè¯æ¶å¼çè¿ç¨æ°ã
模åè®ç»
pkuseg.train(trainFile, testFile, savedir, train_iter = 20, init_model = None)
trainFile è®ç»æ件路å¾ã
testFile æµè¯æ件路å¾ã
savedir è®ç»æ¨¡åçä¿åè·¯å¾ã
train_iter è®ç»è½®æ°ã
init_model åå§å模åï¼é»è®¤ä¸ºNone表示使ç¨é»è®¤åå§åï¼ç¨æ·å¯ä»¥å¡«èªå·±æ³è¦åå§åç模åçè·¯å¾å¦init_model='./models/'ã
å¤è¿ç¨åè¯
å½å°ä»¥ä¸ä»£ç 示ä¾ç½®äºæ件ä¸è¿è¡æ¶ï¼å¦æ¶åå¤è¿ç¨åè½ï¼è¯·å¡å¿
使ç¨if __name__ == '__main__'
ä¿æ¤å
¨å±è¯å¥ï¼è¯¦è§å¤è¿ç¨åè¯ã
é¢è®ç»æ¨¡å
ä»pipå®è£ çç¨æ·å¨ä½¿ç¨ç»é¢ååè¯åè½æ¶ï¼åªéè¦è®¾ç½®model_nameå段为对åºçé¢åå³å¯ï¼ä¼èªå¨ä¸è½½å¯¹åºçç»é¢å模åã
ä»githubä¸è½½çç¨æ·åéè¦èªå·±ä¸è½½å¯¹åºçé¢è®ç»æ¨¡åï¼å¹¶è®¾ç½®model_nameå段为é¢è®ç»æ¨¡åè·¯å¾ãé¢è®ç»æ¨¡åå¯ä»¥å¨releaseé¨åä¸è½½ã以ä¸æ¯å¯¹é¢è®ç»æ¨¡åç说æï¼
-
news: å¨MSRAï¼æ°é»è¯æï¼ä¸è®ç»ç模åã
-
web: å¨å¾®åï¼ç½ç»ææ¬è¯æï¼ä¸è®ç»ç模åã
-
medicine: å¨å»è¯é¢åä¸è®ç»ç模åã
-
tourism: å¨æ 游é¢åä¸è®ç»ç模åã
-
mixed: æ··åæ°æ®éè®ç»çéç¨æ¨¡åãépipå é带çæ¯æ¤æ¨¡åã
æ们è¿éè¿é¢åèªéåºçæ¹æ³ï¼å©ç¨ç»´åºç¾ç§çæªæ 注æ°æ®å®ç°äºå 个ç»é¢åé¢è®ç»æ¨¡åçèªå¨æ建以åéç¨æ¨¡åçä¼åï¼è¿äºæ¨¡åç®åä» å¯ä»¥å¨releaseä¸ä¸è½½ï¼
-
art: å¨èºæ¯ä¸æåé¢åä¸è®ç»ç模åã
-
entertainment: å¨å¨±ä¹ä¸ä½è²é¢åä¸è®ç»ç模åã
-
science: å¨ç§å¦é¢åä¸è®ç»ç模åã
-
default_v2: 使ç¨é¢åèªéåºæ¹æ³å¾å°çä¼ååçéç¨æ¨¡åï¼ç¸è¾äºé»è®¤æ¨¡åè§æ¨¡æ´å¤§ï¼ä½æ³åæ§è½æ´å¥½ã
欢è¿æ´å¤ç¨æ·å¯ä»¥å享èªå·±è®ç»å¥½çç»åé¢å模åã
çæ¬åå²
详è§çæ¬åå²ã
å¼æºåè®®
- æ¬ä»£ç éç¨MIT许å¯è¯ã
- 欢è¿å¯¹è¯¥å·¥å ·å æåºä»»ä½å®è´µæè§å建议ï¼è¯·åé®ä»¶è³jingjingxu@pku.edu.cnã
论æå¼ç¨
该代ç å 主è¦åºäºä»¥ä¸ç§ç 论æï¼å¦ä½¿ç¨äºæ¬å·¥å ·ï¼è¯·å¼ç¨ä»¥ä¸è®ºæï¼
- Ruixuan Luo, Jingjing Xu, Yi Zhang, Zhiyuan Zhang, Xuancheng Ren, Xu Sun. PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation. Arxiv. 2019.
@article{pkuseg,
author = {Luo, Ruixuan and Xu, Jingjing and Zhang, Yi and Zhang, Zhiyuan and Ren, Xuancheng and Sun, Xu},
journal = {CoRR},
title = {PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation.},
url = {https://arxiv.org/abs/1906.11455},
volume = {abs/1906.11455},
year = 2019
}
å ¶ä»ç¸å ³è®ºæ
- Xu Sun, Houfeng Wang, Wenjie Li. Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection. ACL. 2012.
- Jingjing Xu and Xu Sun. Dependency-based gated recursive neural network for chinese word segmentation. ACL. 2016.
- Jingjing Xu and Xu Sun. Transfer learning for low-resource chinese word segmentation with a novel neural network. NLPCC. 2017.
常è§é®é¢å解ç
- 为ä»ä¹è¦åå¸pkusegï¼
- pkuseg使ç¨äºåªäºææ¯ï¼
- æ æ³ä½¿ç¨å¤è¿ç¨åè¯åè®ç»åè½ï¼æ示RuntimeErroråBrokenPipeErrorã
- æ¯å¦ä½è·å ¶å®å·¥å ·å å¨ç»é¢åæ°æ®ä¸è¿è¡æ¯è¾çï¼
- å¨é»çæµè¯éä¸è¿è¡æ¯è¾çè¯ï¼ææå¦ä½ï¼
- å¦ææä¸äºè§£å¾ åè¯è¯æçæå±é¢åå¢ï¼
- å¦ä½çå¾ å¨ä¸äºç¹å®æ ·ä¾ä¸çåè¯ç»æï¼
- å ³äºè¿è¡é度é®é¢ï¼
- å ³äºå¤è¿ç¨é度é®é¢ï¼
è´è°¢
æè°¢ä¿å£«æ±¶ææï¼å京大å¦è®¡ç®è¯è¨æï¼ä¸é±ç«å¤å士æä¾çè®ç»æ°æ®éï¼
ä½è
Ruixuan Luo ï¼ç½ç¿è½©ï¼, Jingjing Xuï¼è®¸æ¶æ¶ï¼, Xuancheng Renï¼ä»»å®£ä¸ï¼, Yi Zhangï¼å¼ èºï¼, Zhiyuan Zhangï¼å¼ ä¹è¿ï¼, Bingzhen Weiï¼ä½å°éï¼ï¼ Xu Sun ï¼åæ ©ï¼
åäº¬å¤§å¦ è¯è¨è®¡ç®ä¸æºå¨å¦ä¹ ç 究ç»
Top Related Projects
结巴中文分词
An Efficient Lexical Analyzer for Chinese
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
百度NLP:分词,词性标注,命名实体识别,词重要性
Python library for processing Chinese text
文本挖掘和预处理工具(文本清洗、新词发现、情感分析、实体识别链接、关键词抽取、知识抽取、句法分析等),无监督或弱监督方法
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot