polyglot

Multilingual text (NLP) processing toolkit

2,351

341

2,351

170

View on GitHub

Top Related Projects

CNTK

17,597

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

tensorflow

191,921

An Open Source Machine Learning Framework for Everyone

pytorch

93,668

Tensors and Dynamic neural networks in Python with strong GPU acceleration

scikit-learn

63,533

scikit-learn: machine learning in Python

transformers

150,567

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Quick Overview

Polyglot is a natural language processing (NLP) library that provides multilingual support for tasks such as language detection, sentiment analysis, and named entity recognition. It is designed to be easy to use and integrate into various applications.

Pros

Multilingual Support: Polyglot supports a wide range of languages, making it suitable for applications that need to handle content in multiple languages.
Ease of Use: The library provides a simple and intuitive API, making it easy to integrate into existing projects.
Performance: Polyglot is designed to be efficient and fast, with support for parallel processing and GPU acceleration.
Active Development: The project is actively maintained, with regular updates and bug fixes.

Cons

Limited Language Coverage: While Polyglot supports a wide range of languages, it may not cover all the languages needed for some applications.
Dependency on External Resources: Polyglot relies on external data sources, such as word embeddings and language models, which may not be available for all languages or may require additional setup.
Potential Accuracy Issues: As with any NLP library, the accuracy of Polyglot's predictions may vary depending on the task and the quality of the underlying data.
Limited Documentation: The project's documentation could be more comprehensive, which may make it challenging for new users to get started.

Code Examples

Here are a few examples of how to use Polyglot in your code:

Language Detection:

from polyglot.text import Text

text = "Hola, cómo estás?"
lang = Text(text).language.code
print(lang)  # Output: es

Sentiment Analysis:

from polyglot.text import Text

text = "I love this product!"
sentiment = Text(text).sentiment.polarity
print(sentiment)  # Output: 0.8

Named Entity Recognition:

from polyglot.text import Text

text = "Barack Obama was the 44th president of the United States."
entities = Text(text).entities
for entity in entities:
    print(entity.tag, entity.value)
# Output:
# PERSON Barack Obama
# ORG United States

Word Embeddings:

from polyglot.text import Word

word = Word("dog", language="en")
print(word.embedding)  # Output: [-0.12345, 0.67890, ...]

Getting Started

To get started with Polyglot, you can follow these steps:

Install the library using pip:

pip install polyglot

Import the necessary modules from the library:

from polyglot.text import Text, Word

Use the library's functions to perform various NLP tasks, such as language detection, sentiment analysis, and named entity recognition:

# Language detection
text = "Hola, cómo estás?"
lang = Text(text).language.code
print(lang)  # Output: es

# Sentiment analysis
text = "I love this product!"
sentiment = Text(text).sentiment.polarity
print(sentiment)  # Output: 0.8

# Named entity recognition
text = "Barack Obama was the 44th president of the United States."
entities = Text(text).entities
for entity in entities:
    print(entity.tag, entity.value)
# Output:
# PERSON Barack Obama
# ORG United States

Explore the library's documentation and available features to learn more about how to use Polyglot in your projects.

Competitor Comparisons

CNTK

17,597

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

Pros of CNTK

CNTK is a highly scalable and efficient deep learning framework, capable of running on a wide range of hardware, including GPUs and CPUs.
CNTK provides a comprehensive set of tools and APIs for building and training complex neural network models, making it a powerful choice for advanced deep learning projects.
The framework has been extensively used and tested by Microsoft, ensuring its reliability and performance.

Cons of CNTK

CNTK has a steeper learning curve compared to Polyglot, as it requires a deeper understanding of deep learning concepts and the framework's specific syntax and architecture.
The documentation for CNTK, while comprehensive, can be less user-friendly than Polyglot's, which is known for its simplicity and ease of use.
CNTK is primarily focused on deep learning, while Polyglot offers a broader range of natural language processing capabilities, including support for various languages and tasks.

Code Comparison

CNTK (5 lines):

import cntk as C

# Define the input and output variables
x = C.input_variable(shape=(1,), name='x')
y = C.input_variable(shape=(1,), name='y')

# Define a simple linear regression model
model = C.linear_regression(x, y)

Polyglot (5 lines):

from polyglot.text import Text

# Create a Text object from a string
text = Text("The quick brown fox jumps over the lazy dog.")

# Extract named entities from the text
entities = text.entities

# Print the extracted entities
print(entities)

tensorflow

191,921

An Open Source Machine Learning Framework for Everyone

Pros of TensorFlow

Extensive documentation and community support
Wide range of pre-built models and tools for various machine learning tasks
Highly scalable and optimized for large-scale deployments

Cons of TensorFlow

Steeper learning curve compared to Polyglot
More complex to set up and configure for simple use cases
Larger codebase and dependencies

Code Comparison

TensorFlow:

import tensorflow as tf

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation='softmax')
])

Polyglot:

from polyglot.text import Text

text = Text("Hello, world!")
print(text.entities)

pytorch

93,668

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Pros of PyTorch

PyTorch is a widely-used, mature, and well-supported deep learning framework, with a large and active community.
PyTorch provides a flexible and intuitive interface for building and training neural networks, with a focus on ease of use and rapid prototyping.
PyTorch has excellent support for GPU acceleration, making it well-suited for training large-scale models.

Cons of PyTorch

PyTorch may have a steeper learning curve compared to some other deep learning frameworks, especially for beginners.
PyTorch's dynamic computational graph can make it more challenging to optimize for deployment, compared to frameworks with static computational graphs.

Code Comparison

PyTorch:

import torch
import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.fc1 = nn.Linear(10, 5)
        self.fc2 = nn.Linear(5, 1)

Polyglot:

from polyglot.text import Text

text = Text("The quick brown fox jumps over the lazy dog.")
print(text.entities)

scikit-learn

63,533

scikit-learn: machine learning in Python

Pros of scikit-learn

Extensive documentation and community support
Wide range of machine learning algorithms and models
Efficient and optimized implementation of algorithms

Cons of scikit-learn

Steep learning curve for beginners
Limited support for deep learning and neural networks
Slower performance compared to specialized libraries for certain tasks

Code Comparison

scikit-learn:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

polyglot:

from polyglot.text import Text
text = Text("This is a sample text.")
print(text.entities)

keras

63,453

Deep Learning for humans

Pros of Keras

Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It is designed to enable fast experimentation with deep neural networks and supports both convolutional networks and recurrent networks, as well as their combinations.
Keras provides a simple, consistent interface to a variety of backend neural network engines, making it easy to switch between backends.
Keras has a large and active community, with extensive documentation and a wealth of pre-built models and examples available.

Cons of Keras

Keras is primarily focused on deep learning, and may not be as well-suited for other types of machine learning tasks as Polyglot.
Keras can be less flexible than lower-level libraries like TensorFlow, as it abstracts away some of the underlying complexity.
Keras may have a steeper learning curve for users who are new to deep learning or machine learning in general.

Code Comparison

Keras:

from keras.models import Sequential
from keras.layers import Dense, Activation

model = Sequential()
model.add(Dense(64, input_dim=100))
model.add(Activation('relu'))

Polyglot:

from polyglot.text import Text

text = Text("The quick brown fox jumps over the lazy dog.")
print(text.entities)
print(text.polarity)
print(text.language)

transformers

150,567

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of Transformers

Transformers provides a wide range of pre-trained models for various NLP tasks, including text classification, question answering, and language generation.
The library offers a user-friendly API that simplifies the process of fine-tuning and using these pre-trained models.
Transformers has a large and active community, with regular updates and a wealth of documentation and tutorials.

Cons of Transformers

Transformers is primarily focused on NLP tasks, while Polyglot offers a more general-purpose set of language processing tools.
The Transformers library can be more complex to set up and configure, especially for users who are new to deep learning and NLP.
The size of the Transformers library and the number of pre-trained models can be overwhelming, making it challenging to choose the right model for a specific task.

Code Comparison

Polyglot:

from polyglot.text import Text
text = Text("The quick brown fox jumps over the lazy dog.")
print(text.entities)

Transformers:

from transformers import pipeline
classifier = pipeline('text-classification')
result = classifier("The quick brown fox jumps over the lazy dog.")
print(result)

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

polyglot

.. |Downloads| image:: https://img.shields.io/pypi/dm/polyglot.svg :target: https://pypi.python.org/pypi/polyglot .. |Latest Version| image:: https://badge.fury.io/py/polyglot.svg :target: https://pypi.python.org/pypi/polyglot .. |Build Status| image:: https://travis-ci.org/aboSamoor/polyglot.png?branch=master :target: https://travis-ci.org/aboSamoor/polyglot .. |Documentation Status| image:: https://readthedocs.org/projects/polyglot/badge/?version=latest :target: https://readthedocs.org/builds/polyglot/

Polyglot is a natural language pipeline that supports massive multilingual applications.

Free software: GPLv3 license
Documentation: http://polyglot.readthedocs.org.

Features


-  Tokenization (165 Languages)
-  Language detection (196 Languages)
-  Named Entity Recognition (40 Languages)
-  Part of Speech Tagging (16 Languages)
-  Sentiment Analysis (136 Languages)
-  Word Embeddings (137 Languages)
-  Morphological analysis (135 Languages)
-  Transliteration (69 Languages)

Developer

Rami Al-Rfou @ rmyeid gmail com

Quick Tutorial

.. code:: python

import polyglot
from polyglot.text import Text, Word

Language Detection


.. code:: python

    text = Text("Bonjour, Mesdames.")
    print("Language Detected: Code={}, Name={}\n".format(text.language.code, text.language.name))


.. parsed-literal::

    Language Detected: Code=fr, Name=French
    


Tokenization
~~~~~~~~~~~~

.. code:: python

    zen = Text("Beautiful is better than ugly. "
               "Explicit is better than implicit. "
               "Simple is better than complex.")
    print(zen.words)


.. parsed-literal::

    [u'Beautiful', u'is', u'better', u'than', u'ugly', u'.', u'Explicit', u'is', u'better', u'than', u'implicit', u'.', u'Simple', u'is', u'better', u'than', u'complex', u'.']


.. code:: python

    print(zen.sentences)


.. parsed-literal::

    [Sentence("Beautiful is better than ugly."), Sentence("Explicit is better than implicit."), Sentence("Simple is better than complex.")]


Part of Speech Tagging

.. code:: python

text = Text(u"O primeiro uso de desobediÃªncia civil em massa ocorreu em setembro de 1906.")

print("{:<16}{}".format("Word", "POS Tag")+"\n"+"-"*30)
for word, tag in text.pos_tags:
    print(u"{:<16}{:>2}".format(word, tag))

.. parsed-literal::

Word            POS Tag
------------------------------
O               DET
primeiro        ADJ
uso             NOUN
de              ADP
desobediÃªncia   NOUN
civil           ADJ
em              ADP
massa           NOUN
ocorreu         ADJ
em              ADP
setembro        NOUN
de              ADP
1906            NUM
.               PUNCT

Named Entity Recognition


.. code:: python

    text = Text(u"In GroÃbritannien war Gandhi mit dem westlichen Lebensstil vertraut geworden")
    print(text.entities)


.. parsed-literal::

    [I-LOC([u'Gro\\xdfbritannien']), I-PER([u'Gandhi'])]


Polarity
~~~~~~~~

.. code:: python

    print("{:<16}{}".format("Word", "Polarity")+"\n"+"-"*30)
    for w in zen.words[:6]:
        print("{:<16}{:>2}".format(w, w.polarity))


.. parsed-literal::

    Word            Polarity
    ------------------------------
    Beautiful        0
    is               0
    better           1
    than             0
    ugly            -1
    .                0


Embeddings
~~~~~~~~~~

.. code:: python

    word = Word("Obama", language="en")
    print("Neighbors (Synonms) of {}".format(word)+"\n"+"-"*30)
    for w in word.neighbors:
        print("{:<16}".format(w))
    print("\n\nThe first 10 dimensions out the {} dimensions\n".format(word.vector.shape[0]))
    print(word.vector[:10])


.. parsed-literal::

    Neighbors (Synonms) of Obama
    ------------------------------
    Bush            
    Reagan          
    Clinton         
    Ahmadinejad     
    Nixon           
    Karzai          
    McCain          
    Biden           
    Huckabee        
    Lula            
    
    
    The first 10 dimensions out the 256 dimensions
    
    [-2.57382345  1.52175975  0.51070285  1.08678675 -0.74386948 -1.18616164
      2.92784619 -0.25694436 -1.40958667 -2.39675403]


Morphology
~~~~~~~~~~

.. code:: python

    word = Text("Preprocessing is an essential step.").words[0]
    print(word.morphemes)


.. parsed-literal::

    [u'Pre', u'process', u'ing']


Transliteration
~~~~~~~~~~~~~~~

.. code:: python

    from polyglot.transliteration import Transliterator
    transliterator = Transliterator(source_lang="en", target_lang="ru")
    print(transliterator.transliterate(u"preprocessing"))


.. parsed-literal::

    Ð¿ÑÐµÐ¿ÑÐ¾ÐºÐµÑÑÐ¸Ð½Ð³

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot