python-goose

Html Content / Article Extractor, web scrapping lib in Python

4,041

782

4,041

109

View on GitHub

Top Related Projects

newspaper

14,624

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

python-readability

2,810

fast python port of arc90's readability tool, updated to match latest readability.js!

sumy

3,598

Module for automatic summarization of text documents and HTML pages.

dragnet

1,268

Just the facts -- web page content extraction

news-please

2,268

news-please - an integrated web crawler and information extractor for news that just works

Quick Overview

Python-Goose is an article extractor library that uses natural language processing techniques to extract the main content, metadata, and images from web pages. It aims to provide clean, readable text from cluttered web pages, making it useful for content analysis, data mining, and information retrieval tasks.

Pros

Effective at extracting main content from various types of web pages
Supports multiple languages
Includes image extraction capabilities
Can be easily integrated into Python projects

Cons

Not actively maintained (last commit was in 2018)
May struggle with some modern web designs or JavaScript-heavy pages
Limited documentation and examples
Potential performance issues with large-scale scraping tasks

Code Examples

Basic article extraction:

from goose3 import Goose

url = 'https://example.com/article'
g = Goose()
article = g.extract(url=url)

print(article.title)
print(article.cleaned_text)

Extracting images:

from goose3 import Goose

url = 'https://example.com/article-with-images'
g = Goose()
article = g.extract(url=url)

if article.top_image:
    print(f"Top image URL: {article.top_image.src}")

for image in article.images:
    print(f"Image URL: {image.src}")

Working with non-English content:

from goose3 import Goose

url = 'https://example.fr/french-article'
g = Goose({'use_meta_language': False, 'target_language': 'fr'})
article = g.extract(url=url)

print(article.title)
print(article.cleaned_text)

Getting Started

To get started with Python-Goose, follow these steps:

Install the library using pip:
```
pip install goose3
```

Import and use the Goose class in your Python script:

from goose3 import Goose

g = Goose()
article = g.extract(url='https://example.com/article')

print(article.title)
print(article.cleaned_text)

Customize the extraction process by passing configuration options:

g = Goose({'strict': True, 'browser_user_agent': 'MyBot/1.0'})

For more advanced usage and configuration options, refer to the project's documentation on GitHub.

Competitor Comparisons

newspaper

14,624

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

Pros of newspaper

More actively maintained with recent updates
Supports multi-threaded article downloads for improved speed
Includes built-in NLP features for keyword extraction and summarization

Cons of newspaper

Heavier dependencies, potentially more complex setup
May be slower for single article parsing compared to Goose
Less flexible for custom parsing rules

Code Comparison

newspaper:

from newspaper import Article

url = 'http://example.com/article'
article = Article(url)
article.download()
article.parse()

print(article.title)
print(article.text)

Goose:

from goose3 import Goose

url = 'http://example.com/article'
g = Goose()
article = g.extract(url=url)

print(article.title)
print(article.cleaned_text)

Both libraries offer similar basic functionality for article extraction, but newspaper provides additional features like keyword extraction and summarization out of the box. Goose's implementation is simpler and may be faster for single article parsing, while newspaper offers more advanced features and multi-threading capabilities for bulk processing.

python-readability

2,810

fast python port of arc90's readability tool, updated to match latest readability.js!

Pros of python-readability

More actively maintained with recent updates
Supports Python 3, while python-goose is primarily for Python 2
Lighter weight and focused specifically on content extraction

Cons of python-readability

Less comprehensive feature set compared to python-goose
May require additional processing for tasks like image extraction
Limited language support compared to python-goose's multi-language capabilities

Code Comparison

python-readability:

from readability import Document
import requests

response = requests.get('http://example.com')
doc = Document(response.text)
print(doc.summary())
print(doc.title())

python-goose:

from goose3 import Goose

g = Goose()
article = g.extract(url='http://example.com')
print(article.cleaned_text)
print(article.title)

Both libraries aim to extract content from web pages, but python-readability focuses on simplicity and core content extraction, while python-goose offers a more comprehensive set of features for article parsing and analysis. The choice between them depends on specific project requirements and the desired level of functionality.

sumy

3,598

Module for automatic summarization of text documents and HTML pages.

Pros of sumy

Supports multiple summarization algorithms (LSA, LexRank, TextRank, etc.)
Offers both extractive and abstractive summarization techniques
Actively maintained with recent updates and contributions

Cons of sumy

Focused solely on text summarization, while python-goose offers additional features like content extraction
May require more setup and configuration for specific use cases
Less comprehensive documentation compared to python-goose

Code Comparison

sumy:

from sumy.parsers.html import HtmlParser
from sumy.summarizers.lsa import LsaSummarizer

parser = HtmlParser.from_url(url, Tokenizer("english"))
summarizer = LsaSummarizer()
summary = summarizer(parser.document, sentences_count=5)

python-goose:

from goose3 import Goose

g = Goose()
article = g.extract(url=url)
summary = article.summary

Both libraries offer straightforward ways to extract summaries from web content, but sumy provides more flexibility in choosing summarization algorithms, while python-goose focuses on a broader range of content extraction features.

dragnet

1,268

Just the facts -- web page content extraction

Pros of dragnet

More actively maintained with recent updates
Supports Python 3, while python-goose is Python 2 only
Includes machine learning models for improved content extraction

Cons of dragnet

Steeper learning curve due to more complex API
Requires additional dependencies like NumPy and scikit-learn
Less straightforward installation process compared to python-goose

Code Comparison

dragnet:

from dragnet import extract_content
content = extract_content(html)

python-goose:

from goose import Goose
g = Goose()
article = g.extract(url='http://example.com')
content = article.cleaned_text

Both libraries aim to extract main content from web pages, but dragnet uses machine learning techniques for potentially more accurate results. python-goose offers a simpler API and easier setup, making it more suitable for quick projects or beginners. dragnet might be preferable for more complex content extraction tasks or when working with Python 3.

The choice between the two depends on specific project requirements, Python version compatibility, and the desired balance between ease of use and extraction accuracy.

news-please

2,268

news-please - an integrated web crawler and information extractor for news that just works

Pros of news-please

More comprehensive feature set, including extraction of metadata like authors, publish date, and language
Supports multiple input formats (URL, HTML, or local files) and output formats (JSON, CSV, SQLite)
Actively maintained with regular updates and improvements

Cons of news-please

More complex setup and configuration compared to python-goose
Heavier resource usage due to its extensive feature set
Slower processing speed for single articles compared to python-goose

Code Comparison

news-please:

from newsplease import NewsPlease

article = NewsPlease.from_url('https://example.com/article')
print(article.title, article.authors, article.date_publish)

python-goose:

from goose3 import Goose

g = Goose()
article = g.extract(url='https://example.com/article')
print(article.title, article.authors, article.publish_date)

Both libraries offer similar basic functionality for extracting article content, but news-please provides more detailed metadata and configuration options out of the box.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Python-Goose - Article Extractor |Build Status|

Intro

Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a scala project <https://github.com/GravityLabs/goose>_.

This is a complete rewrite in Python. The aim of the software is to take any news article or article-type web page and not only extract what is the main body of the article but also all meta data and most probable image candidate.

Goose will try to extract the following information:

Main text of an article
Main image of article
Any YouTube/Vimeo movies embedded in article
Meta Description
Meta tags

The Python version was rewritten by:

Xavier Grangier

Licensing

If you find Goose useful or have issues please drop me a line. I'd love to hear how you're using it or what features should be improved.

Goose is licensed by Gravity.com under the Apache 2.0 license; see the LICENSE file for more details.

Setup

mkvirtualenv --no-site-packages goose
git clone https://github.com/grangier/python-goose.git
cd python-goose
pip install -r requirements.txt
python setup.py install

Take it for a spin

>>> from goose import Goose
>>> url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'
>>> g = Goose()
>>> article = g.extract(url=url)
>>> article.title
u'Occupy London loses eviction fight'
>>> article.meta_description
"Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London's Court of Appeal."
>>> article.cleaned_text[:150]
(CNN) -- Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoi
>>> article.top_image.src
http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg

Configuration

There are two ways to pass configuration to goose. The first one is to pass goose a Configuration() object. The second one is to pass a configuration dict.

For instance, if you want to change the userAgent used by Goose just pass:

>>> g = Goose({'browser_user_agent': 'Mozilla'})

Switching parsers : Goose can now be used with lxml html parser or lxml soup parser. By default the html parser is used. If you want to use the soup parser pass it in the configuration dict :

>>> g = Goose({'browser_user_agent': 'Mozilla', 'parser_class':'soup'})

Goose is now language aware

For example, scraping a Spanish content page with correct meta language tags:

>>> from goose import Goose
>>> url = 'http://sociedad.elpais.com/sociedad/2012/10/27/actualidad/1351332873_157836.html'
>>> g = Goose()
>>> article = g.extract(url=url)
>>> article.title
u'Las listas de espera se agravan'
>>> article.cleaned_text[:150]
u'Los recortes pasan factura a los pacientes. De diciembre de 2010 a junio de 2012 las listas de espera para operarse aumentaron un 125%. Hay m\xe1s ciudad'

Some pages don't have correct meta language tags, you can force it using configuration :

>>> from goose import Goose
>>> url = 'http://www.elmundo.es/elmundo/2012/10/28/espana/1351388909.html'
>>> g = Goose({'use_meta_language': False, 'target_language':'es'})
>>> article = g.extract(url=url)
>>> article.cleaned_text[:150]
u'Importante golpe a la banda terrorista ETA en Francia. La Guardia Civil ha detenido en un hotel de Macon, a 70 kil\xf3metros de Lyon, a Izaskun Lesaka y '

Passing {'use_meta_language': False, 'target_language':'es'} will forcibly select Spanish.

Video extraction

>>> import goose
>>> url = 'http://www.liberation.fr/politiques/2013/08/12/journee-de-jeux-pour-ayrault-dans-les-jardins-de-matignon_924350'
>>> g = goose.Goose({'target_language':'fr'})
>>> article = g.extract(url=url)
>>> article.movies
[<goose.videos.videos.Video object at 0x25f60d0>]
>>> article.movies[0].src
'http://sa.kewego.com/embed/vp/?language_code=fr&playerKey=1764a824c13c&configKey=dcc707ec373f&suffix=&sig=9bc77afb496s&autostart=false'
>>> article.movies[0].embed_code
'<iframe src="http://sa.kewego.com/embed/vp/?language_code=fr&amp;playerKey=1764a824c13c&amp;configKey=dcc707ec373f&amp;suffix=&amp;sig=9bc77afb496s&amp;autostart=false" frameborder="0" scrolling="no" width="476" height="357"/>'
>>> article.movies[0].embed_type
'iframe'
>>> article.movies[0].width
'476'
>>> article.movies[0].height
'357'

Goose in Chinese

Some users want to use Goose for Chinese content. Chinese word segmentation is way more difficult to deal with than occidental languages. Chinese needs a dedicated StopWord analyser that need to be passed to the config object.

>>> from goose import Goose
>>> from goose.text import StopWordsChinese
>>> url  = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
>>> g = Goose({'stopwords_class': StopWordsChinese})
>>> article = g.extract(url=url)
>>> print article.cleaned_text[:150]
é¦æ¸¯è¡æ¿é¿å®æ¢æ¯è±å¨åæ¹ååä¸å°±å¶å¤§å®çè¿ç« å»ºçï¼åå»ºï¼é®é¢å°ç«æ³ä¼æ¥åè´¨è¯¢ï¼å¹¶åé¦æ¸¯æ°ä¼éæã

æ¢æ¯è±å¨ææäºï¼12æ10æ¥ï¼ççé®å¤§ä¼å¼å§ä¹éå¨å¶æ¼è¯´ä¸éæï¼ä½å¼ºè°ä»å¨è¿ç« å»ºçé®é¢ä¸æ²¡æéççæå¾åå¨æºã

ä¸äºäº²åäº¬éµè¥è®®åæ¬¢è¿æ¢æ¯è±éæï¼ä¸è®¤ä¸ºåºè½è·å¾é¦æ¸¯æ°ä¼æ¥åï¼ä½è¿äºè®®åä¹è´¨é®æ¢æ¯è±æ

Goose in Arabic

In order to use Goose in Arabic you have to use the StopWordsArabic class.

>>> from goose import Goose
>>> from goose.text import StopWordsArabic
>>> url = 'http://arabic.cnn.com/2013/middle_east/8/3/syria.clashes/index.html'
>>> g = Goose({'stopwords_class': StopWordsArabic})
>>> article = g.extract(url=url)
>>> print article.cleaned_text[:150]
Ø¯ÙØ´ÙØ Ø³ÙØ±ÙØ§ (CNN) -- Ø£ÙØ¯Øª Ø¬ÙØ§Øª Ø³ÙØ±ÙØ© ÙØ¹Ø§Ø±Ø¶Ø© Ø£Ù ÙØµØ§Ø¦Ù ÙØ³ÙØØ© ÙØ¹Ø§Ø±Ø¶Ø© ÙÙØ¸Ø§Ù Ø§ÙØ±Ø¦ÙØ³ Ø¨Ø´Ø§Ø± Ø§ÙØ£Ø³Ø¯ ÙØ¹ÙÙ ØµÙØ© Ø¨Ù"Ø§ÙØ¬ÙØ´ Ø§ÙØØ±" ØªÙÙÙØª ÙÙ Ø§ÙØ³ÙØ·Ø±Ø© Ø¹ÙÙ ÙØ³ØªÙØ¯Ø¹Ø§Øª ÙÙØ£Ø³Ù

Goose in Korean

In order to use Goose in Korean you have to use the StopWordsKorean class.

>>> from goose import Goose
>>> from goose.text import StopWordsKorean
>>> url='http://news.donga.com/3/all/20131023/58406128/1'
>>> g = Goose({'stopwords_class':StopWordsKorean})
>>> article = g.extract(url=url)
>>> print article.cleaned_text[:150]
ê²½ê¸°ë ì©ì¸ì ìë¦¬ ì¡ì ë¯¼ê° ìíì¸ì¦ ì ë¬¸ê¸°ì ãëì§í¸ì´ì ì¨(www.digitalemc.com). 
14ëì§¸ ì¸ê³ ê°êµì íµì Â·ìì Â·ì í ê·ê²© ìíê³¼ ì¸ì¦ í ì°ë¬¼ë§ íê³  ìë ì´ íì¬ ë°ì±ê· ëíê° ë§ëê¸°ë¡ í ì£¼ì¸ê³µì´ë¤. 
ê·¸ë ì ê¸°ì ìÂ·ë¬´ì íµì Â·ìëì°¨ ì ì¥í ë¶ì¼ì

Known issues

There are some issues with unicode URLs.
Cookie handling : Some websites need cookie handling. At the moment the only work around is to use the raw_html extraction. For instance:

import urllib2 import goose url = "http://www.nytimes.com/2013/08/18/world/middleeast/pressure-by-us-failed-to-sway-egypts-leaders.html?hp" opener = urllib2.build_opener(urllib2.HTTPCookieProcessor()) response = opener.open(url) raw_html = response.read() g = goose.Goose() a = g.extract(raw_html=raw_html) a.cleaned_text u'CAIRO \u2014 For a moment, at least, American and European diplomats trying to defuse the volatile standoff in Egypt thought they had a breakthrough.\n\nAs t'

TODO

Video html5 tag extraction

.. |Build Status| image:: https://travis-ci.org/grangier/python-goose.png?branch=develop :target: https://travis-ci.org/grangier/python-goose

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot