news-please
news-please - an integrated web crawler and information extractor for news that just works
Top Related Projects
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
Html Content / Article Extractor, web scrapping lib in Python
fast python port of arc90's readability tool, updated to match latest readability.js!
List any node_modules 📦 dir in your system and how heavy they are. You can then select which ones you want to erase to free up space 🧹
Quick Overview
news-please is an open-source, integrated web crawler and information extractor for news websites. It automatically detects and extracts information from news articles, including the article's title, author, main text, and publication date. The tool is designed to be easy to use while offering advanced features for more complex crawling tasks.
Pros
- Easy to set up and use, with minimal configuration required
- Supports both on-demand article extraction and continuous crawling
- Automatically detects and extracts various article components
- Highly customizable and extensible for advanced users
Cons
- May occasionally misidentify or incorrectly extract certain article elements
- Performance can be slower for large-scale crawling tasks
- Requires Python knowledge for advanced customization
- Limited support for non-news websites or unconventional article structures
Code Examples
- Extracting information from a single URL:
from newsplease import NewsPlease
article = NewsPlease.from_url('https://www.example.com/news/article')
print(article.title)
print(article.maintext)
- Crawling multiple URLs:
from newsplease import NewsPlease
urls = [
'https://www.example1.com/article1',
'https://www.example2.com/article2',
'https://www.example3.com/article3'
]
articles = NewsPlease.from_urls(urls)
for article in articles:
print(f"Title: {article.title}")
print(f"Date: {article.date_publish}")
- Continuous crawling with a configuration file:
from newsplease.crawler.commoncrawl_crawler import CommonCrawlCrawler
config = {
'sitelist': ['www.example.com'],
'output_path': './crawled_articles'
}
crawler = CommonCrawlCrawler(config)
crawler.crawl()
Getting Started
-
Install news-please:
pip install news-please
-
Basic usage:
from newsplease import NewsPlease article = NewsPlease.from_url('https://www.example.com/news/article') print(f"Title: {article.title}") print(f"Author: {article.authors}") print(f"Text: {article.maintext[:200]}...") # Print first 200 characters print(f"Date: {article.date_publish}")
This will extract and print basic information from the specified news article URL.
Competitor Comparisons
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
Pros of newspaper
- Lightweight and fast, focusing on article extraction and parsing
- Extensive language support with multi-language summarization
- Simple and intuitive API for quick integration
Cons of newspaper
- Less comprehensive web crawling capabilities
- Fewer options for customizing the extraction process
- Limited support for extracting additional metadata
Code Comparison
newspaper:
from newspaper import Article
url = 'http://example.com/article'
article = Article(url)
article.download()
article.parse()
print(article.title)
print(article.text)
news-please:
from newsplease import NewsPlease
article = NewsPlease.from_url('http://example.com/article')
print(article.title)
print(article.maintext)
print(article.authors)
print(article.date_publish)
news-please offers more comprehensive metadata extraction out of the box, while newspaper focuses on simplicity and ease of use for basic article parsing. news-please provides additional features like author and publication date extraction, making it more suitable for detailed news analysis and archiving projects. However, newspaper's lightweight nature and multi-language support make it a good choice for simpler article extraction tasks or projects requiring processing in various languages.
Html Content / Article Extractor, web scrapping lib in Python
Pros of python-goose
- Simpler and more lightweight, focusing primarily on content extraction
- Supports multiple languages out of the box
- Has been around longer, potentially more stable and battle-tested
Cons of python-goose
- Less actively maintained, with fewer recent updates
- More limited in scope, lacking some advanced features of news-please
- May require additional libraries or tools for full-fledged news scraping
Code Comparison
news-please:
from newsplease import NewsPlease
article = NewsPlease.from_url('https://www.example.com/article')
print(article.title)
print(article.maintext)
python-goose:
from goose3 import Goose
g = Goose()
article = g.extract(url='https://www.example.com/article')
print(article.title)
print(article.cleaned_text)
Both libraries offer similar basic functionality for extracting article content, but news-please provides additional features like metadata extraction and more comprehensive parsing options. python-goose focuses on simplicity and language support, while news-please aims to be a more complete solution for news article extraction and analysis.
fast python port of arc90's readability tool, updated to match latest readability.js!
Pros of python-readability
- Lightweight and focused on extracting readable content from HTML
- Simpler to use for basic content extraction tasks
- Can be easily integrated into other projects as a library
Cons of python-readability
- Limited functionality compared to news-please
- Lacks advanced features like article metadata extraction
- No built-in crawling or scraping capabilities
Code Comparison
news-please:
from newsplease import NewsPlease
article = NewsPlease.from_url('https://example.com/article')
print(article.title)
print(article.maintext)
python-readability:
from readability import Document
import requests
response = requests.get('https://example.com/article')
doc = Document(response.text)
print(doc.title())
print(doc.summary())
Summary
news-please is a more comprehensive solution for news article extraction, offering advanced features like metadata extraction and built-in crawling. python-readability, on the other hand, is a simpler library focused on extracting readable content from HTML. While news-please is better suited for large-scale news aggregation projects, python-readability may be preferable for simpler content extraction tasks or as a lightweight component in larger systems.
List any node_modules 📦 dir in your system and how heavy they are. You can then select which ones you want to erase to free up space 🧹
Pros of npkill
- Focused utility for cleaning up node_modules directories
- Interactive CLI interface for easy navigation and selection
- Lightweight and fast, designed for a specific task
Cons of npkill
- Limited in scope compared to news-please's comprehensive functionality
- Not designed for web scraping or content extraction
- Lacks advanced features for data processing and analysis
Code Comparison
npkill (JavaScript):
const deleteFolder = (folderPath) => {
return new Promise((resolve, reject) => {
rimraf(folderPath, (err) => {
if (err) reject(err);
else resolve();
});
});
};
news-please (Python):
def download_url(self, url, timeout=None):
if timeout is None:
timeout = self.config.request_timeout
return requests.get(url, timeout=timeout, headers=self.headers)
Summary
npkill is a specialized tool for managing node_modules directories, offering an interactive CLI for easy cleanup. news-please, on the other hand, is a comprehensive web scraping and news extraction library with broader capabilities. While npkill excels in its focused task, news-please provides more extensive features for content retrieval and processing. The code examples highlight the different languages and purposes of each project, with npkill focusing on file system operations and news-please on web requests and data extraction.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
news-please
news-please is an open source, easy-to-use news crawler that extracts structured information from almost any news website. It can recursively follow internal hyperlinks and read RSS feeds to fetch both most recent and also old, archived articles. You only need to provide the root URL of the news website to crawl it completely. news-please combines the power of multiple state-of-the-art libraries and tools, such as scrapy, Newspaper, and readability.
news-please also features a library mode, which allows Python developers to use the crawling and extraction functionality within their own program. Moreover, news-please allows to conveniently crawl and extract articles from the (very) large news archive at commoncrawl.org.
If you want to contribute to news-please, please first read here.
Announcements
03/23/2021: If you're interested in sentiment classification in news articles, check out our large-scale dataset for target-dependent sentiment classification. We also publish an easy-to-use neural model that achieves state-of-the-art performance. Visit the project here.
06/01/2018: If you're interested in event extraction from news, you might also want to check out our new project, Giveme5W1H - a tool that extracts phrases answering the journalistic five W and one H questions to describe an article's main event, i.e., who did what, when, where, why, and how.
Extracted information
news-please extracts the following attributes from news articles. An examplary json file as extracted by news-please can be found here.
- headline
- lead paragraph
- main text
- main image
- name(s) of author(s)
- publication date
- language
Features
- works out of the box: install with pip, add URLs of your pages, run :-)
- run news-please conveniently using its CLI mode
- use it as a library within your own software
- extract articles from commoncrawl.org's news archive
Modes and use cases
news-please supports three main use cases, which are explained in more detail in the following.
CLI mode
- stores extracted results in JSON files, PostgreSQL, ElasticSearch, Redis, or your own storage
- simple but extensive configuration (if you want to tweak the results)
- revisions: crawl articles multiple times and track changes
Library mode
- crawl and extract information given a list of article URLs
- to use news-please within your own Python code
News archive from commoncrawl.org
- commoncrawl.org provides an extensive, free-to-use archive of news articles from small and major publishers world wide
- news-please enables users to conveniently download and extract articles from commoncrawl.org
- you can optionally define filter criteria, such as news publisher(s) or the date period, within which articles need to be published
- clone the news-please repository, adapt the config section in newsplease/examples/commoncrawl.py, and execute
python3 -m newsplease.examples.commoncrawl
Getting started
It's super easy, we promise!
Installation
news-please runs on Python 3.8+.
$ pip install news-please
Use within your own code (as a library)
You can access the core functionality of news-please, i.e. extraction of semi-structured information from one or more news articles, in your own code by using news-please in library mode. If you want to use news-please's full website extraction (given only the root URL) or continuous crawling mode (using RSS), you'll need to use the CLI mode, which is described later.
from newsplease import NewsPlease
article = NewsPlease.from_url('https://www.nytimes.com/2017/02/23/us/politics/cpac-stephen-bannon-reince-priebus.html?hp')
print(article.title)
A sample of an extracted article can be found here (as a JSON file).
If you want to crawl multiple articles at a time, optionally with any optional parameter taken by requests.request()
NewsPlease.from_urls([url1, url2, ...], request_args={"timeout": 6})
or if you have a file containing all URLs (each line containing a single URL)
NewsPlease.from_file(path)
or if you have raw HTML data (you can also provide the original URL to increase the accuracy of extracting the publishing date)
NewsPlease.from_html(html, url=None)
or if you have a WARC file (also check out our commoncrawl workflow, which provides convenient methods to filter commoncrawl's archive for specific news outlets and dates)
NewsPlease.from_warc(warc_record)
In library mode, news-please will attempt to download and extract information from each URL. The previously described functions are blocking, i.e., will return once news-please has attempted all URLs. The resulting list contains all successfully extracted articles.
Finally, you can process the extracted information contained in the article object(s). For example, to export into a JSON format, you may use:
import json
with open("article.json", "w") as file:
json.dump(article.get_serializable_dict(), file)
Run the crawler (via the CLI)
$ news-please
news-please will then start crawling a few examples pages. To terminate the process press CTRL+C
. news-please will then shut down within 5-60 seconds. You can also press CTRL+C
twice, which will immediately kill the process (not recommended, though).
The results are stored by default in JSON files in the data
folder. In the default configuration, news-please also stores the original HTML files.
Crawl other pages
Most likely, you will not want to crawl from the websites provided in our example configuration. Simply head over to the sitelist.hjson
file and add the root URLs of the news outlets' web pages of your choice. news-please also can extract the most recent events from the GDELT project, see here.
ElasticSearch
news-please also supports export to ElasticSearch. Using Elasticsearch will also enable the versioning feature. First, enable it in the config.cfg
at the config directory, which is by default ~/news-please/config
but can also be changed with the -c
parameter to a custom location. In case the directory does not exist, a default directory will be created at the specified location.
[Scrapy]
ITEM_PIPELINES = {
'newsplease.pipeline.pipelines.ArticleMasterExtractor':100,
'newsplease.pipeline.pipelines.ElasticsearchStorage':350
}
That's it! Except, if your Elasticsearch database is not located at http://localhost:9200
, uses a different username/password or CA-certificate authentication. In these cases, you will also need to change the following.
[Elasticsearch]
host = localhost
port = 9200
...
# Credentials used for authentication (supports CA-certificates):
use_ca_certificates = False # True if authentification needs to be performed
ca_cert_path = '/path/to/cacert.pem'
client_cert_path = '/path/to/client_cert.pem'
client_key_path = '/path/to/client_key.pem'
username = 'root'
secret = 'password'
PostgreSQL
news-please allows for storing of articles to a PostgreSQL database, including the versioning feature. To export to PostgreSQL, open the corresponding config file (config_lib.cfg
for library mode and config.cfg
for CLI mode) and add the PostgresqlStorage module to the pipeline and adjust the database credentials:
[Scrapy]
ITEM_PIPELINES = {
'newsplease.pipeline.pipelines.ArticleMasterExtractor':100,
'newsplease.pipeline.pipelines.PostgresqlStorage':350
}
[Postgresql]
# Postgresql-Connection required for saving meta-informations
host = localhost
port = 5432
database = 'news-please'
# schema = 'news-please'
user = 'user'
password = 'password'
If you plan to use news-please and its export to PostgreSQL in a production environment, we recommend to uninstall the psycopg2-binary
package and install psycopg2
. We use the former since it does not require a C compiler in order to be installed. See here, for more information on differences between psycopg2
and psycopg2-binary
and how to setup a production environment.
Redis
news-please allows to store articles on a Redis database, including the versioning feature. To export to Redis, open the corresponding config file (config_lib.cfg
for library mode and config.cfg
for CLI mode) and add the RedisStorage module to the pipeline and adjust the connection credentials:
[Scrapy]
ITEM_PIPELINES = {
'newsplease.pipeline.pipelines.ArticleMasterExtractor':100,
'newsplease.pipeline.pipelines.RedisStorage':350
}
[Redis]
host = localhost
port = 6379
db = 0
# You can add any redis connection parameter here
ssl_check_hostname = True
username = "news-please"
max_connections = 24
This pipeline should also be compatible with AWS Elasticache and GCP MemoryStore
What's next?
We have collected a bunch of useful information for both users and developers. As a user, you will most likely only deal with two files: sitelist.hjson
(to define sites to be crawled) and config.cfg
(probably only rarely, in case you want to tweak the configuration).
Support
You can find more information on usage and development in our wiki! Before contacting us, please check out the wiki. If you still have questions on how to use news-please, please create a new question in Discussionson here on GitHub. Please understand that we are not able to provide individual support via email. We think that help is more valuable if it is shared publicly so that more people can benefit from it. However, if you still require individual support, e.g., due to confidentiality of your project, we may be able to provide you with private consultation. Contact us for information about pricing and further details.
Issues
For bug reports, we ask you to use the Bug report template. Make sure you're using the latest version of news-please, since we cannot give support for older versions. As described earlier, we cannot give support for issues or questions sent by email.
Donation
Your donations are greatly appreciated! They will free us up to work on this project more, to take on tasks such as adding new features, bug-fix support, and addressing further concerns with the library.
Acknowledgements
This project would not have been possible without the contributions of the following students (ordered alphabetically):
- Moritz Bock
- Michael Fried
- Jonathan Hassler
- Markus Klatt
- Kevin Kress
- Sören Lachnit
- Marvin Pafla
- Franziska Schlor
- Matt Sharinghousen
- Claudio Spener
- Moritz Steinmaier
We also thank all other contributors, which you can find on the contributors page!
How to cite
If you are using news-please, please cite our paper (ResearchGate, Mendeley):
@InProceedings{Hamborg2017,
author = {Hamborg, Felix and Meuschke, Norman and Breitinger, Corinna and Gipp, Bela},
title = {news-please: A Generic News Crawler and Extractor},
year = {2017},
booktitle = {Proceedings of the 15th International Symposium of Information Science},
location = {Berlin},
doi = {10.5281/zenodo.4120316},
pages = {218--223},
month = {March}
}
You can find more information on this and other news projects on our website.
Contributions
Do you want to contribute? Great, we are always happy for any support on this project! We are particularly looking for pull requests that fix bugs. We also welcome pull requests that contribute your own ideas.
By contributing to this project, you agree that your contributions will be licensed under the project's license.
Pull requests
We love contributions by our users! If you plan to submit a pull request, please open an issue first and desribe the issue you want to fix or what you want to improve and how! This way, we can discuss whether your idea could be added to news-please in the first place and, if so, how it could best be implemented in order to fit into architecture and coding style. In the issue, please state that you're planning to implement the described features.
Custom features
Unfortunately, we do not have resources to implement features requested by users. Instead, we recommend that you implement features you need and if you'd like open a pull request here so that the community can benefit from your improvements, too.
License
Licensed under the Apache License, Version 2.0 (the "License"); you may not use news-please except in compliance with the License. A copy of the License is included in the project, see the file LICENSE.txt.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. The news-please logo is courtesy of Mario Hamborg.
Copyright 2016-2024 The news-please team
Top Related Projects
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
Html Content / Article Extractor, web scrapping lib in Python
fast python port of arc90's readability tool, updated to match latest readability.js!
List any node_modules 📦 dir in your system and how heavy they are. You can then select which ones you want to erase to free up space 🧹
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot