Top Related Projects
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
fast python port of arc90's readability tool, updated to match latest readability.js!
Html Content / Article Extractor, web scrapping lib in Python
Quick Overview
Dragnet is a Python library for extracting content from web pages. It uses machine learning algorithms to identify and extract the main content from HTML pages, separating it from navigation, headers, footers, and other boilerplate elements. Dragnet is particularly useful for web scraping and content analysis tasks.
Pros
- Highly accurate content extraction using machine learning techniques
- Supports both Python 2 and Python 3
- Can be easily integrated into existing web scraping pipelines
- Includes pre-trained models for immediate use
Cons
- Limited documentation and examples
- Requires some understanding of machine learning concepts
- May require fine-tuning for specific use cases
- Not actively maintained (last update was in 2019)
Code Examples
- Basic content extraction:
import requests
from dragnet import extract_content
url = "https://example.com/article"
html = requests.get(url).text
content = extract_content(html)
print(content)
- Extracting content and comments:
from dragnet import extract_content_and_comments
html = "<html>...</html>" # Your HTML string
content, comments = extract_content_and_comments(html)
print("Content:", content)
print("Comments:", comments)
- Using a custom model:
from dragnet import load_custom_model
model = load_custom_model('/path/to/custom/model.pkl')
content = model.extract(html)
print(content)
Getting Started
To get started with Dragnet, follow these steps:
-
Install Dragnet using pip:
pip install dragnet
-
Import and use the library in your Python script:
from dragnet import extract_content html = "<html>...</html>" # Your HTML string content = extract_content(html) print(content)
Note: Dragnet requires some additional dependencies, including lxml and scikit-learn. Make sure to install these if you encounter any import errors.
Competitor Comparisons
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
Pros of newspaper
- More comprehensive feature set, including article extraction, keyword extraction, and summarization
- Better documentation and examples for ease of use
- Actively maintained with regular updates and bug fixes
Cons of newspaper
- May be slower for large-scale processing compared to Dragnet
- Less focused on content extraction, potentially less accurate in some cases
- Requires more dependencies, which can increase complexity in some environments
Code Comparison
newspaper:
from newspaper import Article
url = 'http://example.com/article'
article = Article(url)
article.download()
article.parse()
print(article.text)
Dragnet:
import requests
from dragnet import extract_content
html = requests.get('http://example.com/article').text
content = extract_content(html)
print(content)
Both libraries aim to extract content from web pages, but newspaper offers a more comprehensive set of features beyond just content extraction. Dragnet focuses specifically on content extraction and may be more suitable for large-scale processing tasks. The code examples demonstrate the simplicity of use for both libraries, with newspaper requiring an additional step to download the article before parsing.
fast python port of arc90's readability tool, updated to match latest readability.js!
Pros of python-readability
- Simpler implementation, easier to understand and modify
- Focuses specifically on extracting main content from web pages
- Lightweight with fewer dependencies
Cons of python-readability
- Less accurate on complex web layouts
- Limited to content extraction, doesn't offer additional features like dragnet
- May struggle with non-standard HTML structures
Code Comparison
python-readability:
from readability import Document
import requests
response = requests.get('http://example.com')
doc = Document(response.text)
print(doc.summary())
dragnet:
import dragnet
from urllib.request import urlopen
content = urlopen('http://example.com').read()
extracted_content = dragnet.extract_content(content)
print(extracted_content)
Both libraries aim to extract main content from web pages, but dragnet offers more advanced features and potentially better accuracy, especially for complex layouts. python-readability is simpler and more focused on basic content extraction, making it easier to use for straightforward tasks. The choice between the two depends on the specific requirements of your project and the complexity of the web pages you're working with.
Html Content / Article Extractor, web scrapping lib in Python
Pros of Goose
- More actively maintained with recent updates
- Supports multiple languages beyond English
- Includes additional features like image extraction
Cons of Goose
- Less focused on content extraction accuracy
- Requires more dependencies and setup
- May be slower for large-scale processing
Code Comparison
Dragnet example:
from dragnet import extract_content
content = extract_content(html)
Goose example:
from goose3 import Goose
g = Goose()
article = g.extract(url='http://example.com')
content = article.cleaned_text
Both libraries aim to extract main content from web pages, but Dragnet focuses on accuracy and speed for content extraction, while Goose offers a broader range of features. Dragnet's API is simpler, requiring fewer lines of code for basic content extraction. Goose provides more options and flexibility, but at the cost of a slightly more complex setup and usage.
Dragnet is better suited for large-scale content extraction tasks where speed and accuracy are crucial. Goose is more appropriate for projects requiring additional features like image extraction or multi-language support, and where processing speed is less critical.
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
Dragnet
Dragnet isn't interested in the shiny chrome or boilerplate dressing of a web page. It's interested in... 'just the facts.' The machine learning models in Dragnet extract the main article content and optionally user generated comments from a web page. They provide state of the art performance on a variety of test benchmarks.
For more information on our approach check out:
- Our paper Content Extraction Using Diverse Feature Sets, published at WWW in 2013, gives an overview of the machine learning approach.
- A comparison of Dragnet and alternate content extraction packages.
- This blog post explains the intuition behind the algorithms.
This project was originally inspired by Kohlschütter et al, Boilerplate Detection using Shallow Text Features and Weninger et al CETR -- Content Extraction with Tag Ratios, and more recently by Readability.
GETTING STARTED
Depending on your use case, we provide two separate functions to extract just the main article content or the content and any user generated comments. Each function takes an HTML string and returns the content string.
import requests
from dragnet import extract_content, extract_content_and_comments
# fetch HTML
url = 'https://moz.com/devblog/dragnet-content-extraction-from-diverse-feature-sets/'
r = requests.get(url)
# get main article without comments
content = extract_content(r.content)
# get article and comments
content_comments = extract_content_and_comments(r.content)
We also provide a sklearn-style extractor class(complete with fit
and
predict
methods). You can either train an extractor yourself, or load a
pre-trained one:
from dragnet.util import load_pickled_model
content_extractor = load_pickled_model(
'kohlschuetter_readability_weninger_content_model.pkl.gz')
content_comments_extractor = load_pickled_model(
'kohlschuetter_readability_weninger_comments_content_model.pkl.gz')
content = content_extractor.extract(r.content)
content_comments = content_comments_extractor.extract(r.content)
A note about encoding
If you know the encoding of the document (e.g. from HTTP headers), you can pass it down to the parser:
content = content_extractor.extract(html_string, encoding='utf-8')
Otherwise, we try to guess the encoding from a meta
tag or specified
<?xml encoding=".."?>
tag. If that fails, we assume "UTF-8".
Installing
Dragnet is written in Python (developed with 2.7, with support recently added for 3) and built on the numpy/scipy/Cython numerical computing environment. In addition we use lxml (libxml2) for HTML parsing.
We recommend installing from the master branch to ensure you have the latest version.
Installing with Docker:
This is the easiest method to install Dragnet and builds a Docker container with Dragnet and its dependencies.
- Install Docker.
- Clone the master branch:
git clone https://github.com/dragnet-org/dragnet.git
- Build the docker container:
docker build -t dragnet .
- Run the tests:
docker run dragnet make test
You can also run an interactive Python session:
docker run -ti dragnet python3
Installing without Docker
- Install the dependencies needed for Dragnet. The build depends on
GCC, numpy, Cython and lxml (which in turn depends on
libxml2
). We useprovision.sh
to setup the dependencies in the Docker container, so you can use it as a template and modify as appropriate for your operation system. - Clone the master branch:
git clone https://github.com/dragnet-org/dragnet.git
- Install the requirements:
cd dragnet; pip install -r requirements.txt
- Build dragnet:
$ cd dragnet
$ make install
# these should now pass
$ make test
Contributing
We love contributions! Open an issue, or fork/create a pull request.
More details about the code structure
The Extractor
class encapsulates a blockifier, some feature extractors and a machine learning model.
A blockifier implements blockify
that takes a HTML string and returns a list
of block objects. A feature extractor is a callable that takes a list
of blocks and returns a numpy array of features (len(blocks), nfeatures)
.
There is some additional optional functionality
to "train" the feature (e.g. estimate parameters needed for centering)
specified in features.py
. The machine learning model implements
the scikits-learn interface (predict
and fit
) and is used to compute
the content/no-content prediction for each block.
Training/test data
The training and test data is available at dragnet_data.
Training content extraction models
-
Download the training data (see above). In what follows
ROOTDIR
contains the root of thedragnet_data
repo, another directory with similar structure (HTML
andCorrected
sub-directories). -
Create the block corrected files needed to do supervised learning on the block level. First make a sub-directory
$ROOTDIR/block_corrected/
for the output files, then run:from dragnet.data_processing import extract_all_gold_standard_data rootdir = '/path/to/dragnet_data/' extract_all_gold_standard_data(rootdir)
This solves the longest common sub-sequence problem to determine which blocks were extracted in the gold standard. Occasionally this will fail if lxml (libxml2) cannot parse a HTML document. In this case, remove the offending document and restart the process.
-
Use k-fold cross validation in the training set to do model selection and set any hyperparameters. Make decisions about the following:
- Whether to use just article content or content and comments.
- The features to use
- The machine learning model to use
For example, to train the randomized decision tree classifier from sklearn using the shallow text features from Kohlschuetter et al. and the CETR features from Weninger et al.:
from dragnet.extractor import Extractor from dragnet.model_training import train_model from sklearn.ensemble import ExtraTreesClassifier rootdir = '/path/to/dragnet_data/' features = ['kohlschuetter', 'weninger', 'readability'] to_extract = ['content', 'comments'] # or ['content'] model = ExtraTreesClassifier( n_estimators=10, max_features=None, min_samples_leaf=75 ) base_extractor = Extractor( features=features, to_extract=to_extract, model=model ) extractor = train_model(base_extractor, rootdir)
This trains the model and, if a value is passed to
output_dir
, writes a pickled version of it along with some some block level classification errors to a file in the specifiedoutput_dir
. If nooutput_dir
is specified, the block-level performance is printed to stdout. -
Once you have decided on a final model, train it on the entire training data using
dragnet.model_training.train_models
. -
As a last step, test the performance of the model on the test set (see below).
Evaluating content extraction models
Use evaluate_models_predictions
in model_training
to compute the token level
accuracy, precision, recall, and F1. For example, to evaluate a trained model
run:
from dragnet.compat import train_test_split
from dragnet.data_processing import prepare_all_data
from dragnet.model_training import evaluate_model_predictions
rootdir = '/path/to/dragnet_data/'
data = prepare_all_data(rootdir)
training_data, test_data = train_test_split(data, test_size=0.2, random_state=42)
test_html, test_labels, test_weights = extractor.get_html_labels_weights(test_data)
train_html, train_labels, train_weights = extractor.get_html_labels_weights(training_data)
extractor.fit(train_html, train_labels, weights=train_weights)
predictions = extractor.predict(test_html)
scores = evaluate_model_predictions(test_labels, predictions, test_weights)
Note that this is the same evaluation that is run/printed in train_model
Top Related Projects
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
fast python port of arc90's readability tool, updated to match latest readability.js!
Html Content / Article Extractor, web scrapping lib in Python
Convert designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot