dragnet

Just the facts -- web page content extraction

1,273

182

1,273

View on GitHub

Top Related Projects

newspaper

14,801

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

python-readability

2,849

fast python port of arc90's readability tool, updated to match latest readability.js!

python-goose

4,047

Html Content / Article Extractor, web scrapping lib in Python

Quick Overview

Dragnet is a Python library for extracting content from web pages. It uses machine learning algorithms to identify and extract the main content from HTML pages, separating it from navigation, headers, footers, and other boilerplate elements. Dragnet is particularly useful for web scraping and content analysis tasks.

Pros

Highly accurate content extraction using machine learning techniques
Supports both Python 2 and Python 3
Can be easily integrated into existing web scraping pipelines
Includes pre-trained models for immediate use

Cons

Limited documentation and examples
Requires some understanding of machine learning concepts
May require fine-tuning for specific use cases
Not actively maintained (last update was in 2019)

Code Examples

Basic content extraction:

import requests
from dragnet import extract_content

url = "https://example.com/article"
html = requests.get(url).text
content = extract_content(html)
print(content)

Extracting content and comments:

from dragnet import extract_content_and_comments

html = "<html>...</html>"  # Your HTML string
content, comments = extract_content_and_comments(html)
print("Content:", content)
print("Comments:", comments)

Using a custom model:

from dragnet import load_custom_model

model = load_custom_model('/path/to/custom/model.pkl')
content = model.extract(html)
print(content)

Getting Started

To get started with Dragnet, follow these steps:

Install Dragnet using pip:
```
pip install dragnet
```

Import and use the library in your Python script:

from dragnet import extract_content

html = "<html>...</html>"  # Your HTML string
content = extract_content(html)
print(content)

Note: Dragnet requires some additional dependencies, including lxml and scikit-learn. Make sure to install these if you encounter any import errors.

Competitor Comparisons

newspaper

14,801

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

Pros of newspaper

More comprehensive feature set, including article extraction, keyword extraction, and summarization
Better documentation and examples for ease of use
Actively maintained with regular updates and bug fixes

Cons of newspaper

May be slower for large-scale processing compared to Dragnet
Less focused on content extraction, potentially less accurate in some cases
Requires more dependencies, which can increase complexity in some environments

Code Comparison

newspaper:

from newspaper import Article

url = 'http://example.com/article'
article = Article(url)
article.download()
article.parse()

print(article.text)

Dragnet:

import requests
from dragnet import extract_content

html = requests.get('http://example.com/article').text
content = extract_content(html)

print(content)

Both libraries aim to extract content from web pages, but newspaper offers a more comprehensive set of features beyond just content extraction. Dragnet focuses specifically on content extraction and may be more suitable for large-scale processing tasks. The code examples demonstrate the simplicity of use for both libraries, with newspaper requiring an additional step to download the article before parsing.

python-readability

2,849

fast python port of arc90's readability tool, updated to match latest readability.js!

Pros of python-readability

Simpler implementation, easier to understand and modify
Focuses specifically on extracting main content from web pages
Lightweight with fewer dependencies

Cons of python-readability

Less accurate on complex web layouts
Limited to content extraction, doesn't offer additional features like dragnet
May struggle with non-standard HTML structures

Code Comparison

python-readability:

from readability import Document
import requests

response = requests.get('http://example.com')
doc = Document(response.text)
print(doc.summary())

dragnet:

import dragnet
from urllib.request import urlopen

content = urlopen('http://example.com').read()
extracted_content = dragnet.extract_content(content)
print(extracted_content)

Both libraries aim to extract main content from web pages, but dragnet offers more advanced features and potentially better accuracy, especially for complex layouts. python-readability is simpler and more focused on basic content extraction, making it easier to use for straightforward tasks. The choice between the two depends on the specific requirements of your project and the complexity of the web pages you're working with.

python-goose

4,047

Html Content / Article Extractor, web scrapping lib in Python

Pros of Goose

More actively maintained with recent updates
Supports multiple languages beyond English
Includes additional features like image extraction

Cons of Goose

Less focused on content extraction accuracy
Requires more dependencies and setup
May be slower for large-scale processing

Code Comparison

Dragnet example:

from dragnet import extract_content
content = extract_content(html)

Goose example:

from goose3 import Goose
g = Goose()
article = g.extract(url='http://example.com')
content = article.cleaned_text

Both libraries aim to extract main content from web pages, but Dragnet focuses on accuracy and speed for content extraction, while Goose offers a broader range of features. Dragnet's API is simpler, requiring fewer lines of code for basic content extraction. Goose provides more options and flexibility, but at the cost of a slightly more complex setup and usage.

Dragnet is better suited for large-scale content extraction tasks where speed and accuracy are crucial. Goose is more appropriate for projects requiring additional features like image extraction or multi-language support, and where processing speed is less critical.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Dragnet

Dragnet isn't interested in the shiny chrome or boilerplate dressing of a web page. It's interested in... 'just the facts.' The machine learning models in Dragnet extract the main article content and optionally user generated comments from a web page. They provide state of the art performance on a variety of test benchmarks.

For more information on our approach check out:

Our paper Content Extraction Using Diverse Feature Sets, published at WWW in 2013, gives an overview of the machine learning approach.
A comparison of Dragnet and alternate content extraction packages.
This blog post explains the intuition behind the algorithms.

This project was originally inspired by KohlschÃ¼tter et al, Boilerplate Detection using Shallow Text Features and Weninger et al CETR -- Content Extraction with Tag Ratios, and more recently by Readability.

GETTING STARTED

Depending on your use case, we provide two separate functions to extract just the main article content or the content and any user generated comments. Each function takes an HTML string and returns the content string.

import requests
from dragnet import extract_content, extract_content_and_comments

# fetch HTML
url = 'https://moz.com/devblog/dragnet-content-extraction-from-diverse-feature-sets/'
r = requests.get(url)

# get main article without comments
content = extract_content(r.content)

# get article and comments
content_comments = extract_content_and_comments(r.content)

We also provide a sklearn-style extractor class(complete with fit and predict methods). You can either train an extractor yourself, or load a pre-trained one:

from dragnet.util import load_pickled_model

content_extractor = load_pickled_model(
            'kohlschuetter_readability_weninger_content_model.pkl.gz')
content_comments_extractor = load_pickled_model(
            'kohlschuetter_readability_weninger_comments_content_model.pkl.gz')
            
content = content_extractor.extract(r.content)
content_comments = content_comments_extractor.extract(r.content)

A note about encoding

If you know the encoding of the document (e.g. from HTTP headers), you can pass it down to the parser:

content = content_extractor.extract(html_string, encoding='utf-8')

Otherwise, we try to guess the encoding from a meta tag or specified <?xml encoding=".."?> tag. If that fails, we assume "UTF-8".

Installing

Dragnet is written in Python (developed with 2.7, with support recently added for 3) and built on the numpy/scipy/Cython numerical computing environment. In addition we use lxml (libxml2) for HTML parsing.

We recommend installing from the master branch to ensure you have the latest version.

Installing with Docker:

This is the easiest method to install Dragnet and builds a Docker container with Dragnet and its dependencies.

Install Docker.
Clone the master branch: git clone https://github.com/dragnet-org/dragnet.git
Build the docker container: docker build -t dragnet .
Run the tests: docker run dragnet make test

You can also run an interactive Python session:

docker run -ti dragnet python3

Installing without Docker

Install the dependencies needed for Dragnet. The build depends on GCC, numpy, Cython and lxml (which in turn depends on libxml2). We use provision.sh to setup the dependencies in the Docker container, so you can use it as a template and modify as appropriate for your operation system.
Clone the master branch: git clone https://github.com/dragnet-org/dragnet.git
Install the requirements: cd dragnet; pip install -r requirements.txt
Build dragnet:

$ cd dragnet
$ make install
# these should now pass
$ make test

Contributing

We love contributions! Open an issue, or fork/create a pull request.

More details about the code structure

The Extractor class encapsulates a blockifier, some feature extractors and a machine learning model.

A blockifier implements blockify that takes a HTML string and returns a list of block objects. A feature extractor is a callable that takes a list of blocks and returns a numpy array of features (len(blocks), nfeatures). There is some additional optional functionality to "train" the feature (e.g. estimate parameters needed for centering) specified in features.py. The machine learning model implements the scikits-learn interface (predict and fit) and is used to compute the content/no-content prediction for each block.

Training/test data

The training and test data is available at dragnet_data.

Training content extraction models

Download the training data (see above). In what follows ROOTDIR contains the root of the dragnet_data repo, another directory with similar structure (HTML and Corrected sub-directories).
Create the block corrected files needed to do supervised learning on the block level. First make a sub-directory $ROOTDIR/block_corrected/ for the output files, then run:
```
from dragnet.data_processing import extract_all_gold_standard_data
rootdir = '/path/to/dragnet_data/'
extract_all_gold_standard_data(rootdir)
```
This solves the longest common sub-sequence problem to determine which blocks were extracted in the gold standard. Occasionally this will fail if lxml (libxml2) cannot parse a HTML document. In this case, remove the offending document and restart the process.
Use k-fold cross validation in the training set to do model selection and set any hyperparameters. Make decisions about the following:
- Whether to use just article content or content and comments.
- The features to use
- The machine learning model to use
For example, to train the randomized decision tree classifier from sklearn using the shallow text features from Kohlschuetter et al. and the CETR features from Weninger et al.:
```
from dragnet.extractor import Extractor
from dragnet.model_training import train_model
from sklearn.ensemble import ExtraTreesClassifier

rootdir = '/path/to/dragnet_data/'

features = ['kohlschuetter', 'weninger', 'readability']

to_extract = ['content', 'comments']   # or ['content']

model = ExtraTreesClassifier(
    n_estimators=10,
    max_features=None,
    min_samples_leaf=75
)
base_extractor = Extractor(
    features=features,
    to_extract=to_extract,
    model=model
)

extractor = train_model(base_extractor, rootdir)
```
This trains the model and, if a value is passed to output_dir, writes a pickled version of it along with some some block level classification errors to a file in the specified output_dir. If no output_dir is specified, the block-level performance is printed to stdout.
Once you have decided on a final model, train it on the entire training data using dragnet.model_training.train_models.
As a last step, test the performance of the model on the test set (see below).

Evaluating content extraction models

Use evaluate_models_predictions in model_training to compute the token level accuracy, precision, recall, and F1. For example, to evaluate a trained model run:

from dragnet.compat import train_test_split
from dragnet.data_processing import prepare_all_data
from dragnet.model_training import evaluate_model_predictions

rootdir = '/path/to/dragnet_data/'
data = prepare_all_data(rootdir)
training_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

test_html, test_labels, test_weights = extractor.get_html_labels_weights(test_data)
train_html, train_labels, train_weights = extractor.get_html_labels_weights(training_data)

extractor.fit(train_html, train_labels, weights=train_weights)
predictions = extractor.predict(test_html)
scores = evaluate_model_predictions(test_labels, predictions, test_weights)

Note that this is the same evaluation that is run/printed in train_model

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot