API Reference

Diff Types

web-monitoring-diff provides a variety of diff algorithms for use in comparing web content. They all follow a similar standardized signature and return format.

Diff Signatures

All diffs should have parameters named a_<body|text> and b_<body|text> as their first two arguments. These represent the two pieces of content to compare, where a represents the “from” or left-hand side and b represents the “to” or right-hand side of the comparison. The name indicates whether the function takes bytes (a_body/b_body) or a decoded string (a_text/b_text). The web server inspects argument names to determine what to pass to a given diff type.

Additionally, diffs may take several other standardized parameters:

  • a_body, b_body: Raw HTTP reponse body (bytes), described above.

  • a_text, b_text: Decoded text of HTTP response body (str), described above.

  • a_url, b_url: URL at which the content being diffed is found. (This is useful when content contains location-relative information, like links.)

  • a_headers, b_headers: Dict of HTTP headers.

Finally, some diffs take additional, diff-specific parameters.

Return Values

All diffs return a dict with a key named "diff". The value of this dict entry varies by diff type, but is usually:

  • An array of changes. Each entry will be a 2-tuple, where the first item is an int reprenting the type of change (-1 for removal, 0 for unchanged, 1 for addition, or other numbers for diff-specific meanings) and the second item is the data or string that was added/removed/unchanged.

  • A string representing a custom view of the diff, e.g. an HTML document.

  • A bytestring representing a custom binary view of the diff, e.g. an image.

Each diff may add additional, diff-specifc keys to the dict. For example, web_monitoring_diff.html_diff_render() includes a "change_count" key indicating how many changes there were, since it’s tough to inspect the HTML of the resulting diff and count yourself.

web_monitoring_diff.compare_length(a_body, b_body)[source]

Compute difference in response body lengths. (Does not compare contents.)

web_monitoring_diff.identical_bytes(a_body, b_body)[source]

Compute whether response bodies are exactly identical.

web_monitoring_diff.side_by_side_text(a_text, b_text)[source]

Extract the visible text from both response bodies.

web_monitoring_diff.html_text_diff(a_text, b_text)[source]

Diff the visible textual content of an HTML document.

Examples

>>> html_text_diff('<p>Deleted</p><p>Unchanged</p>',
...                '<p>Added</p><p>Unchanged</p>')
[[-1, 'Delet'], [1, 'Add'], [0, 'ed Unchanged']]
web_monitoring_diff.html_source_diff(a_text, b_text)[source]

Diff the full source code of an HTML document.

Examples

>>> html_source_diff('<p>Deleted</p><p>Unchanged</p>',
...                  '<p>Added</p><p>Unchanged</p>')
[[0, '<p>'], [-1, 'Delet'], [1, 'Add'], [0, 'ed</p><p>Unchanged</p>']]

Extracts all the outgoing links from a page and produces a diff of an HTML document that is simply a list of the text and URL of those links.

It ignores links that merely navigate within the page.

NOTE: this diff currently suffers from the fact that our diff server does not know the original URL of the content, so it can identify:

>>> <a href="#anchor-in-this-page">Text</a>

as an internal link, but not:

>>> <a href="http://this.domain.com/this/page#anchor-in-this-page">Text</a>

Generate a diff of all outgoing links (see links_diff()) where the diff property is formatted as a list of change codes and values.

Generate a diff of all outgoing links (see links_diff()) where the diff property is an HTML string. Note the actual return type is still JSON.

web_monitoring_diff.html_diff_render(a_text, b_text, a_headers=None, b_headers=None, include='combined', content_type_options='normal', url_rules='jsessionid')[source]

HTML Diff for rendering. This is focused on visually highlighting portions of a page’s text that have been changed. It does not do much to show how node types or attributes have been modified (save for link or image URLs).

The overall page returned primarily represents the structure of the “new” or “B” version. However, it contains some useful metadata in the <head>:

  1. A <template id=”wm-diff-old-head”> contains the contents of the “old” or “A” version’s <head>.

  2. A <style id=”wm-diff-style”> contains styling diff-specific styling.

  3. A <meta name=”wm-diff-title” content=”[diff]”> contains a renderable HTML diff of the page’s <title>. For example:

    The <del>old</del><ins>new</ins> title

NOTE: you may want to be careful with rendering this response as-is; inline <script> and <style> elements may be included twice if they had changes, which could have undesirable runtime effects.

Parameters:
a_textstr

Source HTML of one document to compare

b_textstr

Source HTML of the other document to compare

a_headersdict

Any HTTP headers associated with the a document

b_headersdict

Any HTTP headers associated with the b document

includestr

Which comparisons to include in output. Options are:

  • combined returns an HTML document with insertions and deletions together.

  • insertions returns an HTML document with only the unchanged text and text inserted in the b document.

  • deletions returns an HTML document with only the unchanged text and text that was deleted from the a document.

  • all returns all of the above documents. You might use this for efficiency – the most expensive part of the diff is only performed once and reused for all three return types.

content_type_optionsstr

Change how content type detection is handled. It doesn’t make a lot of sense to apply an HTML-focused diffing algorithm to, say, a JPEG image, so this function uses a combination of headers and content sniffing to determine whether a document is not HTML (it’s lenient; if it’s not pretty clear that it’s not HTML, it’ll try and diff). Options are:

  • normal uses the Content-Type header and then falls back to sniffing to determine content type.

  • nocheck ignores the Content-Type header but still sniffs.

  • nosniff uses the Content-Type header but does not sniff.

  • ignore doesn’t do any checking at all.

url_rulesstr

Use specialized rules for comparing URLs in links, images, etc. Possible values are:

  • jsessionid ignores Java Servlet session IDs in URLs.

  • wayback considers two Wayback Machine links as equivalent if they have the same original URL, regardless of each of their timestamps.

  • wayback_uk like wayback, but for the UK Web Archive (webarchive.org.uk)

You can also combine multiple comparison rules with a comma, e.g. jsessionid,wayback. Use None or an empty string for exact comparisons. (Default: jsessionid)

Examples

>>> text1 = '<!DOCTYPE html><html><head></head><body><p>Paragraph</p></body></html>'
... text2 = '<!DOCTYPE html><html><head></head><body><h1>Header</h1></body></html>'
... test_diff_render = html_diff_render(text1,text2)

Experimental External Diffs

The functions in web_monitoring_diff.experimental wrap diff algorithms available from other repositories that we consider relatively experimental or unproven. They may be new and still have lots of edge cases, may not be publicly available via PyPI or another package server, or may have any number of other issues.

They are not installed by default, so calling them may raise an exception. To install them, use pip:

$ pip install -r requirements-experimental.txt

Experimental modules are typically named by the package they wrap, and can be called with a function named diff. For example:

>>> from web_monitoring_diff.experimental import htmldiffer
>>> htmldiffer.diff("<some>html</some>", "<some>other html</some>")
web_monitoring_diff.experimental.htmldiffer.diff(a_text, b_text)[source]

Wraps the htmldiffer package with the standard arguments and output format used by all diffs in web-monitoring-diff.

htmldiffer is mainly developed as part of Perma CC (https://perma.cc/), a web archival service, and the Harvard Library Innovation Lab. At a high level, it parses the text and tags of a page into one list and uses Python’s built-in difflib.SequenceMatcher to compare them. This contrasts with web_monitoring_diff.html_render_diff, where it is primarily the text of the page being diffed, with additional content from from the surrounding tags added in as appropriate (tags there are still kept in order to rebuild the page structure after diffing the text).

While htmldiffer is available on PyPI, the public release hasn’t been updated in quite some time. Its authors recommend installing via git instead of PyPI:

$ pip install git+https://github.com/anastasia/htmldiffer@develop

You can also install all experimental differs with:

$ pip install -r requirements-experimental.txt

NOTE: this differ parses HTML in pure Python and can be very slow when using the standard, CPython interpreter. If you plan to use it in a production or performance-sensitive environment, consider using PyPy or another, more optimized interpreter.

Parameters:
a_textstr

Source HTML of one document to compare

b_textstr

Source HTML of the other document to compare

Returns:
dict
web_monitoring_diff.experimental.htmltreediff.diff(a_text, b_text)[source]

Wraps the htmltreediff package with the standard arguments and output format used by all diffs in web-monitoring-diff.

htmltreediff parses HTML documents into an XML DOM and attempts to diff the document structures, rather than look at streams of tags & text (like htmldiffer) or the readable text content of the HTML (like web_monitoring_diff.html_render_diff). Because of this, it can give extremely accurate and detailed information for documents that are very similar, but its output gets complicated or opaque as the two documents diverge in structure. It can also be very slow.

In practice, we’ve found that many real-world web pages vary their structure enough (over periods as short as a few months) to reduce the value of this diff. It’s best used for narrowly-defined scenarios like:

  • Comparing versions of a page that are very similar, often at very close points in time.

  • Comparing XML structures you can expect to be very similar, like XML API responses, RSS documents, etc.

  • Comparing two documents that were generated from the same template with differing underlying data. (Assuming the template is fairly rigid, and does not leave too much document structure up to the underlying data.)

htmltreediff is no longer under active development; we maintain a fork with minimal fixes and Python 3 support. It is not available on PyPI, so you must install via git:

$ pip install git+https://github.com/danielballan/htmltreediff@customize

You can also install all experimental differs with:

$ pip install -r requirements-experimental.txt
Parameters:
a_textstr

Source HTML of one document to compare

b_textstr

Source HTML of the other document to compare

Returns:
dict

Web Server

web_monitoring_diff.server.make_app()[source]

Create and return a Tornado application object that serves diffs.

web_monitoring_diff.server.cli()[source]

Start the diff server from the CLI. This will parse the current process’s arguments, start an event loop, and begin serving.

Exception Classes

class web_monitoring_diff.exceptions.UndecodableContentError[source]

Raised when the content downloaded for diffing could not be decoded.

class web_monitoring_diff.exceptions.UndiffableContentError[source]

Raised when the content provided to a differ is incompatible with the diff algorithm. For example, if a PDF was provided to an HTML differ.