Scrapers#

A scraper extends the abstract Scraper class and implements its scrape() method. The latter is a generator function yielding Scraped Data.

class hircine.scraper.Scraper(comic)#

The abstract base class for scrapers.

The following variables must be accessible after the instance is initialized:

Variables:
  • name (str) – The name of the scraper (displayed in the scraper dropdown).

  • source (str) – The data source. Usually a well-defined name.

  • is_available (bool) – Whether this scraper is available for the given comic.

__init__(comic)#

Initializes a scraper with the instance of the comic it is scraping.

Parameters:

comic (FullComic) – The comic being scraped.

abstract scrape()#

A generator function that yields Scraped Data or a callable returning such data.

A callable may raise the ScrapeWarning exception. This exception will be caught automatically and its message will be collected for display to the user after the scraping process concludes.

Exceptions#

A scraper may raise two kinds of exceptions:

exception hircine.scraper.ScrapeWarning#

An exception signalling a non-fatal error. Its message will be shown to the user once the scraping process concludes.

This is usually raised within a callable yielded by scrape() and should generally only be used to notify the user that a piece of metadata was ignored because it was malformed.

exception hircine.scraper.ScrapeError#

An exception signalling a fatal error, stopping the scraping process immediately.

This should only be raised if it is impossible for the scraping process to continue, for example if a file or URL is inaccessible.

Utility functions#

hircine.scraper.utils.open_archive_file(archive, member, check_sidecar=True)#

Open an archive file for use with the with statement. Yields a file object obtained from:

  1. The archive’s sidecar file, if it exists and check_sidecar is True.

  2. Otherwise, the archive itself.

Parameters:
  • archive (Archive) – The archive.

  • member (str) – The name of the file within the archive (or its sidecar suffix).

  • check_sidecar (bool) – Whether to check for the sidecar file.

hircine.scraper.utils.parse_dict(parsers, data)#

Make a generator that yields callables applying parser functions to their matching input data. parsers and data must both be dictionaries. Parser functions are matched to input data using their dictionary keys. If a parser’s key is not present in data, it is ignored.

A key in parsers may map to another dictionary of parsers. In this case, this function will be applied recursively to the matching value in data, which is assumed to be a dictionary as well.

If a parser is matched to a list type, one callable for each list item is yielded.

Parameters:
  • parsers (dict) – A mapping of parsers.

  • data (dict) – A mapping of data to be parsed.

Registering a scraper#

To register your class as a scraper, place it into the hircine.scraper entry point group. For example, put the following in a pyproject.toml file:

[project.entry-points.'hircine.scraper']
my_scraper = 'myscraper.MyScraper'

Example#

import json

from hircine.scraper import Scraper
from hircine.scraper.types import Artist, Character, Tag, Title
from hircine.scraper.utils import open_archive_file, parse_dict


class MyScraper(Scraper):
    name = "Example scraper"
    source = "example"

    def __init__(self, comic):
        super().__init__(comic)

        self.data = self.load()

        if self.data:
            self.is_available = True

    def load(self):
        try:
            with open_archive_file(self.comic.archive, "metadata.json") as jif:
                return json.load(jif)
        except Exception:
            return {}

    def scrape(self):
        parsers = {
            "title": Title,
            "tags": {
                "artists": Artist,
                "misc": Tag.from_string,
                "characters": Character,
            },
        }

        yield from parse_dict(parsers, self.data)

The scraper above will scrape a JSON file with the following structure:

{
	"title": "This is a Title",
	"tags": {
		"artists": ["Alan Smithee", "Noah Ward"],
		"characters": ["A", "B", "C"],
		"misc": ["horror", "sci-fi"]
	}
}