Scrapers¶
A scraper extends the abstract Scraper
class and
implements its scrape()
method. The latter is a
generator function yielding Scraped Data.
- class hircine.scraper.Scraper(comic)¶
The abstract base class for scrapers.
The following variables must be accessible after the instance is initialized:
- Variables:
- __init__(comic)¶
Initializes a scraper with the instance of the comic it is scraping.
- Parameters:
comic (FullComic) – The comic being scraped.
- abstract scrape()¶
A generator function that yields Scraped Data or a callable returning such data.
A callable may raise the
ScrapeWarning
exception. This exception will be caught automatically and its message will be collected for display to the user after the scraping process concludes.
Exceptions¶
A scraper may raise two kinds of exceptions:
- exception hircine.scraper.ScrapeWarning¶
An exception signalling a non-fatal error. Its message will be shown to the user once the scraping process concludes.
This is usually raised within a callable yielded by
scrape()
and should generally only be used to notify the user that a piece of metadata was ignored because it was malformed.
- exception hircine.scraper.ScrapeError¶
An exception signalling a fatal error, stopping the scraping process immediately.
This should only be raised if it is impossible for the scraping process to continue, for example if a file or URL is inaccessible.
Utility functions¶
- hircine.scraper.utils.open_archive_file(archive, member, check_sidecar=True)¶
Open an archive file for use with the with statement. Yields a file object obtained from:
The archive’s sidecar file, if it exists and check_sidecar is
True
.Otherwise, the archive itself.
- hircine.scraper.utils.parse_dict(parsers, data)¶
Make a generator that yields callables applying parser functions to their matching input data. parsers and data must both be dictionaries. Parser functions are matched to input data using their dictionary keys. If a parser’s key is not present in data, it is ignored.
A key in parsers may map to another dictionary of parsers. In this case, this function will be applied recursively to the matching value in data, which is assumed to be a dictionary as well.
If a parser is matched to a list type, one callable for each list item is yielded.
Registering a scraper¶
To register your class as a scraper, place it into the hircine.scraper
entry point group. For example, put the
following in a pyproject.toml
file:
[project.entry-points.'hircine.scraper']
my_scraper = 'myscraper.MyScraper'
Example¶
import json
from hircine.scraper import Scraper
from hircine.scraper.types import Artist, Character, Tag, Title
from hircine.scraper.utils import open_archive_file, parse_dict
class MyScraper(Scraper):
name = "Example scraper"
source = "example"
def __init__(self, comic):
super().__init__(comic)
self.data = self.load()
if self.data:
self.is_available = True
def load(self):
try:
with open_archive_file(self.comic.archive, "metadata.json") as jif:
return json.load(jif)
except Exception:
return {}
def scrape(self):
parsers = {
"title": Title,
"tags": {
"artists": Artist,
"misc": Tag.from_string,
"characters": Character,
},
}
yield from parse_dict(parsers, self.data)
The scraper above will scrape a JSON file with the following structure:
{
"title": "This is a Title",
"tags": {
"artists": ["Alan Smithee", "Noah Ward"],
"characters": ["A", "B", "C"],
"misc": ["horror", "sci-fi"]
}
}