archivebox.extractors package
Submodules
archivebox.extractors.archive_org module
- archivebox.extractors.archive_org.should_save_archive_dot_org(link: Link, out_dir: Path | None = None, overwrite: bool | None = False) bool [source]
- archivebox.extractors.archive_org.save_archive_dot_org(link: Link, out_dir: Path | None = None, timeout: int = 60) ArchiveResult [source]
submit site to archive.org for archiving via their service, save returned archive url
archivebox.extractors.dom module
- archivebox.extractors.dom.should_save_dom(link: Link, out_dir: Path | None = None, overwrite: bool | None = False) bool [source]
- archivebox.extractors.dom.save_dom(link: Link, out_dir: Path | None = None, timeout: int = 60) ArchiveResult [source]
print HTML of site to file using chrome –dump-html
archivebox.extractors.favicon module
- archivebox.extractors.favicon.should_save_favicon(link: Link, out_dir: str | None = None, overwrite: bool | None = False) bool [source]
- archivebox.extractors.favicon.save_favicon(link: Link, out_dir: Path | None = None, timeout: int = 60) ArchiveResult [source]
download site favicon from google’s favicon api
archivebox.extractors.git module
- archivebox.extractors.git.should_save_git(link: Link, out_dir: Path | None = None, overwrite: bool | None = False) bool [source]
- archivebox.extractors.git.save_git(link: Link, out_dir: Path | None = None, timeout: int = 60) ArchiveResult [source]
download full site using git
archivebox.extractors.media module
- archivebox.extractors.media.should_save_media(link: Link, out_dir: Path | None = None, overwrite: bool | None = False) bool [source]
- archivebox.extractors.media.save_media(link: Link, out_dir: Path | None = None, timeout: int = 3600) ArchiveResult [source]
Download playlists or individual video, audio, and subtitles using youtube-dl or yt-dlp
archivebox.extractors.pdf module
- archivebox.extractors.pdf.should_save_pdf(link: Link, out_dir: Path | None = None, overwrite: bool | None = False) bool [source]
- archivebox.extractors.pdf.save_pdf(link: Link, out_dir: Path | None = None, timeout: int = 60) ArchiveResult [source]
print PDF of site to file using chrome –headless
archivebox.extractors.screenshot module
- archivebox.extractors.screenshot.should_save_screenshot(link: Link, out_dir: Path | None = None, overwrite: bool | None = False) bool [source]
- archivebox.extractors.screenshot.save_screenshot(link: Link, out_dir: Path | None = None, timeout: int = 60) ArchiveResult [source]
take screenshot of site using chrome –headless
archivebox.extractors.title module
- class archivebox.extractors.title.TitleParser(*args, **kwargs)[source]
Bases:
HTMLParser
- property title
- archivebox.extractors.title.get_html(link: Link, path: Path, timeout: int = 60) str [source]
Try to find wget, singlefile and then dom files. If none is found, download the url again.
- archivebox.extractors.title.should_save_title(link: Link, out_dir: str | None = None, overwrite: bool | None = False) bool [source]
- archivebox.extractors.title.save_title(link: Link, out_dir: Path | None = None, timeout: int = 60) ArchiveResult [source]
try to guess the page’s title from its content
archivebox.extractors.wget module
- archivebox.extractors.wget.should_save_wget(link: Link, out_dir: Path | None = None, overwrite: bool | None = False) bool [source]
- archivebox.extractors.wget.save_wget(link: Link, out_dir: Path | None = None, timeout: int = 60) ArchiveResult [source]
download full site using wget
Module contents
- archivebox.extractors.get_default_archive_methods() List[tuple[str, Callable[[Link, Path | None, bool | None], bool], Callable[[Link, Path | None, int], ArchiveResult]]] [source]
- archivebox.extractors.get_archive_methods_for_link(link: Link) Iterable[tuple[str, Callable[[Link, Path | None, bool | None], bool], Callable[[Link, Path | None, int], ArchiveResult]]] [source]