archivebox.index package
Submodules
archivebox.index.csv module
archivebox.index.html module
- archivebox.index.html.parse_html_main_index(out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest')) Iterator[str] [source]
parse an archive index html file and return the list of urls
- archivebox.index.html.main_index_template(links: List[Link], template: str = 'static_index.html') str [source]
render the template for the entire main index
- archivebox.index.html.write_html_link_details(link: Link, out_dir: str | None = None) None [source]
archivebox.index.json module
- archivebox.index.json.generate_json_index_from_links(links: List[Link], with_headers: bool)[source]
- archivebox.index.json.parse_json_main_index(out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest')) Iterator[Link] [source]
parse an archive index json file and return the list of links
- archivebox.index.json.write_json_link_details(link: Link, out_dir: str | None = None) None [source]
write a json file with some info about the link
- archivebox.index.json.parse_json_link_details(out_dir: Path | str, guess: bool | None = False) Link | None [source]
load the json link index from a given directory
- archivebox.index.json.parse_json_links_details(out_dir: Path | str) Iterator[Link] [source]
read through all the archive data folders and return the parsed links
- class archivebox.index.json.ExtendedEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]
Bases:
JSONEncoder
Extended json serializer that supports serializing several model fields and objects
- default(obj)[source]
Implement this method in a subclass such that it returns a serializable object for
o
, or calls the base implementation (to raise aTypeError
).For example, to support arbitrary iterators, you could implement default like this:
def default(self, o): try: iterable = iter(o) except TypeError: pass else: return list(iterable) # Let the base class default method raise the TypeError return JSONEncoder.default(self, o)
archivebox.index.schema module
WARNING: THIS FILE IS ALL LEGACY CODE TO BE REMOVED.
DO NOT ADD ANY NEW FEATURES TO THIS FILE, NEW CODE GOES HERE: core/models.py
- class archivebox.index.schema.ArchiveResult(cmd: List[str], pwd: str | None, cmd_version: str | None, output: str | Exception | NoneType, status: str, start_ts: datetime.datetime, end_ts: datetime.datetime, index_texts: List[str] | None = None, schema: str = 'ArchiveResult')[source]
Bases:
object
- cmd: List[str]
- pwd: str | None
- cmd_version: str | None
- output: str | Exception | None
- status: str
- start_ts: datetime
- end_ts: datetime
- index_texts: List[str] | None = None
- schema: str = 'ArchiveResult'
- property duration: int
- class archivebox.index.schema.Link(timestamp: str, url: str, title: Optional[str], tags: Optional[str], sources: List[str], history: Dict[str, List[archivebox.index.schema.ArchiveResult]] = <factory>, updated: Optional[datetime.datetime] = None, schema: str = 'Link')[source]
Bases:
object
- timestamp: str
- url: str
- title: str | None
- tags: str | None
- sources: List[str]
- history: Dict[str, List[ArchiveResult]]
- updated: datetime | None = None
- schema: str = 'Link'
- snapshot_id
- property link_dir: str
- property archive_path: str
- property archive_size: float
- property url_hash
- property scheme: str
- property extension: str
- property domain: str
- property path: str
- property basename: str
- property base_url: str
- property bookmarked_date: str | None
- property updated_date: str | None
- property archive_dates: List[datetime]
- property oldest_archive_date: datetime | None
- property newest_archive_date: datetime | None
- property num_outputs: int
- property num_failures: int
- property is_static: bool
- property is_archived: bool
archivebox.index.sql module
- archivebox.index.sql.parse_sql_main_index(out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest')) Iterator[Link] [source]
- archivebox.index.sql.remove_from_sql_main_index(snapshots: QuerySet, atomic: bool = False, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest')) None [source]
- archivebox.index.sql.write_sql_main_index(links: List[Link], out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest')) None [source]
- archivebox.index.sql.write_sql_link_details(link: Link, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest')) None [source]
- archivebox.index.sql.list_migrations(out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest')) List[Tuple[bool, str]] [source]
Module contents
- archivebox.index.merge_links(a: Link, b: Link) Link [source]
deterministially merge two links, favoring longer field values over shorter, and “cleaner” values over worse ones.
- archivebox.index.archivable_links(links: Iterable[Link]) Iterable[Link] [source]
remove chrome://, about:// or other schemed links that cant be archived
- archivebox.index.fix_duplicate_links(sorted_links: Iterable[Link]) Iterable[Link] [source]
ensures that all non-duplicate links have monotonically increasing timestamps
- archivebox.index.links_after_timestamp(links: Iterable[Link], resume: float | None = None) Iterable[Link] [source]
- archivebox.index.lowest_uniq_timestamp(used_timestamps: OrderedDict, timestamp: str) str [source]
resolve duplicate timestamps by appending a decimal 1234, 1234 -> 1234.1, 1234.2
- archivebox.index.write_main_index(links: List[Link], out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest')) None [source]
Writes links to sqlite3 file for a given list of links
- archivebox.index.load_main_index(out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest'), warn: bool = True) List[Link] [source]
parse and load existing index with any new links from import_path merged in
- archivebox.index.load_main_index_meta(out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest')) dict | None [source]
- archivebox.index.parse_links_from_source(source_path: str, root_url: str | None = None, parser: str = 'auto') Tuple[List[Link], List[Link]] [source]
- archivebox.index.fix_duplicate_links_in_index(snapshots: QuerySet, links: Iterable[Link]) Iterable[Link] [source]
Given a list of in-memory Links, dedupe and merge them with any conflicting Snapshots in the DB.
- archivebox.index.dedupe_links(snapshots: QuerySet, new_links: List[Link]) List[Link] [source]
The validation of links happened at a different stage. This method will focus on actual deduplication and timestamp fixing.
- archivebox.index.write_link_details(link: Link, out_dir: str | None = None, skip_sql_index: bool = False) None [source]
- archivebox.index.load_link_details(link: Link, out_dir: str | None = None) Link [source]
check for an existing link archive in the given directory, and load+merge it into the given link dict
- archivebox.index.q_filter(snapshots: QuerySet, filter_patterns: List[str], filter_type: str = 'exact') QuerySet [source]
- archivebox.index.search_filter(snapshots: QuerySet, filter_patterns: List[str], filter_type: str = 'search') QuerySet [source]
- archivebox.index.snapshot_filter(snapshots: QuerySet, filter_patterns: List[str], filter_type: str = 'exact') QuerySet [source]
- archivebox.index.get_indexed_folders(snapshots, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest')) Dict[str, Link | None] [source]
indexed links without checking archive status or data directory validity
- archivebox.index.get_archived_folders(snapshots, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest')) Dict[str, Link | None] [source]
indexed links that are archived with a valid data directory
- archivebox.index.get_unarchived_folders(snapshots, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest')) Dict[str, Link | None] [source]
indexed links that are unarchived with no data directory or an empty data directory
- archivebox.index.get_present_folders(snapshots, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest')) Dict[str, Link | None] [source]
dirs that actually exist in the archive/ folder
- archivebox.index.get_valid_folders(snapshots, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest')) Dict[str, Link | None] [source]
dirs with a valid index matched to the main index and archived content
- archivebox.index.get_invalid_folders(snapshots, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest')) Dict[str, Link | None] [source]
dirs that are invalid for any reason: corrupted/duplicate/orphaned/unrecognized
- archivebox.index.get_duplicate_folders(snapshots, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest')) Dict[str, Link | None] [source]
dirs that conflict with other directories that have the same link URL or timestamp
- archivebox.index.get_orphaned_folders(snapshots, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest')) Dict[str, Link | None] [source]
dirs that contain a valid index but aren’t listed in the main index
- archivebox.index.get_corrupted_folders(snapshots, out_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/archivebox/checkouts/latest')) Dict[str, Link | None] [source]
dirs that don’t contain a valid index and aren’t listed in the main index