Skip to content
On this page

Internet Archive

The class for fetching metadata and images from Internet Archive with its Python API.

The class inherits the corresponding class in libquery, and additionally supports validation and processing of the metadata.

Usage

Create a querier for Internet Archive:

python
from libquery import InternetArchive

directory = "./output/internet-archive"
querier = InternetArchive(
    metadata_dir=f"{directory}/metadata",
    download_dir=f"{directory}/downloads",
    img_dir=f"{directory}/imgs",
)

Query metadata:

python
queries = [
    "creator:(Snow, John) AND -collection:(david-rumsey-map-collection) AND date:[1800-01-01 TO 1950-12-31]",
    "creator:(Cheysson, Émile) AND -collection:(david-rumsey-map-collection) AND date:[0001-01-01 TO 1990-12-31]",
]
querier.fetch_metadata(queries=queries)

Validate metadata:

python
querier.validate_metadata()

Query images:

python
querier.fetch_image()

Process metadata:

python
querier.process_metadata(save_path=f"{directory}/processed/processed.json")

Processed Metadata Schema

Each processed metadata entry is stored as:

typescript
interface TimePoint {
    year: number
    month?: number
    day?: number
}

interface ProcessedMetadataEntry {
    uuid: string
    authors: string[]
    displayName: string
    publishDate: TimePoint | null
    viewUrl: string
    downloadUrl: string
    md5?: string
    phash?: string
    resolution?: [number, number]
    fileSize?: number
    languages: string[] | null
    tags: string[]
    abstract: string
    source: {
        name: 'Internet Archive'
        /** The url where the metadata is collected. */
        url: string
        accessDate: string
    }
}