Dataset

Using libquery and libquery_extensions (see querier), we query the data sources, filter images, and process metadata. In this way, we construct a dataset of 13,000 historical visualizations. The data processing scripts and the dataset itself can be found in this repository.

The dataset can be downloaded using our Python package oldvis_dataset (see downloader).

Dataset Construction Workflow

The construction of the dataset involves the following steps:

Querying: Using specialized query utilities to search digital libraries for historical visualizations.
Filtering: Distinguishing visualizations from other artifacts (such photographs and illustrations) using semi-automatic labeling approaches.
Processing: Normalizing heterogeneous metadata from various sources into a unified data structure, as described below.

Metadata Structure

Each visualization entry in the dataset includes processed metadata with 17 fields:

UUID (uuid): Unique identifier for each entry
Authors (authors): Creator(s) of the visualization
Display Name (displayName): Title or descriptive name
Publish Date (publishDate): Publication or creation date
View URL (viewUrl): Link to view the item in the original source
Download URL (downloadUrl): Link to download the image
MD5 Hash (md5): MD5 checksum of the image file, used for file integrity verification and duplicate detection
Perceptual Hash (phash): Perceptual hash of the image, used for finding visually similar or duplicate images
Resolution (resolution): Image dimensions as [width, height] in pixels
File Size (fileSize): Size of the image file in bytes
Languages (languages): Language(s) used in the visualization
Tags (tags): Subject tags or keywords
Abstract (abstract): Description or notes about the visualization
Rights (rights): Copyright and usage information
Source (source): Information about the original data source, including name, URL, and access date

Example:

json

{
    "uuid": "e69c9258-568f-50ab-97ef-89a673648fa1",
    "authors": [
        "Snow, John"
    ],
    "displayName": "On the mode of communication of cholera - Page 58",
    "publishDate": {
        "year": 1855
    },
    "viewUrl": "https://archive.org/details/b28985266/page/n58",
    "downloadUrl": "https://ia601500.us.archive.org/view_archive.php?archive=/19/items/b28985266/b28985266_jp2.zip&file=b28985266_jp2%2Fb28985266_0058.jp2&ext=jpg",
    "md5": "e99fa7f7cd91c82b6f73c7d98d25c5ea",
    "phash": "952a2a55776caa17",
    "resolution": [
        3507,
        3324
    ],
    "fileSize": 1104681,
    "languages": [
        "eng"
    ],
    "tags": [
        "Cholera"
    ],
    "abstract": "Includes bibliographical footnotes\n",
    "rights": "public domain",
    "source": {
        "name": "Internet Archive",
        "url": "https://archive.org/search?query=identifier:(b28985266)",
        "accessDate": "2023-03-21T10:25:36.352011+00:00"
    }
}

Data Sources

As of June 20, 2023, our dataset consists of old visualizations from seven data sources. The number of visualizations obtained from each source is listed in the following table.

Data Source	#Entries
David Rumsey Map Collection	7816
Internet Archive	2985
Gallica	2090
Telefact	225
Library of Congress	212
British Library	132
Alabama Maps	51

Dataset ​

Dataset Construction Workflow ​

Metadata Structure ​

Data Sources ​

Dataset

Dataset Construction Workflow

Metadata Structure

Data Sources