Dataset
Using libquery and libquery_extensions (see querier), we query the data sources, filter images, and process metadata. In this way, we construct a dataset of 13,000 historical visualizations. The data processing scripts and the dataset itself can be found in this repository.
The dataset can be downloaded using our Python package oldvis_dataset (see downloader).
Dataset Construction Workflow
The construction of the dataset involves the following steps:
- Querying: Using specialized query utilities to search digital libraries for historical visualizations.
- Filtering: Distinguishing visualizations from other artifacts (such photographs and illustrations) using semi-automatic labeling approaches.
- Processing: Normalizing heterogeneous metadata from various sources into a unified data structure, as described below.
Metadata Structure
Each visualization entry in the dataset includes processed metadata with 17 fields:
- UUID (
uuid): Unique identifier for each entry - Authors (
authors): Creator(s) of the visualization - Display Name (
displayName): Title or descriptive name - Publish Date (
publishDate): Publication or creation date - View URL (
viewUrl): Link to view the item in the original source - Download URL (
downloadUrl): Link to download the image - MD5 Hash (
md5): MD5 checksum of the image file, used for file integrity verification and duplicate detection - Perceptual Hash (
phash): Perceptual hash of the image, used for finding visually similar or duplicate images - Resolution (
resolution): Image dimensions as [width, height] in pixels - File Size (
fileSize): Size of the image file in bytes - Languages (
languages): Language(s) used in the visualization - Tags (
tags): Subject tags or keywords - Abstract (
abstract): Description or notes about the visualization - Rights (
rights): Copyright and usage information - Source (
source): Information about the original data source, including name, URL, and access date
Example:
json
{
"uuid": "e69c9258-568f-50ab-97ef-89a673648fa1",
"authors": [
"Snow, John"
],
"displayName": "On the mode of communication of cholera - Page 58",
"publishDate": {
"year": 1855
},
"viewUrl": "https://archive.org/details/b28985266/page/n58",
"downloadUrl": "https://ia601500.us.archive.org/view_archive.php?archive=/19/items/b28985266/b28985266_jp2.zip&file=b28985266_jp2%2Fb28985266_0058.jp2&ext=jpg",
"md5": "e99fa7f7cd91c82b6f73c7d98d25c5ea",
"phash": "952a2a55776caa17",
"resolution": [
3507,
3324
],
"fileSize": 1104681,
"languages": [
"eng"
],
"tags": [
"Cholera"
],
"abstract": "Includes bibliographical footnotes\n",
"rights": "public domain",
"source": {
"name": "Internet Archive",
"url": "https://archive.org/search?query=identifier:(b28985266)",
"accessDate": "2023-03-21T10:25:36.352011+00:00"
}
}Data Sources
As of June 20, 2023, our dataset consists of old visualizations from seven data sources. The number of visualizations obtained from each source is listed in the following table.
| Data Source | #Entries |
|---|---|
| David Rumsey Map Collection | 7816 |
| Internet Archive | 2985 |
| Gallica | 2090 |
| Telefact | 225 |
| Library of Congress | 212 |
| British Library | 132 |
| Alabama Maps | 51 |