Skip to content

Processor

We have two Python packages for processing digital library query results:

  • libprocess: Utilities for processing results from general-purpose, large-scale digital libraries (based on libquery), and for normalizing the data into a uniform structure.
  • libprocess_extensions: An extension of libprocess for handling query results related to historical visualizations (based on libquery_extensions), also normalizing the data into a uniform structure.

These packages normalize and clean heterogeneous metadata from various digital libraries. They handle schema mapping, field extraction, data cleaning, UUID generation, date parsing, author parsing, language detection, and URL validation. Source-specific processing rules are included for each data source.

The processed data is then used to build the unified dataset, ensuring a consistent metadata structure across all sources.

For detailed documentation, see the libprocess and libprocess_extensions repositories.