Collection Assessment in a Collaborative Environment: BHL and DPLA
Session Type: Working Session
Session Description
The Biodiversity Heritage Library (BHL) and the Digital Public Library of
America (DPLA) work collaboratively with their partners to make scientific and
culturally significant resources openly available to the world. The BHL (with
a core consortium of fifteen libraries and hundreds of contributors) and the
DPLA (with seventeen partners, including the BHL, who themselves represent
over 600 libraries, archives, museums, and historical societies) continue to
grow within the US and internationally.
Challenges:
The BHL collection consists of more than 60,000 scanned full-text
biodiversity-related book and journal titles with attached OCR and the DPLA
aggregates nearly 3 million records from diverse catalogs, local databases,
and multiple metadata formats. Because
of the variety of metadata, deep analysis of these data sets has proved to be
a significant challenge. Simple questions like “how many entomology or
illustration records are there?” are difficult to answer as not all records
include the exact term in either the subject heading or genre descriptions, or
in fact, anywhere in the record.
Especially challenging for the BHL is that Library of Congress Subject Headings seldom connect directly with the biological taxonomies behind the literature rendering analysis of the collection’s strengths and gaps incompatible with the needs of its core user group of biodiversity scholars. Likewise, the DPLA brings together subject headings, genre terms, and format descriptions from a variety of specialized thesauri and controlled vocabularies, both widely accepted standards and local constructs, challenging true subject analysis and usage beyond keyword search.
Needs:
Both BHL and the DPLA are interested in developing a detailed view of each
collection’s subject coverage to locate gaps to identify additional materials
for digitization, and to connect with specific audiences and new partners that
can fill these content voids. In addition, the DPLA is interested in
automatically identifying the controlled vocabularies or thesauri for subject
terms (or genres, formats, etc.) that come from our partners without an URI,
pointer, or other indicator.
The two organizations are interested in discussing possible applications that might address some of these challenges:
- Visualization tools to drill down from broad terms (e.g., Trees–North America) to more specific terms (e.g., Pinus banksiana) using thesauri specific to biodiversity and connecting them to the LCSH hierarchies.
- Vocabulary identification tool that might “lob” subject headings at open controlled vocabularies to associate terms and grab URIs
- Other ideas proposed by session attendees
The working session will start with brief presentations from BHL and DPLA representatives, followed by a discussion of common challenges. The last two hours will provide time for a mini “hackathon” to experiment with subject heading datasets and conceptualize prototypes for potential tools to satisfy collection assessment challenges highlighted by the discussion.
Session Leaders
Constance Rinaldo, Harvard University
Mark Phillips, University of North Texas
View the community reporting Google doc for this session.