International research on bibliographical data – challenges for data-driven research
Vojtěch Malinek (Institute of Czech Literature, Czech Academy of Sciences, Czech Republic, Praha) Tomasz Umerle (Institute for Literary Research of the Polish Academy of Sciences, Poznań, Warsaw)
DFG subject areas: 101 Ancient Cultures, 102 History, 103 Fine Arts, Music, Theatre and Media Studies, 104 Linguistics, 105 Literary Studies, 106 Social and Cultural Anthropology, Non-European Cultures, Jewish Studies and Religious Studies
Text+ data domain: Collections
User story from the DARIAH-ERIC “Bibliographical Data” Working Group
This user study is based on the experiences of the DARIAH-ERIC “Bibliographical Data” (Bibliodata) Working Group (WG). The group consists of approximately 30 members from 15 European countries involved in different scientific disciplines, in particular humanities, but also in the social sciences, software engineering etc. According to the “DFG-Fachsystematik”, WG activities cover especially the disciplines 102-106 and 409, but they are related to all of the social sciences and humanities, and on a general level can be relevant for nearly any scientific discipline.
One of the primary goals of the WG is to establish a platform for communication for all of the possible stakeholders engaged in bibliographical data processing and research. These are (a) data producers (bibliographers, librarians, scholars, data stewards / research infrastructures, library, scientific institutes, scientific projects etc.), (b) data researches (DH researchers, historians of books, literary scientists etc.), (c) software developers, librarian IT experts etc.
There are two main factors that drive current initiatives to bring the international bibliodata community together.
- Many data standards and norms have been used all over Europe (MARC21, Dublin Core, librarian cataloguing rules, citations formats, proprietary systems, etc.). However, national interpretations and variants of these standards have influenced them over the past decades, and the expectations from both curators and researchers have grown exponentially. Meeting these expectations demands new collaboration efforts bringing data curators and researchers together – both on international, and national level. Hence our Working Group finds the Text+ consortium as a great contribution to the international bibliographical data ecosystem.
- We are observing the increased interest in using bibliographical data not only as a tool for discovering, and identifying resources, but as a vehicle for data-driven studies (“bibliographical data science”) into culture, society, history, art (in such bibliodata research domains like bibliometrics, cultural analytics, science evaluation, literary history, etc.). Our Working Group aims at identifying, and using different national datasets to answer the research questions through bibliographical data. Hence, we are interested in the works of the Text+ consortium as we hope it will allow for the inclusion of new high-quality datasets that will enrich the international bibliodata ecosystem.
The existing bibliographical data sources have not been sufficiently used for data-based research (e.g. comparative, transnational literary, historical, cultural analysis) and advanced data curation (e.g. metadata aggregation, linked data services) on the international level.
Researchers interested in combining datasets originating from different national curating institutions face serious challenges.
Firstly, the bibliodata landscape is so rich and complex, and involves such diverse data curators – such as GLAM (libraries, especially national libraries who curate much of cultural heritage bibliodata), researchers (creators of many of the traditional bibliographical resources, but also everyday producers of bibliographic descriptions, citations), research institutions (that organize and manage the research outputs of researchers), information services (repositories, digital libraries, etc.) – that it is difficult for researchers interested in international, comparative research topics to understand the national, available datasets, their connections, overlaps, curatorial history. In short, bibliodata suffers from insufficient documentation.
Secondly, although the bibliodata is highly-standardized in terms of data formats, and internationally recognized standards, much of the data have not been fully “FAIR-ified” in the aspects critical for international research, namely the use of linked data publication methods, inclusion of persistent identifiers, international authority files, and thesauri.
Thirdly, still many of the bibliodata resources have not been made accessible in compliance with open data standards, and through open infrastructures. The GLAM and research institutions, especially big libraries, are leading the way, but many information services are functioning as “discovery silos” (access is limited, subscription-based, etc.). At the same time, still much of the bibliographical information – catalogs, printed bibliographies, inventories etc. – have not been digitized. Last but not least, new types of data – especially web contents – have not been systematically organized into bibliographical resources.
The international community organized within the Bibliodata WG would very much welcome infrastructural solutions offered by Text+ consortium that would 1. provide high-quality, universal, researcher-friendly documentation of the existing bibliographical resources, 2. guarantee continued introduction of FAIR data principles critical for comparative research on transnational datasets (linked data methods, PIDs, authority files, thesauri mapping, etc.), and development of advanced metadata services, such as linked data services, data shops, 3. large-scale investments in providing open access to bibliographical datasets originating from GLAM (e.g. non-digitized bibliodata collections), research institutions, but also from information services (repositories, aggregators), and publishers (journals, citation indexes).
Leo Lahti, Jani Marjanen, Hege Roivainen & Mikko Tolonen (2019) Bibliographic Data Science and the History of the Book (c. 1500–1800), Cataloging & Classification Quarterly, 57:1, 5–23, DOI: 10.1080/01639374.2018.1543747
A.C. Montoya (2018), ‘The MEDIATE project’ Jaarboek voor Nederlandse Boekgeschiedenis / Yearbook for Dutch Book History 25, 229–232
Silvio Peroni, Paolo Ciancarini, Aldo Gangemi et al. (2020) The practice of self-citations: a longitudinal study. Scientometrics 123, 253–282, https://doi.org/10.1007/s11192-020-03397-6
Mikko Tolonen, Leo Lahti, Hege Roivainen & Jani Marjanen (2019) A Quantitative Approach to Book-Printing in Sweden and Finland, 1640–1828, Historical Methods: A Journal of Quantitative and Interdisciplinary History, 52:1, 57–78, DOI: 10.1080/01615440.2018.1526657