bdDwC: user level standardization of biodiversity data
bdDwC
is an R package that provides an interactive Shiny app and a set of functions for standardizing field names in compliance with the Darwin Core (DwC) format. Running bdDwC enables you to carefully standardize all field names in your dataset – which allows the bdverse
to handle data from various biodiversity portals seamlessly, and lets you enjoy all of its features, regardless of publishers’ variation in field names.
The development of bdDwC was inspired by the Kurator project ‘Darwinizer tool’. bdDwC
utilizes Darwin Cloud dictionary (Wieczorek et al. 2017), which is a lookup table that accumulates different variations in DwC field names, maintained by the Kurator team. It’s also possible to add your own dictionary by creating a CSV file with two columns, one for the Field Names and one for the Standard Names.
Architecture overview
Major challenges ahead
- Establishing and maintaining a robust workflow for feeding the Darwin Cloud - to address this issue, we’ll consult key members of the biodiversity informatics community.
- “Darwinizing” a dataset is a basic task for all
bdverse
tools and workflows, thus developing an intensive QA shell is in order.
Future plans
- Enhance the UI.
- Explore the idea of creating and maintaining a specific dictionary for each data publisher.
- Experiment with fuzzy matching techniques, to generate suggestions for matching fields.
- Explore techniques for enforcing recommended DwC vocabulary.