The Data Quality Review Framework
- It is essential that the data are of high quality so future users can trust and use the data.
- There are many different dimensions of data quality, and different stakeholders may prioritize different ones.
- Reproducibility–the ability to reproduce original analysis and results–sets an even higher bar for quality.
- The Data Quality Review (DQR) framework is a set of recommended curation activities that promote the reproducibility of the research.
- DQR involves a review of the files, the documentation, the data, and the code being curated.
File Review
- Files must be inspected.
- Study and file metadata should be detailed, accurate, and formatted in a standard schema.
- Aspects of file review may vary based on file format.
- Transforming files into non-proprietary formats facilitates reuse and ensures long term accessibility.
Documentation Review
- Study Identification is critical to ensure re-users are able to use the correct reproducibility package, contact study authors, and locate files and communications related to the study.
- Keep the instructions succinct.
- Using relative paths in your codes and a master file to execute them helps in simplifying the instructions.
- Information about the computing environment and the date the analysis was last run are crucial to rebuild the environment should the need arise.
- The Data Availability Statement should be clear enough for researchers to be able to access the same data used in the analysis and apply the same names used in the code so that the data may be called properly.
- A codebook or data dictionary is critical to the interpretation of data and output files.
Data Review
- Data must be checked for missing or ambiguous variable and value labels.
- Observation counts should be consistent across files.
- Data values should be in range, and error-free.
- Metadata and references to the data in associated files should be checked for consistency and accuracy.
- PII should be identified and anonymized or removed.
- Clear metadata about who created, owns, and stewards the data, and what the licensing terms are for reuse, should be created.
Code Review
- Difficulties with executing code can prevent computational reproducibility.
- Researchers should write code with reproducibility in mind.
- Curators reviewing the code can take steps to ensure that it is executable and documented to facilitate reproducibility.