Before using your dataset, it's important to think carefully about whether it is sufficiently relevant and of sufficient quality. This is especially true since, unlike scholarly articles, what it means to "peer review" a dataset is not well established (Mayernik et al., 2015). Some popular data quality dimensions are (Pipino et al., 2002):
Dimensions | Questions to Ask |
---|---|
Accessibility | Can you access the data? How easily? |
Amount | Is there a sufficient amount of data (e.g. is the sample size large enough)? Is there too much? |
Believability | Do you think the data is true or credible? |
Completeness | Do you think the data covers your whole topic? Or just a part? Is the data missing values? |
Concise Representation | How compactly is the data represented? |
Ease of manipulation | Is the data always presented in the same format? Are the variable names, units, and scales always the same? |
Free-of-Error | Is the data free of errors like typos, mis-formatted fields (e.g. a string of numbers accidentally formatted as a date in excel)? |
Interpretability | How easy is the data to read and understand? |
Objectivity | Could there be bias or prejudice in the data? In who/what was sampled, the sampling method, or the mode of analysis? |
Relevancy | How applicable is the data to your topic? |
Reputability | Who created the data and what is their reputation? Can they be trusted? |
Timeliness | Was the data collected during the time period you're studying? If you're looking at trends over time, does the data include the whole time period; how frequently was it collected? |
Supplementary Materials and codebooks contain information which helps interpret the dataset.
Codebooks (aka supplementary materials or documentation) should (but don't always!) include:
Much of the content in this box was adapted with permission from Princeton University's How to Use a Codebook webpage.