Skip to Main Content

Finding Datasets

This Guide provides information on how to search out and access pre-existing datasets.

How Good is the Dataset?

Before using your dataset, it's important to think carefully about whether it is sufficiently relevant and of sufficient quality. This is especially true since, unlike scholarly articles, what it means to "peer review" a dataset is not well established (Mayernik et al., 2015). Some popular data quality dimensions are (Pipino et al., 2002):

Dimensions Questions to Ask
Accessibility Can you access the data? How easily?
Amount Is there a sufficient amount of data (e.g. is the sample size large enough)? Is there too much?
Believability Do you think the data is true or credible?
Completeness Do you think the data covers your whole topic? Or just a part? Is the data missing values?
Concise Representation How compactly is the data represented?
Ease of manipulation Is the data always presented in the same format? Are the variable names, units, and scales always the same?
Free-of-Error Is the data free of errors like typos, mis-formatted fields (e.g. a string of numbers accidentally formatted as a date in excel)?
Interpretability How easy is the data to read and understand?
Objectivity Could there be bias or prejudice in the data? In who/what was sampled, the sampling method, or the mode of analysis?
Relevancy How applicable is the data to your topic?
Reputability Who created the data and what is their reputation? Can they be trusted?
Timeliness Was the data collected during the time period you're studying? If you're looking at trends over time, does the data include the whole time period; how frequently was it collected? 
Mayernik, M. S., Callaghan, S., Leigh, R., Tedds, J., & Worley, S. (2015). Peer Review of Datasets: When, Why, and How. Bulletin of the American Meteorological Society, 96(2), 191–201. https://doi.org/10.1175/BAMS-D-13-00083.1
Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment. Communications of the ACM, 45(4), 211–218. https://doi.org/10.1145/505248.506010

Supplementary Materials and Codebooks

Supplementary Materials and codebooks contain information which helps interpret the dataset.

Codebooks (aka supplementary materials or documentation) should (but don't always!) include:

  1. Description of the study: who did it, why they did it, how they did it.
  2. Sampling information: what was the population studied, how was the sample drawn, what was the response rate.
  3. Technical information about the dataset files: number of observations, record length, number of records per observation, etc.
  4. Structure of the data within the file: e.g., one big table, multiple tables connected by shared values, tables nested within other tables, etc.
  5. Details about the data: columns in which specific variables can be found, whether they are numeric, alphanumeric, etc.
  6. Text of the questions and possible responses (if applicable): some even have how many people responded a particular way.

Much of the content in this box was adapted with permission from Princeton University's  How to Use a Codebook webpage.

Codebooks - Examples

An example of short prose sections within a codebook providing information about what the dataset was used for and citation information for the publication which uses that data.

An example of a codebook providing information about what the dataset was used for and citation information for the publication which uses that data.

 

Leon-Moreta, Agustin. Municipal Incorporation in the United States. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2018-07-22. https://doi.org/10.3886/E100977V3

An example of short prose sections within a codebook explaining the methodology for one type variable in the dataset and providing relevant citations.

An example of a codebook explaining the methodology for one type of variable in the dataset, "International trade flows and gravity variables", and providing relevant citations.

 

Head, Keith, and Mayer, Thierry. Data and code for: “The United States of Europe: A gravity model evaluation of the four freedoms.” Nashville, TN: American Economic Association [publisher], 2021. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2021-05-03. https://doi.org/10.3886/E133281V1

An example of short prose sections within a codebook providing information about how variables are formatted and calculated.

An example of a codebook providing information about how variables are formatted and calculated.

 

Malani, Preeti, Kullgren, Jeffrey, and Solway, Erica. National Poll on Healthy Aging (NPHA), [United States], March 2018. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2021-06-11. https://doi.org/10.3886/E130461V2

An example of a codebook using a captioned table to describe study methodology.

An example of a codebook describing study methodology.

 

Borick, Christopher, Mills, Sarah, and Rabe, Barry. National Surveys on Energy and Environment [United States]: Fall 2017 NSEE: nsee fall 2017 codebook.pdf. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2019-08-14. https://doi.org/10.3886/E100167V19-21789

An example of a codebook defining variable characteristics.

An example of a codebook defining variable characteristics.

 

Caraballo, César, Mahajan, Shiwani, Valero-Elizondo, Javier, Herrin, Jeph, and Krumholz, Harlan. Trends in Differences in Health Status and Health Care Access and Affordability by Race and Ethnicity in the United States, 1999-2018. . Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2021-09-05. https://doi.org/10.3886/E149221V1

An example of a codebook using a table to summarize variable characteristics.

Another example of a codebook describing variable characteristics.

 

Kim, Min Hee, Li, Mao, Sylvers, Dominique, Esposito, Michael, Gomez-Lopez, Iris, Clarke, Philippa, and Chenoweth, Megan. National Neighborhood Data Archive (NaNDA): School Counts by Census Tract, United States, 2000-2018. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2021-12-10. https://doi.org/10.3886/E156024V1