Open

Chapter 6

Finding external data

By Jane Foo

Running your own study to collect data is not the only or best way to start your data analysis. Using someone else’s dataset and sharing your data is on the rise and has helped advance much of the recent research. Using external data offers several benefits:

Time / Cost Can decrease the work required to collect and prepare data for analysis
Access May allow you to work with data that requires more resources to collect than you have, or data that you wouldn’t otherwise have access to at all
Community Promotes new ideas and interesting collaborations by connecting you to people who are interested in the same topic

Where to Find External Data

All those benefits sound great! So where do you find external data? To help narrow your search, ask yourself the following questions:

Scope What is the scope of the data you’re looking for? What are the:
  • geographic boundaries?
  • specific data attributes (such as age range)?
  • time periods?
Type What type of data are you looking for? Do you need:
  • statistics?
  • research data?
  • raw data?
  • data that have been collected using a specific method?
Contribution How will the data contribute to your existing data analysis?
Do you need several external datasets to complete your analysis?

Public Data

Once you have a better idea of what you’re looking for in an external dataset, you can start your search at one of the many public data sources available to you, thanks to the open content and access movement that has been gaining traction on the Internet. Many institutions, governments, and organizations have established policies that support the release of data to the public in order to provide more transparency and accountability and to encourage the development of new services and products. Here’s a breakdown of public data sources:

Source Examples
Search Engines Google
Data Repositories re3data.org
DataBib
DataCite
Dryad
DataCatalogs.org
Open Access Directory
Gapminder
Google Public Data Explorer
IBM Many Eyes
Knoema
Government Datasets World Bank
United Nations
Open Data Index
Open Data Barometer
U.S. Government Data
Kenya’s Open Data Initiative
Research Institutions Academic Torrents
American Psychological Association
Other professional associations
Academic institutions

If you decide to use a search engine (like Google) to look for datasets, keep in mind that you’ll only find things that are indexed by the search engine. Sometimes a website (and the resource associated with it) will be visible only to registered users or be set to block the search engine, so these kinds of sites won’t turn up in your search result. Even still, the Internet is a big playground, so save yourself the headache of scrolling through lots of irrelevant search results by being clear and specific about what you’re looking for.

If you’re not sure what to do with a particular type of data, try browsing through the Information is Beautiful awards for inspiration. You can also attend events such as the annual Open Data Day to see what others have done with open data.

Open data repositories benefit both the contributors and the users by providing an online forum to share and brainstorm new ways to study and discuss data. In some cases, data crowdsourcing has led to new findings that otherwise would have developed at a much slower rate or would have not been possible in the first place. One of the more publicized crowdsourcing projects is Foldit from the University of Washington, a Web-based puzzle game that allows anyone to submit protein folding variations which are used by scientists to build new innovative solutions in bioinformatics and medicine. And recently, Cancer Research UK released a mobile game called Genes in Space that tasks users with identifying cancer cells in biopsy slides which in turn helps researchers cut down data analysis time.

Non-Public Data

Of course, not all data is public. There may come a time when you have access to a special collection of data because of your status within a particular network or through an existing relationship. Or maybe you come across a dataset that you can buy. In either case, you typically have to agree to and sign a license in order to get the data, so always make sure that you review the Terms of Use before you buy. If no terms are provided, insist on getting written permission to use the dataset.

Assessing External Data

Let’s say you’ve found a dataset that fits your criteria. But is the quality good enough?

Assessing data quality means looking at all the details provided about the data (including metadata, or “data about the data,” such as time and date of creation) and the context in which the data is presented. Good datasets will provide details about the dataset’s purpose, ownership, methods, scope, dates, and other notes. For online datasets, you can often find this information by navigating to the “About” or “More Information” web pages or by following a “Documentation” link.

Feel free to use general information evaluation techniques when reviewing data. For instance, one popular method used by academic libraries is the CRAAP Test, which is a set of questions that help you determine the quality of a text. The acronym stands for:

Currency Is the information up-to-date? When was it collected / published / updated?
Relevancy Is the information suitable for your intended use? Does it address your research question? Is there other (better) information?
Authority Is the information creator reputable and has the necessary credentials? Can you trust the information?
Accuracy Do you spot any errors? What is the source of the information? Can other data or research support this information?
Purpose What was the intended purpose of the information collected? Are other potential uses identified?

Finally, when you review the dataset and its details, watch out for the following red flags:

Using External Data

So now you have a dataset that meets your criteria and quality requirements, and you have permission to use it. What other things should you consider before you start your work?

Checklist  
Did you get all the necessary details about the data? Don’t forget to obtain variable specifications, external data dictionaries, and referenced works.
Is the data part of a bigger dataset or body of research? If yes, look for relevant specifications or notes from the bigger dataset.
Has the dataset been used before? If it has and you’re using the data for an analysis, make sure your analysis is adding new insights to what you know has been done with the data previously.
How are you documenting your process and use of the data? Make sure to keep records of licensing rights, communication with data owners, data storage and retention, if applicable.
Are you planning to share your results or findings in the future? If yes, you’ll need to include your data dictionary and a list of your additional data sources.

Your answers to these questions can change the scope of your analysis or prompt you to look for additional data. They may even lead you to think of an entirely new research question.

The checklist encourages you to document (a lot). Careful documentation is important for two big reasons. First, in case you need to redo your analysis, your documentation will help you retrace what you did. Second, your documentation will provide evidence to other researchers that your analysis was conducted properly and allow them to build on your data findings.

Giving Credit to External Data Sources

Simply put, crediting the source of your external dataset is the right thing to do. It’s also mandatory. Ethical research guidelines state that crediting sources is required for any type of research. So always make sure that you properly credit any external data you use by providing citations.

Good citations give the reader enough information to find the data that you have accessed and used. Wondering what a good citation looks like? Try using an existing citation style manual from APA, MLA, Chicago, Turabian, or Harvard. Unlike citations for published items (like books), citations for a dataset vary a great deal from style to style.

As a general rule, all styles require the author and the title. In addition, editor, producer or distributor information (location, publication date), access date (when you first viewed the data), details about the dataset (unique identifier, edition, material type), and the URL may be needed. For government datasets, use the name of the department, committee or agency as the group / corporate author.

For example, let’s say you’re using the U.S. Census Annual Survey of Public Employment and Payroll.

The APA Style Manual (Publication Manual of the American Psychological Association, 6th edition) would cite this the following way:

APA citation

while the MLA Style Manual (MLA Handbook for Writers of Research Paper, 7th edition) cites the same census data as:

MLA citation

Data repositories and organizations often have their own citation guidelines and provide ready citations that you can use “as is”. The Interuniversity Consortium for Political and Social Research (ICPSR), The National Center for Health Statistics, Dryad, PANGAEA, and Roper Center Data all provide guidelines for citing their datasets.

This chapter gives you a brief look into external data: the important takeaway is that we are only at the start of a significant growth in data thanks to the technologies that now make massive data storage and processing an affordable reality. Open datasets in particular have the potential to become a de facto standard for anyone looking for data to analyze.