I attended the Mina Rees library’s workshop on data management plans. A data management plan is usually required in grant applications and papers and it includes the data and data collection methods and procedures for research data, which is the material necessary to come to the project’s conclusion.
We talked about a few reasons to share this data, namely to ensure reproducibility.
We also talked about a few things needed to include in a data management plan:
- How is the data exposed? What will be shared, who is the audience, is it citable
- How will it be preserved? CUNY academic works repository was a good example that came up since it is a good repo to make the data accessible from a google search for example. It is important not to archive the data in proprietary format, it should be open, unencrypted and uncompressed
We also discussed some best practices for handling data:
- Some disciplines have specific data structure standards like ways to label fields.. It is important to follow these depending on your field
- Column names should be human-readable, not coded — unless a dictionary is included
- It’s important to consider how NULL variables are represented
Another best practice we talked about and that I wanted to discuss further in this blog post is “context”. Having spreadsheets without a readme and a data abstract almost means that the data will be taken out of context and used in ways it should not be used (to answer questions it cannot answer for example). This brought me back to a chapter of Data Feminism by Catherine D’Ignazio and Lauren Klein. We have read a chapter of this book for the week where we discussed Epistemologies of DH and I have recently read chapter 6 for the Advanced Interactive Data Visualization course. The chapter, entitled “The Numbers Don’t Speak for Themselves”, presents the 6th principle of Data Feminism:
“Principle #6 of Data Feminism is to Consider Context. Data feminism asserts that data are not neutral or objective. They are the products of unequal social relations, and this context is essential for conducting accurate, ethical analysis.”
Klein and D’Ignazio brought up very interesting examples of lack of context and its unwanted repercussions. Being in a time where open-source is a model used and encouraged, it is necessary to consider the impact that one’s data, if published and easily accessible, can have.
The first example that came up was a data-driven report by FiveThirtyEight titled “Kidnapping of Girls in Nigeria Is Part of a Worsening Problem.” The blog aims to show that the number of kidnapping is at a peak by using data from the Global Database of Events, Language and Tone (GDELT). In the report, they said that there was 3608 kidnappings of young women in 2013. But that was not true. The data source they used (GDELT) was a project that collects and parses news reports, which means that their data could have multiple records per kidnapping or any other event since multiple news reports were probably written on that specific event. GDELT might have not clearly explained this in their website and FiveThirtyEight clearly used the wrong data to answer their research questions, resulting in a misleading data visualization.
I know I will keep this in mind when working on future data projects and when including a data management plan for my capstone project.