Chapter 4: Data Governance
We provide a data governance protocol to determine which datasets are safe to release.
4.1 Principles
Data governance is an institutional framework that promotes accountable and ethical “planning, oversight, and control over the management of data and the use of data and data-related sources” (DAMA). Simply put, strong data governance fosters accountability and ethics in management and use of data. It establishes mutual trust within the community your open data portal serves, and mitigates potential harms to individuals and populations.
KEY IDEA
It is the responsibility of your open data group to collect and publish data in a way that protects privacy, prevents misuse and misinterpretation, and builds community trust.4.2 Opening Data Responsibly
Through a holistic approach, an open data group must publish data responsibly. Below are some questions to consider. Skip to the 'Data Governance Protocol' section for a comprehensive breakdown.
- What is the benefit of publishing a dataset?
- What metadata should be collected?
- What are the risks associated with the dataset - both at the individual and group level?
- Can any of the risks be mitigated through de-identification?
- How do we assess the impact of a dataset after publication?
4.3 Roles and Responsibilities
An open data group can consider themselves as the ‘gatekeepers’ of the open data portal. We believe that no one person should have this important responsibility; rather the entire group should work with the administration to create effective and ethical data governance. Having a diverse team is critical to data governance – people with different backgrounds contribute different and important perspectives on the harms and benefits of a dataset.
4.4 Pre-Publishing
4.4.1 Dataset Collection
Responsible data governance starts at dataset acquisition. After all, an open data portal team must collect the necessary information to (1) assess and contextualize the dataset properly, and (2) present the dataset with the appropriate context and description. Although there are different channels to acquire datasets, we have created a standardized approach to this important step in the process. Below are important fields to have when creating a ‘Dataset Upload’ form.
- Dataset (.csv, hyperlink, .pdf)
- Dataset Name (String)
- Email of uploader, Note: this won’t be made public, just for internal records (String)
- Publish Date (JS Date Object)
- Source (String)
- Link to the raw data source (hyperlink, .pdf)
- Category Tag (Drop-down menu, string)
- Data Type Tag (Drop-down menu, string)
- Description
- Provide a 1-2 sentence high-level description of the dataset.
- Define any key terms mentioned in the dataset.
- Who is the primary audience?
- What is the purpose of the dataset and why does it exist?
- Who does this data belong to, and have they given permission to share the data?
- What kind of decisions are being made with the dataset (currently and in the future)?
- Who can provide context around the dataset and answer questions about how it was collected, the steps taken to process it, and how to interpret the variables?
- Where has this dataset been used before?
- Is there anything else you would like to share about this dataset, such as privacy considerations or quality concerns?
4.4.2 Dataset Assessment
Posting a dataset on your portal gives it an appearance of statistical legitimacy to the public, so your open data group should assess the quality of a dataset before posting. A dataset may only capture a part of the true data, the data collection may be biased, and it can be difficult to assess this bias using a single dataset. For example, if deciding whether to publish a dataset about sexual violence, it would important to note that sexual assault data is historically under-documented on college campuses. We discuss an approach to approximating data quality in the ‘Data Governance Protocol’ section.
Next, you must holistically evaluate the utility and risks of publishing the dataset. Teams may not understand all the factors at play so we recommend opening constructive channels for dialogue with the administration and/or uploader of the data.
First, the group must assess the benefits of publishing the dataset. While individual datasets can serve specific purposes, publishing datasets more broadly can achieve important goals for universities such as increasing student trust in the administration, inspiring new products and ideas, simplifying data requests, and enabling data sharing across the administration.
- Take a look at the ‘Community Impact’ and ‘Generate Buy-in and Demand’ sections for discussions on the benefits of open data.
- Skip to the ‘Data Governance Protocol’ section for a set of questions that can frame and quantify the perceived utility.
Second, the group must assess the risk of publishing the dataset. Risks often include privacy concerns, potential misuse and misrepresentation of data, bias, and consent to name a few. It is important to remember that “risk can be managed, but will not be zero” (Open Data Release Toolkit). Below are some risks to think about and, again, skip to the ‘Data Governance Protocol’ section for a set of questions that can frame and quantify the perceived risks.
Consent is a bedrock principle of data collection and analysis. In general, before adding a dataset to your portal, you should:
- Obtain consent for the collection of data, or ensure that consent was obtained before its collection
- Obtain consent for how the data will be used
Personal information can be discovered through datasets if publicized. It is important to work with the administration to create guidelines that maintain anonymity. Here, we adopt recommendations from Harvard’s Berkman Klein Center for Internet & Society:
- Conduct risk-benefit analyses to inform which, and how, datasets are published
- Consider privacy at each stage of the data lifecycle
- Develop operational structures and processes that codify privacy management widely
- Emphasize campus engagement and campus priorities as essential aspects of data management
With public data, malignant actors may try to use data in nefarious ways. To prevent this, some strategies include:
- Identify vulnerable datasets, including datasets that are composed of individual-level information
- Specify misuse cases, including examples that may break the law or university policy
- Conduct due diligence to see if any other data exists that could deanonymize users — removing personally identifiable information is often not enough!
Many individuals who are interested in this data may have limited experience working with data. As a result, even with the best of intentions, data may be misconstrued or used harmfully.
- Present the data without personally identifiable information
- Consider adding minor obstacles to access sensitive data — logins, access requests, and data ethics training are possible options for a healthier balance between benefits and risks.
It’s easy to misrepresent your datasets. As popularized by Mark Twain, “There are three kinds of lies: lies, damned lies, and statistics.” Here are a few strategies to reduce these errors:
- Describe datasets clearly! Explain who collected the information, when it was collected, and how they did it.
- Clarify confusing entries in a dataset (e.g. outliers) by providing context if it exists
- Clarify confusing variables by providing definitions and example interpretations
- If there are sanity checks embedded (e.g. every row adds to a certain number), make sure to mention those.
- How did selection bias factor into the dataset? As the Human Rights Data Analysis Group puts it, “Data collected by what we can observe is what statisticians call a convenience sample, which is subject to selection bias.” To help minimize this risk, make sure to specify where and why your sample may be unrepresentative. Also remember that data might be missing to conceal information, possibly with malicious intent.
Data may perpetuate inequities in society. Here are a few questions to think about:
- Who collected the data?
- Who was the data collected from? Were certain groups over- or under-represented?
- What questions were asked and how were they framed?
- Can this dataset be used to perpetuate dangerous stereotypes or decisions? For example, a dataset that shows grades based on race or gender should be treated very carefully.
Third, the group must decide whether any of the risks can be mitigated through de-identification. De-identification refers to a collection of processes that prevent personal information from being revealed. However, the process may render a dataset useless for the public because de-identification could remove the context and meaning of a dataset. For most datasets, we recommend following Khaled El Emam’s De-Identification Protocol for Open Data, clearly labeled on pages 15 and 25 of DataSF’s Open Data Release Toolkit. In some cases, it might be a good idea to consider other methods such as differential privacy.
Lastly, use the risk and utility assessments to create an overall risk-utility score. We have put together a data governance protocol later in the guide.
4.5 Post-Publishing
In most cases, although an open data portal team can discuss the utility and risks of a dataset, the true impact of a dataset is not felt until after it has been published. As such, the open data group must:
- Have discussions with the resident expert/data steward about updating the dataset
- Maintain good relationships with all data stakeholders by having conversations with community members that could be impacted by the dataset. To paraphrase a popular slogan, “No data about a stakeholder without a stakeholder”
- Understand and document risks and utility after the dataset has been published and incorporate the knowledge into future data governance guidelines.
4.6 Data Governance Protocol
We have put together a step-by-step process for assessing a dataset. The protocol has been reviewed by the Northwestern Data Governance group. The framework serves as both a checklist and a conversation-starter. The open data group should work to mitigate the risks and amplify the benefits of releasing a dataset. In most cases, the protocol will be used for datasets that are not already public but it can definitely still be applied to existing public datasets. It would be best to have conversations with the university administration, especially in determining the risk score, because they will have valuable information that most students would not know about. A dataset may unilaterally be considered 'un-publishable' by the administration even if it passes the process. Lastly, remember the is guidance, not a hard-and-fast rule. The weights and questions are based on our own experiences, discussions amongst ourselves, and discussions with experts in the field. Feel free to adjust the weights and questions based on your group’s and university’s needs. Please do not simply interpret the output as the correct decision without doing your own independent assessment and analysis.
Ask questions about what types of analysis are valid to conduct, the representativeness of the dataset, the completeness of the dataset, whether the dataset was the submitter’s to share, etc. Meanwhile, we believe that administrative data should be more open to the public, as it is often the source by which university decisions are made; this makes it important for the public to be able to find biases and errors in the dataset. We have provided a simple matrix to determine if a dataset is okay to publish based on the source and sensitivity. The quiz outputs two scores: a utility level score and a risk level score. These correspond to the different axises on the matrix, and their intersection yields a relative risk score. Lower relative risk scores mean that the utility of the dataset likely outweighs the risk, and vice versa.
Utility-Risk Matrix: What is the relative risk level of publishing the dataset? | Utility Level | ||||
---|---|---|---|---|---|
High (5+) | Medium (3-4) | Low (1-2) | None (0) | ||
Risk Level | Minimal (0-2) | Very low e.g. Already public datasets |
Very low e.g. Sports data |
Low e.g. US News, Dining hall |
Moderate e.g. Library volume data |
Low (3-4) | Very low | Low e.g. transports to hospital |
Moderate |
Significant |
|
Medium (5-7) | Low e.g. Crime data |
Moderate e.g. Student demographic. data by category |
Significant |
High e.g. Student ID numbers |
|
High (8-10+) | Moderate |
Significant e.g. Religious affiliation, Race / gender / identity, Alum. giving data |
High e.g. Personal Health data |
Extreme e.g. Student names and addresses |
This data governance protocol is a working draft and it may not reflect all potential concerns. Please let us know if you have any comments or suggestions.
Previous Next