Chapter 4: Data Governance

We provide a data governance protocol to determine which datasets are safe to release.

4.1 Principles

Data governance is an institutional framework that promotes accountable and ethical “planning, oversight, and control over the management of data and the use of data and data-related sources” (DAMA). Simply put, strong data governance fosters accountability and ethics in management and use of data. It establishes mutual trust within the community your open data portal serves, and mitigates potential harms to individuals and populations.

KEY IDEA

It is the responsibility of your open data group to collect and publish data in a way that protects privacy, prevents misuse and misinterpretation, and builds community trust.
While data can be a powerful tool for understanding the impact and efficiency of the many functions of a university, making university data public can pose many risks, such as revealing information about students, communities, and groups. For example, your group might publish a dataset of the racial distribution for each department over time, but if the dataset breaks down the university’s Chemistry Department by Gender, Race, and Tenure status, you may be able to identify individuals in the dataset (DataSF 2). 

4.2 Opening Data Responsibly

Through a holistic approach, an open data group must publish data responsibly. Below are some questions to consider. Skip to the 'Data Governance Protocol' section for a comprehensive breakdown.

  • What is the benefit of publishing a dataset?
  • What metadata should be collected?
  • What are the risks associated with the dataset - both at the individual and group level?
  • Can any of the risks be mitigated through de-identification?
  • How do we assess the impact of a dataset after publication?

4.3 Roles and Responsibilities

An open data group can consider themselves as the ‘gatekeepers’ of the open data portal. We believe that no one person should have this important responsibility; rather the entire group should work with the administration to create effective and ethical data governance. Having a diverse team is critical to data governance – people with different backgrounds contribute different and important perspectives on the harms and benefits of a dataset. 

4.4 Pre-Publishing

4.4.1 Dataset Collection

Responsible data governance starts at dataset acquisition. After all, an open data portal team must collect the necessary information to (1) assess and contextualize the dataset properly, and (2) present the dataset with the appropriate context and description. Although there are different channels to acquire datasets, we have created a standardized approach to this important step in the process. Below are important fields to have when creating a ‘Dataset Upload’ form.

  • Dataset (.csv, hyperlink, .pdf)
  • Dataset Name (String)
  • Email of uploader, Note: this won’t be made public, just for internal records (String)
  • Publish Date (JS Date Object)
  • Source (String)
  • Link to the raw data source (hyperlink, .pdf)
  • Category Tag (Drop-down menu, string)
  • Data Type Tag (Drop-down menu, string)
  • Description
    • Provide a 1-2 sentence high-level description of the dataset.
    • Define any key terms mentioned in the dataset.
    • Who is the primary audience?
    • What is the purpose of the dataset and why does it exist?
    • Who does this data belong to, and have they given permission to share the data?
    • What kind of decisions are being made with the dataset (currently and in the future)?
    • Who can provide context around the dataset and answer questions about how it was collected, the steps taken to process it, and how to interpret the variables?
    • Where has this dataset been used before?
    • Is there anything else you would like to share about this dataset, such as privacy considerations or quality concerns?

4.4.2 Dataset Assessment

Posting a dataset on your portal gives it an appearance of statistical legitimacy to the public, so your open data group should assess the quality of a dataset before posting. A dataset may only capture a part of the true data, the data collection may be biased, and it can be difficult to assess this bias using a single dataset. For example, if deciding whether to publish a dataset about sexual violence, it would important to note that sexual assault data is historically under-documented on college campuses. We discuss an approach to approximating data quality in the ‘Data Governance Protocol’ section.

Next, you must holistically evaluate the utility and risks of publishing the dataset. Teams may not understand all the factors at play so we recommend opening constructive channels for dialogue with the administration and/or uploader of the data.

First, the group must assess the benefits of publishing the dataset. While individual datasets can serve specific purposes, publishing datasets more broadly can achieve important goals for universities such as increasing student trust in the administration, inspiring new products and ideas, simplifying data requests, and enabling data sharing across the administration.

Second, the group must assess the risk of publishing the dataset. Risks often include privacy concerns, potential misuse and misrepresentation of data, bias, and consent to name a few. It is important to remember that “risk can be managed, but will not be zero” (Open Data Release Toolkit). Below are some risks to think about and, again, skip to the ‘Data Governance Protocol’ section for a set of questions that can frame and quantify the perceived risks.

Consent is a bedrock principle of data collection and analysis. In general, before adding a dataset to your portal, you should:

  • Obtain consent for the collection of data, or ensure that consent was obtained before its collection
  • Obtain consent for how the data will be used

Personal information can be discovered through datasets if publicized. It is important to work with the administration to create guidelines that maintain anonymity. Here, we adopt recommendations from Harvard’s Berkman Klein Center for Internet & Society:

  • Conduct risk-benefit analyses to inform which, and how, datasets are published
  • Consider privacy at each stage of the data lifecycle
  • Develop operational structures and processes that codify privacy management widely
  • Emphasize campus engagement and campus priorities as essential aspects of data management

With public data, malignant actors may try to use data in nefarious ways. To prevent this, some strategies include:

  • Identify vulnerable datasets, including datasets that are composed of individual-level information
  • Specify misuse cases, including examples that may break the law or university policy
  • Conduct due diligence to see if any other data exists that could deanonymize users — removing personally identifiable information is often not enough!

Many individuals who are interested in this data may have limited experience working with data. As a result, even with the best of intentions, data may be misconstrued or used harmfully.

  • Present the data without personally identifiable information
  • Consider adding minor obstacles to access sensitive data — logins, access requests, and data ethics training are possible options for a healthier balance between benefits and risks.

It’s easy to misrepresent your datasets. As popularized by Mark Twain, “There are three kinds of lies: lies, damned lies, and statistics.” Here are a few strategies to reduce these errors:

  • Describe datasets clearly! Explain who collected the information, when it was collected, and how they did it.
  • Clarify confusing entries in a dataset (e.g. outliers) by providing context if it exists
  • Clarify confusing variables by providing definitions and example interpretations 
  • If there are sanity checks embedded (e.g. every row adds to a certain number), make sure to mention those.
  • How did selection bias factor into the dataset? As the Human Rights Data Analysis Group puts it, “Data collected by what we can observe is what statisticians call a convenience sample, which is subject to selection bias.” To help minimize this risk, make sure to specify where and why your sample may be unrepresentative. Also remember that data might be missing to conceal information, possibly with malicious intent.

Data may perpetuate inequities in society. Here are a few questions to think about:

  • Who collected the data?
  • Who was the data collected from? Were certain groups over- or under-represented?
  • What questions were asked and how were they framed? 
  • Can this dataset be used to perpetuate dangerous stereotypes or decisions? For example, a dataset that shows grades based on race or gender should be treated very carefully.

Third, the group must decide whether any of the risks can be mitigated through de-identification. De-identification refers to a collection of processes that prevent personal information from being revealed. However, the process may render a dataset useless for the public because de-identification could remove the context and meaning of a dataset. For most datasets, we recommend following Khaled El Emam’s De-Identification Protocol for Open Data, clearly labeled on pages 15 and 25 of DataSF’s Open Data Release Toolkit. In some cases, it might be a good idea to consider other methods such as differential privacy.

Lastly, use the risk and utility assessments to create an overall risk-utility score. We have put together a data governance protocol later in the guide.

4.5 Post-Publishing

In most cases, although an open data portal team can discuss the utility and risks of a dataset, the true impact of a dataset is not felt until after it has been published. As such, the open data group must:

  • Have discussions with the resident expert/data steward about updating the dataset
  • Maintain good relationships with all data stakeholders by having conversations with community members that could be impacted by the dataset. To paraphrase a popular slogan, “No data about a stakeholder without a stakeholder”
  • Understand and document risks and utility after the dataset has been published and incorporate the knowledge into future data governance guidelines.

4.6 Data Governance Protocol

We have put together a step-by-step process for assessing a dataset. The protocol has been reviewed by the Northwestern Data Governance group. The framework serves as both a checklist and a conversation-starter. The open data group should work to mitigate the risks and amplify the benefits of releasing a dataset. In most cases, the protocol will be used for datasets that are not already public but it can definitely still be applied to existing public datasets. It would be best to have conversations with the university administration, especially in determining the risk score, because they will have valuable information that most students would not know about. A dataset may unilaterally be considered 'un-publishable' by the administration even if it passes the process. Lastly, remember the is guidance, not a hard-and-fast rule. The weights and questions are based on our own experiences, discussions amongst ourselves, and discussions with experts in the field. Feel free to adjust the weights and questions based on your group’s and university’s needs. Please do not simply interpret the output as the correct decision without doing your own independent assessment and analysis.

Ask questions about what types of analysis are valid to conduct, the representativeness of the dataset, the completeness of the dataset, whether the dataset was the submitter’s to share, etc. Meanwhile, we believe that administrative data should be more open to the public, as it is often the source by which university decisions are made; this makes it important for the public to be able to find biases and errors in the dataset. We have provided a simple matrix to determine if a dataset is okay to publish based on the source and sensitivity. The quiz outputs two scores: a utility level score and a risk level score. These correspond to the different axises on the matrix, and their intersection yields a relative risk score. Lower relative risk scores mean that the utility of the dataset likely outweighs the risk, and vice versa. 

Utility

  • 1
  • 3


    Yes
    No
  • 1
  • 3

    Yes, data exists publicly in non-machine readable format
    No, data is currently not public

Risks

Scale
    Private, such as data that is medical in nature
    Public, such as data about the university’s budget at a high level
    Yes, such as data about grades broken down by race
    No, there is low potential this data will harm members of marginalized communities.
Personally Identifiable Information (PII)
    Data contains nonsensitive info that has to be linked with other datasets to be identified
    Data contains nonsensitive info that can be directly identified in dataset
    Data contains sensitive info (e.g. addresses, phone numbers, medical info, personal opinions) that are directly identifiable or can be determined by combining with easily accessible datasets
    Yes, data tables that reveal counts of subgroups could make it easy to identify people. For example, the dataset is a table that shows students by major, year, and gender. There is only one Psychology student in 2014. As such, it is easy to identify that student.
    No, there is low potential this data will reveal PII even when aggregated.
    Yes, data contains sensitive info (e.g. addresses, phone numbers, medical info, personal opinions) that could harm the student. For example, revealing a student’s sexuality could harm a student if they are from a country where homosexuality is banned.
    No, there is no PII in the data or there is low potential that revealed PII could be harmful.
Quality
  • 0
  • 3

Misuse
    No, there is no risk to dataset, such as endowment data
    Yes, such as a dataset that undercounts the occurance of sexual assault on campus
Other
    No risk
    Yes, such as student addresses from a geospatial map would be a dangerous use of machine-readable addresses [learn more]
    Low risk
    Medium risk, such as data about crime on campus
    High risk, such as previously undisclosed data about sponsored research from companies/foreign governments
    Yes, such as survey data with personal questions on student opinion
    No
Mitigation
    Yes
    No
    Low effect, such as survey responses about the quality of dining options on campus
    Medium effect, such as ratings of professors where the professors’ names are removed
    High effect, such as removing race/ethnicity entries from a dataset about policing on campus
    Yes, by knowing and listing the major expected errors are in the dataset and requiring acknowledgement of them from users before they can download the dataset
    No risks can be mitigated

Utility-Risk Matrix: What is the relative risk level of publishing the dataset? Utility Level
High (5+) Medium (3-4) Low (1-2) None (0)
Risk Level Minimal (0-2) Very low
e.g. Already public datasets
Very low
e.g. Sports data
Low
e.g. US News, Dining hall
Moderate
e.g. Library volume data
Low (3-4) Very low Low
e.g. transports to hospital
Moderate
Significant
Medium (5-7) Low
e.g. Crime data
Moderate
e.g. Student demographic. data by category
Significant
High
e.g. Student ID numbers
High (8-10+) Moderate
Significant
e.g. Religious affiliation, Race / gender / identity, Alum. giving data
High
e.g. Personal Health data
Extreme
e.g. Student names and addresses

This data governance protocol is a working draft and it may not reflect all potential concerns. Please let us know if you have any comments or suggestions.

Previous Next