Chapter 3: Building a Data Portal
We detail the tools and workflows to build an open data portal.
3.1 Timeline and Product Management
Building an open data portal requires a methodological approach to the development process. The recommended process is based on the agile development model called Scrum. Scrum is based on a small set of core values such as:
- Individuals and interactions over processes and tools
- Working software over comprehensive documentation
- Responding to change over following a plan
At a high level, the Scrum agile development holds all stakeholders of the project accountable from the team, individuals, and any advisors. It is always important to keep in mind a longer timeline focused on the target date for the team to launch a MVP, or minimal viable product, and a shorter timeline focused on the tasks for each week’s sprints.
Refer to this guide for more details on Scrum.
3.2 Front-End Development
The front-end of your portal will be your first and main point of contact with new users. In the following section, we discuss key features and design principles to maximize user experience.
A simple, user-centered home page is the bedrock of any data portal. It should have clear access points to the various functions of your website and be accessible to any user. For example, a good front page might include:
- A short description of the website’s purpose
- The ability to search for and/or easily find datasets on the portal
- Featured datasets that showcase high-value datasets to your community members
- Featured projects to highlight the outcomes of your data portal.
An open data portal should have a page that links to all available datasets. This page should contain:
- The ability to search and filter datasets
- Metadata such as description, data type, and publication date to better capture the context of each dataset
Contributing to your portal should be simple and encouraged. Here are a few pipelines that you should consider:
- Projects sourced from your users, and hopefully aligned with goals within your community.
- Datasets collected by users that may be helpful to the broader community.
- Requests for datasets that don’t exist on your portal yet. Getting this type of feedback from real users who have a use for these datasets is an invaluable contribution to your conversations with administration as you continue to add datasets to your portal.
A data portal built by students, for students should be showcased as such. Leave this page to tell your group’s story and motivations. Provide outlets for feedback, questions, and requests from community members looking to get involved, and respond to these requests in a timely manner.
The style of your open data portal should feel welcoming, clean, and user-friendly with minimal attributes. Have some fun choosing colors and themes that support your vision for the aesthetic; just be sure to follow ADA Site Compliance. Assume many of your users won’t know about open data portals and incorporate features that can engage users from a broad range of backgrounds. A good way to check whether or not you’ve done a good job with this is to reference these 10 heuristic principles.
3.3 Back-End Development
Some open data portals may choose to go serverless. Regardless of whether you have a full-out database and server you will need some sort of backend datastore and system for managing permissions on users. Here are some general best practices for creating an effective open data portal backend that is applicable for college students.
Implement some way to filter and search datasets on your website. This may require backend database structuring or can be done on the frontend using a React module.
Datasets can be stored in the cloud with Amazon Web Services (S3), Microsoft Azure (Azure Blob), Google Cloud (GCS) or another provider. You can also utilize your own server and store datasets locally as well, though we advise against this because it adds additional complexity.
A relatively convenient structure for an open data portal is to store metadata in a separate file from your datasets. This enables you to query your metadata and populate a dataset listings page without having to get all your datasets.
Only authorized users should be able to update metadata and datasets regardless of how or where you store them. Such a permissions system can be as simple as storing your metadata in a Google Sheet that is shared with only a specific set of people and having a similar setup in AWS or another datastore for your datasets.
3.4 Open Data Pipeline Examples
Northwestern Open Data Initiative
The Northwestern Open Data Initiative implemented a full-stack Django-React hybridized server in an all-in-one folder to make deployment easy. To store metadata, the team uses PostgreSQL. To store the actual datasets, their back-end team developed a process of first storing datasets in an ‘unapproved’ AWS S3 bucket, approving them on a Django-based admin panel, and then moving them to the ‘approved’ AWS S3 bucket. The Django admin panel will allow for non-technical members to view datasets, see user engagement, and approve datasets - a helpful feature as we develop the portal as a product. The React front-end makes requests to the AWS S3 upload url through the Django back-end.
On the React-based front-end, the team utilized user-centered design principles in accordance with the data governance protocol mentioned later in this document. For example, in working with the Northwestern Data Governance group, the front-end team created an ‘Upload’ form that asks targeted questions to accurately capture the context and description of a dataset. An uploader cannot submit the form without filling all the required fields dictated by our data governance protocol. In addition, they made the deliberate decision to force a user to view the metadata of a dataset before a user is able to download the dataset.
Stanford Open Data Project
The Stanford Open Data Portal implemented a React frontend hosted using Netlify with AWS S3 buckets for dataset storage. Metadata is stored in a google spreadsheet and a Python script calls the Google Sheets API to download the metadata spreadsheet, convert to JSON, and upload to S3. The AWS console can be used to easily drag and drop both the metadata and individual datasets in CSV format for upload. Another option is to utilize boto3 in the terminal for programmatic access to S3. Credentials in AWS can be managed to give only designated individuals permissions to upload new datasets. Similarly, permissions on Google Sheets can be managed to control access to metadata.
The React frontend can easily access data and metadata stored in AWS by directly making ajax calls to S3t without the need for a server. This reduces complexity and allows those with less programming experience or those who do not want to have a server to get their open data portal up and running. Additionally, the Stanford team has a Google Form that it has sent out to several news organizations and on-campus groups to allow for easy dataset submission. The form included questions that we deemed necessary to represent the dataset’s context.
3.5 Additional Considerations
Google Analytics is a useful tool for monitoring activity on your portal. Use it to identify pain points in your user’s experience as well as determine which datasets users are engaging with.
Converting data from PDFs to CSVs is an important yet time-consuming task that will likely arise when you are obtaining datasets for your portal. We recommend the following tool for easy PDF scraping: https://tabula.technology.
3.6 Recommended Resources
Building an open data portal requires knowledge of full-stack development that can be daunting to even relatively experienced university-aged coders. Here are some resources we have collected that may help with building an open data portal.