Northeast Big Data Innovation Hub Seed Fund Awards

The Northeast Big Data Innovation Hub Seed Fund is designed to promote collaboration and support the cross-pollination of tools, data, and ideas across disciplines and sectors including academia, industry, government, and communities. Funding provided through this program is intended to support the northeast region and align with the Major Goals and Focus Areas of the Northeast Big Data Hub.

The Northeast Big Data Innovation Hub is now accepting applications for our 2021 Seed Fund Program! Learn more here.

2020 Awards

After two rounds of applications and evaluations the hub awarded 19 seed fund grants out of 40 proposals in 2020.

  • Most popular focus areas were Health, and Education + Data Literacy
  • 36 of the proposals were from academic institutions and 4 were from non-profits.
  • 18 of the seed fund grants were awarded to academic institutions and 1 to a non-profit.

See below the seed fund awards by their project focus areas.

2020 Seed Fund Recipients

Responsible Data Science

Using Data Science to Study Environmental Racism, Justice, and Policy

Lead PI: Aunshul Rege (Temple University)

Primary Focus Area: Responsible Data Science

Across the United States, thousands of families that reside in federally assisted housing are living on dangerously contaminated land where they face urgent ongoing environmental and health crisis. In fact, 70% of hazardous waste sites officially listed on the National Priorities List (NPL) under the Comprehensive Environmental Response, Compensation, and Liability Act (CERCLA or Superfund) are located within one mile of federally assisted housing. That constitutes roughly 77,000 families living in public housing and homes paid for with vouchers at or near one of the nation’s polluted Superfund sites designated for cleanup by the federal government. These households disproportionately include low-income communities of color. The proximity of contaminated sites to public housing is an example of environmental racism, instances of which range from the internationally criticized (climate change more acutely harming developing nations) to nationally condemned (lead-poisoned water in Flint, Mich.) to locally known (the legacy of waste treatment plants in the city of Chester). Access to affordable, healthy food is another matter of environmental justice. Many of the city’s community gardens are in the poorest neighborhoods, under threat of development and pollution, which has changed the taste of residents’ food. This proposal offers a green criminology lens to the study of environmental racism and injustice using a qualitative data science approach to (i) create a harms matrix that focuses on impacts to environment, health, and food, and categorize environmental injustice case studies using this matrix, (ii) compare the various case studies to identify patterns, rank the various harms along incidence and severity, and identify corresponding remediation processes and costs, and policy recommendations, and (iii) verify the harms occurrences and prioritization rankings, any patterns, and response and recovery processes and costs by speaking with environmental and health subject matter experts.

Procurement Roundtables: Algorithmic Justice and Responsible AI

Lead PI: Mona Sloane, Ph.D. (New York University)
Collaborators: Rumman Chowdhury, Ph.D. (Parity), John C. Havens (IEEE)

Primary Focus Area: Responsible Data Science

Project Website: Evolving Procurement for Artificial Intelligence Systems in Cities and Beyond and Innovating AI Procurement

Success Story: How to Innovate AI Procurement?

Final Project DOI: AI and Procurement: A Primer

This project is a collaboration between the NYU Alliance for Public Interest Technology, the Institute of Electrical and Electronics Engineers (IEEE) and Parity, a collaborative platform that utilizes AI/ML to extract useful information from qualitative methods and combines them with rigorous quantitative assessments. It contributes to the emerging field of Public Interest Technology (PIT) and addresses the fact that there is little research and interdisciplinary exchange on issues pertaining to data science, public procurement, and transparency and justice. This is a glaring gap: 12% of the global GDP is spent following procurement regulation (World Economic Forum, 2020), and procurement is a core mechanism through which algorithmic power is distributed in public institutions. To fill this gap, this project will be comprised of three interdisciplinary “Procurement Roundtables” – one focused on data science solutions used by public institutions, one focused on algorithmic justice and responsible AI, and one focused on governance innovation. These roundtables will bring together experts in data science, social science (particularly critical technology studies), and governance.

Urban to Rural Communities

Forecasting Salinity in Rivers during Storm Events

Lead PI: Laura Dietz (University of New Hampshire, Computer Science)
Collaborators: Adam Wymore (University of New Hampshire, Natural Resources and the Environment)

Primary Focus Area: Urban to Rural Communities

Project Website: Link

This project takes a data science approach in forecasting the salt concentration in rivers across New Hampshire. The purpose is to analyze what-if scenarios regarding salinity at particular river sites, in order to estimate the impact of changing weather patterns, such as rain-on-snow, drought, or intense rainfall, and different road treatment events.

Location-based Citizen Science in Augmented Reality Image Categorization

Lead PI: Seth Cooper  (Northeastern University)
Collaborators: Sara Wylie (Northeastern University)

Primary Focus Area: Urban to Rural Communities

Project Website: Cartoscope

Image categorization is a common citizen science task, and can help to make sense of large image sets. The main goal of the proposed project is to integrate location-based images for categorization into an augmented reality (AR) citizen science toolkit we are developing. The toolkit, called Tile-o-scope AR, uses an AR mobile app to project images onto a set of physical tiles, which can be used for a variety of activities. The activities are built on matching images containing similar objects, which helps to categorize the images. In the proposed work we aim to add support for dynamically assembled location-based image sets, making the images to be categorized more relevant to participants. We will gather feedback from testers of our toolkit about their experience using location-based image sets.

Education + Data Literacy

Curricular Structures to Blend Data Science & the Digital Humanities

Lead PI: Amanda K. Greene (Lehigh University)
Collaborators: Dominic DiFranzo (Lehigh University), Edward Whitley (Lehigh University) , Annie Laurie Nichols (Saint Vincent College), Lauren Churilla (Saint Vincent College),  Alice Goldfarb (Boston University), Belle Lipton (Norman B. Leventhal Map & Education Center), Catherine Nikolovski (CIVIC Software Foundation)

Primary Focus Area: Education + Data Literacy

Secondary Focus Area: Responsible Data Science

This seed grant will support a collaborative working group, bringing together academics and industry professionals to create shared pedagogical resources that integrate humanist perspectives, ethics, and data science. The group would develop flexible, adaptable curriculum structures that support project-based learning and connect students to socially impactful data science projects beyond the classroom. By establishing learning outcome frameworks, creating model lesson plans with case studies and data examples, and running workshops during the grant term, this collaboration will drive innovative new digital humanities and data science programming at Lehigh University and Saint Vincent College. The resources that the working group develops will guide efforts to redesign existing courses, create new courses, and onboard new faculty for an innovative Digital Humanities major at Saint Vincent College and a Data Science + Digital Humanities certificate at Lehigh University. Emphasizing the role of data science in forwarding social justice initiatives and prioritizing ethical data literacy, these cutting edge programs and pedagogical structures will cultivate students’ technical capacities and enable them to apply socially conscious humanities skills in all phases of the data lifecycle.

Data Literacy as an Enabler to Broaden the Participation Of Underrepresented Minorities in STEM Careers

Lead PI: Babak D. Beheshti (New York Institute of Technology)

Primary Focus Area: Education + Data Literacy

This project aims to increase data science capacity and talent, first by creating a sustainable pipeline from high schools and community colleges to universities for students to pursue degrees in computer science and data science, and second by increasing the accessibility of data science in the broader community. The objective is to expand data literacy and broadening the participation of underrepresented minorities and women in disciplines — and ultimately careers — in which an understanding of data science is foundational. The project’s goal is to make this course accessible to high school and community college students, as well as to the general public, by converting it to a fully asynchronous on-line mode. The project will also leverage partnerships with local high schools and community colleges to advertise and, through a competitive vetting process, provide scholarships for a group of students to take this course free of charge. This will allow students from lower income communities and students underrepresented in data science fields to have access to the these educational resources.


Building the Community to Address Data Integration of the Ecological Long Tail

Lead PI: Beverly Woolf (University of Massachusetts, Amherst)
Collaborators: Ivon Arroyo (University of Massachusetts, Amherst), Will Lee (University of Massachusetts, Amherst), Danielle Allessio (University of Massachusetts, Amherst)

Primary Focus Area: Education + Data Literacy

This research focuses on Educational Data Mining, Learning Science, and Machine Learning. We build on current big data research and combine teacher inquiry and learning analytics to enhance teachers’ ability to collect and utilize real-time data about their students. The research will explore, evaluate, and apply machine learning techniques for optimizing and simplifying teachers’ assessment of students’ strengths, weaknesses, and socio-affective profiles to better create and adjust educational plans.

Development of a Data Analytics Learning Community

Lead PI: Cathie LeBlanc (Plymouth State University)
Collaborators: Daniel Lee (Plymouth State University), Rebecca Noel (Plymouth State University), Hyun Joong-Kim (Plymouth State University), Jonathan Couser (Plymouth State University)

Primary Focus Area: Education + Data Literacy

Project Website: Link

The goal of this project is to increase the capacity of the existing faculty at PSU to teach data analytics, particularly in our General Education program. The project will result in a General Education course with data analytics content related to changing societal understandings of mental health team-taught by a data analytics expert and an historian. In addition, the project funds faculty participation in a data analytics learning community. The learning community kicks off with a week-long workshop in January 2021 through which participants will learn about data analytics as well as major principles in science of learning and how those principles can be applied to the teaching of data analytics content. Each participant in the learning community has committed to incorporating some data analytics content into at least one of their classes. This project is seeding larger conversations on campus about the role that data analytics can play in student projects, particularly in our General Education program. Through our previous work with faculty learning communities, we know that discussions in these communities expand beyond those directly working in the community.

Building Tools and Training for Public & Educational Use of Geospatial Big Data

Lead PI: Garrett Dash Nelson (Leventhal Map & Education Center at the Boston Public Library)
Collaborators: Belle Lipton (Leventhal Map & Education Center at the Boston Public Library), Michelle LeBlanc (Leventhal Map & Education Center at the Boston Public Library)

Primary Focus Area: Education + Data Literacy

Project Website: Public Data Project, Leventhal Map & Education Center

The Leventhal Map & Education Center is a public-facing institution dedicated to geographic education using both historical materials and modern GIS tools. In this project, we will develop an interlocking set of “technical infrastructure” and “social infrastructure” aimed at equipping the public with better access to geospatial data, as well as better skills and critical attitudes in relationship to such forms of information. The technical infrastructure consists primarily of a bespoke data portal designed specifically for nonspecialist users of geospatial data, incorporating human-readable metadata driven by concerns around data justice and data feminism. The social infrastructure consists of creating training materials for access to the public data portal aimed at K-12 public school teachers and library patrons, including both asynchronous tutorials as well as a multi-part introductory course, “Examining the World Through Maps of Data,” to be offered in spring 2021 as a free public program. Work will be supported by student interns associated with the MIT Data + Feminism Lab, as well as an advisory panel of K-12 teachers.

DEFLAB: Data Education and Feminism at Lafayette and Beyond

Lead PI: Trent Gaugler (Lafayette College)
Collaborators: Jason Simms (Lafayette College), Christopher Phillips (Lafayette College)

Primary Focus Area: Education + Data Literacy 

Our project has three main goals. First, we will introduce local community college students to the fundamentals of data science through socially relevant projects. Second, we will enhance Lafayette College students’ ability to design data science projects and communicate data science methods to other students. Third, we will incorporate principles of “data feminism” into the entire project to facilitate students’ learning data science with awareness of gender and other social inequities and strategies for promoting gender equity through the use of data — in other words, to show students that they can use data science to effect social change. By introducing college students to ways of understanding the role gender equity plays in data science and principles for promoting gender equity in their study and growth, this project offers a community-engaged, sustainable way to take steps from a world starting to reckon with the effects of its social ills to a world in which students well-prepared in both technical and liberal arts ways of thinking are equipped to take leading roles in working for the good of society.

Data Science Training and Research Program

Lead PI: Yusuf Danisman (Queensborough Community College)

Primary Focus Area: Education + Data Literacy

This fully online program is aimed at enhancing the coding and data science skills of students at Queensborough Community College (QCC), a minority serving institution located in Queens, NY. For this purpose, 10-15 students will be recruited for the Spring Semester of 2021. All students, as early as their first semester, can participate in the program. Underrepresented groups in STEM are especially encouraged to apply. This program will be organized in three modules: Python as a Programming Language (4 weeks), Data Science & Machine Learning (8 weeks), Capstone Group Project (3 weeks). During the program students will attend virtual workshops and talks that will be given by professionals in industry and academia. Students are also required to do weekly lab assignments and complete a capstone project as a group of 3-4 members.


Nonlinear Dynamics and Machine Learning for Accurate Detection of Early-stage Atrial Fibrillation

Lead PI: Changqing Cheng (State University of New York at Binghamton)

Primary Focus Area: Health

Secondary Focus Area: Education and Data Literacy

The global coronavirus pandemic has put the once-niche telemedicine in spotlight and is driving Health Internet of Things (HIOT) adoption in virtual health settings. Notably, realization of the full potential of HIOT is highly dominated by novel data science tools that can transform the potentially elusive raw data into actionable information for timely detection and optimal intervention for chronic diseases, such as atrial fibrillation (AF). AF is the most prevalent abnormal cardiac rhythms with symptoms of rapid yet irregular heart beatings, and its early detection is paramount to reverse the course of progression and prevent further complications. In clinical settings, AF is often determined upon lengthy procedures, including inspection of symptoms (e.g., heart palpitations and chest pain) and physical examination (e.g., blood test and chest X-ray). Nonetheless, AF at incipient stage is usually sporadic, and subtle symptoms can be fleeting or even absent. Hence, in-the-hospital diagnosis procedure is not applicable for early detection of arrhythmia. Therefore, there is a pressing need for an automatic and easy-to-deploy screening approach for early-stage AF in out-of-hospital scenarios. Whereas portable ECG-based monitoring abound in literature, they mostly apply only to the long temporal data or rely on cumbersome data pre-processing and feature engineering in time / frequency domain, or ignore nonlinear and nonstationary nonlinear dynamics underlying the short ECG. Another challenge resides in data imbalance: the majority of the ECG recordings are under the normal conditions, whereas AF only represents a small portion. In addition, the ECG could also be contaminated by noise or artifacts due to the inappropriate contact between the electrode and the skin. All those factors have posed a quandary for accurate detection of early-stage AF and timely intervention with short-term ECG. The overarching goal of this proposed study is to develop an integrated platform to integrate nonlinear dynamics analysis and data science for incipient-stage AF detection. Specifically, the PI seeks to characterize short-term temporal data using nonlinear time series analysis techniques and tackle data imbalance via data augmentation with generalized adversarial network. This will potentially add a new dimension to the evolving data science research and education.

Contacts patterns during the 2020 COVID-19 epidemic

Lead PI: Eli Fenichel (Yale University – School of the Environment)
Collaborators: Anna Gilbert (Yale University), Roy Lederman (Yale University)

Primary Focus Area: Health

Secondary Focus Area: Urban to Rural Communities

Epidemics are social-behavioral phenomena. Smart device data are the key to understanding how human behaviors are changing during the COVID-19 pandemic. Our goal is to use individual level smart device data to understand the behaviors driving the 2020 COVID-19 pandemic.

We acquired a unique data set of US smart device location data, where the average number of unique devices observed covers about 13% of the US population. These data pair two unique devices by hashed ID into co-locations. The data series runs from January 1st, 2020 through present (and is updating daily with a 5-day lag). We also have detailed spatial parcel and building data for the United States. We have merged the two data sets and propose to use the product to develop a platform for understanding the behavioral aspects of COVID-19 and specifically answering five questions:
1. What are the sources of COVID-19 infection? For example, are there specific contact patterns that are associated with hospitalization?
2. How do high risk locations, e.g., meat packers, connect the broader community graph and spread infection?
3. How did, and are, individuals changing behavior in response to COVID-19 and policies?
4. What types of people should be prioritized for vaccination based on the contact patterns?
5. How can contact patterns be used to stratify COVID-19 testing surveillance regimes?

A landscape of virus-host protein-protein interactions in SARS-CoV-2 infection in humans by machine learning

Lead PI: Ho-Joon Lee, PhD (Yale University)
Collaborators: Prashant Emani, PhD (Yale University), Mark Gerstein, PhD (Yale University), Shrikant Mane, PhD (Yale University)

Primary Focus AreaHealth

We aim to predict and identify human proteins that are targeted by viral proteins of the novel coronavirus, SARS-CoV-2, that causes the COVID-19 disease, at the proteome level using advanced machine learning methods. We will use tree-based ensemble learning and deep learning models with protein sequence-based features as multi-class classifiers for multi-level evidence of virus-host protein-protein interactions. A large-scale public database will be used for model training and human target proteins predicted with high-evidence will be functionally characterized for biological insight.

A scalable computational pipeline to develop polygenic risk scores from biobank data

Lead PI: Hongyu Zhao (Yale School of Public Health)
Collaborators: Robert Bjornson (Yale University), Wei Jiang (Yale School of Public Health)

Primary Focus Area: Health

Project Website: Zhao Lab

Genome-wide association studies (GWAS) have been very successful in delineating the genetic basis of human diseases. Tens of thousands of associations have been identified between single-nucleotide polymorphisms in the human genome and hundreds of complex traits/diseases. These GWAS results offer an opportunity to develop disease risk prediction models with genetic information and other risk factors, e.g. age and smoking. Because genetic factors make significant contributions to many diseases, accurate risk prediction using genetic information may significantly improve prevention, screening, diagnosis, and treatment for common diseases. These models derive polygenic risk scores (PRS) to quantify the genetic risk for a disease. The overarching goals of this project are to address the computational and implementation issues through developing a unified and user-friendly web-platform for practicing PRS analysis, and benchmarking most existing PRS methods. More specifically, we will: significantly reduce the computational time of current PRS methods and integrate these methods within a toolkit with a unified input/output interface; build a user-friendly web platform for calculating genetic risks of many common diseases based on different PRS methods; and benchmark current PRS methods for different diseases, by using publicly accessible GWAS summary statistics from large consortia to develop PRS models and testing their performances by using more than 500,000 individuals from the UK Biobank.

Convolutional Neural Network Facilitated Functional Cortical Mapping using tEEG Signals

Lead PI: Kaushallya Adhikari (University of Rhode Island)

Primary Focus Area: Health

Approximately one percent of the world population has epilepsy [WHO]. Many patients, children in particular, cannot tolerate awake craniotomy and intraoperative language mapping, and this need drives the search for optimal preoperative noninvasive functional language mapping. Mapping of eloquent brain areas is critical for preservation of function after resective brain surgery [Luders et al. Epileptic Disorders, 2006]. This creates a strong demand for reliable functional mapping methods in children, who are characteristically incompliant patients. The ideal pediatric paradigm therefore should be simple for the patient to perform and fast to acquire. A quick, high fidelity, and noninvasive mapping technique is desired to address this need.
We proposed to perform functional cortical mapping using tripolar electroencephalography (tEEG) data. A convolutional neural network (CNN), a deep learning methodology, is a promising technique for tEEG data analysis. CNNs have been widely successful in computer vision, image processing, communications, and localization of neural dipole sources using EEG data . Since children cannot be expected to stay still for prolonged periods for tEEG data recordings, a robust methodology that can process data fast and reliably is a must for tEEG data processing and analysis. Moreover, since tEEG data have higher frequency signals than conventional EEG, they are recorded at higher sampling rates, typically 2000 samples per second or higher [Toole et al. Epilepsy & Behavior, 2019]. Such high sampling rates can lead to voluminous data quickly, which calls for an automated and computationally efficient machine learning technique that has a potential for substantial speed-up and CNN is a promising candidate. A CNN involves alternating convolution layers with pooling layers. With the recent advancement of graphic processing, a CNN with two or more hidden layers is easily feasible. Additionally, we propose to use sparse CNN, which replaces computationally expensive convolution operation with multiplication of sparse matrices and significantly speeds up computations [Liu et. al., IEEE CVPR, 2015]. These sparse CNNs retain the property of invariance to translation, scaling, skewing, and other types of distortion [S. Haykin, Prentice Hall, 2009]. Once pretrained, a sparse CNN can facilitate fast recognition of dominant brain hemisphere, and eventually dominant quadrant. We will explore CNNs with two input formulations: (1) spectrograms (2) raw signals, enabling the notion of end-to-end learning and preempting the need for hand-designed features. We will compare accuracy while varying the number of convolutional layers in a CNN. For activation function of convolutional layers, we will use rectified linear unit (ReLU), exponential linear unit, leaky rectified linear unit, and scaled exponential linear unit [Craik et. al., IOP, 2019]. We will perform comprehensive statistical analysis of the sensitivity and accuracy of CNNs over different numbers of convolutional layers, input formulations, and activation functions. We will also compare the statistical analysis results with the outcomes of minimum norm method, which is prevalent in EEG data analysis.

CritCOVIDView: A Critical Care Visualization Tool for COVID-19

Lead PI: Mohammad Al-Mamun, PhD. (The University of Rhode Island- College of Pharmacy)
Collaborators: Todd Brothers (The University of Rhode Island- College of Pharmacy)

Primary Focus Area: Health

In the midst of the COVID-19 global pandemic, the US healthcare system is under exorbitant duress. As the mortality during the pandemic remains high, prompt decision making regarding medication use and treatment strategies requires overlapped insights of patient-centered data to be interpreted by the interdisciplinary critical care team comprised of pharmacists, nurses, respiratory therapists, and providers. Extracting and interpreting patient-specific health status data from multiple locations within the electronic medical records (EMR) is an arduous process and may lead to oversight of vital information. Currently, no tool exists visualizing the overlapped components of the EMR to support clinicians. The main goal of this project is to develop a cutting-edge tool titled CritCOVIDView to be utilized by bedside clinicians to interpret individualized patient data through the development of an interactive dashboard.

Objective 1: Develop a data mining algorithm to understand the prescribed medication patterns and analyze the treatment modality complexities prior to and during the COVID-19 crisis. We will develop a custom association rule mining algorithm that will efficiently discover associations among the prescribed medications given the health status of critically ill patients.

Objective 2: Develop an interactive critical care dashboard to visualize prescribed medications patterns, laboratory results, and vital signs to facilitate prompt decision making.

Harnessing Data to Predict and Prevent Cancer Treatment Adverse Event through Artificial Intelligence

Lead PI: Robert Wieder (NJMS, Rutgers University)
Collaborators: Nabil Adam (Rutgers University)

Primary Focus Area: Health

Adverse events from cancer therapy represent a significant challenge to patients’ well-being and survival and to the cancer treatment community. Adequate and rigorous approaches to predicting the occurrence of adverse events are lacking. Our long-term goals are to develop a systematic methodology to reliably predict and mitigate potential adverse events from cancer therapy. The specific aims of this proposal seek to understand the interactions of highly variable, exceptionally complex, multidimensional patient-, tumor- and treatment-related factors that predictably predispose to specific adverse events from breast cancer therapy. The approach will apply Artificial Intelligence (AI)/Machine Learning (ML) to the SEER/Medicare database to identify these interactions not previously recognized with existing approaches. Our novel approach will identify characteristics that collectively predict, with a high degree of confidence, the likelihood of adverse events in specific circumstances or at-risk subpopulations that can be ameliorated or avoided. The approach will permit intelligent hypothesis testing in prospective clinical trials to demonstrate the effects of modifying, avoiding, or ameliorating particular interactions of predictive variables to lower the probability or severity of adverse events. The results will provide contributions to the fields of Oncology, health care delivery, health care disparities, and big data.

Knowledge Graph Embedding Evolution for COVID-19

Lead PI: Steven Skiena (Stony Brook University, SUNY)
Collaborators: Baojian Zhou (Stony Brook University, SUNY)

Primary Focus Area: Health

In this seed fund grant, we propose to create and analyze a knowledge graph associated with COVID-19, studying how it has evolved in Wikipedia since January 2020. The COVID-19 graph presents a unique opportunity to study the development of a new topic from day 1. The evolution of graphs under models like preferential attachment prove successful at capturing power law behavior, but do not adequately capture the evolution of knowledge growth reflected by graph and word embeddings of real-world networks. Our proposed work is expected to advance embedding models for these real-world networks.

Learn more about the Northeast Hub Seed Fund here