Big Data Spoke Projects

Big Data Spokes are multi-sector projects addressing regionally-defined priority areas, convening stakeholders whose work is guided by the following themes:

  • Accelerating progress towards addressing societal grand challenges relevant to the regional and national priority areas defined by the BD Hubs;
  • Helping automate the Big Data lifecycle; and
  • Enabling access to and spurring the use of important and valuable available data assets, including international data sets where relevant.

The Big Data Spokes program ran concurrently with the Big Data Hubs in 2016 and again in 2018. Funded by the National Science Foundation through independent awards, the Spokes work together with the Big Data Hubs on topically-focused, goal-driven projects in data science. 

FY2018 Spokes

Data Science Foundry: A Collaborative Platform for Computational Social Science

PIs: Matthew Salganik (Princeton), Devavrat Shah (Brown)
Co-PIs: Alberto Abadie (MIT), Munther Dahleh (MIT), Kalyan Veeramachaneni (MIT)

Start: September 1, 2018

End: August 31, 2021 (Estimated)

From their abstract: This research project will develop a collaborative data science platform for computational social science called the Data Science Foundry, that will allow social scientists to collaborate and validate each other’s studies. The collection and management of large-scale data currently is a relatively unstructured process, with data-processing decisions being made in an ad hoc fashion. This project has the potential to transform how studies are designed and how data will be processed. The collaborative platform will result in a higher level of trust in the studies conducted via the collaborative curation of study design, procedures, and validation. The collaborative platform also will increase the number of studies that can be done in a short span of time. The platform will be developed as open-source, thereby facilitating interactions with the community and enabling different institutions to install the program. The project will bring together three distinct teams to develop this platform: computer scientists to develop abstractions, APIs and systems; statisticians to help with methods and study design; and social scientists to help define the problems and workflow and to provide user feedback.

Advancing a Data-Driven Discovery and Rational Design Paradigm in Chemistry

PI: Johannes Hachmann (SUNY at Buffalo)
Co-PIs: Alan Aspuru-Guzik (Harvard), Marcus Hanwell (Kitware), Geoffrey Hutchison (University of Pittsburgh)

Start: September 1, 2018

End: August 31, 2021 (Estimated)

Website: NSF Big Data Spoke on Advancing a Data-Driven Discovery and Rational Design Paradigm in Chemistry

From their abstract: The project will advance the field of data-driven chemical research by promoting the use of machine learning and other data mining techniques in the molecular sciences and by fostering and coalescing a community of stakeholders. The four signature initiatives of this Spoke project include:

  • the planning, coordination, integration, and consolidation of community-developed software tools for big data research in chemistry as well as the formulation of guidelines, best practices, and standards;
  • the organization of workshops for community building, to connect solution seekers with solution providers, and to address questions ranging from strategic to technical;
  • the creation and dissemination of community-developed teaching materials as well as the formulation of course, program, and curricular recommendations for education and workforce development that reflect the changing, data-centric approach in chemical research;
  • providing access to a shared hardware infrastructure for community data sets, on-site data mining capacity, and the exploration of domain specific method and hardware issues.

Building the Community to Address Data Integration of the Ecological Long Tail

PIs: Kenneth Chiu (Binghamton University), Holly Ewing (Bates College), Kevin Rose (Rensselaer Polytechnic Institute), Kathleen Weathers (Institute of Ecosystem Studies)

Start: September 15, 2018

End: August 31, 2020 (Estimated)

From their abstract: Frequently research on data integration carried out by computer scientists and resulting tools must be modified to fit the needs of domain practitioners (ecologists in this case). This challenge is a socio-technical, collective action problem that can be addressed through a combination of tools and incentives. The project proposes to holding a series of workshops along with proofs-of-concept implementations. These workshops will result in approaches to decentralize the sharing of data in the long tail, through socio-technical approaches that appropriately incentivize and facilitate data integration by smaller labs. Such an interdisciplinary community will provide crucial real-world input to computer science researchers, which will give their research into tools the potential for larger impact in ecological practice and will yield better tools for ecologists.

FY2016 Spokes

In FY2016, the Northeast Big Data Innovation Hub received $3.3 million in funding for seven full Spoke projects and planning projects (defined as seed funding to assist with the planning of future BD Spokes proposals). Details on these projects are included below.

A Licensing Model and Ecosystem for Data Sharing (Data Sharing Spoke Project)

PIs: Jane Greenberg (Drexel), Tim Kraska (Brown,) Samuel Madden (MIT)
Co-PIs: Carsten Binnig (Brown), Daniel Weitzner (MIT)

Start: September 1, 2016

End: August 31, 2020 (Estimated)

Website: A Licensing Model and Ecosystem for Data Sharing

As a community, we seek to address key data sharing challenges relating to policy and privacy, platforms and formats, software and costs, and ethics and education about data sharing benefits.

The NSF Big Data Spoke project, “A Licensing Model and Ecosystem for Data Sharing,” is addressing some of these challenges. Our team is developing a safe and secure data sharing platform that facilitates sharing data that may or may not be open or free between different organizations (industry, academia, government).

Workshop on “Enabling Seamless Data Sharing in Industry and Academia,” September 29-30, 2016

The Northeast Big Data Innovation Hub: “Enabling Seamless Data Sharing in Industry and Academia” Workshop Report

Integration of Environmental Factors and Causal Reasoning Approaches for Large-Scale Observational Health Research (Health Spoke Project)

PIs: Gregory Cooper (U. Pittsburgh), Noemie Elhadad (Columbia), Vasant Honavar (Penn State), Chirag Patel (Harvard)

Start: January 1, 2017

End: December 31, 2020 (Estimated)

Our Health Spoke project is assembling a first-ever data warehouse to house numerous health/clinical, environmental, behavioral, and economic data streams. By breaking current data silos and bringing together multiple large environmental and clinical data streams, this project will enhance health research, allowing causal discovery between these data sources. The ultimate goal of the project is to facilitate community-led and collaborative causal discovery through dissemination of integrated and open big data and analytics tools.

Grand Challenges for Data-Driven Education (Big Data for Education Spoke Project)

PIs: Ivon Arroyo (Worcester Polytechnic Institute), Ryan Baker (U Penn), Beverly Woolf (U. Mass, Amherst)
Co-PI: Neil Heffernan (Worcester Polytechnic Institute)

Start: September 1, 2016

End: August 31, 2020 (Estimated)

The Northeast is a center of gravity for innovations in education, anchored by universities and publishers who drive K-12 education in America. This project will improve capacity in data-driven education by sharing educational databases, managing yearly data competitions, and conducting educational data science workshops and hackathons. The team intends to improve classroom learning and leverage the unique types of data available from digital education to better understand students, groups and the settings in which they learn.

Building Capacity for Regional Collaboration in Closing the Big Data Divide (Data Literacy Planning Project)

PI: Stephen Uzzo (New York Hall of Science)


Start: September 1, 2016

End: August 31, 2018

The Data Science for All initiative led by members of the Northeast Big Data Innovation Hub is actively identifying knowledge and resource gaps, with a goal to help lifelong learners of all ages become data literate, throughout the Northeast and beyond.

Planning for Privacy and Security in Big Data (Privacy & Security Planning Project)

PIs: Adam Smith (Penn State), Rebecca Wright (Rutgers)

Start: September 1, 2016

End: March 31, 2019

This planning project is bringing together stakeholder communities to understand how privacy currently limits data sharing, develop standards and best practices to enable new information flows, and highlight privacy and security issues associated with our priority areas.

Workshop on Privacy and Security for Big Data, Rutgers University, April 24-25, 2017

Cross-organization Big Data Cyber Attack Awareness – CROSSBAR (Privacy & Security Planning Project)

PI: John Yen (Penn State)
Co-PIs: Vijayalakshmi Atluri (Rutgers), George Cybenko (Dartmouth), Peng Liu (Penn State), Andrew Sears (Penn State)

Start: September 1, 2016

End: August 31, 2017

With a focus on protected sharing of cybersecurity data for countering attacks on digital infrastructure, the CROSSBAR project is developing a platform to enhance collaborative cyber security operations through cross-organization sharing of relevant cybersecurity data.

A Workshop on Cross-Organization Big Data Cyber Attack Awareness was convened in Washington, D.C., November 11, 2016.

Partnerships for Energy Cycle Innovation through Big Data (Energy Planning Project)

PI: Abani Patra (U. Buffalo)


Start: September 1, 2016

End: August 31, 2018

The initial planning project explores how to use a brownfield redevelopment and associated energy infrastructure reinvention in Buffalo, NY as a case study to frame the energy sector’s big data innovation needs.