Big Data Spoke Projects


What’s a Hub without Spokes? Each Big Data Hub supports its constituent members—drawn from academia, industry, non-profit organizations, and/or government—to work in concert to achieve common Big Data goals that would not be possible for the independent members to achieve alone. Big Data Spokes are multi-sector projects addressing regionally-defined priority areas, convening stakeholders whose work is guided by the following themes:

  • Accelerating progress towards addressing societal grand challenges relevant to the regional and national priority areas defined by the BD Hubs;
  • Helping automate the Big Data lifecycle; and
  • Enabling access to and spurring the use of important and valuable available data assets, including international data sets where relevant.

FY2018 Spokes


Data Science Foundry: A Collaborative Platform for Computational Social Science

PIs: Matthew Salganik (Princeton), Devavrat Shah (Brown)
Co-PIs: Alberto Abadie (MIT), Munther Dahleh (MIT), Kalyan Veeramachaneni (MIT)

From their abstract: This research project will develop a collaborative data science platform for computational social science called the Data Science Foundry, that will allow social scientists to collaborate and validate each other’s studies. The collection and management of large-scale data currently is a relatively unstructured process, with data-processing decisions being made in an ad hoc fashion. This project has the potential to transform how studies are designed and how data will be processed. The collaborative platform will result in a higher level of trust in the studies conducted via the collaborative curation of study design, procedures, and validation. The collaborative platform also will increase the number of studies that can be done in a short span of time. The platform will be developed as open-source, thereby facilitating interactions with the community and enabling different institutions to install the program. The project will bring together three distinct teams to develop this platform: computer scientists to develop abstractions, APIs and systems; statisticians to help with methods and study design; and social scientists to help define the problems and workflow and to provide user feedback.


Advancing a Data-Driven Discovery and Rational Design Paradigm in Chemistry

PI: Johannes Hachmann (SUNY at Buffalo)
Co-PIs: Alan Aspuru-Guzik (Harvard), Marcus Hanwell (Kitware), Geoffrey Hutchison (University of Pittsburgh)

Website: NSF Big Data Spoke on Advancing a Data-Driven Discovery and Rational Design Paradigm in Chemistry

From their abstract: The project will advance the field of data-driven chemical research by promoting the use of machine learning and other data mining techniques in the molecular sciences and by fostering and coalescing a community of stakeholders. The four signature initiatives of this Spoke project include:

  • the planning, coordination, integration, and consolidation of community-developed software tools for big data research in chemistry as well as the formulation of guidelines, best practices, and standards;
  • the organization of workshops for community building, to connect solution seekers with solution providers, and to address questions ranging from strategic to technical;
  • the creation and dissemination of community-developed teaching materials as well as the formulation of course, program, and curricular recommendations for education and workforce development that reflect the changing, data-centric approach in chemical research;
  • providing access to a shared hardware infrastructure for community data sets, on-site data mining capacity, and the exploration of domain specific method and hardware issues.

Building the Community to Address Data Integration of the Ecological Long Tail

PIs: Kenneth Chiu (Binghamton University), Holly Ewing (Bates College), Kevin Rose (Rensselaer Polytechnic Institute), Kathleen Weathers (Institute of Ecosystem Studies)

From their abstract: Frequently research on data integration carried out by computer scientists and resulting tools must be modified to fit the needs of domain practitioners (ecologists in this case). This challenge is a socio-technical, collective action problem that can be addressed through a combination of tools and incentives. The project proposes to holding a series of workshops along with proofs-of-concept implementations. These workshops will result in approaches to decentralize the sharing of data in the long tail, through socio-technical approaches that appropriately incentivize and facilitate data integration by smaller labs. Such an interdisciplinary community will provide crucial real-world input to computer science researchers, which will give their research into tools the potential for larger impact in ecological practice and will yield better tools for ecologists.


FY2016 Spokes


In FY2016, the Northeast Big Data Innovation Hub received $3.3 million in funding for seven full Spoke projects and planning projects (defined as seed funding to assist with the planning of future BD Spokes proposals). Details on these projects are included below.


A Licensing Model and Ecosystem for Data Sharing (Data Sharing Spoke Project)

PIs: Jane Greenberg (Drexel), Tim Kraska (Brown,) Samuel Madden (MIT)
Co-PIs: Carsten Binnig (Brown), Daniel Weitzner (MIT)

pieces-of-the-puzzle-592779_1280As a community, we seek to address key data sharing challenges relating to policy and privacy, platforms and formats, software and costs, and ethics and education about data sharing benefits.

The NSF Big Data Spoke project, “A Licensing Model and Ecosystem for Data Sharing,” is addressing some of these challenges. Our team is developing a safe and secure data sharing platform that facilitates sharing data that may or may not be open or free between different organizations (industry, academia, government).

Website for “A Licensing Model and Ecosystem for Data Sharing” project

Workshop on “Enabling Seamless Data Sharing in Industry and Academia,” September 29-30, 2016

The Northeast Big Data Innovation Hub: “Enabling Seamless Data Sharing in Industry and Academia” Workshop Report


Integration of Environmental Factors and Causal Reasoning Approaches for Large-Scale Observational Health Research (Health Spoke Project)

PIs: Gregory Cooper (U. Pittsburgh), Noemie Elhadad (Columbia), Vasant Honavar (Penn State), Chirag Patel (Harvard)

iStock_19334089_MEDIUMOur Health Spoke project is assembling a first-ever data warehouse to house numerous health/clinical, environmental, behavioral, and economic data streams. By breaking current data silos and bringing together multiple large environmental and clinical data streams, this project will enhance health research, allowing causal discovery between these data sources. The ultimate goal of the project is to facilitate community-led and collaborative causal discovery through dissemination of integrated and open big data and analytics tools.


Grand Challenges for Data-Driven Education (Education Spoke Project)

PIs:Ivon Arroyo (Worcester Polytechnic Institute), Ryan Baker (U Penn), Beverly Woolf (U. Mass, Amherst)
Co-PI: Neil Heffernan (Worcester Polytechnic Institute)

Pre-teen students in computer lab with instructor in foregroundThe Northeast is a center of gravity for innovations in education, anchored by universities and publishers who drive K-12 education in America. This project will improve capacity in data-driven education by sharing educational databases, managing yearly data competitions, and conducting educational data science workshops and hackathons. The team intends to improve classroom learning and leverage the unique types of data available from digital education to better understand students, groups and the settings in which they learn.


Building Capacity for Regional Collaboration in Closing the Big Data Divide (Data Literacy Planning Project)

PI: Stephen Uzzo (New York Hall of Science)

iStock_26046204_MEDIUMThe Data Science for All initiative led by members of the Northeast Big Data Innovation Hub is actively identifying knowledge and resource gaps, with a goal to help lifelong learners of all ages become data literate, throughout the Northeast and beyond.


Planning for Privacy and Security in Big Data (Privacy & Security Planning Project)

PIs: Adam Smith (Penn State), Rebecca Wright (Rutgers)

key-74534_1280This planning project is bringing together stakeholder communities to understand how privacy currently limits data sharing, develop standards and best practices to enable new information flows, and highlight privacy and security issues associated with our priority areas.

Workshop on Privacy and Security for Big Data, Rutgers University, April 24-25, 2017


Cross-organization Big Data Cyber Attack Awareness – CROSSBAR (Privacy & Security Planning Project)

PI: John Yen (Penn State)
Co-PIs: Vijayalakshmi Atluri (Rutgers), George Cybenko (Dartmouth), Peng Liu (Penn State), Andrew Sears (Penn State)Cyber security concept on virtual screen with a consultant doing presentation in the background

With a focus on protected sharing of cybersecurity data for countering attacks on digital infrastructure, the CROSSBAR project is developing a platform to enhance collaborative cyber security operations through cross-organization sharing of relevant cybersecurity data.

A Workshop on Cross-Organization Big Data Cyber Attack Awareness was convened in Washington, D.C., November 11, 2016.


Partnerships for Energy Cycle Innovation through Big Data (Energy Planning Project)

PI: Abani Patra (U. Buffalo)

lightsThe initial planning project explores how to use a brownfield redevelopment and associated energy infrastructure reinvention in Buffalo, NY as a case study to frame the energy sector’s big data innovation needs.