Cybersecurity Risk Initiative | Northeast Big Data Innovation Hub

Cybersecurity risk including cyber-crime and cyber-breaches have escalated, due to the increase in commerce conducted on the Internet, as well as increasing quantities of sensitive information being captured through the Internet of Things, and stored in databases that can be accessed by outside parties.

Seeking to address these challenges, the Northeast Big Data Innovation Hub (NEBDHub), funded by the National Science Foundation (NSF) Award #1748395, hosted two Cybersecurity Risk Workshops between 2019 and 2021, gathering leading academic, government, risk management, general industry and healthcare participants, with a goal to identify, quantify, and to help mitigate risk associated with cyber-related criminal activity.

These workshops led to an opportunity to partner with Dr. Jay S. Yang at Rochester Institute of Technology (RIT) and the RIT Cybersecurity Visiting Students, establishing the 2022 Cybersecurity Data Science Student Summer Program. This Summer Program invited the Visiting Students to look at the use of artificial intelligence and machine learning for current and emerging cybersecurity challenges. Student researchers applied real-world analytics to study cybersecurity tools for cross-organizational threat benchmarking.

Learn more about the 2022 Cybersecurity Data Science Student Summer Program, 2021 Cybersecurity as Big Data Science Interactive Workshop, and 2019 Cybersecurity Risk Workshop below.

2022 Cybersecurity Data Science Student Summer Program

During the summer of 2022, in collaboration with the Northeast Big Data Innovation Hub (NEBDHub), Rochester Institute of Technology (RIT) invited students to curate cross-organizational cyber intrusion data and use the data to produce benchmarking information about ongoing cyberattacks. This information can be used to enhance current risk and threat assessments, which are typically done through hypothesized threat reports (not necessarily relevant) or via penetration testing (costly). This effort was designed to benefit the student cohort to learn and produce relevant cybersecurity data science results with real-world data.

For this project, Dr. Yang recruited and mentored five undergraduate students and one graduate student to explore data science approaches for cross-organizational cyber threat and risk assessment.

Student Participants:

Chanel Cheng (Undergraduate, Computer Science, RIT)
Serena Yang (Undergraduate, Information Science, Cornell University; remote participant)
Wei-Ting Chen (Undergraduate, National Chi-Nan University, Taiwan; remote participant)
Vazgen Tadevosyan (MS, Data Science, RIT)
Matthew Heller (Undergraduate, Computer Engineering, RIT)
Pradumna Gautam (Undergraduate, Anna University, Chennai, India; visiting student to RIT)

Specifically, two sets of efforts were put forth with the award:

1. Cross-organization attack pattern analysis for data collected through STINGAR, supported by Serena Yang, Wei-Ting Chen, and Vazgen Tadevosyan.

Common honey-nets are used to assess potential risks and threats to the networked systems being modeled after. STINGAR is a community honey-net project run by Duke University’s cybersecurity team. PI Jay Yang has engaged the STINGAR team to obtain honey-net data across many higher-ed institutions that use STINGAR. Two students, Serena Yang and Wei-Ting Chen, were recruited in the summer of 2022 to explore and assess statistical and data-driven means to identify threat patterns across victim organizations and attacker origins. In September of 2022, they presented their findings to the STINGAR team. Serena Yang further presented their findings to NSF Big Data Hubs’ Data Sharing and Cyberinfrastructure Working Group in October 2022, and in the 2023 NEBDHub Inaugural Student Research Symposium. Vazgen Tadevosyan was later recruited to continue working on additional STINGAR data with clustering analysis of engineered attack features, and presented to the STINGAR team in the March of 2023. Serena Yang and Vazgen Tadevosyan are now working on preparing a paper based on their findings.

Key Findings and Impacts:

Statistical analysis reveals attack frequencies originated from different countries (as indicated by IP addresses) to the community honey-nets. This analysis is certainly limited by the data being collected. STINGAR team has improved its system since summer of 2022 and further looked into generating more contextually rich attack features and data science techniques.
Data reveals a large variety of attack attributes across victim organizations and attacking origins. Intelligent engineering of features that can be used for data science and machine learning techniques is critical. We have since derived several features that transform the large variety of attack attribution into more behavioral summary and shown their effectiveness for clustering analysis. The details of the results will be forthcoming in a paper being prepared.
Regular exchanges between community-based honey-net teams and data scientists can be helpful for both sides, where the system designed to collect data and the threat analysis using such data will be more meaningful.

2. Continual learning across organizational cyberattack data for efficient threat recognition, supported by Chanel Cheng, Matthew Heller, and Pradumna Gautam

Chanel Cheng and Matthew Heller were recruited in the summer of 2023 to explore the use of continual learning to differentiate cyberattacks by analyzing network traffic across organizations. Typical machine learning approaches that differentiate cyberattacks focus on traffic one network at a time. This effort develops network agnostic features and an efficient continual learning approach to process data across datasets generated by different data providers (which uses different networks). Matthew Heller assisted in identifying the open-domain network flow datasets that share similar features and normalized the features. Chanel Cheng investigated and implemented a continual learning approach to demonstrate the effectiveness of our approach when treating DoS attack data across datasets. Chanel Cheng continued after summer and further investigated the limitations and enhancements of the continual learning approach when treating a broader variety of cyberattack behaviors. Meanwhile, Pradumna Gautam performed a preliminary literature review into the use of Natural Language Processing to interpret various cyberattack tactics.

Additional Findings:

Continual learning with memory replies that replace old data points with uncertain predictions is effective for learning new attack behaviors even for traffic across organizations.
The above findings have been shown to be effective for a variety of DoS attacks [Cheng ’22], and further investigation to broaden the attack types is ongoing and expected to be disseminated in 2023.
There are significant limitations on open-domain network traffic datasets used for cyber intrusion and threat analysis. PI Jay Yang is in the process of sharing the findings and requesting quality datasets from the community to conduct responsible research using these datasets.
During the process of exploring cyber risk and threat assessment, PI Yang’s group also sees the need of advancing NLP techniques. Follow-up research efforts in this area are ongoing in the community and by PI Jay Yang’s research group.

Summary of Outputs/Dissemination:
Papers:

Chanel Cheng and S. Jay Yang, “Cross-Organizational Continual Learning of Cyber Threat Models,” Poster Paper, ACSAC 2022, December 7-9, 2022, Austin TX, USA.
Reza Fayyazi, Steve Wufeng, Pradumna Gautam, and S. Jay Yang, “Translating Cybersecurity Descriptions into Interpretable MITRE Tactics using Transfer Learning,” Poster Paper, ACSAC 2022, December 7-9, 2022, Austin TX, USA.
Pradumna assisted in curating data and conducting this analysis. He is not the main author of this poster paper.
Vazgen Tadevosyan and Serena Yang, “Cross-organization attack pattern analysis through clustering of engineered features,” in preparation.
Chanel Cheng and S. Jay Yang, “Continual Learning of Large Variety of Cyberattacks across Networks,” in preparation.

Presentations:

Chanel Cheng and Serena Yang, “Data-driven Pattern Analysis and Continual Learning of Cyber Attacks across Organizations,” webinar presentation to Data Sharing and Cyberinfrastructure Working Group, NSF Big Data Hubs Data Sharing and Cyberinfrastructure Working Group, October 7, 2022.
Chanel Cheng, “Cross-Organizational Continual Learning of Cyber Threat Models,” presentation in NEBDHub Inaugural Student Research Symposium, January 27, 2023. Video linked below.
Serena Yang and Wei-Ting Chen, “A Data Driven Approach to Cyber Risk and Threat Assessment,” presentation to Duke University STINGAR team, September 23, 2022.
Serena Yang, “A Data-Driven Approach for Cross-organization Cyber Risk and Threat Assessment,” presentation in NEBDHub Inaugural Student Research Symposium, January 27, 2023. Video linked below.
Vazgen Tadevosyan, “Clustering of Malicious Attacks,” presentation to Duke University STINGAR team, March 3, 2023.

2021 Cybersecurity as Big Data Science Interactive Workshop

The 2021 Cybersecurity as Big Data Science Interactive Workshop brought together experts in data science, databases, visualization, statistics, feature engineering, modeling, reproducibility, streaming data, heterogeneous data, irregular data, explainable AI, and human-centered computing in all application areas to discuss and design data science techniques to address current and emerging cybersecurity challenges.

The workshop’s objectives were to:

- Understand common and unique data science challenges for cybersecurity data.

- Share best data science practices, technical and non-technical, for cybersecurity challenges.

Cybersecurity challenges discussed included:

- Ever-evolving data from sources such as log files, intrusion alerts, vulnerability databases, dark web, and GitHub.

- Data quality issues such as susceptibility to adversarial attacks, lack of labeled data, high signal-to-noise ratios, and data fragmentation.

2019 Cybersecurity Risk Workshop

The Cybersecurity Risk Workshop, hosted on July 24-25, 2019 at RiskEcon® Lab @ Courant Institute of Mathematical Sciences New York University, brought together a group of experts to set an agenda exploring the development of a context-oriented taxonomy to create a better framework for understanding cyberphysical security risk – specifically in the context of exploits and vulnerabilities in commercial and industrial IOT.

With a focus on cloud computing and industrial process control surfaces across telecommunications, transportation, utilities, infrastructure, municipal facilities/fleets/services (e.g. hospitals, emergency response), smart devices and connected vehicles, particularly around machine-to-machine (M2M) and machine-to-human/human-to-machine (M2H/H2M) connectivity, this working session asked participants to discuss their relevant work and contribute to the creation of a draft action plan on how to develop an attack-surface taxonomy.@