Harnessing Data to Predict and Prevent Cancer Treatment Adverse Events through Artificial Intelligence

Guest post by Robert Wieder, Rutgers New Jersey Medical School

This Success Story is a report on the results of the Northeast Big Data Innovation Hub’s 2020 Seed Fund program.

The goal of the project was to conduct an in-depth large-scale study of the comprehensive 1991-2016 Surveillance, Epidemiology, and End Results (SEER)-Medicare dataset*, which the investigators, Dr. Robert Wieder (PI), and Dr. Nabil Adam (Co-PI) obtained through a two-tiered review. The aim was to identify predictors of adverse events in patients treated for breast cancer, applying Machine Learning and Artificial Intelligence (ML/AI) techniques.

Task 1. The dataset

The dataset used for this project combines clinical data from population-based cancer registries with claims data from the Center for Medicare/Medicaid Services (CMS) Medicare program. The dataset consists of the following files:

  1. Patient Entitlement and Diagnosis Summary File (PEDSF), which has SEER data for breast cancer cases diagnosed between the 1990s and 2015
  2. Physician/Supplier (NCH) 1991-2015
  3. Outpatient 1991-2016
  4. MEDPAR 1991-2016
  5. Chronic Conditions Flags 1999-2016
  6. Part D (PDE) 2007-2016
  7. Part D Enrollment (PTden)
  8. Census Data by Zip Code
  9. Census Data by Census Tract

A part-time undergraduate senior, Aqsa Syed, was hired and was paid through this grant funding. Under the supervision of the investigators, the student helped set up the dataset.

Task 2. Data Fusion and Integration.

We developed an OMOP (Observational Medical Outcome Partnership) -based data model to fuse and integrate the SEER-Medicare dataset. OMOP provides a Common Data Model (CDM) for storing data within observational databases with common semantics (terminologies, vocabularies, coding schemes). The CDM cancer module provides the needed granularity and abstraction (e.g., recurrence, remission, end-of-life events, chemotherapy regimens, treatment cycles, response to treatments) of cancer data – diagnoses, treatments, and outcomes. A cancer diagnosis is defined by an assemblage of histology, site, stage, grade, and genetic biomarkers. Cancer treatments, which individual drugs cannot describe, are administered in specific order and cycles.

Task 3. Data Preprocessing and Analysis.

  1. Data Cleaning: renaming, sorting and recording, handling duplicate, missing, and invalid data, and filtering to the desired subset of data.
  2. Data transformation and standardization – Ensure all Features are Numeric and standardized.
    1. Apply StandardScalar as part of the ML pipelines.
    2. Apply Bucketizing for continuous values, e.g., age.
    3. Apply encoding, e.g., one-hot encoding for categorical features.
  3. Tumours are characterized by site, laterality, stage, grade, ER and PR status, and Her2 status (human epidermal growth factor receptor 2).
  4. Patients are characterized for race/ethnicity, age, marital status, obese/overweight status, hypertension, comorbidities, syndromes, prior malignancies, and therapy.

The following paper is under review: Robert Wieder and Nabil Adam, “Drug Repositioning for Cancer in the Era of Big Omics and Real-World Data,” in the journal Critical Reviews in Oncology/Hematology.

The following proposal was submitted to the National Cancer Institute (NCI) for funding as a result of this seed funding:

Deep intelligence Comprehensive Cancer Care 1.1 (Di3C1.1)

Nabil Adam (PI), Robert Wieder (Co-PI), Sita Kapoor (Co-PI), and Tarek Adam (Co-PI)


Most deaths from cancer occur in patients with recurrent or metastatic disease. Treatment guidelines and clinical practice decisions for stage IV cancers vary considerably and are often not based on level 1 evidence obtained from randomized clinical trials. Instead, treatment decisions in the very large spectrum of stage IV clinical scenarios are often based on lower tiers of evidence and depend on physicians’ experiences, preferences, and biases. This is exacerbated by an ever-growing number and complexity of treatment options as first, second, or third-line therapy. Practically, the effects of prior therapy, disease, and comorbid conditions all affect treatment decisions and response to therapy. This underscores an urgent need for longitudinal strategies that advise treatment decisions and timing while considering patient characteristics, medical history, and patient preferences. Ideally, an optimal treatment recommendation will be evidence-driven, not be subject to inherent biases, and be clear. These considerations challenge us to develop approaches to generate high-quality, real-world evidence to guide the treatment of cancer patients. One such approach, machine learning/artificial intelligence (ML/AI), has been considered a potentially useful tool to generate treatment guidelines for therapy. Opportunities for its application have materialized through the emergence of electronic health records (EHRs), which provide a wealth of information on a patient’s medical history, complications, laboratory results, and medication usage. We propose to demonstrate that ML/AI in the setting of stage IV cancer can generate a new level of evidence, level 1.1, derived from real-world data (RWD), to provide prognostic and adverse events predictions in limitless scenarios. It will have a level of credibility second only to randomized clinical trials, provide high-level evidence when clinical trials evidence is lacking and complement clinical trials-derived evidence where available. Our aims are: 

  • Aim 1. Design ML/AI models to identify longitudinal scenario-specific outcomes in cancer patients. We will develop a set of novel ML/AI models that: 1) take into account the longitudinal nature of EHRs and their representation as a sequence of clinical events with irregular time gaps between events; 2) capture generalizable relationships, changes in interrelationships over time in a patient’s treatment and conditions; and 3) are free of biases about patient populations, treatment planning, and geographical location resulting from single-institution data for model validation and testing. 
  • Aim 2. Apply ML/AI models to generate scenario-specific treatment guidelines in stage IV cancer. We will develop, train, validate, and test our models using RWD ((Real World Data, including both structured and unstructured EHR data) from multiple hospitals, healthcare facilities, and office-based physicians. We will demonstrate the feasibility of integrating these models with associated tools and interfaces into a clinical practice system to generate scenario-specific guidelines for use by cancer researchers and clinicians. Our future research plan is to build on the results of this work and develop it into an industrial-strength clinical practice system.

* For more information on the National Cancer Institute (NCI) SEER data from the Surveillance, Epidemiology, and End Results Program visit https://seer.cancer.gov/ 

Robert Wieder is an Attending Physician at Rutgers Cancer Institute of New Jersey at University Hospital and Provost of Rutgers Biomedical and Health Sciences Newark.

Nabil Adam is a Distinguished Professor of Medicine and Computer & Information Systems and specializes in cybersecurity, machine learning, healthcare technology, and clinical/healthcare informatics at Rutgers University. Nabil is also the Co-founder & CEO of Phalcon, LLC. He has served as the Vice-Chancellor for Research & Collaborations at Rutgers University.