UMass Amherst, WPI, Penn announce winners of Northeast big data competition

Winning Schemes for Predicting Student Interest in Science 

UMass Amherst, WPI, Penn announce winners of Northeast big data competition


AMHERST, Mass. – After a year-long, global data-mining competition, organizers today awarded the top three winning teams from Hong Kong, Japan and Michigan at the National Science Foundation’s (NSF) Northeast Big Data Spoke meeting on the MIT campus in Cambridge. A total of 74 teams took part in the challenge.

As competition organizer Beverly Woolf of the University of Massachusetts Amherst explains, the winners used the algorithms of their choice to predict a student’s first job out of college from an anonymized dataset of 423 middle school students followed for 10 years, which was provided by the contest organizers.

The competition was launched last year by computer scientists Neil Heffernan of Worcester Polytechnic Institute (WPI), data scientist Ryan Baker at the University of Pennsylvania and Woolf of UMass Amherst, with learning scientist Ivon Arroyo of WPI. They used a three-year, $1 million grant from NSF, which included a goal “to improve our ability to solve the nation’s most pressing challenges by extracting knowledge and insights from large, complex collections of digital data.”

The researchers focused the competition on using big data to predict a student’s likely attendance in college or college major, based on existing large-scale longitudinal data sets about the students’ behavior in middle school while interacting with an online mathematics tutor.

Woolf says, “One of our goals is to build a community of researchers and practitioners around gathering, storing, and analyzing data from schools, students, and administrators. Development of longitudinal analyses to predict students’ career choice is one use of exploratory analyses to identify regular, or unusual, patterns in data, thus helping to formulate new scientific hypotheses. We are excited to provide this opportunity and associated training opportunity to winners of this competition.”

The first place winning team of Chun Kit Yeung, Kai Yang, and Dit-yan Yeung is from the Hong Kong University of Science and Technology. They used a state-of-the-art “Deep Knowledge Tracing” model to predict student performance, combined with boosting algorithms. Prof. Yeung said, “It was great to participate in this challenge and to be recognized. We learned a lot about deep learning, and were glad to be given a chance to practice those skills.”

Organizers say the second place winner was Makhlouf Jihed from Japan’s Kyushu University, and third place honors went to the University of Michigan Data Science Team, a group that regularly competes in data competitions like this one.

Clickstream data used by the competing teams came from middle school students who had used an online mathematics software platform called ASSISTments, created by Heffernan a decade ago and now used by over 50,000 U.S. students a year. A clickstream is a record of the path the student used on the website. Teams also had data on whether the middle school students had gone on to a STEM career after college. No personal identifiable information was included in either set.

Heffernan says, “It makes sense that students whose clickstream data shows they are more engaged in math class are probably more likely to go into a STEM career, and on the flip side, students who are bored, frustrated, off task or confused are less likely to want to study more math. Researchers can detect who is losing interest in math, and these effects persist for over a decade.”

He adds, “One purpose of this competition was to develop better tools to understand and predict students’ future interests in STEM careers. Once these interests are known, young students can be supported and encouraged to continue their interest in science.”

Baker says, “We already knew that by using this data we could better predict state test scores, who will go on to college, and whether they will choose a STEM major. So for this competition, we asked participants to predict the next thing. Did the students wind up in a STEM career?”

Meeting host Justin Reich, an assistant professor of comparative media studies and writing at MIT, says, “Open data is a crucial component of the future of science, and we are thrilled at the MIT Teaching Systems Lab to host the announcement of this important competition. It’s so exciting to see a global community of students and researchers collaborating to better understand how learning analytics can meaningfully support learners, while accounting for considerations of privacy and ethics. I applaud WPI and UPenn for sharing their data, and offering this competition in order to advance data science in education.”

Woolf’s team plans to hold educational data science workshops, yearly data competitions and hackathons for researchers and teachers over the coming three years, and to share information from several educational databases with teachers.


Contact: Janet Lathrop, 413/545-2989;