News


Latest Faculty of Information News

Promoting replication

Submitted on Tuesday, March 22, 2022

Professor Rohan Alexander examines and creates workflow practices that make studies reproducible 

With a background in economic history and extensive experience using data science approaches to explore questions in economics, history and politics, Rohan Alexander is used to working with messy historical records data that needs to be digitized, cleaned and prepared for analysis. When he arrived at the Faculty of Information as a postdoctoral researcher in 2019, one of the first things he did was co-found a series of brown bag lunches aimed at spreading the word about best practices in data cleaning. 

“Like other economic historians, I’ve been thinking about poor data for a long time and how to use poor data better,” says Alexander. His goal was to give participants from all different fields a chance to share tips about the initial stages in a typical data science workflow. Along with data cleaning and preparation, the meetings were renamed the  Toronto Data Workshop, and began covering data gathering and scraping as well as data sharing and dissemination. 

Alexander, who was appointed Assistant Professor at both the Faculty of Information and the Department of Statistical Sciences in 2020, has recently turned his attention to the replication crisis and how to avoid becoming part of it. The idea is to promote and use best practices that would enable researchers to create a consistent and reproducible workflow in their data gathering. 

Applying the knowledge gained using methods such as web scraping and survey collection to construct datasets from historical sources in a reproducible way, Alexander and Master of Information student A Mahfouz developed the heapsofpapers R package. It allows researchers who need to gather a lot of content – or heaps of papers – from the Internet to do so easily and in a consistent way so that the workflow can be easily reproduced. In the first three months that heapsofpaper was available on CRAN, it had 1,400 downloads.  

Alexander has also investigated reproducibility of Covid-19 research, more-specifically “pre-prints” which are not peer reviewed and became ubiquitous during the pandemic when authors rushed to share their research. “Because these papers are not reviewed, the onus is very much on authors to ensure their work is reproducible,” said Alexander. “Reproducible is the basis of knowledge.” 

He and student researcher Annie Collins downloaded multiple datasets from pre-print servers, analyzing the text for keywords signaling the availability of open data or open code, which they found in only 25% of papers. “It speaks to how we need to do a better job of integrating open science training,” said Alexander. “We need to make this the default so scientific standards don’t drop in situations like Covid-19.” 

Yet another area of research interest for Alexander is multi-level regression with post-stratification (MRP), which is a popular way to adjust non-representative samples to better analyze opinion and other survey responses. For example, if researchers want to ensure their sample is 50% female and they have a biased and unrepresentative sample with only 25% female, they would traditionally have to spend more money to get a larger, better sample.  “What MRP does instead is train a model based on that survey,” says Alexander. “Then you can apply this trained model to a data set more representative of society.”  

To help demystify MRP, critically analyze research that uses it, and apply MRP, Alexander devised an MRP kit using the programming language R. “If you follow this workflow you’ll have a process able to be done redone by any researcher including yourself,” he said. 

Alexander’s book, Telling Stories with Data, will be published by CRC Press in 2022.