Students Capitalize on Computational Genomics Research Using AWS

by Katherine Kendig

The Amazon Web Services (AWS) campus contract, which integrates security and support from both Amazon and Illinois’s Technology Services, enabled the NCSA Genomics team to initiate two forays into cloud computing during the summer of 2017: Ellen Nie, a student researcher, deployed a software prototype on the cloud that will make data-intensive genomic studies faster and more feasible, while Jennie Zermeno investigated cloud-based containerization to streamline workflows.

Amazon Elastic Compute Cloud (EC2), part of AWS, provides massive computational power that doesn’t require proximity to large computer clusters. Users can rent access to processing resources, setting up “instances” — virtual computers — to run whatever programs they require. This allows students and programmers to set up specific computing environments based on their needs at any given time; instances can be set up, taken down, and reconfigured as tasks and requirements change.

The software Nie tested was developed by colleague Jacob Heldenbrand and improves on existing software options by running in parallel on multiple nodes, vastly reducing runtime. Nie says that working with EC2 made testing the prototype much easier. Amazon’s Simple Cloud Storage Service (S3), for example, allowed Nie to store her data within the cloud and pull them out whenever she wanted, eliminating the need to upload and download large amounts of data with each use.

Amazon EMR (Elastic MapReduce) also makes it simple for users to run widely used frameworks like Apache Spark on virtual clusters of any size. Students using traditional clusters are necessarily limited in terms of the resources they can combine, constricted by the nodes, cores per node, memory and data storage specifications, and software frameworks associated with each setup. Amazon opens up limitless possibilities, giving users the freedom to construct tailor-made, elastic computing environments for their workflows.

Incompatibility across clusters can be a frustrating obstacle for collaboration, but cloud computing largely eliminates that issue. The improvements Zermeno makes to the workflow she’s currently streamlining will benefit collaborators in Australia, California, and Africa, across institutions with highly disparate computing environments. Thanks to Amazon, those differences won’t prevent meaningful cooperative progress.

Illinois’s campus contract with AWS has been in operation since June 2016. Dr. Liudmila Mainzer, head of the NCSA Genomics team, notes that grant-bestowing institutions like NIH have set up similar contracts. “Cloud computing provides a useful model for healthcare researchers to address the challenges of doing research on extremely large datasets,” she notes. Because cloud computing is cost-effective and eliminates the need to transfer data between clusters, it can help researchers overcome regulatory and financial restrictions. Illinois’s agreement has provided a new pathway for researchers like Nie and Zermeno to develop efficient, affordable, and innovative options for computation and analysis.

‘It’s very flexible’

Nie and Heldenbrand’s software prototype would allow researchers to use Amazon cloud computing to analyze genome-wide association studies (GWAS). Researchers use GWAS to look for relationships between genetic variation and specific phenotypic features across the entire genome in humans, animals, and plants. Currently these analyses are limited in size and scope due to the high computational requirements of GWAS algorithms. GWAS analysis performed by an existing software, PLINK, is limited in speed and efficiency because the software can’t distribute tasks to multiple computers, or nodes, at the same time.

NCSA researchers are used to large multi-node computing clusters with significant power, but the average medical researcher, agroscientist, or biologist performing GWAS may not have access to such expansive infrastructure. Working with AWS will enable a fast, parallelized alternative to PLINK that can be deployed on the cloud, pairing computational power with wide accessibility.

Nie and Heldenbrand’s software prototype calculates complete models for GWAS that account for both individual and epistatic (gene-to-gene) interactions that cause mutations in traits. Calculating epistasis is especially computationally intensive because it requires analyzing huge numbers of possible interactions; however, it can also yield more robust data than basic models afford. The software is written in Scala and uses the Apache Spark framework, which is ideal for parallelization and distribution — necessary features for a software performing huge numbers of calculations. Unlike PLINK, the prototype can run multiple data at the same time by distributing large tasks across nodes and then re-integrating the computed results. This parallelization significantly reduces the runtime needed and the memory required on each node, while harnessing the full power of a computer cluster. Amazon EC2 will enable scientists across institutions to use the software, regardless of the computing resources they have onsite.

The prototype Nie deployed was funded by CCBGM, a computational genomics partnership between UIUC, the Mayo Clinic, and the University of Chicago. Nie is optimistic about the use of AWS in future projects with CCBGM due to the ease of use: “The good thing about Amazon Cloud is that it’s very flexible,” Nie says. Eric Jakobsson, a CCBGM advisor, is excited that “ubiquitous computing” is now possible as a result of the cloud. AWS eliminates many obstacles of incompatibility and access.

Now, Jakobsson marvels, “computing power is everywhere.”  

‘Play and develop your own directions’

The main goal of Zermeno’s work is to make genomics research — and, ultimately, medical research — easier by developing straightforward, accessible strategies for optimized computational analysis.

Zermeno’s initial avenue of investigation involved the use of containerization to streamline analysis workflows. When using Amazon Elastic Cloud Compute (EC2), users typically set up “instances,” or virtual computers. Containerization, however, allows users to launch specific components of the operating system without setting up an entire instance. Each container performs just one specific task and collapses when its job has completed, maximizing efficiency. Amazon EC2 Container Service provides an environment for users to run and manage containers in partnership with Docker, Inc., a widely used software platform.

Zermeno’s first successful foray into containerization involved a variant calling workflow used by H3Africa, a consortium aiming to expand genomic testing among African populations with collaborators in Australia, Africa, and California. The goal was to give the workflow, which represents a major step in the pipeline of genomic analysis, the interoperability and mobility inherent to containerization. Now Zermeno is building on her work by writing code to connect and configure cloud-native components provided by AWS — like CloudWatch — to provide performance monitoring, security, and other features for the H3A workflow. Unlike clusters, which typically have these features built-in, AWS offers optional building blocks that can be stitched together with custom code. This allows for a programmatic approach to configuring infrastructure. CloudWatch, for example, provides real-time monitoring of metrics like CPU utilization and network performance, and the option of adding custom scripts that monitor metrics like used and available memory.

Zermeno, who only began exploring cloud containerization recently, says it’s surprisingly easy to learn. EC2, she says, offers “the freedom to screw things up.” Whereas mistakes can cause lasting consequences on traditional computer clusters, Amazon’s instances can be deleted and recreated when errors occur. AWS “allows you to play and develop your own directions,” says Zermeno, who has created several guided walkthroughs to help orient new users who may be unfamiliar with cloud environments. She notes that Amazon is also highly responsive to customer feedback and frequently adds new features and functionality.

One of the biggest challenges of using EC2, Zermeno jokes, is keeping track of the many acronyms Amazon employs. “But if that’s the biggest challenge,” she adds, “you’re in a good place.”