Bioinformatics Technology Lab
Timeline: Jan - Apr 2019
I had a wonderful first co-op experience working 4 months at the Bioinformatics Technology Lab. The prominent tools and skills I used during my co-op are the Linux (CentOS) operating system and the command-line, scripting, data pipelining, benchmarking, data visualization, multithreading, and the programming languages C++, Python, R, and Bash - along with learning (and sometimes remembering from high school:) some very interesting biology!
Confession: I wasn’t studying in Computer Science when I started this internship, and I enjoyed software development at this internship very much that it was one of the reasons (along with a course I took) why I decided to transfer to Computer Science from Engineering Physics.
I worked on two different projects during my time at BTL, both of which are described below.
Project #1: Sequence Homology - Antimicrobial Peptide Discovery
In this project, I followed a procedure/data pipeline for antimicrobial peptide (AMP) discovery using sequence homology tools. AMPs are natural antibiotics; protein sequences that contribute to the innate immune systems of many organisms - including humans. The idea behind this project was to investigate “subject” sequences that are similar to given “query” sequences. In this case, this meant investigating protein sequences (subjects) that are similar to known AMP sequences (queries), with the aim of discovering novel AMP sequences from the subject proteins (I worked with the apis mellifera proteins). I wrote Bash and Python scripts to use some widely-used bioinformatics tools such as BLAST and Jackhmmer, as well as to investigate the results we got from running the tools. After following the data pipeline, the candidate novel AMP sequences I identified in the apis mellifera proteins were either known AMPs, or were declared to be “predicted” AMPs, found on NCBI - which showed that this introductory workflow I went through worked correctly.
Project #2: Performance Improvement of a Bioinformatics Software Tool
In this project, I worked on improving the runtime and memory usage of a bioinformatics software program developed at the lab I worked at. As part of this task, I re-implemented the Python code in C++, modularized the code, made some logic changes in the code without changing the functionality of the program, benchmarked the code using various data structure implementations, performed code profiling, wrote a multi-threaded version of the program, and wrote Bash and Python scripts to test the program. For the human chromosomes, which was the largest dataset the program ran on, the runtime improvement was 4.5x, and the memory usage improvement was 1.4x.
You can take a look at the repository which includes some of the highlight pieces of my work at BTL here.