How a Biologist Became a Data Scientist

How I Transitioned from a Non-Technical Background into Data Science

In this article, I will share my experience on how I transitioned from a biologist to becoming a data scientist. This article is an expanded version from the original video on my YouTube channel, Data Professor:

Brief Look at My Academic Journey

So before we begin, perhaps a little bit about myself. I have been working in data science since 2004 when I was in my second year of PhD studies. By the year 2006, I had completed my PhD research thesis entitled Computer-aided molecular design for biological and chemical applications: Quantum chemical and machine learning approach. In a nutshell, my research lies at the interface of biology, chemistry and computer science whereby the fusion of these fields make it possible for me to explore the underlying origins of protein function and its modulation (i.e. inhibition or activation). Such understanding holds great implication for drug discovery, particularly in the discovery of novel bioactive compounds with therapeutic activity.

Fast forward to 2020, I am still using data science to make sense of data from biology, chemistry and medicine. Much of my work revolves around the discovery of drugs that exert promising modulatory property against diseases by specifically targeting target proteins of interest. This is performed by using machine learning to understand and reveal what specific features of drug-like molecules give rise to promising modulation of the target protein.

How Did I Start Data Science?

My data science journey started back in 2004. It was a time when the field was not yet called data science and the more popular term was data mining. I can vividly recall that the first data science project that I was working on was to predict the DNA splice junction sites. The primary data mining tool that I first started out using was a program called WEKA, which was developed by the University of Waikato. This software is a GUI software where we could click various buttons in order to import data, perform feature selection, normalize the data, remove missing data and also to build machine learning models. Some of the machine learning algorithms that I used include decision trees, linear regression, artificial neural network and support vector machine.

I initially started out as a user of the GUI-based data mining software and over time I began to gain awareness on some of the limitations and hurdles. Particularly, I noticed that the time that it takes to run the data mining workflow took a lot of time, specifically when I want to optimize learning parameters I would have to manually modify the parameter values in the program (i.e. imagine doing this for 100–1000 different parameter settings). After a few years, I felt the urge to learn how to automate these manual and mundane tasks.

The natural next step would be to grow to a programming language such as Python or R.

Learning to Code

Learning to code is indeed a challenging endeavor and quite difficult when coming from a non-technical background. Personally, I pretty much tried everything from reading books, tutorials, Stack Overflow, asking colleagues, etc. From my own learning experience, I discovered that if I use my own research problem as the basis for learning to code (i.e. instead of using the example data set) then it feels less of a burden. Because upon accomplishing the coding task, I will be immediately rewarded with the progression of my research project.

How I Break Coding Problems into Manageable Bits

Let’s say that I have a research problem that I wanted to solve, the first thing that I would do is to break down the problem into self-containing tasks (e.g. merge contents from specific columns from several Excel files into 1 file). To tackle this one problem, I would look into tutorials, Stack Overflow answers and also specific sections of coding books. So doing this repetitively over and over again led me to slowly grasp the coding concept and realize that it is not that tough and it is something that is attainable. As I solve more and more problems, I started to gain motivation and become confident in coding. By slowly building upon small, individual coding tasks has contributed to the accumulation of progress of the project and its eventual completion. At this stage, I have this great sense of satisfaction that I have coded a data science workflow that had significantly boosted productivity as well as time and cost saving. So what used to take six months to do may take only a couple of minutes by coding a solution using R or Python code. Come to think of it, being able to code is kind of like having super powers!

Coding = Super Powers

So if a biologist like me can learn coding then I believe that everyone can learn coding too. So with determination and practice, anyone can code.

How I Use Coding in My Research

As a Biomedical Data Scientist, I am faced with the challenge of trying to make sense of biomedical data. Often times, I will have to spend most of my time curating the collected data so as to pre-process the data for further meaningful analysis. As this task is very tedious and repetitive, I am glad that I am able to code a solution in R and Python (yes, I have coded 2 separate versions in these 2 languages) to programmatically pre-process the data so that it is of high-quality for further analysis. Aside from using coding for data pre-processing, I also use it to perform exploratory data analysis as well as machine learning model building. My favorite part of the data science workflow is the process of designing appropriate data visualization that will best convey the data story as much as possible.

Concluding Remarks

And there you have it, my story on how I transitioned from a non-technical background from being a Biologist to a Data Scientist. I hope this is helpful in giving you an idea on how you too can start on your own data science journey.