Genetics - Simon Chu

DNA stands for deoxy-ribonucleic acid, it carries the genetic code for every living thing. RNA stands for ribonucleic acid, this is the part of the DNA that is expressed in a cell to form the specific function for that cell. There are 30 trillion cells in the human body and the signature for DNA is the same for an individual. A process called transcription determines a portion of the cell's DNA to be used as template to create the RNA molecule. The specific type of RNA that carries information stored in DNA is called mRNA.

The human genome is comprised of about 3 billion base pairs of (A)denine-(T)hymine, (C)ytosine-(G)uanine. Subtle differences in the genome amongst human distinguish us with expressed traits (phenotype). Over the years, since around 2009 with the advent of NGS (Next Generation Sequencing) the cost, speed and accuracy of capturing these sequences has dramatically improved, albeit there is still errors. In a nutshell, short snippets (reads) of DNA are massively processed in parallel using modern day camera taking pictures of illuminated base terminator via the use of DNA polymerase. Sometimes during the sequencing cycle one or more of the bases within the cluster fail to terminate and get out of sync, this affects the base quality. These spurious errors along with read alignment and assembly problems translates to a statistical and algorithm solve that needs to be tackled.
One interesting fact is that all human genomes are 99.8% to 99.9% similar. This fact is leveraged to align the snippet read to relative position in the entire genome using an approximate match algorithm.

In order to facilitate read alignment, several data structures help to index positions in the genome. Suffix tree, suffix array and FM index are common data structures with FM index being most compact, borrowing from domain of compression technologies. A baseline matching algorithm uses Boyer-Moore, approximate matching algorithm include Levenshtein. Approximate matching algorithm is necessary to compensate for errors in sequencing and variation in the reference genome. Errors can occur in the form of substitution, insertion, deletion. Distinguishing between a read error and actual mutation is utmost important. Dynamic programming can be used for approximate match with a predefined number of edit distance.