Lab Home | Phone | Search | ||||||||
|
||||||||
Our current knowledge of genome function is the result of sequence-based data in the form of one-dimensional strings of letters. However, DNA-binding proteins recognize the double helix as a three-dimensional object. Therefore, an understanding of transcription factor (TF) binding specificity must ultimately include DNA shape. The sequence-structure relationship in DNA is highly degenerate, and different nucleotide sequences can give rise to the same structure, while single nucleotide sequence variants sometimes change DNA shape over a region of several base pairs. To explore these effects on a genomic scale, we developed a method for the high-throughput DNA shape features. We used these structural features to augment nucleotide sequence in binding specificity models derived from statistical machine learning approaches such as support vector regression (SVR) and regularized multiple linear regression (MLR). Using these approaches, we learned in vitro DNA binding specificity models from protein binding microarray (PBM), genomic-context PBM, and HT-SELEX/SELEX-seq data. Based on data for many TFs from diverse protein families, we demonstrated that shape-augmented models are generally more efficient than existing sequence models in terms of accuracy, number of features, and computation time. Our models provide information on the importance of specific DNA sequence and shape features and thus reveal TF family-specific readout mechanisms and better explain why a given TF binds in vivo to a specific genomic target site. Host: Boian Alexandrov |