COMPUTATIONAL METHODS FOR IDENTIFYING TRANSCRIPTION FACTOR BINIDNG SITES IN PLASMODIUM FALCIPARUM

Sean Westfall,  Kim Williamson,  Catherine Putonti*

Loyola University Chicago, Computer Science, Chicago, IL 60626

cputonti@luc.edu


Abstract

Relatively few transcription factor binding sites (TFBS) are known for Plasmodium falciparum, the causative agent of malaria. Though, knowledge of these sites would prove invaluable in understanding the highly unique transcriptional regulation employed by this organism. With the entire genome of

P. falciparum sequenced and assembled through the Malaria Genome Project, it is now possible to use computational methods to find TFBS. Numerous different algorithms have been developed to identify TFBS, particularly those within model organisms such as yeast and E. coli. These methods are predominantly based upon finding statistically overrepresented motifs, or short segments of DNA, in the upstream regions of clustered genes known to be co-regulated. Much of the difficulty faced in extending these methods to the examination of P. falciparum is due to the organism’s high AT richness (~80%) and a lack of understanding of the organism’s transcriptional regulatory mechanisms. Thus far, algorithms designed to recognize TFBS in P. falciparum have utilized microarray expression data to first cluster genes that appear to be co-regulated and then apply existing algorithms of motif detection. As a result, biases may be introduced from the underlying expression data or the particular clustering algorithm employed. Herein, we present a multi-tier analysis for TFBS discovery in higher order organisms. Firstly, we compared the occurrences of 7-mer motifs in the 2000 basepair (bp) upstream regions for coding regions of interest to the upstream regions of the entire genome sequence. In doing so, we identified putative motifs that appear within the genes of interest and do not appear (or do not appear often) elsewhere in the genome; such motifs could be functionally important sites. Next, we looked for correlations between the genes for which the putative motifs occurred and the expression profile of these genes from microarray data. Furthermore, we used a sliding window approach to gain insight into the relationship between the GC content of these putative motifs against the background AT content within the upstream regions of these genes. In order to assess the strength of our method, we analyzed putative motifs in well studied gene families, specifically the var gene family and a family of heat shock proteins, for which the actual TFBS has been experimentally verified. Because clustering is applied only once putative motifs have been identified, expression data bias’ is removed. Our results demonstrate that there is a relationship between the GC content of specific motifs and the background AT content within the upstream region of co-regulated gene families. Further, our results show that though more experimentation is necessary, our method can provide much insight into P. falciparum’s transcriptional regulatory control system.

Download

[Abstract (DOC)]