|
Brian Y Chen,
Drew H Bryant,
Viacheslav Y Fofanov,
David M Kristensen,
Amanda E Cruess,
Marek Kimmel,
Olivier Lichtarge,
Lydia E Kavraki
Department of Computer Science, Rice University Houston, TX 77005, USA. kavraki@rice.edu.
Determining the function of proteins is a problem with immense practical impact on the identification of inhibition targets and the causes of side effects. Unfortunately, experimental determination of protein function is expensive and time consuming. For this reason, algorithms for computational function prediction have been developed to focus and accelerate this effort. These algorithms are comparison techniques which identify matches of geometric and chemical similarity between motifs, representing known functional sites, and substructures of functionally uncharacterized proteins (targets). Matches of statistically significant geometric and chemical similarity can identify targets with active sites cognate to the matching motif. Unfortunately statistically significant matches can include false positive matches to functionally unrelated proteins. We target this problem by presenting Cavity Aware Match Augmentation (CAMA), a technique which uses C-spheres to represent active clefts which must remain vacant for ligand binding. CAMA rejects matches to targets without similar binding volumes. On 18 sample motifs, we observed that introducing C-spheres eliminated 80% of false positive matches and maintained 87% of true positive matches found with identical motifs lacking C-spheres. Analyzing a range of C-sphere positions and sizes, we observed that some high-impact C- spheres eliminate more false positive matches than others. High-impact C-spheres can be detected with a geometric analysis we call Cavity Scaling, permitting us to refine our initial cavity-aware motifs to contain only high-impact C-spheres. In the absence of expert knowledge, Cavity Scaling can guide the design of cavity-aware motifs to eliminate many false positive matches.
Latest citations:
1 Department of Information Engineering, University of Padova , Padova, Italy .
Abstract The functional prediction of proteins is one of the most challenging problems in modern biology. An established computational technique involves the identification of three-dimensional local similarities in proteins. In this article, we present a novel method to quickly identify promising binding sites. Our aim is to efficiently detect putative binding sites without explicitly aligning them. Using the theory of Spherical Harmonics, a candidate binding site is modeled as a Binding Ball. The Binding Ball signature, offered by the Spherical Fourier coefficients, can be efficiently used for a fast detection of putative regions. Our contribution includes the Binding Ball modeling and the definition of a scoring function that does not require aligning candidate regions. Our scoring function can be computed efficiently using a property of Spherical Fourier transform (SFT) that avoids the evaluation of all alignments. Experiments on different ligands show good discrimination power when searching for known binding sites. Moreover, we prove that this method can save up to 40% in time compared with traditional approaches.
Brian Y Chen,
Drew H Bryant,
Amanda E Cruess,
Joseph H Bylund,
Viacheslav Y Fofanov,
David M Kristensen,
Marek Kimmel,
Olivier Lichtarge,
Lydia E Kavraki
The study of disease often hinges on the biological function of proteins, but determining protein function is a difficult experimental process. To minimize duplicated effort, algorithms for function prediction seek characteristics indicative of possible protein function. One approach is to identify substructural matches of geometric and chemical similarity between motifs representing known active sites and target protein structures with unknown function. In earlier work, statistically significant matches of certain effective motifs have identified functionally related active sites. Effective motifs must be carefully designed to maintain similarity to functionally related sites (sensitivity) and avoid incidental similarities to functionally unrelated protein geometry (specificity). Existing motif design techniques use the geometry of a single protein structure. Poor selection of this structure can limit motif effectiveness if the selected functional site lacks similarity to functionally related sites. To address this problem, this paper presents composite motifs, which combine structures of functionally related active sites to potentially increase sensitivity. Our experimentation compares the effectiveness of composite motifs with simple motifs designed from single protein structures. On six distinct families of functionally related proteins, leave-one-out testing showed that composite motifs had sensitivity comparable to the most sensitive of all simple motifs and specificity comparable to the average simple motif. On our data set, we observed that composite motifs simultaneously capture variations in active site conformation, diminish the problem of selecting motif structures, and enable the fusion of protein structures from diverse data sources.
J Comput Biol. ;14 (6):791-816
17691895
Cit:4
Brian Y Chen,
Viacheslav Y Fofanov,
Drew H Bryant,
Bradley D Dodson,
David M Kristensen,
Andreas M Lisewski,
Marek Kimmel,
Olivier Lichtarge,
Lydia E Kavraki
The development of new and effective drugs is strongly affected by the need to identify drug targets and to reduce side effects. Resolving these issues depends partially on a thorough understanding of the biological function of proteins. Unfortunately, the experimental determination of protein function is expensive and time consuming. To support and accelerate the determination of protein functions, algorithms for function prediction are designed to gather evidence indicating functional similarity with well studied proteins. One such approach is the MASH pipeline, described in the first half of this paper. MASH identifies matches of geometric and chemical similarity between motifs, representing known functional sites, and substructures of functionally uncharacterized proteins (targets). Observations from several research groups concur that statistically significant matches can indicate functionally related active sites. One major subproblem is the design of effective motifs, which have many matches to functionally related targets (sensitive motifs), and few matches to functionally unrelated targets (specific motifs). Current techniques select and combine structural, physical, and evolutionary properties to generate motifs that mirror functional characteristics in active sites. This approach ignores incidental similarities that may occur with functionally unrelated proteins. To address this problem, we have developed Geometric Sieving (GS), a parallel distributed algorithm that efficiently refines motifs, designed by existing methods, into optimized motifs with maximal geometric and chemical dissimilarity from all known protein structures. In exhaustive comparison of all possible motifs based on the active sites of 10 well-studied proteins, we observed that optimized motifs were among the most sensitive and specific.
Brian Y Chen,
Drew H Bryant,
Viacheslav Y Fofanov,
David M Kristensen,
Amanda E Cruess,
Marek Kimmel,
Olivier Lichtarge,
Lydia E Kavraki
Department of Computer Science, Rice University, Houston, TX 77005, USA.
Algorithms for geometric and chemical comparison of protein substructure can be useful for many applications in protein function prediction. These motif matching algorithms identify matches of geometric and chemical similarity between well-studied functional sites, motifs, and substructures of functionally uncharacterized proteins, targets. For the purpose of function prediction, the accuracy of motif matching algorithms can be evaluated with the number of statistically significant matches to functionally related proteins, true positives (TPs), and the number of statistically insignificant matches to functionally unrelated proteins, false positives (FPs). Our earlier work developed cavity-aware motifs which use motif points to represent functionally significant atoms and C-spheres to represent functionally significant volumes. We observed that cavity-aware motifs match significantly fewer FPs than matches containing only motif points. We also observed that high-impact C-spheres, which significantly contribute to the reduction of FPs, can be isolated automatically with a technique we call Cavity Scaling. This paper extends our earlier work by demonstrating that C-spheres can be used to accelerate point-based geometric and chemical comparison algorithms, maintaining accuracy while reducing runtime. We also demonstrate that the placement of C-spheres can significantly affect the number of TPs and FPs identified by a cavity-aware motif. While the optimal placement of C-spheres remains a diffcult open problem, we compared two logical placement strategies to better understand C-sphere placement.
Other papers by authors:
Brian Y Chen,
Drew H Bryant,
Amanda E Cruess,
Joseph H Bylund,
Viacheslav Y Fofanov,
David M Kristensen,
Marek Kimmel,
Olivier Lichtarge,
Lydia E Kavraki
The study of disease often hinges on the biological function of proteins, but determining protein function is a difficult experimental process. To minimize duplicated effort, algorithms for function prediction seek characteristics indicative of possible protein function. One approach is to identify substructural matches of geometric and chemical similarity between motifs representing known active sites and target protein structures with unknown function. In earlier work, statistically significant matches of certain effective motifs have identified functionally related active sites. Effective motifs must be carefully designed to maintain similarity to functionally related sites (sensitivity) and avoid incidental similarities to functionally unrelated protein geometry (specificity). Existing motif design techniques use the geometry of a single protein structure. Poor selection of this structure can limit motif effectiveness if the selected functional site lacks similarity to functionally related sites. To address this problem, this paper presents composite motifs, which combine structures of functionally related active sites to potentially increase sensitivity. Our experimentation compares the effectiveness of composite motifs with simple motifs designed from single protein structures. On six distinct families of functionally related proteins, leave-one-out testing showed that composite motifs had sensitivity comparable to the most sensitive of all simple motifs and specificity comparable to the average simple motif. On our data set, we observed that composite motifs simultaneously capture variations in active site conformation, diminish the problem of selecting motif structures, and enable the fusion of protein structures from diverse data sources.
Brian Y Chen,
Drew H Bryant,
Viacheslav Y Fofanov,
David M Kristensen,
Amanda E Cruess,
Marek Kimmel,
Olivier Lichtarge,
Lydia E Kavraki
Department of Computer Science, Rice University, Houston, TX 77005, USA.
Algorithms for geometric and chemical comparison of protein substructure can be useful for many applications in protein function prediction. These motif matching algorithms identify matches of geometric and chemical similarity between well-studied functional sites, motifs, and substructures of functionally uncharacterized proteins, targets. For the purpose of function prediction, the accuracy of motif matching algorithms can be evaluated with the number of statistically significant matches to functionally related proteins, true positives (TPs), and the number of statistically insignificant matches to functionally unrelated proteins, false positives (FPs). Our earlier work developed cavity-aware motifs which use motif points to represent functionally significant atoms and C-spheres to represent functionally significant volumes. We observed that cavity-aware motifs match significantly fewer FPs than matches containing only motif points. We also observed that high-impact C-spheres, which significantly contribute to the reduction of FPs, can be isolated automatically with a technique we call Cavity Scaling. This paper extends our earlier work by demonstrating that C-spheres can be used to accelerate point-based geometric and chemical comparison algorithms, maintaining accuracy while reducing runtime. We also demonstrate that the placement of C-spheres can significantly affect the number of TPs and FPs identified by a cavity-aware motif. While the optimal placement of C-spheres remains a diffcult open problem, we compared two logical placement strategies to better understand C-sphere placement.
J Comput Biol. ;14 (6):791-816
17691895
Cit:4
Brian Y Chen,
Viacheslav Y Fofanov,
Drew H Bryant,
Bradley D Dodson,
David M Kristensen,
Andreas M Lisewski,
Marek Kimmel,
Olivier Lichtarge,
Lydia E Kavraki
The development of new and effective drugs is strongly affected by the need to identify drug targets and to reduce side effects. Resolving these issues depends partially on a thorough understanding of the biological function of proteins. Unfortunately, the experimental determination of protein function is expensive and time consuming. To support and accelerate the determination of protein functions, algorithms for function prediction are designed to gather evidence indicating functional similarity with well studied proteins. One such approach is the MASH pipeline, described in the first half of this paper. MASH identifies matches of geometric and chemical similarity between motifs, representing known functional sites, and substructures of functionally uncharacterized proteins (targets). Observations from several research groups concur that statistically significant matches can indicate functionally related active sites. One major subproblem is the design of effective motifs, which have many matches to functionally related targets (sensitive motifs), and few matches to functionally unrelated targets (specific motifs). Current techniques select and combine structural, physical, and evolutionary properties to generate motifs that mirror functional characteristics in active sites. This approach ignores incidental similarities that may occur with functionally unrelated proteins. To address this problem, we have developed Geometric Sieving (GS), a parallel distributed algorithm that efficiently refines motifs, designed by existing methods, into optimized motifs with maximal geometric and chemical dissimilarity from all known protein structures. In exhaustive comparison of all possible motifs based on the active sites of 10 well-studied proteins, we observed that optimized motifs were among the most sensitive and specific.
Protein Sci. 2006 May 2;:
16672239
Cit:14
David M Kristensen,
Brian Y Chen,
Viacheslav Y Fofanov,
R Matthew Ward,
Andreas Martin Lisewski,
Marek Kimmel,
Lydia E Kavraki,
Olivier Lichtarge
The annotation of protein function has not kept pace with the exponential growth of raw sequence and structure data. An emerging solution to this problem is to identify 3D motifs or templates in protein structures that are necessary and sufficient determinants of function. Here, we demonstrate the recurrent use of evolutionary trace information to construct such 3D templates for enzymes, search for them in other structures, and distinguish true from spurious matches. Serine protease templates built from evolutionarily important residues distinguish between proteases and other proteins nearly as well as the classic Ser-His-Asp catalytic triad. In 53 enzymes spanning 33 distinct functions, an automated pipeline identifies functionally related proteins with an average positive predictive power of 62%, including correct matches to proteins with the same function but with low sequence identity (the average identity for some templates is only 17%). Although these template building, searching, and match classification strategies are not yet optimized, their sequential implementation demonstrates a functional annotation pipeline which does not require experimental information, but only local molecular mimicry among a small number of evolutionarily important residues.
Pac Symp Biocomput. 2005 ;:334-45
15759639
Cit:10
Brian Y Chen,
Viacheslav Y Fofanov,
David M Kristensen,
Marek Kimmel,
Olivier Lichtarge,
Lydia E Kavraki
Rice University, Department of Computer Science, Houston, TX 77005, USA.
The comparison of structural subsites in proteins is increasingly relevant to the prediction of their biological function. To address this problem, we present the Match Augmentation algorithm (MA). Given a structural motif of interest, such as a functional site, MA searches a target protein structure for a match: the set of atoms with the greatest geometric and chemical similarity. MA is extremely efficient because it exploits the fact that the amino acids in a structural motif are not equally important to function. Using motif residues ranked on functional significance via the Evolutionary Trace (ET), MA prioritizes its search by initially forming matches with functionally significant residues, then, guided by ET, it augments this partial match stepwise until the whole motif is found. With this hierarchical strategy, MA runs considerably faster than other methods, and almost always identifies matches in homologs known to have cognate functional sites. Second, in order to interpret matches, we further introduce a statistical method using nonparametric density estimation of the frequency distribution of structural matches. Our results show that the hierarchy of functional importance within structural motifs speeds up the search within targets, and points to a new method to score their statistical significance.
ABSTRACT: BACKGROUND: Structural variations caused by a wide range of physicochemical and biological sources directly influence the function of a protein. For enzymatic proteins, the structure and chemistry of the catalytic binding site residues can be loosely defined as a substructure of the protein. Comparative analysis of drug-receptor substructures across and within species has been used for lead evaluation. Substructure-level similarity between the binding sites of functionally similar proteins has also been used to identify instances of convergent evolution among proteins. In functionally homologous protein families, shared chemistry and geometry at catalytic sites provide a common, local point of comparison among proteins that may differ significantly at the sequence, fold, or domain topology levels. RESULTS: This paper describes two key results that can be used separately or in combination for protein function analysis. The Family-wise Analysis of SubStructural Templates (FASST) method uses all-against-all substructure comparison to determine Substructural Clusters (SCs). SCs characterize the binding site substructural variation within a protein family. In this paper we focus on examples of automatically determined SCs that can be linked to phylogenetic distance between family members, segregation by conformation, and organization by homology among convergent protein lineages. The Motif Ensemble Statistical Hypothesis (MESH) framework constructs a representative motif for each protein cluster among the SCs determined by FASST to build motif ensembles that are shown through a series of function prediction experiments to improve the function prediction power of existing motifs. CONCLUSIONS: FASST contributes a critical feedback and assessment step to existing binding site substructure identification methods and can be used for the thorough investigation of structure-function relationships. The application of MESH allows for an automated, statistically rigorous procedure for incorporating structural variation data into protein function prediction pipelines. Our work provides an unbiased, automated assessment of the structural variability of identified binding site substructures among protein structure families and a technique for exploring the relation of substructural variation to protein function. As available proteomic data continues to expand, the techniques proposed will be indispensable for the large-scale analysis and interpretation of structural data.
Hui Yao,
David M Kristensen,
Ivana Mihalek,
Mathew E Sowa,
Chad Shaw,
Marek Kimmel,
Lydia Kavraki,
Olivier Lichtarge
Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza T921, Houston, TX 77030, USA.
Functional sites determine the activity and interactions of proteins and as such constitute the targets of most drugs. However, the exponential growth of sequence and structure data far exceeds the ability of experimental techniques to identify their locations and key amino acids. To fill this gap we developed a computational Evolutionary Trace method that ranks the evolutionary importance of amino acids in protein sequences. Studies show that the best-ranked residues form fewer and larger structural clusters than expected by chance and overlap with functional sites, but until now the significance of this overlap has remained qualitative. Here, we use 86 diverse protein structures, including 20 determined by the structural genomics initiative, to show that this overlap is a recurrent and statistically significant feature. An automated ET correctly identifies seven of ten functional sites by the least favorable statistical measure, and nine of ten by the most favorable one. These results quantitatively demonstrate that a large fraction of functional sites in the proteome may be accurately identified from sequence and structure. This should help focus structure-function studies, rational drug design, protein engineering, and functional annotation to the relevant regions of a protein.
Bioinformatics. 2009 Mar 23;:
19307237
Cit:2
R Matthew Ward,
Eric Venner,
Bryce Daines,
Stephen Murray,
Serkan Erdin,
David M Kristensen,
Olivier Lichtarge
Departments of Molecular and Human Genetics, Program in Structural and Computational Biology and Molecular Biophysics, Department of Biochemistry and Molecular Biology, Department of Pharmacology, One Baylor Plaza, Houston, TX 77030, W. M. Keck Center for Interdisciplinary Bioscience Training, Houston, TX 77005.
SUMMARY: The Evolutionary Trace Annotation (ETA) Server predicts enzymatic activity. ETA starts with a structure of unknown function, such as those from structural genomics, and with no prior knowledge of its mechanism uses the phylogenetic Evolutionary Trace method (ET) to extract key functional residues and propose a function-associated three-dimensional motif, called a 3D template. ETA then searches previously annotated structures for geometric template matches that suggest molecular and thus functional mimicry. In order to maximize the predictive value of these matches, ETA then applies distinctive specificity filters-evolutionary similarity, function plurality and match reciprocity. In large scale controls on enzymes, prediction coverage is 43% but the positive predictive value rises to 92%, thus minimizing false annotations (Ward, et al., 2008). Users may modify any search parameter including the template. ETA thus expands the ET suite for protein structure annotation (Mihalek, et al., 2006; Morgan, et al., 2006), and can contribute to the annotation efforts of metaservers. AVAILABILITY: The ETA Server is a web application available at http://mammoth.bcm.tmc.edu/eta/.
PLoS ONE. 2008 ;3 (5):e2136
18461181
Cit:12
R Matthew Ward,
Serkan Erdin,
Tuan A Tran,
David M Kristensen,
Andreas Martin Lisewski,
Olivier Lichtarge
Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America.
Function prediction frequently relies on comparing genes or gene products to search for relevant similarities. Because the number of protein structures with unknown function is mushrooming, however, we asked here whether such comparisons could be improved by focusing narrowly on the key functional features of protein structures, as defined by the Evolutionary Trace (ET). Therefore a series of algorithms was built to (a) extract local motifs (3D templates) from protein structures based on ET ranking of residue importance;(b) to assess their geometric and evolutionary similarity to other structures; and (c) to transfer enzyme annotation whenever a plurality was reached across matches. Whereas a prototype had only been 80% accurate and was not scalable, here a speedy new matching algorithm enabled large-scale searches for reciprocal matches and thus raised annotation specificity to 100% in both positive and negative controls of 49 enzymes and 50 non-enzymes, respectively-in one case even identifying an annotation error-while maintaining sensitivity ( approximately 60%). Critically, this Evolutionary Trace Annotation (ETA) pipeline requires no prior knowledge of functional mechanisms. It could thus be applied in a large-scale retrospective study of 1218 structural genomics enzymes and reached 92% accuracy. Likewise, it was applied to all 2935 unannotated structural genomics proteins and predicted enzymatic functions in 320 cases: 258 on first pass and 62 more on second pass. Controls and initial analyses suggest that these predictions are reliable. Thus the large-scale evolutionary integration of sequence-structure-function data, here through reciprocal identification of local, functionally important structural features, may contribute significantly to de-orphaning the structural proteome.
David Kristensen,
R Matthew Ward,
Andreas Martin Lisewski,
Serkan Erdin,
Brian Chen,
Viacheslav Fofanov,
Marek Kimmel,
Lydia Kavraki,
Olivier Lichtarge
ABSTRACT: BACKGROUND: Structural genomics projects such as the Protein Structure Initiative (PSI) yield many new structures, but often these have no known molecular functions. One approach to recover this information is to use 3D templates-- structure-function motifs that consist of a few functionally critical amino acids and may suggest functional similarity when geometrically matched to other structures. Since experimentally determined functional sites are not common enough to define 3D templates on a large scale, this work tests a computational strategy to select relevant residues for 3D templates. RESULTS: Based on evolutionary information and heuristics, an Evolutionary Trace Annotation (ETA) pipeline built templates for 98 enzymes, half taken from the PSI, and sought matches in a non-redundant structure database. On average each template matched 2.7 distinct proteins, of which 2.0 share the first three Enzyme Commission digits as the template's enzyme of origin. In many cases (61%) a single most likely function could be predicted as the annotation with the most matches, and in these cases such a plurality vote identified the correct function with 87% accuracy. ETA was also found to be complementary to sequence homology-based annotations. When matches are required to both geometrically match the 3D template and to be sequence homologs found by BLAST or PSI-BLAST, the annotation accuracy is greater than either method alone, especially in the region of lower sequence identity where homology-based annotations are least reliable. CONCLUSIONS: These data suggest that knowledge of evolutionarily important residues improves functional annotation among distant enzyme homologs. Since, unlike other 3D template approaches, the ETA method bypasses the need for experimental knowledge of the catalytic mechanism, it should prove a useful, large scale, and general adjunct to combine with other methods to decipher protein function in the structural proteome.
Latest similar papers:ABSTRACT: BACKGROUND: Structural variations caused by a wide range of physicochemical and biological sources directly influence the function of a protein. For enzymatic proteins, the structure and chemistry of the catalytic binding site residues can be loosely defined as a substructure of the protein. Comparative analysis of drug-receptor substructures across and within species has been used for lead evaluation. Substructure-level similarity between the binding sites of functionally similar proteins has also been used to identify instances of convergent evolution among proteins. In functionally homologous protein families, shared chemistry and geometry at catalytic sites provide a common, local point of comparison among proteins that may differ significantly at the sequence, fold, or domain topology levels. RESULTS: This paper describes two key results that can be used separately or in combination for protein function analysis. The Family-wise Analysis of SubStructural Templates (FASST) method uses all-against-all substructure comparison to determine Substructural Clusters (SCs). SCs characterize the binding site substructural variation within a protein family. In this paper we focus on examples of automatically determined SCs that can be linked to phylogenetic distance between family members, segregation by conformation, and organization by homology among convergent protein lineages. The Motif Ensemble Statistical Hypothesis (MESH) framework constructs a representative motif for each protein cluster among the SCs determined by FASST to build motif ensembles that are shown through a series of function prediction experiments to improve the function prediction power of existing motifs. CONCLUSIONS: FASST contributes a critical feedback and assessment step to existing binding site substructure identification methods and can be used for the thorough investigation of structure-function relationships. The application of MESH allows for an automated, statistically rigorous procedure for incorporating structural variation data into protein function prediction pipelines. Our work provides an unbiased, automated assessment of the structural variability of identified binding site substructures among protein structure families and a technique for exploring the relation of substructural variation to protein function. As available proteomic data continues to expand, the techniques proposed will be indispensable for the large-scale analysis and interpretation of structural data.
J Chem Inf Model. 2010 Apr 7;:
20373791
Department of Molecular Design and Synthesis, Higher Institute of Technologies and Applied Sciences, Habana, Cuba, Department of Chemistry, University of Calgary, Calgary, Alberta, Canada, and Institute for Physical and Theoretical Chemistry, University of Bonn, Bonn, Germany.
A novel approach is applied for the prediction of potential binding sites in ligand-protein interactions. This methodology introduces an integral strategy based on the calculation of protein geometrical parameters and the use of a quantum mechanical descriptor, Binding Local Site (B(LS)). A screening of the most likely cavities in the protein crystal structure is carried out where the analysis of geometric cavities is performed, and the virtual centers for binding (VCB) are located. The VCB surrounding amino acid residues (AA) are evaluated through the calculation of the B(LS) by using the theoretical affinity order between the ligand and each AA. It includes a quantum scoring function based on the ligand-AA association energies and entropies. A contribution to the understanding of flavonoid-protein interactions is provided as well. The new bioinformatic strategy makes good predictions for flavonoid ligands. The calculated binding sites are quite in agreement with the crystal binding sites of 10 flavonoid binding proteins. This is a contribution of quantum mechanics in some phases of in silico drug design.
Bioinformatics. 2010 Jan 14;:
20080513
Cit:1
Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, GA, 30318.
SUMMARY: In the post-genomic era, the annotation of protein function facilitates the understanding of various biological processes. To extend the range of function annotation methods to the twilight zone of sequence identity, we have developed approaches that exploit both protein tertiary structure and/or protein sequence evolutionary relationships. To serve the scientific community, we have integrated the structure prediction tools, TASSER, TASSER-Lite and METATASSER, and the functional inference tools, FINDSITE, a structure based algorithm for binding site prediction, GO molecular function inference and ligand screening, EFICAz(2), a sequence based approach to enzyme function inference, and DBD-hunter, an algorithm for predicting DNA binding proteins and associated DNA binding residues, into a unified web resource, PSiFR (Protein Structure and Function prediction Resource). Availability and Implementation: PSiFR is freely available for use on the web at http://psifr.cssb.biology.gatech.edu/ CONTACT: skolnick@gatech.edu.
BMC Genomics. 2009 ;10 Suppl 3 :S6
19958504
Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore. lawrence@bic.nus.edu.sg
BACKGROUND: Caspases belong to a class of cysteine proteases which function as critical effectors in cellular processes such as apoptosis and inflammation by cleaving substrates immediately after unique tetrapeptide sites. With hundreds of reported substrates and many more expected to be discovered, the elucidation of the caspase degradome will be an important milestone in the study of these proteases in human health and disease. Several computational methods for predicting caspase cleavage sites have been developed recently for identifying potential substrates. However, as most of these methods are based primarily on the detection of the tetrapeptide cleavage sites - a factor necessary but not sufficient for predicting in vivo substrate cleavage - prediction outcomes will inevitably include many false positives. RESULTS: In this paper, we show that structural factors such as the presence of disorder and solvent exposure in the vicinity of the cleavage site are important and can be used to enhance results from cleavage site prediction. We constructed a two-step model incorporating cleavage site prediction and these factors to predict caspase substrates. Sequences are first predicted for cleavage sites using CASVM or GraBCas. Predicted cleavage sites are then scored, ranked and filtered against a cut-off based on their propensities for locating in disordered and solvent exposed regions. Using an independent dataset of caspase substrates, the model was shown to achieve greater positive predictive values compared to CASVM or GraBCas alone, and was able to reduce the false positives pool by up to 13% and 53% respectively while retaining all true positives. We applied our prediction model on the family of receptor tyrosine kinases (RTKs) and highlighted several members as potential caspase targets. The results suggest that RTKs may be generally regulated by caspase cleavage and in some cases, promote the induction of apoptotic cell death - a function distinct from their role as transducers of survival and growth signals. CONCLUSION: As a step towards the prediction of in vivo caspase substrates, we have developed an accurate method incorporating cleavage site prediction and structural factors. The multi-factor model augments existing methods and complements experimental efforts to define the caspase degradome on the systems-wide basis.
Bioinformatics. 2009 Oct 21;:
19846440
Cit:1
Roll: A new algorithm for the detection of protein pockets and cavities with a rolling probe sphere.
Graduate School of Life Science, Hokkaido University, Kita-Ku Kita-10 Nishi-8, Sapporo, 0600810, Japan.
MOTIVATION: Prediction of ligand binding sites of proteins is significant as it can provide insight into biological functions and reaction mechanisms of proteins. It is also a prerequisite for protein - ligand docking and an important step in structure-based drug design. RESULTS: We present a new algorithm, Roll, implemented in a program named POCASA, which can predict binding sites by detecting pockets and cavities of proteins with a rolling sphere. To evaluate the performance of POCASA, a test with the same data set as used in several existing methods was carried out. POCASA achieved a high success rate of 77%. In addition, the test results indicated that POCASA can predict good shapes of ligand binding sites. AVAILABILITY: A web version of POCASA is freely available at http://altair.sci.hokudai.ac.jp/g6/Research/POCASA_e.html CONTACT: yao@castor.sci.hokudai.ac.jp.
Department of Computer Science, Rice University, Houston, TX 77005, USA. mmoll@cs.rice.edu
There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity. Our focus is on methods that determine binding site similarity. Although several such methods exist, it still remains a challenging problem to quickly find all functionally-related matches for structural motifs in large data sets with high specificity. In this context, a structural motif is a set of 3D points annotated with physicochemical information that characterize a molecular function. We propose a new method called LabelHash that creates hash tables of n-tuples of residues for a set of targets. Using these hash tables, we can quickly look up partial matches to a motif and expand those matches to complete matches. We show that by applying only very mild geometric constraints we can find statistically significant matches with extremely high specificity in very large data sets and for very general structural motifs. We demonstrate that our method requires a reasonable amount of storage when employing a simple geometric filter and further improves on the specificity of our previous work while maintaining very high sensitivity. Our algorithm is evaluated on 20 homolog classes and a non-redundant version of the Protein Data Bank as our background data set. We use cluster analysis to analyze why certain classes of homologs are more difficult to classify than others. The LabelHash algorithm is implemented on a web server at http://kavrakilab.org/labelhash/.
Bioinformatics Research Group, School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA. ezeng001@cs.fiu.edu
Almost every cellular process requires the interactions of pairs or larger complexes of proteins. High throughput protein-protein interaction (PPI) data have been generated using techniques such as the yeast two-hybrid systems, mass spectrometry method, and many more. Such data provide us with a new perspective to predict protein functions and to generate protein-protein interaction networks, and many recent algorithms have been developed for this purpose. However, PPI data generated using high throughput techniques contain a large number of false positives. In this paper, we have proposed a novel method to evaluate the support for PPI data based on gene ontology information. If the semantic similarity between genes is computed using gene ontology information and using Resnik's formula, then our results show that we can model the PPI data as a mixture model predicated on the assumption that true protein-protein interactions will have higher support than the false positives in the data. Thus semantic similarity between genes serves as a metric of support for PPI data. Taking it one step further, new function prediction approaches are also being proposed with the help of the proposed metric of the support for the PPI data. These new function prediction approaches outperform their conventional counterparts. New evaluation methods are also proposed.
ABSTRACT: Nuclear localization signals (NLSs) are stretches of residues within a protein that are important for the regulated nuclear import of the protein. Of the many import pathways that exist in yeast, the best characterized is termed the 'classical' NLS pathway. The classical NLS contains specific patterns of basic residues and computational methods have been designed to predict the location of these motifs on proteins. The consensus sequences, or patterns, for the other import pathways are less well-understood. In this paper, we present an analysis of characterized NLSs in yeast, and find, despite the large number of nuclear import pathways, that NLSs seem to show similar patterns of amino acid residues. We test current prediction methods and observe a low true positive rate. We therefore suggest an approach using hidden Markov models (HMMs) to predict novel NLSs in proteins. We show that our method is able to consistently find 37% of the NLSs with a low false positive rate and that our method retains its true positive rate outside of the yeast data set used for the training parameters. Our implementation of this model, NLStradamus, is made available at: http://www.moseslab.csb.utoronto.ca/NLStradamus/
In Silico Biol. 2009 ;9 (1-2):23-34
19537159
Cit:1
Department of Mathematics and Statistics, University of Helsinki, Helsinki, FI-00014, Finland. jukka.kohonen@helsinki.fi
A Naive Bayes classifier tool is presented for annotating proteins on the basis of amino acid motifs, cellular localization and protein-protein interactions. Annotations take the form of posterior probabilities within the Molecular Function hierarchy of the Gene Ontology (GO). Experiments with the data available for yeast, Saccharomyces cerevisiae, show that our prediction method can yield a relatively high level of accuracy. Several apparent challenges and possibilities for future developments are also discussed. A common approach to functional characterization is to use sequence similarities at varying levels, by utilizing several existing databases and local alignment/identification algorithms. Such an approach is typically quite labor-intensive when performed by an expert in a manual fashion. Integration of several sources of information is in this context generally considered as the only possibility to obtain valuable predictions with practical implications. However, some improvements in the prediction accuracy of the molecular functions, and thereby also savings in the computational effort, can be achieved by restricting attention to only those data sources that involve a higher degree of specificity. We employ here a Naive Bayes model in order to provide probabilistic predictions, and to enable a computationally efficient approach to data integration.
Nobuyoshi Nagamine,
Takayuki Shirakawa,
Yusuke Minato,
Kentaro Torii,
Hiroki Kobayashi,
Masaya Imoto,
Yasubumi Sakakibara
Department of Biosciences and Informatics, Keio University, Yokohama, Japan.
Predictions of interactions between target proteins and potential leads are of great benefit in the drug discovery process. We present a comprehensively applicable statistical prediction method for interactions between any proteins and chemical compounds, which requires only protein sequence data and chemical structure data and utilizes the statistical learning method of support vector machines. In order to realize reasonable comprehensive predictions which can involve many false positives, we propose two approaches for reduction of false positives:(i) efficient use of multiple statistical prediction models in the framework of two-layer SVM and (ii) reasonable design of the negative data to construct statistical prediction models. In two-layer SVM, outputs produced by the first-layer SVM models, which are constructed with different negative samples and reflect different aspects of classifications, are utilized as inputs to the second-layer SVM. In order to design negative data which produce fewer false positive predictions, we iteratively construct SVM models or classification boundaries from positive and tentative negative samples and select additional negative sample candidates according to pre-determined rules. Moreover, in order to fully utilize the advantages of statistical learning methods, we propose a strategy to effectively feedback experimental results to computational predictions with consideration of biological effects of interest. We show the usefulness of our approach in predicting potential ligands binding to human androgen receptors from more than 19 million chemical compounds and verifying these predictions by in vitro binding. Moreover, we utilize this experimental validation as feedback to enhance subsequent computational predictions, and experimentally validate these predictions again. This efficient procedure of the iteration of the in silico prediction and in vitro or in vivo experimental verifications with the sufficient feedback enabled us to identify novel ligand candidates which were distant from known ligands in the chemical space.
|
||
|
|