BioInfoBank Library


 

Bi, C (Chengpeng)

Latest papers:

go to Publishergo to Pubmedgo to Scholargo to Googleshow EndNote Citationshow BibTex Citation
Department of Pathology and Laboratory Medicine, The Children's Mercy Hospitals and Clinics, Kansas City, Missouri 64108, USA. csaunders@cmh.edu
Although few examples are formally documented, all polymerase chain reaction-based testing is theoretically vulnerable to allele drop-out (ADO), the failure to amplify one of the two alleles present in a cell. In a clinical setting, this can lead to false positive or negative diagnosis. We investigated the mechanisms leading to ADO in the MECP2 gene in two unrelated female patients undergoing testing for Rett syndrome. Both the patients had two benign DNA variations, c.819G > T and c.1161C > T, that appeared homozygous due to ADO. Bioinformatics analyses indicate that this region of the MECP2 gene is rich in complex tertiary structures called G-quadruplex and i-motifs, the disruption of which by the c.819G > T and c.1161C > T variants leads to preferential amplification of the variant allele. Other examples of ADO likely occur, and consideration of disrupting G-quadruplex and i-motif structures should be given when this phenomenon is unexpected. We identify factors in both the polymerase chain reaction amplification and the sequencing steps that help overcome ADO.
go to Publishergo to Pubmedgo to Scholargo to Googleshow EndNote Citationshow BibTex Citation
[My paper] Chengpeng Bi
Bioinformatics and Intelligent Computing Laboratory, Division of Clinical Pharmacology, Children's Mercy Hospitals and Clinics, 2401 Gillham Road, Kansas City, MO 64108, USA. cbi@cmh.edu
Motif discovery methods play pivotal roles in deciphering the genetic regulatory codes (i.e., motifs) in genomes as well as in locating conserved domains in protein sequences. The Expectation Maximization (EM) algorithm is one of the most popular methods used in de novo motif discovery. Based on the position weight matrix (PWM) updating technique, this paper presents a Monte Carlo version of the EM motif-finding algorithm that carries out stochastic sampling in local alignment space to overcome the conventional EM's main drawback of being trapped in a local optimum. The newly implemented algorithm is named as Monte Carlo EM Motif Discovery Algorithm (MCEMDA). MCEMDA starts from an initial model, and then it iteratively performs Monte Carlo simulation and parameter update until convergence. A log-likelihood profiling technique together with the top-k strategy is introduced to cope with the phase shifts and multiple modal issues in motif discovery problem. A novel grouping motif alignment (GMA) algorithm is designed to select motifs by clustering a population of candidate local alignments and successfully applied to subtle motif discovery. MCEMDA compares favorably to other popular PWM-based and word enumerative motif algorithms tested using simulated (l, d)-motif cases, documented prokaryotic, and eukaryotic DNA motif sequences. Finally, MCEMDA is applied to detect large blocks of conserved domains using protein benchmarks and exhibits its excellent capacity while compared with other multiple sequence alignment methods.

Most cited papers:

go to Pubmedgo to Scholargo to Googleshow EndNote Citationshow BibTex Citation
UC Davis Genome Center, University of California, One Shields Avenue, Davis, CA 95616, USA.
SUMMARY: WebSIDD is a Web-based service designed to predict locations and extents of stress-induced duplex destabilization (SIDD) that occur in a double-stranded DNA molecule of specified base sequence, on which a specified level of superhelical stress is imposed. The algorithm calculates the approximate equilibrium statistical mechanical distribution of a population of identical molecules among its accessible states. The user inputs the DNA sequence, and the program outputs the calculated transition probability and destabilization energy of each base pair in the sequence. As options, the user can specify the temperature and the level of superhelicity. The values of all structural and energy parameters used in the calculation have been experimentally measured. WebSIDD should prove useful for finding SIDD-susceptible sites in genomic sequences, and correlating their occurrence with locations involved in regulatory and pathological processes. This strategy already has illuminated the roles of SIDD in diverse biological regulatory processes, including transcriptional initiation and termination, and the eukaryotic nuclear scaffold attachments that partition chromosomes into domains. AVAILABILITY: http://orange.genomecenter.ucdavis.edu/benham/sidd/index.html
go to Publishergo to Pubmedgo to Scholargo to Googleshow EndNote Citationshow BibTex Citation
German Research Center for Biotechnology, RDIF/Epigenetic Regulation, D-38124 Braunschweig, Mascheroder Weg 1, Germany.
Scaffold or matrix-attachment regions (S/MARs) are thought to be involved in the organization of eukaryotic chromosomes and in the regulation of several DNA functions. Their characteristics are conserved between plants and humans, and a variety of biological activities have been associated with them. The identification of S/MARs within genomic sequences has proved to be unexpectedly difficult, as they do not appear to have consensus sequences or sequence motifs associated with them. We have shown that S/MARs do share a characteristic structural property, they have a markedly high predicted propensity to undergo strand separation when placed under negative superhelical tension. This result agrees with experimental observations, that S/MARs contain base-unpairing regions (BURs). Here, we perform a quantitative evaluation of the association between the ease of stress-induced DNA duplex destabilization (SIDD) and S/MAR binding activity. We first use synthetic oligomers to investigate how the arrangement of localized unpairing elements within a base-unpairing region affects S/MAR binding. The organizational properties found in this way are applied to the investigation of correlations between specific measures of stress-induced duplex destabilization and the binding properties of naturally occurring S/MARs. For this purpose, we analyze S/MAR and non-S/MAR elements that have been derived from the human genome or from the tobacco genome. We find that S/MARs exhibit long regions of extensive destabilization. Moreover, quantitative measures of the SIDD attributes of these fragments calculated under uniform conditions are found to correlate very highly (r(2)>0.8) with their experimentally measured S/MAR-binding strengths. These results suggest that duplex destabilization may be involved in the mechanisms by which S/MARs function. They suggest also that SIDD properties may be incorporated into an improved computational strategy to search genomic DNA sequences for sites having the necessary attributes to function as S/MARs, and even to estimate their relative binding strengths.
go to Pubmedgo to Scholargo to Googleshow EndNote Citationshow BibTex Citation
UC Davis Genome Center, University of California, One Shields Avenue, Davis, CA 95616, USA. cjbenham@ucdavis.edu
We present a method for calculating predicted locations and extents of stress-induced DNA duplex destabilization (SIDD) as functions of base sequence and stress level in long DNA molecules. The base pair denaturation energies are assigned individually, so the influences of near neighbors, methylated bases, adducts, or lesions can be included. Sample calculations indicate that copolymeric energetics give results that are close to those derived when full near-neighbor energetics are used; small but potentially informative differences occur only in the calculated SIDD properties of moderately destabilized regions. The method presented here for analyzing long sequences calculates the destabilization properties within windows of fixed length N, with successive windows displaced by an offset distance d(o). The final values of the relevant destabilization parameters for each base pair are calculated as weighted averages of the values computed for each window in which that base pair appears. This approach implicitly assumes that the strength of the direct coupling between remote base pairs that is induced by the imposed stress attenuates with their separation distance. This strategy enables calculations of the destabilization properties of DNA sequences of any length, up to and including complete chromosomes. We illustrate its utility by calculating the destabilization properties of the entire E. coli genomic DNA sequence. A preliminary analysis of the results shows that promoters are associated with SIDD regions in a highly statistically significant manner, suggesting that SIDD attributes may prove useful in the computational prediction of promoter locations in prokaryotes.
go to Publishergo to Pubmedgo to Scholargo to Googleshow EndNote Citationshow BibTex Citation
Laboratory of Human Molecular Genetics, Children's Mercy Hospital & Clinics, 2401 Gillham Road, Kansas City, MO 64108, USA.
Many multimeric transcription factors recognize DNA sequence patterns by cooperatively binding to bipartite elements composed of half sites separated by a flexible spacer. We developed a novel bipartite algorithm, bipartite pattern discovery (Bipad), which produces a mathematical model based on information maximization or Shannon's entropy minimization principle, for discovery of bipartite sequence patterns. Bipad is a C++ program that applies greedy methods to search the bipartite alignment space and examines the upstream or downstream regions of co-regulated genes, looking for cis-regulatory bipartite patterns. An input sequence file with zero or one site per locus is required, and the left and right motif widths and a range of possible gap lengths must be specified. Bipad can run in either single-block or bipartite pattern search modes, and it is capable of comprehensively searching all four orientations of half-site patterns. Simulation studies showed that the accuracy of this motif discovery algorithm depends on sample size and motif conservation level, but results were independent of background composition. Bipad performed equivalent with or better than other pattern search algorithms in correctly identifying Escherichia coli cyclic AMP receptor protein and Bacillus subtilis sigma factor binding site sequences based on experimentally defined benchmarks. Finally, a new bipartite information weight matrix for vitamin D3 receptor/retinoid X receptor alpha (VDR/RXRalpha) binding sites was derived that comprehensively models the natural variability inherent in these sequence elements.
go to Pubmedgo to Scholargo to Googleshow EndNote Citationshow BibTex Citation
[My paper] Chengpeng Bi
Children's Mercy Hospitals and Clinics, 2401 Gillham Road, Pediatrics Research Building, Third Floor, Kansas City, Missouri 64108, USA. cbi@cmh.edu.
Position weight matrix-based statistical modeling for the identification and characterization of motif sites in a set of unaligned biopolymer sequences is presented. This paper describes and implements a new algorithm, the Stochastic EM-type Algorithm for Motif-finding (SEAM), and redesigns and implements the EM-based motif-finding algorithm called deterministic EM (DEM) for comparison with SEAM, its stochastic counterpart. The gold standard example, cyclic adenosine monophosphate receptor protein (CRP) binding sequences, together with other biological sequences, is used to illustrate the performance of the new algorithm and compare it with other popular motif-finding programs. The convergence of the new algorithm is shown by simulation. The in silico experiments using simulated and biological examples illustrate the power and robustness of the new algorithm SEAM in de novo motif discovery.
go to Pubmedgo to Scholargo to Googleshow EndNote Citationshow BibTex Citation
ABSTRACT: BACKGROUND: Many dimeric protein complexes bind cooperatively to families of bipartite nucleic acid sequence elements consisting of pairs of conserved half-sites with sequences and intervening distances that vary among individual sites. RESULTS: We introduce the Bipad Server and Logo Plotter (http://bipad.cmh.edu), a web interface to predict sequence elements embedded within unaligned sequences and generate sequence logos from the aligned elements. Either a bipartite model consisting of a pair of one-block position weight matrices (PWM's) with a gap distribution, or a single PWM matrix for contiguous single block motifs may be produced. The Bipad program performs multiple local alignment by entropy minimization and cyclic refinement using a stochastic greedy search strategy. Optimal models are refined by maximizing incremental information contents among a set of potential models with varying half site and gap lengths. CONCLUSIONS: The web service graphically represents the set of discovered elements as a sequence logo and depicts the gap distribution as a histogram. Server performance was evaluated by generating a collection of bipartite models for distinct DNA binding proteins.
go to Publishergo to Pubmedgo to Scholargo to Googleshow EndNote Citationshow BibTex Citation
[My paper] Chengpeng Bi
Protein conserved domains are distinct units of molecular structure, usually associated with particular aspects of molecular function such as catalysis or binding. These conserved subsequences are often unobserved and thus in need of detection. Motif discovery methods can be used to find these unobserved domains given a set of sequences. This paper presents the data augmentation (DA) framework that unifies a suite of motif-finding algorithms through maximizing the same likelihood function by imputing the unobserved data. The data augmentation refers to those methods that formulate iterative optimization by exploiting the unobserved data. Two categories of maximum likelihood based motif-finding algorithms are illustrated under the DA framework. The first is the deterministic algorithms that are to maximize the likelihood function by performing an iteratively optimal local search in the alignment space. The second is the stochastic algorithms that are to iteratively draw motif location samples via Monte Carlo simulation and simultaneously keep track of the superior solution with the best likelihood. As a result, four DA motif discovery algorithms are described, evaluated, and compared by aligning real and simulated protein sequences.
go to Publishergo to Pubmedgo to Scholargo to Googleshow EndNote Citationshow BibTex Citation
Since the completion of human genome sequencing, cataloging of all genomic functional elements has been one of the challenging problems in bioinformatics. Deciphering cis-regulatory elements in the human genome still remains elusive although much effort has been expended. This paper reviews a suite of methods for two-block motif discovery including mathematical modeling, de novo motif-finding based on multiple local alignment, and genomic sequence scanning method for putative sites. We formulate a general method to address this challenge and compare two major existing algorithms (i.e., greedy local search and Gibbs sampling) implemented to solve the popular two-block structured motif discovery issue. We demonstrate how to use this suite of methods and apply them to human nuclear receptor response elements (i.e., protein binding sites of several relevant nuclear receptors, HNF4alpha, CAR/RXR, and PXR/RXR).
go to Pubmedgo to Scholargo to Googleshow EndNote Citationshow BibTex Citation
[My paper] Chengpeng Bi
Bioinformatics and Intelligent Computing Lab, Division of Clinical Pharmacology, Children's Mercy Hospitals, Kansas City, Missouri, USA. cbi@cmh.edu
BACKGROUND: Deciphering cis-regulatory elements or de novo motif-finding in genomes still remains elusive although much algorithmic effort has been expended. The Markov chain Monte Carlo (MCMC) method such as Gibbs motif samplers has been widely employed to solve the de novo motif-finding problem through sequence local alignment. Nonetheless, the MCMC-based motif samplers still suffer from local maxima like EM. Therefore, as a prerequisite for finding good local alignments, these motif algorithms are often independently run a multitude of times, but without information exchange between different chains. Hence it would be worth a new algorithm design enabling such information exchange. RESULTS: This paper presents a novel motif-finding algorithm by evolving a population of Markov chains with information exchange (PMC), each of which is initialized as a random alignment and run by the Metropolis-Hastings sampler (MHS). It is progressively updated through a series of local alignments stochastically sampled. Explicitly, the PMC motif algorithm performs stochastic sampling as specified by a population-based proposal distribution rather than individual ones, and adaptively evolves the population as a whole towards a global maximum. The alignment information exchange is accomplished by taking advantage of the pooled motif site distributions. A distinct method for running multiple independent Markov chains (IMC) without information exchange, or dubbed as the IMC motif algorithm, is also devised to compare with its PMC counterpart. CONCLUSION: Experimental studies demonstrate that the performance could be improved if pooled information were used to run a population of motif samplers. The new PMC algorithm was able to improve the convergence and outperformed other popular algorithms tested using simulated and biological motif sequences.
Polish News
2012-05-17 08:14:44 © BioInfoBank Institute