|
Bioinformatics (Oxford, England)
Valéry Ozenne,
Frédéric Bauer,
Loïc Salmon,
Jie-Rong Huang,
Malene Ringkjøbing Jensen,
Stéphane Segard,
Pau Bernadó,
Céline Charavay,
Martin Blackledge
Protein Dynamics and Flexibility, Institut de Biologie Structurale Jean-Pierre Ebel, CEA; CNRS; UJF UMR 5075, 41 Rue Jules Horowitz, Grenoble 38027, Groupe Informatique pour les Scientifiques du Sud Est (GIPSE), IRTSV / Laboratoire Biologie à Grande Echelle, CEA - INSERM U1038 - UJF, 17 avenue des Martyrs, 38054 Grenoble Cedex 9 and Centre de Biochimie Structurale, CNRS UMR 5048 - UM 1 - INSERM UMR 1054, 34090, Montpelier, France.
MOTIVATION Intrinsically disordered proteins (IDPs) represent a significant fraction of the human proteome. The classical structure function paradigm that has successfully underpinned our understanding of molecular biology breaks down when considering proteins that have no stable tertiary structure in their functional form. One convenient approach is to describe the protein in terms of an equilibrium of rapidly inter-converting conformers. Currently, tools to generate such ensemble descriptions are extremely rare, and poorly adapted to the prediction of experimental data. RESULTS We present flexible-meccano-a highly efficient algorithm that generates ensembles of molecules, on the basis of amino acid-specific conformational potentials and volume exclusion. Conformational sampling depends uniquely on the primary sequence, with the possibility of introducing additional local or long-range conformational propensities at an amino acid-specific resolution. The algorithm can also be used to calculate expected values of experimental parameters measured at atomic or molecular resolution, such as nuclear magnetic resonance (NMR) and small angle scattering, respectively. We envisage that flexible-meccano will be useful for researchers who wish to compare experimental data with those expected from a fully disordered protein, researchers who see experimental evidence of deviation from 'random coil' behaviour in their protein, or researchers who are interested in working with a broad ensemble of conformers representing the flexibility of the IDP of interest. AVAILABILITY A fully documented multi-platform executable is provided, with examples, at http://www.ibs.fr/science-213/scientific-output/software/flexible-meccano/ CONTACT martin.blackledge@ibs.fr.
Bioinformatics. 2012 May 18;:
22611132
Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030.
SUMMARY: Understanding the differences between knotted and unknotted protein structures may offer insights into how proteins fold. To characterize the type of knot in a protein, we have developed PyKnot, a plugin that works seamlessly within the PyMOL molecular viewer and gives quick results including the knot's invariants, crossing numbers and simplified knot projections and backbones. PyKnot may be useful to researchers interested in classifying knots in macromolecules, and provides tools for students of biology and chemistry with which to learn topology and macromolecular visualization. AVAILABILITY: PyMOL is available at http://www,pymol.org. The PyKnot module and tutorial videos are available through http://youtu.be/p95aif6xqcM.
Bioinformatics. 2012 May 18;:
22611131
NICTA Victoria Research Laboratory, Department of Computer Science and Software Engineering, The University of Melbourne, Parkville, Australia.
MOTIVATION: The de novo assembly of short read high-throughput sequencing data poses significant computational challenges. The volume of data is huge; the reads are tiny compared to the underlying sequence; and there are significant numbers of sequencing errors. There are numerous software packages that allow users to assemble short reads, but most are either limited to relatively small genomes (e.g., bacteria), or require large computing infrastructure, or employ greedy algorithms and thus often do not yield high quality results. RESULTS: We have developed Gossamer, an implementation of the de Bruijn approach to assembly that requires close to the theoretical minimum of memory, but still allows efficient processing. Our results show that it is space efficient, and produces high quality assemblies. AVAILABILITY: Gossamer is available for non-commercial use from http://www.genomics.csse.unimelb.edu.au/product-gossamer.php. CONTACT: tom.conway@nicta.com.au.
Bioinformatics. 2012 May 18;:
22611130
Mingxiang Teng,
Shoji Ichikawa,
Leah R Padgett,
Yadong Wang,
Matthew Mort,
David N Cooper,
Daniel L Koller,
Tatiana Foroud,
Howard J Edenberg,
Michael J Econs,
Yunlong Liu
School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China, Center for Computational Biology and Bioinformatics, Department of Medical and Molecular Genetics, Department of Medicine, Department of Biochemistry and Molecular Biology, Indiana University School of Medicine, Indianapolis, IN 46202, USA and Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff CF14 4XN, UK.
MOTIVATION: One of the fundamental questions in genetics study is to identify functional DNA variants that are responsible to a disease or phenotype of interest. Results from large-scale genetics studies, such as genome-wise associations studies (GWAS), and the availability of high throughput sequencing technologies provide opportunities in identifying causal variants. Despite the technical advances, informatics methodologies need to be developed to prioritize thousands of variants for potential causative effects. RESULTS: We present regSNPs, an informatics strategy that integrates several established bioinformatics tools, for prioritizing regulatory SNPs, i.e. the SNPs in the promoter regions that potentially affect phenotype through changing transcription of downstream genes. Comparing to existing tools, regSNPs has two distinct features. It considers degenerative features of binding motifs by calculating the differences on the binding affinity caused by the candidate variants and integrates potential phenotypic effects of various transcription factors. When tested by using the disease-causing variants documented in the Human Gene Mutation Database, regSNPs showed mixed performance on various diseases. regSNPs predicted 3 SNPs that can potentially affect bone density in a region detected in an earlier linkage study. Potential effects of one of the variants were validated using luciferase reporter assay. CONTACT: yunliu@iupui.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Bioinformatics. 2012 May 18;:
22611129
SIRET Research Group, Department of Software Engineering, FMP, Charles University in Prague, Czech Republic.
MOTIVATION: Understanding the architecture and function of RNA molecules requires methods for comparing and analyzing their 3D structures. While a structural alignment of short RNAs is achievable in a reasonable amount of time, large structures represent much bigger challenge. However the growth of the number of large RNAs deposited in the PDB database calls for the development of fast and accurate methods for analyzing their structures, as well as for rapid similarity searches in databases. RESULTS: In this article a novel algorithm for an RNA structural comparison SETTER (SEcondary sTructure-based TERtiary Structure Similarity Algorithm) is introduced. SETTER utilizes a pairwise comparison method based on 3D similarity of the so-called generalized secondary structure units (GSSU). For each pair of structures, SETTER produces a distance score and an indication of its statistical significance. SETTER can be used both for the structural alignments of structures that are already known to be homologous, as well as for 3D structure similarity searches and functional annotation. The algorithm presented is both accurate and fast and does not impose limits on the size of aligned RNA structures. AVAILABILITY: The SETTER program, as well as all datasets, are freely available from http://siret.cz/hoksza/projects/setter/. CONTACT: hoksza@ksi.mff.cuni.cz, svozild@vscht.cz SUPPLEMENTARY INFORMATION: Supplementary Information is available at Bioinformatics online.
Bioinformatics. 2012 May 17;:
22595210
Department of Biochemistry, MSc Program in Bioinformatics with Systems Biology, Department of Computer Science, University College Cork, Cork, Ireland.
MOTIVATION: Conserved patterns across a multiple sequence alignment can be visualized by generating Sequence Logos. Sequence Logos show the conserved regions as stacks of symbol(s) where the height of a stack is proportional to its informational content, while the height of each symbol within the stack is proportional to its frequency in the alignment. Sequence logos use symbols of either nucleotide or amino acid alphabets. However, certain regulatory signals in mRNA act as combinations of codons. Yet no tool is available for visualization of conserved codon patterns. RESULTS: We present the first application which allows visualization of conserved regions in a multiple sequence alignment in the context of codons. CodonLogo is based on WebLogo3 and uses the same heuristics but treats codons as inseparable units of a 64-letter alphabet. CodonLogo can discriminate patterns of codon conservation from patterns of nucleotide conservation that appear indistinguishable in standard Sequence Logos. With CodonLogo it is often possible to discriminate the protein coding frame in an mRNA from two alternative frames. AVAILABILITY: The CodonLogo source code and its implementation (in a local version of the Galaxy Browser) are available at http://recode.ucc.ie/CodonLogo and through the Galaxy Tool Shed at http://toolshed.g2.bx.psu.edu/.
Bioinformatics. 2012 May 17;:
22595209
Department of Neurology and Center of Translational System Biology, Mount Sinai School of Medicine, New York, NY, 10029, USA.
MOTIVATION: For flow cytometry data, there are two common approaches to the unsupervised clustering problem; one is based on the finite mixture model and the other on spatial exploration of the histograms. The former is computationally slow and has difficulty to identify clusters of irregular shapes. The latter approach cannot be applied directly to high dimensional data as the computational time and memory become unmanageable and the estimated histogram is unreliable. An algorithm without these two problems would be very useful. RESULTS: In this paper, we combine ideas from the finite mixture model and histogram spatial exploration. This new algorithm, which we call flowPeaks, can be applied directly to high dimensional data and identify irregular shape clusters. The algorithm first uses K-means algorithm with a large K to partition the cell population into many small clusters. These partitioned data allow the generation of a smoothed density function using the finite mixture model. All local peaks are exhaustively searched by exploring the density function and the cells are clustered by the associated local peak. The algorithm flowPeaks is automatic, fast and reliable and robust to cluster shape and outliers. This algorithm has been applied to flow cytometry data and it has been compared with state of the art algorithms, including Misty Mountain, FLOCK, flowMeans, flowMerge and FLAME. AVAILABILITY: The R package flowPeaks is available at https://github.com/yongchao/flowPeaks. CONTACT: yongchao.ge@mssm.edu.
Bioinformatics. 2012 May 17;:
22595208
Bradley Department of Electrical and Computer Engineering, Virginia Tech, Arlington, VA 22203, USA.
MOTIVATION: Identification of transcriptional regulatory networks (TRNs) is of significant importance in computational biology for cancer research, providing a critical building block to unravel disease pathways. However, existing methods for TRN identification suffer from the inclusion of excessive 'noise' in microarray data and false-positives in binding data, especially when applied to human tumor-derived cell line studies. More robust methods that can counteract the imperfection of data sources are therefore needed for reliable identification of TRNs in this context. RESULTS: In this paper, we propose to establish a link between the quality of one target gene to represent its regulator and the uncertainty of its expression to represent other target genes. Specifically, an outlier sum statistic was used to measure the aggregated evidence for regulation events between target genes and their corresponding transcription factors. A Gibbs sampling method was then developed to estimate the marginal distribution of the outlier sum statistic, hence, to uncover underlying regulatory relationships. To evaluate the effectiveness of our proposed method, we compared its performance with that of an existing sampling-based method using both simulation data and yeast cell cycle data. The experimental results show that our method consistently outperforms the competing method in different settings of signal-to-noise ratio and network topology, indicating its robustness for biological applications. Finally we applied our method to breast cancer cell line data and demonstrated its ability to extract biologically meaningful regulatory modules related to estrogen signaling and action in breast cancer.Availability and implementation: The Gibbs sampler MATLAB package is freely available at http://www.cbil.ece.vt.edu/software.htm CONTACT: xuan@vt.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Bioinformatics. 2012 May 16;:
22595207
Statistics, University of Wisconsin-Madison, Madison, WI 53706.
SUMMARY: R/EBcoexpress implements the approach of Dawson and Kendziorski (2011) using R, a freely available, open source statistical programming language (R Development Core Team, 2009). The approach identifies differential co-expression (DC) by examining the correlations among gene pairs using an empirical Bayesian approach, producing a false discovery rate (FDR) controlled list of DC pairs. This interrogation of DC gene pairs complements but is distinct from differential expression (DE) analyses, under the general goal of understanding differential regulation across biological conditions.Availability and Implementation: R/EBcoexpress is freely available and hosted on Bioconductor; a source file and vignette may be found at http://www.bioconductor.org/packages/release/bioc/html/EBcoexpress.html CONTACT: kendzior@biostat.wisc.edu SUPPLEMENTARY INFORMATION: None.
Bioinformatics. 2012 May 15;:
22592383
Department of Biostatistics, University of Washington, Seattle, WA, USA.
MOTIVATION: Statistical analyses of Genome Wide Association Studies require fitting large numbers of very similar regression models, each with low statistical power. Taking advantage of repeated observations or correlated phenotypes can increase this statistical power, but fitting the more complicated models required can make computation impractical. RESULTS: In this paper we present simple methods that capitalize on the structure inherent in GWAS studies to dramatically speed up computation for a wide variety of problems, with a special focus on methods for correlated phenotypes. AVAILABILITY: The R package 'boss' is available on the Comprehensive R Archive Network (CRAN) at http://cran.r-project.org/web/packages/boss/ CONTACT: voorma@u.washington.edu.
Bioinformatics. 2012 May 15;:
22592382
Dept. of Electrical and Computer Engineering, Texas A&M University, College Station, TX.
MOTIVATION: In early drug development it would be beneficial to be able to identify those dynamic patterns of gene response that indicate that drugs targeting a particular gene will be likely or not to elicit the desired response. One approach would be to quantitate the degree of similarity between the responses that cells show when exposed to drugs, so that consistencies in the regulation of cellular response processes that produce success or failure can be more readily identified. RESULTS: We track drug response using fluorescent proteins as transcription activity reporters. Our basic assumption is that drugs inducing very similar alteration in transcriptional regulation will produce similar temporal trajectories on many of the reporter proteins and hence be identified as having similarities in their mechanisms of action (MOA). The main body of this work is devoted to characterizing similarity in temporal trajectories/signals. To do so, we must first identify the key points that determine mechanistic similarity between two drug responses. Directly comparing points on the two signals is unrealistic, since it cannot handle delays and speed variations on the time axis. Hence, to capture the similarities between reporter responses, we develop an alignment algorithm that is robust to noise, time delays, and is able to find all the contiguous parts of signals centered about a core alignment (reflecting a core mechanism in drug response). Applying the proposed algorithm to a range of real drug experiments shows that the result agrees well with the prior drug MOA knowledge. AVAILABILITY: The R code for the RLCSS algorithm is available at http://gsp.tamu.edu/Publications/supplementary/zhao12a. CONTACT: edward@ece.tamu.edu.
Bioinformatics. 2012 May 15;:
22592381
Shigeo Fujimori,
Naoya Hirai,
Kazuyo Masuoka,
Tomohiro Oshikubo,
Tatsuhiro Yamashita,
Takanori Washio,
Ayumu Saito,
Masao Nagasaki,
Satoru Miyano,
Etsuko Miyamoto-Sato
Division of Interactome Medical Sciences, Institute of Medical Science, The University of Tokyo, Tokyo 108-8039, Japan, Production Solution Business Unit, Production Solution Division.II, Solution Department I, Fujitsu Advanced Engineering Ltd., Tokyo 163-1017, Japan, BioIT Business Development Unit, Fujitsu Ltd., Chiba 261-8588, Japan, RIKEN GENESIS Co., Ltd., Yokohama 230-0045, Japan and Human Genome Center, Institute of Medical Sciences, The University of Tokyo, Tokyo 108-8039, Japan.
SUMMARY: Protein-protein interactions (PPIs) are mediated through specific regions on proteins. Some proteins have two or more protein interacting regions (IRs) and some IRs are competitively used for interactions with different proteins. IRView currently contains data for 3,417 IRs in human and mouse proteins. The data was obtained from several different sources and combined with annotated region data from InterPro. Information on non-synonymous single nucleotide polymorphism (nsSNP) sites and variable regions owing to alternative mRNA splicing is also included. The IRView web interface displays all IR data, including user-uploaded data, on reference sequences so that the positional relationship between IRs can be easily understood. IRView should be useful for analyzing underlying relationships between the proteins behind the PPI networks. AVAILABILITY: IRView is publicly available on the web at http://ir.hgc.jp/. CONTACT: nekoneko@ims.u-tokyo.ac.jp.
Bioinformatics. 2012 May 15;:
22592380
Department of Microbial and Molecular Systems, Katholieke Universiteit Leuven, Kasteelpark Arenberg 20, Department of Electrical Engineering, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, 3001 Heverlee, Belgium, School of Life Sciences - LifeNet, Freiburg Institute for Advanced Studies (FRIAS), University of Freiburg, Albertstr. 19, 79104 Freiburg, Germany, Department of Plant Biotechnology and Bioinformatics (VIB), Ghent University, Technologiepark 927, B-9052 Ghent, Belgium.
MOTIVATION: Probabilistic motif detection requires a multi-step approach going from the actual de novo regulatory motif finding up to a tedious assessment of the predicted motifs. MotifSuite, a user friendly web interface streamlines this analysis flow. Its core consists of two post processing procedures that allow prioritizing the motif detection output. The tools offered by MotifSuite are built around the well established motif detection tool MotifSampler, but can also be used in combination with any other probabilistic motif detection tool. Elaborate guidelines on each of its applications have been provided. AVAILABILITY: http://homes.esat.kuleuven.be/~bioi_marchal/MotifSuite/Index.htm CONTACT: kamar@psb.ugent.be.
Bioinformatics. 2012 May 15;:
22592379
Division of Bioinformatics, Omicsoft Inc., 164 Quade Drive, Cary, NC 27513, USA.
SUMMARY: Accurately mapping RNA-Seq reads to the reference genome is a critical step for performing downstream analysis such as transcript assembly, isoform detection and quantification. Many tools have been developed, however, given the huge size of the next generation sequencing (NGS) datasets and the complexity of the transcriptome, RNA-Seq read mapping remains a challenge with the ever-increasing amount of data. We develop OSA (Omicsoft Sequence Aligner), a fast and accurate alignment tool for RNA-Seq data. Benchmarked with existing methods, OSA improves mapping speed 4-10 fold with better sensitivity and less false positives. AVAILABILITY: OSA can be downloaded from http://omicsoft.com/osa. It is free to academic users. OSA has been tested extensively on Linux, Mac OS X and Windows platforms.
Bioinformatics. 2012 May 15;:
22592378
School of Biological Sciences, University of Reading, Whiteknights, Reading RG6 6AS, Biocomputing Section, MRC Harwell, Harwell Oxford Campus, Didcot OX11 0RD and Diamond Light Source, Beamline B23, Chilton, Didcot OX11 0DE, UK.
MOTIVATION: Modelling the 3D structures of proteins can often be enhanced if more than one fold template is used during the modelling process. However, in many cases, this may also result in poorer model quality for a given target or alignment method. There is a need for modelling protocols that can both consistently and significantly improve 3D models and provide an indication of when models might not benefit from the use of multiple target-template alignments. Here, we investigate the use of both global and local model quality prediction scores produced by ModFOLDclust2, to improve the selection of target-template alignments for the construction of multiple-template models. Additionally, we evaluate clustering the resulting population of multi- and single-template models for the improvement of our IntFOLD-TS tertiary structure prediction method. RESULTS: We find that using accurate local model quality scores to guide alignment selection is the most consistent way to significantly improve models for each of the sequence to structure alignment methods tested. In addition, using accurate global model quality for re-ranking alignments, prior to selection, further improves the majority of multi-template modelling methods tested. Furthermore, subsequent clustering of the resulting population of multiple-template models significantly improves the quality of selected models compared with the previous version of our tertiary structure prediction method, IntFOLD-TS.Availability and Implementation: Source code and binaries can be freely downloaded from http://www.reading.ac.uk/bioinf/downloads/. CONTACT: l.j.mcguffin@reading.ac.uk SUPPLEMENTARY INFORMATION: http://www.reading.ac.uk/bioinf/MTM_suppl_info.pdf.
Bioinformatics. 2012 May 15;:
22592377
Lars J Kangas,
Thomas O Metz,
Giorgis Isaac,
Brian T Schrom,
Bojana Ginovska-Pangovska,
Luning Wang,
Li Tan,
Robert R Lewis,
John H Miller
Computational and Statistical Analytics Division, Pacific Northwest National Laboratory, P.O. Box 999, Richland, WA 99352 ,lars.kangas@pnnl.gov.
MOTIVATION: Liquid chromatography-mass spectrometry-based metabolomics has gained importance in the life sciences, yet it is not supported by software tools for high throughput identification of metabolites based on their fragmentation spectra. An algorithm (ISIS: in silico identification software) and its implementation are presented and show great promise in generating in silico spectra of lipids for the purpose of structural identification. Instead of using chemical reaction rate equations or rules-based fragmentation libraries, the algorithm uses machine learning to find accurate bond cleavage rates in a mass spectrometer employing collision-induced dissociation tandem mass spectrometry. A preliminary test of the algorithm with 45 lipids from a subset of lipid classes shows both high sensitivity and specificity. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Bioinformatics. 2012 May 13;:
22586178
Cellular Differentiation and Toxicity Prediction, Institut Pasteur Korea 696 Sampyeong-dong, Bundang-gu, Seongnam-si, Gyeonggi-do, 463-400 Korea.
MOTIVATION: High-Throughput Screening is a powerful technology principally used by pharmaceutical industries allowing the identification of molecules of interest within large libraries. Originally target based, cellular assays provide a way to test compounds (or other biological material such as small interfering RNA) in a more physiologically realistic in vitro environment. High-Content Screening (HCS) platforms are now available at lower cost, giving the opportunity for universities or research institutes to access those technologies for research purposes. However, the amount of information extracted from each experiment is multiplexed and hence difficult to handle. In such context, there is an important need for an easy-to-use, but still powerful software able to manage multidimensional screening data by performing adapted quality control and classification. HCS analyzer includes: a user-friendly interface specifically dedicated to HCS readouts, an automated approach to identify systematic errors potentially occurring during screening and a set of tools to classify, cluster and identify phenotypes of interest among large and multivariate data. AVAILABILITY: The application, the C#.Net source code, as well as detailed documentation, are freely available at the following URL: http://hcs-analyzer.ip-korea.org CONTACT: dorvalt@ip-korea.org.
Bioinformatics. 2012 May 10;:
22581181
Department of Bioinformatics and Computational Biology, UT MD Anderson Cancer Center, Houston, TX 77030.
SUMMARY: Sequencing by hybridization to oligonucleotides has evolved into an inexpensive, reliable and fast technology for targeted sequencing. Hundreds of human genes can now be sequenced within a day using a single hybridization to a resequencing microarray. However, several issues inherent to these arrays (e.g., cross-hybridization, variable probe/target affinity) cause sequencing errors and have prevented more widespread applications. We developed an R package for resequencing microarray data analysis that integrates a novel statistical algorithm, sequence robust multi-array analysis (SRMA), for rare variant detection with high sensitivity (false negative rate, FNR 5%) and accuracy (false positive rate, FPR 1 × 10(-5)). The SRMA package consists of five modules for quality control, data normalization, single array analysis, multi-array analysis and output analysis. The entire workflow is efficient and identifies rare DNA single nucleotide variations (SNVs) and structural changes such as gene deletions with high accuracy and sensitivity. AVAILABILITY: http://cran.r-project.org/, http://odin.mdacc.tmc.edu/~wwang7/SRMAIndex.html. CONTACT: wwang7@mdanderson.org.
Bioinformatics. 2012 May 10;:
22581180
Evolutionary Biology Unit, South Australian Museum, North Terrace, Adelaide, SA, 5000 Australia, School of Molecular and Biomedical Science, North Terrace, Adelaide, Ecology, Evolution and Landscape Science, University of Adelaide, North Terrace, Adelaide, Department of Ecology and Evolutionary Biology, Yale University, New Haven, USA, School of Biological Sciences, Flinders University, GPO Box 2100, Adelaide, and Australian Centre for Evolutionary Biology and Biodiversity, School of Earth and Environmental Science, University of Adelaide.
MOTIVATION: When working with non model organisms, few if any species-specific markers are available for phylogenetic, phylogeo-graphic and population studies. Therefore, researchers often try to adapt markers developed in distantly related taxa, resulting in poor amplification and ascertainment bias in their target taxa. Markers can be developed de-novo and anonymous nuclear loci (ANL) are proving to be a boon for researchers seeking large numbers of fast-evolving, independent loci. However, the development of ANL can be laboratory intensive and expensive. A workflow is described to identify suitable low copy anonymous loci from high throughput shotgun sequences, dramatically reducing the cost and time required to develop these markers and produce robust multilocus datasets. RESULTS: By successively removing repetitive and evolutionary conserved sequences from low coverage shotgun libraries, we were able to isolate thousands of potential ANL. Empirical testing of loci developed from two reptile taxa confirmed that our methodology yields markers with comparable amplification rates and nucleotide diversities to ANLs developed using other methodologies. Our approach capitalises on next-generation sequencing technologies to enable the development of phylogenetic, phylogeographic and population markers for taxa lacking suitable genomic resources. CONTACT: terry.bertozzi@samuseum.sa.gov.au.
Bioinformatics. 2012 May 10;:
22581179
Illumina, Inc, 5200 Illumina Way, San Diego, CA 92122.
MOTIVATION: Whole genome and exome sequencing of matched tumor-normal sample pairs is becoming routine in cancer research. The consequent increased demand for somatic variant analysis of paired samples requires methods specialized to model this problem so as to sensitively call variants at any practical level of tumor impurity. RESULTS: We describe Strelka, a method for somatic SNV and small indel detection from sequencing data of matched tumor-normal samples. The method employs a novel Bayesian approach which represents continuous allele frequencies for both tumor and normal samples, whilst leveraging the expected genotype structure of the normal. This is achieved by representing the normal sample as a mixture of germline variation with noise, and representing the tumor sample as a mixture of the normal sample with somatic variation. A natural consequence of the model structure is that sensitivity can be maintained at high tumor impurity without requiring purity estimates. We demonstrate that the method has superior accuracy and sensitivity on impure samples compared to approaches based on either diploid genotype likelihoods or general allele-frequency tests. AVAILABILITY: The Strelka workflow source code is available from ftp://strelka@ftp.illumina.com/. CONTACT: csaunders@illumina.com.
|
Polish News |
|
||
|
|