|
Latest Paper:
Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA. pickrell@uchicago.edu
Li et al.(Research Articles, 1 July 2011, p. 53; published online 19 May 2011) reported more than 10,000 mismatches between messenger RNA and DNA sequences from the same individuals, which they attributed to previously unrecognized mechanisms of gene regulation. We found that at least 88% of these sequence mismatches can likely be explained by technical artifacts such as errors in mapping sequencing reads to a reference genome, sequencing errors, and genetic variation.
PLoS One. 2012 ;7 (2):e30629
22359548
Jean-Baptiste Veyrieras,
Daniel J Gaffney,
Joseph K Pickrell,
Yoav Gilad,
Matthew Stephens,
Jonathan K Pritchard
Department of Human Genetics, The University of Chicago, Chicago, Illinois, United States of America.
Mapping of expression quantitative trait loci (eQTLs) is an important technique for studying how genetic variation affects gene regulation in natural populations. In a previous study using Illumina expression data from human lymphoblastoid cell lines, we reported that cis-eQTLs are especially enriched around transcription start sites (TSSs) and immediately upstream of transcription end sites (TESs). In this paper, we revisit the distribution of eQTLs using additional data from Affymetrix exon arrays and from RNA sequencing. We confirm that most eQTLs lie close to the target genes; that transcribed regions are generally enriched for eQTLs; that eQTLs are more abundant in exons than introns; and that the peak density of eQTLs occurs at the TSS. However, we find that the intriguing TES peak is greatly reduced or absent in the Affymetrix and RNA-seq data. Instead our data suggest that the TES peak observed in the Illumina data is mainly due to exon-specific QTLs that affect 3' untranslated regions, where most of the Illumina probes are positioned. Nonetheless, we do observe an overall enrichment of eQTLs in exons versus introns in all three data sets, consistent with an important role for exonic sequences in gene regulation.
Daniel G MacArthur,
Suganthi Balasubramanian,
Adam Frankish,
Ni Huang,
James Morris,
Klaudia Walter,
Luke Jostins,
Lukas Habegger,
Joseph K Pickrell,
Stephen B Montgomery,
Cornelis A Albers,
Zhengdong D Zhang,
Donald F Conrad,
Gerton Lunter,
Hancheng Zheng,
Qasim Ayub,
Mark A DePristo,
Eric Banks,
Min Hu,
Robert E Handsaker,
Jeffrey A Rosenfeld,
Menachem Fromer,
Mike Jin,
Xinmeng Jasmine Mu,
Ekta Khurana,
Kai Ye,
Mike Kay,
Gary Ian Saunders,
Marie-Marthe Suner,
Toby Hunt,
If H A Barnes,
Clara Amid,
Denise R Carvalho-Silva,
Alexandra H Bignell,
Catherine Snow,
Bryndis Yngvadottir,
Suzannah Bumpstead,
David N Cooper,
Yali Xue,
Irene Gallego Romero,
Jun Wang,
Yingrui Li,
Richard A Gibbs,
Steven A McCarroll,
Emmanouil T Dermitzakis,
Jonathan K Pritchard,
Jeffrey C Barrett,
Jennifer Harrow,
Matthew E Hurles,
Mark B Gerstein,
Chris Tyler-Smith
Wellcome Trust Sanger Institute, Hinxton, UK. macarthur@atgu.mgh.harvard.edu
Genome-sequencing studies indicate that all humans carry many genetic variants predicted to cause loss of function (LoF) of protein-coding genes, suggesting unexpected redundancy in the human genome. Here we apply stringent filters to 2951 putative LoF variants obtained from 185 human genomes to determine their true prevalence and properties. We estimate that human genomes typically contain ~100 genuine LoF variants with ~20 genes completely inactivated. We identify rare and likely deleterious LoF alleles, including 26 known and 21 predicted severe disease-causing variants, as well as common LoF variants in nonessential genes. We describe functional and evolutionary differences between LoF-tolerant and recessive disease genes and a method for using these differences to prioritize candidate genes found in clinical sequencing studies.
Jacob F Degner,
Athma A Pai,
Roger Pique-Regi,
Jean-Baptiste Veyrieras,
Daniel J Gaffney,
Joseph K Pickrell,
Sherryl De Leon,
Katelyn Michelini,
Noah Lewellen,
Gregory E Crawford,
Matthew Stephens,
Yoav Gilad,
Jonathan K Pritchard
Department of Human Genetics, University of Chicago, Chicago, Illinois 60637, USA.
The mapping of expression quantitative trait loci (eQTLs) has emerged as an important tool for linking genetic variation to changes in gene regulation. However, it remains difficult to identify the causal variants underlying eQTLs, and little is known about the regulatory mechanisms by which they act. Here we show that genetic variants that modify chromatin accessibility and transcription factor binding are a major mechanism through which genetic variation leads to gene expression differences among humans. We used DNase I sequencing to measure chromatin accessibility in 70 Yoruba lymphoblastoid cell lines, for which genome-wide genotypes and estimates of gene expression levels are also available. We obtained a total of 2.7 billion uniquely mapped DNase I-sequencing (DNase-seq) reads, which allowed us to produce genome-wide maps of chromatin accessibility for each individual. We identified 8,902 locations at which the DNase-seq read depth correlated significantly with genotype at a nearby single nucleotide polymorphism or insertion/deletion (false discovery rate = 10%). We call such variants 'DNase I sensitivity quantitative trait loci'(dsQTLs). We found that dsQTLs are strongly enriched within inferred transcription factor binding sites and are frequently associated with allele-specific changes in transcription factor binding. A substantial fraction (16%) of dsQTLs are also associated with variation in the expression levels of nearby genes (that is, these loci are also classified as eQTLs). Conversely, we estimate that as many as 55% of eQTL single nucleotide polymorphisms are also dsQTLs. Our observations indicate that dsQTLs are highly abundant in the human genome and are likely to be important contributors to phenotypic variation.
Genome Biol. 2012 Jan 31;13 (1):R7
22293038
Daniel J Gaffney,
Jean-Baptiste Veyrieras,
Jacob F Degner,
Roger Pique-Regi,
Athma A Pai,
Gregory E Crawford,
Matthew Stephens,
Yoav Gilad,
Jonathan K Pritchard
Department of Human Genetics, University of Chicago, 920 E58th Street, Chicago, IL 60637, USA. dg13@sanger.ac.uk.
ABSTRACT: BACKGROUND: Expression quantitative trait loci (eQTLs) are likely to play an important role in the genetics of complex traits; however, their functional basis remains poorly understood. Using the HapMap lymphoblastoid cell lines, we combine 1000 Genomes genotypes and an extensive catalogue of human functional elements to investigate the biological mechanisms that eQTLs perturb. RESULTS: We use a Bayesian hierarchical model to estimate the enrichment of eQTLs in a wide variety of regulatory annotations. We find that approximately 40% of eQTLs occur in open chromatin, and that they are particularly enriched in transcription factor binding sites, suggesting that many directly impact protein-DNA interactions. Analysis of core promoter regions shows that eQTLs also frequently disrupt some known core promoter motifs but, surprisingly, are not enriched in other well-known motifs such as the TATA box. We also show that information from regulatory annotations alone, when weighted by the hierarchical model, can provide a meaningful ranking of the SNPs that are most likely to drive gene expression variation. CONCLUSIONS: Our study demonstrates how regulatory annotation and the association signal derived from eQTL-mapping can be combined into a single framework. We used this approach to further our understanding of the biology that drives human gene expression variation, and of the putatively causal SNPs that underlie it.
Genome Res. 2011 Dec 29;:
22207615
George H Perry,
Pall Melsted,
John C Marioni,
Ying Wang,
Russell Bainer,
Joseph K Pickrell,
Katelyn Michelini,
Sarah Zehr,
Anne D Yoder,
Matthew Stephens,
Jonathan K Pritchard,
Yoav Gilad
University of Chicago;
Comparative genomic studies in primates have yielded important insights into the evolutionary forces that shape genetic diversity and revealed the likely genetic basis for certain species-specific adaptations. To date, however, these studies have focused on only a small number of species. For the majority of non-human primates, including some of the most critically endangered, genome-level data are not yet available. In this study, we have taken the first steps towards addressing this gap by sequencing RNA from the livers of multiple individuals from each of 16 mammalian species, including humans and 11 non-human primates. Of the non-human primate species, five are lemurs and two are lorisoids, for which little or no genomic data were previously available. To analyze these data, we developed a method for de novo assembly and alignment of orthologous gene sequences across species. We assembled an average of 5,721 genes per species, and characterized diversity and divergence of both gene sequences and gene expression levels. We identified patterns of variation that are consistent with the action of positive or directional selection, including an 18-fold enrichment of peroxisomal genes among genes whose regulation likely evolved under directional selection in the ancestral primate lineage. Importantly, we found no relationship between genetic diversity and endangered status, with the two most endangered species in our study, the black and white ruffed lemur and the Coquerels sifaka, having the highest genetic diversity among all primates. Our observations imply that many endangered lemur populations still harbor considerable genetic variation. Timely efforts to conserve these species alongside their habitats have therefore strong potential to achieve long-term success.
Genome Biol Evol. 2011 Dec 7;:
22155688
George H Perry,
Darryl Reeves,
Páll Melsted,
Aakrosh Ratan,
Webb Miller,
Katelyn Michelini,
Edward E Louis Jr,
Jonathan K Pritchard,
Christopher E Mason,
Yoav Gilad
Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA.
We present a high-coverage, draft genome assembly of the aye-aye (Daubentonia madagascariensis), a highly unusual, nocturnal primate from Madagascar. Our assembly totals ∼3.0 billion base pairs (3.0 Gb), roughly the size of the human genome, comprised of ∼2.6 million scaffolds (N50 scaffold size = 13,597 bp) based on short, paired-end sequencing reads. We compared the aye-aye genome sequence data to the four other published primate genomes (human, chimpanzee, orangutan, and rhesus macaque), as well as to the mouse and dog genomes as non-primate outgroups. Unexpectedly, we observed strong evidence for a relatively slow substitution rate in the aye-aye lineage compared to these and other primates. In fact, the aye-aye branch length is estimated to be ∼10% shorter than that of the human lineage, which is known for its low substitution rate. This finding may be explained, in part, by the protracted aye-aye life history pattern, including late weaning and age of first reproduction relative to other lemurs. Additionally, the availability of this draft lemur genome sequence allowed us to polarize nucleotide and protein sequence changes to the ancestral primate lineage - a critical period in primate evolution, for which the relevant fossil record is sparse. Finally, we identified 293,800 high-confidence single nucleotide polymorphisms (SNPs) in the donor individual for our aye-aye genome sequence, a captive-born individual from two wild-born parents. The resulting heterozygosity estimate of 0.051% is the lowest of any primate studied to date, which is understandable considering the aye-aye's extensive home range size and relatively low population densities. Yet this level of genetic diversity also suggests that conservation efforts benefitting this unusual species should be prioritized, especially in the face of the accelerating degradation and fragmentation of Madagascar's forests.
Lucy Huang,
Mattias Jakobsson,
Trevor J Pemberton,
Muntaser Ibrahim,
Thomas Nyambo,
Sabah Omar,
Jonathan K Pritchard,
Sarah A Tishkoff,
Noah A Rosenberg
Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan.
Sub-Saharan Africa has been identified as the part of the world with the greatest human genetic diversity. This high level of diversity causes difficulties for genome-wide association (GWA) studies in African populations-for example, by reducing the accuracy of genotype imputation in African populations compared to non-African populations. Here, we investigate haplotype variation and imputation in Africa, using 253 unrelated individuals from 15 Sub-Saharan African populations. We identify the populations that provide the greatest potential for serving as reference panels for imputing genotypes in the remaining groups. Considering reference panels comprising samples of recent African descent in Phase 3 of the HapMap Project, we identify mixtures of reference groups that produce the maximal imputation accuracy in each of the sampled populations. We find that optimal HapMap mixtures and maximal imputation accuracies identified in detailed tests of imputation procedures can instead be predicted by using simple summary statistics that measure relationships between the pattern of genetic variation in a target population and the patterns in potential reference panels. Our results provide an empirical basis for facilitating the selection of reference panels in GWA studies of diverse human populations, especially those of African ancestry. Genet. Epidemiol. 35:766-780, 2011. © 2011 Wiley Periodicals, Inc.
Nat Genet. 2011 ;43 (10):923-5
21956387
Department of Human Genetics, University of Chicago, Chicago, USA.
Two new studies take distinct population genetic approaches to analyzing whole-genome sequencing data sets in order to estimate human demographic parameters. These papers refine our understanding of the relationships among human populations while illustrating both the possibilities and the statistical challenges of fitting demographic models to whole-genome data sets.
BMC Bioinformatics. 2011 ;12 :333
21831268
Department of Human Genetics, The University of Chicago, Chicago, IL 60637, USA. pmelsted@gmail.com
BACKGROUND Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing k-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton k-mers are uninformative for many algorithms without some kind of error correction. RESULTS We present a new method that identifies all the k-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed k-mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through the data to provide exact counts of all nonunique k-mers. For example data sets, we report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting k-mers in sequence data with errors. CONCLUSIONS A reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. It is available for free download at http://pritch.bsd.uchicago.edu/bfcounter.html.
|
Polish News |
|||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||
|
|