Human Genome Variation

Similar interests

Human Genome Variation

Each individual has two copies of the human genome: one maternal, one paternal. Any two human genomes differ in about 3M single nucleotide variants, 1/10 as many small insertions and deletions, and thousands of larger structural variants (SVs). Human genome variation is important:

It is a main cause of phenotypic differences - the reason humans differ from each other in their phenotypes, including their predisposition to disease.
It contains information that can be mined to reconstruct our ancestry, demography, and relatedness.

The discovery of human genome variation, and also its use to understand ancestry, relatedness and phenotype, require the development of sophisticated algorithms. This is especially true for three orthogonal reasons: (1) massive scale of the data - for example, today several projects (1KGP, UK10K, UK100K) are underway to sequence thousands of individuals; (2) complexity of the human genome - in particular, complex repeat structures create "dark" regions and confound the discovery of SVs; (3) complex correlations across variants within a population (aka Linkage Disequilibrium, LD). The challenge is to infer variants and their correlations jointly across a population, and to model LD in a manner that allows efficient computation and at the same time accurate inference.

We develop algorithms for modeling and discovering human genome variation, and for leveraging variation to study ancestry, relatedness and phenotype. Examples of our recent work include:

Discovering variation in repeat regions and segmental duplications that were previously "dark regions" with respect to short read technologies. We developed an algorithm that utilizes read cloud barcoding (produced for example by 10X Gemcode or Moleculo) to map short reads in the correct source copy of a repeat, and call germline or somatic variants there. Using these techniques coupled with 10X technology we uncover roughly 200,000 previously dark variants in the human genome.
Ultrafast and accurate variation calling and genotyping in a population of genomes. We developed Reveel, an algorithm that uses a simple and effective way to model LD, and which jointly discovers and genotypes variants across a large (1000s or more) population of individuals that have been sequenced. Reveel works well with low-coverage sequencing data, so as to allow sequencing more individuals for the same cost. We are applying Reveel to 1KGP and UK10K data with great results.
Inference of identity by descent (IBD) and relatedness. Our methods, starting with Parente and Carrot followed by SpeeDB and Parente2, enable finding segments of IBD across a large database of individuals efficiently and with high accuracy down to resolution of 2 cM. See the Parente website.
Ancestry Inference. How can we paint an individual's population ancestries along their chromosomes? In earlier work we introduced HAPAA, the first ancestry inference method that explicitly leverages the training populations' haplotypes, rather than individual markers (similar to HAPMIX which appeared subsequently). We later developed ALLOY, which combines the HAPAA model with compressed population models using BEAGLE.

Sequencing and genotyping a large population.

Uncovering dark variants in the human genome.

Modeling LD and inferring relatedness.

Similar interests

Human Genome Variation

RESEARCH

PEOPLE

RESOURCES

PUBLICATIONS

CONTACT