Computationally efficient, exact covariate-adjusted multivariate methods for genetic analysis leveraging summary statistics from large biobanks

Jack M. Wolf, Martha Barnard, Xueting Xia, Nathan Ryder, Jason Westra, Nathan Tintle

January 2020

PDF

Image credit: pixabay

Abstract

The popularization of biobanks provides an unprecedented amount of genetic and phenotypic information that can be used to research the relationship between genetics and human health. Despite the opportunities these datasets provide, they also pose many problems associated with computational time and costs, data size and transfer, and privacy and security. The publishing of summary statistics from these biobanks, and the use of them in a variety of downstream statistical analyses, alleviates many of these logistical problems. However, major questions remain about how to use summary statistics in all but the simplest downstream applications. Here, we present a novel approach to utilize basic summary statistics (estimates from single marker regressions on single phenotypes) to evaluate more complex phenotypes using multivariate methods. In particular, we present a covariate-adjusted method for conducting principal component analysis (PCA) utilizing only biobank summary statistics. We validate exact formulas for this method, as well as provide a framework of estimation when specific summary statistics are not available, through simulation. We apply our method to a real data set of fatty acid and genomic data.

Type

Journal article

Publication

Pacific Symposium on Biocomputing, 25(1)

Update (Feb 25, 2020): This paper was selected by the International Genetic Epidemiology Society (IGES) communication committee as the January IGES highlight!

The IGES communication January highlight paper is by @nathantintle and colleagues! This paper opens up new avenues for analyzing summary statistic data from biobanks using post-hoc covariate adjustment. Open Access: https://t.co/Ue2MNESHUZ
— IGES (@genepisociety) February 25, 2020

Statistical Genetics

Jack M. Wolf

Biostatistician and Educator

I’m an biostatistics PhD student at the University of Minnesota interested in causal inference, clinical trial design, and statistics and data science education.