Heterogeneous Large Datasets Integration Using Bayesian Factor Regression

Authors: Alejandra Avalos-Pacheco, David Rossell and Richard S. Savage

Bayesian Analysis, Vol. 17, No 1, 33–66, January, 2022

Two key challenges in modern statistical applications are the largeamount of information recorded per individual, and that such data are often notcollected all at once but in batches. These batch effects can be complex, causingdistortions in both mean and variance. We propose a novel sparse latent factorregression model to integrate such heterogeneous data. The model provides a toolfor data exploration via dimensionality reduction and sparse low-rank covarianceestimation while correcting for a range of batch effects. We study the use ofseveral sparse priors (local and non-local) to learn the dimension of the latentfactors. We provide a flexible methodology for sparse factor regression which isnot limited to data with batch effects. Our model is fitted in a deterministicfashion by means of an EM algorithm for which we derive closed-form updates,contributing a novel scalable algorithm for non-local priors of interest beyondthe immediate scope of this paper. We present several examples, with a focuson bioinformatics applications. Our results show an increase in the accuracy ofthe dimensionality reduction, with non-local priors substantially improving thereconstruction of factor cardinality. The results of our analyses illustrate howfailing to properly account for batch effects can result in unreliable inference. Ourmodel provides a novel approach to latent factor regression that balances sparsitywith sensitivity in scenarios both with and without batch effects and is highlycomputationally efficient.