

Data of this scale has effectively outgrown the practical limits of first-generation GWAS data formats 4, 5 and new tools for storage, computation, and analysis are required. Studies incorporating millions of individuals with genotypes either directly observed or inferred at tens of millions of genetic variants are now becoming common 1– 3. The need to discover and dissect genetic associations at ever finer detail, and advances in the throughput of genotyping and sequencing technologies has driven rapid increases in the scale of genetic epidemiology studies. The UK Biobank is one of a number of projects that have used BGEN for release of imputed data, and we expect the format to continue to be widely implemented and used. To make using BGEN as easy as possible, we provide a detailed specification and a freely available reference implementation, and we leverage this by developing additional tools including an indexing tool (bgenix) and an R package (rbgen) that permits loading of BGEN-encoded data into the R statistical programming environment. We investigate the properties of this format using imputed data from the UK BiLEVE study, demonstrating both storage efficiency, and fast data loading performance on the order of hundreds of millions of imputed genotypes per second. Here we present a binary file format (the BGEN format) that can store both directly-typed and statistically imputed genotype data, and achieves substantial space savings by data compression and the use of an efficient representation for probabilities. Studies on this scale provide logistical and analytic challenges starting with the issue of efficiently storing, transmitting, and accessing underlying data.

The impact of modern technology on genetic epidemiology has been significant, with studies comprising millions of individuals assessed at tens of millions of genetic variants now becoming common.
