Population Nitro - from multi-column VCF to GVM

As it has been proven that Population Calling, Group Calling or Joint Genotyping has clear advantages, one wonders why this is not being used for all research projects dealing with cohorts of samples.

The answer is simple: pain! Or to put it in other words, the resource requirements in terms of compute, storage, network bandwidth, manageability and competence. In short, budget has been out of reach for all except a ‘happy’ few who don’t mind taking aspirin.

The root cause of this low adoption is the lack of a sound data processing approach to make it easy and effective. I like to address three areas: 1. Resource Usage, 2. Consistency and 3. Access.

1. Resource Usage

The method deals with 100s, 1,000s and even 10K+ samples and is reasonably simple. You take the reads from all samples and for each sample you filter noise and check if you can call a variant. Then you check how many samples ‘agree’ on a call and depending on the result you increase or decrease the sensitivity of the caller for the specific call and run the caller again on all samples. The final result will be reported in a multi sample VCF. The model requires the data to be processed twice since we don’t know the ‘constellation’ of the group. Here is where the footprint of the BAM and the costs of variant calling kick in hard. If you thought single sample variant calling was expensive, brace yourself for the multiple sample approach.variant calling and resource usage.

This is exactly the reason why the concept of Joint Genotyping was developed. With Joint Genotyping the Variant Call phase is done only once. The variant caller produces a VCF file, which not only holds the variants but also the intermediate reference bases. The latter is needed to distinct genomic areas with reference calls from areas with no coverage. To be more efficient the reference bases in the VCF are clustered. The notation, still standard VCF, is known as gVCF or genomic VCF.

The approach comes at a cost. The size of a traditional BAM is 100GB. There are cases where a single sample genomic VCF gets close to 70GB, because the first pass must be done in high sensitivity mode. So not having to do a true two-pass operation on the BAM comes at the expense of additional storage. The gVCF files are finally used to do Joint Genotyping and the result is presented in a multi-column VCF file.

A second advantage of the gVCF concept is met when samples are added to an existing cohort. In such case genomic VCFs for existing samples are not regenerated. However, the Joint Genotyping pass must to be done for all samples, because the consensus of the new group may be different.

If we now take a step back, there are a number of obvious and simple measures that would remove ALL the pain from the above process and move the benefits of population calling to all researchers working with cohorts:

a) Reduce the footprint of the aligned reads file. GAR (GENALICE Aligned Reads) format uses 5GB for a 37x human WGS sample.

b) Create a light weight binary format for multi-column (g)VCF. GVM (GENALICE Variant Map) format uses 200MB for a 37x human WGS sample.

c) Apply consensus based call enhancement on output.

The approach yields 6 minutes processing time per sample on a single node (commodity hardware), no intermediate files requirement, and a final storage footprint of 200MB per sample.

2. Consistency

As cohorts grow over time and people want access to the data prior to reaching the full population of a cohort to start their analysis, we need to warrant consistent results in a changing environment.

This actually is a non-topic for smaller cohorts, as the cost of making a full copy of a Variant Map, a multi-column VCF or GVM, is very low. A GVM of 100 samples would take 20GB and 1,000 samples will take 200GB. When we start entering cohort sizes of 10K samples or more, 2TB and beyond warrants a bit more attention. Even though institutes have high speed petabyte scale storage devices installed, it would make sense to consider the speed at which a cohort grows and the need to work on the different growth plateaus of a cohort.

If you, for example, have a cohort of 5,000 samples, which grows at a rate of 200 per month, it will take 25 months to populate the cohort. There is value in having early access, yet access must be consistent. One could take a structural time based approach and make a copy of the Variant Map every 6 months. Four copies of the data at different sizes will in this example result in 2.5x the footprint. When time is further extended and the number of ‘snapshots’ increases this multiplier goes up.

The metric is harsh and goes up quickly with time and snapshot frequency even when older snapshots are purged, because the real pain is in the tail of the process. The storage multiplier is only driven by the provided service level of snapshot frequency and obsoletion. As such the multiplier applies to every cohort, which is governed by the service level. So in this case size does not matter.

The concept of ‘view’ and ‘snapshot’, which are common in database environments, are ideally suited to solve this problem. In source control there is a similar concept of ‘tag’ or ‘branch’. Both concepts provide consistent access to data without making a full copy. Such ‘snapshot’ exists at meta- and aggregate levels. The heavy load of detail data is all shared.

The binary GVM format supports this transactional concept outside of a database system. The concept transforms the storage multiplier to a constant amount per snapshot, which ensures efficiency, while allowing increment level consistent access to a cohort.

3. Access

A Variant Map is a very valuable repository. Its basic structure is reasonably simple. We list the sum of all mutations found in all samples at a genomic position. Then, for that position, we list for each sample the reported mutations that were found. This is a grid, where the crosshair of sample and position may contain bare or very rich data depending on the flavour of the tool you use.

The data, by tradition, is organized in VCF format. A GVM, or selections of it, can be converted to VCF to be compatible with common standards. This is a monolithic (quite often text) object, which is hard to analyze. Having the data stored in a database, which is designed to work with structured data would make the query capability much better.

I think it is still too early to have large cohorts stored in a database. Just do the math. Between 1K and 5K samples we will collect in the order of 20 million mutations. Twenty million mutations times 5K amount to 100 billion data points. Organizing and indexing this data is close to impossible, traditional transaction-, space management- and index maintenance models are not designed to deal with the amount, gradual influx and search requirements for this data.

Access to GVMI do believe that database systems i.e. SQL is a proper endpoint to integrate genotype and phenotype data analysis. I also think this will be a slow process, in which we first organize the data properly outside of the database, providing a proper API to query the data, building the access methods and access path optimization strategies. Then such an engine can be blend into a database system and we can make the data source ‘language’ part of SQL.

At this point in time, we are still struggling with organizing the data for simple and efficient access, which is ‘by sample’ and ‘by position’. The binary GVM format provides such dual access method by storing the crosshairs by sample and by position. This, and embedded aggregates, are a first approach to accelerate simple operations such as range, sample, and combined data extraction. It is clear that this is a first step towards more complex query capabilities.

To sum it up

The pen turns out to carry more ink than I anticipated when starting this write-up. I hope the thinking is clear. To efficiently exploit Population Calling we must deal with footprint (storage) reduction. We need a simple transaction driven cohort growth model to provide consistent time-based access. We should finally take a stepwise approach to move the format that holds the genetic gold and the access methods to the common place for structured data, which still is the relational database. Providing genetic access via SQL will accelerate application development leading to global use of genetic data in all facets of society, even areas we never thought of. When that happens, cost models will transition from license, to pay per use, to subscription and community value thru targeted advertisement. We still have a long way to go. Yet, we stay on course towards economical health in a healthy economy.