Skip to content

Part 1: Method overview and manual testing

Variant calling is a genomic analysis method that aims to identify variations in a genome sequence relative to a reference genome. Here we are going to use tools and methods designed for calling short germline variants, i.e. SNPs and indels, in whole-genome sequencing data.

GATK pipeline

A full variant calling pipeline typically involves a lot of steps, including mapping to the reference (sometimes referred to as genome alignment) and variant filtering and prioritization. For simplicity, in this course we are going to focus on just the variant calling part.

Methods

We're going to show you two ways to apply variant calling to whole-genome sequencing samples to identify germline SNPs and indels. First we'll start with a simple per-sample approach that calls variants independently from each sample. Then we'll show you a more sophisticated joint calling approach that analyzes multiple samples together, producing more accurate and informative results.

Before we dive into writing any workflow code for either approach, we are going to try out the commands manually on some test data.

Dataset

We provide the following data and related resources:

  • A reference genome consisting of a small region of the human chromosome 20 (from hg19/b37) and its accessory files (index and sequence dictionary).
  • Three whole genome sequencing samples corresponding to a family trio (mother, father and son), which have been subset to a small slice of data on chromosome 20 to keep the file sizes small. This is Illumina short-read sequencing data that have already been mapped to the reference genome, provided in BAM format (Binary Alignment Map, a compressed version of SAM, Sequence Alignment Map).
  • A list of genomic intervals, i.e. coordinates on the genome where our samples have data suitable for calling variants, provided in BED format.

Software

The two main tools involved are Samtools, a widely used toolkit for manipulating sequence alignment files, and GATK (Genome Analysis Toolkit), a set of tools for variant discovery developed at the Broad Institute.

These tools are not installed in the GitHub Codespaces environment, so we'll use them via containers (see Hello Containers).

Note

Make sure you're in the nf4-science/genomics directory so that the last part of the path shown when you type pwd is genomics.


1. Per-sample variant calling

Per-sample variant calling processes each sample independently: the variant caller examines the sequencing data for one sample at a time and identifies positions where the sample differs from the reference.

In this section we test the two commands that make up the per-sample variant calling approach: indexing a BAM file with Samtools and calling variants with GATK HaplotypeCaller. These are the commands we'll wrap into a Nextflow workflow in Part 2 of this course.

  1. Generate an index file for a BAM input file using Samtools
  2. Run the GATK HaplotypeCaller on the indexed BAM file to generate per-sample variant calls in VCF (Variant Call Format)
BAMSamtools indexBAM indexIntervalsReference+ index & dictGATK HaplotypeCallerVCF + index

We start by testing the two commands on just one sample.

1.1. Index a BAM input file with Samtools

Index files are a common feature of bioinformatics file formats; they contain information about the structure of the main file that allows tools like GATK to access a subset of the data without having to read through the whole file. This is important because of how large these files can get.

BAM files are often provided without an index, so the first step in many analysis workflows is to generate one using samtools index.

We're going to pull down a Samtools container, spin it up interactively and run the samtools index command on one of the BAM files.

1.1.1. Pull the Samtools container

Run the docker pull command to download the Samtools container image:

docker pull community.wave.seqera.io/library/samtools:1.20--b5dfbd93de237464
Command output
1.20--b5dfbd93de237464: Pulling from library/samtools
6360b3717211: Pull complete
2ec3f7ad9b3c: Pull complete
7716ca300600: Pull complete
4f4fb700ef54: Pull complete
8c61d418774c: Pull complete
03dae77ff45c: Pull complete
aab7f787139d: Pull complete
4f4fb700ef54: Pull complete
837d55536720: Pull complete
897362c12ca7: Pull complete
3893cbe24e91: Pull complete
d1b61e94977b: Pull complete
c72ff66fb90f: Pull complete
0e0388f29b6d: Pull complete
Digest: sha256:bbfc45b4f228975bde86cba95e303dd94ecf2fdacea5bfb2e2f34b0d7b141e41
Status: Downloaded newer image for community.wave.seqera.io/library/samtools:1.20--b5dfbd93de237464
community.wave.seqera.io/library/samtools:1.20--b5dfbd93de237464

If you haven't downloaded this image before, it may take a minute to complete. Once it's done, you have a local copy of the container image.

1.1.2. Spin up the Samtools container interactively

To run the container interactively, use docker run with the -it flags. The -v ./data:/data option mounts the local data directory into the container so the tools can access the input files.

docker run -it -v ./data:/data community.wave.seqera.io/library/samtools:1.20--b5dfbd93de237464

Your prompt changes to something like (base) root@a1b2c3d4e5f6:/tmp#, indicating you are now inside the container. The data files are accessible under /data.

1.1.3. Run the indexing command

The Samtools documentation gives us the command line to run to index a BAM file.

We only need to provide the input file; the tool will automatically generate a name for the output by appending .bai to the input filename.

samtools index /data/bam/reads_mother.bam
Directory contents
data/bam/
├── reads_father.bam
├── reads_mother.bam
├── reads_mother.bam.bai
└── reads_son.bam

You should now see a file called reads_mother.bam.bai in the same directory as the original BAM input file.

1.1.4. Exit the Samtools container

To exit the container, type exit.

exit

Your prompt should now be back to what it was before you started the container.

1.2. Call variants with GATK HaplotypeCaller

We're going to pull down a GATK container, spin it up interactively and run the gatk HaplotypeCaller command on the BAM file we just indexed.

1.2.1. Pull the GATK container

Run the docker pull command to download the GATK container image:

docker pull community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867
Command output

Some layers show Already exists because they are shared with the Samtools container image we pulled earlier.

4.5.0.0--730ee8817e436867: Pulling from library/gatk4
6360b3717211: Already exists
2ec3f7ad9b3c: Already exists
7716ca300600: Already exists
4f4fb700ef54: Already exists
8c61d418774c: Already exists
03dae77ff45c: Already exists
aab7f787139d: Already exists
4f4fb700ef54: Already exists
837d55536720: Already exists
897362c12ca7: Already exists
3893cbe24e91: Already exists
d1b61e94977b: Already exists
e5c558f54708: Pull complete
087cce32d294: Pull complete
Digest: sha256:e33413b9100f834fcc62fd5bc9edc1e881e820aafa606e09301eac2303d8724b
Status: Downloaded newer image for community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867
community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867

This should be faster than the first pull because the two container images share most of their layers.

1.2.2. Spin up the GATK container interactively

Spin up the GATK container interactively with the data directory mounted, just as we did for Samtools.

docker run -it -v ./data:/data community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867

Your prompt changes to indicate you are now inside the GATK container.

1.2.3. Run the variant calling command

The GATK documentation gives us the command line to run to perform variant calling on a BAM file.

We need to provide the BAM input file (-I) as well as the reference genome (-R), a name for the output file (-O) and a list of genomic intervals to analyze (-L).

However, we don't need to specify the path to the index file; the tool will automatically look for it in the same directory, based on the established naming and co-location convention. The same applies to the reference genome's accessory files (index and sequence dictionary files, *.fai and *.dict).

gatk HaplotypeCaller \
        -R /data/ref/ref.fasta \
        -I /data/bam/reads_mother.bam \
        -O reads_mother.vcf \
        -L /data/ref/intervals.bed
Command output

The tool produces verbose logging output. The highlighted lines confirm successful completion.

Using GATK jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar HaplotypeCaller -R /data/ref/ref.fasta -I /data/bam/reads_mother.bam -O reads_mother.vcf -L /data/ref/intervals.bed
00:27:50.687 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
00:27:50.854 INFO  HaplotypeCaller - ------------------------------------------------------------
00:27:50.858 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.5.0.0
00:27:50.858 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
00:27:50.858 INFO  HaplotypeCaller - Executing as root@a1fe8ff42d07 on Linux v6.10.14-linuxkit amd64
00:27:50.858 INFO  HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v17.0.11-internal+0-adhoc..src
00:27:50.859 INFO  HaplotypeCaller - Start Date/Time: February 8, 2026 at 12:27:50 AM GMT
00:27:50.859 INFO  HaplotypeCaller - ------------------------------------------------------------
00:27:50.859 INFO  HaplotypeCaller - ------------------------------------------------------------
00:27:50.861 INFO  HaplotypeCaller - HTSJDK Version: 4.1.0
00:27:50.861 INFO  HaplotypeCaller - Picard Version: 3.1.1
00:27:50.861 INFO  HaplotypeCaller - Built for Spark Version: 3.5.0
00:27:50.862 INFO  HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
00:27:50.862 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
00:27:50.862 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
00:27:50.863 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
00:27:50.864 INFO  HaplotypeCaller - Deflater: IntelDeflater
00:27:50.864 INFO  HaplotypeCaller - Inflater: IntelInflater
00:27:50.864 INFO  HaplotypeCaller - GCS max retries/reopens: 20
00:27:50.864 INFO  HaplotypeCaller - Requester pays: disabled
00:27:50.865 INFO  HaplotypeCaller - Initializing engine
00:27:50.991 INFO  FeatureManager - Using codec BEDCodec to read file file:///data/ref/intervals.bed
00:27:51.016 INFO  IntervalArgumentCollection - Processing 6369 bp from intervals
00:27:51.029 INFO  HaplotypeCaller - Done initializing engine
00:27:51.040 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
00:27:51.042 INFO  NativeLibraryLoader - Loading libgkl_smithwaterman.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_smithwaterman.so
00:27:51.042 INFO  SmithWatermanAligner - Using AVX accelerated SmithWaterman implementation
00:27:51.046 INFO  HaplotypeCallerEngine - Disabling physical phasing, which is supported only for reference-model confidence output
00:27:51.063 INFO  NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
00:27:51.085 INFO  IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
00:27:51.086 INFO  IntelPairHmm - Available threads: 10
00:27:51.086 INFO  IntelPairHmm - Requested threads: 4
00:27:51.086 INFO  PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
00:27:51.128 INFO  ProgressMeter - Starting traversal
00:27:51.136 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Regions Processed   Regions/Minute
00:27:51.882 WARN  InbreedingCoeff - InbreedingCoeff will not be calculated at position 20_10037292_10066351:3480 and possibly subsequent; at least 10 samples must have called genotypes
00:27:52.969 INFO  HaplotypeCaller - 7 read(s) filtered by: MappingQualityReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
0 read(s) filtered by: NotSecondaryAlignmentReadFilter
0 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
0 read(s) filtered by: GoodCigarReadFilter
0 read(s) filtered by: WellformedReadFilter
7 total reads filtered out of 1867 reads processed
00:27:52.971 INFO  ProgressMeter - 20_10037292_10066351:13499              0.0                    35           1145.7
00:27:52.971 INFO  ProgressMeter - Traversal complete. Processed 35 total regions in 0.0 minutes.
00:27:52.976 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 0.003346916
00:27:52.976 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.045731709
00:27:52.977 INFO  SmithWatermanAligner - Total compute time in native Smith-Waterman : 0.02 sec
00:27:52.981 INFO  HaplotypeCaller - Shutting down engine
[February 8, 2026 at 12:27:52 AM GMT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=203423744

The output file reads_mother.vcf is created inside your working directory in the container, so you won't see it in the VS Code file explorer unless you change the output file path. However, it's a small test file, so you can cat it to open it and view the contents. If you scroll all the way up to the start of the file, you'll find a header composed of many lines of metadata, followed by a list of variant calls, one per line.

File contents
reads_mother.vcf
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	reads_mother
20_10037292_10066351	3480	.	C	CT	503.03	.	AC=2;AF=1.00;AN=2;DP=23;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=27.95;SOR=1.179	GT:AD:DP:GQ:PL	1/1:0,18:18:54:517,54,0
20_10037292_10066351	3520	.	AT	A	609.03	.	AC=2;AF=1.00;AN=2;DP=18;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=33.83;SOR=0.693	GT:AD:DP:GQ:PL	1/1:0,18:18:54:623,54,0
20_10037292_10066351	3529	.	T	A	155.64	.	AC=1;AF=0.500;AN=2;BaseQRankSum=-0.544;DP=21;ExcessHet=0.0000;FS=1.871;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=7.78;ReadPosRankSum=-1.158;SOR=1.034	GT:AD:DP:GQ:PL	0/1:12,8:20:99:163,0,328

Each line describes a possible variant identified in the sample's sequencing data. For guidance on interpreting VCF format, see this helpful article.

The output VCF file is accompanied by an index file called reads_mother.vcf.idx that was automatically created by GATK. It has the same function as the BAM index file, to allow tools to seek and retrieve subsets of data without loading in the entire file.

1.2.4. Exit the GATK container

To exit the container, type exit.

exit

Your prompt should be back to normal. That concludes the per-sample variant calling test.


2. Joint calling on a cohort

The variant calling approach we just used generates variant calls per sample. That's fine for looking at variants from each sample in isolation, but it yields limited information. It's often more interesting to look at how variant calls differ across multiple samples. GATK offers an alternative method called joint variant calling for this purpose.

Joint variant calling involves generating a special kind of variant output called GVCF (for Genomic VCF) for each sample, then combining the GVCF data from all the samples and running a 'joint genotyping' statistical analysis.

Joint analysis

What's special about a sample's GVCF is that it contains records summarizing sequence data statistics about all positions in the targeted area of the genome, not just the positions where the program found evidence of variation. This is critical for the joint genotyping calculation (further reading).

The GVCF is produced by GATK HaplotypeCaller, the same tool we just tested, with an additional parameter (-ERC GVCF). Combining the GVCFs is done with GATK GenomicsDBImport, which combines the per-sample calls into a data store (analogous to a database). The actual 'joint genotyping' analysis is then done with GATK GenotypeGVCFs.

Here we test the commands needed to generate GVCFs and run joint genotyping. These are the commands we'll wrap into a Nextflow workflow in Part 3 of this course.

  1. Generate an index file for each BAM input file using Samtools
  2. Run the GATK HaplotypeCaller on each BAM input file to generate a GVCF of per-sample genomic variant calls
  3. Collect all the GVCFs and combine them into a GenomicsDB data store
  4. Run joint genotyping on the combined GVCF data store to produce a cohort-level VCF
BAMSamtools indexBAM indexIntervalsReference+ index & dictGATK HaplotypeCallerGVCF + indexx multiple samplesGVCF + indexGVCF + indexGVCF + indexGenomicsDBvariant storeGATK GenomicsDBImportGATK GenotypeGVCFsJoint-calledVCFGVCF mode

We now need to test all of these commands, starting with indexing all three BAM files.

2.1. Index BAM files for all three samples

In the first section above, we only indexed one BAM file. Now we need to index all three samples so that GATK HaplotypeCaller can process them.

2.1.1. Spin up the Samtools container interactively

We already pulled the Samtools container image, so we can spin it up directly:

docker run -it -v ./data:/data community.wave.seqera.io/library/samtools:1.20--b5dfbd93de237464

Your prompt changes to indicate you are inside the container, with the data directory mounted as before.

2.1.2. Run the indexing command on all three samples

Run the indexing command on each of the three BAM files:

samtools index /data/bam/reads_mother.bam
samtools index /data/bam/reads_father.bam
samtools index /data/bam/reads_son.bam
Directory contents
data/bam/
├── reads_father.bam
├── reads_father.bam.bai
├── reads_mother.bam
├── reads_mother.bam.bai
├── reads_son.bam
└── reads_son.bam.bai

This should produce the index files in the same directory as the corresponding BAM files.

2.1.3. Exit the Samtools container

To exit the container, type exit.

exit

Your prompt should be back to normal.

2.2. Generate GVCFs for all three samples

To run the joint genotyping step, we need GVCFs for all three samples.

2.2.1. Spin up the GATK container interactively

We already pulled the GATK container image earlier, so we can spin it up directly:

docker run -it -v ./data:/data community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867

Your prompt changes to indicate you are inside the GATK container.

2.2.2. Run the variant calling command with the GVCF option

In order to produce a genomic VCF (GVCF), we add the -ERC GVCF option to the base command, which switches on the HaplotypeCaller's GVCF mode.

We also change the file extension for the output file from .vcf to .g.vcf. This is technically not a requirement, but it is a strongly recommended convention.

gatk HaplotypeCaller \
        -R /data/ref/ref.fasta \
        -I /data/bam/reads_mother.bam \
        -O reads_mother.g.vcf \
        -L /data/ref/intervals.bed \
        -ERC GVCF
Command output
Using GATK jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar HaplotypeCaller -R /data/ref/ref.fasta -I /data/bam/reads_mother.bam -O reads_mother.g.vcf -L /data/ref/intervals.bed -ERC GVCF
00:28:03.593 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
00:28:03.765 INFO  HaplotypeCaller - ------------------------------------------------------------
00:28:03.768 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.5.0.0
00:28:03.768 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
00:28:03.768 INFO  HaplotypeCaller - Executing as root@8515e5a0598e on Linux v6.10.14-linuxkit amd64
00:28:03.768 INFO  HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v17.0.11-internal+0-adhoc..src
00:28:03.769 INFO  HaplotypeCaller - Start Date/Time: February 8, 2026 at 12:28:03 AM GMT
00:28:03.769 INFO  HaplotypeCaller - ------------------------------------------------------------
00:28:03.770 INFO  HaplotypeCaller - ------------------------------------------------------------
00:28:03.772 INFO  HaplotypeCaller - HTSJDK Version: 4.1.0
00:28:03.773 INFO  HaplotypeCaller - Picard Version: 3.1.1
00:28:03.773 INFO  HaplotypeCaller - Built for Spark Version: 3.5.0
00:28:03.773 INFO  HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
00:28:03.773 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
00:28:03.773 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
00:28:03.774 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
00:28:03.774 INFO  HaplotypeCaller - Deflater: IntelDeflater
00:28:03.774 INFO  HaplotypeCaller - Inflater: IntelInflater
00:28:03.775 INFO  HaplotypeCaller - GCS max retries/reopens: 20
00:28:03.775 INFO  HaplotypeCaller - Requester pays: disabled
00:28:03.776 INFO  HaplotypeCaller - Initializing engine
00:28:03.896 INFO  FeatureManager - Using codec BEDCodec to read file file:///data/ref/intervals.bed
00:28:03.919 INFO  IntervalArgumentCollection - Processing 6369 bp from intervals
00:28:03.934 INFO  HaplotypeCaller - Done initializing engine
00:28:03.935 INFO  HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
00:28:03.943 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
00:28:03.945 INFO  NativeLibraryLoader - Loading libgkl_smithwaterman.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_smithwaterman.so
00:28:03.946 INFO  SmithWatermanAligner - Using AVX accelerated SmithWaterman implementation
00:28:03.955 INFO  HaplotypeCallerEngine - Standard Emitting and Calling confidence set to -0.0 for reference-model confidence output
00:28:03.956 INFO  HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
00:28:03.972 INFO  NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
00:28:03.993 INFO  IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
00:28:03.994 INFO  IntelPairHmm - Available threads: 10
00:28:03.994 INFO  IntelPairHmm - Requested threads: 4
00:28:03.994 INFO  PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
00:28:04.044 INFO  ProgressMeter - Starting traversal
00:28:04.070 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Regions Processed   Regions/Minute
00:28:04.874 WARN  InbreedingCoeff - InbreedingCoeff will not be calculated at position 20_10037292_10066351:3480 and possibly subsequent; at least 10 samples must have called genotypes
00:28:06.535 INFO  HaplotypeCaller - 7 read(s) filtered by: MappingQualityReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
0 read(s) filtered by: NotSecondaryAlignmentReadFilter
0 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
0 read(s) filtered by: GoodCigarReadFilter
0 read(s) filtered by: WellformedReadFilter
7 total reads filtered out of 1867 reads processed
00:28:06.537 INFO  ProgressMeter - 20_10037292_10066351:13499              0.0                    35            851.6
00:28:06.538 INFO  ProgressMeter - Traversal complete. Processed 35 total regions in 0.0 minutes.
00:28:06.543 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 0.003648749
00:28:06.544 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.031498916
00:28:06.544 INFO  SmithWatermanAligner - Total compute time in native Smith-Waterman : 0.02 sec
00:28:06.547 INFO  HaplotypeCaller - Shutting down engine
[February 8, 2026 at 12:28:06 AM GMT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.05 minutes.
Runtime.totalMemory()=281018368

This creates the GVCF output file reads_mother.g.vcf in the current working directory in the container.

If you cat it to view the contents, you'll see it's much longer than the equivalent VCF we generated in section 1. You can't even scroll up to the start of the file, and most of the lines look quite different from what we saw in the VCF.

File contents
reads_mother.g.vcf
20_10037292_10066351    14714   .       T       <NON_REF>       .       .       END=14718       GT:DP:GQ:MIN_DP:PL       0/0:37:99:37:0,99,1192
20_10037292_10066351    14719   .       T       <NON_REF>       .       .       END=14719       GT:DP:GQ:MIN_DP:PL       0/0:36:82:36:0,82,1087
20_10037292_10066351    14720   .       T       <NON_REF>       .       .       END=14737       GT:DP:GQ:MIN_DP:PL       0/0:42:99:37:0,100,1160

These represent non-variant regions where the variant caller found no evidence of variation, so it captured some statistics describing its level of confidence in the absence of variation. This makes it possible to distinguish between two very different case figures: (1) there is good quality data showing that the sample is homozygous-reference, and (2) there is not enough good data available to make a determination either way.

In a GVCF, there are typically lots of such non-variant lines, with a smaller number of variant records sprinkled among them. Try running head -176 on the GVCF to load in just the first 176 lines of the file to find an actual variant call.

File contents
reads_mother.g.vcf
20_10037292_10066351    3479    .       T       <NON_REF>       .       .       END=3479        GT:DP:GQ:MIN_DP:PL       0/0:34:36:34:0,36,906
20_10037292_10066351    3480    .       C       CT,<NON_REF>    503.03  .       DP=23;ExcessHet=0.0000;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQandDP=82800,23    GT:AD:DP:GQ:PL:SB       1/1:0,18,0:18:54:517,54,0,517,54,517:0,0,7,11
20_10037292_10066351    3481    .       T       <NON_REF>       .       .       END=3481        GT:DP:GQ:MIN_DP:PL       0/0:21:51:21:0,51,765

The second line shows the first variant record in the file, which corresponds to the first variant in the VCF file we looked at earlier.

Just like the original VCF was, the output GVCF file is also accompanied by an index file, called reads_mother.g.vcf.idx.

2.2.3. Repeat the process on the other two samples

Generate GVCFs for the remaining two samples by running the commands below, one after another.

gatk HaplotypeCaller \
        -R /data/ref/ref.fasta \
        -I /data/bam/reads_father.bam \
        -O reads_father.g.vcf \
        -L /data/ref/intervals.bed \
        -ERC GVCF
gatk HaplotypeCaller \
        -R /data/ref/ref.fasta \
        -I /data/bam/reads_son.bam \
        -O reads_son.g.vcf \
        -L /data/ref/intervals.bed \
        -ERC GVCF

Once this completes, you should have three files ending in .g.vcf in your current directory (one per sample) and their respective index files ending in .g.vcf.idx.

But don't exit the container! We're going to use the same container in the next step.

2.3. Run joint genotyping

Now that we have all the GVCFs, we can try out the joint genotyping approach to generating variant calls for a cohort of samples. It's a two-step method that consists of combining the data from all the GVCFs into a data store, then running the joint genotyping analysis proper to generate the final VCF of joint-called variants.

2.3.1. Combine all the per-sample GVCFs

This first step uses another GATK tool, called GenomicsDBImport, to combine the data from all the GVCFs into a GenomicsDB data store.

gatk GenomicsDBImport \
    -V reads_mother.g.vcf \
    -V reads_father.g.vcf \
    -V reads_son.g.vcf \
    -L /data/ref/intervals.bed \
    --genomicsdb-workspace-path family_trio_gdb
Command output
Using GATK jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar GenomicsDBImport -V reads_mother.g.vcf -V reads_father.g.vcf -V reads_son.g.vcf -L /data/ref/intervals.bed --genomicsdb-workspace-path family_trio_gdb
00:28:20.772 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
00:28:20.914 INFO  GenomicsDBImport - ------------------------------------------------------------
00:28:20.917 INFO  GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.5.0.0
00:28:20.917 INFO  GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
00:28:20.917 INFO  GenomicsDBImport - Executing as root@8515e5a0598e on Linux v6.10.14-linuxkit amd64
00:28:20.917 INFO  GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v17.0.11-internal+0-adhoc..src
00:28:20.918 INFO  GenomicsDBImport - Start Date/Time: February 8, 2026 at 12:28:20 AM GMT
00:28:20.918 INFO  GenomicsDBImport - ------------------------------------------------------------
00:28:20.918 INFO  GenomicsDBImport - ------------------------------------------------------------
00:28:20.920 INFO  GenomicsDBImport - HTSJDK Version: 4.1.0
00:28:20.921 INFO  GenomicsDBImport - Picard Version: 3.1.1
00:28:20.921 INFO  GenomicsDBImport - Built for Spark Version: 3.5.0
00:28:20.922 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
00:28:20.923 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
00:28:20.923 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
00:28:20.923 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
00:28:20.923 INFO  GenomicsDBImport - Deflater: IntelDeflater
00:28:20.924 INFO  GenomicsDBImport - Inflater: IntelInflater
00:28:20.924 INFO  GenomicsDBImport - GCS max retries/reopens: 20
00:28:20.924 INFO  GenomicsDBImport - Requester pays: disabled
00:28:20.925 INFO  GenomicsDBImport - Initializing engine
00:28:21.144 INFO  FeatureManager - Using codec BEDCodec to read file file:///data/ref/intervals.bed
00:28:21.152 INFO  IntervalArgumentCollection - Processing 6369 bp from intervals
00:28:21.157 INFO  GenomicsDBImport - Done initializing engine
00:28:21.287 INFO  GenomicsDBLibLoader - GenomicsDB native library version : 1.5.1-84e800e
00:28:21.290 INFO  GenomicsDBImport - Vid Map JSON file will be written to /tmp/family_trio_gdb/vidmap.json
00:28:21.290 INFO  GenomicsDBImport - Callset Map JSON file will be written to /tmp/family_trio_gdb/callset.json
00:28:21.291 INFO  GenomicsDBImport - Complete VCF Header will be written to /tmp/family_trio_gdb/vcfheader.vcf
00:28:21.291 INFO  GenomicsDBImport - Importing to workspace - /tmp/family_trio_gdb
00:28:21.453 INFO  GenomicsDBImport - Importing batch 1 with 3 samples
00:28:21.757 INFO  GenomicsDBImport - Importing batch 1 with 3 samples
00:28:21.859 INFO  GenomicsDBImport - Importing batch 1 with 3 samples
00:28:21.979 INFO  GenomicsDBImport - Done importing batch 1/1
00:28:21.988 INFO  GenomicsDBImport - Import completed!
00:28:21.988 INFO  GenomicsDBImport - Shutting down engine
[February 8, 2026 at 12:28:21 AM GMT] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=305135616

The output of this step is effectively a directory containing a set of further nested directories holding the combined variant data in the form of multiple different files. You can poke around it but you'll quickly see this data store format is not intended to be read directly by humans.

Note

GATK includes tools that make it possible to inspect and extract variant call data from the data store as needed.

2.3.2. Run the joint genotyping analysis proper

This second step uses yet another GATK tool, called GenotypeGVCFs, to recalculate variant statistics and individual genotypes in light of the data available across all samples in the cohort.

gatk GenotypeGVCFs \
    -R /data/ref/ref.fasta \
    -V gendb://family_trio_gdb \
    -O family_trio.vcf
Command output
Using GATK jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar GenotypeGVCFs -R /data/ref/ref.fasta -V gendb://family_trio_gdb -O family_trio.vcf
00:28:24.625 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
00:28:24.798 INFO  GenotypeGVCFs - ------------------------------------------------------------
00:28:24.801 INFO  GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.5.0.0
00:28:24.801 INFO  GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
00:28:24.801 INFO  GenotypeGVCFs - Executing as root@8515e5a0598e on Linux v6.10.14-linuxkit amd64
00:28:24.801 INFO  GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v17.0.11-internal+0-adhoc..src
00:28:24.802 INFO  GenotypeGVCFs - Start Date/Time: February 8, 2026 at 12:28:24 AM GMT
00:28:24.802 INFO  GenotypeGVCFs - ------------------------------------------------------------
00:28:24.802 INFO  GenotypeGVCFs - ------------------------------------------------------------
00:28:24.804 INFO  GenotypeGVCFs - HTSJDK Version: 4.1.0
00:28:24.804 INFO  GenotypeGVCFs - Picard Version: 3.1.1
00:28:24.804 INFO  GenotypeGVCFs - Built for Spark Version: 3.5.0
00:28:24.805 INFO  GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
00:28:24.805 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
00:28:24.806 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
00:28:24.806 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
00:28:24.806 INFO  GenotypeGVCFs - Deflater: IntelDeflater
00:28:24.806 INFO  GenotypeGVCFs - Inflater: IntelInflater
00:28:24.807 INFO  GenotypeGVCFs - GCS max retries/reopens: 20
00:28:24.807 INFO  GenotypeGVCFs - Requester pays: disabled
00:28:24.808 INFO  GenotypeGVCFs - Initializing engine
00:28:25.023 INFO  GenomicsDBLibLoader - GenomicsDB native library version : 1.5.1-84e800e
00:28:25.081 INFO  NativeGenomicsDB - pid=162 tid=163 No valid combination operation found for INFO field InbreedingCoeff  - the field will NOT be part of INFO fields in the generated VCF records
00:28:25.082 INFO  NativeGenomicsDB - pid=162 tid=163 No valid combination operation found for INFO field MLEAC  - the field will NOT be part of INFO fields in the generated VCF records
00:28:25.082 INFO  NativeGenomicsDB - pid=162 tid=163 No valid combination operation found for INFO field MLEAF  - the field will NOT be part of INFO fields in the generated VCF records
00:28:25.109 INFO  GenotypeGVCFs - Done initializing engine
00:28:25.184 INFO  ProgressMeter - Starting traversal
00:28:25.187 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
00:28:25.446 WARN  InbreedingCoeff - InbreedingCoeff will not be calculated at position 20_10037292_10066351:3480 and possibly subsequent; at least 10 samples must have called genotypes
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.15034835899999904,Cpu time(s),0.1355218420000006
00:28:26.189 INFO  ProgressMeter - 20_10037292_10066351:13953              0.0                  3390         202994.0
00:28:26.190 INFO  ProgressMeter - Traversal complete. Processed 3390 total variants in 0.0 minutes.
00:28:26.194 INFO  GenotypeGVCFs - Shutting down engine
[February 8, 2026 at 12:28:26 AM GMT] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=296747008

This creates the VCF output file family_trio.vcf in the current working directory in the container. It's another reasonably small file so you can cat this file to view its contents, and scroll up to find the first few variant lines.

File contents
family_trio.vcf
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  reads_father    reads_mother    reads_son
20_10037292_10066351    3480    .       C       CT      1625.89 .       AC=5;AF=0.833;AN=6;BaseQRankSum=0.220;DP=85;ExcessHet=0.0000;FS=2.476;MLEAC=5;MLEAF=0.833;MQ=60.00;MQRankSum=0.00;QD=21.68;ReadPosRankSum=-1.147e+00;SOR=0.487    GT:AD:DP:GQ:PL  0/1:15,16:31:99:367,0,375       1/1:0,18:18:54:517,54,0 1/1:0,26:26:78:756,78,0
20_10037292_10066351    3520    .       AT      A       1678.89 .       AC=5;AF=0.833;AN=6;BaseQRankSum=1.03;DP=80;ExcessHet=0.0000;FS=2.290;MLEAC=5;MLEAF=0.833;MQ=60.00;MQRankSum=0.00;QD=22.39;ReadPosRankSum=0.701;SOR=0.730 GT:AD:DP:GQ:PL   0/1:18,13:31:99:296,0,424       1/1:0,18:18:54:623,54,0 1/1:0,26:26:78:774,78,0
20_10037292_10066351    3529    .       T       A       154.29  .       AC=1;AF=0.167;AN=6;BaseQRankSum=-5.440e-01;DP=104;ExcessHet=0.0000;FS=1.871;MLEAC=1;MLEAF=0.167;MQ=60.00;MQRankSum=0.00;QD=7.71;ReadPosRankSum=-1.158e+00;SOR=1.034       GT:AD:DP:GQ:PL  0/0:44,0:44:99:0,112,1347       0/1:12,8:20:99:163,0,328        0/0:39,0:39:99:0,105,1194

This looks similar to the VCF we generated earlier, except this time we have genotype-level information for all three samples. The last three columns in the file are the genotype blocks for the samples, listed in alphabetical order.

If we look at the genotypes called for our test family trio for the very first variant, we see that the father is heterozygous-variant (0/1), and the mother and son are both homozygous-variant (1/1).

That is ultimately the information we're looking to extract from the dataset!

2.3.3. Exit the GATK container

To exit the container, type exit.

exit

Your prompt should be back to normal. That concludes the manual testing of the variant calling commands.


Takeaway

You know how to test the Samtools indexing and GATK variant calling commands in their respective containers, including how to generate GVCFs and run joint genotyping on multiple samples.

What's next?

Learn how to wrap those same commands into workflows that use containers to execute the work.