Part 3: Joint calling on a cohort¶
In Part 2, you built a per-sample variant calling pipeline that processed each sample's data independently. Now we're going to extend it to implement joint variant calling, as covered in Part 1.
Assignment¶
In this part of the course, we're going to extend the workflow to do the following:
- Generate an index file for each BAM input file using Samtools
- Run the GATK HaplotypeCaller on each BAM input file to generate a GVCF of per-sample genomic variant calls
- Collect all the GVCFs and combine them into a GenomicsDB data store
- Run joint genotyping on the combined GVCF data store to produce a cohort-level VCF
This part builds directly on the workflow produced by Part 2.
How to begin from this section
This section of the course assumes you have completed Part 2: Per-sample variant calling and have a working genomics.nf pipeline.
If you did not complete Part 2 or want to start fresh for this part, you can use the Part 2 solution as your starting point.
Run these commands from inside the nf4-science/genomics/ directory:
cp solutions/part2/genomics-2.nf genomics.nf
cp solutions/part2/nextflow.config .
cp solutions/part2/modules/* modules/
This gives you a complete per-sample variant calling workflow. You can test that it runs successfully by running the following command:
Lesson plan¶
We've broken this down into two steps:
- Modify the per-sample variant calling step to produce a GVCF. This covers updating process commands and outputs.
- Add a joint genotyping step that combines and genotypes the per-sample GVCFs.
This introduces the
collect()operator, Groovy closures for command-line construction, and multi-command processes.
Note
Make sure you're in the correct working directory:
cd /workspaces/training/nf4-science/genomics
1. Modify the per-sample variant calling step to produce a GVCF¶
The pipeline from Part 2 produces VCF files, but joint calling requires GVCF files. We need to switch on the GVCF variant calling mode and update the output file extension.
Recall the GVCF variant calling command from Part 1:
gatk HaplotypeCaller \
-R /data/ref/ref.fasta \
-I /data/bam/reads_mother.bam \
-O reads_mother.g.vcf \
-L /data/ref/intervals.bed \
-ERC GVCF
Compared to the base HaplotypeCaller command we wrapped in Part 2, the differences are the -ERC GVCF parameter and the .g.vcf output extension.
1.1. Tell HaplotypeCaller to emit a GVCF and update the output extension¶
Open the modules/gatk_haplotypecaller.nf module file to make two changes:
- Add the
-ERC GVCFparameter to the GATK HaplotypeCaller command; - Update the output file path to use the corresponding
.g.vcfextension, as per GATK convention.
Make sure you add a backslash (\) at the end of the previous line when you add -ERC GVCF.
We also need to update the output block to match the new file extension.
Since we changed the command output from .vcf to .g.vcf, the process output: block must reflect the same change.
1.2. Update the output file extension in the process outputs block¶
We also need to update the workflow's publish and output configuration to reflect the new GVCF outputs.
1.3. Update the publish targets for the new GVCF outputs¶
Since we're now producing GVCFs instead of VCFs, we should update the workflow's publish: section to use more descriptive names.
We'll also organize the GVCF files into their own subdirectory for clarity.
Now update the output block to match.
1.4. Update the output block for the new directory structure¶
We also need to update the output block to put the GVCF files in a gvcf subdirectory.
With the module, publish targets, and output block all updated, we can test the changes.
1.5. Run the pipeline¶
Run the workflow to verify the changes work.
Command output
The Nextflow output looks the same as before, but the .g.vcf files and their index files are now organized in subdirectories.
Directory contents (symlinks shortened)
results/
├── gvcf/
│ ├── reads_father.bam.g.vcf -> */27/0d7eb9*/reads_father.bam.g.vcf
│ ├── reads_father.bam.g.vcf.idx -> */27/0d7eb9*/reads_father.bam.g.vcf.idx
│ ├── reads_mother.bam.g.vcf -> */e4/4ed55e*/reads_mother.bam.g.vcf
│ ├── reads_mother.bam.g.vcf.idx -> */e4/4ed55e*/reads_mother.bam.g.vcf.idx
│ ├── reads_son.bam.g.vcf -> */08/e95962*/reads_son.bam.g.vcf
│ └── reads_son.bam.g.vcf.idx -> */08/e95962*/reads_son.bam.g.vcf.idx
└── indexed_bam/
├── reads_father.bam -> */9a/c7a873*/reads_father.bam
├── reads_father.bam.bai -> */9a/c7a873*/reads_father.bam.bai
├── reads_mother.bam -> */f1/8d8486*/reads_mother.bam
├── reads_mother.bam.bai -> */f1/8d8486*/reads_mother.bam.bai
├── reads_son.bam -> */cc/fbc705*/reads_son.bam
└── reads_son.bam.bai -> */cc/fbc705*/reads_son.bam.bai
If you open one of the GVCF files and scroll through it, you can verify that GATK HaplotypeCaller produced GVCF files as requested.
Takeaway¶
When you change a tool command's output filename, the process output: block and publish/output configuration must be updated to match.
What's next?¶
Learn to collect the contents of a channel and pass them on to the next process as a single input.
2. Add a joint genotyping step¶
We now need to collect the per-sample GVCFs, combine them into a GenomicsDB data store, and run joint genotyping to produce a cohort-level VCF.
As covered in Part 1, this is a two-tool operation: GenomicsDBImport combines the GVCFs, then GenotypeGVCFs produces the final variant calls.
We'll wrap both tools in a single process called GATK_JOINTGENOTYPING.
Recall the two commands from Part 1:
gatk GenomicsDBImport \
-V reads_mother.g.vcf \
-V reads_father.g.vcf \
-V reads_son.g.vcf \
-L /data/ref/intervals.bed \
--genomicsdb-workspace-path family_trio_gdb
The first command takes the per-sample GVCFs and an intervals file, and produces a GenomicsDB data store.
The second takes that data store, a reference genome, and produces the final cohort-level VCF.
The container URI is the same as for HaplotypeCaller: community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867.
2.1. Set up the inputs¶
The joint genotyping process needs two kinds of inputs that we don't have yet: an arbitrary cohort name, and the collected GVCF outputs from all samples bundled together.
2.1.1. Add a cohort_name parameter¶
We need to provide an arbitrary name for the cohort.
Later in the training series you'll learn how to use sample metadata for this sort of thing, but for now we just declare a CLI parameter using params and give it a default value for convenience.
2.1.2. Gather the HaplotypeCaller outputs across samples¶
If we were to plug the output channel from GATK_HAPLOTYPECALLER directly into the new process, Nextflow would call the process on each sample GVCF separately.
We want to bundle all three GVCFs (and their index files) so that Nextflow hands all of them together to a single process call.
We can do that using the collect() channel operator.
Add the following lines to the workflow body, right after the call to GATK_HAPLOTYPECALLER:
Breaking this down:
- We take the output channel from
GATK_HAPLOTYPECALLERusing the.outproperty. - Because we named the outputs using
emit:in section 1, we can select the GVCFs with.vcfand the index files with.idx. Without named outputs, we would have to use.out[0]and.out[1]. - The
collect()operator bundles all files into a single element, soall_gvcfs_chcontains all three GVCFs together, andall_idxs_chcontains all three index files together.
We can collect the GVCFs and their index files separately (as opposed to keeping them together in tuples) because Nextflow will stage all input files together for execution, so the index files will be present alongside the GVCFs.
Tip
You can use the view() operator to inspect the contents of channels before and after applying channel operators.
2.2. Write the joint genotyping process and call it in the workflow¶
Following the same pattern we used in Part 2, we'll write the process definition in a module file, import it into the workflow, and call it on the inputs we just prepared.
2.2.1. Construct a string to give each GVCF a -V argument¶
Before we start filling in the process definition, there's one thing to work out.
The GenomicsDBImport command expects a separate -V argument for each GVCF file, like this:
gatk GenomicsDBImport \
-V reads_mother.bam.g.vcf \
-V reads_father.bam.g.vcf \
-V reads_son.bam.g.vcf \
...
If we were to write -V ${all_gvcfs_ch}, Nextflow would simply concatenate the filenames and that part of the command would look like this:
But we need the string to look like this:
Importantly, we need to construct this string dynamically from whatever files are in the collected channel. Nextflow (via Groovy) provides a concise way to do this:
Breaking this down:
all_gvcfs.collect { gvcf -> "-V ${gvcf}" }iterates over each file path and prepends-Vto it, producing["-V A.g.vcf", "-V B.g.vcf", "-V C.g.vcf"]..join(' ')concatenates them with spaces:"-V A.g.vcf -V B.g.vcf -V C.g.vcf".- The result is assigned to a local variable
gvcfs_line(defined withdef), which we can interpolate into the command template.
This line goes inside the process's script: block, before the command template.
You can place arbitrary Groovy code between script: and the opening """ of the command template.
Then you'll be able to refer to that whole string as gvcfs_line in the script: block of the process.
2.2.2. Fill in the module for the joint genotyping process¶
Now we can tackle writing the full process.
Open modules/gatk_jointgenotyping.nf and examine the outline of the process definition.
Go ahead and fill in the process definition using the information provided above, then check your work against the solution in the "After" tab below.
There are several things worth calling out here.
As previously, several inputs are listed even though the commands don't reference them directly: all_idxs, ref_index, and ref_dict.
Listing them ensures Nextflow stages these files in the working directory alongside the files that do appear in the commands, which GATK expects to find based on naming conventions.
The gvcfs_line variable uses the Groovy closure described above to construct the -V arguments for GenomicsDBImport.
This process runs two commands in serial, just as you would in the terminal.
GenomicsDBImport combines the per-sample GVCFs into a data store, then GenotypeGVCFs reads that data store and produces the final cohort-level VCF.
The GenomicsDB data store (${cohort_name}_gdb) is an intermediate artifact used only within the process; it doesn't appear in the output block.
Once you've completed this, the process is ready to use. To use it in the workflow, you'll need to import the module and add a process call.
2.2.3. Import the module¶
Add the import statement to genomics.nf, below the existing import statements:
The process is now available in the workflow scope.
2.2.4. Add the process call¶
Add the call to GATK_JOINTGENOTYPING in the workflow body, after the collect() lines:
The process is now fully wired up. Next, we configure how the outputs are published.
2.3. Configure the output handling¶
We need to publish the joint VCF outputs. Add publish targets and output block entries for the joint genotyping results.
2.3.1. Add publish targets for the joint VCF¶
Add the joint VCF and its index to the publish: section of the workflow:
Now update the output block to match.
2.3.2. Add output block entries for the joint VCF¶
Add entries for the joint VCF files. We'll put them at the root of the results directory since this is the final output.
With the process, publish targets, and output block all in place, we can test the complete workflow.
2.4. Run the workflow¶
Run the workflow to verify everything works.
Command output
The first two steps are cached from the previous run, and the new GATK_JOINTGENOTYPING step runs once on the collected inputs from all three samples.
The final output file, family_trio.joint.vcf (and its index), are in the results directory.
Directory contents (symlinks shortened)
results/
├── family_trio.joint.vcf -> */a6/7cc8ed*/family_trio.joint.vcf
├── family_trio.joint.vcf.idx -> */a6/7cc8ed*/family_trio.joint.vcf.idx
├── gvcf/
│ ├── reads_father.bam.g.vcf -> */27/0d7eb9*/reads_father.bam.g.vcf
│ ├── reads_father.bam.g.vcf.idx -> */27/0d7eb9*/reads_father.bam.g.vcf.idx
│ ├── reads_mother.bam.g.vcf -> */e4/4ed55e*/reads_mother.bam.g.vcf
│ ├── reads_mother.bam.g.vcf.idx -> */e4/4ed55e*/reads_mother.bam.g.vcf.idx
│ ├── reads_son.bam.g.vcf -> */08/e95962*/reads_son.bam.g.vcf
│ └── reads_son.bam.g.vcf.idx -> */08/e95962*/reads_son.bam.g.vcf.idx
└── indexed_bam/
├── reads_father.bam -> */9a/c7a873*/reads_father.bam
├── reads_father.bam.bai -> */9a/c7a873*/reads_father.bam.bai
├── reads_mother.bam -> */f1/8d8486*/reads_mother.bam
├── reads_mother.bam.bai -> */f1/8d8486*/reads_mother.bam.bai
├── reads_son.bam -> */cc/fbc705*/reads_son.bam
└── reads_son.bam.bai -> */cc/fbc705*/reads_son.bam.bai
If you open the joint VCF file, you can verify that the workflow produced the expected variant calls.
You now have an automated, fully reproducible joint variant calling workflow!
Note
Keep in mind the data files we gave you cover only a tiny portion of chromosome 20. The real size of a variant callset would be counted in millions of variants. That's why we use only tiny subsets of data for training purposes!
Takeaway¶
You know how to collect outputs from a channel and bundle them as a single input to another process. You also know how to construct a command line using Groovy closures, and how to run multiple commands in a single process.
What's next?¶
Celebrate your success and take a well-deserved break.
In the next part of this course, you'll learn how to run a production-ready variant calling pipeline from nf-core and compare it to the pipeline you built manually.