Part 2: Per-sample variant calling¶
In Part 1, you tested the Samtools and GATK commands manually in their respective containers. Now we're going to wrap those same commands into a Nextflow workflow.
Assignment¶
In this part of the course, we're going to develop a workflow that does the following:
- Generate an index file for each BAM input file using Samtools
- Run the GATK HaplotypeCaller on each BAM input file to generate per-sample variant calls in VCF (Variant Call Format)
This replicates the steps from Part 1, where you ran these commands manually in their containers.
As a starting point, we provide you with a workflow file, genomics.nf, that outlines the main parts of the workflow, as well as two module files, samtools_index.nf and gatk_haplotypecaller.nf, that outline the structure of the modules.
These files are not functional; their purpose is just to serve as scaffolds for you to fill in with the interesting parts of the code.
Lesson plan¶
In order to make the development process more educational, we've broken this down into four steps:
- Write a single-stage workflow that runs Samtools index on a BAM file. This covers creating a module, importing it, and calling it in a workflow.
- Add a second process to run GATK HaplotypeCaller on the indexed BAM file. This introduces chaining process outputs to inputs and handling accessory files.
- Adapt the workflow to run on a batch of samples. This covers parallel execution and introduces tuples to keep associated files together.
- Make the workflow accept a text file containing a batch of input files. This demonstrates a common pattern for providing inputs in bulk.
Each step focuses on a specific aspect of workflow development.
1. Write a single-stage workflow that runs Samtools index on a BAM file¶
This first step focuses on the basics: loading a BAM file and generating an index for it.
Recall the samtools index command from Part 1:
The command takes a BAM file as input and produces a .bai index file alongside it.
The container URI was community.wave.seqera.io/library/samtools:1.20--b5dfbd93de237464.
We're going to take this information and wrap it in Nextflow in three stages:
- Set up the input
- Write the indexing process and call it in the workflow
- Configure the output handling
1.1. Set up the input¶
We need to declare an input parameter, create a test profile to provide a convenient default value, and create an input channel.
1.1.1. Add an input parameter declaration¶
In the main workflow file genomics.nf, under the Pipeline parameters section, declare a CLI parameter called reads_bam.
That sets up the CLI parameter, but we don't want to type out the file path every time we run the workflow during development. There are multiple options for providing a default value; here we use a test profile.
1.1.2. Create a test profile with a default value in nextflow.config¶
A test profile provides convenient default values for trying out a workflow without specifying inputs on the command line. This is a common convention in the Nextflow ecosystem (see Hello Config for more detail).
Add a profiles block to nextflow.config with a test profile that sets the reads_bam parameter to one of the test BAM files.
Here, we're using ${projectDir}, a built-in Nextflow variable that points to the directory where the workflow script is located.
This makes it easy to reference data files and other resources without hardcoding absolute paths.
1.1.3. Set up the input channel¶
In the workflow block, create an input channel from the parameter value using the .fromPath channel factory (as used in Hello Channels).
Now we need to create the process to run the indexing on this input.
1.2. Write the indexing process and call it in the workflow¶
We need to write the process definition in the module file, import it into the workflow using an include statement, and call it on the input.
1.2.1. Fill in the module for the indexing process¶
Open modules/samtools_index.nf and examine the outline of the process definition.
You should recognize the main structural elements; if not, consider reading through Hello Nextflow for a refresher.
Go ahead and fill in the process definition by yourself using the information provided above, then check your work against the solution in the "After" tab below.
| modules/samtools_index.nf | |
|---|---|
Once you've completed this, the process is complete. To use it in the workflow, you'll need to import the module and add a process call.
1.2.2. Include the module¶
In genomics.nf, add an include statement to make the process available to the workflow:
The process is now available in the workflow scope.
1.2.3. Call the indexing process on the input¶
Now, let's add a call to SAMTOOLS_INDEX in the workflow block, passing the input channel as an argument.
The workflow now loads the input and runs the indexing process on it. Next, we need to configure how the output is published.
1.3. Configure the output handling¶
We need to declare which process outputs to publish and specify where they should go.
1.3.1. Declare an output in the publish: section¶
The publish: section inside the workflow block declares which process outputs should be published.
Assign the output of SAMTOOLS_INDEX to a named target called bam_index.
Now we need to tell Nextflow where to put the published output.
1.3.2. Configure the output target in the output {} block¶
The output {} block sits outside the workflow and specifies where each named target is published.
Let's add a target for bam_index that publishes into a bam/ subdirectory.
Note
By default, Nextflow publishes output files as symbolic links, which avoids unnecessary duplication.
Even though the data files we're using here are very small, in genomics they can get very large.
Symlinks will break when you clean up your work directory, so for production workflows you may want to override the default publish mode to 'copy'.
1.4. Run the workflow¶
At this point, we have a one-step indexing workflow that should be fully functional. Let's test that it works!
We can run it with -profile test to use the default value set up in the test profile and avoid having to write the path on the command line.
Command output
You can check that the index file has been generated correctly by looking in the work directory or in the results directory.
Work directory contents
There it is!
Takeaway¶
You know how to create a module containing a process, import it into a workflow, call it with an input channel, and publish the results.
What's next?¶
Add a second step that takes the output of the indexing process and uses it to run variant calling.
2. Add a second process to run GATK HaplotypeCaller on the indexed BAM file¶
Now that we have an index for our input file, we can move on to setting up the variant calling step.
Recall the gatk HaplotypeCaller command from Part 1:
gatk HaplotypeCaller \
-R /data/ref/ref.fasta \
-I /data/bam/reads_mother.bam \
-O reads_mother.vcf \
-L /data/ref/intervals.bed
The command takes a BAM file (-I), a reference genome (-R), and an intervals file (-L), and produces a VCF file (-O) along with its index.
The tool also expects the BAM index, reference index, and reference dictionary to be co-located with their respective files.
The container URI was community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867.
We follow the same three stages as before:
- Set up the inputs
- Write the variant calling process and call it in the workflow
- Configure the output handling
2.1. Set up the inputs¶
The variant calling step requires several additional input files. We need to declare parameters for them, add default values to the test profile, and create variables to load them.
2.1.1. Add parameter declarations for accessory inputs¶
Since our new process expects a handful of additional files to be provided, add parameter declarations for them in genomics.nf under the Pipeline parameters section:
As before, we provide default values through the test profile rather than inline.
2.1.2. Add accessory file defaults to the test profile¶
Just as we did for reads_bam in section 1.1.2, add default values for the accessory files to the test profile in nextflow.config:
Now we need to create variables that load these file paths for use in the workflow.
2.1.3. Create variables for the accessory files¶
Add variables for the accessory file paths inside the workflow block:
The file() syntax tells Nextflow explicitly to handle these inputs as file paths.
You can learn more about this in the Side Quest Working with files.
2.2. Write the variant calling process and call it in the workflow¶
We need to write the process definition in the module file, import it into the workflow using an include statement, and call it on the input reads plus the output of the indexing step and the accessory files.
2.2.1. Fill in the module for the variant calling process¶
Open modules/gatk_haplotypecaller.nf and examine the outline of the process definition.
Go ahead and fill in the process definition by yourself using the information provided above, then check your work against the solution in the "After" tab below.
You'll notice that this process has more inputs than the GATK command itself requires. The GATK knows to look for the BAM index file and the reference genome's accessory files based on naming conventions, but Nextflow is domain-agnostic and doesn't know about these conventions. We need to list them explicitly so that Nextflow stages them in the working directory at runtime; otherwise GATK will throw an error about missing files.
Similarly, we list the output VCF's index file ("${input_bam}.vcf.idx") explicitly so that Nextflow keeps track of it for subsequent steps.
We use the emit: syntax to assign a name to each output channel, which will become useful when we wire the outputs into the publish block.
Once you've completed this, the process is complete. To use it in the workflow, you'll need to import the module and add a process call.
2.2.2. Import the new module¶
Update genomics.nf to import the new module:
The process is now available in the workflow scope.
2.2.3. Add the process call¶
Add the process call in the workflow body, under main::
You should recognize the *.out syntax from the Hello Nextflow training series; we are telling Nextflow to take the channel output by SAMTOOLS_INDEX and plugging that into the GATK_HAPLOTYPECALLER process call.
Note
Notice that the inputs are provided in the exact same order in the call to the process as they are listed in the input block of the process. In Nextflow, inputs are positional, meaning you must follow the same order; and of course there have to be the same number of elements.
2.3. Configure the output handling¶
We need to add the new outputs to the publish declaration and configure where they go.
2.3.1. Add publish targets for the variant calling outputs¶
Add the VCF and index outputs to the publish: section:
Now we need to tell Nextflow where to put the new outputs.
2.3.2. Configure the new output targets¶
Add entries for the vcf and vcf_idx targets in the output {} block, publishing both into a vcf/ subdirectory:
The VCF and its index are published as separate targets that both go into the vcf/ subdirectory.
2.4. Run the workflow¶
Run the expanded workflow, adding -resume this time so that we don't have to run the indexing step again.
Command output
Now if we look at the console output, we see the two processes listed.
The first process was skipped thanks to the caching, as expected, whereas the second process was run since it's brand new.
You'll find the output files in the results directory (as symbolic links to the work directory).
Directory contents
If you open the VCF file, you should see the same contents as in the file you generated by running the GATK command directly in the container.
File contents
This is the output we care about generating for each sample in our study.
Takeaway¶
You know how to make a two-step modular workflow that does real analysis work and is capable of dealing with genomics file format idiosyncrasies like the accessory files.
What's next?¶
Make the workflow handle multiple samples in bulk.
3. Adapt the workflow to run on a batch of samples¶
It's all well and good to have a workflow that can automate processing on a single sample, but what if you have 1000 samples? Do you need to write a bash script that loops through all your samples?
No, thank goodness! Just make a minor tweak to the code and Nextflow will handle that for you too.
3.1. Update the input to list three samples¶
To run on multiple samples, update the test profile to provide an array of file paths instead of a single one. This is a quick way to test multi-sample execution; in the next step we'll switch to a more scalable approach using a file of inputs.
First, comment out the type annotation in the parameter declaration, since arrays cannot use typed declarations:
Then update the test profile to list all three samples:
The channel factory in the workflow body (.fromPath) accepts multiple file paths just as well as a single one, so no other changes are needed.
3.2. Run the workflow¶
Try running the workflow now that the plumbing is set up to run on all three test samples.
Funny thing: this might work, OR it might fail. For example, here's a run that succeeded:
Command output
If your workflow run succeeded, run it again until you get an error like this:
Command output
N E X T F L O W ~ version 25.10.2
┃ Launching `genomics.nf` [loving_pasteur] DSL2 - revision: d2a8e63076
executor > local (4)
[01/eea165] SAMTOOLS_INDEX (2) | 3 of 3, cached: 1 ✔
[a5/fa9fd0] GATK_HAPLOTYPECALLER (3) | 1 of 3, cached: 1
ERROR ~ Error executing process > 'GATK_HAPLOTYPECALLER (2)'
Caused by:
Process `GATK_HAPLOTYPECALLER (2)` terminated with an error exit status (2)
Command executed:
gatk HaplotypeCaller -R ref.fasta -I reads_father.bam -O reads_father.bam.vcf -L intervals.bed
Command exit status:
2
Command error:
...
A USER ERROR has occurred: Traversal by intervals was requested but some input files are not indexed.
...
If you look at the GATK command error output, there will be a line like this:
A USER ERROR has occurred: Traversal by intervals was requested but some input files are not indexed.
Well, that's weird, considering we explicitly indexed the BAM files in the first step of the workflow. Could there be something wrong with the plumbing?
3.3. Troubleshoot the problem¶
We'll inspect the work directories and use the view() operator to figure out what went wrong.
3.3.1. Check the work directories for the relevant calls¶
Take a look inside the work directory for the failed GATK_HAPLOTYPECALLER process call listed in the console output.
Directory contents
work/a5/fa9fd0994b6beede5fb9ea073596c2
├── intervals.bed -> /workspaces/training/nf4-science/genomics/data/ref/intervals.bed
├── reads_father.bam.bai -> /workspaces/training/nf4-science/genomics/work/01/eea16597bd6e810fb4cf89e60f8c2d/reads_father.bam.bai
├── reads_son.bam -> /workspaces/training/nf4-science/genomics/data/bam/reads_son.bam
├── reads_son.bam.vcf
├── reads_son.bam.vcf.idx
├── ref.dict -> /workspaces/training/nf4-science/genomics/data/ref/ref.dict
├── ref.fasta -> /workspaces/training/nf4-science/genomics/data/ref/ref.fasta
└── ref.fasta.fai -> /workspaces/training/nf4-science/genomics/data/ref/ref.fasta.fai
Pay particular attention to the names of the BAM file and the BAM index that are listed in this directory: reads_son.bam and reads_father.bam.bai.
What the heck? Nextflow has staged an index file in this process call's work directory, but it's the wrong one. How could this have happened?
3.3.2. Use the view() operator to inspect channel contents¶
Add these two lines in the workflow body before the GATK_HAPLOTYPECALLER process call to view the contents of the channel:
Then run the workflow command again.
Once again, this may succeed or fail. Here's what the output of the two .view() calls looks like for a failed run:
/workspaces/training/nf4-science/genomics/data/bam/reads_mother.bam
/workspaces/training/nf4-science/genomics/data/bam/reads_father.bam
/workspaces/training/nf4-science/genomics/data/bam/reads_son.bam
/workspaces/training/nf4-science/genomics/work/9c/53492e3518447b75363e1cd951be4b/reads_father.bam.bai
/workspaces/training/nf4-science/genomics/work/cc/37894fffdf6cc84c3b0b47f9b536b7/reads_son.bam.bai
/workspaces/training/nf4-science/genomics/work/4d/dff681a3d137ba7d9866e3d9307bd0/reads_mother.bam.bai
The first three lines correspond to the input channel and the second, to the output channel. You can see that the BAM files and index files for the three samples are not listed in the same order!
Note
When you call a Nextflow process on a channel containing multiple elements, Nextflow will try to parallelize execution as much as possible, and will collect outputs in whatever order they become available. The consequence is that the corresponding outputs may be collected in a different order than the original inputs were fed in.
As currently written, our workflow script assumes that the index files will come out of the indexing step listed in the same mother/father/son order as the inputs were given. But that is not guaranteed to be the case, which is why sometimes (though not always) the wrong files get paired up in the second step.
To fix this, we need to make sure the BAM files and their index files travel together through the channels.
Tip
The view() statements in the workflow code don't do anything, so it's not a problem to leave them in.
However they will clutter up your console output, so we recommend removing them when you're done troubleshooting the issue.
3.4. Update the workflow to handle the index files correctly¶
The fix is to bundle each BAM file with its index into a tuple, then update the downstream process and workflow plumbing to match.
3.4.1. Change the output of the SAMTOOLS_INDEX module to a tuple¶
The simplest way to ensure a BAM file and its index stay closely associated is to package them together into a tuple coming out of the index task.
Note
A tuple is a finite, ordered list of elements that is commonly used for returning multiple values from a function. Tuples are particularly useful for passing multiple inputs or outputs between processes while preserving their association and order.
Update the output in modules/samtools_index.nf to include the BAM file:
This way, each index file will be tightly coupled with its original BAM file, and the overall output of the indexing step will be a single channel containing pairs of files.
3.4.2. Change the input of the GATK_HAPLOTYPECALLER module to accept a tuple¶
Since we've changed the 'shape' of the output of the first process, we need to update the input definition of the second process to match.
Update modules/gatk_haplotypecaller.nf:
Now we need to update the workflow to reflect the new tuple structure in the process call and the publish targets.
3.4.3. Update the call to GATK_HAPLOTYPECALLER in the workflow¶
We no longer need to provide the original reads_ch to the GATK_HAPLOTYPECALLER process, since the BAM file is now bundled into the channel output by SAMTOOLS_INDEX.
Update the call in genomics.nf:
Finally, we need to update the publish targets to reflect the new output structure.
3.4.4. Update the publish target for the indexed BAM output¶
Since the SAMTOOLS_INDEX output is now a tuple containing both the BAM file and its index, rename the publish target from bam_index to indexed_bam to better reflect its contents:
With these changes, the BAM and its index are guaranteed to travel together, so the pairing will always be correct.
3.5. Run the corrected workflow¶
Run the workflow again to make sure this will work reliably going forward.
This time (and every time) everything should run correctly:
Command output
The results directory now contains both BAM and BAI files for each sample (from the tuple), along with the VCF outputs:
Results directory contents
results/
├── bam/
│ ├── reads_father.bam -> ...
│ ├── reads_father.bam.bai -> ...
│ ├── reads_mother.bam -> ...
│ ├── reads_mother.bam.bai -> ...
│ ├── reads_son.bam -> ...
│ └── reads_son.bam.bai -> ...
└── vcf/
├── reads_father.bam.vcf -> ...
├── reads_father.bam.vcf.idx -> ...
├── reads_mother.bam.vcf -> ...
├── reads_mother.bam.vcf.idx -> ...
├── reads_son.bam.vcf -> ...
└── reads_son.bam.vcf.idx -> ...
By bundling associated files into tuples, we ensured the correct files always travel together through the workflow. The workflow now processes any number of samples reliably, but listing them individually in the config is not very scalable. In the next step, we'll switch to reading inputs from a file.
Takeaway¶
You know how to make your workflow run on multiple samples (independently).
What's next?¶
Make it easier to handle samples in bulk.
4. Make the workflow accept a text file containing a batch of input files¶
A very common way to provide multiple data input files to a workflow is to do it with a text file containing the file paths. It can be as simple as a text file listing one file path per line and nothing else, or the file can contain additional metadata, in which case it's often called a samplesheet.
Here we are going to show you how to do the simple case.
4.1. Examine the provided text file listing the input file paths¶
We already made a text file listing the input file paths, called sample_bams.txt, which you can find in the data/ directory.
/workspaces/training/nf4-science/genomics/data/bam/reads_mother.bam
/workspaces/training/nf4-science/genomics/data/bam/reads_father.bam
/workspaces/training/nf4-science/genomics/data/bam/reads_son.bam
As you can see, we listed one file path per line, and they are absolute paths.
Note
The files we are using here are just on your GitHub Codespaces's local filesystem, but we could also point to files in cloud storage. If you are not using the provided Codespaces environment, you may need to adapt the file paths to match your local setup.
4.2. Update the parameter and test profile¶
Switch the reads_bam parameter to point to the sample_bams.txt file instead of listing individual samples.
Restore the type annotation in the params block (since it's a single path again):
Then update the test profile to point to the text file:
| nextflow.config | |
|---|---|
The list of files no longer lives in the code at all, which is a big step in the right direction.
4.3. Update the channel factory to read lines from a file¶
Currently, our input channel factory treats any files we give it as the data inputs we want to feed to the indexing process. Since we're now giving it a file that lists input file paths, we need to change its behavior to parse the file and treat the file paths it contains as the data inputs.
We can do this using the same pattern we used in Part 2 of Hello Nextflow: applying the splitCsv() operator to parse the file, then a map operation to select the first field of each line.
Technically we could do this more simply using the .splitText() operator, since our input file currently only contains file paths.
However, by using the more versatile splitCsv operator (supplemented by map), we can future-proof our workflow in case we decide to add metadata to the file containing file paths.
Tip
If you're not confident you understand what the operators are doing here, this is another great opportunity to use the .view() operator to look at what the channel contents look like before and after applying them.
4.4. Run the workflow¶
Run the workflow one more time. This should produce the same result as before, right?
Command output
Yes! In fact, Nextflow correctly detects that the process calls are exactly the same, and doesn't even bother re-running everything, since we were running with -resume.
And that's it! Our simple variant calling workflow has all the basic features we wanted.
Takeaway¶
You know how to make a multi-step modular workflow to index a BAM file and apply per-sample variant calling using GATK.
More generally, you've learned how to use essential Nextflow components and logic to build a simple genomics pipeline that does real work, taking into account the idiosyncrasies of genomics file formats and tool requirements.
What's next?¶
Celebrate your success and take an extra long break!
In the next part of this course, you'll learn how to transform this simple per-sample variant calling workflow to apply joint variant calling to the data.