Welcome to the Nextflow Training Workshop
We are excited to have you on the path to writing reproducible and scalable scientific workflows using Nextflow. This guide complements the full Nextflow documentation - if you ever have any doubts, head over to the docs located here.
By the end of this course you should:
-
Be proficient in writing Nextflow pipelines
-
Know the basic Nextflow concepts of Channels, Processes and Operators
-
Have an understanding of containerized workflows
-
Understand the different execution platforms supported by Nextflow
-
Be introduced to the Nextflow community and ecosystem
Overview
To get you started with Nextflow as quickly as possible, we will walk through the following steps:
-
Set up a development environment to run Nextflow
-
Explore Nextflow concepts using some basic workflows, including a multi-step RNA-Seq analysis
-
Build and use Docker containers to encapsulate all workflow dependencies
-
Dive deeper into the core Nextflow syntax, including Channels, Processes, and Operators
-
Cover cluster and cloud deployment scenarios and explore Nextflow Tower capabilities
This will give you a broad understanding of Nextflow, to start writing your own pipelines. We hope you enjoy the course! This is an ever-evolving document — feedback is always welcome.
Summary
1. Environment setup
There are two main ways to get started with Seqera’s Nextflow training course.
The first is to install the requirements locally (Local installation), which is best if you are already familiar with Git and Docker, or working offline.
The second is to use Gitpod, which is best for first-timers as this platform contains all the programs and data required. Simply click the link and log in using your GitHub account to start the tutorial.
1.1. Local installation
Nextflow can be used on any POSIX-compatible system (Linux, macOS, Windows Subsystem for Linux, etc.).
Requirements:
-
Bash
-
Git
Optional requirements for this tutorial:
-
Singularity 2.5.x (or later)
-
Conda 4.5 (or later)
-
A configured AWS Batch computing environment
1.1.1. Download Nextflow:
Enter this command in your terminal:
wget -qO- https://get.nextflow.io | bash
Or, if you prefer curl:
curl -s https://get.nextflow.io | bash
Then ensure that the downloaded binary is executable:
chmod +x nextflow
AND put the nextflow executable into your $PATH (e.g. ~/usr/local/bin or ~/bin/)
1.1.2. Docker
Ensure you have Docker Desktop running on your machine. Download Docker here.
1.1.3. Training material
You can view the training material here: training.seqera.io/
To download the material use git clone github.com/seqeralabs/nf-training-public.git
.
Then cd
into the nf-training
directory.
1.1.4. Checking your installation
Check the correct installation of nextflow
by running the following command:
nextflow info
This should show the current version, system and runtime.
1.2. Gitpod
A preconfigured Nextflow development environment is available using Gitpod.
Requirements:
-
A GitHub account
-
Web browser (Google Chrome, Firefox)
-
Internet connection
1.2.1. Gitpod quick start
To run Gitpod:
-
Click the following URL:
(which is our GitHub repository URL, prefixed with gitpod.io/#).
-
Log in to your GitHub account (and allow authorization).
Once you have signed in, Gitpod should load (skip prebuild if asked).
1.2.2. Explore your Gitpod IDE
You should now see something similar to the following:

The sidebar allows you to customize your Gitpod environment and perform basic tasks (copy, paste, open files, search, git, etc.). Click the Explorer button to see which files are in this repository.
The terminal allows you to run all the programs in the repository. For example, both nextflow
and docker
are installed and can be executed.
The main window allows you to view and edit files. Clicking on a file in the explorer will open it within the main window. You should also see the nf-training material browser (training.seqera.io/).
To test that the environment is working correctly, type the following into the terminal:
nextflow info
This should come up with the Nextflow version and runtime information:
Version: 22.09.7-edge build 5806 Created: 28-09-2022 09:36 UTC System: Linux 5.15.0-47-generic Runtime: Groovy 3.0.13 on OpenJDK 64-Bit Server VM 11.0.1+13-LTS Encoding: UTF-8 (UTF-8)
1.2.3. Gitpod resources
-
Gitpod gives you 500 free credits per month, which is equivalent to 50 hours of free environment runtime using the standard workspace (up to 4 cores, 8 GB RAM and 30GB storage).
-
There is also a large workspace option that gives you up to 8 cores, 16GB RAM, and 50GB storage for each workspace (up to 4 parallel workspaces in the free version). However, the large workspace uses your free credits quicker so your account will have fewer hours of access to this space.
-
Gitpod will time out after 30 minutes. However, changes are saved for up to 2 weeks (see the next section for reopening a timed out session).
See gitpod.io for more details.
1.2.4. Reopening a Gitpod session
You can reopen an environment from gitpod.io/workspaces. Find your previous environment in the list, then select the ellipsis (three dots icon) and select Open.
If you have saved the URL for your previous Gitpod environment, you can simply open it your browser to open the previous environment.
Alternatively, you can start a new workspace by following the Gitpod URL: gitpod.io/#https://github.com/seqeralabs/nf-training-public
If you have lost your environment, you can find the main scripts used in this tutorial in the nf-training
directory to resume with a new environment.
1.2.5. Saving files from Gitpod to your local machine.
To save any file from the explorer panel, right-click the file and select Download
.
1.2.6. Training material
The training course can be accessed in your browser from training.seqera.io/.
1.3. Selecting a Nextflow version
By default, Nextflow will pull the latest stable version into your environment.
However, Nextflow is constantly evolving as we make improvements and fix bugs.
The latest releases can be viewed on GitHub here.
If you want to use a specific version of Nextflow, you can set the NXF_VER variable as shown below:
export NXF_VER=22.04.5
Most of this tutorial workshop requires NXF_VER=22.04.0 or later, to use DSL2 as default. |
Run nextflow -version
again to confirm that the change has taken effect.
2. Introduction
2.1. Basic concepts
Nextflow is a workflow orchestration engine and domain specific language (DSL) that makes it easy to write data-intensive computational pipelines.
It is designed around the idea that the Linux platform is the lingua franca of data science. Linux provides many simple but powerful command-line and scripting tools that, when chained together, facilitate complex data manipulations.
Nextflow extends this approach, adding the ability to define complex program interactions and a high-level parallel computational environment, based on the dataflow programming model. Nextflow’s core features are:
-
Workflow portability and reproducibility
-
Scalability of parallelization and deployment
-
Integration of existing tools, systems, and industry standards
2.1.1. Processes and Channels
In practice, a Nextflow pipeline is made by joining together different processes.
Each process
can be written in any scripting language that can be executed by the Linux platform (Bash, Perl, Ruby, Python, etc.).
Processes are executed independently and are isolated from each other, i.e. they do not share a common
(writable) state. The only way they can communicate is via asynchronous first-in, first-out (FIFO) queues, called
channels
in Nextflow.
Any process
can define one or more channels
as an input
and output
. The interaction between these processes,
and ultimately the pipeline execution flow itself, is implicitly defined by these input
and output
declarations.

2.1.2. Execution abstraction
While a process
defines what command or script
has to be executed, the executor determines
how that script
is actually run in the target platform.
If not otherwise specified, processes are executed on the local computer. The local executor is very useful for pipeline development and testing purposes, but for real world computational pipelines, a High Performance Computing (HPC) or cloud platform is often required.
In other words, Nextflow provides an abstraction between the pipeline’s functional logic and the underlying execution system (or runtime). Thus, it is possible to write a pipeline which runs seamlessly on your computer, a cluster, or the cloud, without being modified. You simply define the target execution platform in the configuration file.

2.1.3. Scripting language
Nextflow implements a declarative DSL that simplifies the writing of complex data analysis workflows as an extension of a general purpose programming language.
This approach makes Nextflow flexible — it provides the benefits of a concise DSL for the handling of recurrent use cases with ease, and the flexibility and power of a general purpose programming language to handle corner cases, in the same computing environment. This would be difficult to implement using a purely declarative approach.
In practical terms, Nextflow scripting is an extension of the Groovy programming language which, in turn, is a super-set of the Java programming language. Groovy can be thought of as "Python for Java", in that it simplifies the writing of code and is more approachable.
2.2. Your first script
Here you will execute your first Nextflow script (hello.nf
), which we will go through line-by-line.
In this toy example, the script takes an input string (a parameter called params.greeting
) and splits it into chunks of six characters in the first process. The second process then converts the characters to upper case. The result is finally displayed on-screen.
2.2.1. Nextflow code (hello.nf)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#!/usr/bin/env nextflow (1)
params.greeting = 'Hello world!' (2)
greeting_ch = Channel.of(params.greeting) (3)
process SPLITLETTERS { (4)
input: (5)
val x (6)
output: (7)
path 'chunk_*' (8)
""" (9)
printf '$x' | split -b 6 - chunk_ (10)
""" (11)
} (12)
process CONVERTTOUPPER { (13)
input: (14)
path y (15)
output: (16)
stdout (17)
""" (18)
cat $y | tr '[a-z]' '[A-Z]' (19)
""" (20)
} (21)
workflow { (22)
letters_ch = SPLITLETTERS(greeting_ch) (23)
results_ch = CONVERTTOUPPER(letters_ch.flatten()) (24)
results_ch.view{ it } (25)
} (26)
1 | The code begins with a shebang, which declares Nextflow as the interpreter. |
2 | Declares a parameter greeting that is initialized with the value 'Hello world!'. |
3 | Initializes a channel labelled greeting_ch , which contains the value from params.greeting . Channels are the input type for processes in Nextflow. |
4 | Begins the first process, defined as SPLITLETTERS . |
5 | Input declaration for the SPLITLETTERS process. Inputs can be values (val ), files or paths (path ), or other qualifiers (see here). |
6 | Tells the process to expect an input value (val ), that we assign to the variable 'x'. |
7 | Output declaration for the SPLITLETTERS process. |
8 | Tells the process to expect an output file(s) (path ), with a filename starting with 'chunk_*', as output from the script. The process sends the output as a channel. |
9 | Three double quotes initiate the code block to execute in this process . |
10 | Code to execute — printing the input value x (called using the dollar symbol [$] prefix), splitting the string into chunks with a length of 6 characters ("Hello " and "world!"), and saving each to a file (chunk_aa and chunk_ab). |
11 | Three double quotes end the code block. |
12 | End of the first process block. |
13 | Begins the second process, defined as CONVERTTOUPPER . |
14 | Input declaration for the CONVERTTOUPPER process . |
15 | Tells the process to expect an input file(s) (path ; i.e. chunk_aa and chunk_ab), that we assign to the variable 'y'. |
16 | Output declaration for the CONVERTTOUPPER process. |
17 | Tells the process to expect output as standard output (stdout) and send this output as a channel. |
18 | Three double quotes initiate the code block to execute in this process . |
19 | Script to read files (cat) using the '$y' input variable, then pipe to uppercase conversion, outputting to standard output. |
20 | Three double quotes end the code block. |
21 | End of first process block. |
22 | Start of the workflow scope, where each process can be called. |
23 | Execute the process SPLITLETTERS on the greeting_ch (aka greeting channel), and store the output in the channel letters_ch . |
24 | Execute the process CONVERTTOUPPER on the letters channel letters_ch , which is flattened using the operator .flatten() . This transforms the input channel in such a way that every item is a separate element. We store the output in the channel results_ch . |
25 | The final output (in the results_ch channel) is printed to screen using the view operator (appended onto the channel name). |
26 | End of the workflow scope. |
The use of the operator .flatten() here is to split the two files into two separate items to be put through the next process (else they would be treated as a single element).
|
2.2.2. In practice
Now copy the above example into your favourite text editor and save it to a file named hello.nf
.
For the Gitpod tutorial, make sure you are in the folder called nf-training
|
Execute the script by entering the following command in your terminal:
nextflow run hello.nf
The output will look similar to the text shown below:
Launching `hello.nf` [fabulous_torricelli] DSL2 - revision: 197a0e289a executor > local (3) [c8/c36893] process > SPLITLETTERS (1) [100%] 1 of 1 ✔ [1a/3c54ed] process > CONVERTTOUPPER (2) [100%] 2 of 2 ✔ HELLO WORLD!
The standard output shows (line by line):
-
1: The Nextflow version executed.
-
2: The script and version names.
-
3: The executor used (in the above case: local).
-
4: The first
process
is executed once (1). The line starts with a unique hexadecimal value (see TIP below), and ends with the percentage and job completion information. -
5: The second process is executed twice (2) (once for chunk_aa, once for chunk_ab).
-
6-7: The result string from stdout is printed.
The hexadecimal numbers, like c8/c36893 , identify the unique process
execution. These numbers are also the prefix of the directories where each
process is executed. You can inspect the files produced by changing to the directory
$PWD/work and using these numbers to find the process-specific execution path.
|
The second process runs twice, executing in two different work directories
for each input file. Therefore, in the previous example the work directory [1a/3c54ed]
represents just one of the two directories that were processed. To print all the
relevant paths to the screen, use the -ansi-log flag (e.g. nextflow run hello.nf -ansi-log false ).
|
It’s worth noting that the process CONVERTTOUPPER
is executed in parallel, so there’s no guarantee that the instance processing the first split (the chunk 'Hello ') will be executed before the one processing the second split (the chunk 'world!').
Thus, it is perfectly possible that your final result will be printed out in a different order:
WORLD! HELLO
2.3. Modify and resume
Nextflow keeps track of all the processes executed in your pipeline. If you modify some parts of your script, only the processes that are changed will be re-executed. The execution of the processes that are not changed will be skipped and the cached result will be used instead.
This allows for testing or modifying part of your pipeline without having to re-execute it from scratch.
For the sake of this tutorial, modify the CONVERTTOUPPER
process in
the previous example, replacing the process script with the string
rev $y
, so that the process looks like this:
1
2
3
4
5
6
7
8
9
10
11
process CONVERTTOUPPER {
input:
path y
output:
stdout
"""
rev $y
"""
}
Then save the file with the same name, and execute it by adding the
-resume
option to the command line:
nextflow run hello.nf -resume
It will print output similar to this:
N E X T F L O W ~ version 22.09.7-edge Launching `hello.nf` [mighty_mcnulty] DSL2 - revision: 525206806b executor > local (2) [c8/c36893] process > SPLITLETTERS (1) [100%] 1 of 1, cached: 1 ✔ [77/cf83b6] process > CONVERTTOUPPER (1) [100%] 2 of 2 ✔ !dlrow olleH
You will see that the execution of the process SPLITLETTERS
is
actually skipped (the process ID is the same as in the first output) — its results are
retrieved from the cache. The second process is executed as expected,
printing the reversed strings.
The pipeline results are cached by default in the directory $PWD/work .
Depending on your script, this folder can take a lot of disk space.
If you are sure you won’t need to resume your pipeline execution, clean this folder periodically.
|
2.4. Pipeline parameters
Pipeline parameters are simply declared by prepending the prefix params
to a
variable name, separated by a dot character. Their value can be
specified on the command line by prefixing the parameter name with a
double dash character, i.e. --paramName
.
Now, let’s try to execute the previous example specifying a different input string parameter, as shown below:
nextflow run hello.nf --greeting 'Bonjour le monde!'
The string specified on the command line will override the default value of the parameter. The output will look like this:
N E X T F L O W ~ version 22.09.7-edge Launching `hello.nf` [fervent_galileo] DSL2 - revision: 525206806b executor > local (4) [e9/139d7d] process > SPLITLETTERS (1) [100%] 1 of 1 ✔ [bb/fc8548] process > CONVERTTOUPPER (1) [100%] 3 of 3 ✔ m el r !edno uojnoB
2.4.1. In DAG-like format
To better understand how Nextflow is dealing with the data in this pipeline, below is a DAG-like figure to visualise all the inputs
, outputs
, channels
and processes
:
Click here:

3. Simple RNA-Seq pipeline
To demonstrate a real-world biomedical scenario, we will implement a proof of concept RNA-Seq pipeline which:
-
Indexes a transcriptome file
-
Performs quality controls
-
Performs quantification
-
Creates a MultiQC report
This will be done using a series of seven scripts,
each of which builds on the previous to create a complete workflow.
You can find these in the tutorial folder (script1.nf
- script7.nf
).
3.1. Define the pipeline parameters
Parameters are inputs and options that can be changed when the pipeline is run.
The script script1.nf
defines the pipeline input parameters.
1
2
3
4
5
params.reads = "$projectDir/data/ggal/gut_{1,2}.fq"
params.transcriptome_file = "$projectDir/data/ggal/transcriptome.fa"
params.multiqc = "$projectDir/multiqc"
println "reads: $params.reads"
Run it by using the following command:
nextflow run script1.nf
Try to specify a different input parameter in your execution command, for example:
nextflow run script1.nf --reads '/workspace/nf-training-public/nf-training/data/ggal/lung_{1,2}.fq'
Exercise
Modify the script1.nf
by adding a fourth parameter named outdir
and set it to a default path
that will be used as the pipeline output directory.
Click here for the answer:
1
2
3
4
params.reads = "$projectDir/data/ggal/gut_{1,2}.fq"
params.transcriptome_file = "$projectDir/data/ggal/transcriptome.fa"
params.multiqc = "$projectDir/multiqc"
params.outdir = "results"
Exercise
Modify script1.nf
to print all of the pipeline parameters by using a single log.info
command as a multiline string statement.
See an example here. |
Click here for the answer:
Add the following to your script file:
1
2
3
4
5
6
7
8
log.info """\
R N A S E Q - N F P I P E L I N E
===================================
transcriptome: ${params.transcriptome_file}
reads : ${params.reads}
outdir : ${params.outdir}
"""
.stripIndent()
Recap
In this step you have learned:
-
How to define parameters in your pipeline script
-
How to pass parameters by using the command line
-
The use of
$var
and${var}
variable placeholders -
How to use multiline strings
-
How to use
log.info
to print information and save it in the log execution file
3.2. Create a transcriptome index file
Nextflow allows the execution of any command or script by using a process
definition.
A process
is defined by providing three main declarations:
the process input
, output
and command script
.
To add a transcriptome INDEX
processing step, try adding the following code blocks to your script1.nf
. Alternatively, these code blocks have already been added to script2.nf
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
/*
* define the `index` process that creates a binary index
* given the transcriptome file
*/
process INDEX {
input:
path transcriptome
output:
path 'salmon_index'
script:
"""
salmon index --threads $task.cpus -t $transcriptome -i salmon_index
"""
}
Additionally, add a workflow scope containing an input channel definition and the index process:
1
2
3
workflow {
index_ch = INDEX(params.transcriptome_file)
}
Here, the params.transcriptome_file
parameter is used as the input for the INDEX
process. The INDEX
process (using the salmon
tool) creates salmon_index
, an indexed transcriptome that is passed as an output to the index_ch
channel.
The input declaration defines a transcriptome path variable which is used in the script as a reference (using the dollar symbol) in the Salmon command line.
|
Resource requirements such as CPUs and memory limits can change with different workflow executions and platforms. Nextflow can use $task.cpus as a variable for the number of CPUs. See process directives documentation for more details. |
Run it by using the command:
nextflow run script2.nf
The execution will fail because salmon
is not installed in your environment.
Add the command line option -with-docker
to launch the execution through a Docker container,
as shown below:
nextflow run script2.nf -with-docker
This time the execution will work because it uses the Docker container nextflow/rnaseq-nf
that is defined in the
nextflow.config
file in your current directory. If you are running this script locally then you will need to download docker
to your machine, log in and activate docker, and allow the script to download the container
containing the run scripts. You can learn more about docker here.
To avoid adding -with-docker
each time you execute the script, add the following line to the nextflow.config
file:
docker.enabled = true
Exercise
Enable the Docker execution by default by adding the above setting in the nextflow.config
file.
Exercise
Print the output of the index_ch
channel by using the view operator.
Click here for the answer:
Add the following to the end of your workflow block in your script file
index_ch.view()
Exercise
If you have more CPUs available, try changing your script to request more resources for this process. For example, see the directive docs. $task.cpus
is already specified in this script, so setting the number of CPUs as a directive will tell Nextflow to run this job.
Click here for the answer:
Add cpus 2
to the top of the index process:
1
2
3
4
process INDEX {
cpus 2
input:
...
Then check it worked by looking at the script executed in the work directory. Look for the hexadecimal (e.g. work/7f/f285b80022d9f61e82cd7f90436aa4/), Then cat
the .command.sh
file.
Bonus Exercise
Use the command tree work
to see how Nextflow organizes the process work directory. Check here if you need to download tree
.
Click here for the answer:
It should look something like this:
work ├── 17 │ └── 263d3517b457de4525513ae5e34ea8 │ ├── index │ │ ├── complete_ref_lens.bin │ │ ├── ctable.bin │ │ ├── ctg_offsets.bin │ │ ├── duplicate_clusters.tsv │ │ ├── eqtable.bin │ │ ├── info.json │ │ ├── mphf.bin │ │ ├── pos.bin │ │ ├── pre_indexing.log │ │ ├── rank.bin │ │ ├── refAccumLengths.bin │ │ ├── ref_indexing.log │ │ ├── reflengths.bin │ │ ├── refseq.bin │ │ ├── seq.bin │ │ └── versionInfo.json │ └── transcriptome.fa -> /workspace/Gitpod_test/data/ggal/transcriptome.fa ├── 7f
Recap
In this step you have learned:
-
How to define a process executing a custom command
-
How process inputs are declared
-
How process outputs are declared
-
How to print the content of a channel
-
How to access the number of available CPUs
3.3. Collect read files by pairs
This step shows how to match read files into pairs, so they can be mapped by Salmon.
Edit the script script3.nf
by adding the following statement as the last line of the file:
read_pairs_ch.view()
Save it and execute it with the following command:
nextflow run script3.nf
It will print something similar to this:
[gut, [/.../data/ggal/gut_1.fq, /.../data/ggal/gut_2.fq]]
The above example shows how the read_pairs_ch
channel emits tuples composed of
two elements, where the first is the read pair prefix and the second is a list
representing the actual files.
Try it again specifying different read files by using a glob pattern:
nextflow run script3.nf --reads 'data/ggal/*_{1,2}.fq'
File paths that include one or more wildcards ie. * , ? , etc., MUST be
wrapped in single-quoted characters to avoid Bash expanding the glob.
|
Exercise
Use the set operator in place
of =
assignment to define the read_pairs_ch
channel.
Click here for the answer:
1
2
3
Channel
.fromFilePairs( params.reads )
.set { read_pairs_ch }
Exercise
Use the checkIfExists
option for the fromFilePairs method to check if the specified path contains file pairs.
Click here for the answer:
1
2
3
Channel
.fromFilePairs( params.reads, checkIfExists: true )
.set { read_pairs_ch }
Recap
In this step you have learned:
-
How to use
fromFilePairs
to handle read pair files -
How to use the
checkIfExists
option to check for the existence of input files -
How to use the
set
operator to define a new channel variable
The declaration of a channel can be before the workflow scope of within it. As long as it is upstream of the process that requires the specific channel. |
3.4. Perform expression quantification
script4.nf
adds a gene expression QUANTIFICATION
process and call within the workflow scope. Quantification requires the index transcriptome and RNA-Seq read pair fastq files.
In the workflow scope, note how the index_ch
channel is assigned as output in the INDEX
process.
Next, note that the first input channel for the QUANTIFICATION
process is the previously declared index_ch
, which contains the path
to the salmon_index
.
Also, note that the second input channel for the QUANTIFICATION
process, is the read_pair_ch
we just created. This being a tuple
composed of two elements (a value: sample_id
and a list of paths to the fastq reads: reads
) in order to match the structure of the items emitted by the fromFilePairs
channel factory.
Execute it by using the following command:
nextflow run script4.nf -resume
You will see the execution of the QUANTIFICATION
process.
When using the -resume
option, any step that has already been processed is skipped.
Try to execute the same script again with more read files, as shown below:
nextflow run script4.nf -resume --reads 'data/ggal/*_{1,2}.fq'
You will notice that the QUANTIFICATION
process is executed multiple times.
Nextflow parallelizes the execution of your pipeline simply by providing multiple sets of input data to your script.
It may be useful to apply optional settings to a specific process using directives by specifying them in the process body. |
Exercise
Add a tag directive to the
QUANTIFICATION
process to provide a more readable execution log.
Click here for the answer:
Add the following before the input declaration:
tag "Salmon on $sample_id"
Exercise
Add a publishDir directive
to the QUANTIFICATION
process to store the process results in a directory of your choice.
Click here for the answer:
Add the following before the input
declaration in the QUANTIFICATION
process:
publishDir params.outdir, mode:'copy'
Recap
In this step you have learned:
-
How to connect two processes together by using the channel declarations
-
How to
resume
the script execution and skip cached steps -
How to use the
tag
directive to provide a more readable execution output -
How to use the
publishDir
directive to store a process results in a path of your choice
3.5. Quality control
Next, we implement a FASTQC
quality control step for your input reads (using the label fastqc
). The inputs are the same as the read pairs used in the QUANTIFICATION
step.
You can run it by using the following command:
nextflow run script5.nf -resume
Nextflow DSL2 knows to split the reads_pair_ch
into two identical channels as they are required twice as an input for both of the FASTQC
and the QUANTIFICATION
process.
3.6. MultiQC report
This step collects the outputs from the QUANTIFICATION
and FASTQC
processes to create
a final report using the MultiQC tool.
Execute the next script with the following command:
nextflow run script6.nf -resume --reads 'data/ggal/*_{1,2}.fq'
It creates the final report in the results
folder in the current work
directory.
In this script, note the use of the mix
and collect operators chained
together to gather the outputs of the QUANTIFICATION
and FASTQC
processes as a single input. Operators can be used to combine and transform channels.
MULTIQC(quant_ch.mix(fastqc_ch).collect())
We only want one task of MultiQC to be executed to produce one report. Therefore, we use the mix
channel operator to combine the two channels followed by the collect
operator, to return the complete channel contents as a single element.
Recap
In this step you have learned:
-
How to collect many outputs to a single input with the
collect
operator -
How to
mix
two channels into a single channel -
How to chain two or more operators together
3.7. Handle completion event
This step shows how to execute an action when the pipeline completes the execution.
Note that Nextflow processes define the execution of asynchronous tasks i.e. they are not executed one after another as if they were written in the pipeline script in a common imperative programming language.
The script uses the workflow.onComplete
event handler to print a confirmation message
when the script completes.
Try to run it by using the following command:
nextflow run script7.nf -resume --reads 'data/ggal/*_{1,2}.fq'
3.8. Email notifications
Send a notification email when the workflow execution completes using the -N <email address>
command-line option.
Note: this requires the configuration of a SMTP server in the nextflow config
file. Below is an example nextflow.config
file showing the settings you would have to configure:
1
2
3
4
5
6
7
8
9
10
mail {
from = 'info@nextflow.io'
smtp.host = 'email-smtp.eu-west-1.amazonaws.com'
smtp.port = 587
smtp.user = "xxxxx"
smtp.password = "yyyyy"
smtp.auth = true
smtp.starttls.enable = true
smtp.starttls.required = true
}
See mail documentation for details.
3.9. Custom scripts
Real-world pipelines use a lot of custom user scripts (BASH, R, Python, etc.) Nextflow
allows you to consistently use and manage these scripts. Simply put them
in a directory named bin
in the pipeline project root. They will be automatically added
to the pipeline execution PATH
.
For example, create a file named fastqc.sh
with the following content:
1
2
3
4
5
6
7
8
9
#!/bin/bash
set -e
set -u
sample_id=${1}
reads=${2}
mkdir fastqc_${sample_id}_logs
fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}
Save it, give execute permission, and move it into the bin
directory as shown below:
chmod +x fastqc.sh mkdir -p bin mv fastqc.sh bin
Then, open the script7.nf
file and replace the FASTQC
process' script with
the following code:
1
2
3
4
script:
"""
fastqc.sh "$sample_id" "$reads"
"""
Run it as before:
nextflow run script7.nf -resume --reads 'data/ggal/*_{1,2}.fq'
Recap
In this step you have learned:
-
How to write or use existing custom scripts in your Nextflow pipeline.
-
How to avoid the use of absolute paths by having your scripts in the
bin/
folder.
3.10. Metrics and reports
Nextflow can produce multiple reports and charts providing several runtime metrics and execution information.
Run the rnaseq-nf pipeline previously introduced as shown below:
nextflow run rnaseq-nf -with-docker -with-report -with-trace -with-timeline -with-dag dag.png
The -with-docker
option launches each task of the execution as a Docker container run command.
The -with-report
option enables the creation of the workflow execution report. Open
the file report.html
with a browser to see the report created with the above command.
The -with-trace
option enables the creation of a tab separated file containing runtime
information for each executed task. Check the trace.txt
for an example.
The -with-timeline
option enables the creation of the workflow timeline report showing
how processes were executed over time. This may be useful to identify the most time consuming
tasks and bottlenecks. See an example at this link.
Finally, the -with-dag
option enables the rendering of the workflow execution direct acyclic graph
representation. Note: This feature requires the installation of Graphviz on your computer.
See here for further details.
Then try running :
dot -Tpng dag.dot > graph.png open graph.png
Note: runtime metrics may be incomplete for run short running tasks as in the case of this tutorial.
You view the HTML files by right-clicking on the file name in the left side-bar and choosing the Preview menu item. |
3.11. Run a project from GitHub
Nextflow allows the execution of a pipeline project directly from a GitHub repository (or similar services, e.g., BitBucket and GitLab).
This simplifies the sharing and deployment of complex projects and tracking changes in a consistent manner.
The following GitHub repository hosts a complete version of the workflow introduced in this tutorial:
You can run it by specifying the project name and launching each task of the execution as a Docker container run command:
nextflow run nextflow-io/rnaseq-nf -with-docker
It automatically downloads the container and stores it in the $HOME/.nextflow
folder.
Use the command info
to show the project information:
nextflow info nextflow-io/rnaseq-nf
Nextflow allows the execution of a specific revision of your project by using the -r
command line option. For example:
nextflow run nextflow-io/rnaseq-nf -r v2.1 -with-docker
Revision are defined by using Git tags or branches defined in the project repository.
Tags enable precise control of the changes in your project files and dependencies over time.
3.12. More resources
-
Nextflow documentation - The Nextflow docs home.
-
Nextflow patterns - A collection of Nextflow implementation patterns.
-
CalliNGS-NF - A Variant calling pipeline implementing GATK best practices.
-
nf-core - A community collection of production ready genomic pipelines.
4. Manage dependencies and containers
Computational workflows are rarely composed of a single script or tool. Most of the time they require the usage of dozens of different software components or libraries.
Installing and maintaining such dependencies is a challenging task and the most common source of irreproducibility in scientific applications.
To overcome these issues we use containers that allow the encapsulation of software dependencies, i.e. tools and libraries required by a data analysis application in one or more self-contained, ready-to-run, immutable Linux container images, that can be easily deployed in any platform supporting the container runtime.
Containers can be executed in an isolated manner from the hosting system. Having its own copy of the file system, processing space, memory management, etc.
Containers were first introduced with kernel 2.6 as a Linux feature known as Control Groups or Cgroups. |
4.1. Docker
Docker is a handy management tool to build, run and share container images.
These images can be uploaded and published in a centralized repository known as Docker Hub, or hosted by other parties like Quay.
4.1.1. Run a container
A container can be run using the following command:
docker run <container-name>
Try for example the following publicly available container (if you have docker installed):
docker run hello-world
4.1.2. Pull a container
The pull command allows you to download a Docker image without running it. For example:
docker pull debian:stretch-slim
The above command downloads a Debian Linux image. You can check it exists by using:
docker images
4.1.3. Run a container in interactive mode
Launching a BASH shell in the container allows you to operate in an interactive mode in the containerized operating system. For example:
docker run -it debian:stretch-slim bash
Once launched, you will notice that it is running as root (!). Use the usual commands to navigate the file system. This is useful to check if the expected programs are present within a container.
To exit from the container, stop the BASH session with the exit
command.
4.1.4. Your first Dockerfile
Docker images are created by using a so-called Dockerfile
, which is a simple text file
containing a list of commands to assemble and configure the image
with the software packages required.
Here, you will create a Docker image containing cowsay and the Salmon tool.
The Docker build process automatically copies all files that are located in the
current directory to the Docker daemon in order to create the image. This can take
a lot of time when big/many files exist. For this reason, it’s important to always work in
a directory containing only the files you really need to include in your Docker image.
Alternatively, you can use the .dockerignore file to select paths to exclude from the build.
|
Use your favorite editor (e.g., vim
or nano
) to create a file named Dockerfile
and copy the
following content:
1
2
3
4
5
6
7
8
FROM debian:stretch-slim
LABEL image.author.name "Your Name Here"
LABEL image.author.email "your@email.here"
RUN apt-get update && apt-get install -y curl cowsay
ENV PATH=$PATH:/usr/games/
4.1.5. Build the image
Build the Dockerfile image by using the following command:
docker build -t my-image .
Where "my-image" is the user-specified name tag for the Dockerfile, present in the current directory.
Don’t miss the dot in the above command. |
When it completes, verify that the image has been created by listing all available images:
docker images
You can try your new container by running this command:
docker run my-image cowsay Hello Docker!
4.1.6. Add a software package to the image
Add the Salmon package to the Docker image by adding the following snippet to the Dockerfile
:
1
2
3
RUN curl -sSL https://github.com/COMBINE-lab/salmon/releases/download/v1.5.2/salmon-1.5.2_linux_x86_64.tar.gz | tar xz \
&& mv /salmon-*/bin/* /usr/bin/ \
&& mv /salmon-*/lib/* /usr/lib/
Save the file and build the image again with the same command as before:
docker build -t my-image .
You will notice that it creates a new Docker image with the same name but with a different image ID.
4.1.7. Run Salmon in the container
Check that Salmon is running correctly in the container as shown below:
docker run my-image salmon --version
You can even launch a container in an interactive mode by using the following command:
docker run -it my-image bash
Use the exit
command to terminate the interactive session.
4.1.8. File system mounts
Create a genome index file by running Salmon in the container.
Try to run Salmon in the container with the following command:
docker run my-image \ salmon index -t $PWD/data/ggal/transcriptome.fa -i transcript-index
The above command fails because Salmon cannot access the input file.
This happens because the container runs in a completely separate file system and it cannot access the hosting file system by default.
You will need to use the --volume
command-line option to mount the input file(s) e.g.
docker run --volume $PWD/data/ggal/transcriptome.fa:/transcriptome.fa my-image \ salmon index -t /transcriptome.fa -i transcript-index
The generated transcript-index directory is still not accessible in the host file system.
|
An easier way is to mount a parent directory to an identical one in the container, this allows you to use the same path when running it in the container e.g. |
docker run --volume $PWD:$PWD --workdir $PWD my-image \ salmon index -t $PWD/data/ggal/transcriptome.fa -i transcript-index
Or set a folder you want to mount as an environmental variable, called DATA
:
DATA=/workspace/nf-training-public/nf-training/data docker run --volume $DATA:$DATA --workdir $PWD my-image \ salmon index -t $PWD/data/ggal/transcriptome.fa -i transcript-index
Now check the content of the transcript-index
folder by entering the command:
ls -la transcript-index
Note that the permissions for files created by the Docker execution is root .
|
4.1.9. Upload the container in the Docker Hub (bonus)
Publish your container in the Docker Hub to share it with other people.
Create an account on the hub.docker.com website. Then from your shell terminal run the following command, entering the user name and password you specified when registering in the Hub:
docker login
Tag the image with your Docker user name account:
docker tag my-image <user-name>/my-image
Finally push it to the Docker Hub:
docker push <user-name>/my-image
After that anyone will be able to download it by using the command:
docker pull <user-name>/my-image
Note how after a pull and push operation, Docker prints the container digest number e.g.
Digest: sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab266 Status: Downloaded newer image for nextflow/rnaseq-nf:latest
This is a unique and immutable identifier that can be used to reference a container image in a univocally manner. For example:
docker pull nextflow/rnaseq-nf@sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab266
4.1.10. Run a Nextflow script using a Docker container
The simplest way to run a Nextflow script with a Docker image is using the
-with-docker
command-line option:
nextflow run script2.nf -with-docker my-image
As seen in the last section, you can also configure the Nextflow config file (nextflow.config
) to select which container to use instead of having to specify it as a command-line argument every time.
4.2. Singularity
Singularity is a container runtime designed to work in high-performance computing data centers, where the usage of Docker is generally not allowed due to security constraints.
Singularity implements a container execution model similar to Docker. However, it uses a completely different implementation design.
A Singularity container image is archived as a plain file that can be stored in a shared file system and accessed by many computing nodes managed using a batch scheduler.
Singularity will not work with Gitpod. If you wish to try this section, please do it locally, or on an HPC. |
4.2.1. Create a Singularity images
Singularity images are created using a Singularity
file in a similar manner to Docker but
using a different syntax.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Bootstrap: docker
From: debian:stretch-slim
%environment
export PATH=$PATH:/usr/games/
%labels
AUTHOR <your name>
%post
apt-get update && apt-get install -y locales-all curl cowsay
curl -sSL https://github.com/COMBINE-lab/salmon/releases/download/v1.0.0/salmon-1.0.0_linux_x86_64.tar.gz | tar xz \
&& mv /salmon-*/bin/* /usr/bin/ \
&& mv /salmon-*/lib/* /usr/lib/
Once you have saved the Singularity
file. You can create the image with these commands:
sudo singularity build my-image.sif Singularity
Note: the build
command requires sudo
permissions. A common workaround
consists of building the image on a local workstation and then deploying it in the
cluster by copying the image file.
4.2.2. Running a container
Once done, you can run your container with the following command
singularity exec my-image.sif cowsay 'Hello Singularity'
By using the shell
command you can enter in the container in interactive mode.
For example:
singularity shell my-image.sif
Once in the container instance run the following commands:
touch hello.txt ls -la
Note how the files on the host environment are shown. Singularity automatically
mounts the host $HOME directory and uses the current work directory.
|
4.2.3. Import a Docker image
An easier way to create a Singularity container without requiring sudo
permission and boosting the containers interoperability is to import a Docker container image by pulling it directly from a Docker registry. For example:
singularity pull docker://debian:stretch-slim
The above command automatically downloads the Debian Docker image and converts it to
a Singularity image in the current directory with the name debian-jessie.simg
.
4.2.4. Run a Nextflow script using a Singularity container
Nextflow allows the transparent usage of Singularity containers as easy as with Docker.
Simply enable the use of the Singularity engine in place of Docker in the Nextflow configuration file by using the -with-singularity
command-line option:
nextflow run script7.nf -with-singularity nextflow/rnaseq-nf
As before, the Singularity container can also be provided in the Nextflow config file. We’ll see how to do this later.
4.2.5. The Singularity Container Library
The authors of Singularity, SyLabs have their own repository of Singularity containers.
In the same way that we can push Docker images to Docker Hub, we can upload Singularity images to the Singularity Library.
4.3. Conda/Bioconda packages
Conda is a popular package and environment manager. The built-in support for Conda allows Nextflow pipelines to automatically create and activate the Conda environment(s), given the dependencies specified by each process.
In this Gitpod environment, conda is already installed.
4.3.1. Using conda
A Conda environment is defined using a YAML file, which lists the required software packages. The first thing you need to do is to initiate conda for shell interaction, and then open a new terminal by running bash.
conda init bash
Then write your YAML file (to env.yml). For example:
1
2
3
4
5
6
7
8
9
10
name: nf-tutorial
channels:
- conda-forge
- defaults
- bioconda
dependencies:
- bioconda::salmon=1.5.1
- bioconda::fastqc=0.11.9
- bioconda::multiqc=1.12
- conda-forge::tbb=2020.2
Given the recipe file, the environment is created using the command shown below:
conda env create --file env.yml
You can check the environment was created successfully with the command shown below:
conda env list
This should look something like this:
# conda environments: # base * /opt/conda nf-tutorial /opt/conda/envs/nf-tutorial
To enable the environment, you can use the activate
command:
conda activate nf-tutorial
Nextflow is able to manage the activation of a Conda environment when its directory
is specified using the -with-conda
option (using the same path shown in the list
function. For example:
nextflow run script7.nf -with-conda /opt/conda/envs/nf-tutorial
When creating a Conda environment with a YAML recipe file, Nextflow automatically downloads the required dependencies, builds the environment and activates it. |
This makes easier to manage different environments for the processes in the workflow script.
See the docs for details.
4.3.2. Create and use conda-like environments using micromamba
Another way to build conda-like environments is through a Dockerfile
and micromamba
.
micromamba
is a fast and robust package for building small conda-based environments.
This saves having to build a conda environment each time you want to use it (as outlined in previous sections).
To do this, you simply require a Dockerfile
and you use micromamba to install the packages. However, a good practice is to have a YAML recipe file like in the previous section, so we’ll do it here too.
1
2
3
4
5
6
7
8
9
10
name: nf-tutorial
channels:
- conda-forge
- defaults
- bioconda
dependencies:
- bioconda::salmon=1.5.1
- bioconda::fastqc=0.11.9
- bioconda::multiqc=1.12
- conda-forge::tbb=2020.2
Then, we can write our Dockerfile with micromamba installing the packages from this recipe file.
1
2
3
4
5
6
7
8
9
10
11
12
13
FROM mambaorg/micromamba:0.25.1
LABEL image.author.name "Your Name Here"
LABEL image.author.email "your@email.here"
COPY --chown=$MAMBA_USER:$MAMBA_USER micromamba.yml /tmp/env.yml
RUN micromamba create -n nf-tutorial
RUN micromamba install -y -n nf-tutorial -f /tmp/env.yml && \
micromamba clean --all --yes
ENV PATH /opt/conda/envs/nf-tutorial/bin:$PATH
The above Dockerfile
takes the parent image 'mambaorg/micromamba', then installs a conda
environment using micromamba
, and installs salmon
, fastqc
and multiqc
.
Exercise
Try executing the RNA-Seq pipeline from earlier (script7.nf). Start by building your own micromamba Dockerfile
(from above), save it to your docker hub repo, and direct Nextflow to run from this container (changing your nextflow.config
).
Building a Docker container and pushing to your personal repo can take >10 minutes. |
For an overview of steps to take, click here:
-
Make a file called
Dockerfile
in the current directory (with the code above). -
Build the image:
docker build -t my-image .
(don’t forget the '.'). -
Publish the docker image to your online docker account.
Something similar to the following, with <myrepo> replaced with your own Docker ID, without '<' and '>' characters!
"my-image" could be replaced with any name you choose. As good practice, choose something memorable and ensure the name matches the name you used in the previous command. docker login docker tag my-image <myrepo>/my-image docker push <myrepo>/my-image
-
Add the image file name to the
nextflow.config
file.e.g. remove the following from the
nextflow.config
:process.container = 'nextflow/rnaseq-nf'
and replace with:
process.container = '<myrepo>/my-image'
-
Trying running Nextflow, e.g.:
nextflow run script7.nf -with-docker
Nextflow should now be able to find salmon
to run the process.
4.4. BioContainers
Another useful resource linking together Bioconda and containers is the BioContainers project. BioContainers is a community initiative that provides a registry of container images for every Bioconda recipe.
So far, we’ve seen how to install packages with conda and micromamba, both locally and within containers. With BioContainers, you don’t need to create your own container image for the tools you want, and you don’t need to use conda or micromamba to install the packages. It already provides you with a Docker image containing the programs you want installed. For example, you can get the container image of fastqc using BioContainers with:
docker pull biocontainers/fastqc:v0.11.5
You can check the registry for the packages you want in BioContainers official website.
Contrary to other registries that will pull the latest image when no tag (version) is provided, you must specify a tag when pulling BioContainers (after a colon :
, e.g. fastqc:v0.11.5). Check the tags within the registry and pick the one that better suits your needs.
Exercise
During the earlier RNA-Seq tutorial (script2.nf), we created an index with the salmon tool. Given we do not have salmon installed locally in the machine provided by Gitpod, we had to either run it with -with-conda
or -with-docker
. Your task now is to run it again -with-docker
, but without having to create your own Docker container image. Instead, use the BioContainers image for salmon 1.7.0.
For the result, click here:
nextflow run script2.nf -with-docker quay.io/biocontainers/salmon:1.7.0--h84f40af_0
Bonus Exercise
Change the process directives in script5.nf
or the nextflow.config
file to make the pipeline automatically use BioContainers when using salmon, or fastqc.
HINT: Temporarily comment out the line process.container = 'nextflow/rnaseq-nf'
in the nextflow.config
file to make sure the processes are using the BioContainers that you set, and not the container image we have been using in this training. With these changes, you should be able to run the pipeline with BioContainers by running the following in the command line:
For the result, click here:
nextflow run script5.nf
with the following container directives for each process:
1
2
3
4
process FASTQC {
container 'biocontainers/fastqc:v0.11.5'
tag "FASTQC on $sample_id"
...
and
1
2
3
4
5
process QUANTIFICATION {
tag "Salmon on $sample_id"
container 'quay.io/biocontainers/salmon:1.7.0--h84f40af_0'
publishDir params.outdir, mode:'copy'
...
Check the .command.run
file in the work directory and ensure that the run line contains the correct Biocontainers.
Bonus content
5. Channels
Channels are a key data structure of Nextflow that allows the implementation of reactive-functional oriented computational workflows based on the Dataflow programming paradigm.
They are used to logically connect tasks to each other or to implement functional style data transformations.

5.1. Channel types
Nextflow distinguishes two different kinds of channels: queue channels and value channels.
5.1.1. Queue channel
A queue channel is an asynchronous unidirectional FIFO queue that connects two processes or operators.
-
asynchronous means that operations are non-blocking.
-
unidirectional means that data flows from a producer to a consumer.
-
FIFO means that the data is guaranteed to be delivered in the same order as it is produced. First In, First Out.
A queue channel is implicitly created by process output definitions or using channel factories such as Channel.of or Channel.fromPath.
Try the following snippets:
1
2
3
ch = Channel.of(1,2,3)
println(ch) (1)
ch.view() (2)
1 | Use the built-in print line function println to print the ch channel |
2 | Apply the view method to the ch channel prints each item emitted by the channels |
Exercise
Try to execute this snippet. You can do that by creating a new .nf
file or by editing an already existing .nf
file.
1
2
ch = Channel.of(1,2,3)
ch.view()
5.1.2. Value channels
A value channel (a.k.a. singleton channel) by definition is bound to a single value and it can be read unlimited times without consuming its contents. A value
channel is created using the value factory method or by operators returning a single value, such as first, last, collect, count, min, max, reduce, and sum.
To better understand the difference between value and queue channels, save the snippet below as example.nf
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
ch1 = Channel.of(1,2,3)
ch2 = Channel.of(1)
process SUM {
input:
val x
val y
output:
stdout
script:
"""
echo \$(($x+$y))
"""
}
workflow {
SUM(ch1,ch2).view()
}
When you run the script, it prints only 2, as you can see below:.
2
To understand why, we can inspect the queue channel and running Nextflow with DSL1 gives us a more explicit comprehension of what is behind the curtains.
1
2
ch1 = Channel.of(1)
println ch1
Then:
nextflow run example.nf -dsl1
This will print:
DataflowQueue(queue=[DataflowVariable(value=1), DataflowVariable(value=groovyx.gpars.dataflow.operator.PoisonPill@34be065a)])
We have the value 1 as the single element of our queue channel and a poison pill, which will tell the process that there’s nothing left to be consumed. That’s why we only have one output for the example above, which is 2. Let’s inspect a value channel now.
1
2
ch1 = Channel.value(1)
println ch1
Then:
nextflow run example.nf -dsl1
This will print:
DataflowVariable(value=1)
There is no poison pill, and that’s why we get a different output with the code below, where ch2
is turned into a value channel through the first
operator.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
ch1 = Channel.of(1,2,3)
ch2 = Channel.of(1)
process SUM {
input:
val x
val y
output:
stdout
script:
"""
echo \$(($x+$y))
"""
}
workflow {
SUM(ch1,ch2.first()).view()
}
4 3 2
Besides, in many situations, Nextflow will implicitly convert variables to value channels when they are used in a process invocation. For example, when you invoke a process with a pipeline parameter (params.example
) which has a string value, it is automatically cast into a value channel.
5.2. Channel factories
These are Nextflow commands for creating channels that have implicit expected inputs and functions.
5.2.1. value
The value
factory method is used to create a value channel. An optional not null
argument
can be specified to bind the channel to a specific value. For example:
1
2
3
ch1 = Channel.value() (1)
ch2 = Channel.value( 'Hello there' ) (2)
ch3 = Channel.value( [1,2,3,4,5] ) (3)
1 | Creates an empty value channel |
2 | Creates a value channel and binds a string to it |
3 | Creates a value channel and binds a list object to it that will be emitted as a sole emission |
5.2.2. of
The factory Channel.of
allows the creation of a queue channel with the values specified as arguments.
1
2
ch = Channel.of( 1, 3, 5, 7 )
ch.view{ "value: $it" }
The first line in this example creates a variable ch
which holds a channel object. This channel emits the values specified as a parameter in the of
method. Thus the second line will print the following:
value: 1 value: 3 value: 5 value: 7
The method Channel.of
works in a similar manner to Channel.from
(which is now deprecated), fixing some inconsistent behaviors of the latter and provides better handling when specifying a range of values.
For example, the following works with a range from 1 to 23 :
1
2
3
Channel
.of(1..23, 'X', 'Y')
.view()
5.2.3. fromList
The method Channel.fromList
creates a channel emitting the elements provided
by a list object specified as an argument:
1
2
3
4
5
list = ['hello', 'world']
Channel
.fromList(list)
.view()
5.2.4. fromPath
The fromPath
factory method creates a queue channel emitting one or more files
matching the specified glob pattern.
Channel.fromPath( './data/meta/*.csv' )
This example creates a channel and emits as many items as there are files with a csv
extension in the /data/meta
folder. Each element is a file object implementing the Path interface.
Two asterisks, i.e. ** , works like * but cross directory boundaries. This syntax is generally used for matching complete paths. Curly brackets specify a collection of sub-patterns.
|
Name | Description |
---|---|
glob |
When |
type |
Type of path returned, either |
hidden |
When |
maxDepth |
Maximum number of directory levels to visit (default: |
followLinks |
When |
relative |
When |
checkIfExists |
When |
Learn more about the glob patterns syntax at this link.
Exercise
Use the Channel.fromPath
method to create a channel emitting all files with the suffix .fq
in the data/ggal/
directory and any subdirectory, in addition to hidden files. Then print the file names.
Click here for the answer:
1
2
Channel.fromPath( './data/ggal/**.fq' , hidden:true)
.view()
5.2.5. fromFilePairs
The fromFilePairs
method creates a channel emitting the file pairs matching a glob pattern provided by the user. The matching files are emitted as tuples, in which the first element is the grouping key of the matching pair and the second element is the list of files (sorted in lexicographical order).
1
2
3
Channel
.fromFilePairs('./data/ggal/*_{1,2}.fq')
.view()
It will produce an output similar to the following:
[liver, [/user/nf-training/data/ggal/liver_1.fq, /user/nf-training/data/ggal/liver_2.fq]] [gut, [/user/nf-training/data/ggal/gut_1.fq, /user/nf-training/data/ggal/gut_2.fq]] [lung, [/user/nf-training/data/ggal/lung_1.fq, /user/nf-training/data/ggal/lung_2.fq]]
The glob pattern must contain at least a star wildcard character. |
Name | Description |
---|---|
type |
Type of paths returned, either |
hidden |
When |
maxDepth |
Maximum number of directory levels to visit (default: |
followLinks |
When |
size |
Defines the number of files each emitted item is expected to hold (default: 2). Set to |
flat |
When |
checkIfExists |
When |
Exercise
Use the fromFilePairs
method to create a channel emitting all pairs of fastq read in the data/ggal/
directory and print them. Then use the flat:true
option and compare the output with the previous execution.
Click here for the answer:
Use the following, with or without 'flat:true':
1
2
Channel.fromFilePairs( './data/ggal/*_{1,2}.fq', flat:true)
.view()
Then check the square brackets around the file names, to see the difference with flat
.
5.2.6. fromSRA
The Channel.fromSRA
method makes it possible to query the NCBI SRA archive and returns a channel emitting the FASTQ files matching the specified selection criteria.
The query can be project ID(s) or accession number(s) supported by the NCBI ESearch API.
This function now requires an API key you can only get by logging into your NCBI account. |
For help with NCBI login and key acquisition, click here:
-
Go to: www.ncbi.nlm.nih.gov/
-
Click the top right "Log in" button to sign into NCBI. Follow their instructions.
-
Once into your account, click the button at the top right, usually your ID.
-
Go to Account settings
-
Scroll down to the API Key Management section.
-
Click on "Create an API Key".
-
The page will refresh and the key will be displayed where the button was. Copy your key.
You also need to use the latest edge version of Nextflow. Check your nextflow -version , it should say -edge , if not: download the newest Nextflow version, following the instructions linked here.
|
For example, the following snippet will print the contents of an NCBI project ID:
1
2
3
4
5
params.ncbi_api_key = '<Your API key here>'
Channel
.fromSRA(['SRP073307'], apiKey: params.ncbi_api_key)
.view()
Replace <Your API key here> with your API key.
|
This should print:
[SRR3383346, [/vol1/fastq/SRR338/006/SRR3383346/SRR3383346_1.fastq.gz, /vol1/fastq/SRR338/006/SRR3383346/SRR3383346_2.fastq.gz]] [SRR3383347, [/vol1/fastq/SRR338/007/SRR3383347/SRR3383347_1.fastq.gz, /vol1/fastq/SRR338/007/SRR3383347/SRR3383347_2.fastq.gz]] [SRR3383344, [/vol1/fastq/SRR338/004/SRR3383344/SRR3383344_1.fastq.gz, /vol1/fastq/SRR338/004/SRR3383344/SRR3383344_2.fastq.gz]] [SRR3383345, [/vol1/fastq/SRR338/005/SRR3383345/SRR3383345_1.fastq.gz, /vol1/fastq/SRR338/005/SRR3383345/SRR3383345_2.fastq.gz]] (remaining omitted)
Multiple accession IDs can be specified using a list object:
1
2
3
4
ids = ['ERR908507', 'ERR908506', 'ERR908505']
Channel
.fromSRA(ids, apiKey: params.ncbi_api_key)
.view()
[ERR908507, [/vol1/fastq/ERR908/ERR908507/ERR908507_1.fastq.gz, /vol1/fastq/ERR908/ERR908507/ERR908507_2.fastq.gz]] [ERR908506, [/vol1/fastq/ERR908/ERR908506/ERR908506_1.fastq.gz, /vol1/fastq/ERR908/ERR908506/ERR908506_2.fastq.gz]] [ERR908505, [/vol1/fastq/ERR908/ERR908505/ERR908505_1.fastq.gz, /vol1/fastq/ERR908/ERR908505/ERR908505_2.fastq.gz]]
Read pairs are implicitly managed and are returned as a list of files. |
It’s straightforward to use this channel as an input using the usual Nextflow syntax. The code below creates a channel containing two samples from a public SRA study and runs FASTQC on the resulting files. See:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
params.ncbi_api_key = '<Your API key here>'
params.accession = ['ERR908507', 'ERR908506']
process fastqc {
input:
tuple val(sample_id), path(reads_file)
output:
path("fastqc_${sample_id}_logs")
script:
"""
mkdir fastqc_${sample_id}_logs
fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads_file}
"""
}
workflow {
reads = Channel.fromSRA(params.accession, apiKey: params.ncbi_api_key)
fastqc(reads)
}
If you want to run the pipeline above and do not have fastqc installed in your machine, don’t forget what you learned in the previous section. Run this pipeline with -with-docker biocontainers/fastqc:v0.11.5
, for example.
5.2.7. Text files
The splitText
operator allows you to split multi-line strings or text file items, emitted by a source channel into chunks containing n lines, which will be emitted by the resulting channel. See:
1
2
3
4
Channel
.fromPath('data/meta/random.txt') (1)
.splitText() (2)
.view() (3)
1 | Instructs Nextflow to make a channel from the path "data/meta/random.txt". |
2 | The splitText operator splits each item into chunks of one line by default. |
3 | View contents of the channel. |
You can define the number of lines in each chunk by using the parameter by
, as shown in the following example:
1
2
3
4
5
6
7
Channel
.fromPath('data/meta/random.txt')
.splitText( by: 2 )
.subscribe {
print it;
print "--- end of the chunk ---\n"
}
The subscribe operator permits execution of user defined functions each time a new value is emitted by the source channel.
|
An optional closure can be specified in order to transform the text chunks produced by the operator. The following example shows how to split text files into chunks of 10 lines and transform them into capital letters:
1
2
3
4
Channel
.fromPath('data/meta/random.txt')
.splitText( by: 10 ) { it.toUpperCase() }
.view()
You can also make counts for each line:
1
2
3
4
5
6
count=0
Channel
.fromPath('data/meta/random.txt')
.splitText()
.view { "${count++}: ${it.toUpperCase().trim()}" }
Finally, you can also use the operator on plain files (outside of the channel context):
1
2
3
4
5
6
def f = file('data/meta/random.txt')
def lines = f.splitText()
def count=0
for( String row : lines ) {
log.info "${count++} ${row.toUpperCase()}"
}
5.2.8. Comma separate values (.csv)
The splitCsv
operator allows you to parse text items emitted by a channel, that are CSV formatted.
It then splits them into records or groups them as a list of records with a specified length.
In the simplest case, just apply the splitCsv
operator to a channel emitting a CSV formatted text files or text entries. For example, to view only the first and fourth columns:
1
2
3
4
5
Channel
.fromPath("data/meta/patients_1.csv")
.splitCsv()
// row is a list object
.view { row -> "${row[0]},${row[3]}" }
When the CSV begins with a header line defining the column names, you can specify the parameter header: true
which allows you to reference each value by its column name, as shown in the following example:
1
2
3
4
5
Channel
.fromPath("data/meta/patients_1.csv")
.splitCsv(header: true)
// row is a list object
.view { row -> "${row.patient_id},${row.num_samples}" }
Alternatively, you can provide custom header names by specifying a list of strings in the header parameter as shown below:
1
2
3
4
5
Channel
.fromPath("data/meta/patients_1.csv")
.splitCsv(header: ['col1', 'col2', 'col3', 'col4', 'col5'] )
// row is a list object
.view { row -> "${row.col1},${row.col4}" }
You can also process multiple csv files at the same time:
1
2
3
4
Channel
.fromPath("data/meta/patients_*.csv") // <-- just use a pattern
.splitCsv(header:true)
.view { row -> "${row.patient_id}\t${row.num_samples}" }
Notice that you can change the output format simply by adding a different delimiter. |
Finally, you can also operate on csv files outside the channel context:
1
2
3
4
5
def f = file('data/meta/patients_1.csv')
def lines = f.splitCsv()
for( List row : lines ) {
log.info "${row[0]} -- ${row[2]}"
}
Exercise
Try inputting fastq reads into the RNA-Seq workflow from earlier using .splitCSV
.
Click here for the answer:
Add a csv text file containing the following, as an example input with the name "fastq.csv":
gut,/workspace/nf-training-public/nf-training/data/ggal/gut_1.fq,/workspace/nf-training-public/nf-training/data/ggal/gut_2.fq
Then replace the input channel for the reads in script7.nf
. Changing the following lines:
1
2
3
Channel
.fromFilePairs( params.reads, checkIfExists: true )
.set { read_pairs_ch }
To a splitCsv channel factory input:
1
2
3
4
5
Channel
.fromPath("fastq.csv")
.splitCsv()
.view () { row -> "${row[0]},${row[1]},${row[2]}" }
.set { read_pairs_ch }
Finally, change the cardinality of the processes that use the input data. For example, for the quantification process, change it from:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
process QUANTIFICATION {
tag "$sample_id"
input:
path salmon_index
tuple val(sample_id), path(reads)
output:
path sample_id, emit: quant_ch
script:
"""
salmon quant --threads $task.cpus --libType=U -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id
"""
}
To:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
process QUANTIFICATION {
tag "$sample_id"
input:
path salmon_index
tuple val(sample_id), path(reads1), path(reads2)
output:
path sample_id, emit: quant_ch
script:
"""
salmon quant --threads $task.cpus --libType=U -i $salmon_index -1 ${reads1} -2 ${reads2} -o $sample_id
"""
}
Repeat the above for the fastqc step.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
process FASTQC {
tag "FASTQC on $sample_id"
input:
tuple val(sample_id), path(reads1), path(reads2)
output:
path "fastqc_${sample_id}_logs"
script:
"""
mkdir fastqc_${sample_id}_logs
fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads1} ${reads2}
"""
}
Now the workflow should run from a CSV file.
5.2.9. Tab separated values (.tsv)
Parsing tsv files works in a similar way, simply add the sep:'\t'
option in the splitCsv
context:
1
2
3
4
5
6
Channel
.fromPath("data/meta/regions.tsv", checkIfExists:true)
// use `sep` option to parse TAB separated files
.splitCsv(sep:'\t')
// row is a list object
.view()
Exercise
Try using the tab separation technique on the file "data/meta/regions.tsv", but print just the first column, and remove the header.
Answer:
1
2
3
4
5
6
Channel
.fromPath("data/meta/regions.tsv", checkIfExists:true)
// use `sep` option to parse TAB separated files
.splitCsv(sep:'\t', header:true )
// row is a list object
.view { row -> "${row.patient_id}" }
5.3. More complex file formats
5.3.1. JSON
We can also easily parse the JSON file format using the following groovy schema:
1
2
3
4
5
6
7
8
9
import groovy.json.JsonSlurper
def f = file('data/meta/regions.json')
def records = new JsonSlurper().parse(f)
for( def entry : records ) {
log.info "$entry.patient_id -- $entry.feature"
}
When using an older JSON version, you may need to replace parse(f) with parseText(f.text)
|
5.3.2. YAML
This can also be used as a way to parse YAML files:
1
2
3
4
5
6
7
8
9
import org.yaml.snakeyaml.Yaml
def f = file('data/meta/regions.yml')
def records = new Yaml().load(f)
for( def entry : records ) {
log.info "$entry.patient_id -- $entry.feature"
}
5.3.3. Storage of parsers into modules
The best way to store parser scripts is to keep them in a Nextflow module file.
See the following Nextflow script:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
include{ parseJsonFile } from './modules/parsers.nf'
process foo {
input:
tuple val(meta), path(data_file)
"""
echo your_command $meta.region_id $data_file
"""
}
workflow {
Channel.fromPath('data/meta/regions*.json') \
| flatMap { parseJsonFile(it) } \
| map { entry -> tuple(entry,"/some/data/${entry.patient_id}.txt") } \
| foo
}
For this script to work, a module file called parsers.nf
needs to be created and stored in a modules folder in the current directory.
The parsers.nf
file should contain the parseJsonFile
function.
Nextflow will use this as a custom function within the workflow scope.
You will learn more about module files later in section 8.1 of this tutorial. |
6. Processes
In Nextflow, a process
is the basic computing primitive to execute foreign functions (i.e., custom scripts or tools).
The process
definition starts with the keyword process
, followed by the process name and finally the process body delimited by curly brackets.
The process
name is commonly written in upper case by convention.
A basic process
, only using the script
definition block, looks like the following:
1
2
3
4
5
6
7
process SAYHELLO {
script:
"""
echo 'Hello world!'
"""
}
In more complex examples, the process body can contain up to five definition blocks:
-
Directives are initial declarations that define optional settings
-
Input defines the expected input file(s) and the channel from where to find them
-
Output defines the expected output file(s) and the channel to send the data to
-
When is an optional clause statement to allow conditional processes
-
Script is a string statement that defines the command to be executed by the process
The full process syntax is defined as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
process < name > {
[ directives ] (1)
input: (2)
< process inputs >
output: (3)
< process outputs >
when: (4)
< condition >
[script|shell|exec]: (5)
"""
< user script to be executed >
"""
}
1 | Zero, one or more process directives |
2 | Zero, one or more process inputs |
3 | Zero, one or more process outputs |
4 | An optional boolean conditional to trigger the process execution |
5 | The command to be executed |
6.1. Script
The script
block is a string statement that defines the command to be executed by the process.
A process can execute only one script
block. It must be the last statement when the process contains input and output declarations.
The script
block can be a single or a multi-line string. The latter simplifies the writing of non-trivial scripts
composed of multiple commands spanning over multiple lines. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
process EXAMPLE {
script:
"""
echo 'Hello world!\nHola mundo!\nCiao mondo!\nHallo Welt!' > file
cat file | head -n 1 | head -c 5 > chunk_1.txt
gzip -c chunk_1.txt > chunk_archive.gz
"""
}
workflow {
EXAMPLE()
}
By default, the process
command is interpreted as a Bash script. However, any other scripting language can be used by simply starting the script with the corresponding Shebang declaration. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
process PYSTUFF {
script:
"""
#!/usr/bin/env python
x = 'Hello'
y = 'world!'
print ("%s - %s" % (x,y))
"""
}
workflow {
PYSTUFF()
}
Multiple programming languages can be used within the same workflow script. However, for large chunks of code it is better to save them into separate files and invoke them from the process script. One can store the specific scripts in the ./bin/ folder.
|
6.1.1. Script parameters
Script parameters (params
) can be defined dynamically using variable values. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
params.data = 'World'
process FOO {
script:
"""
echo Hello $params.data
"""
}
workflow {
FOO()
}
A process script can contain any string format supported by the Groovy programming language. This allows us to use string interpolation as in the script above or multiline strings. Refer to String interpolation for more information. |
Since Nextflow uses the same Bash syntax for variable substitutions in strings, Bash environment variables need to be escaped using the \ character.
|
1
2
3
4
5
6
7
8
9
10
11
process FOO {
script:
"""
echo "The current directory is \$PWD"
"""
}
workflow {
FOO()
}
It can be tricky to write a script uses many Bash variables. One possible alternative is to use a script string delimited by single-quote characters
1
2
3
4
5
6
7
8
9
10
11
process BAR {
script:
"""
echo $PATH | tr : '\\n'
"""
}
workflow {
BAR()
}
However, this blocks the usage of Nextflow variables in the command script.
Another alternative is to use a shell
statement instead of script
and use a different
syntax for Nextflow variables, e.g., !{..}
. This allows the use of both Nextflow and Bash variables in the same script.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
params.data = 'le monde'
process BAZ {
shell:
'''
X='Bonjour'
echo $X !{params.data}
'''
}
workflow {
BAZ()
}
6.1.2. Conditional script
The process script can also be defined in a completely dynamic manner using an if
statement or any other expression for evaluating a string value. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
params.compress = 'gzip'
params.file2compress = "$baseDir/data/ggal/transcriptome.fa"
process FOO {
input:
path file
script:
if( params.compress == 'gzip' )
"""
gzip -c $file > ${file}.gz
"""
else if( params.compress == 'bzip2' )
"""
bzip2 -c $file > ${file}.bz2
"""
else
throw new IllegalArgumentException("Unknown aligner $params.compress")
}
workflow {
FOO(params.file2compress)
}
6.2. Inputs
Nextflow processes are isolated from each other but can communicate between themselves by sending values through channels.
Inputs implicitly determine the dependencies and the parallel execution of the process. The process execution is fired each time new data is ready to be consumed from the input channel:

The input
block defines which channels the process
is expecting to receive data from. You can only define one input
block at a time, and it must contain one or more input declarations.
The input
block follows the syntax shown below:
1
2
input:
<input qualifier> <input name>
6.2.1. Input values
The val
qualifier allows you to receive data of any type as input. It can be accessed in the process script by using the specified input name, as shown in the following example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
num = Channel.of( 1, 2, 3 )
process BASICEXAMPLE {
debug true
input:
val x
script:
"""
echo process job $x
"""
}
workflow {
myrun = BASICEXAMPLE(num)
}
In the above example the process is executed three times, each time a value is received from the channel num
and used to process the script. Thus, it results in an output similar to the one shown below:
process job 3 process job 1 process job 2
The channel guarantees that items are delivered in the same order as they have been sent - but - since the process is executed in a parallel manner, there is no guarantee that they are processed in the same order as they are received. |
6.2.2. Input files
The path
qualifier allows the handling of file values in the process execution context. This means that
Nextflow will stage it in the process execution directory, and it can be accessed in the script by using the name specified in the input declaration.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
reads = Channel.fromPath( 'data/ggal/*.fq' )
process FOO {
debug true
input:
path 'sample.fastq'
script:
"""
ls sample.fastq
"""
}
workflow {
result = FOO(reads)
}
The input file name can also be defined using a variable reference as shown below:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
reads = Channel.fromPath( 'data/ggal/*.fq' )
process FOO {
debug true
input:
path sample
script:
"""
ls $sample
"""
}
workflow {
result = FOO(reads)
}
The same syntax is also able to handle more than one input file in the same execution and only requires changing the channel composition.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
reads = Channel.fromPath( 'data/ggal/*.fq' )
process FOO {
debug true
input:
path sample
script:
"""
ls -lh $sample
"""
}
workflow {
FOO(reads.collect())
}
In the past, the file qualifier was used for files, but the path qualifier
should be preferred over file to handle process input files when using Nextflow 19.10.0
or later. When a process declares an input file, the corresponding channel elements
must be file objects created with the file helper function from the file specific
channel factories (e.g., Channel.fromPath or Channel.fromFilePairs ).
|
Exercise
Write a script that creates a channel containing all read files matching the pattern data/ggal/*_1.fq
followed by a process that concatenates them into a single file and prints the first 20 lines.
Click here for the answer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
params.reads = "$baseDir/data/ggal/*_1.fq"
Channel
.fromPath( params.reads )
.set { read_ch }
process CONCATENATE {
tag "Concat all files"
input:
path '*'
output:
path 'top_10_lines'
script:
"""
cat * > concatenated.txt
head -n 20 concatenated.txt > top_10_lines
"""
}
workflow {
concat_ch = CONCATENATE(read_ch.collect())
concat_ch.view()
}
6.2.3. Combine input channels
A key feature of processes is the ability to handle inputs from multiple channels. However, it’s important to understand how channel contents and their semantics affect the execution of a process.
Consider the following example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
ch1 = Channel.of(1,2,3)
ch2 = Channel.of('a','b','c')
process FOO {
debug true
input:
val x
val y
script:
"""
echo $x and $y
"""
}
workflow {
FOO(ch1, ch2)
}
Both channels emit three values, therefore the process is executed three times, each time with a different pair:
-
(1, a)
-
(2, b)
-
(3, c)
What is happening is that the process waits until there’s a complete input configuration, i.e., it receives an input value from all the channels declared as input.
When this condition is verified, it consumes the input values coming from the respective channels, spawns a task execution, then repeats the same logic until one or more channels have no more content.
This means channel values are consumed serially one after another and the first empty channel causes the process execution to stop, even if there are other values in other channels.
So what happens when channels do not have the same cardinality (i.e., they emit a different number of elements)?
For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
input1 = Channel.of(1,2)
input2 = Channel.of('a','b','c','d')
process FOO {
debug true
input:
val x
val y
script:
"""
echo $x and $y
"""
}
workflow {
FOO(input1, input2)
}
In the above example, the process is only executed twice because the process stops when a channel has no more data to be processed.
However, what happens if you replace value x with a value
channel?
Compare the previous example with the following one :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
input1 = Channel.value(1)
input2 = Channel.of('a','b','c')
process BAR {
debug true
input:
val x
val y
script:
"""
echo $x and $y
"""
}
workflow {
BAR(input1, input2)
}
The output should look something like:
1 and b 1 and a 1 and c
This is because value channels can be consumed multiple times and do not affect process termination.
Exercise
Write a process that is executed for each read file matching the pattern data/ggal/*_1.fq
and
use the same data/ggal/transcriptome.fa
in each execution.
Click here for the answer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
params.reads = "$baseDir/data/ggal/*_1.fq"
params.transcriptome_file = "$baseDir/data/ggal/transcriptome.fa"
Channel
.fromPath( params.reads )
.set { read_ch }
process COMMAND {
tag "Run_command"
input:
path reads
path transcriptome
output:
path result
script:
"""
echo your_command $reads $transcriptome > result
"""
}
workflow {
concat_ch = COMMAND(read_ch, params.transcriptome_file)
concat_ch.view()
}
6.2.4. Input repeaters
The each
qualifier allows you to repeat the execution of a process for each item in a collection every time new data is received. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
sequences = Channel.fromPath('data/prots/*.tfa')
methods = ['regular', 'espresso', 'psicoffee']
process ALIGNSEQUENCES {
debug true
input:
path seq
each mode
script:
"""
echo t_coffee -in $seq -mode $mode
"""
}
workflow {
ALIGNSEQUENCES(sequences, methods)
}
In the above example, every time a file of sequences is received as an input by the process, it executes three tasks, each running a different alignment method set as a mode
variable. This is useful when you need to repeat the same task for a given set of parameters.
Exercise
Extend the previous example so a task is executed for each read file matching the pattern data/ggal/*_1.fq
and repeat the same task with both salmon
and kallisto
.
Click here for the answer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
params.reads = "$baseDir/data/ggal/*_1.fq"
params.transcriptome_file = "$baseDir/data/ggal/transcriptome.fa"
methods= ['salmon', 'kallisto']
Channel
.fromPath( params.reads )
.set { read_ch }
process COMMAND {
tag "Run_command"
input:
path reads
path transcriptome
each mode
output:
path result
script:
"""
echo $mode $reads $transcriptome > result
"""
}
workflow {
concat_ch = COMMAND(read_ch , params.transcriptome_file, methods)
concat_ch
.view { "To run : ${it.text}" }
}
6.3. Outputs
The output declaration block defines the channels used by the process to send out the results produced.
Only one output block, that can contain one or more output declaration, can be defined. The output block follows the syntax shown below:
1
2
output:
<output qualifier> <output name> , emit: <output channel>
6.3.1. Output values
The val
qualifier specifies a defined value in the script context.
Values are frequently defined in the input and/or output declaration blocks, as shown in the following example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
methods = ['prot','dna', 'rna']
process FOO {
input:
val x
output:
val x
script:
"""
echo $x > file
"""
}
workflow {
receiver_ch = FOO(Channel.of(methods))
receiver_ch.view { "Received: $it" }
}
6.3.2. Output files
The path
qualifier specifies one or more files produced by the process into the specified channel as an output.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
process RANDOMNUM {
output:
path 'result.txt'
script:
"""
echo $RANDOM > result.txt
"""
}
workflow {
receiver_ch = RANDOMNUM()
receiver_ch.view { "Received: " + it.text }
}
In the above example the process RANDOMNUM
creates a file named result.txt
containing a random number.
Since a file parameter using the same name is declared in the output block, the
file is sent over the receiver_ch
channel when the task is complete. A downstream process
declaring the same channel as input will
be able to receive it.
6.3.3. Multiple output files
When an output file name contains a wildcard character (*
or ?
) it is interpreted as a
glob path matcher.
This allows us to capture multiple files into a list object and output them as a sole emission. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
process SPLITLETTERS {
output:
path 'chunk_*'
"""
printf 'Hola' | split -b 1 - chunk_
"""
}
workflow {
letters = SPLITLETTERS()
letters
.flatMap()
.view { "File: ${it.name} => ${it.text}" }
}
Prints the following:
File: chunk_aa => H File: chunk_ab => o File: chunk_ac => l File: chunk_ad => a
Some caveats on glob pattern behavior:
-
Input files are not included in the list of possible matches
-
Glob pattern matches both files and directory paths
-
When a two stars pattern
**
is used to recourse across directories, only file paths are matched i.e., directories are not included in the result list.
Exercise
Remove the flatMap
operator and see out the output change. The documentation
for the flatMap
operator is available at this link.
Click here for the answer:
File: [chunk_aa, chunk_ab, chunk_ac, chunk_ad] => [H, o, l, a]
6.3.4. Dynamic output file names
When an output file name needs to be expressed dynamically, it is possible to define it using a dynamic string that references values defined in the input declaration block or in the script global context. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
species = ['cat','dog', 'sloth']
sequences = ['AGATAG','ATGCTCT', 'ATCCCAA']
Channel.fromList(species)
.set { species_ch }
process ALIGN {
input:
val x
val seq
output:
path "${x}.aln"
script:
"""
echo align -in $seq > ${x}.aln
"""
}
workflow {
genomes = ALIGN( species_ch, sequences )
genomes.view()
}
In the above example, each time the process is executed an alignment file is produced whose name depends
on the actual value of the x
input.
6.3.5. Composite inputs and outputs
So far we have seen how to declare multiple input and output channels that can handle one value at a time. However, Nextflow can also handle a tuple of values.
The input and output declarations for tuples must be declared with a tuple
qualifier followed by the definition of each element in the tuple.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
reads_ch = Channel.fromFilePairs('data/ggal/*_{1,2}.fq')
process FOO {
input:
tuple val(sample_id), path(sample_id)
output:
tuple val(sample_id), path('sample.bam')
script:
"""
echo your_command_here --reads $sample_id > sample.bam
"""
}
workflow {
bam_ch = FOO(reads_ch)
bam_ch.view()
}
In previous versions of Nextflow tuple was called set but it was used the same way with the same semantic.
|
Exercise
Modify the script of the previous exercise so that the bam file is named as the given sample_id
.
Click here for the answer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
reads_ch = Channel.fromFilePairs('data/ggal/*_{1,2}.fq')
process FOO {
input:
tuple val(sample_id), path(sample_files)
output:
tuple val(sample_id), path("${sample_id}.bam")
script:
"""
echo your_command_here --reads $sample_id > ${sample_id}.bam
"""
}
workflow {
bam_ch = FOO(reads_ch)
bam_ch.view()
}
6.4. When
The when
declaration allows you to define a condition that must be verified in order to execute the process. This can be any expression that evaluates a boolean value.
It is useful to enable/disable the process execution depending on the state of various inputs and parameters. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
params.dbtype = 'nr'
params.prot = 'data/prots/*.tfa'
proteins = Channel.fromPath(params.prot)
process FIND {
debug true
input:
path fasta
val type
when:
fasta.name =~ /^BB11.*/ && type == 'nr'
script:
"""
echo blastp -query $fasta -db nr
"""
}
workflow {
result = FIND(proteins, params.dbtype)
}
6.5. Directives
Directive declarations allow the definition of optional settings that affect the execution of the current process without affecting the semantic of the task itself.
They must be entered at the top of the process body, before any other declaration blocks (i.e., input
, output
, etc.).
Directives are commonly used to define the amount of computing resources to be used or other meta directives that allow the definition of extra configuration of logging information. For example:
1
2
3
4
5
6
7
8
9
10
process FOO {
cpus 2
memory 1.GB
container 'image/name'
script:
"""
echo your_command --this --that
"""
}
The complete list of directives is available at this link.
Name | Description |
---|---|
Allows you to define the number of (logical) CPUs required by the process’ task. |
|
Allows you to define how long a process is allowed to run (e.g., time '1h': 1 hour, '1s' 1 second, '1m' 1 minute, '1d' 1 day). |
|
Allows you to define how much memory the process is allowed to use (e.g., '2 GB' is 2 GB). Can also use B, KB,MB,GB and TB. |
|
Allows you to define how much local disk storage the process is allowed to use. |
|
Allows you to associate each process execution with a custom label to make it easier to identify them in the log file or the trace execution report. |
6.6. Organize outputs
6.6.1. PublishDir directive
Given each process is being executed in separate temporary work/
folder (e.g., work/f1/850698…; work/g3/239712…; etc.), we may want to save important, non-intermediary, and/or final files in a results folder.
Remember to delete the work folder from time to time to clear your intermediate files and stop them from filling your computer! |
To store our workflow result files, we need to explicitly mark them using the directive publishDir in the process that’s creating the files. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
params.outdir = 'my-results'
params.prot = 'data/prots/*.tfa'
proteins = Channel.fromPath(params.prot)
process BLASTSEQ {
publishDir "$params.outdir/bam_files", mode: 'copy'
input:
path fasta
output:
path ('*.txt')
script:
"""
echo blastp $fasta > ${fasta}_result.txt
"""
}
workflow {
blast_ch = BLASTSEQ(proteins)
blast_ch.view()
}
The above example will copy all blast script files created by the BLASTSEQ
task into the
directory path my-results
.
The publish directory can be local or remote. For example, output files
could be stored using an AWS S3 bucket by using the s3:// prefix in the target path.
|
6.6.2. Manage semantic sub-directories
You can use more than one publishDir
to keep different outputs in separate directories. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
params.reads = 'data/reads/*_{1,2}.fq.gz'
params.outdir = 'my-results'
samples_ch = Channel.fromFilePairs(params.reads, flat: true)
process FOO {
publishDir "$params.outdir/$sampleId/", pattern: '*.fq'
publishDir "$params.outdir/$sampleId/counts", pattern: "*_counts.txt"
publishDir "$params.outdir/$sampleId/outlooks", pattern: '*_outlook.txt'
input:
tuple val(sampleId), path('sample1.fq.gz'), path('sample2.fq.gz')
output:
path "*"
script:
"""
< sample1.fq.gz zcat > sample1.fq
< sample2.fq.gz zcat > sample2.fq
awk '{s++}END{print s/4}' sample1.fq > sample1_counts.txt
awk '{s++}END{print s/4}' sample2.fq > sample2_counts.txt
head -n 50 sample1.fq > sample1_outlook.txt
head -n 50 sample2.fq > sample2_outlook.txt
"""
}
workflow {
out_channel = FOO(samples_ch)
}
The above example will create an output structure in the directory my-results
,
that contains a separate sub-directory for each given sample ID, each containing the folders counts
and outlooks
.
7. Operators
Operators are methods that allow you to connect channels, transform values emitted by a channel, or apply some user-provided rules.
There are seven main groups of operators are described in greater detail within the Nextflow Reference Documentation, linked below:
7.1. Basic example
1
2
3
nums = Channel.of(1,2,3,4) (1)
square = nums.map { it -> it * it } (2)
square.view() (3)

1 | Creates a queue channel emitting four values |
2 | Creates a new channel, transforming each number into its square |
3 | Prints the channel content |
Operators can also be chained to implement custom behaviors, so the previous snippet can also be written as:
1
2
3
4
Channel
.of(1,2,3,4)
.map { it -> it * it }
.view()
7.2. Basic operators
Here we explore some of the most commonly used operators.
7.2.1. view
The view
operator prints the items emitted by a channel to the console standard output, appending a
new line character to each item. For example:
1
2
3
Channel
.of('foo', 'bar', 'baz')
.view()
It prints:
foo bar baz
An optional closure parameter can be specified to customize how items are printed. For example:
1
2
3
Channel
.of('foo', 'bar', 'baz')
.view { "- $it" }
It prints:
- foo - bar - baz
7.2.2. map
The map
operator applies a function of your choosing to every item emitted by a channel and returns the items obtained as a new channel. The function applied is called the mapping function and is expressed with a closure as shown in the example below:
1
2
3
4
Channel
.of( 'hello', 'world' )
.map { it -> it.reverse() }
.view()
A map
can associate a generic tuple to each element and can contain any data.
1
2
3
4
Channel
.of( 'hello', 'world' )
.map { word -> [word, word.size()] }
.view { word, len -> "$word contains $len letters" }
Exercise
Use fromPath
to create a channel emitting the fastq files matching the pattern data/ggal/*.fq
,
then use map
to return a pair containing the file name and the path itself, and finally, use view
to print the resulting channel.
Click here for the answer:
1
2
3
4
Channel
.fromPath('data/ggal/*.fq')
.map { file -> [ file.name, file ] }
.view { name, file -> "> $name : $file" }
7.2.3. mix
The mix
operator combines the items emitted by two (or more) channels into a single channel.
1
2
3
4
5
c1 = Channel.of( 1,2,3 )
c2 = Channel.of( 'a','b' )
c3 = Channel.of( 'z' )
c1 .mix(c2,c3).view()
1 2 a 3 b z
The items in the resulting channel have the same order as in the respective original channels.
However, there is no guarantee that the element of the second channel are appended after the elements
of the first. Indeed, in the example above, the element a has been printed before 3 .
|
7.2.4. flatten
The flatten
operator transforms a channel in such a way that every tuple is flattened so that each entry is emitted as a sole element by the resulting channel.
1
2
3
4
5
6
7
foo = [1,2,3]
bar = [4,5,6]
Channel
.of(foo, bar)
.flatten()
.view()
The above snippet prints:
1 2 3 4 5 6
7.2.5. collect
The collect
operator collects all of the items emitted by a channel in a list and returns the object as a sole emission.
1
2
3
4
Channel
.of( 1, 2, 3, 4 )
.collect()
.view()
It prints a single value:
[1,2,3,4]
The result of the collect operator is a value channel.
|
7.2.6. groupTuple
The groupTuple
operator collects tuples (or lists) of values emitted by the source channel, grouping the elements that share the same key. Finally, it emits a new tuple object for each distinct key collected.
Try the following example:
1
2
3
4
Channel
.of( [1,'A'], [1,'B'], [2,'C'], [3, 'B'], [1,'C'], [2, 'A'], [3, 'D'] )
.groupTuple()
.view()
It shows:
[1, [A, B, C]] [2, [C, A]] [3, [B, D]]
This operator is useful to process a group together with all the elements that share a common property or grouping key.
Exercise
Use fromPath
to create a channel emitting all of the files in the folder data/meta/
,
then use a map
to associate the baseName
prefix to each file. Finally, group all
files that have the same common prefix.
Click here for the answer:
1
2
3
4
Channel.fromPath('data/meta/*')
.map { file -> tuple(file.baseName, file) }
.groupTuple()
.view { baseName, file -> "> $baseName : $file" }
7.2.7. join
The join
operator creates a channel that joins together the items emitted by two channels with a matching key. The key is defined, by default, as the first element in each item emitted.
1
2
3
left = Channel.of(['X', 1], ['Y', 2], ['Z', 3], ['P', 7])
right = Channel.of(['Z', 6], ['Y', 5], ['X', 4])
left.join(right).view()
The resulting channel emits:
[Z, 3, 6] [Y, 2, 5] [X, 1, 4]
Notice 'P' is missing in the final result. |
7.2.8. branch
The branch
operator allows you to forward the items emitted by a source channel to one or more output channels.
The selection criterion is defined by specifying a closure that provides one or more boolean expressions, each of which is identified by a unique label. For the first expression that evaluates to a true value, the item is bound to a named channel as the label identifier. For example:
1
2
3
4
5
6
7
8
9
10
Channel
.of(1,2,3,40,50)
.branch {
small: it < 10
large: it > 10
}
.set { result }
result.small.view { "$it is small" }
result.large.view { "$it is large" }
The branch operator returns a multi-channel object (i.e., a variable that holds more than one channel object).
|
In the above example, what would happen to a value of 10? To deal with this, you can also use >= .
|
7.3. More resources
Check the operators documentation on Nextflow web site.
8. Groovy basic structures and idioms
Nextflow is a domain specific language (DSL) implemented on top of the Groovy programming language, which in turn is a super-set of the Java programming language. This means that Nextflow can run any Groovy or Java code.
Here are some important Groovy syntax that are commonly used in Nextflow.
8.1. Printing values
To print something is as easy as using one of the print
or println
methods.
println("Hello, World!")
The only difference between the two is that the println
method implicitly appends a new line character to the printed string.
parentheses for function invocations are optional. Therefore, the following syntax is also valid : |
println "Hello, World!"
8.2. Comments
Comments use the same syntax as C-family programming languages:
1
2
3
4
5
6
// comment a single config file
/*
a comment spanning
multiple lines
*/
8.3. Variables
To define a variable, simply assign a value to it:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
x = 1
println x
x = new java.util.Date()
println x
x = -3.1499392
println x
x = false
println x
x = "Hi"
println x
Local variables are defined using the def
keyword:
def x = 'foo'
The def
should be always used when defining variables local to a function or a closure.
8.4. Lists
A List object can be defined by placing the list items in square brackets:
list = [10,20,30,40]
You can access a given item in the list with square-bracket notation (indexes start at 0
) or using the get method:
1
2
println list[0]
println list.get(0)
In order to get the length of a list you can use the size method:
println list.size()
We use the assert
keyword to test if a condition is true (similar to an if
function).
Here, Groovy will print nothing if it is correct, else it will raise an AssertionError message.
assert list[0] == 10
This assertion should be correct, but try changing it to an incorrect one. |
Lists can also be indexed with negative indexes and reversed ranges.
1
2
3
list = [0,1,2]
assert list[-1] == 2
assert list[-1..0] == list.reverse()
In the last assert line we are referencing the initial list and converting this with a "shorthand" range (..), to run from the -1th element (2) to the 0th element (0). |
List objects implement all methods provided by the java.util.List interface, plus the extension methods provided by Groovy.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
assert [1,2,3] << 1 == [1,2,3,1]
assert [1,2,3] + [1] == [1,2,3,1]
assert [1,2,3,1] - [1] == [2,3]
assert [1,2,3] * 2 == [1,2,3,1,2,3]
assert [1,[2,3]].flatten() == [1,2,3]
assert [1,2,3].reverse() == [3,2,1]
assert [1,2,3].collect{ it+3 } == [4,5,6]
assert [1,2,3,1].unique().size() == 3
assert [1,2,3,1].count(1) == 2
assert [1,2,3,4].min() == 1
assert [1,2,3,4].max() == 4
assert [1,2,3,4].sum() == 10
assert [4,2,1,3].sort() == [1,2,3,4]
assert [4,2,1,3].find{it%2 == 0} == 4
assert [4,2,1,3].findAll{it%2 == 0} == [4,2]
8.5. Maps
Maps are like lists that have an arbitrary key instead of an integer. Therefore, the syntax is very much aligned.
map = [a:0, b:1, c:2]
Maps can be accessed in a conventional square-bracket syntax or as if the key was a property of the map.
1
2
3
assert map['a'] == 0 (1)
assert map.b == 1 (2)
assert map.get('c') == 2 (3)
1 | Using square brackets. |
2 | Using dot notation. |
3 | Using the get method. |
To add data or to modify a map, the syntax is similar to adding values to a list:
1
2
3
4
map['a'] = 'x' (1)
map.b = 'y' (2)
map.put('c', 'z') (3)
assert map == [a:'x', b:'y', c:'z']
1 | Using square brackets. |
2 | Using dot notation. |
3 | Using the put method. |
Map objects implement all methods provided by the java.util.Map interface, plus the extension methods provided by Groovy.
8.6. String interpolation
String literals can be defined by enclosing them with either single- ('') or double- ("") quotation marks.
Double-quoted strings can contain the value of an arbitrary variable by prefixing its name with the $ character, or the value of any expression by using the ${expression} syntax, similar to Bash/shell scripts:
1
2
3
4
5
6
foxtype = 'quick'
foxcolor = ['b', 'r', 'o', 'w', 'n']
println "The $foxtype ${foxcolor.join()} fox"
x = 'Hello'
println '$x + $y'
This code prints:
1
2
The quick brown fox
$x + $y
Note the different use of $ and ${..} syntax to interpolate value expressions in a string literal.
|
Finally, string literals can also be defined using the /
character as a delimiter. They are known as
slashy strings and are useful for defining regular expressions and patterns, as there is no need to escape backslashes. As with double-quote strings they allow to interpolate variables prefixed with a $
character.
Try the following to see the difference:
1
2
3
4
5
x = /tic\tac\toe/
y = 'tic\tac\toe'
println x
println y
it prints:
tic\tac\toe tic ac oe
8.7. Multi-line strings
A block of text that spans multiple lines can be defined by delimiting it with triple single or double quotes:
1
2
3
4
5
text = """
Hello there James.
How are you today?
"""
println text
Finally, multi-line strings can also be defined with slashy strings. For example:
1
2
3
4
5
6
text = /
This is a multi-line
slashy string!
It's cool, isn't it?!
/
println text
Like before, multi-line strings inside double quotes and slash characters support variable interpolation, while single-quoted multi-line strings do not. |
8.8. If statement
The if
statement uses the same syntax common in other programming languages, such as Java, C, JavaScript, etc.
1
2
3
4
5
6
if( < boolean expression > ) {
// true branch
}
else {
// false branch
}
The else
branch is optional. Also, the curly brackets are optional when the branch defines just a single
statement.
1
2
3
x = 1
if( x > 10 )
println 'Hello'
null , empty strings, and empty collections are evaluated to false .
|
Therefore a statement like:
1
2
3
4
5
6
7
list = [1,2,3]
if( list != null && list.size() > 0 ) {
println list
}
else {
println 'The list is empty'
}
Can be written as:
1
2
3
4
5
list = [1,2,3]
if( list )
println list
else
println 'The list is empty'
See the Groovy-Truth for further details.
In some cases it can be useful to replace the if statement with a ternary expression (aka
conditional expression). For example:
|
println list ? list : 'The list is empty'
The previous statement can be further simplified using the Elvis operator, as shown below:
println list ?: 'The list is empty'
8.9. For statement
The classical for
loop syntax is supported as shown here:
1
2
3
for (int i = 0; i <3; i++) {
println("Hello World $i")
}
Iteration over list objects is also possible using the syntax below:
1
2
3
4
5
list = ['a','b','c']
for( String elem : list ) {
println elem
}
8.10. Functions
It is possible to define a custom function into a script, as shown here:
1
2
3
4
5
int fib(int n) {
return n < 2 ? 1 : fib(n-1) + fib(n-2)
}
assert fib(10)==89
A function can take multiple arguments separating them with a comma. The return
keyword can be omitted and the function implicitly returns the value of the last evaluated expression. Also, explicit types can be omitted, though not recommended:
1
2
3
4
5
def fact( n ) {
n > 1 ? n * fact(n-1) : 1
}
assert fact(5) == 120
8.11. Closures
Closures are the Swiss army knife of Nextflow/Groovy programming. In a nutshell, a closure is a block of code that can be passed as an argument to a function. A closure can also be used to define an anonymous function.
More formally, a closure allows the definition of functions as first-class objects.
square = { it * it }
The curly brackets around the expression it * it
tells the script interpreter to treat this expression as code. The it
identifier is an implicit variable that represents the value that is passed to the function when it is invoked.
Once compiled, the function object is assigned to the variable square
as any other variable assignment shown previously. To invoke the closure execution use the special method call
or just use the round parentheses to specify the closure parameter(s). For example:
1
2
assert square.call(5) == 25
assert square(9) == 81
As is, this may not seem interesting, but we can now pass the square
function as an argument to other functions or methods.
Some built-in functions take a function like this as an argument. One example is the collect
method on lists:
1
2
x = [ 1, 2, 3, 4 ].collect(square)
println x
It prints:
[ 1, 4, 9, 16 ]
By default, closures take a single parameter called it
, to give it a different name use the
->
syntax. For example:
square = { num -> num * num }
It’s also possible to define closures with multiple, custom-named parameters.
For example, when the method each()
is applied to a map it can take a closure with two arguments,
to which it passes the key-value pair for each entry in the map
object. For example:
1
2
3
printMap = { a, b -> println "$a with value $b" }
values = [ "Yue" : "Wu", "Mark" : "Williams", "Sudha" : "Kumari" ]
values.each(printMap)
It prints:
Yue with value Wu Mark with value Williams Sudha with value Kumari
A closure has two other important features.
First, it can access and modify variables in the scope where it is defined.
Second, a closure can be defined in an anonymous manner, meaning that it is not given a name, and is defined in the place where it needs to be used.
As an example showing both these features, see the following code fragment:
1
2
3
4
result = 0 (1)
values = ["China": 1 , "India" : 2, "USA" : 3] (2)
values.keySet().each { result += values[it] } (3)
println result
1 | Defines a global variable. |
2 | Defines a map object. |
3 | Invokes the each method passing the closure object which modifies the result variable. |
Learn more about closures in the Groovy documentation.
8.12. More resources
The complete Groovy language documentation is available at this link.
A great resource to master Apache Groovy syntax is the book: Groovy in Action.
9. Modularization
The definition of module libraries simplifies the writing of complex data analysis pipelines and makes re-use of processes much easier.
Using the hello.nf
example from earlier, we will convert the pipeline’s processes into modules, then call them within the workflow scope in a variety of ways.
9.1. Modules
Nextflow DSL2 allows for the definition of stand-alone module scripts that can be included and shared across multiple workflows. Each module can contain its own process
or workflow
definition.
9.1.1. Importing modules
Components defined in the module script can be imported into other Nextflow scripts using the include
statement. This allows you to store these components in a separate file(s) so that they can be re-used in multiple workflows.
Using the hello.nf
example, we can achieve this by:
-
Creating a file called
modules.nf
in the top-level directory. -
Cutting and pasting the two process definitions for
SPLITLETTERS
andCONVERTTOUPPER
intomodules.nf
. -
Removing the
process
definitions in thehello.nf
script. -
Importing the processes from
modules.nf
within thehello.nf
script anywhere above theworkflow
definition:
1
2
include { SPLITLETTERS } from './modules.nf'
include { CONVERTTOUPPER } from './modules.nf'
In general, you would use relative paths to define the location of the module scripts using the ./ prefix.
|
Exercise
Create a modules.nf
file with the previously defined processes from hello.nf
. Then remove these processes from hello.nf
and add the include
definitions shown above.
Click here for the answer:
The hello.nf
script should look similar like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/bin/env nextflow
params.greeting = 'Hello world!'
greeting_ch = Channel.of(params.greeting)
include { SPLITLETTERS } from './modules.nf'
include { CONVERTTOUPPER } from './modules.nf'
workflow {
letters_ch = SPLITLETTERS(greeting_ch)
results_ch = CONVERTTOUPPER(letters_ch.flatten())
results_ch.view{ it }
}
You should have the following in the file ./modules.nf
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
process SPLITLETTERS {
input:
val x
output:
path 'chunk_*'
"""
printf '$x' | split -b 6 - chunk_
"""
}
process CONVERTTOUPPER {
input:
path y
output:
stdout
"""
cat $y | tr '[a-z]' '[A-Z]'
"""
}
We now have modularized processes which makes the code reusable.
9.1.2. Multiple imports
If a Nextflow module script contains multiple process
definitions they can also be imported using a single include
statement as shown in the example below:
include { SPLITLETTERS; CONVERTTOUPPER } from './modules.nf'
9.1.3. Module aliases
When including a module component it is possible to specify a name alias using the as
declaration.
This allows the inclusion and the invocation of the same component multiple times using different names:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#!/usr/bin/env nextflow
params.greeting = 'Hello world!'
greeting_ch = Channel.of(params.greeting)
include { SPLITLETTERS as SPLITLETTERS_one } from './modules.nf'
include { SPLITLETTERS as SPLITLETTERS_two } from './modules.nf'
include { CONVERTTOUPPER as CONVERTTOUPPER_one } from './modules.nf'
include { CONVERTTOUPPER as CONVERTTOUPPER_two } from './modules.nf'
workflow {
letters_ch1 = SPLITLETTERS_one(greeting_ch)
results_ch1 = CONVERTTOUPPER_one(letters_ch1.flatten())
results_ch1.view{ it }
letters_ch2 = SPLITLETTERS_two(greeting_ch)
results_ch2 = CONVERTTOUPPER_two(letters_ch2.flatten())
results_ch2.view{ it }
}
Exercise
Save the previous snippet as hello.2.nf
, and predict the output to screen.
Click here for the answer:
The hello.2.nf
output should look something like this:
N E X T F L O W ~ version 22.04.3 Launching `hello.2.nf` [goofy_goldstine] DSL2 - revision: 449cf82eaf executor > local (6) [e1/5e6523] process > SPLITLETTERS_one (1) [100%] 1 of 1 ✔ [14/b77deb] process > CONVERTTOUPPER_one (1) [100%] 2 of 2 ✔ [c0/115bd6] process > SPLITLETTERS_two (1) [100%] 1 of 1 ✔ [09/f9072d] process > CONVERTTOUPPER_two (2) [100%] 2 of 2 ✔ WORLD! HELLO WORLD! HELLO
You can store each process in separate files within separate sub-folders or combined in one big file (both are valid). For your own projects, it can be helpful to see examples on public repos. e.g. Seqera RNA-Seq tutorial here or within nf-core pipelines, such as nf-core/rnaseq. |
9.2. Output definition
Nextflow allows the use of alternative output definitions within workflows to simplify your code.
In the previous basic example (hello.nf
), we defined the channel names to specify the input to the next process.
e.g.
1
2
3
4
5
6
workflow {
greeting_ch = Channel.of(params.greeting)
letters_ch = SPLITLETTERS(greeting_ch)
results_ch = CONVERTTOUPPER(letters_ch.flatten())
results_ch.view{ it }
}
We have moved the greeting_ch into the workflow scope for this exercise.
|
We can also explicitly define the output of one channel to another using the .out
attribute.
1
2
3
4
5
6
workflow {
greeting_ch = Channel.of(params.greeting)
letters_ch = SPLITLETTERS(greeting_ch)
results_ch = CONVERTTOUPPER(letters_ch.out.flatten())
results_ch.view{ it }
}
We can also remove the channel definitions completely from each line if we prefer (as at each call, a channel is implied):
1
2
3
4
5
6
workflow {
greeting_ch = Channel.of(params.greeting)
SPLITLETTERS(greeting_ch)
CONVERTTOUPPER(SPLITLETTERS.out.flatten())
CONVERTTOUPPER.out.view()
}
If a process defines two or more output channels, each channel can be accessed by indexing the .out
attribute, e.g., .out[0]
, .out[1]
, etc. In our example we only have the [0]'th
output:
1
2
3
4
5
6
workflow {
greeting_ch = Channel.of(params.greeting)
SPLITLETTERS(greeting_ch)
CONVERTTOUPPER(SPLITLETTERS.out.flatten())
CONVERTTOUPPER.out[0].view()
}
Alternatively, the process output
definition allows the use of the emit
statement to define a named identifier that can be used to reference the channel in the external scope.
For example, try adding the emit
statement on the convertToUpper
process in your modules.nf
file:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
process SPLITLETTERS {
input:
val x
output:
path 'chunk_*'
"""
printf '$x' | split -b 6 - chunk_
"""
}
process CONVERTTOUPPER {
input:
path y
output:
stdout emit: upper
"""
cat $y | tr '[a-z]' '[A-Z]'
"""
}
Then change the workflow scope in hello.nf
to call this specific named output (notice the added .upper
):
1
2
3
4
5
6
workflow {
greeting_ch = Channel.of(params.greeting)
SPLITLETTERS(greeting_ch)
CONVERTTOUPPER(SPLITLETTERS.out.flatten())
CONVERTTOUPPER.out.upper.view{ it }
}
9.2.1. Using piped outputs
Another way to deal with outputs in the workflow scope is to use pipes |
.
Exercise
Try changing the workflow script to the snippet below:
1
2
3
workflow {
Channel.of(params.greeting) | SPLITLETTERS | flatten() | CONVERTTOUPPER | view
}
Here we use a pipe which passed the output as a channel to the next process.
9.3. Workflow definition
The workflow
scope allows the definition of components that define the invocation of one or more processes or operators:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/usr/bin/env nextflow
params.greeting = 'Hello world!'
include { SPLITLETTERS } from './modules.nf'
include { CONVERTTOUPPER } from './modules.nf'
workflow my_pipeline {
greeting_ch = Channel.of(params.greeting)
SPLITLETTERS(greeting_ch)
CONVERTTOUPPER(SPLITLETTERS.out.flatten())
CONVERTTOUPPER.out.upper.view{ it }
}
workflow {
my_pipeline()
}
For example, the snippet above defines a workflow
named my_pipeline
, that can be invoked via another workflow
definition.
Make sure that your modules.nf file is the one containing the emit on the CONVERTTOUPPER process.
|
A workflow component can access any variable or parameter defined in the outer scope. In the running example, we can also access params.greeting directly within the workflow definition.
|
9.3.1. Workflow inputs
A workflow
component can declare one or more input channels using the take
statement. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/usr/bin/env nextflow
params.greeting = 'Hello world!'
include { SPLITLETTERS } from './modules.nf'
include { CONVERTTOUPPER } from './modules.nf'
workflow my_pipeline {
take:
greeting
main:
SPLITLETTERS(greeting)
CONVERTTOUPPER(SPLITLETTERS.out.flatten())
CONVERTTOUPPER.out.upper.view{ it }
}
When the take statement is used, the workflow definition needs to be declared within the main block.
|
The input for the workflow
can then be specified as an argument:
1
2
3
workflow {
my_pipeline(Channel.of(params.greeting))
}
9.3.2. Workflow outputs
A workflow
can declare one or more output channels using the emit
statement. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
workflow my_pipeline {
take:
greeting
main:
SPLITLETTERS(greeting)
CONVERTTOUPPER(SPLITLETTERS.out.flatten())
emit:
CONVERTTOUPPER.out.upper
}
workflow {
my_pipeline(Channel.of(params.greeting))
my_pipeline.out.view()
}
As a result, we can use the my_pipeline.out
notation to access the outputs of my_pipeline
in the invoking workflow
.
We can also declare named outputs within the emit
block.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
workflow my_pipeline {
take:
greeting
main:
SPLITLETTERS(greeting)
CONVERTTOUPPER(SPLITLETTERS.out.flatten())
emit:
my_data = CONVERTTOUPPER.out.upper
}
workflow {
my_pipeline(Channel.of(params.greeting))
my_pipeline.out.my_data.view()
}
The result of the above snippet can then be accessed using my_pipeline.out.my_data
.
9.3.3. Calling named workflows
Within a main.nf
script (called hello.nf
in our example) we also can have multiple workflows.
In which case we may want to call a specific workflow when running the code.
For this we use the entrypoint call -entry <workflow_name>
.
The following snippet has two named workflows (my_pipeline_one
and my_pipeline_two
):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#!/usr/bin/env nextflow
params.greeting = 'Hello world!'
include { SPLITLETTERS as SPLITLETTERS_one } from './modules.nf'
include { SPLITLETTERS as SPLITLETTERS_two } from './modules.nf'
include { CONVERTTOUPPER as CONVERTTOUPPER_one } from './modules.nf'
include { CONVERTTOUPPER as CONVERTTOUPPER_two } from './modules.nf'
workflow my_pipeline_one {
letters_ch1 = SPLITLETTERS_one(params.greeting)
results_ch1 = CONVERTTOUPPER_one(letters_ch1.flatten())
results_ch1.view{ it }
}
workflow my_pipeline_two {
letters_ch2 = SPLITLETTERS_two(params.greeting)
results_ch2 = CONVERTTOUPPER_two(letters_ch2.flatten())
results_ch2.view{ it }
}
workflow {
my_pipeline_one(Channel.of(params.greeting))
my_pipeline_two(Channel.of(params.greeting))
}
You can choose which pipeline runs by using the entry
flag:
nextflow run hello.2.nf -entry my_pipeline_one
9.3.4. Parameter scopes
A module script can define one or more parameters or custom functions using the same syntax as with any other Nextflow script. Using the minimal examples below:
Module script (./modules.nf
)
1
2
3
4
5
6
params.foo = 'Hello'
params.bar = 'world!'
def SAYHELLO() {
println "$params.foo $params.bar"
}
Main script (./hello.nf
)
1
2
3
4
5
6
7
8
9
10
#!/usr/bin/env nextflow
params.foo = 'Hola'
params.bar = 'mundo!'
include { SAYHELLO } from './modules.nf'
workflow {
SAYHELLO()
}
Running hello.nf
should print:
Hola mundo!
As highlighted above, the script will print Hola mundo!
instead of Hello world!
because parameters are inherited from the including context.
To avoid being ignored, pipeline parameters should be defined at the beginning of the script before any include declarations.
|
The addParams
option can be used to extend the module parameters without affecting the external scope. For example:
1
2
3
4
5
6
7
8
9
10
#!/usr/bin/env nextflow
params.foo = 'Hola'
params.bar = 'mundo!'
include { SAYHELLO } from './modules.nf' addParams(foo: 'Olá')
workflow {
SAYHELLO()
}
Executing the main script above should print:
Olá mundo!
9.4. DSL2 migration notes
To view a summary of the changes introduced when Nextflow migrated from DSL1 to DSL2 please refer to the DSL2 migration notes in the official Nextflow documentation.
10. Nextflow configuration
A key Nextflow feature is the ability to decouple the workflow implementation by the configuration setting required by the underlying execution platform.
This enables portable deployment without the need to modify the application code.
10.1. Configuration file
When a pipeline script is launched, Nextflow looks for a file named nextflow.config
in the current directory and in the script base directory (if it is not the same as the current directory). Finally, it checks for the file: $HOME/.nextflow/config
.
When more than one of the above files exists, they are merged, so that the settings in the first override the same settings that may appear in the second, and so on.
The default config file search mechanism can be extended by providing an extra configuration file by using the command line option: -c <config file>
.
10.1.1. Config syntax
A Nextflow configuration file is a simple text file containing a set of properties defined using the syntax:
name = value
Please note that string values need to be wrapped in quotation characters while numbers and boolean values (true , false ) do not. Also, note that values are typed, meaning for example that, 1 is different from '1' , since the first is interpreted as the number one, while the latter is interpreted as a string value.
|
10.1.2. Config variables
Configuration properties can be used as variables in the configuration file itself, by using the usual $propertyName
or ${expression}
syntax.
1
2
3
propertyOne = 'world'
anotherProp = "Hello $propertyOne"
customPath = "$PATH:/my/app/folder"
In the configuration file it’s possible to access any variable defined in the host environment
such as $PATH , $HOME , $PWD , etc.
|
10.1.3. Config comments
Configuration files use the same conventions for comments used in the Nextflow script:
1
2
3
4
5
6
// comment a single line
/*
a comment spanning
multiple lines
*/
10.1.4. Config scopes
Configuration settings can be organized in different scopes by dot prefixing the property names with a scope identifier or grouping the properties in the same scope using the curly brackets notation. This is shown in the following example:
1
2
3
4
5
6
7
alpha.x = 1
alpha.y = 'string value..'
beta {
p = 2
q = 'another string ..'
}
10.1.5. Config params
The scope params
allows the definition of workflow parameters that override the values defined
in the main workflow script.
This is useful to consolidate one or more execution parameters in a separate file.
1
2
3
// config file
params.foo = 'Bonjour'
params.bar = 'le monde!'
1
2
3
4
5
6
// workflow script
params.foo = 'Hello'
params.bar = 'world!'
// print both params
println "$params.foo $params.bar"
Exercise
Save the first snippet as nextflow.config
and the second one as params.nf
. Then run:
nextflow run params.nf
For the result, click here:
Bonjour le monde!
Execute is again specifying the foo
parameter on the command line:
nextflow run params.nf --foo Hola
For the result, click here:
Hola le monde!
Compare the result of the two executions.
10.1.6. Config env
The env
scope allows the definition of one or more variables that will be exported into the environment where the workflow tasks will be executed.
1
2
env.ALPHA = 'some value'
env.BETA = "$HOME/some/path"
Exercise
Save the above snippet as a file named my-env.config
. Then save the snippet below in a file named
foo.nf
:
1
2
3
4
5
6
process foo {
echo true
'''
env | egrep 'ALPHA|BETA'
'''
}
Finally, execute the following command:
nextflow run foo.nf -c my-env.config
Your result should look something like the following, click here:
BETA=/home/some/path ALPHA=some value
10.1.7. Config process
process
directives allow the specification of settings for the task execution such as cpus
, memory
, container
, and other resources in the pipeline script.
This is useful when prototyping a small workflow script.
However, it’s always a good practice to decouple the workflow execution logic from the process configuration settings, i.e. it’s strongly suggested to define the process settings in the workflow configuration file instead of the workflow script.
The process
configuration scope allows the setting of any process
directives in the Nextflow configuration file. For example:
1
2
3
4
5
process {
cpus = 10
memory = 8.GB
container = 'biocontainers/bamtools:v2.4.0_cv3'
}
The above config snippet defines the cpus
, memory
and container
directives for all processes in your workflow script.
The process selector can be used to apply the configuration to a specific process or group of processes (discussed later).
Memory and time duration units can be specified either using a string-based notation in which the digit(s) and the unit can be separated by a blank or by using the numeric notation in which the digit(s) and the unit are separated by a dot character and are not enclosed by quote characters. |
String syntax | Numeric syntax | Value |
---|---|---|
|
|
10240 bytes |
|
|
524288000 bytes |
|
1.min |
60 seconds |
|
- |
1 hour and 25 seconds |
The syntax for setting process
directives in the
configuration file requires =
(i.e. assignment operator), whereas
it should not be used when setting the process directives within the
workflow script.
For an example, click here:
1
2
3
4
5
6
7
8
9
10
11
process foo {
cpus 4
memory 2.GB
time 1.hour
maxRetries 3
script:
"""
your_command --cpus $task.cpus --mem $task.memory
"""
}
This is especially important when you want to define a config setting using a dynamic expression using a closure. For example:
1
2
3
process foo {
memory = { 4.GB * task.cpus }
}
Directives that require more than one value, e.g. pod, in the configuration file need to be expressed as a map object.
1
2
3
process {
pod = [env: 'FOO', value: '123']
}
Finally, directives that are to be repeated in the process definition, in the configuration files need to be defined as a list object. For example:
1
2
3
4
process {
pod = [ [env: 'FOO', value: '123'],
[env: 'BAR', value: '456'] ]
}
10.1.8. Config Docker execution
The container image to be used for the process execution can be specified in the nextflow.config
file:
1
2
process.container = 'nextflow/rnaseq-nf'
docker.enabled = true
The use of unique "SHA256" docker image IDs guarantees that the image content does not change over time, for example:
1
2
process.container = 'nextflow/rnaseq-nf@sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab266'
docker.enabled = true
10.1.9. Config Singularity execution
To run a workflow execution with Singularity, a container image file path is required in the Nextflow config file using the container directive:
1
2
process.container = '/some/singularity/image.sif'
singularity.enabled = true
The container image file must be an absolute path i.e. it must start with a / .
|
The following protocols are supported:
-
library://
download the container image from the Singularity Library service. -
shub://
download the container image from the Singularity Hub. -
docker://
download the container image from the Docker Hub and convert it to the Singularity format. -
docker-daemon://
pull the container image from a local Docker installation and convert it to a Singularity image file.
* Singularity hub shub:// is no longer available as a builder service. Though existing images from before 19th April 2021 will still work.
|
By specifying a plain Docker container image name, Nextflow implicitly downloads and converts it to a Singularity image when the Singularity execution is enabled. For example: |
1
2
process.container = 'nextflow/rnaseq-nf'
singularity.enabled = true
The above configuration instructs Nextflow to use the Singularity engine to run your script processes. The container is pulled from the Docker registry and cached in the current directory to be used for further runs.
Alternatively, if you have a Singularity image file, its absolute path location
can be specified as the container name either using the -with-singularity
option
or the process.container
setting in the config file.
Exercise
Try to run the script as shown below, changing the nextflow.config
file to the one above using singularity
:
nextflow run script7.nf
Nextflow will pull the container image automatically, it will require a few seconds depending on the network connection speed. |
10.1.10. Config Conda execution
The use of a Conda environment can also be provided in the configuration file
by adding the following setting in the nextflow.config
file:
process.conda = "/home/ubuntu/miniconda2/envs/nf-tutorial"
You can specify the path of an existing Conda environment as either directory or the path of Conda environment YAML file.
11. Deployment scenarios
Real-world genomic applications can spawn the execution of thousands of jobs. In this scenario a batch scheduler is commonly used to deploy a pipeline in a computing cluster, allowing the execution of many jobs in parallel across many compute nodes.
Nextflow has built-in support for the most commonly used batch schedulers, such as Univa Grid Engine, SLURM and IBM LSF. Check the Nextflow documentation for the complete list of supported execution platforms.
11.1. Cluster deployment
A key Nextflow feature is the ability to decouple the workflow implementation from the actual execution platform. The implementation of an abstraction layer allows the deployment of the resulting workflow on any executing platform supported by the framework.
To run your pipeline with a batch scheduler, modify the nextflow.config
file specifying
the target executor and the required computing resources if needed. For example:
process.executor = 'slurm'
11.2. Managing cluster resources
When using a batch scheduler, it is often needed to specify the number of resources (i.e. cpus, memory, execution time, etc.) required by each task.
This can be done using the following process directives:
the cluster queue to be used for the computation |
|
the number of cpus to be allocated a task execution |
|
the amount of memory to be allocated for a task execution |
|
the max amount of time to be allocated for a task execution |
|
the amount of disk storage required for a task execution |
11.2.1. Workflow wide resources
Use the scope process
to define the resource requirements for all processes in
your workflow applications. For example:
1
2
3
4
5
6
7
process {
executor = 'slurm'
queue = 'short'
memory = '10 GB'
time = '30 min'
cpus = 4
}
11.2.2. Configure process by name
In real-world applications, different tasks need different amounts of
computing resources. It is possible to define the resources for a specific task
using the select withName:
followed by the process name:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
process {
executor = 'slurm'
queue = 'short'
memory = '10 GB'
time = '30 min'
cpus = 4
withName: foo {
cpus = 2
memory = '20 GB'
queue = 'short'
}
withName: bar {
cpus = 4
memory = '32 GB'
queue = 'long'
}
}
Exercise
Run the RNA-Seq script (script7.nf) from earlier, but specify that the quantification
process
requires 2 CPUs and 5 GB of memory, within the nextflow.config
file.
Click here for the answer:
1
2
3
4
5
6
process {
withName: quantification {
cpus = 2
memory = '5 GB'
}
}
11.2.3. Configure process by labels
When a workflow application is composed of many processes, listing all of the process names and choosing resources for each of them in the configuration file can be difficult.
A better strategy consists of annotating the processes with a label directive. Then specify the resources in the configuration file used for all processes having the same label.
The workflow script:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
process task1 {
label 'long'
"""
first_command --here
"""
}
process task2 {
label 'short'
"""
second_command --here
"""
}
The configuration file:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
process {
executor = 'slurm'
withLabel: 'short' {
cpus = 4
memory = '20 GB'
queue = 'alpha'
}
withLabel: 'long' {
cpus = 8
memory = '32 GB'
queue = 'omega'
}
}
11.2.4. Configure multiple containers
Containers can be set for each process in your workflow. You can define their containers in a config file as shown below:
1
2
3
4
5
6
7
8
9
10
process {
withName: foo {
container = 'some/image:x'
}
withName: bar {
container = 'other/image:y'
}
}
docker.enabled = true
Should I use a single fat container or many slim containers? Both approaches have pros & cons. A single container is simpler to build and maintain, however when using many tools the image can become very big and tools can create conflicts with each other. Using a container for each process can result in many different images to build and maintain, especially when processes in your workflow use different tools for each task. |
Read more about config process selectors at this link.
11.3. Configuration profiles
Configuration files can contain the definition of one or more profiles. A profile is a set of configuration attributes that can be activated/chosen when launching a pipeline execution by using the -profile
command- line option.
Configuration profiles are defined by using the special scope profiles
which group the attributes that belong to the same profile using a common prefix. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
profiles {
standard {
params.genome = '/local/path/ref.fasta'
process.executor = 'local'
}
cluster {
params.genome = '/data/stared/ref.fasta'
process.executor = 'sge'
process.queue = 'long'
process.memory = '10GB'
process.conda = '/some/path/env.yml'
}
cloud {
params.genome = '/data/stared/ref.fasta'
process.executor = 'awsbatch'
process.container = 'cbcrg/imagex'
docker.enabled = true
}
}
This configuration defines three different profiles: standard
, cluster
and cloud
that set different process configuration strategies depending on the target runtime platform. By convention, the standard
profile is implicitly used when no other profile is specified by the user.
To enable a specific profile use -profile
option followed by the profile name:
nextflow run <your script> -profile cluster
Two or more configuration profiles can be specified by separating the profile names with a comma character: |
nextflow run <your script> -profile standard,cloud
11.4. Cloud deployment
AWS Batch is a managed computing service that allows the execution of containerized workloads in the Amazon cloud infrastructure.
Nextflow provides built-in support for AWS Batch which allows the seamless deployment of a Nextflow pipeline in the cloud, offloading the process executions as Batch jobs.
Once the Batch environment is configured, specify the instance types to be used and the max number of CPUs to be allocated, you need to create a Nextflow configuration file like the one shown below:
1
2
3
4
5
6
process.executor = 'awsbatch' (1)
process.queue = 'nextflow-ci' (2)
process.container = 'nextflow/rnaseq-nf:latest' (3)
workDir = 's3://nextflow-ci/work/' (4)
aws.region = 'eu-west-1' (5)
aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws' (6)
1 | Set AWS Batch as the executor to run the processes in the workflow |
2 | The name of the computing queue defined in the Batch environment |
3 | The Docker container image to be used to run each job |
4 | The workflow work directory must be a AWS S3 bucket |
5 | The AWS region to be used |
6 | The path of the AWS cli tool required to download/upload files to/from the container |
The best practice is to keep this setting as a separate profile in your workflow config file. This allows the execution with a simple command. |
nextflow run script7.nf
The complete details about AWS Batch deployment are available at this link.
11.5. Volume mounts
Elastic Block Storage (EBS) volumes (or other supported storage) can be mounted in the job container using the following configuration snippet:
1
2
3
4
5
aws {
batch {
volumes = '/some/path'
}
}
Multiple volumes can be specified using comma-separated paths. The usual Docker volume mount syntax can be used to define complex volumes for which the container path is different from the host path or to specify a read-only option:
1
2
3
4
5
6
aws {
region = 'eu-west-1'
batch {
volumes = ['/tmp', '/host/path:/mnt/path:ro']
}
}
This is a global configuration that has to be specified in a Nextflow config file and will be applied to all process executions. |
Nextflow expects paths to be available. It does not handle the provision of EBS volumes or another kind of storage. |
11.6. Custom job definition
Nextflow automatically creates the Batch Job definitions needed to execute your pipeline processes. Therefore it’s not required to define them before you run your workflow.
However, you may still need to specify a custom Job Definition to provide fine-grained control of the configuration settings of a specific job (e.g. to define custom mount paths or other special settings of a Batch Job).
To use your own job definition in a Nextflow workflow, use it in place of the container image name,
prefixing it with the job-definition://
string. For example:
1
2
3
process {
container = 'job-definition://your-job-definition-name'
}
11.7. Custom image
Since Nextflow requires the AWS CLI tool to be accessible in the computing environment, a common solution consists of creating a custom Amazon Machine Image (AMI) and installing it in a self-contained manner (e.g. using Conda package manager).
When creating your custom AMI for AWS Batch, make sure to use the Amazon ECS-Optimized Amazon Linux AMI as the base image. |
The following snippet shows how to install AWS CLI with Miniconda:
sudo yum install -y bzip2 wget wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh -b -f -p $HOME/miniconda $HOME/miniconda/bin/conda install -c conda-forge -y awscli rm Miniconda3-latest-Linux-x86_64.sh
The aws tool will be placed in a directory named bin in the main installation folder. The tools will not work properly if you modify this directory structure after the installation.
|
Finally, specify the aws
full path in the Nextflow config file as shown below:
aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws'
11.8. Launch template
An alternative approach to is to create a custom AMI using a Launch template that installs the AWS CLI tool during the instance boot via custom user data.
In the EC2 dashboard, create a Launch template specifying the user data field:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="//"
--//
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/sh
## install required deps
set -x
export PATH=/usr/local/bin:$PATH
yum install -y jq python27-pip sed wget bzip2
pip install -U boto3
## install awscli
USER=/home/ec2-user
wget -q https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -f -p $USER/miniconda
$USER/miniconda/bin/conda install -c conda-forge -y awscli
rm Miniconda3-latest-Linux-x86_64.sh
chown -R ec2-user:ec2-user $USER/miniconda
--//--
Then create a new compute environment in the Batch dashboard and specify the newly created launch template in the corresponding field.
11.9. Hybrid deployments
Nextflow allows the use of multiple executors in the same workflow application. This feature enables the deployment of hybrid workloads in which some jobs are executed on the local computer or local computing cluster, and some jobs are offloaded to the AWS Batch service.
To enable this feature use one or more process selectors in your Nextflow configuration file.
For example, apply the AWS Batch configuration only to a subset of processes in your workflow. You can try the following:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
process {
executor = 'slurm' (1)
queue = 'short' (2)
withLabel: bigTask { (3)
executor = 'awsbatch' (4)
queue = 'my-batch-queue' (5)
container = 'my/image:tag' (6)
}
}
aws {
region = 'eu-west-1' (7)
}
1 | Set slurm as the default executor |
2 | Set the queue for the SLURM cluster |
3 | Setting of a process named bigTask |
4 | Set awsbatch as the executor for the bigTask process |
5 | Set the queue for the bigTask process |
6 | Set the container image to deploy for the bigTask process |
7 | Define the region for Batch execution |
12. Get started with Nextflow Tower
12.1. Basic concepts
Nextflow Tower is the centralized command post for data management and pipelines. It brings monitoring, logging and observability to distributed workflows and simplifies the deployment of pipelines on any cloud, cluster or laptop.
Nextflow tower core features include:
-
The launching of pre-configured pipelines with ease.
-
Programmatic integration to meet the needs of an organization.
-
Publishing pipelines to shared workspaces.
-
Management of the infrastructure required to run data analysis at scale.
12.2. Usage
You can use Tower via either the -with-tower
option while using the Nextflow run command, through the online GUI or through the API.
12.2.1. Via the Nextflow run
command
Create an account and login into Tower.
1. Create a new token
You can access your tokens from the Settings drop-down menu :

2. Name your token

3. Save your token safely
Copy and keep your new token in a safe place. |

4. Export your token
Once your token has been created, open a terminal and type:
export TOWER_ACCESS_TOKEN=eyxxxxxxxxxxxxxxxQ1ZTE= export NXF_VER=20.10.0
Where eyxxxxxxxxxxxxxxxQ1ZTE=
is the token you have just created.
Check your nextflow -version . Bearer tokens require Nextflow version 20.10.0 or later and can be set with the second command shown above. You can change the version if necessary.
|
To submit a pipeline to a Workspace using the Nextflow command-line tool, add the workspace ID to your environment. For example:
export TOWER_WORKSPACE_ID=000000000000000
The workspace ID can be found on the organization’s Workspaces overview page.
5. Run Nextflow with tower
Run your Nextflow workflows as usual with the addition of the -with-tower
command:
1
nextflow run hello.nf -with-tower
You will see and be able to monitor your Nextflow jobs in Tower.
To configure and execute Nextflow jobs in Cloud environments, visit the Compute environments section.
Exercise
Run the RNA-Seq script7.nf
using the -with-tower
flag, after correctly completing the token settings outlined above.
If you get stuck, click here:
Go to tower.nf/, login, then click the run tab, and select the run that you just submitted. If you can’t find it, double check your token was entered correctly.
12.2.2. Via online GUI
To run using the GUI, there are three main steps:
1. Create an account and login into Tower, available free of charge, at tower.nf.
2. Create and configure a new compute environment.
3. Start launching pipelines.
Configuring your compute environment
Tower uses the concept of Compute Environments to define the execution platform where a pipeline will run.
It supports the launching of pipelines into a growing number of cloud and on-premise infrastructures.

Each compute environment must be pre-configured to enable Tower to submit tasks. You can read more on how to set up each environment using the links below.
The following guides describe how to configure each of these compute environments.
Selecting a default compute environment
If you have more than one Compute Environment, you can select which one will be used by default when launching a pipeline.
-
Navigate to your compute environments.
-
Choose your default environment by selecting the Make primary button.
Congratulations!
You are now ready to launch pipelines with your primary compute environment.
Launchpad
Launchpad makes it easy for any workspace user to launch a pre-configured pipeline.

A pipeline is a repository containing a Nextflow workflow, a compute environment and pipeline parameters.
Pipeline Parameters Form
Launchpad automatically detects the presence of a nextflow_schema.json
in the root of the repository and dynamically creates a form where users can easily update the parameters.
The parameter forms view will appear if the workflow has a Nextflow schema file for the parameters. Please refer to the Nextflow Schema guide to learn more about the schema file use-cases and how to create them. |
This makes it trivial for users without any expertise in Nextflow to enter their pipeline parameters and launch.

Adding a new pipeline
Adding a pipeline to the pre-saved workspace launchpad is detailed in full on the tower webpage docs.
In brief, these are the steps you need to follow to set up a pipeline.
-
Select the Launchpad button in the navigation bar. This will open the Launch Form.
-
Select a compute environment.
-
Enter the repository of the pipeline you want to launch. e.g. github.com/nf-core/rnaseq.git
-
Select a pipeline Revision number. The Git default branch (main/master) or
manifest.defaultBranch
in the Nextflow configuration will be used by default. -
Set the Work directory location of the Nextflow work directory. The location associated with the compute environment will be selected by default.
-
Enter the name(s) of each of the Nextflow Config profiles followed by the
Enter
key. See the Nextflow Config profiles documentation for more details. -
Enter any Pipeline parameters in YAML or JSON format. YAML example:
1 2
reads: 's3://nf-bucket/exome-data/ERR013140_{1,2}.fastq.bz2' paired_end: true
-
Select Launchpad to begin the pipeline execution.
Nextflow pipelines are simply Git repositories and can be changed to any public or private Git-hosting platform. See Git Integration in the Tower docs and Pipeline Sharing in the Nextflow docs for more details. |
The credentials associated with the compute environment must be able to access the work directory. |
In the configuration, the full path to a bucket must be specified with single quotes around strings and no quotes around booleans or numbers. |
To create your own customized Nextflow Schema for your pipeline, see the examples from the nf-core workflows that have adopted this approach. For example, eager and rnaseq.
|
For advanced settings options check out this page.
There is also community support available if you get into trouble, join the Nextflow Slack by following this link.
12.2.3. API
To learn more about using the Tower API, visit the API section in this documentation.
12.3. Workspaces and Organizations
Nextflow Tower simplifies the development and execution of workflows by providing a centralized interface for users and organizations.
Each user has a unique workspace where they can interact and manage all resources such as workflows, compute environments and credentials. Details of this can be found here.
By default, each user has their own private workspace, while organizations have the ability to run and manage users through role-based access as members and collaborators.
12.3.1. Organization resources
You can create your own organization and participant workspace by following the docs at tower.
Tower allows the creation of multiple organizations, each of which can contain multiple workspaces with shared users and resources. This allows any organization to customize and organize the usage of resources while maintaining an access control layer for users associated with a workspace.
12.3.2. Organization users
Any user can be added or removed from a particular organization or a workspace and can be allocated a specific access role within that workspace.
The Teams feature provides a way for organizations to group various users and participants together into teams. For example, workflow-developers
or analysts
, and apply access control to all the users within this team collectively.
For further information, please refer to the User Management section.
Setting up a new organization
Organizations are the top-level structure and contain Workspaces, Members, Teams and Collaborators.
To create a new Organization:
-
Click on the dropdown next to your name and select New organization to open the creation dialog.
-
On the dialog, fill in the fields as per your organization. The Name and Full name fields are compulsory.
A valid name for the organization must follow a specific pattern. Please refer to the UI for further instructions. -
The rest of the fields such as Description, Location, Website URL and Logo Url are optional.
-
Once the details are filled in, you can access the newly created organization using the organization’s page, which lists all of your organizations.
It is possible to change the values of the optional fields either using the Edit option on the organization’s page or by using the Settings tab within the organization page, provided that you are the Owner of the organization. A list of all the included Members, Teams and Collaborators can be found on the organization page.
13. Execution cache and resume
The Nextflow caching mechanism works by assigning a unique ID to each task which is used to create a separate execution directory where the tasks are executed and the results stored.
The task unique ID is generated as a 128-bit hash number composing the task input values, files and command string.
The pipeline work directory is organized as shown below:
work/ ├── 12 │ └── 1adacb582d2198cd32db0e6f808bce │ ├── genome.fa -> /data/../genome.fa │ └── index │ ├── hash.bin │ ├── header.json │ ├── indexing.log │ ├── quasi_index.log │ ├── refInfo.json │ ├── rsd.bin │ ├── sa.bin │ ├── txpInfo.bin │ └── versionInfo.json ├── 19 │ └── 663679d1d87bfeafacf30c1deaf81b │ ├── ggal_gut │ │ ├── aux_info │ │ │ ├── ambig_info.tsv │ │ │ ├── expected_bias.gz │ │ │ ├── fld.gz │ │ │ ├── meta_info.json │ │ │ ├── observed_bias.gz │ │ │ └── observed_bias_3p.gz │ │ ├── cmd_info.json │ │ ├── libParams │ │ │ └── flenDist.txt │ │ ├── lib_format_counts.json │ │ ├── logs │ │ │ └── salmon_quant.log │ │ └── quant.sf │ ├── ggal_gut_1.fq -> /data/../ggal_gut_1.fq │ ├── ggal_gut_2.fq -> /data/../ggal_gut_2.fq │ └── index -> /data/../asciidocs/day2/work/12/1adacb582d2198cd32db0e6f808bce/index
You can create these plots using the tree function if you have it installed. On unix, simply sudo apt install -y tree or with Homebrew: brew install tree
|
13.1. How resume works
The -resume
command-line option allows the continuation of a pipeline
execution from the last step that was completed successfully:
nextflow run <script> -resume
In practical terms, the pipeline is executed from the beginning. However, before launching the execution of a process
, Nextflow
uses the task unique ID to check if the work directory already exists and that it contains a valid command exit state with the
expected output files.
If this condition is satisfied the task execution is skipped and previously computed results are used as the process
results.
The first task for which a new output is computed invalidates all downstream executions in the remaining DAG.
13.2. Work directory
The task work directories are created in the folder work
in
the launching path by default. This is supposed to be a scratch
storage area that can be cleaned up once the computation is completed.
Workflow final output(s) are supposed to be stored in a different location specified using one or more publishDir directive. |
Make sure to delete your work directory occasionally, else your machine/environment may be filled with unused files. |
A different location for the execution work directory can be specified
using the command line option -w
. For example:
nextflow run <script> -w /some/scratch/dir
If you delete or move the pipeline work directory, it will prevent the use of the resume feature in subsequent runs. |
The hash code for input files is computed using:
-
The complete file path
-
The file size
-
The last modified timestamp
Therefore, just touching a file will invalidate the related task execution.
13.3. How to organize in silico experiments
It’s good practice to organize each experiment in its own folder. The main experiment input parameters should be specified using a Nextflow config file. This makes it simple to track and replicate an experiment over time.
In the same experiment, the same pipeline can be executed multiple times, however, launching two (or more) Nextflow instances in the same directory concurrently should be avoided. |
The nextflow log
command lists the executions run in the current folder:
$ nextflow log TIMESTAMP DURATION RUN NAME STATUS REVISION ID SESSION ID COMMAND 2019-05-06 12:07:32 1.2s focused_carson ERR a9012339ce 7363b3f0-09ac-495b-a947-28cf430d0b85 nextflow run hello 2019-05-06 12:08:33 21.1s mighty_boyd OK a9012339ce 7363b3f0-09ac-495b-a947-28cf430d0b85 nextflow run rnaseq-nf -with-docker 2019-05-06 12:31:15 1.2s insane_celsius ERR b9aefc67b4 4dc656d2-c410-44c8-bc32-7dd0ea87bebf nextflow run rnaseq-nf 2019-05-06 12:31:24 17s stupefied_euclid OK b9aefc67b4 4dc656d2-c410-44c8-bc32-7dd0ea87bebf nextflow run rnaseq-nf -resume -with-docker
You can use either the session ID or the run name to recover a specific execution. For example:
nextflow run rnaseq-nf -resume mighty_boyd
13.4. Execution provenance
The log
command, when provided with a run name or session ID, can return many useful bits of information about a pipeline execution that can be used to create a provenance report.
By default, it will list the work directories used to compute each task. For example:
$ nextflow log tiny_fermat /data/.../work/7b/3753ff13b1fa5348d2d9b6f512153a /data/.../work/c1/56a36d8f498c99ac6cba31e85b3e0c /data/.../work/f7/659c65ef60582d9713252bcfbcc310 /data/.../work/82/ba67e3175bd9e6479d4310e5a92f99 /data/.../work/e5/2816b9d4e7b402bfdd6597c2c2403d /data/.../work/3b/3485d00b0115f89e4c202eacf82eba
The -f
(fields) option can be used to specify which metadata
should be printed by the log
command. For example:
$ nextflow log tiny_fermat -f 'process,exit,hash,duration' index 0 7b/3753ff 2.0s fastqc 0 c1/56a36d 9.3s fastqc 0 f7/659c65 9.1s quant 0 82/ba67e3 2.7s quant 0 e5/2816b9 3.2s multiqc 0 3b/3485d0 6.3s
The complete list of available fields can be retrieved with the command:
nextflow log -l
The -F
option allows the specification of filtering criteria to
print only a subset of tasks. For example:
$ nextflow log tiny_fermat -F 'process =~ /fastqc/' /data/.../work/c1/56a36d8f498c99ac6cba31e85b3e0c /data/.../work/f7/659c65ef60582d9713252bcfbcc310
This can be useful to locate specific task work directories.
Finally, the -t
option enables the creation of a basic custom provenance report, showing a template file in any format of your choice. For example:
<div>
<h2>${name}</h2>
<div>
Script:
<pre>${script}</pre>
</div>
<ul>
<li>Exit: ${exit}</li>
<li>Status: ${status}</li>
<li>Work dir: ${workdir}</li>
<li>Container: ${container}</li>
</ul>
</div>
Save the above snippet in a file named template.html
. Then
run this command (using the correct id for your run, e.g. not tiny_fermat
):
nextflow log tiny_fermat -t template.html > prov.html
Finally, open the prov.html
file with a browser.
13.5. Resume troubleshooting
If your workflow execution is not resumed as expected with one or more tasks being unexpectedly re-executed each time, these may be the most likely causes:
-
Input file changed: Make sure that there’s no change in your input file(s). Don’t forget the task unique hash is computed by taking into account the complete file path, the last modified timestamp and the file size. If any of this information has changed, the workflow will be re-executed even if the input content is the same.
-
A process modifies an input: A process should never alter input files, otherwise the
resume
for future executions will be invalidated for the same reason explained in the previous point. -
Inconsistent file attributes: Some shared file systems, such as NFS, may report an inconsistent file timestamp (i.e. a different timestamp for the same file) even if it has not been modified. To prevent this problem use the lenient cache strategy.
-
Race condition in global variable: Nextflow is designed to simplify parallel programming without taking care about race conditions and the access to shared resources. One of the few cases in which a race condition can arise is when using a global variable with two (or more) operators. For example:
1 2 3 4 5 6 7 8 9
Channel .of(1,2,3) .map { it -> X=it; X+=2 } .view { "ch1 = $it" } Channel .of(1,2,3) .map { it -> X=it; X*=2 } .view { "ch2 = $it" }
The problem in this snippet is that the
X
variable in the closure definition is defined in the global scope. Therefore, since operators are executed in parallel, theX
value can be overwritten by the othermap
invocation.The correct implementation requires the use of the
def
keyword to declare the variable local.1 2 3 4 5 6 7 8 9
Channel .of(1,2,3) .map { it -> def X=it; X+=2 } .println { "ch1 = $it" } Channel .of(1,2,3) .map { it -> def X=it; X*=2 } .println { "ch2 = $it" }
-
Not deterministic input channels: While dataflow channel ordering is guaranteed (i.e. data is read in the same order in which it’s written in the channel), a process can declare as input two or more channels each of which is the output of a different process, the overall input ordering is not consistent over different executions.
In practical terms, consider the following snippet:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
process foo { input: tuple val(pair), path(reads) output: tuple val(pair), path('*.bam'), emit: bam_ch script: """ your_command --here """ } process bar { input: tuple val(pair), path(reads) output: tuple val(pair), path('*.bai'), emit: bai_ch script: """ other_command --here """ } process gather { input: tuple val(pair), path(bam) tuple val(pair), path(bai) script: """ merge_command $bam $bai """ }
The inputs declared at line 29 and 30 can be delivered in any order because the execution order of the process
foo
andbar
are not deterministic due to their parallel execution.Therefore the input of the third process needs to be synchronized using the join operator, or a similar approach. The third process should be written as:
1 2 3 4 5 6 7 8 9 10 11
... process gather { input: tuple val(pair), path(bam), path(bai) script: """ merge_command $bam $bai """ }
14. Error handling and troubleshooting
14.1. Execution error debugging
When a process execution exits with a non-zero exit status, Nextflow stops the workflow execution and reports the failing task:
ERROR ~ Error executing process > 'index' Caused by: (1) Process `index` terminated with an error exit status (127) Command executed: (2) salmon index --threads 1 -t transcriptome.fa -i index Command exit status: (3) 127 Command output: (4) (empty) Command error: (5) .command.sh: line 2: salmon: command not found Work dir: (6) /Users/pditommaso/work/0b/b59f362980defd7376ee0a75b41f62
1 | A description of the error cause |
2 | The command executed |
3 | The command exit status |
4 | The command standard output, when available |
5 | The command standard error |
6 | The command work directory |
Carefully review all error data as it can provide information valuable for debugging.
If this is not enough, cd
into the task work directory. It contains all the files to replicate the issue in an isolated manner.
The task execution directory contains these files:
-
.command.sh
: The command script. -
.command.run
: The command wrapped used to run the job. -
.command.out
: The complete job standard output. -
.command.err
: The complete job standard error. -
.command.log
: The wrapper execution output. -
.command.begin
: Sentinel file created as soon as the job is launched. -
.exitcode
: A file containing the task exit code. -
Task input files (symlinks)
-
Task output files
Verify that the .command.sh
file contains the expected command and all variables are correctly resolved.
Also verify the existence of the .exitcode
and .command.begin
files, which if absent, suggest the task was never executed by the subsystem (e.g. the batch scheduler). If the .command.begin
file exists, the job was launched but was likely killed abruptly.
You can replicate the failing execution using the command bash .command.run
to verify the cause of the error.
14.2. Ignore errors
There are cases in which a process error may be expected and it should not stop the overall workflow execution.
To handle this use case, set the process errorStrategy
to ignore
:
1
2
3
4
5
6
7
process foo {
errorStrategy 'ignore'
script:
"""
your_command --this --that
"""
}
If you want to ignore any error you can set the same directive in the config file as a default setting:
process.errorStrategy = 'ignore'
14.3. Automatic error fail-over
In rare cases, errors may be caused by transient conditions. In this situation, an effective strategy is re-executing the failing task.
1
2
3
4
5
6
7
process foo {
errorStrategy 'retry'
script:
"""
your_command --this --that
"""
}
Using the retry
error strategy the task is re-executed a second time if it returns a non-zero exit status
before stopping the complete workflow execution.
The directive maxRetries can be used to set the number of attempts the task can be re-executed before declaring it failed with an error condition.
14.4. Retry with backoff
There are cases in which the required execution resources may be temporarily unavailable (e.g. network congestion). In these cases simply re-executing the same task will likely result in an identical error. A retry with an exponential backoff delay can better recover these error conditions.
1
2
3
4
5
6
7
8
process foo {
errorStrategy { sleep(Math.pow(2, task.attempt) * 200 as long); return 'retry' }
maxRetries 5
script:
'''
your_command --here
'''
}
14.5. Dynamic resources allocation
It’s a very common scenario that different instances of the same process may have very different needs in terms of computing resources. In such situations requesting, for example, an amount of memory too low will cause some tasks to fail. Instead, using a higher limit that fits all the tasks in your execution could significantly decrease the execution priority of your jobs.
To handle this use case, you can use a retry
error strategy and increase the computing resources allocated by the job at each successive attempt.
1
2
3
4
5
6
7
8
9
10
11
12
process foo {
cpus 4
memory { 2.GB * task.attempt } (1)
time { 1.hour * task.attempt } (2)
errorStrategy { task.exitStatus == 140 ? 'retry' : 'terminate' } (3)
maxRetries 3 (4)
script:
"""
your_command --cpus $task.cpus --mem $task.memory
"""
}
1 | The memory is defined in a dynamic manner, the first attempt is 2 GB, the second 4 GB, and so on. |
2 | The wall execution time is set dynamically as well, the first execution attempt is set to 1 hour, the second 2 hours, and so on. |
3 | If the task returns an exit status equal to 140 it will set the error strategy to retry otherwise it will terminate the execution. |
4 | It will retry the process execution up to three times. |