Welcome to the Nextflow Training Workshop
We are excited to have you on the path to writing reproducible and scalable scientific workflows using Nextflow. This guide complements the full Nextflow documentation - if you ever have any doubts, head over to the docs located here.
By the end of this course you should:
-
Be proficient in writing Nextflow pipelines
-
Know the basic Nextflow concepts of Channels, Processes and Operators
-
Have an understanding of containerised workflows
-
Understand the different execution platforms supported by Nextflow
-
Be introducted to the Nextflow community and ecosystem
Overview
To get you started with Nextflow as quickly as possible, we will walk through the following steps:
-
Set up a development environment to run Nextflow.
-
Explore Nexflow concepts using some basic workflows including a multistep RNA-Seq analysis.
-
Build and use Docker containers to encaspusulate all workflow dependencies.
-
Dive deeper into the core Nextflow syntax including Channels, Processes and Operators.
-
Cover cluster and cloud deployment scenarios and explore Nextflow Tower capabilities.
This will give you a broad understanding of Nextflow, so that you can start writing your own pipelines. We hope you enjoy the course! This is ever-evolving document and feedback is always welcome.
Summary
1. Environment setup
There are many ways to get started with the Nextflow training course. We recommend using Gitpod (see section 1.2), as this is a closed environment with all the programs and data installed.
1.1. Local installation
Nextflow can be used on any POSIX compatible system (Linux, OS X, Windows Subsystem for Linux etc.).
Requirements:
-
Bash
-
Git
Optional requirements for this tutorial:
-
Docker engine 1.10.x (or later)
-
Singularity 2.5.x (or later)
-
Conda 4.5 (or later)
-
AWS Batch computing environment configured
1.1.1. Download Nextflow:
Enter this command in your terminal:
wget -qO- https://get.nextflow.io | bash
Or, if you prefer curl:
curl -s https://get.nextflow.io | bash
Then ensure that the downloaded binary is executable:
chmod +x nextflow
AND put the nextflow executable into your $PATH (e.g. /usr/local/bin)
1.1.2. Training material
Download the training material by copy & pasting the following command in the terminal:
1
git clone https://github.com/seqeralabs/nf-training-public.git
Check the correct installation of nextflow
by running the following command:
1
nextflow info
1.2. Gitpod
Use a preconfigured Nextflow development environment using Gitpod.
Requirements:
-
A GitHub, GitLab or BitBucket account (gives you 50 hours usage per month)
-
Web browser (Google Chrome, Firefox)
1.2.2. Gitpod environment
To run Gitpod:
-
Either: click the following URL:
(which is our Github repository URL, prefixed with gitpod.io/#).
OR
Install the Gitpod browser extension, following the instructions here. This will add a green Gitpod button to each Git repository for easy access to the environment. Then go to the git repository for the training (github.com/seqeralabs/nf-training-public), and click the green button.

-
Login to your Github account (and allow authorisation).
You may need to close the window at each stage and click the gitpod link again.
For BitBucket or GitLab, you may have to sign into Gitpod and change your integration on the following page: gitpod.io/integrations.
Once you have signed in, gitpod will load (skip prebuild, if asked):

Gitpod gives you 50 hours per month to run the environment, with up to 4 parellel workspaces at a time. So feel free to come back at any time and try the course at your own pace.
It is useful to save the URL that is now your Gitpod environment. Later, we can use this to return to the same closed environment (so please take note of it now).
-
Explore your Gitpod IDE
You should see something similar to the following:

The sidebar allows you to customise your Gitpod environment and perform basic tasks (Copy/Paste, Open files, search, git, etc.) Click the Explorer button to see which files are in this repository.
The terminal allows you to run all the programs in the repository, for example nextflow
and docker
are installed.
The main window allows you to view and edit files. Clicking on a file in the explorer will open it within the main window. We can also see the nf-training material in this window, but if you find the space too small you can open it separately in another browser window (training.seqera.io/). Finally, if you accidently close the Gitpod training window, you can reopen it by entering gp preview training.seqera.io
.
-
Saving your files in Gitpod to local machine.
To save your files, choose your file of interest, then either use the left/top hand side bar (File/Save As…) or use you’re keyboard shortcut to save as (On Mac: CMD, shift, s), then choose Show local
. This will open up a explorer to choose where to save your file on your local machine.
1.2.3. Reopening a Gitpod session
Any running workspace will be automatically stopped after 30 minutes. You can open the environment again by going to gitpod.io/workspaces and finding your previous environment, then clicking the three dot button to the right, and selecting Open.
If you save the URL from your previous Gitpod environment, you can just paste this into your browser to open the previous environment. Environments are saved for up to two weeks, but don’t rely on their existance, download any important files you want for posterity.
Alternatively, you can start a new workspace by clicking the green gitpod button, or following the Gitpod URL: gitpod.io/#https://github.com/seqeralabs/nf-training-public
This tutorial provides all the scripts, so don’t worry if you have lost your environment. In the nf-training
and nf-training/scripts
directories, you can find the main scripts and individual snippets used in the tutorial.
If you want to change git provider (between GitHub, GitLab and BitBucket), go to gitpod.io/integrations. Then you will need to login and deactive the current provider.
1.2.4. Getting started
In the terminal section, you can type the following:
nextflow info
This should come up with the Nextflow information from this environment. This tells us that the environment is working. All the training material and scripts are in this environment.
Be aware that if you leave the window or are not active, your session may end after 30 minutes. You can always reactivate by clicking the gitpod link again. It will show if you previous environment is still active, or if you will need to open a new environment. |
You should see the following (or similar):
Version: 21.10.6 build 5660
Created: 21-12-2021 16:55 UTC
System: Linux 4.14.262-135.489.amzn1.x86_64
Runtime: Groovy 3.0.9 on OpenJDK 64-Bit Server VM 1.8.0_312-b07
Encoding: UTF-8 (UTF-8)
1.3. Selecting a Nextflow version
By default Nextflow will pull the latest stable version into your environment.
However, Nextflow is constantly evolving as we make improvements and fix bugs.
It is worth checking out the latest releases on github: (click here).
If you want or need to use a specific version of Nextflow, you can set the NXF_VER variable as so:
1
export NXF_VER=21.10.0
Then run nextflow -version
to make sure that the change has taken effect.
2. Introduction
2.1. Basic concepts
Nextflow is a workflow orchestration engine and a domain specific language (DSL) that makes it easy to write data-intensive computational pipelines.
It is designed around the idea that the Linux platform is the lingua franca of data science. Linux provides many simple but powerful command-line and scripting tools that, when chained together, facilitate complex data manipulations.
Nextflow extends this approach, adding the ability to define complex program interactions and a high-level parallel computational environment based on the dataflow programming model. Nextflow’s core features are:
-
Workflow portability & reproducibility
-
Scalability of parallelization and deployment
-
Integration of existing tools, systems & industry standards
2.1.1. Processes and Channels
In practice, a Nextflow pipeline is made by joining together different processes.
Each process
can be written in any scripting language that can be executed by the Linux platform (Bash, Perl, Ruby, Python, etc.).
Processes are executed independently and are isolated from each other, i.e. they do not share a common
(writable) state. The only way they can communicate is via asynchronous first-in, first-out (FIFO) queues, called
channels
in Nextflow.
Any process
can define one or more channels
as input
and output
. The interaction between these processes,
and ultimately the pipeline execution flow itself, is implicitly defined by these input
and output
declarations.

2.1.2. Execution abstraction
While a process
defines what command or script
has to be executed, the executor determines
how that script
is actually run in the target platform.
If not otherwise specified, processes are executed on the local computer. The local executor is very useful for pipeline development and testing purposes, but for real world computational pipelines, an HPC or cloud platform is often required.
In other words, Nextflow provides an abstraction between the pipeline’s functional logic and the underlying execution system (or runtime). Thus it is possible to write a pipeline once and to seamlessly run it on your computer, a cluster, or the cloud, without modifying it, by simply defining the target execution platform in the configuration file.

2.1.3. Scripting language
Nextflow implements a declarative domain specific language (DSL) that simplifies the writing of complex data analysis workflows as an extension of a general purpose programming language.
This approach makes Nextflow very flexible, because it allows in the same computing environment the benefits of a concise DSL allowing the handling of recurrent use cases with ease and the flexibility and power of a general purpose programming language to handle corner cases, which may be difficult to implement using a purely declarative approach.
In practical terms, Nextflow scripting is an extension of the Groovy programming language, which in turn is a super-set of the Java programming language. Groovy can be considered as Python for Java in that it simplifies the writing of code and is more approachable.
2.2. Your first script
Here you will execute your first Nextflow script (hello.nf
), which we will go through line by line.
In this toy example, the script takes an input string (a parameter called params.greeting
), then splits the input into chucks of six characters in the first process, then converts the characters to upper case in the second process, before viewing the output to screen.
2.2.1. Nextflow code (hello.nf)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!/usr/bin/env nextflow (1)
params.greeting = 'Hello world!' (2)
greeting_ch = Channel.from(params.greeting) (3)
process splitLetters { (4)
input: (5)
val x from greeting_ch (6)
output: (7)
file 'chunk_*' into letters_ch (8)
""" (9)
printf '$x' | split -b 6 - chunk_ (10)
""" (11)
} (12)
process convertToUpper { (13)
input: (14)
file y from letters_ch.flatten() (15)
output: (16)
stdout into result_ch (17)
""" (18)
cat $y | tr '[a-z]' '[A-Z]' (19)
""" (20)
} (21)
result_ch.view{ it } (22)
1 | The code begins with a shebang which declares Nextflow as the interepreter. |
2 | Declares a parameter greeting that is initialized with the value 'Hello world!'. |
3 | Initalises a channel labelled greeting_ch , which contains the value from params.greeting . Channels are the necessary input type for processes in Nextflow. |
4 | Begins the first process defined as splitLetters . |
5 | Input declaration for the splitLetters process. Inputs can be values: val , files: file , paths path or other qualifiers (See here) |
6 | Tells the process to expect an input value (val ) from the channel greeting_ch , that we assign to the variable 'x'. |
7 | Output declaration for the splitLetters process. |
8 | Tells the process to expect output file/s (file ) containing 'chunk_*' as output from the script and send the files to the channel letters_ch . |
9 | Triple quote marks intiate the code block to execute in this process . |
10 | Code to execute, printing the input value x (called using the dollar [$] symbol prefix), and splitting the string into chunks of 6 characters long ("Hello " and "world!") and saving to files: chunk_aa and chunk_ab. |
11 | Triple quote marks the end of the code block. |
12 | End of first process block. |
13 | Begin second process defined as convertToUpper . |
14 | Input declaration for the convertToUpper process . |
15 | Tells the process to expect input file/s (file ; e.g. chunk_aa and chunk_ab) from the letter_ch , that we assign to the variable 'y'. |
16 | Output declaration for the convertToUpper process. |
17 | Tells the process to expect output as standard output (stdout) and direct this into the result_ch channel. |
18 | Triple quote marks intiate the code block to execute in this process . |
19 | Script to read files (cat) using the '$y' input variable, then pipe to uppercase conversion, outputting to standard output. |
20 | Triple quote marks the end of the code block. |
21 | End of first process block. |
22 | The final output (in the result_ch ) is printed to screen using the view operator (appended onto the channel name). |
The use of the operator .flatten() here is to split the two files into two separate items to be put through the next process (else they would treat them as a single element).
|
2.2.2. In practise
Please now copy the following example into your favourite text editor
and save it to a file named hello.nf
.
For the Gitpod tutorial, make sure you are in the folder called nf-training
|
Execute the script by entering the following command in your terminal:
nextflow run hello.nf
The output will look similar to the text shown below:
1
2
3
4
5
6
7
N E X T F L O W ~ version 21.04.3
Launching `hello.nf` [confident_kowalevski] - revision: a3f5a9e46a
executor > local (3)
[0d/59d203] process > splitLetters (1) [100%] 1 of 1 ✔
[9f/1dd42a] process > convertToUpper (2) [100%] 2 of 2 ✔
HELLO
WORLD!
Where the standard output shows (line by line):
-
1: The Nextflow version executed.
-
2: The script and version names.
-
3: The executor used (in the above case: local).
-
4: The first
process
executed once (1). Starting with a unique hexadecimal (see TIP below) and ending with percent and job complete information. -
5: The second process` executed twice (2).
-
6-7: Followed by the printed result string from stdout.
The hexadecimal numbers, like 0d/59d203 , identify the unique process
execution. These numbers are also the prefix of the directories where each
process is executed. You can inspect the files produced by changing to the directory
$PWD/work and using these numbers to find the process-specific
execution path.
|
The second process runs twice, executing in two different work directories
for each input file. Therefore, in the previous example the work directory [9f/1dd42a]
represents just one of the two directories that were processed. To print all the
relevent paths to screen, use the -ansi-log flag (e.g. nextflow run hello.nf -ansi-log false ).
|
It’s worth noting that the process convertToUpper
is executed in
parallel, so there’s no guarantee that the instance processing the first
split (the chunk 'Hello ') will be executed before the one
processing the second split (the chunk 'world!').
Thus, it is perfectly possible that you will get the final result printed out in a different order:
WORLD!
HELLO
2.3. Modify and resume
Nextflow keeps track of all the processes executed in your pipeline. If you modify some parts of your script, only the processes that are actually changed will be re-executed. The execution of the processes that are not changed will be skipped and the cached result used instead.
This helps when testing or modifying part of your pipeline without having to re-execute it from scratch.
For the sake of this tutorial, modify the convertToUpper
process in
the previous example, replacing the process script with the string
rev $y
, so that the process looks like this:
1
2
3
4
5
6
7
8
9
10
11
12
process convertToUpper {
input:
file y from letters_ch.flatten()
output:
stdout into result_ch
"""
rev $y
"""
}
Then save the file with the same name, and execute it by adding the
-resume
option to the command line:
nextflow run hello.nf -resume
It will print output similar to this:
N E X T F L O W ~ version 21.04.3
Launching `hello.nf` [admiring_venter] - revision: aed50861e0
executor > local (2)
[74/d48321] process > splitLetters (1) [100%] 1 of 1, cached: 1 ✔
[59/136e00] process > convertToUpper (1) [100%] 2 of 2 ✔
!dlrow
olleH
You will see that the execution of the process splitLetters
is
actually skipped (the process ID is the same), and its results are
retrieved from the cache. The second process is executed as expected,
printing the reversed strings.
The pipeline results are cached by default in the directory $PWD/work .
Depending on your script, this folder can take of lot of disk space.
If you are sure you won’t resume your pipeline execution, clean this folder periodically.
|
2.4. Pipeline parameters
Pipeline parameters are simply declared by prepending to a variable name
the prefix params
, separated by a dot character. Their value can be
specified on the command line by prefixing the parameter name with a
double dash character, i.e. --paramName
Now, let’s try to execute the previous example specifying a different input string parameter, as shown below:
nextflow run hello.nf --greeting 'Bonjour le monde!'
The string specified on the command line will override the default value of the parameter. The output will look like this:
N E X T F L O W ~ version 21.10.0
Launching `hello.nf` [angry_mcclintock] - revision: 073d8111fc
executor > local (4)
[6c/e6edf5] process > splitLetters (1) [100%] 1 of 1 ✔
[bc/6845ce] process > convertToUpper (2) [100%] 3 of 3 ✔
uojnoB
!edno
m el r
2.4.1. In DAG-like format
To better understand how Nextflow is dealing with the data in this pipeline, we share below a DAG-like figure to visualise all the inputs
, outputs
, channels
and processes
.
Check this out, by clicking here:

3. Simple RNA-Seq pipeline
To demonstrate a real-world biomedical scenario, we will implement a proof of concept RNA-Seq pipeline which:
-
Indexes a transcriptome file.
-
Performs quality controls.
-
Performs quantification.
-
Creates a MultiQC report.
This will be done using a series of seven scripts,
each of which builds on the previous, in order to have a complete worflow.
You can find these in the tutorial folder (script1.nf
to script7.nf
).
3.1. Define the pipeline parameters
Parameters are inputs and options that can be changed when the pipeline is run.
The script script1.nf
defines the pipeline input parameters.
1
2
3
4
5
params.reads = "$projectDir/data/ggal/*_{1,2}.fq"
params.transcriptome_file = "$projectDir/data/ggal/transcriptome.fa"
params.multiqc = "$projectDir/multiqc"
println "reads: $params.reads"
Run it by using the following command:
nextflow run script1.nf
Try to specify a different input parameter, for example:
nextflow run script1.nf --reads "/workspace/nf-training-public/nf-training/data/ggal/gut_{1,2}.fq"
Exercise
Modify the script1.nf
adding a fourth parameter named outdir
and set it to a default path
that will be used as the pipeline output directory.
Click here for the answer:
1
2
3
4
params.reads = "$projectDir/data/ggal/*_{1,2}.fq"
params.transcriptome_file = "$projectDir/data/ggal/transcriptome.fa"
params.multiqc = "$projectDir/multiqc"
params.outdir = "results"
Exercise
Modify the script1.nf
to print all the pipeline parameters by using a single log.info
command and a multiline string statement.
See an example here. |
Click here for the answer:
Add the following to your script file:
1
2
3
4
5
6
7
8
log.info """\
R N A S E Q - N F P I P E L I N E
===================================
transcriptome: ${params.transcriptome_file}
reads : ${params.reads}
outdir : ${params.outdir}
"""
.stripIndent()
Recap
In this step you have learned:
-
How to define parameters in your pipeline script
-
How to pass parameters by using the command line
-
The use of
$var
and${var}
variable placeholders -
How to use multiline strings
-
How to use
log.info
to print information and save it in the log execution file
3.2. Create a transcriptome index file
Nextflow allows the execution of any command or script by using a process
definition.
A process
is defined by providing three main declarations:
the process inputs
, outputs
and command script
.
To add a transcriptome index
process, try adding the following code block to your script (or continue from script2.nf).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/*
* define the `salmon_index` process that creates a binary index
* given the transcriptome file
*/
process index {
input:
path transcriptome from params.transcriptome_file
output:
path 'salmon_index' into index_ch
script:
"""
salmon index --threads $task.cpus -t $transcriptome -i salmon_index
"""
}
This process takes the transcriptome params file as input
and creates the transcriptome index by using the salmon
tool. The output
is specified as a path to a output file called salmon_index
, and this is passed to the index_ch
channel
.
the input declaration defines a transcriptome path variable which is used in the script as a reference (using the dollar sign) in the Salmon command line.
|
Resource requirements such as CPU and memory can change with different workflow executions and platforms. Nextflow can use $task.cpus as a variable for the number of CPUs. See process directives documentation for more details. |
Run it by using the command:
nextflow run script2.nf
The execution will fail because salmon
is not installed in your environment.
Add the command line option -with-docker
to launch the execution through a Docker container
as shown below:
nextflow run script2.nf -with-docker
This time it works because it uses the Docker container nextflow/rnaseq-nf
defined in the
nextflow.config
file in your current directory. If you are running this script locally then you need to download docker
to your machine then login to activate docker and allow the script to download the container
containing the run scripts - learn more about docker later: here.
In order to avoid adding -with-docker
each time, add the following line in the nextflow.config
file:
docker.enabled = true
Exercise
Enable the Docker execution by default adding the above setting in the nextflow.config
file.
Exercise
Print the output of the index_ch
channel by using the view operator.
Click here for the answer:
Add the following to your script file, at the end:
1
index_ch.view()
Exercise
If you have more cpu available, try changing your script to request more resources for this process. For an example, see the directive docs. $task.cpus
is already specified in this script, so setting the cpu as the top as a directive will tell Nextflow to run this job with the specified number of cpus.
Click here for the answer:
Add cpus 2
to the top of the index process:
1
2
3
4
process index {
cpus 2
input:
...
Then check it worked, by looking at the script executed in the work directory. Look for the hexidecimal (e.g. work/7f/f285b80022d9f61e82cd7f90436aa4/), Then cat
the .command.sh
file.
Bonus Exercise
Use the command tree work
to see how Nextflow organizes the process work directory. Check here, if you need to download tree.
Click here for the answer:
It should look something like this:
work ├── 17 │ └── 263d3517b457de4525513ae5e34ea8 │ ├── index │ │ ├── complete_ref_lens.bin │ │ ├── ctable.bin │ │ ├── ctg_offsets.bin │ │ ├── duplicate_clusters.tsv │ │ ├── eqtable.bin │ │ ├── info.json │ │ ├── mphf.bin │ │ ├── pos.bin │ │ ├── pre_indexing.log │ │ ├── rank.bin │ │ ├── refAccumLengths.bin │ │ ├── ref_indexing.log │ │ ├── reflengths.bin │ │ ├── refseq.bin │ │ ├── seq.bin │ │ └── versionInfo.json │ └── transcriptome.fa -> /workspace/Gitpod_test/data/ggal/transcriptome.fa ├── 7f
Recap
In this step you have learned:
-
How to define a process executing a custom command
-
How process inputs are declared
-
How process outputs are declared
-
How to print the content of a channel
-
How to access the number of available CPUs
3.3. Collect read files by pairs
This step shows how to match read files into pairs, so they can be mapped by Salmon.
Edit the script script3.nf
and add the following statement as the last line:
read_pairs_ch.view()
Save it and execute it with the following command:
nextflow run script3.nf
It will print something similar to this:
[ggal_gut, [/.../data/ggal/gut_1.fq, /.../data/ggal/gut_2.fq]]
The above example shows how the read_pairs_ch
channel emits tuples composed by
two elements, where the first is the read pair prefix and the second is a list
representing the actual files.
Try it again specifying different read files by using a glob pattern:
nextflow run script3.nf --reads 'data/ggal/*_{1,2}.fq'
File paths including one or more wildcards ie. * , ? , etc. MUST be
wrapped in single-quoted characters to avoid Bash expanding the glob.
|
Exercise
Use the set operator in place
of =
assignment to define the read_pairs_ch
channel.
Click here for the answer:
1
2
3
Channel
.fromFilePairs( params.reads )
.set { read_pairs_ch }
Exercise
Use the checkIfExists
option for the fromFilePairs method to check if the specified path contains at least file pairs.
Click here for the answer:
1
2
3
Channel
.fromFilePairs( params.reads, checkIfExists: true )
.set { read_pairs_ch }
Recap
In this step you have learned:
-
How to use
fromFilePairs
to handle read pair files -
How to use the
checkIfExists
option to check input file existence -
How to use the
set
operator to define a new channel variable
3.4. Perform expression quantification
script4.nf
adds the quantification
process.
In this script, note how the index_ch
channel, declared as output in the index
process,
is now used as a channel in the input section.
Also, note how the second input is declared as a tuple
composed by two elements:
the sample_id
and the reads
in order to match the structure of the items emitted
by the read_pairs_ch
channel.
Execute it by using the following command:
nextflow run script4.nf -resume
You will see the execution of the quantification
process.
When using the -resume
option, any step that has already been processed is skipped.
Try to execute the same script with more read files as shown below:
nextflow run script4.nf -resume --reads 'data/ggal/*_{1,2}.fq'
You will notice that the quantification
process is executed more than
one time.
Nextflow parallelizes the execution of your pipeline simply by providing multiple input data to your script.
It may be useful to apply optional settings to a specific process using directives which are provided in the process body. |
Exercise
Add a tag directive to the
quantification
process to provide a more readable execution log.
Click here for the answer:
Add the following before the input declaration:
tag "Salmon on $sample_id"
Exercise
Add a publishDir directive
to the quantification
process to store the process results into a directory of your choice.
Click here for the answer:
Add the following before the input
declaration in the quantification
process:
publishDir params.outdir, mode:'copy'
Recap
In this step you have learned:
-
How to connect two processes by using the channel declarations
-
How to resume the script execution skipping cached steps
-
How to use the
tag
directive to provide a more readable execution output -
How to use the
publishDir
directive to store a process results in a path of your choice
3.5. Quality control
This step implements a quality control of your input reads. The inputs are the same
read pairs which are provided to the quantification
steps.
You can run it by using the following command:
nextflow run script5.nf -resume
The script will report the following error message:
Channel `read_pairs_ch` has been used twice as an input by process `fastqc` and process `quantification`
Exercise
Modify the creation of the read_pairs_ch
channel by using an into
operator in place of a set
.
see an example here. |
Click here for the answer:
Add the following before the input
declaration in the quantification
process:
Channel
.fromFilePairs( params.reads, checkIfExists: true )
.into { read_pairs_ch; read_pairs2_ch }
Then ensure that the second process uses the "read_pairs2_ch". See, script6.nf.
Recap
In this step you have learned:
-
How to use the
into
operator to create multiple copies of the same channel.
In Nextflow DSL2, it is no longer a requirement to duplicate channels. |
3.6. MultiQC report
This step collects the outputs from the quantification
and fastqc
steps to create
a final report using the MultiQC tool.
Execute the next script with the following command:
nextflow run script6.nf -resume --reads 'data/ggal/*_{1,2}.fq'
It creates the final report in the results
folder in the current work directory.
In this script, note the use of the mix
and collect operators chained
together to get all the outputs of the quantification
and fastqc
processes as a single input. Operators can be used to combine and transform channels.
We only want one task of MultiQC being executed which produces one report. Therefore, we use the mix
channel operator to combine the two channels followed by the collect
operator, to return the complete channel contents as a single element.
Recap
In this step you have learned:
-
How to collect many outputs to a single input with the
collect
operator -
How to
mix
two channels in a single channel -
How to chain two or more operators togethers
3.7. Handle completion event
This step shows how to execute an action when the pipeline completes the execution.
Note that Nextflow processes define the execution of asynchronous tasks i.e. they are not executed one after another as they are written in the pipeline script as it would happen in a common imperative programming language.
The script uses the workflow.onComplete
event handler to print a confirmation message
when the script completes.
Try to run it by using the following command:
nextflow run script7.nf -resume --reads 'data/ggal/*_{1,2}.fq'
3.8. Bonus!
Send a notification email when the workflow execution complete using the -N <email address>
command line option. Note: this requires the configuration of a SMTP server in nextflow config
file. For the sake of this tutorial add the following setting in your nextflow.config
file:
1
2
3
4
5
6
7
8
9
10
mail {
from = 'info@nextflow.io'
smtp.host = 'email-smtp.eu-west-1.amazonaws.com'
smtp.port = 587
smtp.user = "xxxxx"
smtp.password = "yyyyy"
smtp.auth = true
smtp.starttls.enable = true
smtp.starttls.required = true
}
Then execute again the previous example specifying your email address:
nextflow run script7.nf -resume --reads 'data/ggal/*_{1,2}.fq' -c mail.config -N <your email>
See mail documentation for details.
3.9. Custom scripts
Real world pipelines use a lot of custom user scripts (BASH, R, Python, etc). Nextflow
allows you to use and manage all these scripts in a consistent manner. Simply put them
in a directory named bin
in the pipeline project root. They will be automatically added
to the pipeline execution PATH
.
For example, create a file named fastqc.sh
with the following content:
1
2
3
4
5
6
7
8
9
#!/bin/bash
set -e
set -u
sample_id=${1}
reads=${2}
mkdir fastqc_${sample_id}_logs
fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}
Save it, give execute permission and move it in the bin
directory as shown below:
1
2
3
chmod +x fastqc.sh
mkdir -p bin
mv fastqc.sh bin
Then, open the script7.nf
file and replace the fastqc
process' script with
the following code:
1
2
3
4
script:
"""
fastqc.sh "$sample_id" "$reads"
"""
Run it as before:
nextflow run script7.nf -resume --reads 'data/ggal/*_{1,2}.fq'
Recap
In this step you have learned:
-
How to write or use existing custom scripts in your Nextflow pipeline.
-
How to avoid the use of absolute paths by having your scripts in the
bin/
folder.
3.10. Metrics and reports
Nextflow is able to produce multiple reports and charts providing several runtime metrics and execution information.
Run the rnaseq-nf pipeline previously introduced as shown below:
nextflow run rnaseq-nf -with-docker -with-report -with-trace -with-timeline -with-dag dag.png
The -with-docker
option launches each task of the execution as a Docker container run command.
The -with-report
option enables the creation of the workflow execution report. Open
the file report.html
with a browser to see the report created with the above command.
The -with-trace
option enables the create of a tab separated file containing runtime
information for each executed task. Check the content of the file trace.txt
for an example.
The -with-timeline
option enables the creation of the workflow timeline report showing
how processes where executed along time. This may be useful to identify most time consuming
tasks and bottlenecks. See an example at this link.
Finally the -with-dag
option enables to rendering of the workflow execution direct acyclic graph
representation. Note: this feature requires the installation of Graphviz in your computer.
See here for details.
Then try running :
dot -Tpng dag.dot > graph.png
open graph.png
Note: runtime metrics may be incomplete for run short running tasks as in the case of this tutorial.
You view the HTML files right-clicking on the file name in the left side-bar and choosing the Preview menu item. |
3.11. Run a project from GitHub
Nextflow allows the execution of a pipeline project directly from a GitHub repository (or similar services eg. BitBucket and GitLab).
This simplifies the sharing and deployment of complex projects and tracking changes in a consistent manner.
The following GitHub repository hosts a complete version of the workflow introduced in this tutorial:
You can run it by specifying the project name as shown below, launching each task of the execution as a Docker container run command:
nextflow run nextflow-io/rnaseq-nf -with-docker
It automatically downloads the container and stores it in the $HOME/.nextflow
folder.
Use the command info
to show the project information, e.g.:
nextflow info nextflow-io/rnaseq-nf
Nextflow allows the execution of a specific revision of your project by using the -r
command line option. For example:
nextflow run nextflow-io/rnaseq-nf -r dev
Revision are defined by using Git tags or branches defined in the project repository.
This allows a precise control of the changes in your project files and dependencies over time.
3.12. More resources
-
Nextflow documentation - The Nextflow docs home.
-
Nextflow patterns - A collection of Nextflow implementation patterns.
-
CalliNGS-NF - A Variant calling pipeline implementing GATK best practices.
-
nf-core - A community collection of production ready genomic pipelines.
4. Manage dependencies & containers
Computational workflows are rarely composed of a single script or tool. Most of the times they require the usage of dozens of different software components or libraries.
Installing and maintaining such dependencies is a challenging task and the most common source of irreproducibility in scientific applications.
To overcome these issues, we use containers, which allow the encapsulation of software dependencies, i.e. tools and libraries required by a data analysis application in one or more self-contained, ready-to-run, immutable Linux container images, that can be easily deployed in any platform supporting the container runtime.
Containers can be executed in an isolated manner from the hosting system. Having its own copy of the file system, processing space, memory management, etc.
Containers were first introduced with kernel 2.6 as a Linux feature known as Control Groups or Cgroups. |
4.1. Docker
Docker is a handy management tool to build, run and share container images.
These images can be uploaded and published in a centralised repository know as Docker Hub, or hosted by other parties like Quay.
4.1.1. Run a container
A container can be run using the following command:
docker run <container-name>
Try for example the following publically available container (if you have docker installed):
docker run hello-world
4.1.2. Pull a container
The pull command allows you to download a Docker image without running it. For example:
docker pull debian:stretch-slim
The above command download a Debian Linux image. You can check it exists by using:
docker images
4.1.3. Run a container in interactive mode
Launching a BASH shell in the container allows you to operate in an interactive mode in the containerised operating system. For example:
docker run -it debian:stretch-slim bash
Once launched, you will notice that it is running as root (!). Use the usual commands to navigate in the file system. This is useful to check if the expected programs are present within a container.
To exit from the container, stop the BASH session with the exit
command.
4.1.4. Your first Dockerfile
Docker images are created by using a so-called Dockerfile
, which is a simple text file
containing a list of commands to assemble and configure the image
with the software packages required.
In this step you will create a Docker image containing the Salmon tool.
the Docker build process automatically copies all files that are located in the
current directory to the Docker daemon in order to create the image. This can take
a lot of time when big/many files exists. For this reason it’s important to always work in
a directory containing only the files you really need to include in your Docker image.
Alternatively you can use the .dockerignore file to select the path to exclude from the build.
|
Then use your favourite editor (e.g. vim
or nano
) to create a file named Dockerfile
and copy the
following content:
1
2
3
4
5
6
7
FROM debian:stretch-slim
MAINTAINER <your name>
RUN apt-get update && apt-get install -y curl cowsay
ENV PATH=$PATH:/usr/games/
4.1.5. Build the image
Build the Dockerfile image by using the following command:
docker build -t my-image .
Where "my-image" is the user specified name tag for the Dockerfile, present in the current directory.
Don’t miss the dot in the above command. When it completes, verify that the image has been created listing all available images: |
docker images
You can try your new container by running this command:
docker run my-image cowsay Hello Docker!
4.1.6. Add a software package to the image
Add the Salmon package to the Docker image by adding to the Dockerfile
the following snippet:
1
2
3
RUN curl -sSL https://github.com/COMBINE-lab/salmon/releases/download/v1.5.2/salmon-1.5.2_linux_x86_64.tar.gz | tar xz \
&& mv /salmon-*/bin/* /usr/bin/ \
&& mv /salmon-*/lib/* /usr/lib/
Save the file and build the image again with the same command as before:
docker build -t my-image .
You will notice that it creates a new Docker image with the same name but with a different image ID.
4.1.7. Run Salmon in the container
Check that Salmon is running correctly in the container as shown below:
docker run my-image salmon --version
You can even launch a container in an interactive mode by using the following command:
docker run -it my-image bash
Use the exit
command to terminate the interactive session.
4.1.8. File system mounts
Create a genome index file by running Salmon in the container.
Try to run Salmon in the container with the following command:
docker run my-image \
salmon index -t $PWD/data/ggal/transcriptome.fa -i transcript-index
The above command fails because Salmon cannot access the input file.
This happens because the container runs in a completely separate file system and it cannot access the hosting file system by default.
You will need to use the --volume
command line option to mount the input file(s) e.g.
docker run --volume $PWD/data/ggal/transcriptome.fa:/transcriptome.fa my-image \
salmon index -t /transcriptome.fa -i transcript-index
the generated transcript-index directory is still not accessible in the host
file system.
|
An easier way is to mount a parent directory to an identical one in the container, this allows you to use the same path when running it in the container e.g. |
docker run --volume $PWD:$PWD --workdir $PWD my-image \
salmon index -t $PWD/data/ggal/transcriptome.fa -i transcript-index
or set a folder you want to mount as an environmental variable, called DATA
:
DATA=/workspace/nf-training-public/nf-training/data
docker run --volume $DATA:$DATA --workdir $PWD my-image \
salmon index -t $PWD/data/ggal/transcriptome.fa -i transcript-index
Now check the content of the transcript-index
folder entering the command:
ls -la transcript-index
Note that the permissions for files created by the Docker
execution is root .
|
Exercise
Use the option -u $(id -u):$(id -g)
to allow Docker to create files with the right permission.
4.1.9. Upload the container in the Docker Hub (bonus)
Publish your container in the Docker Hub to share it with other people.
Create an account in the hub.docker.com web site. Then from your shell terminal run the following command, entering the user name and password you specified registering in the Hub:
docker login
Tag the image with your Docker user name account:
docker tag my-image <user-name>/my-image
Finally push it to the Docker Hub:
docker push <user-name>/my-image
After that anyone will be able to download it by using the command:
docker pull <user-name>/my-image
Note how after a pull and push operation, Docker prints the container digest number e.g.
Digest: sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab266
Status: Downloaded newer image for nextflow/rnaseq-nf:latest
This is a unique and immutable identifier that can be used to reference a container image in a univocally manner. For example:
docker pull nextflow/rnaseq-nf@sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab266
4.1.10. Run a Nextflow script using a Docker container
The simplest way to run a Nextflow script with a Docker image is using the
-with-docker
command line option:
nextflow run script2.nf -with-docker my-image
We’ll see later how to configure in the Nextflow config file which container to use instead of having to specify every time as a command line argument.
4.2. Singularity
Singularity is container runtime designed to work in HPC data center, where the usage of Docker is generally not allowed due to security constraints.
Singularity implements a container execution model similar to Docker, however it uses a complete different implementation design.
A Singularity container image is archived as a plain file that can be stored in a shared file system and accessed by many computing nodes managed by a batch scheduler.
4.2.1. Create a Singularity images
Singularity images are created using a Singularity
file in similar manner to Docker,
though using a different syntax.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Bootstrap: docker
From: debian:stretch-slim
%environment
export PATH=$PATH:/usr/games/
%labels
AUTHOR <your name>
%post
apt-get update && apt-get install -y locales-all curl cowsay
curl -sSL https://github.com/COMBINE-lab/salmon/releases/download/v1.0.0/salmon-1.0.0_linux_x86_64.tar.gz | tar xz \
&& mv /salmon-*/bin/* /usr/bin/ \
&& mv /salmon-*/lib/* /usr/lib/
Once you have save the Singularity
file. Create the image with these commands:
sudo singularity build my-image.sif Singularity
Note: the build
command requires sudo
permissions. A common workaround
consists to build the image on a local workstation and then deploy in the
cluster just copying the image file.
4.2.2. Running a container
Once done, you can run your container with the following command
singularity exec my-image.sif cowsay 'Hello Singularity'
By using the shell
command you can enter in the container in interactive mode.
For example:
singularity shell my-image.sif
Once in the container instance run the following commands:
touch hello.txt
ls -la
Note how the files on the host environment are shown. Singularity automatically
mounts the host $HOME directory and uses the current work directory.
|
4.2.3. Import a Docker image
An easier way to create Singularity container without requiring sudo permission and boosting the containers interoperability is to import a Docker container image pulling it directly from a Docker registry. For example:
singularity pull docker://debian:stretch-slim
The above command automatically downloads the Debian Docker image and converts it to
a Singularity image store in the current directory with the name debian-jessie.simg
.
4.2.4. Run a Nextflow script using a Singularity container
Nextflow allows the transparent usage of Singularity containers as easy as with Docker ones.
It only requires to enable the use of Singularity engine in place of Docker in the
Nextflow configuration file using the -with-singularity
command line option:
nextflow run script7.nf -with-singularity nextflow/rnaseq-nf
As before the Singularity container can also be provided in the Nextflow config file. We’ll see later how to do it.
4.2.5. The Singularity Container Library
The authors of Singularity, SyLabs have their own repository of Singularity containers.
In the same way that we can push docker images to Docker Hub, we can upload Singularity images to the Singularity Library.
4.3. Conda/Bioconda packages
Conda is a popular package and environment manager. The built-in support for Conda allows Nextflow pipelines to automatically create and activate the Conda environment(s), given the dependencies specified by each process.
For this Gitpod tutorial you need to open a new terminal to ensure that conda is activated (see the + button on the terminal).
You should now see that the beginning of your command prompt has (base)
written. If you wish to deactivate conda at any point, simply enter conda deactivate
.
4.3.1. Using conda
A Conda environment is defined using a YAML file, which lists the required software packages. For example:
name: nf-tutorial
channels:
- defaults
- bioconda
- conda-forge
dependencies:
- salmon=1.0.0
- fastqc=0.11.5
- multiqc=1.5
- tbb=2020.2
Given the recipe file, the environment is created using the command shown below:
conda env create --file env.yml
You can check the environment was created successfully with the command shown below:
conda env list
This should look something like this:
# conda environments:
#
base /opt/conda
nf-tutorial * /opt/conda/envs/nf-tutorial
To enable the environment you can use the activate
command:
conda activate nf-tutorial
Nextflow is able to manage the activation of a Conda environment when the its directory
is specified using the -with-conda
option (using the same path shown in the list
function. For example:
nextflow run script7.nf -with-conda /opt/conda/envs/nf-tutorial
When specifying as Conda environment a YAML recipe file, Nextflow automatically downloads the required dependencies, build the environment and automatically activate it. |
This makes easier to manage different environments for the processes in the workflow script.
See the docs for details.
4.3.2. Create and use conda-like environments using micromamba
Another way to build conda-like environments is through a Dockerfile
and micromamba
.
micromamba
is fast and robust package for building small conda-based environments.
This saves having to build a conda environment each time you want to use it (as outlined in previous sections).
To do this, you simply require a Dockerfile
, such as the one in this public repo:
FROM mambaorg/micromamba
MAINTAINER Your name <your_email>
RUN \
micromamba install -y -n base -c defaults -c bioconda -c conda-forge \
salmon=1.5.1 \
fastqc=0.11.9 \
multiqc=1.10.1 \
&& micromamba clean -a -y
The above Dockerfile
takes the parent image 'mambaorg/micromamba', then installs a conda
environment using micromamba
, and installs salmon
, fastqc
and multiqc
.
Exercise
Try executing the RNA-Seq pipeline from earlier (script7.nf), but start by building your own micromamba Dockerfile
(from above), save to your docker hub repo and direct Nextflow to run from this container (changing your nexflow.config
).
Building a Docker container and pushing to your personal repo can take >10 minutes. |
For an overview of steps to take, click here:
-
Make a file called
Dockerfile
in the current directory (with above code). -
Build the image:
docker build -t my-image .
(don’t forget the '.'). -
Publish the docker image to your online docker account.
Something similar to the following, with <myrepo> replaced with your own Docker ID, without '<' and '>' characters!.
"my-image" could be any name you choose. Just choose something memorable and ensure the name matches the name you used in the previous command. docker login docker tag my-image <myrepo>/my-image docker push <myrepo>/my-image
-
Add the image file name to the nextflow.config file.
e.g. remove the following from the nextflow.config:
process.container = 'nextflow/rnaseq-nf'
and replace with:
process.container = '<myrepo>/my-image'
replacing the above the correct docker hub link.
-
Trying running Nextflow, e.g.:
nextflow run script7.nf -with-docker
It should now work, and find salmon
to run the process.
4.4. BioContainers
Another useful resource linking together Bioconda and containers is the BioContainers project. BioContainers is a community initiative that provides a registry of container images for every Bioconda recipe.
5. Channels
Channels are a key data structure of Nextflow that allows the implementation of reactive-functional oriented computational workflows based on the Dataflow programming paradigm.
They are used to logically connect tasks to each other or to implement functional style data transformations.

5.1. Channel types
Nextflow distinguishes two different kinds of channels: queue channels and value channels.
5.1.1. Queue channel
A queue channel is a asynchronous unidirectional FIFO queue which connects two processes or operators.
-
asynchronous means that operations are non-blocking.
-
unidirectional means that data flow from a producer to a consumer.
-
FIFO means that the data is guaranteed to be delivered in the same order as it is produced. First In, First Out.
A queue channel is implicitly created by process output definitions or using channel factories such as Channel.from or Channel.fromPath.
Try the following snippets:
1
2
3
ch = Channel.from(1,2,3)
println(ch) (1)
ch.view() (2)
1 | Use the built-in print line function println to print the ch channel. |
2 | Apply the view method to the ch channel, therefore prints each item emitted by the channels. |
Exercise
Try to execute this snippet, it will produce an error message.
1
2
3
ch = Channel.from(1,2,3)
ch.view()
ch.view()
A queue channel can have one and exactly one producer and one and exactly one consumer. |
5.1.2. Value channels
A value channel (a.k.a. singleton channel) by definition is bound to a single value and it can be read unlimited times without consuming its contents.
1
2
3
4
ch = Channel.value('Hello')
ch.view()
ch.view()
ch.view()
It prints:
Hello
Hello
Hello
5.2. Channel factories
These are Nextflow commands for creating channels that have implicit expected inputs and functions.
5.2.1. value
The value
factory method is used to create a value channel. An optional not null
argument
can be specified to bind the channel to a specific value. For example:
1
2
3
ch1 = Channel.value() (1)
ch2 = Channel.value( 'Hello there' ) (2)
ch3 = Channel.value( [1,2,3,4,5] ) (3)
1 | Creates an empty value channel. |
2 | Creates a value channel and binds a string to it. |
3 | Creates a value channel and binds a list object to it that will be emitted as a sole emission. |
5.2.2. from
The factory Channel.from
allows the creation of a queue channel with the values specified as argument.
1
2
ch = Channel.from( 1, 3, 5, 7 )
ch.view{ "value: $it" }
The first line in this example creates a variable ch
which holds a channel object. This channel emits the values specified as a parameter in the from
method. Thus the second line will print the following:
value: 1 value: 3 value: 5 value: 7
Method Channel.from will be deprecated and replaced by Channel.of (see below).
|
5.2.3. of
The method Channel.of
works in a similar manner to Channel.from
, though it fixes
some inconsistent behavior of the latter and provides a better handling for range of values.
For example:
1
2
3
Channel
.of(1..23, 'X', 'Y')
.view()
5.2.4. fromList
The method Channel.fromList
creates a channel emitting the elements provided
by a list objects specified as argument:
1
2
3
4
5
list = ['hello', 'world']
Channel
.fromList(list)
.view()
5.2.5. fromPath
The fromPath
factory method create a queue channel emitting one or more files
matching the specified glob pattern.
1
Channel.fromPath( './data/meta/*.csv' )
This example creates a channel and emits as many items as there are files with csv
extension in the /data/meta
folder. Each element is a file object implementing the Path interface.
Two asterisks, i.e. ** , works like * but crosses directory boundaries. This syntax is generally used for matching complete paths. Curly brackets specify a collection of sub-patterns.
|
Name | Description |
---|---|
glob |
When |
type |
Type of paths returned, either |
hidden |
When |
maxDepth |
Maximum number of directory levels to visit (default: |
followLinks |
When |
relative |
When |
checkIfExists |
When |
Learn more about the glob patterns syntax at this link.
Exercise
Use the Channel.fromPath
method to create a channel emitting all files with the suffix .fq
in the data/ggal/
directory and any subdirectory, in addition to hidden files. Then print the file names.
Click here for the answer:
Channel.fromPath( './data/ggal/**.fq' , hidden:true)
.view()
5.2.6. fromFilePairs
The fromFilePairs
method creates a channel emitting the file pairs matching a glob pattern provided by the user. The matching files are emitted as tuples in which the first element is the grouping key of the matching pair and the second element is the list of files (sorted in lexicographical order).
1
2
3
Channel
.fromFilePairs('./data/ggal/*_{1,2}.fq')
.view()
It will produce an output similar to the following:
[liver, [/user/nf-training/data/ggal/liver_1.fq, /user/nf-training/data/ggal/liver_2.fq]]
[gut, [/user/nf-training/data/ggal/gut_1.fq, /user/nf-training/data/ggal/gut_2.fq]]
[lung, [/user/nf-training/data/ggal/lung_1.fq, /user/nf-training/data/ggal/lung_2.fq]]
The glob pattern must contain at least a star wildcard character. |
Name | Description |
---|---|
type |
Type of paths returned, either |
hidden |
When |
maxDepth |
Maximum number of directory levels to visit (default: |
followLinks |
When |
size |
Defines the number of files each emitted item is expected to hold (default: 2). Set to |
flat |
When |
checkIfExists |
When |
Exercise
Use the fromFilePairs
method to create a channel emitting all pairs of fastq read in the data/ggal/
directory and print them. Then use the flat:true
option and compare the output with the previous execution.
Click here for the answer:
Use the following, with or without 'flat:true':
Channel.fromFilePairs( './data/ggal/*_{1,2}.fq', flat:true)
.view()
Then check the square brackets around the file names, to see the difference with flat
.
5.2.7. fromSRA
The Channel.fromSRA
method makes it possible to query the NCBI SRA archive and returns a channel emitting the FASTQ files matching the specified selection criteria.
The query can be project ID or accession number(s) supported by the NCBI ESearch API.
This function now requires an API key you can only get by logging into your NCBI account. |
For help with NCBI login and key acquisition, click here:
-
Go to : www.ncbi.nlm.nih.gov/
-
Click top right button to "Sign into NCBI". Follow their instructions.
-
Once into your account, click the button at top right, left of
My NCBI
, usually your ID. -
Scroll down to API key section. Copy your key.
You also need to use the latest edge version of Nextflow. Check your nextflow -version , it should say -edge , if not: download the newest Nextflow version, following the instructions linked here.
|
For example the following snippet will print the contents of an NCBI project ID:
1
2
3
4
5
params.ncbi_api_key = '<Your API key here>'
Channel
.fromSRA(['SRP073307'], apiKey: params.ncbi_api_key)
.view()
Replace <Your API key here> with your API key.
|
This should print:
1
2
3
4
5
[SRR3383346, [/vol1/fastq/SRR338/006/SRR3383346/SRR3383346_1.fastq.gz, /vol1/fastq/SRR338/006/SRR3383346/SRR3383346_2.fastq.gz]]
[SRR3383347, [/vol1/fastq/SRR338/007/SRR3383347/SRR3383347_1.fastq.gz, /vol1/fastq/SRR338/007/SRR3383347/SRR3383347_2.fastq.gz]]
[SRR3383344, [/vol1/fastq/SRR338/004/SRR3383344/SRR3383344_1.fastq.gz, /vol1/fastq/SRR338/004/SRR3383344/SRR3383344_2.fastq.gz]]
[SRR3383345, [/vol1/fastq/SRR338/005/SRR3383345/SRR3383345_1.fastq.gz, /vol1/fastq/SRR338/005/SRR3383345/SRR3383345_2.fastq.gz]]
(remaining omitted)
Multiple accession IDs can be specified using a list object:
1
2
3
4
ids = ['ERR908507', 'ERR908506', 'ERR908505']
Channel
.fromSRA(ids, apiKey: params.ncbi_api_key)
.view()
1
2
3
[ERR908507, [/vol1/fastq/ERR908/ERR908507/ERR908507_1.fastq.gz, /vol1/fastq/ERR908/ERR908507/ERR908507_2.fastq.gz]]
[ERR908506, [/vol1/fastq/ERR908/ERR908506/ERR908506_1.fastq.gz, /vol1/fastq/ERR908/ERR908506/ERR908506_2.fastq.gz]]
[ERR908505, [/vol1/fastq/ERR908/ERR908505/ERR908505_1.fastq.gz, /vol1/fastq/ERR908/ERR908505/ERR908505_2.fastq.gz]]
Read pairs are implicitly managed and are returned as a list of files. |
It’s straightforward to use this channel as an input using the usual Nextflow syntax. The code below creates a channel containing 2 samples from a public SRA study and runs FASTQC on the resulting files. See:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
params.ncbi_api_key = '<Your API key here>'
params.accession = ['ERR908507', 'ERR908506']
reads = Channel.fromSRA(params.accession, apiKey: params.ncbi_api_key)
process fastqc {
input:
tuple sample_id, file(reads_file) from reads
output:
file("fastqc_${sample_id}_logs") into fastqc_ch
script:
"""
mkdir fastqc_${sample_id}_logs
fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads_file}
"""
}
5.2.8. Text files
The splitText
operator allows you to split multi-line strings or text file items, emitted by a source channel into chunks containing n lines, which will be emitted by the resulting channel. See:
Channel .fromPath('data/meta/random.txt') (1) .splitText() (2) .view() (3)
1 | Instructs Nextflow to make a channel from the path "data/meta/random.txt". |
2 | The splitText operator splits each item into chunks of one line by default. |
3 | View contents of the channel. |
You can define the number of lines in each chunk by using the parameter by
, as shown in the following example:
Channel .fromPath('data/meta/random.txt') .splitText( by: 2 ) .subscribe { print it; print "--- end of the chunk ---\n" }
The subscribe operator permits to execute a user defined function each time a new value is emitted by the source channel.
|
An optional closure can be specified in order to transform the text chunks produced by the operator. The following example shows how to split text files into chunks of 10 lines and transform them into capital letters:
Channel .fromPath('data/meta/random.txt') .splitText( by: 10 ) { it.toUpperCase() } .view()
You can also make counts for each line:
count=0 Channel .fromPath('data/meta/random.txt') .splitText() .view { "${count++}: ${it.toUpperCase().trim()}" }
Finally, you can also use the operator on plain files (outside of the channel context), as so:
def f = file('data/meta/random.txt') def lines = f.splitText() def count=0 for( String row : lines ) { log.info "${count++} ${row.toUpperCase()}" }
5.2.9. Comma separate values (.csv)
The splitCsv
operator allows you to parse text items emitted by a channel, that are formatted using the CSV format.
It then splits them into records or groups them into a list of records with a specified length.
In the simplest case, just apply the splitCsv
operator to a channel emitting a CSV formatted text files or text entries, to view only the first and fourth columns. For example:
Channel .fromPath("data/meta/patients_1.csv") .splitCsv() // row is a list object .view { row -> "${row[0]},${row[3]}" }
When the CSV begins with a header line defining the column names, you can specify the parameter header: true
which allows you to reference each value by its name, as shown in the following example:
Channel .fromPath("data/meta/patients_1.csv") .splitCsv(header: true) // row is a list object .view { row -> "${row.patient_id},${row.num_samples}" }
Alternatively you can provide custom header names by specifying a the list of strings in the header parameter as shown below:
Channel .fromPath("data/meta/patients_1.csv") .splitCsv(header: ['col1', 'col2', 'col3', 'col4', 'col5'] ) // row is a list object .view { row -> "${row.col1},${row.col4}" }
You can also process multiple csv files at the same time:
Channel .fromPath("data/meta/patients_*.csv") // <-- just use a pattern .splitCsv(header:true) .view { row -> "${row.patient_id}\t${row.num_samples}" }
Notice that you can change the output format simply by adding a different delimiter. |
Finally, you can also operate on csv files outside the channel context, as so:
def f = file('data/meta/patients_1.csv') def lines = f.splitCsv() for( List row : lines ) { log.info "${row[0]} -- ${row[2]}" }
Exercise
Try inputting fastq reads to the RNA-Seq workflow from earlier using .splitCSV
.
Click here for the answer:
Add a csv text file containing the following, as example input with the name "fastq.csv":
1
gut,/workspace/nf-training-public/nf-training/data/ggal/gut_1.fq,/workspace/nf-training-public/nf-training/data/ggal/gut_2.fq
Then replace the input channel for the reads in script7.nf
. Changing the following lines:
1
2
3
Channel
.fromFilePairs( params.reads, checkIfExists: true )
.into { read_pairs_ch; read_pairs2_ch }
To a splitCsv channel factory input:
1
2
3
4
5
Channel
.fromPath("fastq.csv")
.splitCsv()
.view () { row -> "${row[0]},${row[1]},${row[2]}" }
.into { read_pairs_ch; read_pairs2_ch }
Finally, change the cardinality of the processes that use the input data. For example, for the quantification process I change it from:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
process quantification {
tag "$sample_id"
input:
path salmon_index from index_ch
tuple val(sample_id), path(reads) from read_pairs_ch
output:
path sample_id into quant_ch
script:
"""
salmon quant --threads $task.cpus --libType=U -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id
"""
}
To:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
process quantification {
tag "$sample_id"
input:
path salmon_index from index_ch
tuple val(sample_id), path(reads1), path(reads2) from read_pairs_ch
output:
path sample_id into quant_ch
script:
"""
salmon quant --threads $task.cpus --libType=U -i $salmon_index -1 ${reads1} -2 ${reads2} -o $sample_id
"""
}
Repeat for the fastqc step. Now the workflow should run from a CSV file.
5.2.10. Tab separated values (.tsv)
Parsing tsv files works in a similar way, just adding the sep:'\t'
option in the splitCsv
context:
Channel .fromPath("data/meta/regions.tsv", checkIfExists:true) // use `sep` option to parse TAB separated files .splitCsv(sep:'\t') // row is a list object .view()
Exercise
Try using the tab separation technique on the file "data/meta/regions.tsv", but print just the first column, and remove the header.
Answer:
Channel .fromPath("data/meta/regions.tsv", checkIfExists:true) // use `sep` option to parse TAB separated files .splitCsv(sep:'\t', header:true ) // row is a list object .view { row -> "${row.patient_id}" }
5.3. More complex file formats
5.3.1. JSON
We can also easily parse the JSON file format using the following groovy schema:
import groovy.json.JsonSlurper def f = file('data/meta/regions.json') def records = new JsonSlurper().parse(f) for( def entry : records ) { log.info "$entry.patient_id -- $entry.feature" }
When using an older JSON version, you may need to replace parse(f) with parseText(f.text)
|
5.3.2. YAML
In a similar way, this is a way to parse YAML files:
import org.yaml.snakeyaml.Yaml def f = file('data/meta/regions.json') def records = new Yaml().load(f) for( def entry : records ) { log.info "$entry.patient_id -- $entry.feature" }
5.3.3. Storage of parsers into modules
The best way to store parser scripts is to keep them in a nextflow module file.
This follows the DSL2 way of working.
See the following nextflow script:
nextflow.preview.dsl=2 include{ parseJsonFile } from './modules/parsers.nf' process foo { input: tuple val(meta), path(data_file) """ echo your_command $meta.region_id $data_file """ } workflow { Channel.fromPath('data/meta/regions*.json') \ | flatMap { parseJsonFile(it) } \ | map { entry -> tuple(entry,"/some/data/${entry.patient_id}.txt") } \ | foo }
To get this script to work, first we need to create a file called parsers.nf
, and store it in the modules folder in the current directory.
This file should have the parseJsonFile
function present, then Nextflow will use this as a custom function within the workflow scope.
6. Processes
In Nextflow, a process
is the basic computing primitive to execute foreign functions (i.e. custom scripts or tools).
The process
definition starts with the keyword process
, followed by the process name and finally the process body delimited by curly brackets.
A basic process
, only using the script
definition block, looks like the following:
1
2
3
4
5
6
process sayHello {
script:
"""
echo 'Hello world!'
"""
}
In more complex examples, the process body can contain up to five definition blocks:
-
Directives are initial declarations that define optional settings.
-
Input defines the expected input file/s and the channel from where to find them.
-
Output defines the expected output file/s and the channel to send the data to.
-
When is a optional clause statement to allow conditional processes.
-
Script is a string statement that defines the command to be executed by the process
The full process syntax is defined as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
process < name > {
[ directives ] (1)
input: (2)
< process inputs >
output: (3)
< process outputs >
when: (4)
< condition >
[script|shell|exec]: (5)
"""
< user script to be executed >
"""
}
1 | Zero, one or more process directives. |
2 | Zero, one or more process inputs. |
3 | Zero, one or more process outputs. |
4 | An optional boolean conditional to trigger the process execution. |
5 | The command to be executed. |
6.1. Script
The script
block is a string statement that defines the command to be executed by the process.
A process contains one and only one script
block, and it must be the last statement when the process contains input and output declarations.
The script
block can be a single or multi-line string. The latter simplifies the writing of non-trivial scripts
composed by multiple commands spanning over multiple lines. For example:
1
2
3
4
5
6
7
8
process example {
script:
"""
echo 'Hello world!\nHola mundo!\nCiao mondo!\nHallo Welt!' > file
cat file | head -n 1 | head -c 5 > chunk_1.txt
gzip -c chunk_1.txt > chunk_archive.gz
"""
}
By default the process
command is interpreted as a Bash script. However any other scripting language can be used by simply starting the script with the corresponding Shebang declaration. For example:
1
2
3
4
5
6
7
8
9
10
process pyStuff {
script:
"""
#!/usr/bin/env python
x = 'Hello'
y = 'world!'
print ("%s - %s" % (x,y))
"""
}
Multiple programming languages can be used within the same workflow script. However, for large chunks of code is suggested to save them into separate files and invoke them from the process script. One can store the specific scripts in the ./bin folder. |
6.1.1. Script parameters
Script parameters (params
) can be defined dynamically using variable values, like this:
1
2
3
4
5
6
7
8
params.data = 'World'
process foo {
script:
"""
echo Hello $params.data
"""
}
A process script can contain any string format supported by the Groovy programming language. This allows us to use string interpolation as in the script above or multiline strings. Refer to String interpolation for more information. |
Since Nextflow uses the same Bash syntax for variable substitutions in strings, Bash environment variables need to be escaped using \ character.
|
1
2
3
4
5
6
process foo {
script:
"""
echo "The current directory is \$PWD"
"""
}
Try to modify the above script using $PWD instead of \$PWD and check the difference.
|
This can be tricky when the script uses many Bash variables. A possible alternative is to use a script string delimited by single-quote characters
1
2
3
4
5
6
process bar {
script:
'''
echo $PATH | tr : '\\n'
'''
}
However, this blocks the usage of Nextflow variables in the command script.
Another alternative is to use a shell
statement instead of script
which uses a different
syntax for Nextflow variables: !{..}
. This allows the use of both Nextflow and Bash variables in the same script.
1
2
3
4
5
6
7
8
9
params.data = 'le monde'
process baz {
shell:
'''
X='Bonjour'
echo $X !{params.data}
'''
}
6.1.2. Conditional script
The process script can also be defined in a completely dynamic manner using an if
statement or any other expression evaluating to a string value. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
params.compress = 'gzip'
params.file2compress = "$baseDir/data/ggal/transcriptome.fa"
process foo {
input:
path file from params.file2compress
script:
if( params.compress == 'gzip' )
"""
gzip -c $file > ${file}.gz
"""
else if( params.compress == 'bzip2' )
"""
bzip2 -c $file > ${file}.bz2
"""
else
throw new IllegalArgumentException("Unknown aligner $params.compress")
}
6.2. Inputs
Nextflow processes are isolated from each other but can communicate between themselves sending values through channels.
Inputs implicitly determine the dependency and the parallel execution of the process. The process execution is fired each time new data is ready to be consumed from the input channel:

The input
block defines which channels the process
is expecting to receive data from. You can only define one input
block at a time, and it must contain one or more input declarations.
The input
block follows the syntax shown below:
input:
<input qualifier> <input name> from <source channel>
6.2.1. Input values
The val
qualifier allows you to receive data of any type as input. It can be accessed in the process script by using the specified input name, as shown in the following example:
1
2
3
4
5
6
7
8
9
10
num = Channel.from( 1, 2, 3 )
process basicExample {
input:
val x from num
"""
echo process job $x
"""
}
In the above example the process is executed three times, each time a value is received from the channel num and used to process the script. Thus, it results in an output similar to the one shown below:
process job 3
process job 1
process job 2
The channel guarantees that items are delivered in the same order as they have been sent - but - since the process is executed in a parallel manner, there is no guarantee that they are processed in the same order as they are received. |
6.2.2. Input files
The file
qualifier allows the handling of file values in the process execution context. This means that
Nextflow will stage it in the process execution directory, and it can be access in the script by using the name specified in the input declaration.
1
2
3
4
5
6
7
8
9
10
reads = Channel.fromPath( 'data/ggal/*.fq' )
process foo {
input:
file 'sample.fastq' from reads
script:
"""
echo your_command --reads sample.fastq
"""
}
The input file name can also be defined using a variable reference as shown below:
1
2
3
4
5
6
7
8
9
10
reads = Channel.fromPath( 'data/ggal/*.fq' )
process foo {
input:
file sample from reads
script:
"""
echo your_command --reads $sample
"""
}
The same syntax it’s also able to handle more than one input file in the same execution. Only requiring a change in the channel composition.
1
2
3
4
5
6
7
8
9
10
reads = Channel.fromPath( 'data/ggal/*.fq' )
process foo {
input:
file sample from reads.collect()
script:
"""
echo your_command --reads $sample
"""
}
When a process declares an input file , the corresponding channel elements
must be file objects, i.e. created with the file helper function from the file specific
channel factories e.g. Channel.fromPath or Channel.fromFilePairs .
|
Consider the following snippet:
1
2
3
4
5
6
7
8
9
10
params.genome = 'data/ggal/transcriptome.fa'
process foo {
input:
file genome from params.genome
script:
"""
echo your_command --reads $genome
"""
}
The above code creates a temporary file named input.1
with the string data/ggal/transcriptome.fa
as content. That likely is not what you wanted to do.
6.2.3. Input path
As of version 19.10.0, Nextflow introduced a new path
input qualifier that simplifies
the handling of cases such as the one shown above. In a nutshell, the input path
automatically
handles string values as file objects. The following example works as expected:
1
2
3
4
5
6
7
8
9
10
params.genome = "$baseDir/data/ggal/transcriptome.fa"
process foo {
input:
path genome from params.genome
script:
"""
echo your_command --reads $genome
"""
}
The path qualifier should be preferred over file to handle process input files when using Nextflow 19.10.0 or later. |
Exercise
Write a script that creates a channel containing all read files matching the pattern data/ggal/*_1.fq
followed by a process that concatenates them into a single file and prints the first 20 lines.
Click here for the answer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
params.reads = "$baseDir/data/ggal/*_1.fq"
Channel
.fromPath( params.reads )
.set { read_ch }
process concat {
tag "Concat all files"
input:
path '*' from read_ch.collect()
output:
path 'top_10_lines' into concat_ch
script:
"""
cat * > concatenated.txt
head -n 20 concatenated.txt > top_10_lines
"""
}
concat_ch.view()
6.2.4. Combine input channels
A key feature of processes is the ability to handle inputs from multiple channels. However it’s important to understands how channel content and their semantics affect the execution of a process.
Consider the following example:
1
2
3
4
5
6
7
8
9
10
process foo {
echo true
input:
val x from Channel.from(1,2,3)
val y from Channel.from('a','b','c')
script:
"""
echo $x and $y
"""
}
Both channels emit three values, therefore the process is executed three times, each time with a different pair:
-
(1, a)
-
(2, b)
-
(3, c)
What is happening is that the process waits until there’s a complete input configuration i.e. it receives an input value from all the channels declared as input.
When this condition is verified, it consumes the input values coming from the respective channels, and spawns a task execution, then repeat the same logic until one or more channels have no more content.
This means channel values are consumed serially one after another and the first empty channel cause the process execution to stop even if there are other values in other channels.
So what happens when channels do not have the same cardinality (i.e. they emit a different number of elements)?
For example:
1
2
3
4
5
6
7
8
9
10
process foo {
echo true
input:
val x from Channel.from(1,2)
val y from Channel.from('a','b','c')
script:
"""
echo $x and $y
"""
}
In the above example the process is executed only two times, because when a channel has no more data to be processed it stops the process execution.
However, what happens if you replace value x with a value
channel?
Compare the previous example with the following one :
1
2
3
4
5
6
7
8
9
10
process bar {
echo true
input:
val x from Channel.value(1)
val y from Channel.from('a','b','c')
script:
"""
echo $x and $y
"""
}
The output should look something like:
1
2
3
1 and b
1 and a
1 and c
This is because value channels can be consumed multiple times, so it doesn’t affect process termination.
Exercise
Write a process that is executed for each read file matching the pattern data/ggal/*_1.fq
and
use the same data/ggal/transcriptome.fa
in each execution.
Click here for the answer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
params.reads = "$baseDir/data/ggal/*_1.fq"
params.transcriptome_file = "$baseDir/data/ggal/transcriptome.fa"
Channel
.fromPath( params.reads )
.set { read_ch }
process command {
tag "Run_command"
input:
path reads from read_ch
path transcriptome from params.transcriptome_file
output:
path result into concat_ch
script:
"""
echo your_command $reads $transcriptome > result
"""
}
concat_ch.view()
6.2.5. Input repeaters
The each
qualifier allows you to repeat the execution of a process for each item in a collection, every time new data is received. For example:
1
2
3
4
5
6
7
8
9
10
11
12
sequences = Channel.fromPath('data/prots/*.tfa')
methods = ['regular', 'expresso', 'psicoffee']
process alignSequences {
input:
path seq from sequences
each mode from methods
"""
echo t_coffee -in $seq -mode $mode
"""
}
In the above example every time a file of sequences is received as input by the process, it executes three tasks, each running a different alignment method, set as a mode
variable. This is useful when you need to repeat the same task for a given set of parameters.
Exercise
Extend the previous example so a task is executed for each read file matching the pattern data/ggal/*_1.fq
and repeat the same task both with salmon
and kallisto
.
Click here for the answer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
params.reads = "$baseDir/data/ggal/*_1.fq"
params.transcriptome_file = "$baseDir/data/ggal/transcriptome.fa"
methods= ['salmon', 'kallisto']
Channel
.fromPath( params.reads )
.set { read_ch }
process command {
tag "Run_command"
input:
path reads from read_ch
path transcriptome from params.transcriptome_file
each mode from methods
output:
path result into concat_ch
script:
"""
echo $mode $reads $transcriptome > result
"""
}
concat_ch
.view { "To run : ${it.text}" }
6.3. Outputs
The output declaration block defines the channels used by the process to send out the results produced.
Only one output block can be defined containing one or more output declarations. The output block follows the syntax shown below:
output: <output qualifier> <output name> into <target channel>[,channel,..]
6.3.1. Output values
The val
qualifier specifies a defined value output in the script context. In a common usage scenario,
this is a value, which has been defined in the input declaration block, as shown in the following example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
methods = ['prot','dna', 'rna']
process foo {
input:
val x from methods
output:
val x into receiver
"""
echo $x > file
"""
}
receiver.view { "Received: $it" }
6.3.2. Output files
The file
qualifier specifies one or more files as output, produced by the process, into the specified channel.
1
2
3
4
5
6
7
8
9
10
11
process randomNum {
output:
file 'result.txt' into numbers
'''
echo $RANDOM > result.txt
'''
}
numbers.view { "Received: " + it.text }
In the above example the process randomNum
creates a file named result.txt
containing a random number.
Since a file parameter using the same name is declared in the output block, when the task is completed that
file is sent over the numbers
channel. A downstream process
declaring the same channel as input will
be able to receive it.
6.3.3. Multiple output files
When an output file name contains a *
or ?
wildcard character it is interpreted as a
glob path matcher.
This allows us to capture multiple files into a list object and output them as a sole emission. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
process splitLetters {
output:
file 'chunk_*' into letters
'''
printf 'Hola' | split -b 1 - chunk_
'''
}
letters
.flatMap()
.view { "File: ${it.name} => ${it.text}" }
it prints:
File: chunk_aa => H File: chunk_ab => o File: chunk_ac => l File: chunk_ad => a
Some caveats on glob pattern behavior:
-
Input files are not included in the list of possible matches.
-
Glob pattern matches against both files and directory paths.
-
When a two stars pattern
**
is used to recourse across directories, only file paths are matched i.e. directories are not included in the result list.
Exercise
Remove the flatMap
operator and see out the output change. The documentation
for the flatMap
operator is available at this link.
Click here for the answer:
1
File: [chunk_aa, chunk_ab, chunk_ac, chunk_ad] => [H, o, l, a]
6.3.4. Dynamic output file names
When an output file name needs to be expressed dynamically, it is possible to define it using a dynamic evaluated string, which references values defined in the input declaration block or in the script global context. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
species = ['cat','dog', 'sloth']
sequences = ['AGATAG','ATGCTCT', 'ATCCCAA']
process align {
input:
val x from species
val seq from sequences
output:
file "${x}.aln" into genomes
"""
echo align -in $seq > ${x}.aln
"""
}
genomes.view()
In the above example, each time the process is executed an alignment file is produced whose name depends
on the actual value of the x
input.
6.3.5. Composite inputs and outputs
So far we have seen how to declare multiple input and output channels, but each channel was handling only one value at time. However Nextflow can handle a tuple of values.
When using a channel emitting a tuple of values, the corresponding input declaration must be declared with a tuple
qualifier followed by definition of each element in the tuple.
In the same manner, output channels emitting a tuple of values can be declared using the tuple
qualifier
following by the definition of each tuple element.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
reads_ch = Channel.fromFilePairs('data/ggal/*_{1,2}.fq')
process foo {
input:
tuple val(sample_id), file(sample_files) from reads_ch
output:
tuple val(sample_id), file('sample.bam') into bam_ch
script:
"""
echo your_command_here --reads $sample_id > sample.bam
"""
}
bam_ch.view()
In previous versions of Nextflow tuple was called set but it was used exactly with the
same semantic. It can still be used for backward compatibility.
|
Exercise
Modify the script of the previous exercise so that the bam file is named as the given sample_id
.
Click here for the answer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
reads_ch = Channel.fromFilePairs('data/ggal/*_{1,2}.fq')
process foo {
input:
tuple val(sample_id), file(sample_files) from reads_ch
output:
tuple val(sample_id), file("${sample_id}.bam") into bam_ch
script:
"""
echo your_command_here --reads $sample_id > ${sample_id}.bam
"""
}
bam_ch.view()
6.4. When
The when
declaration allows you to define a condition that must be verified in order to execute the process. This can be any expression that evaluates a boolean value.
It is useful to enable/disable the process execution depending on the state of various inputs and parameters. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
params.dbtype = 'nr'
params.prot = 'data/prots/*.tfa'
proteins = Channel.fromPath(params.prot)
process find {
input:
file fasta from proteins
val type from params.dbtype
when:
fasta.name =~ /^BB11.*/ && type == 'nr'
script:
"""
echo blastp -query $fasta -db nr
"""
}
6.5. Directives
Directive declarations allow the definition of optional settings that affect the execution of the current process without affecting the semantic of the task itself.
They must be entered at the top of the process body, before any other declaration blocks (i.e. input
, output
, etc.).
Directives are commonly used to define the amount of computing resources to be used or other meta directives that allow the definition of extra information for configuration or logging purpose. For example:
1
2
3
4
5
6
7
8
9
10
process foo {
cpus 2
memory 1.GB
container 'image/name'
script:
"""
echo your_command --this --that
"""
}
The complete list of directives is available at this link.
Name | Description |
---|---|
Allows you to define the number of (logical) CPU required by the process’ task. |
|
Allows you to define how long a process is allowed to run. e.g. time '1h': 1 hour, '1s' 1 second, '1m' 1 minute, '1d' 1 day. |
|
Allows you to define how much memory the process is allowed to use. e.g. '2 GB' is 2 GB. Can use B, KB,MB,GB and TB. |
|
Allows you to define how much local disk storage the process is allowed to use. |
|
Allows you to associate each process execution with a custom label, so that it will be easier to identify them in the log file or in the trace execution report. |
6.6. Organise outputs
6.6.1. PublishDir directive
Given each process is being executed in separate temporary work/ folders (e.g. work/f1/850698…; work/g3/239712…; etc.), we may want to save important, non-intermediary, final files into a results folder.
Remember to delete the work folder from time to time, else all your intermediate files will fill up your computer! |
To store our workflow result files, we need to be explicitly mark them using the directive publishDir in the process that’s creating these file. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
params.outdir = 'my-results'
params.prot = 'data/prots/*.tfa'
proteins = Channel.fromPath(params.prot)
process blastSeq {
publishDir "$params.outdir/bam_files", mode: 'copy'
input:
file fasta from proteins
output:
file ('*.txt') into blast_ch
"""
echo blastp $fasta > ${fasta}_result.txt
"""
}
blast_ch.view()
The above example will copy all blast script files created by the blastSeq
task in the
directory path my-results
.
The publish directory can be local or remote. For example output files
could be stored to a AWS S3 bucket just using the s3:// prefix in the target path.
|
6.6.2. Manage semantic sub-directories
You can use more then one publishDir
to keep different outputs in separate directory. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
params.reads = 'data/reads/*_{1,2}.fq.gz'
params.outdir = 'my-results'
Channel
.fromFilePairs(params.reads, flat: true)
.set{ samples_ch }
process foo {
publishDir "$params.outdir/$sampleId/", pattern: '*.fq'
publishDir "$params.outdir/$sampleId/counts", pattern: "*_counts.txt"
publishDir "$params.outdir/$sampleId/outlooks", pattern: '*_outlook.txt'
input:
tuple val(sampleId), file('sample1.fq.gz'), file('sample2.fq.gz') from samples_ch
output:
file "*"
script:
"""
< sample1.fq.gz zcat > sample1.fq
< sample2.fq.gz zcat > sample2.fq
awk '{s++}END{print s/4}' sample1.fq > sample1_counts.txt
awk '{s++}END{print s/4}' sample2.fq > sample2_counts.txt
head -n 50 sample1.fq > sample1_outlook.txt
head -n 50 sample2.fq > sample2_outlook.txt
"""
}
The above example will create an output structure in the directory my-results
,
which contains a separate sub-directory for each given sample ID each of which
contain the folders counts
and outlooks
.
7. Operators
Operators are methods that allow you to connect channels to each other or to transform values emitted by a channel applying some user provided rules.
There are seven main groups that are detailed in the documentation that cover:
7.1. Basic example
1
2
3
nums = Channel.from(1,2,3,4) (1)
square = nums.map { it -> it * it } (2)
square.view() (3)

1 | Create a queue channel emitting four values. |
2 | Create a new channel, transforming each number to its square. |
3 | Print the channel content. |
Operators can also be chained to implement custom behaviors, so the previous snippet can be written as:
1
2
3
Channel.from(1,2,3,4)
.map { it -> it * it }
.view()
7.2. Basic operators
Here we explore some of the most commonly used operators.
7.2.1. view
The view
operator prints the items emitted by a channel to the console standard output, appending a
new line character to each item. For example:
1
2
3
Channel
.from('foo', 'bar', 'baz')
.view()
It prints:
foo
bar
baz
An optional closure parameter can be specified to customize how items are printed. For example:
1
2
3
Channel
.from('foo', 'bar', 'baz')
.view { "- $it" }
It prints:
- foo - bar - baz
7.2.2. map
The map
operator applies a function of your choosing to every item emitted by a channel, and returns the items obtained as a new channel. The function applied is called the mapping function and is expressed with a closure as shown in the example below:
1
2
3
4
Channel
.from( 'hello', 'world' )
.map { it -> it.reverse() }
.view()
A map
can associate to each element a generic tuple containing any data as needed.
1
2
3
4
Channel
.from( 'hello', 'world' )
.map { word -> [word, word.size()] }
.view { word, len -> "$word contains $len letters" }
Exercise
Use fromPath
to create a channel emitting the fastq files matching the pattern data/ggal/*.fq
,
then chain with a map to return a pair containing the file name and the path itself.
Finally print the resulting channel.
Click here for the answer:
1
2
3
Channel.fromPath('data/ggal/*.fq')
.map { file -> [ file.name, file ] }
.view { name, file -> "> $name : $file" }
7.2.3. into
The into
operator connects a source channel to two or more target channels in such a way the values emitted by the source channel are copied to the target channels. For example:
1
2
3
4
5
6
Channel
.from( 'a', 'b', 'c' )
.into{ foo; bar }
foo.view{ "Foo emits: " + it }
bar.view{ "Bar emits: " + it }
Note the use in this example of curly brackets and the ; as channel names separator. This is needed because the actual parameter of into is a closure which defines the target channels to which the source one is connected.
|
7.2.4. mix
The mix
operator combines the items emitted by two (or more) channels into a single channel.
1
2
3
4
5
c1 = Channel.from( 1,2,3 )
c2 = Channel.from( 'a','b' )
c3 = Channel.from( 'z' )
c1 .mix(c2,c3).view()
1
2
a
3
b
z
The items in the resulting channel have the same order as in respective original channel,
however there’s no guarantee that the element of the second channel are appended after the elements
of the first. Indeed in the above example the element a has been printed before 3 .
|
7.2.5. flatten
The flatten
operator transforms a channel in such a way that every tuple is flattened so that each entry is emitted as a sole element by the resulting channel.
1
2
3
4
5
6
7
foo = [1,2,3]
bar = [4,5,6]
Channel
.from(foo, bar)
.flatten()
.view()
The above snippet prints:
1
2
3
4
5
6
7.2.6. collect
The collect
operator collects all the items emitted by a channel to a list and returns the resulting object as a sole emission.
1
2
3
4
Channel
.from( 1, 2, 3, 4 )
.collect()
.view()
It prints a single value:
[1,2,3,4]
The result of the collect operator is a value channel.
|
7.2.7. groupTuple
The groupTuple
operator collects tuples (or lists) of values emitted by the source channel, grouping together the elements that share the same key. Finally it emits a new tuple object for each distinct key collected.
Try the following example:
1
2
3
4
Channel
.from( [1,'A'], [1,'B'], [2,'C'], [3, 'B'], [1,'C'], [2, 'A'], [3, 'D'] )
.groupTuple()
.view()
It shows:
[1, [A, B, C]]
[2, [C, A]]
[3, [B, D]]
This operator is useful to process together all elements for which there is a common property or grouping key.
Exercise
Use fromPath
to create a channel emitting all the files in the folder data/meta/
,
then use a map
to associate to each file the baseName
prefix. Finally group together all
files having the same common prefix.
Click here for the answer:
1
2
3
4
Channel.fromPath('data/meta/*')
.map { file -> tuple(file.baseName, file) }
.groupTuple()
.view { baseName, file -> "> $baseName : $file" }
7.2.8. join
The join
operator creates a channel that joins together the items emitted by two channels for which there exists a matching key. The key is defined, by default, as the first element in each item emitted.
1
2
3
left = Channel.from(['X', 1], ['Y', 2], ['Z', 3], ['P', 7])
right= Channel.from(['Z', 6], ['Y', 5], ['X', 4])
left.join(right).view()
The resulting channel emits:
[Z, 3, 6]
[Y, 2, 5]
[X, 1, 4]
Notice 'P' is missing in the final result |
7.2.9. branch
The branch
operator allows you to forward the items emitted by a source channel to one or more output channels, choosing one of them at a time.
The selection criteria is defined by specifying a closure that provides one or more boolean expressions, each of which is identified by a unique label. On the first expression that evaluates to a true value, the current item is bound to a named channel as the label identifier. For example:
1
2
3
4
5
6
7
8
9
10
Channel
.from(1,2,3,40,50)
.branch {
small: it < 10
large: it > 10
}
.set { result }
result.small.view { "$it is small" }
result.large.view { "$it is large" }
The branch operator returns a multi-channel object i.e. a variable that holds
more than one channel object.
|
7.3. More resources
Check the operators documentation on Nextflow web site.
8. Groovy basic structures and idioms
Nextflow is a domain specific language (DSL) implemented on top of the Groovy programming language, which in turn is a super-set of the Java programming language. This means that Nextflow can run any Groovy or Java code.
Here are some important Groovy syntax to learn, that are commonly used in Nextflow.
8.1. Printing values
To print something is as easy as using one of the print
or println
methods.
1
println("Hello, World!")
The only difference between the two is that the println
method implicitly appends a new line character to the printed string.
parenthesis for function invocations are optional. Therefore, the following is also valid syntax: |
1
println "Hello, World!"
8.2. Comments
Comments use the same syntax as C-family programming languages:
1
2
3
4
5
6
// comment a single config file
/*
a comment spanning
multiple lines
*/
8.3. Variables
To define a variable, simply assign a value to it:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
x = 1
println x
x = new java.util.Date()
println x
x = -3.1499392
println x
x = false
println x
x = "Hi"
println x
Local variables are defined using the def
keyword:
1
def x = 'foo'
It should be always used when defining variables local to a function or a closure.
8.4. Lists
A List object can be defined by placing the list items in square brackets:
1
list = [10,20,30,40]
You can access a given item in the list with square-bracket notation (indexes start at 0
) or using the get method:
1
2
println list[0]
println list.get(0)
In order to get the length of the list use the size method:
1
println list.size()
We use the assert
keyword to test if a condition is true (similar to an if
function). Here, Groovy will print nothing if it is corrrect, else it will raise an AssertionError message.
1
assert list[0] == 10
This assertion should be correct, but try changing the value to an incorrect one. |
Lists can also be indexed with negative indexes and reversed ranges.
1
2
3
list = [0,1,2]
assert list[-1] == 2
assert list[-1..0] == list.reverse()
List objects implement all methods provided by the java.util.List interface, plus the extension methods provided by Groovy API.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
assert [1,2,3] << 1 == [1,2,3,1]
assert [1,2,3] + [1] == [1,2,3,1]
assert [1,2,3,1] - [1] == [2,3]
assert [1,2,3] * 2 == [1,2,3,1,2,3]
assert [1,[2,3]].flatten() == [1,2,3]
assert [1,2,3].reverse() == [3,2,1]
assert [1,2,3].collect{ it+3 } == [4,5,6]
assert [1,2,3,1].unique().size() == 3
assert [1,2,3,1].count(1) == 2
assert [1,2,3,4].min() == 1
assert [1,2,3,4].max() == 4
assert [1,2,3,4].sum() == 10
assert [4,2,1,3].sort() == [1,2,3,4]
assert [4,2,1,3].find{it%2 == 0} == 4
assert [4,2,1,3].findAll{it%2 == 0} == [4,2]
8.5. Maps
Maps are like lists that have an arbitrary type of key instead of integer. Therefore, the syntax is very much aligned.
1
map = [a:0, b:1, c:2]
Maps can be accessed in a conventional square-bracket syntax or as if the key was a property of the map.
1
2
3
assert map['a'] == 0 (1)
assert map.b == 1 (2)
assert map.get('c') == 2 (3)
1 | Use of the square brackets. |
2 | Use a dot notation. |
3 | Use of get method. |
To add data or to modify a map, the syntax is similar to adding values to list:
1
2
3
4
map['a'] = 'x' (1)
map.b = 'y' (2)
map.put('c', 'z') (3)
assert map == [a:'x', b:'y', c:'z']
1 | Use of the square brackets. |
2 | Use a dot notation. |
3 | Use of put method. |
Map objects implement all methods provided by the java.util.Map interface, plus the extension methods provided by Groovy API.
8.6. String interpolation
String literals can be defined enclosing them either with single- or double- quotation marks.
Double-quoted strings can contain the value of an arbitrary variable by prefixing its name with the $ character, or the value of any expression by using the ${expression} syntax, similar to Bash/shell scripts:
1
2
3
4
5
6
foxtype = 'quick'
foxcolor = ['b', 'r', 'o', 'w', 'n']
println "The $foxtype ${foxcolor.join()} fox"
x = 'Hello'
println '$x + $y'
This code prints:
1
2
The quick brown fox
$x + $y
Note the different use of $ and ${..} syntax to interpolate value expressions in a string literal.
|
Finally string literals can also be defined using the /
character as delimiter. They are known as
slashy strings and are useful for defining regular expressions and patterns, as there is no need to escape backslashes. As with double quote strings they allow to interpolate variables prefixed with a $
character.
Try the following to see the difference:
1
2
3
4
5
x = /tic\tac\toe/
y = 'tic\tac\toe'
println x
println y
it prints:
tic\tac\toe
tic ac oe
8.7. Multi-line strings
A block of text that spans multiple lines can be defined by delimiting it with triple single or double quotes:
1
2
3
4
text = """
Hello there James.
How are you today?
"""
Finally multi-line strings can also be defined with slashy strings. For example:
1
2
3
4
5
text = /
This is a multi-line
slashy string!
It's cool, isn't it?!
/
Like before, multi-line strings inside double quotes and slash characters support variable interpolation, while single-quoted multi-line strings do not. |
8.8. If statement
The if
statement uses the same syntax common in other programming languages, such Java, C, JavaScript, etc.
1
2
3
4
5
6
if( < boolean expression > ) {
// true branch
}
else {
// false branch
}
The else
branch is optional. Also curly brackets are optional when the branch defines just a single
statement.
1
2
3
x = 1
if( x > 10 )
println 'Hello'
null , empty strings and empty collections are evaluated to false .
|
Therefore a statement like:
1
2
3
4
5
6
7
list = [1,2,3]
if( list != null && list.size() > 0 ) {
println list
}
else {
println 'The list is empty'
}
Can be written as:
1
2
3
4
if( list )
println list
else
println 'The list is empty'
See the Groovy-Truth for details.
In some cases it can be useful to replace the if statement with a ternary expression (aka
conditional expression). For example:
|
1
println list ? list : 'The list is empty'
The previous statement can be further simplified using the Elvis operator as shown below:
1
println list ?: 'The list is empty'
8.9. For statement
The classical for
loop syntax is supported as shown here:
1
2
3
for (int i = 0; i <3; i++) {
println("Hello World $i")
}
Iteration over list objects is also possible using the syntax below:
1
2
3
4
5
list = ['a','b','c']
for( String elem : list ) {
println elem
}
8.10. Functions
It is possible to define a custom function into a script, as shown here:
1
2
3
4
5
int fib(int n) {
return n < 2 ? 1 : fib(n-1) + fib(n-2)
}
assert fib(10)==89
A function can take multiple arguments separating them with a comma. The return
keyword can be omitted and the function implicitly returns the value
of the last evaluated expression. Also, explicit types can be omitted, though not recommended:
1
2
3
4
5
def fact( n ) {
n > 1 ? n * fact(n-1) : 1
}
assert fact(5) == 120
8.11. Closures
Closures are the swiss army knife of Nextflow/Groovy programming. In a nutshell, a closure is a block of code that can be passed as an argument to a function, it could also be defined an anonymous function.
More formally, a closure allows the definition of functions as first class objects.
1
square = { it * it }
The curly brackets around the expression it * it
tells the script interpreter to treat this expression as code. The it
identifier is an implicit variable that represents the value that is passed to the function when it is invoked.
Once compiled, the function object is assigned to the variable square
as any other variable assignments shown previously. To invoke the closure execution use the special method call
or just use the round
parentheses to specify the closure parameter(s). For example:
1
2
assert square.call(5) == 25
assert square(9) == 81
This is not very interesting until we find that we can pass the function square
as an argument to other functions or methods.
Some built-in functions take a function like this as an argument. One example is the collect
method on lists:
1
2
x = [ 1, 2, 3, 4 ].collect(square)
println x
It prints:
[ 1, 4, 9, 16 ]
By default, closures take a single parameter called it
, to give it a different name use the
->
syntax. For example:
1
square = { num -> num * num }
It’s also possible to define closures with multiple, custom-named parameters.
For example, the method each()
when applied to a map can take a closure with two arguments,
to which it passes the key-value pair for each entry in the map
object. For example:
1
2
3
printMap = { a, b -> println "$a with value $b" }
values = [ "Yue" : "Wu", "Mark" : "Williams", "Sudha" : "Kumari" ]
values.each(printMap)
It prints:
Yue with value Wu
Mark with value Williams
Sudha with value Kumari
A closure has two other important features.
First, it can access and modify variables in the scope where it is defined.
Second, a closure can be defined in an anonymous manner, meaning that it is not given a name, and is defined in the place where it needs to be used.
As an example showing both these features, see the following code fragment:
1
2
3
4
result = 0 (1)
values = ["China": 1 , "India" : 2, "USA" : 3] (2)
values.keySet().each { result += values[it] } (3)
println result
1 | Define a global variable. |
2 | Define a map object. |
3 | Invoke the each method passing the closure object which modifies the result variable. |
Learn more about closures in the Groovy documentation.
8.12. More resources
The complete Groovy language documentation is available at this link.
A great resource to master Apache Groovy syntax is the book: Groovy in Action.
9. DSL2 and modules
9.1. Basic concepts
DSL2 is a syntax extension to DSL1, developed primarily to allow the definition of module libraries and simplifies the writing of complex data analysis pipelines.
To ensure backward compatibility with DSL1 you have to explicitly enable DSL2 by adding the following line at the beginning of each workflow script:
nextflow.enable.dsl=2
DSL2 core features include:
-
Separation of processes from their invocation.
-
Use of a
workflow
directive to execute specific processes. -
Storage and re-use of processes from module files.
-
Update of specific syntax and definitions
To demonstrate the differences to DSL1, we will convert the first script we created earlier (hello.nf
) into DSL2.
9.2. Process
9.2.1. Process definition
DSL2 separates the definition of a process from its invocation. The process definition follows the usual syntax as described in the process documentation. The only difference is that the from
and into
channel declarations now have to be omitted. By removing the preset channels from
and into
, we allow the process to accept any value channel without needing to specify its channel name.
DSL1 syntax
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#!/usr/bin/env nextflow
params.greeting = 'Hello world!'
greeting_ch = Channel.from(params.greeting)
process splitLetters {
input:
val x from greeting_ch
output:
file 'chunk_*' into letters_ch
"""
printf '$x' | split -b 6 - chunk_
"""
}
DSL2 syntax
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
params.greeting = 'Hello world!'
greeting_ch = Channel.from(params.greeting)
process splitLetters {
input:
val x
output:
file 'chunk_*'
"""
printf '$x' | split -b 6 - chunk_
"""
}
9.2.2. Workflow scope
To run this process in DSL2, we have to invoke it in the workflow
scope, passing the expected input channels as parameters as it if were a custom function. For example:
1
2
3
workflow {
splitLetters(greeting_ch)
}
A process can only be invoked once in the same workflow definition.
|
Next, we can define additional processes within the script and add another call to the workflow`
directive, thereby separating the processes into 'submodules'.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
params.greeting = 'Hello world!'
greeting_ch = Channel.from(params.greeting)
process splitLetters {
input:
val x
output:
file 'chunk_*'
"""
printf '$x' | split -b 6 - chunk_
"""
}
process convertToUpper {
input:
file y
output:
stdout
"""
cat $y | tr '[a-z]' '[A-Z]'
"""
}
workflow {
splitLetters(greeting_ch)
convertToUpper(splitLetters.out.flatten())
convertToUpper.out.view{ it }
}
In the example above, we can propagate the output of one process to another using the .out
attribute. In this way, we now have a simpler workflow scope to follow, that creates custom functions from processes defined earlier in the script.
9.2.3. Process outputs
If a process defines two or more output channels, each of them can be accessed by indexing the .out
attribute e.g. .out[0]
, .out[1]
, etc.
The process output
definition also allows the use of the emit
statment to define a named identifier that can be used to reference the channel in the external scope. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
params.greeting = 'Hello world!'
greeting_ch = Channel.from(params.greeting)
process splitLetters {
input:
val x
output:
file 'chunk_*'
"""
printf '$x' | split -b 6 - chunk_
"""
}
process convertToUpper {
input:
file y
output:
stdout emit: verbiage
"""
cat $y | tr '[a-z]' '[A-Z]'
"""
}
workflow {
splitLetters(greeting_ch)
convertToUpper(splitLetters.out.flatten())
convertToUpper.out.verbiage.view{ it }
}
9.3. Workflow
9.3.1. Workflow definition
The workflow
scope allows the definition of components that define the invocation of one or more processes and operators:
1
2
3
4
5
6
7
8
9
workflow my_pipeline {
splitLetters(greeting_ch)
convertToUpper(splitLetters.out.flatten())
convertToUpper.out.verbiage.view{ it }
}
workflow {
my_pipeline()
}
For example, the snippet above defines a workflow
named my_pipeline
, that can be invoked via another workflow
definition.
9.3.2. Workflow parameters
A workflow component can access any variable and parameter defined in the outer scope. In the running example, we can also access params.greeting
directly within the workflow
definition.
1
2
3
4
5
6
7
8
9
workflow my_pipeline {
splitLetters(Channel.from(params.greeting))
convertToUpper(splitLetters.out.flatten())
convertToUpper.out.verbiage.view{ it }
}
workflow {
my_pipeline()
}
9.3.3. Workflow inputs
A workflow
component can declare one or more input channels using the take
statement. For example:
1
2
3
4
5
6
7
8
9
workflow my_pipeline {
take:
greeting
main:
splitLetters(greeting)
convertToUpper(splitLetters.out.flatten())
convertToUpper.out.verbiage.view{ it }
}
When the take statement is used, the workflow definition needs to be declared within the main block.
|
The input for the workflow
can then be specified as an argument as shown below:
1
2
3
workflow {
my_pipeline(Channel.from(params.greeting))
}
9.3.4. Workflow outputs
A workflow
can declare one or more output channels using the emit
statement. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
workflow my_pipeline {
take:
greeting
main:
splitLetters(greeting)
convertToUpper(splitLetters.out.flatten())
emit:
convertToUpper.out.verbiage
}
workflow {
my_pipeline(Channel.from(params.greeting))
my_pipeline.out.view()
}
As a result, we can use the my_pipeline.out
notation to access the outputs of my_pipeline
in the invoking workflow
.
We can also declare named outputs within the emit
block.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
workflow my_pipeline {
take:
greeting
main:
splitLetters(greeting)
convertToUpper(splitLetters.out.flatten())
emit:
my_data = convertToUpper.out.verbiage
}
workflow {
my_pipeline(Channel.from(params.greeting))
my_pipeline.out.my_data.view()
}
The result of the above snippet can then be accessed using my_pipeline.out.my_data
.
9.4. Modules
Nextflow DSL2 allows for the definition of stand-alone module scripts that can be included and shared across multiple workflows. Each module can contain it’s own process
or workflow
definition.
9.4.1. Importing modules
Components defined in the module script can be imported into other Nextflow scripts using the include
statement. This allows you to store these components in separate file(s) so that they can be re-used in multiple workflows.
Using the hello.nf
example, we can acheive this by:
-
Creating a file called
modules.nf
in the top-level folder containing the main Nextflow script. -
Cut and paste the two process definitions for
splitLetters
andconvertToUpper
intomodules.nf
. -
Import the processes from
modules.nf
within the main script anywhere above theworkflow
definition:
1
2
include { splitLetters } from './modules.nf'
include { convertToUpper } from './modules.nf'
In general, you would use relative paths to define the location of the module scripts using the ./ prefix.
|
9.4.2. Multiple imports
If a Nextflow module script contains multiple process
definitions these can also be imported using a single include
statement as shown in the example below:
1
include { splitLetters; convertToUpper } from './modules.nf'
9.4.3. Module aliases
When including a module component it is possible to specify a name alias using the as
declaration. This allows the inclusion and the invocation of the same component multiple times in your script using different names. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
params.greeting = 'Hello world!'
include { splitLetters as splitLetters_one } from './modules.nf'
include { splitLetters as splitLetters_two } from './modules.nf'
include { convertToUpper as convertToUpper_one } from './modules.nf'
include { convertToUpper as convertToUpper_two } from './modules.nf'
workflow my_pipeline_one {
take:
greeting
main:
splitLetters_one(greeting)
convertToUpper_one(splitLetters_one.out.flatten())
emit:
my_data = convertToUpper_one.out.verbiage
}
workflow my_pipeline_two {
take:
greeting
main:
splitLetters_two(greeting)
convertToUpper_two(splitLetters_two.out.flatten())
emit:
my_data = convertToUpper_two.out.verbiage
}
workflow {
my_pipeline_one(Channel.from(params.greeting))
my_pipeline_one.out.my_data.view()
my_pipeline_two(Channel.from(params.greeting))
my_pipeline_two.out.my_data.view()
}
9.4.4. Parameter scopes
A module script can define one or more parameters or custom functions using the same syntax as with any other Nextflow script. Using the minimal examples below:
Module script (./modules.nf
)
1
2
3
4
5
6
params.foo = 'Hello'
params.bar = 'world!'
def sayHello() {
println "$params.foo $params.bar"
}
Main script (./main.nf
)
1
2
3
4
5
6
7
8
9
10
11
12
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
params.foo = 'Hola'
params.bar = 'mundo!'
include { sayHello } from './modules.nf'
workflow {
sayHello()
}
Running main.nf
should print:
1
Hola mundo!
As highlighted above, the script will print Hola mundo!
instead of Hello world!
because parameters are inherited from the including context.
To avoid being ignored, pipeline parameters should be defined at the beginning of the script before any include declarations.
|
The addParams
option can be used to extend the module parameters without affecting the external scope. For example:
1
2
3
4
5
6
7
8
9
10
11
12
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
params.foo = 'Hola'
params.bar = 'mundo!'
include { sayHello } from './modules.nf' addParams(foo: 'Ciao')
workflow {
sayHello()
}
Executing the main script above should print:
1
Ciao world!
9.5. DSL2 migration notes
To view a summary of the changes introduced when migrating from DSL1 to DSL2 please refer to the DSL2 migration notes in the official Nextflow documentation.
10. Get started with Nextflow Tower
10.1. Basic concepts
Nextflow Tower is the centralized command-post for the management of Nextflow data pipelines. It brings monitoring, logging & observability, to distributed workflows and simplifies the deployment of pipelines on any cloud, cluster or laptop.
Nextflow tower core features include:
-
the launching of pre-configured pipelines with ease.
-
programmatic integration to meet the needs of organizations.
-
publishing pipelines to shared workspaces.
-
management of infrastructure required to run data analysis at scale.
10.2. Usage
You can use Tower via either the -with-tower
option while using the Nextflow run command, through the online GUI or through the API.
10.2.1. Via the Nextflow run
command
Create an account and login into Tower.
1. Create a new token
You can access your tokens from the The Settings drop-down menu :

2. Name your token

3. Save your token safely
Copy and keep your new token in a safe place. |

4. Export your token
Once your token has been created, open a terminal and type:
1
2
export TOWER_ACCESS_TOKEN=eyxxxxxxxxxxxxxxxQ1ZTE=
export NXF_VER=20.10.0
Where eyxxxxxxxxxxxxxxxQ1ZTE=
is the token you have just created.
Check your nextflow -version . Bearer tokens require Nextflow version 20.10.0 or later, set with the second command above. Change as necessary.
|
To submit a pipeline to a Workspace using the Nextflow command line tool, add the workspace ID to your environment. For example
1
export TOWER_WORKSPACE_ID=000000000000000
The workspace ID can be found on the organisation’s Workspaces overview page.
5. Run Nextflow with tower
Run your Nextflow workflows as usual with the addition of the -with-tower
command:
1
nextflow run hello.nf -with-tower
You will see and be able to monitor your Nextflow jobs in Tower.
To configure and execute Nextflow jobs in Cloud environments, visit the Compute environments section.
Exercise
Run the RNA-Seq script7.nf
using the -with-tower
flag, after correctly completing the token settings outlined above.
If you get stuck, click here:
Go to tower.nf/, login, then click the run tab, and select the run that you just submitted. If you can’t find it, double check your token was entered correctly.
10.2.2. Via online GUI
To run using the GUI, there are three main steps:
1. Create an account and login into Tower, available free of charge, at tower.nf.
2. Create and configure a new compute environment.
3. Start launching pipelines.
Configuring your compute environment
Tower uses the concept of Compute Environments to define the execution platform where a pipeline will run.
It supports the launching of pipelines into a growing number of cloud and on-premise infrastructures.

Each compute environment must be pre-configured to enable Tower to submit tasks. You can read more on how to set up each environment using the links below.
The following set up guides describe how to configure eahc of these compute environments.
Selecting a default compute environment
If you have more than one Compute Environment, you can select which one will be used by default when launching a pipeline.
-
Navigate to your compute environments.
-
Choose your default environment by selecting the Make primary button.
Congratulations!
You are now ready to launch pipelines with your primary compute environment.
Launchpad
Launchpad makes it is easy for any workspace user to launch a pre-configured pipeline.

A pipeline is a repository containing a Nextflow workflow, a compute environment and pipeline parameters.
Pipeline Parameters Form
Launchpad automatically detects the presence of a nextflow_schema.json
in the root of the repository and dynamically creates a form where users can easily update the parameters.
The parameter forms view will appear, if the workflow has a Nextflow schema file for the parameters. Please refer the Nextflow Schema guide to learn more about the use-cases and how to create them. |
This makes it trivial for users without any expertise in Nextflow to enter their pipeline parameters and launch.

Adding a new pipeline
Adding a pipeline to the pre-saved workspace launchpad is detailed in full on the tower webpage docs.
In brief, these are the steps you need to follow to set up a pipeline.
-
Select the Launchpad button in the navigation bar. This will open the Launch Form.
-
Select a compute environment.
-
Enter the repository of the pipeline you want to launch. e.g. github.com/nf-core/rnaseq.git
-
A Revision number can be used select different versions of pipeline. The Git default branch (main/master) or
manifest.defaultBranch
in the Nextflow configuration will be used by default. -
The Work directory specifies the location of the Nextflow work directory. The location associated with the compute environment will be selected by default.
-
Enter the name(s) of each of the Nextflow Config profiles followed by the
Enter
key. See the Nextflow Config profiles documentation for more details. -
Enter any Pipeline parameters in YAML or JSON format. YAML example:
1 2
reads: 's3://nf-bucket/exome-data/ERR013140_{1,2}.fastq.bz2' paired_end: true
-
Select Launchpad to begin the pipeline execution.
Nextflow pipelines are simply Git repositories and the location can be any public or private Git-hosting platform. See Git Integration in the Tower docs and Pipeline Sharing in the Nextflow docs for more details. |
The credentials associated with the compute environment must be able to access the work directory. |
In the configuration, the full path to a bucket must be specified with single-quotes around strings no quotes around booleans or numbers. |
To create your own customized Nextflow Schema for your pipleine, see the examples of from increasing number of nf-core workflows that have adopted this for example eager and rnaseq.
|
For advanced setting options check out this page.
There is also community support available if you get into trouble, see here.
10.2.3. API
To learn more about using the Tower API, visit to the API section in this documentation.
10.3. Workspaces and Organisations
Nextflow Tower simplifies the development and execution of workflows by providing a centralized interface for users and organisations.
Each user has a unique workspace where they can interact and manage all resources such as workflows, compute environments and credentials. details of this can be found here.
By default, each user has their own private workspace, while organisations have the ability to run and manage users through role-based access as members and collaborators.
10.3.1. Organization resources
You can create your own organisation and participant workspace by following the docs at tower.
Tower allows creation of multiple organizations, each of which can contain multiple workspaces with shared users and resources. This allows any organization to customize and organize the usage of resources while maintaining an access control layer for users associated with a workspace.
10.3.2. Organization users
Any user can be added or removed from a particular organization or a workspace and can be allocated a specific access role within that workspace.
The Teams feature provides a way for the organizations to group various users and participants together into teams, for example workflow-developers
or analysts
, and apply access control to all the users within this team as a whole.
For further information, please refer the User Management section.
Setting up a new organisation
Organizations are the top-level structure and contain Workspaces, Members, Teams and Collaborators.
To create a new Organization:
-
Click on the dropdown next to your name and select New organization to open the creation dialog.
-
On the dialog, fill in the fields as per your organization. The Name and Full name fields are compulsory.
A valid name for the organization must follow specific pattern. Please refer the UI for further instructions. -
The rest of the fields such as Description, Location, Website URL and Logo Url are optional.
-
Once the details are filled-in, you can access the newly created organization using the organizations page, which lists all of your organizations.
It is possible to change the values of the optional fields either using the Edit option on the organizations page or using the Settings tab within the organization page, provided that you are the Owner of the organization . A list of all the included Members, Teams and Collaborators can be found at the organization page.
11. Nextflow configuration
A key Nextflow feature is the ability to decouple the workflow implementation by the configuration setting required by the underlying execution platform.
This enables portable deployment without the need to modify the application code.
11.1. Configuration file
When a pipeline script is launched Nextflow looks for a file named nextflow.config
in the current directory and in the script base directory (if it is not the same as the current directory). Finally it checks for the file: $HOME/.nextflow/config
.
When more than one on the above files exist they are merged, so that the settings in the first override the same ones that may appear in the second one, and so on.
The default config file search mechanism can be extended proving an extra configuration file by using the command line option: -c <config file>
.
11.1.1. Config syntax
A Nextflow configuration file is a simple text file containing a set of properties defined using the syntax:
name = value
Please note, string values need to be wrapped in quotation characters while numbers and boolean values (true , false ) do not. Also, note that values are typed, meaning for example that, 1 is different from '1' , since the first is interpreted as the number one, while the latter is interpreted as a string value.
|
11.1.2. Config variables
Configuration properties can be used as variables in the configuration file itself, by using the usual $propertyName
or ${expression}
syntax.
1
2
3
propertyOne = 'world'
anotherProp = "Hello $propertyOne"
customPath = "$PATH:/my/app/folder"
In the configuration file it’s possible to access any variable defined in the host environment
such as $PATH , $HOME , $PWD , etc.
|
11.1.3. Config comments
Configuration files use the same conventions for comments used in the Nextflow script:
1
2
3
4
5
6
// comment a single line
/*
a comment spanning
multiple lines
*/
11.1.4. Config scopes
Configuration settings can be organized in different scopes by dot prefixing the property names with a scope identifier or grouping the properties in the same scope using the curly brackets notation. This is shown in the following example:
1
2
3
4
5
6
7
alpha.x = 1
alpha.y = 'string value..'
beta {
p = 2
q = 'another string ..'
}
11.1.5. Config params
The scope params
allows the definition of workflow parameters that overrides the values defined
in the main workflow script.
This is useful to consolidate one or more execution parameters in a separate file.
1
2
3
// config file
params.foo = 'Bonjour'
params.bar = 'le monde!'
1
2
3
4
5
6
// workflow script
params.foo = 'Hello'
params.bar = 'world!'
// print both params
println "$params.foo $params.bar"
Exercise
Save the first snippet as nextflow.config
and the second one as params.nf
. Then run:
nextflow run params.nf
For the result, click here:
Bonjour le monde!
Execute is again specifying the foo
parameter on the command line:
nextflow run params.nf --foo Hola
For the result, click here:
Hola le monde!
Compare the result of the two executions.
11.1.6. Config env
The env
scope allows the definition of one or more variables that will be exported into the environment where the workflow tasks will be executed.
1
2
env.ALPHA = 'some value'
env.BETA = "$HOME/some/path"
Exercise
Save the above snippet a file named my-env.config
. The save the snippet below in a file named
foo.nf
:
1
2
3
4
5
6
process foo {
echo true
'''
env | egrep 'ALPHA|BETA'
'''
}
Finally executed the following command:
nextflow run foo.nf -c my-env.config
Your result should look something like the following, click here:
BETA=/home/some/path
ALPHA=some value
11.1.7. Config process
process
directives allow the specification of specific settings for the task execution such as cpus
, memory
, container
and other resources in the pipeline script.
This is useful specially when prototyping a small workflow script.
However it’s always a good practice to decouple the workflow execution logic from the process configuration settings, i.e. it’s strongly suggested to define the process settings in the workflow configuration file instead of the workflow script.
The process
configuration scope allows the setting of any process
directives in the Nextflow configuration file. For example:
1
2
3
4
5
process {
cpus = 10
memory = 8.GB
container = 'biocontainers/bamtools:v2.4.0_cv3'
}
The above config snippet defines the cpus
, memory
and container
directives for all processes in your workflow script.
The process selector can be used to apply the configuration to a specific process or group of processes (discussed later).
Memory and time duration units can be specified either using a string based notation in which the digit(s) and the unit can be separated by a blank or by using the numeric notation in which the digit(s) and the unit are separated by a dot character and are not enclosed by quote characters. |
String syntax | Numeric syntax | Value |
---|---|---|
|
|
10240 bytes |
|
|
524288000 bytes |
|
1.min |
60 seconds |
|
- |
1 hour and 25 seconds |
The syntax for setting process
directives in the
configuration file requires =
(i.e. assignment operator), whereas
it should not be used when setting the process directives within the
workflow script.
For an example, click here:
process foo { cpus 4 memory 2.GB time 1.hour maxRetries 3
script: """ your_command --cpus $task.cpus --mem $task.memory """ }
This is important, especially when you want to define a config setting using a dynamic expression using a closure. For example:
process foo {
memory = { 4.GB * task.cpus }
}
Directives that requires more than one value, e.g. pod, in the configuration file need to be expressed as a map object.
process {
pod = [env: 'FOO', value: '123']
}
Finally directives that allows to be repeated in the process definition, in the configuration files need to be defined as a list object. For example:
process {
pod = [ [env: 'FOO', value: '123'],
[env: 'BAR', value: '456'] ]
}
11.1.8. Config Docker execution
The container image to be used for the process execution can be specified in the nextflow.config
file:
1
2
process.container = 'nextflow/rnaseq-nf'
docker.enabled = true
The use of unique "SHA256" docker image IDs guarantees that the image content does not change over time, for example:
1
2
process.container = 'nextflow/rnaseq-nf@sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab266'
docker.enabled = true
11.1.9. Config Singularity execution
To run a workflow execution with Singularity, a container image file path is required in the Nextflow config file using the container directive:
1
2
process.container = '/some/singularity/image.sif'
singularity.enabled = true
The container image file must be an absolute path i.e. it must start with a / .
|
The following protocols are supported:
-
library://
download the container image from the Singularity Library service. -
shub://
download the container image from the Singularity Hub. -
docker://
download the container image from the Docker Hub and convert it to the Singularity format. -
docker-daemon://
pull the container image from a local Docker installation and convert it to a Singularity image file.
* Singularity hub shub:// is now no longer available as a builder service. Though existing images before 19th April 2021 are still working.
|
By specifying a plain Docker container image name, Nextflow implicitly downloads and converts it to a Singularity image when the Singularity execution is enabled. For example: |
1
2
process.container = 'nextflow/rnaseq-nf'
singularity.enabled = true
The above configuration instructs Nextflow to use the Singularity engine to run your script processes. The container is pulled from the Docker registry and cached in the current directory to be used for further runs.
Alternatively if you have a Singularity image file, its absolute path location
can be specified as the container name either using the -with-singularity
option
or the process.container
setting in the config file.
Exercise
Try to run the script as shown below, changing the nextflow.config
file to the one above using singularity
:
nextflow run script7.nf
Nextflow will pull the container image automatically, it will require a few seconds depending the network connection speed. |
11.1.10. Config Conda execution
The use of a Conda environment can also be provided in the configuration file
adding the following setting in the nextflow.config
file:
1
process.conda = "/home/ubuntu/miniconda2/envs/nf-tutorial"
You can either specify the path of an existing Conda environment directory or the path of Conda environment YAML file.
12. Deployment scenarios
Real world genomic application can spawn the execution of thousands of jobs. In this scenario a batch scheduler is commonly used to deploy a pipeline in a computing cluster, allowing the execution of many jobs in parallel across many compute nodes.
Nextflow has built-in support for most commonly used batch schedulers, such as Univa Grid Engine, SLURM and IBM LSF. Check the Nextflow documentation for the complete list of supported execution platforms.
12.1. Cluster deployment
A key Nextflow feature is the ability to decouple the workflow implementation from the actual execution platform. The implementing of an abstraction layer allows the deployment of the resulting workflow on any executing platform supported by the framework.
To run your pipeline with a batch scheduler, modify the nextflow.config
file specifying
the target executor and the required computing resources, if needed. For example:
1
process.executor = 'slurm'
12.2. Managing cluster resources
When using a batch scheduler, it is often needed to specify the amount of resources (i.e. cpus, memory, execution time, etc.) required by each task.
This can be done using the following process directives:
the cluster queue to be used for the computation |
|
the number of cpus to be allocated a task execution |
|
the amount of memory to be allocated a task execution |
|
the max amount of time to be allocated a task execution |
|
the amount of disk storage required a task execution |
12.2.1. Workflow wide resources
Use the scope process
to define the resource requirements for all processes in
your workflow applications. For example:
1
2
3
4
5
6
7
process {
executor = 'slurm'
queue = 'short'
memory = '10 GB'
time = '30 min'
cpus = 4
}
12.2.2. Configure process by name
In real world application different tasks need different amount of
computing resources. It is possible to define the resources for a specific task
using the select withName:
followed by the process name:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
process {
executor = 'slurm'
queue = 'short'
memory = '10 GB'
time = '30 min'
cpus = 4
withName: foo {
cpus = 2
memory = '20 GB'
queue = 'short'
}
withName: bar {
cpus = 4
memory = '32 GB'
queue = 'long'
}
}
Exercise
Run the RNA-Seq script (script7.nf) from earlier, but specify that the quantification
process
cpu reuirements to 2 cpu and with 5 GB of memory, withon the nextflow.config
file.
Click here for the answer:
1
2
3
4
5
6
process {
withName: quantification {
cpus = 2
memory = '5 GB'
}
}
12.2.3. Configure process by labels
When a workflow application is composed by many processes listing all process names in the configuration file can be difficult to choose resources for each of them.
A better strategy consists of annotating the processes with a label directive. Then specify the resources in the configuration file used for all processes having the same label.
The workflow script:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
process task1 {
label 'long'
"""
first_command --here
"""
}
process task2 {
label 'short'
"""
second_command --here
"""
}
The configuration file:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
process {
executor = 'slurm'
withLabel: 'short' {
cpus = 4
memory = '20 GB'
queue = 'alpha'
}
withLabel: 'long' {
cpus = 8
memory = '32 GB'
queue = 'omega'
}
}
12.2.4. Configure multiple containers
It is possible to use a different container for each process in your workflow. You can define their continers in a config file as shown below:
1
2
3
4
5
6
7
8
9
10
process {
withName: foo {
container = 'some/image:x'
}
withName: bar {
container = 'other/image:y'
}
}
docker.enabled = true
Should I use a single fat container or many slim containers? Both approaches have pros & cons. A single container is simpler to build and to maintain, however when using many tools the image can become very big and tools can conflict each other. Using a container for each process can result in many different images to build and to maintain, especially when processes in your workflow uses different tools in each task. |
Read more about config process selector at this link.
12.3. Configuration profiles
Configuration files can contain the definition of one or more profiles. A profile is a set of configuration attributes that can be activated/chosen when launching a pipeline execution by using the -profile
command line option.
Configuration profiles are defined by using the special scope profiles
which group the attributes that belong to the same profile using a common prefix. For example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
profiles {
standard {
params.genome = '/local/path/ref.fasta'
process.executor = 'local'
}
cluster {
params.genome = '/data/stared/ref.fasta'
process.executor = 'sge'
process.queue = 'long'
process.memory = '10GB'
process.conda = '/some/path/env.yml'
}
cloud {
params.genome = '/data/stared/ref.fasta'
process.executor = 'awsbatch'
process.container = 'cbcrg/imagex'
docker.enabled = true
}
}
This configuration defines three different profiles: standard
, cluster
and cloud
that set different process configuration strategies depending on the target runtime platform. By convention the standard
profile is implicitly used when no other profile is specified by the user.
To enable a specific profile use -profile
option followed by the profile name:
nextflow run <your script> -profile cluster
Two or more configuration profiles can be specified by separating the profile names with a comma character: |
nextflow run <your script> -profile standard,cloud
12.4. Cloud deployment
AWS Batch is a managed computing service that allows the execution of containerised workloads in the Amazon cloud infrastructure.
Nextflow provides a built-in support for AWS Batch which allows the seamless deployment of a Nextflow pipeline in the cloud, offloading the process executions as Batch jobs.
Once the Batch environment is configured, specifying the instance types to be used and the max number of cpus to be allocated, you need to create a Nextflow configuration file like the one showed below:
1
2
3
4
5
6
process.executor = 'awsbatch' (1)
process.queue = 'nextflow-ci' (2)
process.container = 'nextflow/rnaseq-nf:latest' (3)
workDir = 's3://nextflow-ci/work/' (4)
aws.region = 'eu-west-1' (5)
aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws' (6)
1 | Set AWS Batch as the executor to run the processes in the workflow |
2 | The name of the computing queue defined in the Batch environment |
3 | The Docker container image to be used to run each job |
4 | The workflow work directory must be a AWS S3 bucket |
5 | The AWS region to be used |
6 | The path of the AWS cli tool required to download/upload files to/from the container |
The best practice is to keep this setting as a separate profile in your workflow config file. This allows the execution with a simple command. |
nextflow run script7.nf
The complete details about AWS Batch deployment are available at this link.
12.5. Volume mounts
EBS volumes (or other supported storage) can be mounted in the job container using the following configuration snippet:
aws {
batch {
volumes = '/some/path'
}
}
Multiple volumes can be specified using comma-separated paths. The usual Docker volume mount syntax can be used to define complex volumes for which the container path is different from the host path or to specify a read-only option:
aws {
region = 'eu-west-1'
batch {
volumes = ['/tmp', '/host/path:/mnt/path:ro']
}
}
This is a global configuration that has to be specified in a Nextflow config file, as such it’s applied to all process executions. |
Nextflow expects those paths to be available. It does not handle the provision of EBS volumes or other kind of storage. |
12.6. Custom job definition
Nextflow automatically creates the Batch Job definitions needed to execute your pipeline processes. Therefore it’s not required to define them before you run your workflow.
However, you may still need to specify a custom Job Definition to provide fine-grained control of the configuration settings of a specific job (e.g. to define custom mount paths or other special settings of a Batch Job).
To use your own job definition in a Nextflow workflow, use it in place of the container image name,
prefixing it with the job-definition://
string. For example:
process {
container = 'job-definition://your-job-definition-name'
}
12.7. Custom image
Since Nextflow requires the AWS CLI tool to be accessible in the computing environment, a common solution consists of creating a custom AMI and installing it in a self-contained manner (e.g. using Conda package manager).
When creating your custom AMI for AWS Batch, make sure to use the Amazon ECS-Optimized Amazon Linux AMI as the base image. |
The following snippet shows how to install AWS CLI with Miniconda:
sudo yum install -y bzip2 wget
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -f -p $HOME/miniconda
$HOME/miniconda/bin/conda install -c conda-forge -y awscli
rm Miniconda3-latest-Linux-x86_64.sh
The aws tool will be placed in a directory named bin in the main installation folder. Modifying this directory structure after the installation will cause the tool to not work properly.
|
Finally specify the aws
full path in the Nextflow config file as show below:
aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws'
12.8. Launch template
An alternative to is to create a custom AMI using a Launch template that installs the AWS CLI tool during the instance boot via custom user-data.
In the EC2 dashboard, create a Launch template specifying the user data field:
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="//"
--//
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/sh
## install required deps
set -x
export PATH=/usr/local/bin:$PATH
yum install -y jq python27-pip sed wget bzip2
pip install -U boto3
## install awscli
USER=/home/ec2-user
wget -q https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -f -p $USER/miniconda
$USER/miniconda/bin/conda install -c conda-forge -y awscli
rm Miniconda3-latest-Linux-x86_64.sh
chown -R ec2-user:ec2-user $USER/miniconda
--//--
Then in the Batch dashboard create a new compute environment and specify the newly created launch template in the corresponding field.
12.9. Hybrid deployments
Nextflow allows the use of multiple executors in the same workflow application. This feature enables the deployment of hybrid workloads in which some jobs are executed on the local computer or local computing cluster and some jobs are offloaded to the AWS Batch service.
To enable this feature use one or more process selectors in your Nextflow configuration file.
For example, to apply the AWS Batch configuration only to a subset of processes in your workflow. You can try the following:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
process {
executor = 'slurm' (1)
queue = 'short' (2)
withLabel: bigTask { (3)
executor = 'awsbatch' (4)
queue = 'my-batch-queue' (5)
container = 'my/image:tag' (6)
}
}
aws {
region = 'eu-west-1' (7)
}
1 | Set slurm as the default executor |
2 | Set the queue for the SLURM cluster |
3 | Setting of a process named bigTask |
4 | Set awsbatch as the executor for the bigTask process |
5 | Set the queue for the bigTask process |
6 | Set the container image to deploy for the bigTask process |
7 | Define the region for Batch execution |
13. Execution cache and resume
The Nextflow caching mechanism works by assigning a unique ID to each task which is used to create a separate execution directory where the tasks are executed and the results stored.
The task unique ID is generated as a 128-bit hash number obtained composing the task inputs values and files and the command string.
The pipeline work directory is organized as shown below:
work/
├── 12
│ └── 1adacb582d2198cd32db0e6f808bce
│ ├── genome.fa -> /data/../genome.fa
│ └── index
│ ├── hash.bin
│ ├── header.json
│ ├── indexing.log
│ ├── quasi_index.log
│ ├── refInfo.json
│ ├── rsd.bin
│ ├── sa.bin
│ ├── txpInfo.bin
│ └── versionInfo.json
├── 19
│ └── 663679d1d87bfeafacf30c1deaf81b
│ ├── ggal_gut
│ │ ├── aux_info
│ │ │ ├── ambig_info.tsv
│ │ │ ├── expected_bias.gz
│ │ │ ├── fld.gz
│ │ │ ├── meta_info.json
│ │ │ ├── observed_bias.gz
│ │ │ └── observed_bias_3p.gz
│ │ ├── cmd_info.json
│ │ ├── libParams
│ │ │ └── flenDist.txt
│ │ ├── lib_format_counts.json
│ │ ├── logs
│ │ │ └── salmon_quant.log
│ │ └── quant.sf
│ ├── ggal_gut_1.fq -> /data/../ggal_gut_1.fq
│ ├── ggal_gut_2.fq -> /data/../ggal_gut_2.fq
│ └── index -> /data/../asciidocs/day2/work/12/1adacb582d2198cd32db0e6f808bce/index
You can create these plots using the tree function if you have it installed. On unix, simply sudo apt install -y tree or with Homebrew: brew install tree
|
13.1. How resume works
The -resume
command line option allow the continuation of a pipeline
execution, since the last step that was successfully completed:
nextflow run <script> -resume
In practical terms the pipeline is executed from the beginning. However, before launching the execution of a process
Nextflow
uses the task unique ID to check if the work directory already exists and that it contains a valid command exit state with the
expected output files.
If this condition is satisfied the task execution is skipped and previously computed results are used as the process
results.
The first task for which a new output is computed, invalidates all downstream executions in the remaining DAG.
13.2. Work directory
The task work directories are created in the folder work
in
the launching path by default. This is supposed to be a scratch
storage area that can be cleaned up once the computation is completed.
Workflow final output(s) are supposed to be stored in a different location specified using one or more publishDir directive. |
Make sure to delete you work directory occasionly, else your machine/environment may be filled with unused files. |
A different location for the execution work directory can be specified
using the command line option -w
e.g.
nextflow run <script> -w /some/scratch/dir
If you delete or move the pipeline work directory, it will prevent the use of the resume feature in following runs. |
The hash code for input files is computed using:
-
The complete file path
-
The file size
-
The last modified timestamp
Therefore just touching a file will invalidated the related task execution.
13.3. How to organize in silico experiments
It’s good practice to organize each experiment in its own folder. The main experiment input parameters should be specified using a Nextflow config file. This makes it simple to track and replicate the experiment over time.
In the same experiment, the same pipeline can be executed multiple times, however it should be avoided to launch two (or more) Nextflow instances in the same directory concurrently. |
The nextflow log
command lists the executions run in the current folder:
1
2
3
4
5
6
7
$ nextflow log
TIMESTAMP DURATION RUN NAME STATUS REVISION ID SESSION ID COMMAND
2019-05-06 12:07:32 1.2s focused_carson ERR a9012339ce 7363b3f0-09ac-495b-a947-28cf430d0b85 nextflow run hello
2019-05-06 12:08:33 21.1s mighty_boyd OK a9012339ce 7363b3f0-09ac-495b-a947-28cf430d0b85 nextflow run rnaseq-nf -with-docker
2019-05-06 12:31:15 1.2s insane_celsius ERR b9aefc67b4 4dc656d2-c410-44c8-bc32-7dd0ea87bebf nextflow run rnaseq-nf
2019-05-06 12:31:24 17s stupefied_euclid OK b9aefc67b4 4dc656d2-c410-44c8-bc32-7dd0ea87bebf nextflow run rnaseq-nf -resume -with-docker
You can use either the session ID or the run name to recover a specific execution. For example:
nextflow run rnaseq-nf -resume mighty_boyd
13.4. Execution provenance
The log
command when provided with a run name or session ID can return many useful bits of information about a pipeline execution that can be used to create a provenance report.
By default, it lists the work directories used to compute each task. For example:
$ nextflow log tiny_fermat
/data/.../work/7b/3753ff13b1fa5348d2d9b6f512153a
/data/.../work/c1/56a36d8f498c99ac6cba31e85b3e0c
/data/.../work/f7/659c65ef60582d9713252bcfbcc310
/data/.../work/82/ba67e3175bd9e6479d4310e5a92f99
/data/.../work/e5/2816b9d4e7b402bfdd6597c2c2403d
/data/.../work/3b/3485d00b0115f89e4c202eacf82eba
Using the option -f
(fields) it’s possible to specify which metadata
should be printed by the log
command. For example:
$ nextflow log tiny_fermat -f 'process,exit,hash,duration'
index 0 7b/3753ff 2.0s
fastqc 0 c1/56a36d 9.3s
fastqc 0 f7/659c65 9.1s
quant 0 82/ba67e3 2.7s
quant 0 e5/2816b9 3.2s
multiqc 0 3b/3485d0 6.3s
The complete list of available fields can be retrieved with the command:
nextflow log -l
The option -F
allows the specification of a filtering criteria to
print only a subset of tasks. For example:
$ nextflow log tiny_fermat -F 'process =~ /fastqc/'
/data/.../work/c1/56a36d8f498c99ac6cba31e85b3e0c
/data/.../work/f7/659c65ef60582d9713252bcfbcc310
This can be useful to locate specific task work directories.
Finally, the -t
flag option allows the creation of a basic custom provenance report, showing a template file in any format of your choice. For example:
<div>
<h2>${name}</h2>
<div>
Script:
<pre>${script}</pre>
</div>
<ul>
<li>Exit: ${exit}</li>
<li>Status: ${status}</li>
<li>Work dir: ${workdir}</li>
<li>Container: ${container}</li>
</ul>
</div>
Save the above snippet in a file named template.html
. Then
run this command (using the correct id for your run, e.g. not tiny_fermat):
nextflow log tiny_fermat -t template.html > prov.html
Finally open the prov.html
file with a browser.
13.5. Resume troubleshooting
If your workflow execution is not resumed as expected with one or more tasks re-executed each time, these may be the most likely causes:
-
Input file changed: Make sure that there’s no change in your input file(s). Don’t forget the task unique hash is computed taking into account the complete file path, the last modified timestamp and the file size. If any of these information changes, the workflow will be re-executed even if the input content is the same.
-
A process modifies an input: A process should never alter input files, otherwise the
resume
for future executions will be invalidated for the same reason explained in the previous point. -
Inconsistent file attributes: Some shared file systems, such as NFS, may report an inconsistent file timestamp (i.e. a different timestamp for the same file) even if it has not been modified. To prevent this problem use the lenient cache strategy.
-
Race condition in global variable: Nextflow is designed to simplify parallel programming without taking care about race conditions and the access to shared resources. One of the few cases in which a race condition can arise is when using a global variable with two (or more) operators. For example:
1 2 3 4 5 6 7 8 9
Channel .from(1,2,3) .map { it -> X=it; X+=2 } .view { "ch1 = $it" } Channel .from(1,2,3) .map { it -> X=it; X*=2 } .view { "ch2 = $it" }
The problem in this snippet is that the
X
variable in the closure definition is defined in the global scope. Therefore, since operators are executed in parallel, theX
value can be overwritten by the othermap
invocation.The correct implementation requires the use of the
def
keyword to declare the variable local.1 2 3 4 5 6 7 8 9
Channel .from(1,2,3) .map { it -> def X=it; X+=2 } .println { "ch1 = $it" } Channel .from(1,2,3) .map { it -> def X=it; X*=2 } .println { "ch2 = $it" }
-
Not deterministic input channels: While dataflow channel ordering is guaranteed (i.e. data is read in the same order in which it’s written in the channel), a process can declare as input two or more channels each of which is the output of a different process, the overall input ordering is not consistent over different executions.
In practical term, consider the following snippet:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
process foo { input: set val(pair), file(reads) from ... output: set val(pair), file('*.bam') into bam_ch """ your_command --here """ } process bar { input: set val(pair), file(reads) from ... output: set val(pair), file('*.bai') into bai_ch """ other_command --here """ } process gather { input: set val(pair), file(bam) from bam_ch set val(pair), file(bai) from bai_ch """ merge_command $bam $bai """ }
The inputs declared at line 19,20 can be delivered in any order because the execution order of the process
foo
andbar
is not deterministic due to the parallel executions of them.Therefore the input of the third process needs to be synchronized using the join operator or a similar approach. The third process should be written as:
1 2 3 4 5 6 7 8 9
... process gather { input: set val(pair), file(bam), file(bai) from bam_ch.join(bai_ch) """ merge_command $bam $bai """ }
14. Error handling & troubleshooting
14.1. Execution error debugging
When a process execution exits with a non-zero exit status, Nextflow stops the workflow execution and reports the failing task:
ERROR ~ Error executing process > 'index'
Caused by: (1)
Process `index` terminated with an error exit status (127)
Command executed: (2)
salmon index --threads 1 -t transcriptome.fa -i index
Command exit status: (3)
127
Command output: (4)
(empty)
Command error: (5)
.command.sh: line 2: salmon: command not found
Work dir: (6)
/Users/pditommaso/work/0b/b59f362980defd7376ee0a75b41f62
1 | A description of the error cause |
2 | The command executed |
3 | The command exit status |
4 | The command standard output, when available |
5 | The command standard error |
6 | The command work directory |
Review carefully all these data, they can provide valuable information on the cause of the error.
If this is not enough, cd
into the task work directory. It contains all the files to replicate the issue in a isolated manner.
The task execution directory contains these files:
-
.command.sh
: The command script. -
.command.run
: The command wrapped used to run the job. -
.command.out
: The complete job standard output. -
.command.err
: The complete job standard error. -
.command.log
: The wrapper execution output. -
.command.begin
: Sentinel file created as soon as the job is launched. -
.exitcode
: A file containing the task exit code. -
Task input files (symlinks)
-
Task output files
Verify that the .command.sh
file contains the expected command to be executed and all variables are correctly resolved.
Also verify the existence of the .exitcode
and .command.begin
files, which if absent, suggest the task was never executed by the subsystem (e.g. the batch scheduler). If the .command.begin
file exist, the job was launched but was likely killed abruptly.
You can replicate the failing execution using the command bash .command.run
and verify the cause of the error.
14.2. Ignore errors
There are cases in which a process error may be expected and it should not stop the overall workflow execution.
To handle this use case, set the process errorStrategy
to ignore
:
1
2
3
4
5
6
7
process foo {
errorStrategy 'ignore'
script:
"""
your_command --this --that
"""
}
If you want to ignore any error set the same directive in the config file as a default setting:
1
process.errorStrategy = 'ignore'
14.3. Automatic error fail-over
In rare cases, errors may be caused by transient conditions. In this situation, an effective strategy is re-executing the failing task.
1
2
3
4
5
6
7
process foo {
errorStrategy 'retry'
script:
"""
your_command --this --that
"""
}
Using the retry
error strategy the task is re-executed a second time if it returns a non-zero exit status
before stopping the complete workflow execution.
The directive maxRetries can be used to set number of attempts the task can be re-execute before declaring it failed with an error condition.
14.4. Retry with backoff
There are cases in which the required execution resources may be temporary unavailable (e.g. network congestion). In these cases simply re-executing the same task will likely result in the identical error. A retry with an exponential backoff delay can better recover these error conditions.
1
2
3
4
5
6
7
8
process foo {
errorStrategy { sleep(Math.pow(2, task.attempt) * 200 as long); return 'retry' }
maxRetries 5
script:
'''
your_command --here
'''
}
14.5. Dynamic resources allocation
It’s a very common scenario that different instances of the same process may have very different needs in terms of computing resources. In such situations requesting for example, an amount of memory too low will cause some tasks to fail. Instead, using a higher limit that fits all the tasks in your execution could significantly decrease the execution priority of your jobs.
To handle this use case, you can use a retry
error strategy and increasing the computing resources allocated by the job at each successive attempt.
1
2
3
4
5
6
7
8
9
10
11
12
process foo {
cpus 4
memory { 2.GB * task.attempt } (1)
time { 1.hour * task.attempt } (2)
errorStrategy { task.exitStatus == 140 ? 'retry' : 'terminate' } (3)
maxRetries 3 (4)
script:
"""
your_command --cpus $task.cpus --mem $task.memory
"""
}
1 | The memory is defined in a dynamic manner, the first attempt is 2 GB, the second 4 GB, and so on. |
2 | The wall execution time is set dynamically as well, the first execution attempt is set to 1 hour, the second 2 hours, and so on. |
3 | If the task returns an exit status equal to 140 it will set the error strategy to retry otherwise it will terminate the execution. |
4 | It will retry the process execution up to three times. |