Welcome to the Nextflow Training Workshop

We are excited to have you on the path to writing reproducible and scalable scientific workflows using Nextflow. This guide complements the full Nextflow documentation - if you ever have any doubts, head over to the docs located here.

By the end of this course you should:

  1. Be proficient in writing Nextflow pipelines

  2. Know the basic Nextflow concepts of Channels, Processes and Operators

  3. Have an understanding of containerized workflows

  4. Understand the different execution platforms supported by Nextflow

  5. Be introduced to the Nextflow community and ecosystem

Overview

To get you started with Nextflow as quickly as possible, we will walk through the following steps:

  1. Set up a development environment to run Nextflow

  2. Explore Nextflow concepts using some basic workflows, including a multi-step RNA-Seq analysis

  3. Build and use Docker containers to encapsulate all workflow dependencies

  4. Dive deeper into the core Nextflow syntax, including Channels, Processes, and Operators

  5. Cover cluster and cloud deployment scenarios and explore Nextflow Tower capabilities

This will give you a broad understanding of Nextflow, to start writing your own pipelines. We hope you enjoy the course! This is an ever-evolving document — feedback is always welcome.

Summary

1. Environment setup

There are two main ways to get started with Seqera’s Nextflow training course.

The first is to install the requirements locally (Local installation), which is best if you are already familiar with Git and Docker, or working offline.

The second is to use Gitpod, which is best for first-timers as this platform contains all the programs and data required. Simply click the link and log in using your GitHub account to start the tutorial.

1.1. Local installation

Nextflow can be used on any POSIX-compatible system (Linux, macOS, Windows Subsystem for Linux, etc.).

Requirements:

Optional requirements for this tutorial:

1.1.1. Download Nextflow:

Enter this command in your terminal:

wget -qO- https://get.nextflow.io | bash

Or, if you prefer curl:

curl -s https://get.nextflow.io | bash

Then ensure that the downloaded binary is executable:

chmod +x nextflow

AND put the nextflow executable into your $PATH (e.g. ~/usr/local/bin or ~/bin/)

1.1.2. Docker

Ensure you have Docker Desktop running on your machine. Download Docker here.

1.1.3. Training material

You can view the training material here: training.seqera.io/

To download the material use git clone github.com/seqeralabs/nf-training-public.git.

Then cd into the nf-training directory.

1.1.4. Checking your installation

Check the correct installation of nextflow by running the following command:

nextflow info

This should show the current version, system and runtime.

1.2. Gitpod

A preconfigured Nextflow development environment is available using Gitpod.

Requirements:

  • A GitHub account

  • Web browser (Google Chrome, Firefox)

  • Internet connection

1.2.1. Gitpod quick start

To run Gitpod:

  • Click the following URL:

(which is our GitHub repository URL, prefixed with gitpod.io/#).

  • Log in to your GitHub account (and allow authorization).

Once you have signed in, Gitpod should load (skip prebuild if asked).

1.2.2. Explore your Gitpod IDE

You should now see something similar to the following:

gitpod.welcome

The sidebar allows you to customize your Gitpod environment and perform basic tasks (copy, paste, open files, search, git, etc.). Click the Explorer button to see which files are in this repository.

The terminal allows you to run all the programs in the repository. For example, both nextflow and docker are installed and can be executed.

The main window allows you to view and edit files. Clicking on a file in the explorer will open it within the main window. You should also see the nf-training material browser (training.seqera.io/).

To test that the environment is working correctly, type the following into the terminal:

nextflow info

This should come up with the Nextflow version and runtime information:

  Version: 22.09.7-edge build 5806
  Created: 28-09-2022 09:36 UTC
  System: Linux 5.15.0-47-generic
  Runtime: Groovy 3.0.13 on OpenJDK 64-Bit Server VM 11.0.1+13-LTS
  Encoding: UTF-8 (UTF-8)

1.2.3. Gitpod resources

  • Gitpod gives you 500 free credits per month, which is equivalent to 50 hours of free environment runtime using the standard workspace (up to 4 cores, 8 GB RAM and 30GB storage).

  • There is also a large workspace option that gives you up to 8 cores, 16GB RAM, and 50GB storage for each workspace (up to 4 parallel workspaces in the free version). However, the large workspace uses your free credits quicker so your account will have fewer hours of access to this space.

  • Gitpod will time out after 30 minutes. However, changes are saved for up to 2 weeks (see the next section for reopening a timed out session).

See gitpod.io for more details.

1.2.4. Reopening a Gitpod session

You can reopen an environment from gitpod.io/workspaces. Find your previous environment in the list, then select the ellipsis (three dots icon) and select Open.

If you have saved the URL for your previous Gitpod environment, you can simply open it your browser to open the previous environment.

Alternatively, you can start a new workspace by following the Gitpod URL: gitpod.io/#https://github.com/seqeralabs/nf-training-public

If you have lost your environment, you can find the main scripts used in this tutorial in the nf-training directory to resume with a new environment.

1.2.5. Saving files from Gitpod to your local machine.

To save any file from the explorer panel, right-click the file and select Download.

1.2.6. Training material

The training course can be accessed in your browser from training.seqera.io/.

1.3. Selecting a Nextflow version

By default, Nextflow will pull the latest stable version into your environment.

However, Nextflow is constantly evolving as we make improvements and fix bugs.

The latest releases can be viewed on GitHub here.

If you want to use a specific version of Nextflow, you can set the NXF_VER variable as shown below:

export NXF_VER=22.04.5
Most of this tutorial workshop requires NXF_VER=22.04.0 or later, to use DSL2 as default.

Run nextflow -version again to confirm that the change has taken effect.

2. Introduction

2.1. Basic concepts

Nextflow is a workflow orchestration engine and domain specific language (DSL) that makes it easy to write data-intensive computational pipelines.

It is designed around the idea that the Linux platform is the lingua franca of data science. Linux provides many simple but powerful command-line and scripting tools that, when chained together, facilitate complex data manipulations.

Nextflow extends this approach, adding the ability to define complex program interactions and a high-level parallel computational environment, based on the dataflow programming model. Nextflow’s core features are:

  • Workflow portability and reproducibility

  • Scalability of parallelization and deployment

  • Integration of existing tools, systems, and industry standards

2.1.1. Processes and Channels

In practice, a Nextflow pipeline is made by joining together different processes. Each process can be written in any scripting language that can be executed by the Linux platform (Bash, Perl, Ruby, Python, etc.).

Processes are executed independently and are isolated from each other, i.e. they do not share a common (writable) state. The only way they can communicate is via asynchronous first-in, first-out (FIFO) queues, called channels in Nextflow.

Any process can define one or more channels as an input and output. The interaction between these processes, and ultimately the pipeline execution flow itself, is implicitly defined by these input and output declarations.

channel process

2.1.2. Execution abstraction

While a process defines what command or script has to be executed, the executor determines how that script is actually run in the target platform.

If not otherwise specified, processes are executed on the local computer. The local executor is very useful for pipeline development and testing purposes, but for real world computational pipelines, a High Performance Computing (HPC) or cloud platform is often required.

In other words, Nextflow provides an abstraction between the pipeline’s functional logic and the underlying execution system (or runtime). Thus, it is possible to write a pipeline which runs seamlessly on your computer, a cluster, or the cloud, without being modified. You simply define the target execution platform in the configuration file.

execution abstraction

2.1.3. Scripting language

Nextflow implements a declarative DSL that simplifies the writing of complex data analysis workflows as an extension of a general purpose programming language.

This approach makes Nextflow flexible — it provides the benefits of a concise DSL for the handling of recurrent use cases with ease, and the flexibility and power of a general purpose programming language to handle corner cases, in the same computing environment. This would be difficult to implement using a purely declarative approach.

In practical terms, Nextflow scripting is an extension of the Groovy programming language which, in turn, is a super-set of the Java programming language. Groovy can be thought of as "Python for Java", in that it simplifies the writing of code and is more approachable.

2.2. Your first script

Here you will execute your first Nextflow script (hello.nf), which we will go through line-by-line.

In this toy example, the script takes an input string (a parameter called params.greeting) and splits it into chunks of six characters in the first process. The second process then converts the characters to upper case. The result is finally displayed on-screen.

2.2.1. Nextflow code (hello.nf)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#!/usr/bin/env nextflow     (1)

params.greeting  = 'Hello world!' (2)
greeting_ch = Channel.of(params.greeting) (3)

process SPLITLETTERS { (4)
    input: (5)
    val x (6)

    output: (7)
    path 'chunk_*' (8)

    """ (9)
    printf '$x' | split -b 6 - chunk_  (10)
    """ (11)
} (12)

process CONVERTTOUPPER { (13)
    input:  (14)
    path y (15)

    output: (16)
    stdout (17)

    """ (18)
    cat $y | tr '[a-z]' '[A-Z]'  (19)
    """ (20)
} (21)

workflow { (22)
    letters_ch = SPLITLETTERS(greeting_ch) (23)
    results_ch = CONVERTTOUPPER(letters_ch.flatten()) (24)
    results_ch.view{ it } (25)
} (26)
1 The code begins with a shebang, which declares Nextflow as the interpreter.
2 Declares a parameter greeting that is initialized with the value 'Hello world!'.
3 Initializes a channel labelled greeting_ch, which contains the value from params.greeting. Channels are the input type for processes in Nextflow.
4 Begins the first process, defined as SPLITLETTERS.
5 Input declaration for the SPLITLETTERS process. Inputs can be values (val), files or paths (path), or other qualifiers (see here).
6 Tells the process to expect an input value (val), that we assign to the variable 'x'.
7 Output declaration for the SPLITLETTERS process.
8 Tells the process to expect an output file(s) (path), with a filename starting with 'chunk_*', as output from the script. The process sends the output as a channel.
9 Three double quotes initiate the code block to execute in this process.
10 Code to execute — printing the input value x (called using the dollar symbol [$] prefix), splitting the string into chunks with a length of 6 characters ("Hello " and "world!"), and saving each to a file (chunk_aa and chunk_ab).
11 Three double quotes end the code block.
12 End of the first process block.
13 Begins the second process, defined as CONVERTTOUPPER.
14 Input declaration for the CONVERTTOUPPER process.
15 Tells the process to expect an input file(s) (path; i.e. chunk_aa and chunk_ab), that we assign to the variable 'y'.
16 Output declaration for the CONVERTTOUPPER process.
17 Tells the process to expect output as standard output (stdout) and send this output as a channel.
18 Three double quotes initiate the code block to execute in this process.
19 Script to read files (cat) using the '$y' input variable, then pipe to uppercase conversion, outputting to standard output.
20 Three double quotes end the code block.
21 End of first process block.
22 Start of the workflow scope, where each process can be called.
23 Execute the process SPLITLETTERS on the greeting_ch (aka greeting channel), and store the output in the channel letters_ch.
24 Execute the process CONVERTTOUPPER on the letters channel letters_ch, which is flattened using the operator .flatten(). This transforms the input channel in such a way that every item is a separate element. We store the output in the channel results_ch.
25 The final output (in the results_ch channel) is printed to screen using the view operator (appended onto the channel name).
26 End of the workflow scope.
The use of the operator .flatten() here is to split the two files into two separate items to be put through the next process (else they would be treated as a single element).

2.2.2. In practice

Now copy the above example into your favourite text editor and save it to a file named hello.nf.

For the Gitpod tutorial, make sure you are in the folder called nf-training

Execute the script by entering the following command in your terminal:

nextflow run hello.nf

The output will look similar to the text shown below:

Launching `hello.nf` [fabulous_torricelli] DSL2 - revision: 197a0e289a
executor >  local (3)
[c8/c36893] process > SPLITLETTERS (1)   [100%] 1 of 1 ✔
[1a/3c54ed] process > CONVERTTOUPPER (2) [100%] 2 of 2 ✔
HELLO
WORLD!

The standard output shows (line by line):

  • 1: The Nextflow version executed.

  • 2: The script and version names.

  • 3: The executor used (in the above case: local).

  • 4: The first process is executed once (1). The line starts with a unique hexadecimal value (see TIP below), and ends with the percentage and job completion information.

  • 5: The second process is executed twice (2) (once for chunk_aa, once for chunk_ab).

  • 6-7: The result string from stdout is printed.

The hexadecimal numbers, like c8/c36893, identify the unique process execution. These numbers are also the prefix of the directories where each process is executed. You can inspect the files produced by changing to the directory $PWD/work and using these numbers to find the process-specific execution path.
The second process runs twice, executing in two different work directories for each input file. Therefore, in the previous example the work directory [1a/3c54ed] represents just one of the two directories that were processed. To print all the relevant paths to the screen, use the -ansi-log flag (e.g. nextflow run hello.nf -ansi-log false).

It’s worth noting that the process CONVERTTOUPPER is executed in parallel, so there’s no guarantee that the instance processing the first split (the chunk 'Hello ') will be executed before the one processing the second split (the chunk 'world!').

Thus, it is perfectly possible that your final result will be printed out in a different order:

WORLD!
HELLO

2.3. Modify and resume

Nextflow keeps track of all the processes executed in your pipeline. If you modify some parts of your script, only the processes that are changed will be re-executed. The execution of the processes that are not changed will be skipped and the cached result will be used instead.

This allows for testing or modifying part of your pipeline without having to re-execute it from scratch.

For the sake of this tutorial, modify the CONVERTTOUPPER process in the previous example, replacing the process script with the string rev $y, so that the process looks like this:

1
2
3
4
5
6
7
8
9
10
11
process CONVERTTOUPPER {
    input:
    path y

    output:
    stdout

    """
    rev $y
    """
}

Then save the file with the same name, and execute it by adding the -resume option to the command line:

nextflow run hello.nf -resume

It will print output similar to this:

N E X T F L O W  ~  version 22.09.7-edge
Launching `hello.nf` [mighty_mcnulty] DSL2 - revision: 525206806b
executor >  local (2)
[c8/c36893] process > SPLITLETTERS (1)   [100%] 1 of 1, cached: 1 ✔
[77/cf83b6] process > CONVERTTOUPPER (1) [100%] 2 of 2 ✔
!dlrow
 olleH

You will see that the execution of the process SPLITLETTERS is actually skipped (the process ID is the same as in the first output) — its results are retrieved from the cache. The second process is executed as expected, printing the reversed strings.

The pipeline results are cached by default in the directory $PWD/work. Depending on your script, this folder can take a lot of disk space. If you are sure you won’t need to resume your pipeline execution, clean this folder periodically.

2.4. Pipeline parameters

Pipeline parameters are simply declared by prepending the prefix params to a variable name, separated by a dot character. Their value can be specified on the command line by prefixing the parameter name with a double dash character, i.e. --paramName.

Now, let’s try to execute the previous example specifying a different input string parameter, as shown below:

nextflow run hello.nf --greeting 'Bonjour le monde!'

The string specified on the command line will override the default value of the parameter. The output will look like this:

N E X T F L O W  ~  version 22.09.7-edge
Launching `hello.nf` [fervent_galileo] DSL2 - revision: 525206806b
executor >  local (4)
[e9/139d7d] process > SPLITLETTERS (1)   [100%] 1 of 1 ✔
[bb/fc8548] process > CONVERTTOUPPER (1) [100%] 3 of 3 ✔
m el r
!edno
uojnoB

2.4.1. In DAG-like format

To better understand how Nextflow is dealing with the data in this pipeline, below is a DAG-like figure to visualise all the inputs, outputs, channels and processes:

Click here:
helloworlddiagram

3. Simple RNA-Seq pipeline

To demonstrate a real-world biomedical scenario, we will implement a proof of concept RNA-Seq pipeline which:

  1. Indexes a transcriptome file

  2. Performs quality controls

  3. Performs quantification

  4. Creates a MultiQC report

This will be done using a series of seven scripts, each of which builds on the previous to create a complete workflow. You can find these in the tutorial folder (script1.nf - script7.nf).

3.1. Define the pipeline parameters

Parameters are inputs and options that can be changed when the pipeline is run.

The script script1.nf defines the pipeline input parameters.

1
2
3
4
5
params.reads = "$projectDir/data/ggal/gut_{1,2}.fq"
params.transcriptome_file = "$projectDir/data/ggal/transcriptome.fa"
params.multiqc = "$projectDir/multiqc"

println "reads: $params.reads"

Run it by using the following command:

nextflow run script1.nf

Try to specify a different input parameter in your execution command, for example:

nextflow run script1.nf --reads '/workspace/nf-training-public/nf-training/data/ggal/lung_{1,2}.fq'

Exercise

Modify the script1.nf by adding a fourth parameter named outdir and set it to a default path that will be used as the pipeline output directory.

Click here for the answer:
1
2
3
4
params.reads = "$projectDir/data/ggal/gut_{1,2}.fq"
params.transcriptome_file = "$projectDir/data/ggal/transcriptome.fa"
params.multiqc = "$projectDir/multiqc"
params.outdir = "results"

Exercise

Modify script1.nf to print all of the pipeline parameters by using a single log.info command as a multiline string statement.

See an example here.
Click here for the answer:

Add the following to your script file:

1
2
3
4
5
6
7
8
log.info """\
         R N A S E Q - N F   P I P E L I N E
         ===================================
         transcriptome: ${params.transcriptome_file}
         reads        : ${params.reads}
         outdir       : ${params.outdir}
         """
         .stripIndent()

Recap

In this step you have learned:

  1. How to define parameters in your pipeline script

  2. How to pass parameters by using the command line

  3. The use of $var and ${var} variable placeholders

  4. How to use multiline strings

  5. How to use log.info to print information and save it in the log execution file

3.2. Create a transcriptome index file

Nextflow allows the execution of any command or script by using a process definition.

A process is defined by providing three main declarations: the process input, output and command script.

To add a transcriptome INDEX processing step, try adding the following code blocks to your script1.nf. Alternatively, these code blocks have already been added to script2.nf.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
/*
 * define the `index` process that creates a binary index
 * given the transcriptome file
 */
process INDEX {
  input:
  path transcriptome

  output:
  path 'salmon_index'

  script:
  """
  salmon index --threads $task.cpus -t $transcriptome -i salmon_index
  """
}

Additionally, add a workflow scope containing an input channel definition and the index process:

1
2
3
workflow {
    index_ch = INDEX(params.transcriptome_file)
}

Here, the params.transcriptome_file parameter is used as the input for the INDEX process. The INDEX process (using the salmon tool) creates salmon_index, an indexed transcriptome that is passed as an output to the index_ch channel.

The input declaration defines a transcriptome path variable which is used in the script as a reference (using the dollar symbol) in the Salmon command line.
Resource requirements such as CPUs and memory limits can change with different workflow executions and platforms. Nextflow can use $task.cpus as a variable for the number of CPUs. See process directives documentation for more details.

Run it by using the command:

nextflow run script2.nf

The execution will fail because salmon is not installed in your environment.

Add the command line option -with-docker to launch the execution through a Docker container, as shown below:

nextflow run script2.nf -with-docker

This time the execution will work because it uses the Docker container nextflow/rnaseq-nf that is defined in the nextflow.config file in your current directory. If you are running this script locally then you will need to download docker to your machine, log in and activate docker, and allow the script to download the container containing the run scripts. You can learn more about docker here.

To avoid adding -with-docker each time you execute the script, add the following line to the nextflow.config file:

docker.enabled = true

Exercise

Enable the Docker execution by default by adding the above setting in the nextflow.config file.

Exercise

Print the output of the index_ch channel by using the view operator.

Click here for the answer:

Add the following to the end of your workflow block in your script file

index_ch.view()

Exercise

If you have more CPUs available, try changing your script to request more resources for this process. For example, see the directive docs. $task.cpus is already specified in this script, so setting the number of CPUs as a directive will tell Nextflow to run this job.

Click here for the answer:

Add cpus 2 to the top of the index process:

1
2
3
4
process INDEX {
  cpus 2
  input:
  ...

Then check it worked by looking at the script executed in the work directory. Look for the hexadecimal (e.g. work/7f/f285b80022d9f61e82cd7f90436aa4/), Then cat the .command.sh file.

Bonus Exercise

Use the command tree work to see how Nextflow organizes the process work directory. Check here if you need to download tree.

Click here for the answer:

It should look something like this:

work
├── 17
│   └── 263d3517b457de4525513ae5e34ea8
│       ├── index
│       │   ├── complete_ref_lens.bin
│       │   ├── ctable.bin
│       │   ├── ctg_offsets.bin
│       │   ├── duplicate_clusters.tsv
│       │   ├── eqtable.bin
│       │   ├── info.json
│       │   ├── mphf.bin
│       │   ├── pos.bin
│       │   ├── pre_indexing.log
│       │   ├── rank.bin
│       │   ├── refAccumLengths.bin
│       │   ├── ref_indexing.log
│       │   ├── reflengths.bin
│       │   ├── refseq.bin
│       │   ├── seq.bin
│       │   └── versionInfo.json
│       └── transcriptome.fa -> /workspace/Gitpod_test/data/ggal/transcriptome.fa
├── 7f

Recap

In this step you have learned:

  1. How to define a process executing a custom command

  2. How process inputs are declared

  3. How process outputs are declared

  4. How to print the content of a channel

  5. How to access the number of available CPUs

3.3. Collect read files by pairs

This step shows how to match read files into pairs, so they can be mapped by Salmon.

Edit the script script3.nf by adding the following statement as the last line of the file:

read_pairs_ch.view()

Save it and execute it with the following command:

nextflow run script3.nf

It will print something similar to this:

[gut, [/.../data/ggal/gut_1.fq, /.../data/ggal/gut_2.fq]]

The above example shows how the read_pairs_ch channel emits tuples composed of two elements, where the first is the read pair prefix and the second is a list representing the actual files.

Try it again specifying different read files by using a glob pattern:

nextflow run script3.nf --reads 'data/ggal/*_{1,2}.fq'
File paths that include one or more wildcards ie. *, ?, etc., MUST be wrapped in single-quoted characters to avoid Bash expanding the glob.

Exercise

Use the set operator in place of = assignment to define the read_pairs_ch channel.

Click here for the answer:
1
2
3
Channel
  .fromFilePairs( params.reads )
  .set { read_pairs_ch }

Exercise

Use the checkIfExists option for the fromFilePairs method to check if the specified path contains file pairs.

Click here for the answer:
1
2
3
Channel
  .fromFilePairs( params.reads, checkIfExists: true )
  .set { read_pairs_ch }

Recap

In this step you have learned:

  1. How to use fromFilePairs to handle read pair files

  2. How to use the checkIfExists option to check for the existence of input files

  3. How to use the set operator to define a new channel variable

The declaration of a channel can be before the workflow scope of within it. As long as it is upstream of the process that requires the specific channel.

3.4. Perform expression quantification

script4.nf adds a gene expression QUANTIFICATION process and call within the workflow scope. Quantification requires the index transcriptome and RNA-Seq read pair fastq files.

In the workflow scope, note how the index_ch channel is assigned as output in the INDEX process.

Next, note that the first input channel for the QUANTIFICATION process is the previously declared index_ch, which contains the path to the salmon_index.

Also, note that the second input channel for the QUANTIFICATION process, is the read_pair_ch we just created. This being a tuple composed of two elements (a value: sample_id and a list of paths to the fastq reads: reads) in order to match the structure of the items emitted by the fromFilePairs channel factory.

Execute it by using the following command:

nextflow run script4.nf -resume

You will see the execution of the QUANTIFICATION process.

When using the -resume option, any step that has already been processed is skipped.

Try to execute the same script again with more read files, as shown below:

nextflow run script4.nf -resume --reads 'data/ggal/*_{1,2}.fq'

You will notice that the QUANTIFICATION process is executed multiple times.

Nextflow parallelizes the execution of your pipeline simply by providing multiple sets of input data to your script.

It may be useful to apply optional settings to a specific process using directives by specifying them in the process body.

Exercise

Add a tag directive to the QUANTIFICATION process to provide a more readable execution log.

Click here for the answer:

Add the following before the input declaration:

tag "Salmon on $sample_id"

Exercise

Add a publishDir directive to the QUANTIFICATION process to store the process results in a directory of your choice.

Click here for the answer:

Add the following before the input declaration in the QUANTIFICATION process:

publishDir params.outdir, mode:'copy'

Recap

In this step you have learned:

  1. How to connect two processes together by using the channel declarations

  2. How to resume the script execution and skip cached steps

  3. How to use the tag directive to provide a more readable execution output

  4. How to use the publishDir directive to store a process results in a path of your choice

3.5. Quality control

Next, we implement a FASTQC quality control step for your input reads (using the label fastqc). The inputs are the same as the read pairs used in the QUANTIFICATION step.

You can run it by using the following command:

nextflow run script5.nf -resume

Nextflow DSL2 knows to split the reads_pair_ch into two identical channels as they are required twice as an input for both of the FASTQC and the QUANTIFICATION process.

3.6. MultiQC report

This step collects the outputs from the QUANTIFICATION and FASTQC processes to create a final report using the MultiQC tool.

Execute the next script with the following command:

nextflow run script6.nf -resume --reads 'data/ggal/*_{1,2}.fq'

It creates the final report in the results folder in the current work directory.

In this script, note the use of the mix and collect operators chained together to gather the outputs of the QUANTIFICATION and FASTQC processes as a single input. Operators can be used to combine and transform channels.

MULTIQC(quant_ch.mix(fastqc_ch).collect())

We only want one task of MultiQC to be executed to produce one report. Therefore, we use the mix channel operator to combine the two channels followed by the collect operator, to return the complete channel contents as a single element.

Recap

In this step you have learned:

  1. How to collect many outputs to a single input with the collect operator

  2. How to mix two channels into a single channel

  3. How to chain two or more operators together

3.7. Handle completion event

This step shows how to execute an action when the pipeline completes the execution.

Note that Nextflow processes define the execution of asynchronous tasks i.e. they are not executed one after another as if they were written in the pipeline script in a common imperative programming language.

The script uses the workflow.onComplete event handler to print a confirmation message when the script completes.

Try to run it by using the following command:

nextflow run script7.nf -resume --reads 'data/ggal/*_{1,2}.fq'

3.8. Email notifications

Send a notification email when the workflow execution completes using the -N <email address> command-line option.

Note: this requires the configuration of a SMTP server in the nextflow config file. Below is an example nextflow.config file showing the settings you would have to configure:

1
2
3
4
5
6
7
8
9
10
mail {
  from = 'info@nextflow.io'
  smtp.host = 'email-smtp.eu-west-1.amazonaws.com'
  smtp.port = 587
  smtp.user = "xxxxx"
  smtp.password = "yyyyy"
  smtp.auth = true
  smtp.starttls.enable = true
  smtp.starttls.required = true
}

See mail documentation for details.

3.9. Custom scripts

Real-world pipelines use a lot of custom user scripts (BASH, R, Python, etc.) Nextflow allows you to consistently use and manage these scripts. Simply put them in a directory named bin in the pipeline project root. They will be automatically added to the pipeline execution PATH.

For example, create a file named fastqc.sh with the following content:

1
2
3
4
5
6
7
8
9
#!/bin/bash
set -e
set -u

sample_id=${1}
reads=${2}

mkdir fastqc_${sample_id}_logs
fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}

Save it, give execute permission, and move it into the bin directory as shown below:

chmod +x fastqc.sh
mkdir -p bin
mv fastqc.sh bin

Then, open the script7.nf file and replace the FASTQC process' script with the following code:

1
2
3
4
script:
"""
fastqc.sh "$sample_id" "$reads"
"""

Run it as before:

nextflow run script7.nf -resume --reads 'data/ggal/*_{1,2}.fq'

Recap

In this step you have learned:

  1. How to write or use existing custom scripts in your Nextflow pipeline.

  2. How to avoid the use of absolute paths by having your scripts in the bin/ folder.

3.10. Metrics and reports

Nextflow can produce multiple reports and charts providing several runtime metrics and execution information.

Run the rnaseq-nf pipeline previously introduced as shown below:

nextflow run rnaseq-nf -with-docker -with-report -with-trace -with-timeline -with-dag dag.png

The -with-docker option launches each task of the execution as a Docker container run command.

The -with-report option enables the creation of the workflow execution report. Open the file report.html with a browser to see the report created with the above command.

The -with-trace option enables the creation of a tab separated file containing runtime information for each executed task. Check the trace.txt for an example.

The -with-timeline option enables the creation of the workflow timeline report showing how processes were executed over time. This may be useful to identify the most time consuming tasks and bottlenecks. See an example at this link.

Finally, the -with-dag option enables the rendering of the workflow execution direct acyclic graph representation. Note: This feature requires the installation of Graphviz on your computer. See here for further details. Then try running :

dot -Tpng dag.dot > graph.png
open graph.png

Note: runtime metrics may be incomplete for run short running tasks as in the case of this tutorial.

You view the HTML files by right-clicking on the file name in the left side-bar and choosing the Preview menu item.

3.11. Run a project from GitHub

Nextflow allows the execution of a pipeline project directly from a GitHub repository (or similar services, e.g., BitBucket and GitLab).

This simplifies the sharing and deployment of complex projects and tracking changes in a consistent manner.

The following GitHub repository hosts a complete version of the workflow introduced in this tutorial:

You can run it by specifying the project name and launching each task of the execution as a Docker container run command:

nextflow run nextflow-io/rnaseq-nf -with-docker

It automatically downloads the container and stores it in the $HOME/.nextflow folder.

Use the command info to show the project information:

nextflow info nextflow-io/rnaseq-nf

Nextflow allows the execution of a specific revision of your project by using the -r command line option. For example:

nextflow run nextflow-io/rnaseq-nf -r v2.1 -with-docker

Revision are defined by using Git tags or branches defined in the project repository.

Tags enable precise control of the changes in your project files and dependencies over time.

3.12. More resources

4. Manage dependencies and containers

Computational workflows are rarely composed of a single script or tool.  Most of the time they require the usage of dozens of different software components or libraries.

Installing and maintaining such dependencies is a challenging task and the most common source of irreproducibility in scientific applications.

To overcome these issues we use containers that allow the encapsulation of software dependencies, i.e. tools and libraries required by a data analysis application in one or more self-contained, ready-to-run, immutable Linux container images, that can be easily deployed in any platform supporting the container runtime.

Containers can be executed in an isolated manner from the hosting system. Having its own copy of the file system, processing space, memory management, etc.

Containers were first introduced with kernel 2.6 as a Linux feature known as Control Groups or Cgroups.

4.1. Docker

Docker is a handy management tool to build, run and share container images.

These images can be uploaded and published in a centralized repository known as Docker Hub, or hosted by other parties like Quay.

4.1.1. Run a container

A container can be run using the following command:

docker run <container-name>

Try for example the following publicly available container (if you have docker installed):

docker run hello-world

4.1.2. Pull a container

The pull command allows you to download a Docker image without running it. For example:

docker pull debian:stretch-slim

The above command downloads a Debian Linux image. You can check it exists by using:

docker images

4.1.3. Run a container in interactive mode

Launching a BASH shell in the container allows you to operate in an interactive mode in the containerized operating system. For example:

docker run -it debian:stretch-slim bash

Once launched, you will notice that it is running as root (!). Use the usual commands to navigate the file system. This is useful to check if the expected programs are present within a container.

To exit from the container, stop the BASH session with the exit command.

4.1.4. Your first Dockerfile

Docker images are created by using a so-called Dockerfile, which is a simple text file containing a list of commands to assemble and configure the image with the software packages required.

Here, you will create a Docker image containing cowsay and the Salmon tool.

The Docker build process automatically copies all files that are located in the current directory to the Docker daemon in order to create the image. This can take a lot of time when big/many files exist. For this reason, it’s important to always work in a directory containing only the files you really need to include in your Docker image. Alternatively, you can use the .dockerignore file to select paths to exclude from the build.

Use your favorite editor (e.g., vim or nano) to create a file named Dockerfile and copy the following content:

1
2
3
4
5
6
7
8
FROM debian:stretch-slim

LABEL image.author.name "Your Name Here"
LABEL image.author.email "your@email.here"

RUN apt-get update && apt-get install -y curl cowsay

ENV PATH=$PATH:/usr/games/

4.1.5. Build the image

Build the Dockerfile image by using the following command:

docker build -t my-image .

Where "my-image" is the user-specified name tag for the Dockerfile, present in the current directory.

Don’t miss the dot in the above command.

When it completes, verify that the image has been created by listing all available images:

docker images

You can try your new container by running this command:

docker run my-image cowsay Hello Docker!

4.1.6. Add a software package to the image

Add the Salmon package to the Docker image by adding the following snippet to the Dockerfile:

1
2
3
RUN curl -sSL https://github.com/COMBINE-lab/salmon/releases/download/v1.5.2/salmon-1.5.2_linux_x86_64.tar.gz | tar xz \
&& mv /salmon-*/bin/* /usr/bin/ \
&& mv /salmon-*/lib/* /usr/lib/

Save the file and build the image again with the same command as before:

docker build -t my-image .

You will notice that it creates a new Docker image with the same name but with a different image ID.

4.1.7. Run Salmon in the container

Check that Salmon is running correctly in the container as shown below:

docker run my-image salmon --version

You can even launch a container in an interactive mode by using the following command:

docker run -it my-image bash

Use the exit command to terminate the interactive session.

4.1.8. File system mounts

Create a genome index file by running Salmon in the container.

Try to run Salmon in the container with the following command:

docker run my-image \
  salmon index -t $PWD/data/ggal/transcriptome.fa -i transcript-index

The above command fails because Salmon cannot access the input file.

This happens because the container runs in a completely separate file system and it cannot access the hosting file system by default.

You will need to use the --volume command-line option to mount the input file(s) e.g.

docker run --volume $PWD/data/ggal/transcriptome.fa:/transcriptome.fa my-image \
  salmon index -t /transcriptome.fa -i transcript-index
The generated transcript-index directory is still not accessible in the host file system.
An easier way is to mount a parent directory to an identical one in the container, this allows you to use the same path when running it in the container e.g.
docker run --volume $PWD:$PWD --workdir $PWD my-image \
  salmon index -t $PWD/data/ggal/transcriptome.fa -i transcript-index

Or set a folder you want to mount as an environmental variable, called DATA:

DATA=/workspace/nf-training-public/nf-training/data
docker run --volume $DATA:$DATA --workdir $PWD my-image \
  salmon index -t $PWD/data/ggal/transcriptome.fa -i transcript-index

Now check the content of the transcript-index folder by entering the command:

ls -la transcript-index
Note that the permissions for files created by the Docker execution is root.

4.1.9. Upload the container in the Docker Hub (bonus)

Publish your container in the Docker Hub to share it with other people.

Create an account on the hub.docker.com website. Then from your shell terminal run the following command, entering the user name and password you specified when registering in the Hub:

docker login

Tag the image with your Docker user name account:

docker tag my-image <user-name>/my-image

Finally push it to the Docker Hub:

docker push <user-name>/my-image

After that anyone will be able to download it by using the command:

docker pull <user-name>/my-image

Note how after a pull and push operation, Docker prints the container digest number e.g.

Digest: sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab266
Status: Downloaded newer image for nextflow/rnaseq-nf:latest

This is a unique and immutable identifier that can be used to reference a container image in a univocally manner. For example:

docker pull nextflow/rnaseq-nf@sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab266

4.1.10. Run a Nextflow script using a Docker container

The simplest way to run a Nextflow script with a Docker image is using the -with-docker command-line option:

nextflow run script2.nf -with-docker my-image

As seen in the last section, you can also configure the Nextflow config file (nextflow.config) to select which container to use instead of having to specify it as a command-line argument every time.

4.2. Singularity

Singularity is a container runtime designed to work in high-performance computing data centers, where the usage of Docker is generally not allowed due to security constraints.

Singularity implements a container execution model similar to Docker. However, it uses a completely different implementation design.

A Singularity container image is archived as a plain file that can be stored in a shared file system and accessed by many computing nodes managed using a batch scheduler.

Singularity will not work with Gitpod. If you wish to try this section, please do it locally, or on an HPC.

4.2.1. Create a Singularity images

Singularity images are created using a Singularity file in a similar manner to Docker but using a different syntax.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Bootstrap: docker
From: debian:stretch-slim

%environment
export PATH=$PATH:/usr/games/

%labels
AUTHOR <your name>

%post

apt-get update && apt-get install -y locales-all curl cowsay
curl -sSL https://github.com/COMBINE-lab/salmon/releases/download/v1.0.0/salmon-1.0.0_linux_x86_64.tar.gz | tar xz \
&& mv /salmon-*/bin/* /usr/bin/ \
&& mv /salmon-*/lib/* /usr/lib/

Once you have saved the Singularity file. You can create the image with these commands:

sudo singularity build my-image.sif Singularity

Note: the build command requires sudo permissions. A common workaround consists of building the image on a local workstation and then deploying it in the cluster by copying the image file.

4.2.2. Running a container

Once done, you can run your container with the following command

singularity exec my-image.sif cowsay 'Hello Singularity'

By using the shell command you can enter in the container in interactive mode. For example:

singularity shell my-image.sif

Once in the container instance run the following commands:

touch hello.txt
ls -la
Note how the files on the host environment are shown. Singularity automatically mounts the host $HOME directory and uses the current work directory.

4.2.3. Import a Docker image

An easier way to create a Singularity container without requiring sudo permission and boosting the containers interoperability is to import a Docker container image by pulling it directly from a Docker registry. For example:

singularity pull docker://debian:stretch-slim

The above command automatically downloads the Debian Docker image and converts it to a Singularity image in the current directory with the name debian-jessie.simg.

4.2.4. Run a Nextflow script using a Singularity container

Nextflow allows the transparent usage of Singularity containers as easy as with Docker.

Simply enable the use of the Singularity engine in place of Docker in the Nextflow configuration file by using the -with-singularity command-line option:

nextflow run script7.nf -with-singularity nextflow/rnaseq-nf

As before, the Singularity container can also be provided in the Nextflow config file. We’ll see how to do this later.

4.2.5. The Singularity Container Library

The authors of Singularity, SyLabs have their own repository of Singularity containers.

In the same way that we can push Docker images to Docker Hub, we can upload Singularity images to the Singularity Library.

4.3. Conda/Bioconda packages

Conda is a popular package and environment manager. The built-in support for Conda allows Nextflow pipelines to automatically create and activate the Conda environment(s), given the dependencies specified by each process.

In this Gitpod environment, conda is already installed.

4.3.1. Using conda

A Conda environment is defined using a YAML file, which lists the required software packages. The first thing you need to do is to initiate conda for shell interaction, and then open a new terminal by running bash.

conda init
bash

Then write your YAML file (to env.yml). For example:

1
2
3
4
5
6
7
8
9
10
name: nf-tutorial
channels:
  - conda-forge
  - defaults
  - bioconda
dependencies:
  - bioconda::salmon=1.5.1
  - bioconda::fastqc=0.11.9
  - bioconda::multiqc=1.12
  - conda-forge::tbb=2020.2

Given the recipe file, the environment is created using the command shown below:

conda env create --file env.yml

You can check the environment was created successfully with the command shown below:

conda env list

This should look something like this:

# conda environments:
#
base                  *  /opt/conda
nf-tutorial              /opt/conda/envs/nf-tutorial

To enable the environment, you can use the activate command:

conda activate nf-tutorial

Nextflow is able to manage the activation of a Conda environment when its directory is specified using the -with-conda option (using the same path shown in the list function. For example:

nextflow run script7.nf -with-conda /opt/conda/envs/nf-tutorial
When creating a Conda environment with a YAML recipe file, Nextflow automatically downloads the required dependencies, builds the environment and activates it.

This makes easier to manage different environments for the processes in the workflow script.

See the docs for details.

4.3.2. Create and use conda-like environments using micromamba

Another way to build conda-like environments is through a Dockerfile and micromamba.

micromamba is a fast and robust package for building small conda-based environments.

This saves having to build a conda environment each time you want to use it (as outlined in previous sections).

To do this, you simply require a Dockerfile and you use micromamba to install the packages. However, a good practice is to have a YAML recipe file like in the previous section, so we’ll do it here too.

1
2
3
4
5
6
7
8
9
10
name: nf-tutorial
channels:
  - conda-forge
  - defaults
  - bioconda
dependencies:
  - bioconda::salmon=1.5.1
  - bioconda::fastqc=0.11.9
  - bioconda::multiqc=1.12
  - conda-forge::tbb=2020.2

Then, we can write our Dockerfile with micromamba installing the packages from this recipe file.

1
2
3
4
5
6
7
8
9
10
11
12
13
FROM mambaorg/micromamba:0.25.1

LABEL image.author.name "Your Name Here"
LABEL image.author.email "your@email.here"

COPY --chown=$MAMBA_USER:$MAMBA_USER micromamba.yml /tmp/env.yml

RUN micromamba create -n nf-tutorial

RUN micromamba install -y -n nf-tutorial -f /tmp/env.yml && \
    micromamba clean --all --yes

ENV PATH /opt/conda/envs/nf-tutorial/bin:$PATH

The above Dockerfile takes the parent image 'mambaorg/micromamba', then installs a conda environment using micromamba, and installs salmon, fastqc and multiqc.

Exercise

Try executing the RNA-Seq pipeline from earlier (script7.nf). Start by building your own micromamba Dockerfile (from above), save it to your docker hub repo, and direct Nextflow to run from this container (changing your nextflow.config).

Building a Docker container and pushing to your personal repo can take >10 minutes.
For an overview of steps to take, click here:
  1. Make a file called Dockerfile in the current directory (with the code above).

  2. Build the image: docker build -t my-image . (don’t forget the '.').

  3. Publish the docker image to your online docker account.

    Something similar to the following, with <myrepo> replaced with your own Docker ID, without '<' and '>' characters!

    "my-image" could be replaced with any name you choose. As good practice, choose something memorable and ensure the name matches the name you used in the previous command.
    docker login
    docker tag my-image <myrepo>/my-image
    docker push <myrepo>/my-image
  4. Add the image file name to the nextflow.config file.

    e.g. remove the following from the nextflow.config:

    process.container = 'nextflow/rnaseq-nf'

    and replace with:

    process.container = '<myrepo>/my-image'
  5. Trying running Nextflow, e.g.:

    nextflow run script7.nf -with-docker

Nextflow should now be able to find salmon to run the process.

4.4. BioContainers

Another useful resource linking together Bioconda and containers is the BioContainers project. BioContainers is a community initiative that provides a registry of container images for every Bioconda recipe.

So far, we’ve seen how to install packages with conda and micromamba, both locally and within containers. With BioContainers, you don’t need to create your own container image for the tools you want, and you don’t need to use conda or micromamba to install the packages. It already provides you with a Docker image containing the programs you want installed. For example, you can get the container image of fastqc using BioContainers with:

docker pull biocontainers/fastqc:v0.11.5

You can check the registry for the packages you want in BioContainers official website.

Contrary to other registries that will pull the latest image when no tag (version) is provided, you must specify a tag when pulling BioContainers (after a colon :, e.g. fastqc:v0.11.5). Check the tags within the registry and pick the one that better suits your needs.

Exercise

During the earlier RNA-Seq tutorial (script2.nf), we created an index with the salmon tool. Given we do not have salmon installed locally in the machine provided by Gitpod, we had to either run it with -with-conda or -with-docker. Your task now is to run it again -with-docker, but without having to create your own Docker container image. Instead, use the BioContainers image for salmon 1.7.0.

For the result, click here:
nextflow run script2.nf -with-docker quay.io/biocontainers/salmon:1.7.0--h84f40af_0

Bonus Exercise

Change the process directives in script5.nf or the nextflow.config file to make the pipeline automatically use BioContainers when using salmon, or fastqc.

HINT: Temporarily comment out the line process.container = 'nextflow/rnaseq-nf' in the nextflow.config file to make sure the processes are using the BioContainers that you set, and not the container image we have been using in this training. With these changes, you should be able to run the pipeline with BioContainers by running the following in the command line:

For the result, click here:
nextflow run script5.nf

with the following container directives for each process:

1
2
3
4
process FASTQC {
    container 'biocontainers/fastqc:v0.11.5'
    tag "FASTQC on $sample_id"
...

and

1
2
3
4
5
process QUANTIFICATION {
    tag "Salmon on $sample_id"
    container 'quay.io/biocontainers/salmon:1.7.0--h84f40af_0'
    publishDir params.outdir, mode:'copy'
...

Check the .command.run file in the work directory and ensure that the run line contains the correct Biocontainers.

Bonus content

Details

You can have more complex definitions within your process block by letting the appropriate container image or conda package be used depending on if the user selected singularity, Docker or conda to be used. You can click here for more information and here for an example.

5. Channels

Channels are a key data structure of Nextflow that allows the implementation of reactive-functional oriented computational workflows based on the Dataflow programming paradigm.

They are used to logically connect tasks to each other or to implement functional style data transformations.

channel files

5.1. Channel types

Nextflow distinguishes two different kinds of channels: queue channels and value channels.

5.1.1. Queue channel

A queue channel is an asynchronous unidirectional FIFO queue that connects two processes or operators.

  • asynchronous means that operations are non-blocking.

  • unidirectional means that data flows from a producer to a consumer.

  • FIFO means that the data is guaranteed to be delivered in the same order as it is produced. First In, First Out.

A queue channel is implicitly created by process output definitions or using channel factories such as Channel.of or Channel.fromPath.

Try the following snippets:

1
2
3
ch = Channel.of(1,2,3)
println(ch)     (1)
ch.view()       (2)
1 Use the built-in print line function println to print the ch channel
2 Apply the view method to the ch channel prints each item emitted by the channels

Exercise

Try to execute this snippet. You can do that by creating a new .nf file or by editing an already existing .nf file.

1
2
ch = Channel.of(1,2,3)
ch.view()

5.1.2. Value channels

A value channel (a.k.a. singleton channel) by definition is bound to a single value and it can be read unlimited times without consuming its contents. A value channel is created using the value factory method or by operators returning a single value, such as first, last, collect, count, min, max, reduce, and sum.

To better understand the difference between value and queue channels, save the snippet below as example.nf.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
ch1 = Channel.of(1,2,3)
ch2 = Channel.of(1)

process SUM {
    input:
    val x
    val y

    output:
    stdout

    script:
    """
    echo \$(($x+$y))
    """
}

workflow {
    SUM(ch1,ch2).view()
}

When you run the script, it prints only 2, as you can see below:.

2

To understand why, we can inspect the queue channel and running Nextflow with DSL1 gives us a more explicit comprehension of what is behind the curtains.

1
2
ch1 = Channel.of(1)
println ch1

Then:

nextflow run example.nf -dsl1

This will print:

DataflowQueue(queue=[DataflowVariable(value=1), DataflowVariable(value=groovyx.gpars.dataflow.operator.PoisonPill@34be065a)])

We have the value 1 as the single element of our queue channel and a poison pill, which will tell the process that there’s nothing left to be consumed. That’s why we only have one output for the example above, which is 2. Let’s inspect a value channel now.

1
2
ch1 = Channel.value(1)
println ch1

Then:

nextflow run example.nf -dsl1

This will print:

DataflowVariable(value=1)

There is no poison pill, and that’s why we get a different output with the code below, where ch2 is turned into a value channel through the first operator.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
ch1 = Channel.of(1,2,3)
ch2 = Channel.of(1)

process SUM {
    input:
    val x
    val y

    output:
    stdout

    script:
    """
    echo \$(($x+$y))
    """
}

workflow {
    SUM(ch1,ch2.first()).view()
}
4

3

2

Besides, in many situations, Nextflow will implicitly convert variables to value channels when they are used in a process invocation. For example, when you invoke a process with a pipeline parameter (params.example) which has a string value, it is automatically cast into a value channel.

5.2. Channel factories

These are Nextflow commands for creating channels that have implicit expected inputs and functions.

5.2.1. value

The value factory method is used to create a value channel. An optional not null argument can be specified to bind the channel to a specific value. For example:

1
2
3
ch1 = Channel.value()                 (1)
ch2 = Channel.value( 'Hello there' )  (2)
ch3 = Channel.value( [1,2,3,4,5] )    (3)
1 Creates an empty value channel
2 Creates a value channel and binds a string to it
3 Creates a value channel and binds a list object to it that will be emitted as a sole emission

5.2.2. of

The factory Channel.of allows the creation of a queue channel with the values specified as arguments.

1
2
ch = Channel.of( 1, 3, 5, 7 )
ch.view{ "value: $it" }

The first line in this example creates a variable ch which holds a channel object. This channel emits the values specified as a parameter in the of method. Thus the second line will print the following:

value: 1
value: 3
value: 5
value: 7

The method Channel.of works in a similar manner to Channel.from (which is now deprecated), fixing some inconsistent behaviors of the latter and provides better handling when specifying a range of values. For example, the following works with a range from 1 to 23 :

1
2
3
Channel
  .of(1..23, 'X', 'Y')
  .view()

5.2.3. fromList

The method Channel.fromList creates a channel emitting the elements provided by a list object specified as an argument:

1
2
3
4
5
list = ['hello', 'world']

Channel
  .fromList(list)
  .view()

5.2.4. fromPath

The fromPath factory method creates a queue channel emitting one or more files matching the specified glob pattern.

Channel.fromPath( './data/meta/*.csv' )

This example creates a channel and emits as many items as there are files with a csv extension in the /data/meta folder. Each element is a file object implementing the Path interface.

Two asterisks, i.e. **, works like * but cross directory boundaries. This syntax is generally used for matching complete paths. Curly brackets specify a collection of sub-patterns.
Table 1. Available options
Name Description

glob

When true interprets characters *, ?, [] and {} as glob wildcards, otherwise handles them as normal characters (default: true)

type

Type of path returned, either file, dir or any (default: file)

hidden

When true includes hidden files in the resulting paths (default: false)

maxDepth

Maximum number of directory levels to visit (default: no limit)

followLinks

When true symbolic links are followed during directory tree traversal, otherwise they are managed as files (default: true)

relative

When true return paths are relative to the top-most common directory (default: false)

checkIfExists

When true throws an exception when the specified path does not exist in the file system (default: false)

Learn more about the glob patterns syntax at this link.

Exercise

Use the Channel.fromPath method to create a channel emitting all files with the suffix .fq in the data/ggal/ directory and any subdirectory, in addition to hidden files. Then print the file names.

Click here for the answer:
1
2
Channel.fromPath( './data/ggal/**.fq' , hidden:true)
  .view()

5.2.5. fromFilePairs

The fromFilePairs method creates a channel emitting the file pairs matching a glob pattern provided by the user. The matching files are emitted as tuples, in which the first element is the grouping key of the matching pair and the second element is the list of files (sorted in lexicographical order).

1
2
3
Channel
  .fromFilePairs('./data/ggal/*_{1,2}.fq')
  .view()

It will produce an output similar to the following:

[liver, [/user/nf-training/data/ggal/liver_1.fq, /user/nf-training/data/ggal/liver_2.fq]]
[gut, [/user/nf-training/data/ggal/gut_1.fq, /user/nf-training/data/ggal/gut_2.fq]]
[lung, [/user/nf-training/data/ggal/lung_1.fq, /user/nf-training/data/ggal/lung_2.fq]]
The glob pattern must contain at least a star wildcard character.
Table 2. Available options
Name Description

type

Type of paths returned, either file, dir or any (default: file)

hidden

When true includes hidden files in the resulting paths (default: false)

maxDepth

Maximum number of directory levels to visit (default: no limit)

followLinks

When true symbolic links are followed during directory tree traversal, otherwise they are managed as files (default: true)

size

Defines the number of files each emitted item is expected to hold (default: 2). Set to -1 for any.

flat

When true the matching files are produced as sole elements in the emitted tuples (default: false).

checkIfExists

When true, it throws an exception of the specified path that does not exist in the file system (default: false)

Exercise

Use the fromFilePairs method to create a channel emitting all pairs of fastq read in the data/ggal/ directory and print them. Then use the flat:true option and compare the output with the previous execution.

Click here for the answer:

Use the following, with or without 'flat:true':

1
2
Channel.fromFilePairs( './data/ggal/*_{1,2}.fq', flat:true)
  .view()

Then check the square brackets around the file names, to see the difference with flat.

5.2.6. fromSRA

The Channel.fromSRA method makes it possible to query the NCBI SRA archive and returns a channel emitting the FASTQ files matching the specified selection criteria.

The query can be project ID(s) or accession number(s) supported by the NCBI ESearch API.

This function now requires an API key you can only get by logging into your NCBI account.
For help with NCBI login and key acquisition, click here:
  1. Go to: www.ncbi.nlm.nih.gov/

  2. Click the top right "Log in" button to sign into NCBI. Follow their instructions.

  3. Once into your account, click the button at the top right, usually your ID.

  4. Go to Account settings

  5. Scroll down to the API Key Management section.

  6. Click on "Create an API Key".

  7. The page will refresh and the key will be displayed where the button was. Copy your key.

You also need to use the latest edge version of Nextflow. Check your nextflow -version, it should say -edge, if not: download the newest Nextflow version, following the instructions linked here.

For example, the following snippet will print the contents of an NCBI project ID:

1
2
3
4
5
params.ncbi_api_key = '<Your API key here>'

Channel
  .fromSRA(['SRP073307'], apiKey: params.ncbi_api_key)
  .view()
Replace <Your API key here> with your API key.

This should print:

[SRR3383346, [/vol1/fastq/SRR338/006/SRR3383346/SRR3383346_1.fastq.gz, /vol1/fastq/SRR338/006/SRR3383346/SRR3383346_2.fastq.gz]]
[SRR3383347, [/vol1/fastq/SRR338/007/SRR3383347/SRR3383347_1.fastq.gz, /vol1/fastq/SRR338/007/SRR3383347/SRR3383347_2.fastq.gz]]
[SRR3383344, [/vol1/fastq/SRR338/004/SRR3383344/SRR3383344_1.fastq.gz, /vol1/fastq/SRR338/004/SRR3383344/SRR3383344_2.fastq.gz]]
[SRR3383345, [/vol1/fastq/SRR338/005/SRR3383345/SRR3383345_1.fastq.gz, /vol1/fastq/SRR338/005/SRR3383345/SRR3383345_2.fastq.gz]]
(remaining omitted)

Multiple accession IDs can be specified using a list object:

1
2
3
4
ids = ['ERR908507', 'ERR908506', 'ERR908505']
Channel
  .fromSRA(ids, apiKey: params.ncbi_api_key)
  .view()
[ERR908507, [/vol1/fastq/ERR908/ERR908507/ERR908507_1.fastq.gz, /vol1/fastq/ERR908/ERR908507/ERR908507_2.fastq.gz]]
[ERR908506, [/vol1/fastq/ERR908/ERR908506/ERR908506_1.fastq.gz, /vol1/fastq/ERR908/ERR908506/ERR908506_2.fastq.gz]]
[ERR908505, [/vol1/fastq/ERR908/ERR908505/ERR908505_1.fastq.gz, /vol1/fastq/ERR908/ERR908505/ERR908505_2.fastq.gz]]
Read pairs are implicitly managed and are returned as a list of files.

It’s straightforward to use this channel as an input using the usual Nextflow syntax. The code below creates a channel containing two samples from a public SRA study and runs FASTQC on the resulting files. See:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
params.ncbi_api_key = '<Your API key here>'

params.accession = ['ERR908507', 'ERR908506']

process fastqc {
  input:
  tuple val(sample_id), path(reads_file)

  output:
  path("fastqc_${sample_id}_logs")

  script:
  """
  mkdir fastqc_${sample_id}_logs
  fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads_file}
  """
}

workflow {
  reads = Channel.fromSRA(params.accession, apiKey: params.ncbi_api_key)
  fastqc(reads)
}

If you want to run the pipeline above and do not have fastqc installed in your machine, don’t forget what you learned in the previous section. Run this pipeline with -with-docker biocontainers/fastqc:v0.11.5, for example.

5.2.7. Text files

The splitText operator allows you to split multi-line strings or text file items, emitted by a source channel into chunks containing n lines, which will be emitted by the resulting channel. See:

1
2
3
4
Channel
  .fromPath('data/meta/random.txt') (1)
  .splitText()                      (2)
  .view()                           (3)
1 Instructs Nextflow to make a channel from the path "data/meta/random.txt".
2 The splitText operator splits each item into chunks of one line by default.
3 View contents of the channel.

You can define the number of lines in each chunk by using the parameter by, as shown in the following example:

1
2
3
4
5
6
7
Channel
  .fromPath('data/meta/random.txt')
  .splitText( by: 2 )
  .subscribe {
    print it;
    print "--- end of the chunk ---\n"
  }
The subscribe operator permits execution of user defined functions each time a new value is emitted by the source channel.

An optional closure can be specified in order to transform the text chunks produced by the operator. The following example shows how to split text files into chunks of 10 lines and transform them into capital letters:

1
2
3
4
Channel
  .fromPath('data/meta/random.txt')
  .splitText( by: 10 ) { it.toUpperCase() }
  .view()

You can also make counts for each line:

1
2
3
4
5
6
count=0

Channel
  .fromPath('data/meta/random.txt')
  .splitText()
  .view { "${count++}: ${it.toUpperCase().trim()}" }

Finally, you can also use the operator on plain files (outside of the channel context):

1
2
3
4
5
6
  def f = file('data/meta/random.txt')
  def lines = f.splitText()
  def count=0
  for( String row : lines ) {
    log.info "${count++} ${row.toUpperCase()}"
  }

5.2.8. Comma separate values (.csv)

The splitCsv operator allows you to parse text items emitted by a channel, that are CSV formatted.

It then splits them into records or groups them as a list of records with a specified length.

In the simplest case, just apply the splitCsv operator to a channel emitting a CSV formatted text files or text entries. For example, to view only the first and fourth columns:

1
2
3
4
5
Channel
  .fromPath("data/meta/patients_1.csv")
  .splitCsv()
  // row is a list object
  .view { row -> "${row[0]},${row[3]}" }

When the CSV begins with a header line defining the column names, you can specify the parameter header: true which allows you to reference each value by its column name, as shown in the following example:

1
2
3
4
5
Channel
  .fromPath("data/meta/patients_1.csv")
  .splitCsv(header: true)
  // row is a list object
  .view { row -> "${row.patient_id},${row.num_samples}" }

Alternatively, you can provide custom header names by specifying a list of strings in the header parameter as shown below:

1
2
3
4
5
Channel
  .fromPath("data/meta/patients_1.csv")
  .splitCsv(header: ['col1', 'col2', 'col3', 'col4', 'col5'] )
  // row is a list object
  .view { row -> "${row.col1},${row.col4}" }

You can also process multiple csv files at the same time:

1
2
3
4
Channel
  .fromPath("data/meta/patients_*.csv") // <-- just use a pattern
  .splitCsv(header:true)
  .view { row -> "${row.patient_id}\t${row.num_samples}" }
Notice that you can change the output format simply by adding a different delimiter.

Finally, you can also operate on csv files outside the channel context:

1
2
3
4
5
def f = file('data/meta/patients_1.csv')
  def lines = f.splitCsv()
  for( List row : lines ) {
    log.info "${row[0]} -- ${row[2]}"
  }

Exercise

Try inputting fastq reads into the RNA-Seq workflow from earlier using .splitCSV.

Click here for the answer:

Add a csv text file containing the following, as an example input with the name "fastq.csv":

gut,/workspace/nf-training-public/nf-training/data/ggal/gut_1.fq,/workspace/nf-training-public/nf-training/data/ggal/gut_2.fq

Then replace the input channel for the reads in script7.nf. Changing the following lines:

1
2
3
Channel
  .fromFilePairs( params.reads, checkIfExists: true )
  .set { read_pairs_ch }

To a splitCsv channel factory input:

1
2
3
4
5
Channel
  .fromPath("fastq.csv")
  .splitCsv()
  .view () { row -> "${row[0]},${row[1]},${row[2]}" }
  .set { read_pairs_ch }

Finally, change the cardinality of the processes that use the input data. For example, for the quantification process, change it from:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
process QUANTIFICATION {
  tag "$sample_id"

  input:
  path salmon_index
  tuple val(sample_id), path(reads)

  output:
  path sample_id, emit: quant_ch

  script:
  """
  salmon quant --threads $task.cpus --libType=U -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id
  """
}

To:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
process QUANTIFICATION {
  tag "$sample_id"

  input:
  path salmon_index
  tuple val(sample_id), path(reads1), path(reads2)

  output:
  path sample_id, emit: quant_ch

  script:
  """
  salmon quant --threads $task.cpus --libType=U -i $salmon_index -1 ${reads1} -2 ${reads2} -o $sample_id
  """
}

Repeat the above for the fastqc step.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
process FASTQC {
  tag "FASTQC on $sample_id"

  input:
  tuple val(sample_id), path(reads1), path(reads2)

  output:
  path "fastqc_${sample_id}_logs"

  script:
  """
  mkdir fastqc_${sample_id}_logs
  fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads1} ${reads2}
  """
}

Now the workflow should run from a CSV file.

5.2.9. Tab separated values (.tsv)

Parsing tsv files works in a similar way, simply add the sep:'\t' option in the splitCsv context:

1
2
3
4
5
6
Channel
  .fromPath("data/meta/regions.tsv", checkIfExists:true)
  // use `sep` option to parse TAB separated files
  .splitCsv(sep:'\t')
  // row is a list object
  .view()

Exercise

Try using the tab separation technique on the file "data/meta/regions.tsv", but print just the first column, and remove the header.

Answer:
1
2
3
4
5
6
Channel
  .fromPath("data/meta/regions.tsv", checkIfExists:true)
  // use `sep` option to parse TAB separated files
  .splitCsv(sep:'\t', header:true )
  // row is a list object
  .view { row -> "${row.patient_id}" }

5.3. More complex file formats

5.3.1. JSON

We can also easily parse the JSON file format using the following groovy schema:

1
2
3
4
5
6
7
8
9
import groovy.json.JsonSlurper

def f = file('data/meta/regions.json')
def records = new JsonSlurper().parse(f)


for( def entry : records ) {
  log.info "$entry.patient_id -- $entry.feature"
}
When using an older JSON version, you may need to replace parse(f) with parseText(f.text)

5.3.2. YAML

This can also be used as a way to parse YAML files:

1
2
3
4
5
6
7
8
9
import org.yaml.snakeyaml.Yaml

def f = file('data/meta/regions.yml')
def records = new Yaml().load(f)


for( def entry : records ) {
  log.info "$entry.patient_id -- $entry.feature"
}

5.3.3. Storage of parsers into modules

The best way to store parser scripts is to keep them in a Nextflow module file.

See the following Nextflow script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
include{ parseJsonFile } from './modules/parsers.nf'

process foo {
  input:
  tuple val(meta), path(data_file)

  """
  echo your_command $meta.region_id $data_file
  """
}

workflow {
  Channel.fromPath('data/meta/regions*.json') \
    | flatMap { parseJsonFile(it) } \
    | map { entry -> tuple(entry,"/some/data/${entry.patient_id}.txt") } \
    | foo
}

For this script to work, a module file called parsers.nf needs to be created and stored in a modules folder in the current directory.

The parsers.nf file should contain the parseJsonFile function.

Nextflow will use this as a custom function within the workflow scope.

You will learn more about module files later in section 8.1 of this tutorial.

6. Processes

In Nextflow, a process is the basic computing primitive to execute foreign functions (i.e., custom scripts or tools).

The process definition starts with the keyword process, followed by the process name and finally the process body delimited by curly brackets.

The process name is commonly written in upper case by convention.

A basic process, only using the script definition block, looks like the following:

1
2
3
4
5
6
7
process SAYHELLO {

  script:
  """
  echo 'Hello world!'
  """
}

In more complex examples, the process body can contain up to five definition blocks:

  1. Directives are initial declarations that define optional settings

  2. Input defines the expected input file(s) and the channel from where to find them

  3. Output defines the expected output file(s) and the channel to send the data to

  4. When is an optional clause statement to allow conditional processes

  5. Script is a string statement that defines the command to be executed by the process

The full process syntax is defined as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
process < name > {

  [ directives ]        (1)

  input:                (2)
  < process inputs >

  output:               (3)
  < process outputs >

  when:                 (4)
  < condition >

  [script|shell|exec]:  (5)
  """
  < user script to be executed >
  """
}
1 Zero, one or more process directives
2 Zero, one or more process inputs
3 Zero, one or more process outputs
4 An optional boolean conditional to trigger the process execution
5 The command to be executed

6.1. Script

The script block is a string statement that defines the command to be executed by the process.

A process can execute only one script block. It must be the last statement when the process contains input and output declarations.

The script block can be a single or a multi-line string. The latter simplifies the writing of non-trivial scripts composed of multiple commands spanning over multiple lines. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
process EXAMPLE {

  script:
  """
  echo 'Hello world!\nHola mundo!\nCiao mondo!\nHallo Welt!' > file
  cat file | head -n 1 | head -c 5 > chunk_1.txt
  gzip -c chunk_1.txt  > chunk_archive.gz
  """
}

workflow {
  EXAMPLE()
}

By default, the process command is interpreted as a Bash script. However, any other scripting language can be used by simply starting the script with the corresponding Shebang declaration. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
process PYSTUFF {

  script:
  """
  #!/usr/bin/env python

  x = 'Hello'
  y = 'world!'
  print ("%s - %s" % (x,y))
  """
}

workflow {
  PYSTUFF()
}
Multiple programming languages can be used within the same workflow script. However, for large chunks of code it is better to save them into separate files and invoke them from the process script. One can store the specific scripts in the ./bin/ folder.

6.1.1. Script parameters

Script parameters (params) can be defined dynamically using variable values. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
params.data = 'World'

process FOO {

  script:
  """
  echo Hello $params.data
  """
}

workflow {
  FOO()
}
A process script can contain any string format supported by the Groovy programming language. This allows us to use string interpolation as in the script above or multiline strings. Refer to String interpolation for more information.
Since Nextflow uses the same Bash syntax for variable substitutions in strings, Bash environment variables need to be escaped using the \ character.
1
2
3
4
5
6
7
8
9
10
11
process FOO {

  script:
  """
  echo "The current directory is \$PWD"
  """
}

workflow {
  FOO()
}

It can be tricky to write a script uses many Bash variables. One possible alternative is to use a script string delimited by single-quote characters

1
2
3
4
5
6
7
8
9
10
11
process BAR {

  script:
  """
  echo $PATH | tr : '\\n'
  """
}

workflow {
  BAR()
}

However, this blocks the usage of Nextflow variables in the command script.

Another alternative is to use a shell statement instead of script and use a different syntax for Nextflow variables, e.g., !{..}. This allows the use of both Nextflow and Bash variables in the same script.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
params.data = 'le monde'

process BAZ {

  shell:
  '''
  X='Bonjour'
  echo $X !{params.data}
  '''
}

workflow {
  BAZ()
}

6.1.2. Conditional script

The process script can also be defined in a completely dynamic manner using an if statement or any other expression for evaluating a string value. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
params.compress = 'gzip'
params.file2compress = "$baseDir/data/ggal/transcriptome.fa"

process FOO {

  input:
  path file

  script:
  if( params.compress == 'gzip' )
    """
    gzip -c $file > ${file}.gz
    """
  else if( params.compress == 'bzip2' )
    """
    bzip2 -c $file > ${file}.bz2
    """
  else
    throw new IllegalArgumentException("Unknown aligner $params.compress")
}

workflow {
  FOO(params.file2compress)
}

6.2. Inputs

Nextflow processes are isolated from each other but can communicate between themselves by sending values through channels.

Inputs implicitly determine the dependencies and the parallel execution of the process. The process execution is fired each time new data is ready to be consumed from the input channel:

channel process

The input block defines which channels the process is expecting to receive data from. You can only define one input block at a time, and it must contain one or more input declarations.

The input block follows the syntax shown below:

1
2
input:
  <input qualifier> <input name>

6.2.1. Input values

The val qualifier allows you to receive data of any type as input. It can be accessed in the process script by using the specified input name, as shown in the following example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
num = Channel.of( 1, 2, 3 )

process BASICEXAMPLE {
  debug true

  input:
  val x

  script:
  """
  echo process job $x
  """
}

workflow {
  myrun = BASICEXAMPLE(num)
}

In the above example the process is executed three times, each time a value is received from the channel num and used to process the script. Thus, it results in an output similar to the one shown below:

process job 3
process job 1
process job 2
The channel guarantees that items are delivered in the same order as they have been sent - but - since the process is executed in a parallel manner, there is no guarantee that they are processed in the same order as they are received.

6.2.2. Input files

The path qualifier allows the handling of file values in the process execution context. This means that Nextflow will stage it in the process execution directory, and it can be accessed in the script by using the name specified in the input declaration.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
reads = Channel.fromPath( 'data/ggal/*.fq' )

process FOO {
  debug true

  input:
  path 'sample.fastq'

  script:
  """
  ls sample.fastq
  """
}

workflow {
  result = FOO(reads)
}

The input file name can also be defined using a variable reference as shown below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
reads = Channel.fromPath( 'data/ggal/*.fq' )

process FOO {
  debug true

  input:
  path sample

  script:
  """
  ls  $sample
  """
}

workflow {
  result = FOO(reads)
}

The same syntax is also able to handle more than one input file in the same execution and only requires changing the channel composition.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
reads = Channel.fromPath( 'data/ggal/*.fq' )

process FOO {
  debug true

  input:
  path sample

  script:
  """
  ls -lh $sample
  """
}

workflow {
  FOO(reads.collect())
}
In the past, the file qualifier was used for files, but the path qualifier should be preferred over file to handle process input files when using Nextflow 19.10.0 or later. When a process declares an input file, the corresponding channel elements must be file objects created with the file helper function from the file specific channel factories (e.g., Channel.fromPath or Channel.fromFilePairs).

Exercise

Write a script that creates a channel containing all read files matching the pattern data/ggal/*_1.fq followed by a process that concatenates them into a single file and prints the first 20 lines.

Click here for the answer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
params.reads = "$baseDir/data/ggal/*_1.fq"

Channel
  .fromPath( params.reads )
  .set { read_ch }

process CONCATENATE {
  tag "Concat all files"

  input:
  path '*'

  output:
  path 'top_10_lines'

  script:
  """
  cat * > concatenated.txt
  head -n 20 concatenated.txt > top_10_lines
  """
}

workflow {
  concat_ch = CONCATENATE(read_ch.collect())
  concat_ch.view()
}

6.2.3. Combine input channels

A key feature of processes is the ability to handle inputs from multiple channels. However, it’s important to understand how channel contents and their semantics affect the execution of a process.

Consider the following example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
ch1 = Channel.of(1,2,3)
ch2 = Channel.of('a','b','c')

process FOO {
  debug true

  input:
  val x
  val y

  script:
   """
   echo $x and $y
   """
}

workflow {
  FOO(ch1, ch2)
}

Both channels emit three values, therefore the process is executed three times, each time with a different pair:

  • (1, a)

  • (2, b)

  • (3, c)

What is happening is that the process waits until there’s a complete input configuration, i.e., it receives an input value from all the channels declared as input.

When this condition is verified, it consumes the input values coming from the respective channels, spawns a task execution, then repeats the same logic until one or more channels have no more content.

This means channel values are consumed serially one after another and the first empty channel causes the process execution to stop, even if there are other values in other channels.

So what happens when channels do not have the same cardinality (i.e., they emit a different number of elements)?

For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
input1 = Channel.of(1,2)
input2 = Channel.of('a','b','c','d')

process FOO {
  debug true

  input:
  val x
  val y

  script:
   """
   echo $x and $y
   """
}

workflow {
  FOO(input1, input2)
}

In the above example, the process is only executed twice because the process stops when a channel has no more data to be processed.

However, what happens if you replace value x with a value channel?

Compare the previous example with the following one :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
input1 = Channel.value(1)
input2 = Channel.of('a','b','c')

process BAR {
  debug true

  input:
  val x
  val y

  script:
   """
   echo $x and $y
   """
}

workflow {
  BAR(input1, input2)
}
The output should look something like:
1 and b
1 and a
1 and c

This is because value channels can be consumed multiple times and do not affect process termination.

Exercise

Write a process that is executed for each read file matching the pattern data/ggal/*_1.fq and use the same data/ggal/transcriptome.fa in each execution.

Click here for the answer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
params.reads = "$baseDir/data/ggal/*_1.fq"
params.transcriptome_file = "$baseDir/data/ggal/transcriptome.fa"

Channel
    .fromPath( params.reads )
    .set { read_ch }

process COMMAND {
  tag "Run_command"

  input:
  path reads
  path transcriptome

  output:
  path result

  script:
  """
  echo your_command $reads $transcriptome > result
  """
}

workflow {
  concat_ch = COMMAND(read_ch, params.transcriptome_file)
  concat_ch.view()
}

6.2.4. Input repeaters

The each qualifier allows you to repeat the execution of a process for each item in a collection every time new data is received. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
sequences = Channel.fromPath('data/prots/*.tfa')
methods = ['regular', 'espresso', 'psicoffee']

process ALIGNSEQUENCES {
  debug true

  input:
  path seq
  each mode

  script:
  """
  echo t_coffee -in $seq -mode $mode
  """
}

workflow {
  ALIGNSEQUENCES(sequences, methods)
}

In the above example, every time a file of sequences is received as an input by the process, it executes three tasks, each running a different alignment method set as a mode variable. This is useful when you need to repeat the same task for a given set of parameters.

Exercise

Extend the previous example so a task is executed for each read file matching the pattern data/ggal/*_1.fq and repeat the same task with both salmon and kallisto.

Click here for the answer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
params.reads = "$baseDir/data/ggal/*_1.fq"
params.transcriptome_file = "$baseDir/data/ggal/transcriptome.fa"
methods= ['salmon', 'kallisto']

Channel
    .fromPath( params.reads )
    .set { read_ch }

process COMMAND {
  tag "Run_command"

  input:
  path reads
  path transcriptome
  each mode

  output:
  path result

  script:
  """
  echo $mode $reads $transcriptome > result
  """
}

workflow {
  concat_ch = COMMAND(read_ch , params.transcriptome_file, methods)
  concat_ch
      .view { "To run : ${it.text}" }
}

6.3. Outputs

The output declaration block defines the channels used by the process to send out the results produced.

Only one output block, that can contain one or more output declaration, can be defined. The output block follows the syntax shown below:

1
2
output:
  <output qualifier> <output name> , emit: <output channel>

6.3.1. Output values

The val qualifier specifies a defined value in the script context. Values are frequently defined in the input and/or output declaration blocks, as shown in the following example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
methods = ['prot','dna', 'rna']

process FOO {

  input:
  val x

  output:
  val x

  script:
  """
  echo $x > file
  """
}

workflow {
  receiver_ch = FOO(Channel.of(methods))
  receiver_ch.view { "Received: $it" }
}

6.3.2. Output files

The path qualifier specifies one or more files produced by the process into the specified channel as an output.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
process RANDOMNUM {

    output:
    path 'result.txt'

    script:
    """
    echo $RANDOM > result.txt
    """
}


workflow {
  receiver_ch = RANDOMNUM()
  receiver_ch.view { "Received: " + it.text }
}

In the above example the process RANDOMNUM creates a file named result.txt containing a random number.

Since a file parameter using the same name is declared in the output block, the file is sent over the receiver_ch channel when the task is complete. A downstream process declaring the same channel as input will be able to receive it.

6.3.3. Multiple output files

When an output file name contains a wildcard character (* or ?) it is interpreted as a glob path matcher. This allows us to capture multiple files into a list object and output them as a sole emission. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
process SPLITLETTERS {

    output:
    path 'chunk_*'

    """
    printf 'Hola' | split -b 1 - chunk_
    """
}

workflow {
    letters = SPLITLETTERS()
    letters
        .flatMap()
        .view { "File: ${it.name} => ${it.text}" }
}

Prints the following:

File: chunk_aa => H
File: chunk_ab => o
File: chunk_ac => l
File: chunk_ad => a

Some caveats on glob pattern behavior:

  • Input files are not included in the list of possible matches

  • Glob pattern matches both files and directory paths

  • When a two stars pattern ** is used to recourse across directories, only file paths are matched i.e., directories are not included in the result list.

Exercise

Remove the flatMap operator and see out the output change. The documentation for the flatMap operator is available at this link.

Click here for the answer:
File: [chunk_aa, chunk_ab, chunk_ac, chunk_ad] => [H, o, l, a]

6.3.4. Dynamic output file names

When an output file name needs to be expressed dynamically, it is possible to define it using a dynamic string that references values defined in the input declaration block or in the script global context. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
species = ['cat','dog', 'sloth']
sequences = ['AGATAG','ATGCTCT', 'ATCCCAA']

Channel.fromList(species)
       .set { species_ch }

process ALIGN {

  input:
  val x
  val seq

  output:
  path "${x}.aln"

  script:
  """
  echo align -in $seq > ${x}.aln
  """
}

workflow {
  genomes = ALIGN( species_ch, sequences )
  genomes.view()
}

In the above example, each time the process is executed an alignment file is produced whose name depends on the actual value of the x input.

6.3.5. Composite inputs and outputs

So far we have seen how to declare multiple input and output channels that can handle one value at a time. However, Nextflow can also handle a tuple of values.

The input and output declarations for tuples must be declared with a tuple qualifier followed by the definition of each element in the tuple.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
reads_ch = Channel.fromFilePairs('data/ggal/*_{1,2}.fq')

process FOO {

  input:
    tuple val(sample_id), path(sample_id)

  output:
    tuple val(sample_id), path('sample.bam')

  script:
  """
    echo your_command_here --reads $sample_id > sample.bam
  """
}

workflow {
  bam_ch = FOO(reads_ch)
  bam_ch.view()
}
In previous versions of Nextflow tuple was called set but it was used the same way with the same semantic.

Exercise

Modify the script of the previous exercise so that the bam file is named as the given sample_id.

Click here for the answer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
reads_ch = Channel.fromFilePairs('data/ggal/*_{1,2}.fq')

process FOO {

  input:
    tuple val(sample_id), path(sample_files)

  output:
    tuple val(sample_id), path("${sample_id}.bam")

  script:
  """
    echo your_command_here --reads $sample_id > ${sample_id}.bam
  """
}

workflow {
  bam_ch = FOO(reads_ch)
  bam_ch.view()
}

6.4. When

The when declaration allows you to define a condition that must be verified in order to execute the process. This can be any expression that evaluates a boolean value.

It is useful to enable/disable the process execution depending on the state of various inputs and parameters. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
params.dbtype = 'nr'
params.prot = 'data/prots/*.tfa'
proteins = Channel.fromPath(params.prot)

process FIND {
  debug true

  input:
  path fasta
  val type

  when:
  fasta.name =~ /^BB11.*/ && type == 'nr'

  script:
  """
  echo blastp -query $fasta -db nr
  """
}

workflow {
  result = FIND(proteins, params.dbtype)
}

6.5. Directives

Directive declarations allow the definition of optional settings that affect the execution of the current process without affecting the semantic of the task itself.

They must be entered at the top of the process body, before any other declaration blocks (i.e., input, output, etc.).

Directives are commonly used to define the amount of computing resources to be used or other meta directives that allow the definition of extra configuration of logging information. For example:

1
2
3
4
5
6
7
8
9
10
process FOO {
  cpus 2
  memory 1.GB
  container 'image/name'

  script:
  """
  echo your_command --this --that
  """
}

The complete list of directives is available at this link.

Table 3. Commonly used directives
Name Description

cpus

Allows you to define the number of (logical) CPUs required by the process’ task.

time

Allows you to define how long a process is allowed to run (e.g., time '1h': 1 hour, '1s' 1 second, '1m' 1 minute, '1d' 1 day).

memory

Allows you to define how much memory the process is allowed to use (e.g., '2 GB' is 2 GB). Can also use B, KB,MB,GB and TB.

disk

Allows you to define how much local disk storage the process is allowed to use.

tag

Allows you to associate each process execution with a custom label to make it easier to identify them in the log file or the trace execution report.

6.6. Organize outputs

6.6.1. PublishDir directive

Given each process is being executed in separate temporary work/ folder (e.g., work/f1/850698…​; work/g3/239712…​; etc.), we may want to save important, non-intermediary, and/or final files in a results folder.

Remember to delete the work folder from time to time to clear your intermediate files and stop them from filling your computer!

To store our workflow result files, we need to explicitly mark them using the directive publishDir in the process that’s creating the files. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
params.outdir = 'my-results'
params.prot = 'data/prots/*.tfa'
proteins = Channel.fromPath(params.prot)


process BLASTSEQ {
    publishDir "$params.outdir/bam_files", mode: 'copy'

    input:
    path fasta

    output:
    path ('*.txt')

    script:
    """
    echo blastp $fasta > ${fasta}_result.txt
    """
}

workflow {
  blast_ch = BLASTSEQ(proteins)
  blast_ch.view()
}

The above example will copy all blast script files created by the BLASTSEQ task into the directory path my-results.

The publish directory can be local or remote. For example, output files could be stored using an AWS S3 bucket by using the s3:// prefix in the target path.

6.6.2. Manage semantic sub-directories

You can use more than one publishDir to keep different outputs in separate directories. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
params.reads = 'data/reads/*_{1,2}.fq.gz'
params.outdir = 'my-results'

samples_ch = Channel.fromFilePairs(params.reads, flat: true)

process FOO {
  publishDir "$params.outdir/$sampleId/", pattern: '*.fq'
  publishDir "$params.outdir/$sampleId/counts", pattern: "*_counts.txt"
  publishDir "$params.outdir/$sampleId/outlooks", pattern: '*_outlook.txt'

  input:
    tuple val(sampleId), path('sample1.fq.gz'), path('sample2.fq.gz')

  output:
    path "*"

  script:
  """
    < sample1.fq.gz zcat > sample1.fq
    < sample2.fq.gz zcat > sample2.fq

    awk '{s++}END{print s/4}' sample1.fq > sample1_counts.txt
    awk '{s++}END{print s/4}' sample2.fq > sample2_counts.txt

    head -n 50 sample1.fq > sample1_outlook.txt
    head -n 50 sample2.fq > sample2_outlook.txt
  """
}

workflow {
  out_channel = FOO(samples_ch)
}

The above example will create an output structure in the directory my-results, that contains a separate sub-directory for each given sample ID, each containing the folders counts and outlooks.

7. Operators

Operators are methods that allow you to connect channels, transform values emitted by a channel, or apply some user-provided rules.

There are seven main groups of operators are described in greater detail within the Nextflow Reference Documentation, linked below:

7.1. Basic example

1
2
3
nums = Channel.of(1,2,3,4)         (1)
square = nums.map { it -> it * it }  (2)
square.view()                        (3)
channel map
1 Creates a queue channel emitting four values
2 Creates a new channel, transforming each number into its square
3 Prints the channel content

Operators can also be chained to implement custom behaviors, so the previous snippet can also be written as:

1
2
3
4
Channel
    .of(1,2,3,4)
    .map { it -> it * it }
    .view()

7.2. Basic operators

Here we explore some of the most commonly used operators.

7.2.1. view

The view operator prints the items emitted by a channel to the console standard output, appending a new line character to each item. For example:

1
2
3
Channel
    .of('foo', 'bar', 'baz')
    .view()

It prints:

foo
bar
baz

An optional closure parameter can be specified to customize how items are printed. For example:

1
2
3
Channel
    .of('foo', 'bar', 'baz')
    .view { "- $it" }

It prints:

- foo
- bar
- baz

7.2.2. map

The map operator applies a function of your choosing to every item emitted by a channel and returns the items obtained as a new channel. The function applied is called the mapping function and is expressed with a closure as shown in the example below:

1
2
3
4
Channel
    .of( 'hello', 'world' )
    .map { it -> it.reverse() }
    .view()

A map can associate a generic tuple to each element and can contain any data.

1
2
3
4
Channel
    .of( 'hello', 'world' )
    .map { word -> [word, word.size()] }
    .view { word, len -> "$word contains $len letters" }

Exercise

Use fromPath to create a channel emitting the fastq files matching the pattern data/ggal/*.fq, then use map to return a pair containing the file name and the path itself, and finally, use view to print the resulting channel.

Click here for the answer:
1
2
3
4
Channel
    .fromPath('data/ggal/*.fq')
    .map { file -> [ file.name, file ] }
    .view { name, file -> "> $name : $file" }

7.2.3. mix

The mix operator combines the items emitted by two (or more) channels into a single channel.

1
2
3
4
5
c1 = Channel.of( 1,2,3 )
c2 = Channel.of( 'a','b' )
c3 = Channel.of( 'z' )

c1 .mix(c2,c3).view()
1
2
a
3
b
z
The items in the resulting channel have the same order as in the respective original channels. However, there is no guarantee that the element of the second channel are appended after the elements of the first. Indeed, in the example above, the element a has been printed before 3.

7.2.4. flatten

The flatten operator transforms a channel in such a way that every tuple is flattened so that each entry is emitted as a sole element by the resulting channel.

1
2
3
4
5
6
7
foo = [1,2,3]
bar = [4,5,6]

Channel
    .of(foo, bar)
    .flatten()
    .view()

The above snippet prints:

1
2
3
4
5
6

7.2.5. collect

The collect operator collects all of the items emitted by a channel in a list and returns the object as a sole emission.

1
2
3
4
Channel
    .of( 1, 2, 3, 4 )
    .collect()
    .view()

It prints a single value:

[1,2,3,4]
The result of the collect operator is a value channel.

7.2.6. groupTuple

The groupTuple operator collects tuples (or lists) of values emitted by the source channel, grouping the elements that share the same key. Finally, it emits a new tuple object for each distinct key collected.

Try the following example:

1
2
3
4
Channel
    .of( [1,'A'], [1,'B'], [2,'C'], [3, 'B'], [1,'C'], [2, 'A'], [3, 'D'] )
    .groupTuple()
    .view()

It shows:

[1, [A, B, C]]
[2, [C, A]]
[3, [B, D]]

This operator is useful to process a group together with all the elements that share a common property or grouping key.

Exercise

Use fromPath to create a channel emitting all of the files in the folder data/meta/, then use a map to associate the baseName prefix to each file. Finally, group all files that have the same common prefix.

Click here for the answer:
1
2
3
4
Channel.fromPath('data/meta/*')
    .map { file -> tuple(file.baseName, file) }
    .groupTuple()
    .view { baseName, file -> "> $baseName : $file" }

7.2.7. join

The join operator creates a channel that joins together the items emitted by two channels with a matching key. The key is defined, by default, as the first element in each item emitted.

1
2
3
left = Channel.of(['X', 1], ['Y', 2], ['Z', 3], ['P', 7])
right = Channel.of(['Z', 6], ['Y', 5], ['X', 4])
left.join(right).view()

The resulting channel emits:

[Z, 3, 6]
[Y, 2, 5]
[X, 1, 4]
Notice 'P' is missing in the final result.

7.2.8. branch

The branch operator allows you to forward the items emitted by a source channel to one or more output channels.

The selection criterion is defined by specifying a closure that provides one or more boolean expressions, each of which is identified by a unique label. For the first expression that evaluates to a true value, the item is bound to a named channel as the label identifier. For example:

1
2
3
4
5
6
7
8
9
10
Channel
    .of(1,2,3,40,50)
    .branch {
        small: it < 10
        large: it > 10
    }
    .set { result }

result.small.view { "$it is small" }
result.large.view { "$it is large" }
The branch operator returns a multi-channel object (i.e., a variable that holds more than one channel object).
In the above example, what would happen to a value of 10? To deal with this, you can also use >=.

7.3. More resources

Check the operators documentation on Nextflow web site.

8. Groovy basic structures and idioms

Nextflow is a domain specific language (DSL) implemented on top of the Groovy programming language, which in turn is a super-set of the Java programming language. This means that Nextflow can run any Groovy or Java code.

Here are some important Groovy syntax that are commonly used in Nextflow.

8.1. Printing values

To print something is as easy as using one of the print or println methods.

println("Hello, World!")

The only difference between the two is that the println method implicitly appends a new line character to the printed string.

parentheses for function invocations are optional. Therefore, the following syntax is also valid :
println "Hello, World!"

8.2. Comments

Comments use the same syntax as C-family programming languages:

1
2
3
4
5
6
// comment a single config file

/*
   a comment spanning
   multiple lines
*/

8.3. Variables

To define a variable, simply assign a value to it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
x = 1
println x

x = new java.util.Date()
println x

x = -3.1499392
println x

x = false
println x

x = "Hi"
println x

Local variables are defined using the def keyword:

def x = 'foo'

The def should be always used when defining variables local to a function or a closure.

8.4. Lists

A List object can be defined by placing the list items in square brackets:

list = [10,20,30,40]

You can access a given item in the list with square-bracket notation (indexes start at 0) or using the get method:

1
2
println list[0]
println list.get(0)

In order to get the length of a list you can use the size method:

println list.size()

We use the assert keyword to test if a condition is true (similar to an if function). Here, Groovy will print nothing if it is correct, else it will raise an AssertionError message.

assert list[0] == 10
This assertion should be correct, but try changing it to an incorrect one.

Lists can also be indexed with negative indexes and reversed ranges.

1
2
3
list = [0,1,2]
assert list[-1] == 2
assert list[-1..0] == list.reverse()
In the last assert line we are referencing the initial list and converting this with a "shorthand" range (..), to run from the -1th element (2) to the 0th element (0).

List objects implement all methods provided by the java.util.List interface, plus the extension methods provided by Groovy.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
assert [1,2,3] << 1 == [1,2,3,1]
assert [1,2,3] + [1] == [1,2,3,1]
assert [1,2,3,1] - [1] == [2,3]
assert [1,2,3] * 2 == [1,2,3,1,2,3]
assert [1,[2,3]].flatten() == [1,2,3]
assert [1,2,3].reverse() == [3,2,1]
assert [1,2,3].collect{ it+3 } == [4,5,6]
assert [1,2,3,1].unique().size() == 3
assert [1,2,3,1].count(1) == 2
assert [1,2,3,4].min() == 1
assert [1,2,3,4].max() == 4
assert [1,2,3,4].sum() == 10
assert [4,2,1,3].sort() == [1,2,3,4]
assert [4,2,1,3].find{it%2 == 0} == 4
assert [4,2,1,3].findAll{it%2 == 0} == [4,2]

8.5. Maps

Maps are like lists that have an arbitrary key instead of an integer. Therefore, the syntax is very much aligned.

map = [a:0, b:1, c:2]

Maps can be accessed in a conventional square-bracket syntax or as if the key was a property of the map.

1
2
3
assert map['a'] == 0        (1)
assert map.b == 1           (2)
assert map.get('c') == 2    (3)
1 Using square brackets.
2 Using dot notation.
3 Using the get method.

To add data or to modify a map, the syntax is similar to adding values to a list:

1
2
3
4
map['a'] = 'x'           (1)
map.b = 'y'              (2)
map.put('c', 'z')        (3)
assert map == [a:'x', b:'y', c:'z']
1 Using square brackets.
2 Using dot notation.
3 Using the put method.

Map objects implement all methods provided by the java.util.Map interface, plus the extension methods provided by Groovy.

8.6. String interpolation

String literals can be defined by enclosing them with either single- ('') or double- ("") quotation marks.

Double-quoted strings can contain the value of an arbitrary variable by prefixing its name with the $ character, or the value of any expression by using the ${expression} syntax, similar to Bash/shell scripts:

1
2
3
4
5
6
foxtype = 'quick'
foxcolor = ['b', 'r', 'o', 'w', 'n']
println "The $foxtype ${foxcolor.join()} fox"

x = 'Hello'
println '$x + $y'

This code prints:

1
2
The quick brown fox
$x + $y
Note the different use of $ and ${..} syntax to interpolate value expressions in a string literal.

Finally, string literals can also be defined using the / character as a delimiter. They are known as slashy strings and are useful for defining regular expressions and patterns, as there is no need to escape backslashes. As with double-quote strings they allow to interpolate variables prefixed with a $ character.

Try the following to see the difference:

1
2
3
4
5
x = /tic\tac\toe/
y = 'tic\tac\toe'

println x
println y

it prints:

tic\tac\toe
tic    ac    oe

8.7. Multi-line strings

A block of text that spans multiple lines can be defined by delimiting it with triple single or double quotes:

1
2
3
4
5
text = """
    Hello there James.
    How are you today?
    """
println text

Finally, multi-line strings can also be defined with slashy strings. For example:

1
2
3
4
5
6
text = /
    This is a multi-line
    slashy string!
    It's cool, isn't it?!
    /
println text
Like before, multi-line strings inside double quotes and slash characters support variable interpolation, while single-quoted multi-line strings do not.

8.8. If statement

The if statement uses the same syntax common in other programming languages, such as Java, C, JavaScript, etc.

1
2
3
4
5
6
if( < boolean expression > ) {
    // true branch
}
else {
    // false branch
}

The else branch is optional. Also, the curly brackets are optional when the branch defines just a single statement.

1
2
3
x = 1
if( x > 10 )
    println 'Hello'
null, empty strings, and empty collections are evaluated to false.

Therefore a statement like:

1
2
3
4
5
6
7
list = [1,2,3]
if( list != null && list.size() > 0 ) {
  println list
}
else {
  println 'The list is empty'
}

Can be written as:

1
2
3
4
5
list = [1,2,3]
if( list )
    println list
else
    println 'The list is empty'

See the Groovy-Truth for further details.

In some cases it can be useful to replace the if statement with a ternary expression (aka conditional expression). For example:
println list ? list : 'The list is empty'

The previous statement can be further simplified using the Elvis operator, as shown below:

println list ?: 'The list is empty'

8.9. For statement

The classical for loop syntax is supported as shown here:

1
2
3
for (int i = 0; i <3; i++) {
   println("Hello World $i")
}

Iteration over list objects is also possible using the syntax below:

1
2
3
4
5
list = ['a','b','c']

for( String elem : list ) {
  println elem
}

8.10. Functions

It is possible to define a custom function into a script, as shown here:

1
2
3
4
5
int fib(int n) {
    return n < 2 ? 1 : fib(n-1) + fib(n-2)
}

assert fib(10)==89

A function can take multiple arguments separating them with a comma. The return keyword can be omitted and the function implicitly returns the value of the last evaluated expression. Also, explicit types can be omitted, though not recommended:

1
2
3
4
5
def fact( n ) {
  n > 1 ? n * fact(n-1) : 1
}

assert fact(5) == 120

8.11. Closures

Closures are the Swiss army knife of Nextflow/Groovy programming. In a nutshell, a closure is a block of code that can be passed as an argument to a function. A closure can also be used to define an anonymous function.

More formally, a closure allows the definition of functions as first-class objects.

square = { it * it }

The curly brackets around the expression it * it tells the script interpreter to treat this expression as code. The it identifier is an implicit variable that represents the value that is passed to the function when it is invoked.

Once compiled, the function object is assigned to the variable square as any other variable assignment shown previously. To invoke the closure execution use the special method call or just use the round parentheses to specify the closure parameter(s). For example:

1
2
assert square.call(5) == 25
assert square(9) == 81

As is, this may not seem interesting, but we can now pass the square function as an argument to other functions or methods. Some built-in functions take a function like this as an argument. One example is the collect method on lists:

1
2
x = [ 1, 2, 3, 4 ].collect(square)
println x

It prints:

[ 1, 4, 9, 16 ]

By default, closures take a single parameter called it, to give it a different name use the -> syntax. For example:

square = { num -> num * num }

It’s also possible to define closures with multiple, custom-named parameters.

For example, when the method each() is applied to a map it can take a closure with two arguments, to which it passes the key-value pair for each entry in the map object. For example:

1
2
3
printMap = { a, b -> println "$a with value $b" }
values = [ "Yue" : "Wu", "Mark" : "Williams", "Sudha" : "Kumari" ]
values.each(printMap)

It prints:

Yue with value Wu
Mark with value Williams
Sudha with value Kumari

A closure has two other important features.

First, it can access and modify variables in the scope where it is defined.

Second, a closure can be defined in an anonymous manner, meaning that it is not given a name, and is defined in the place where it needs to be used.

As an example showing both these features, see the following code fragment:

1
2
3
4
result = 0                                       (1)
values = ["China": 1 , "India" : 2, "USA" : 3]   (2)
values.keySet().each { result += values[it] }    (3)
println result
1 Defines a global variable.
2 Defines a map object.
3 Invokes the each method passing the closure object which modifies the result variable.

Learn more about closures in the Groovy documentation.

8.12. More resources

The complete Groovy language documentation is available at this link.

A great resource to master Apache Groovy syntax is the book: Groovy in Action.

9. Modularization

The definition of module libraries simplifies the writing of complex data analysis pipelines and makes re-use of processes much easier.

Using the hello.nf example from earlier, we will convert the pipeline’s processes into modules, then call them within the workflow scope in a variety of ways.

9.1. Modules

Nextflow DSL2 allows for the definition of stand-alone module scripts that can be included and shared across multiple workflows. Each module can contain its own process or workflow definition.

9.1.1. Importing modules

Components defined in the module script can be imported into other Nextflow scripts using the include statement. This allows you to store these components in a separate file(s) so that they can be re-used in multiple workflows.

Using the hello.nf example, we can achieve this by:

  • Creating a file called modules.nf in the top-level directory.

  • Cutting and pasting the two process definitions for SPLITLETTERS and CONVERTTOUPPER into modules.nf.

  • Removing the process definitions in the hello.nf script.

  • Importing the processes from modules.nf within the hello.nf script anywhere above the workflow definition:

1
2
include { SPLITLETTERS   } from './modules.nf'
include { CONVERTTOUPPER } from './modules.nf'
In general, you would use relative paths to define the location of the module scripts using the ./ prefix.

Exercise

Create a modules.nf file with the previously defined processes from hello.nf. Then remove these processes from hello.nf and add the include definitions shown above.

Click here for the answer:

The hello.nf script should look similar like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/bin/env nextflow

params.greeting  = 'Hello world!'
greeting_ch = Channel.of(params.greeting)

include { SPLITLETTERS   } from './modules.nf'
include { CONVERTTOUPPER } from './modules.nf'

workflow {
    letters_ch = SPLITLETTERS(greeting_ch)
    results_ch = CONVERTTOUPPER(letters_ch.flatten())
    results_ch.view{ it }
}

You should have the following in the file ./modules.nf:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
process SPLITLETTERS {

    input:
    val x

    output:
    path 'chunk_*'

    """
    printf '$x' | split -b 6 - chunk_
    """
}

process CONVERTTOUPPER {

    input:
    path y

    output:
    stdout

    """
    cat $y | tr '[a-z]' '[A-Z]'
    """
}

We now have modularized processes which makes the code reusable.

9.1.2. Multiple imports

If a Nextflow module script contains multiple process definitions they can also be imported using a single include statement as shown in the example below:

include { SPLITLETTERS; CONVERTTOUPPER } from './modules.nf'

9.1.3. Module aliases

When including a module component it is possible to specify a name alias using the as declaration. This allows the inclusion and the invocation of the same component multiple times using different names:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#!/usr/bin/env nextflow

params.greeting = 'Hello world!'
greeting_ch = Channel.of(params.greeting)

include { SPLITLETTERS as SPLITLETTERS_one } from './modules.nf'
include { SPLITLETTERS as SPLITLETTERS_two } from './modules.nf'

include { CONVERTTOUPPER as CONVERTTOUPPER_one } from './modules.nf'
include { CONVERTTOUPPER as CONVERTTOUPPER_two } from './modules.nf'

workflow {
    letters_ch1 = SPLITLETTERS_one(greeting_ch)
    results_ch1 = CONVERTTOUPPER_one(letters_ch1.flatten())
    results_ch1.view{ it }

    letters_ch2 = SPLITLETTERS_two(greeting_ch)
    results_ch2 = CONVERTTOUPPER_two(letters_ch2.flatten())
    results_ch2.view{ it }
}

Exercise

Save the previous snippet as hello.2.nf, and predict the output to screen.

Click here for the answer:

The hello.2.nf output should look something like this:

N E X T F L O W  ~  version 22.04.3
Launching `hello.2.nf` [goofy_goldstine] DSL2 - revision: 449cf82eaf
executor >  local (6)
[e1/5e6523] process > SPLITLETTERS_one (1)   [100%] 1 of 1 ✔
[14/b77deb] process > CONVERTTOUPPER_one (1) [100%] 2 of 2 ✔
[c0/115bd6] process > SPLITLETTERS_two (1)   [100%] 1 of 1 ✔
[09/f9072d] process > CONVERTTOUPPER_two (2) [100%] 2 of 2 ✔
WORLD!
HELLO
WORLD!
HELLO
You can store each process in separate files within separate sub-folders or combined in one big file (both are valid). For your own projects, it can be helpful to see examples on public repos. e.g. Seqera RNA-Seq tutorial here or within nf-core pipelines, such as nf-core/rnaseq.

9.2. Output definition

Nextflow allows the use of alternative output definitions within workflows to simplify your code.

In the previous basic example (hello.nf), we defined the channel names to specify the input to the next process.

e.g.

1
2
3
4
5
6
workflow  {
    greeting_ch = Channel.of(params.greeting)
    letters_ch = SPLITLETTERS(greeting_ch)
    results_ch = CONVERTTOUPPER(letters_ch.flatten())
    results_ch.view{ it }
}
We have moved the greeting_ch into the workflow scope for this exercise.

We can also explicitly define the output of one channel to another using the .out attribute.

1
2
3
4
5
6
workflow  {
    greeting_ch = Channel.of(params.greeting)
    letters_ch = SPLITLETTERS(greeting_ch)
    results_ch = CONVERTTOUPPER(letters_ch.out.flatten())
    results_ch.view{ it }
}

We can also remove the channel definitions completely from each line if we prefer (as at each call, a channel is implied):

1
2
3
4
5
6
workflow  {
    greeting_ch = Channel.of(params.greeting)
    SPLITLETTERS(greeting_ch)
    CONVERTTOUPPER(SPLITLETTERS.out.flatten())
    CONVERTTOUPPER.out.view()
}

If a process defines two or more output channels, each channel can be accessed by indexing the .out attribute, e.g., .out[0], .out[1], etc. In our example we only have the [0]'th output:

1
2
3
4
5
6
workflow  {
    greeting_ch = Channel.of(params.greeting)
    SPLITLETTERS(greeting_ch)
    CONVERTTOUPPER(SPLITLETTERS.out.flatten())
    CONVERTTOUPPER.out[0].view()
}

Alternatively, the process output definition allows the use of the emit statement to define a named identifier that can be used to reference the channel in the external scope.

For example, try adding the emit statement on the convertToUpper process in your modules.nf file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
process SPLITLETTERS {
    input:
    val x

    output:
    path 'chunk_*'

    """
    printf '$x' | split -b 6 - chunk_
    """
}

process CONVERTTOUPPER {
    input:
    path y

    output:
    stdout emit: upper

    """
    cat $y | tr '[a-z]' '[A-Z]'
    """
}

Then change the workflow scope in hello.nf to call this specific named output (notice the added .upper):

1
2
3
4
5
6
workflow {
    greeting_ch = Channel.of(params.greeting)
    SPLITLETTERS(greeting_ch)
    CONVERTTOUPPER(SPLITLETTERS.out.flatten())
    CONVERTTOUPPER.out.upper.view{ it }
}

9.2.1. Using piped outputs

Another way to deal with outputs in the workflow scope is to use pipes |.

Exercise

Try changing the workflow script to the snippet below:

1
2
3
workflow {
    Channel.of(params.greeting) | SPLITLETTERS | flatten() | CONVERTTOUPPER | view
}

Here we use a pipe which passed the output as a channel to the next process.

9.3. Workflow definition

The workflow scope allows the definition of components that define the invocation of one or more processes or operators:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/usr/bin/env nextflow

params.greeting = 'Hello world!'

include { SPLITLETTERS } from './modules.nf'
include { CONVERTTOUPPER } from './modules.nf'


workflow my_pipeline {
    greeting_ch = Channel.of(params.greeting)
    SPLITLETTERS(greeting_ch)
    CONVERTTOUPPER(SPLITLETTERS.out.flatten())
    CONVERTTOUPPER.out.upper.view{ it }
}

workflow {
    my_pipeline()
}

For example, the snippet above defines a workflow named my_pipeline, that can be invoked via another workflow definition.

Make sure that your modules.nf file is the one containing the emit on the CONVERTTOUPPER process.
A workflow component can access any variable or parameter defined in the outer scope. In the running example, we can also access params.greeting directly within the workflow definition.

9.3.1. Workflow inputs

A workflow component can declare one or more input channels using the take statement. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/usr/bin/env nextflow

params.greeting = 'Hello world!'

include { SPLITLETTERS } from './modules.nf'
include { CONVERTTOUPPER } from './modules.nf'

workflow my_pipeline {
    take:
    greeting

    main:
    SPLITLETTERS(greeting)
    CONVERTTOUPPER(SPLITLETTERS.out.flatten())
    CONVERTTOUPPER.out.upper.view{ it }
}
When the take statement is used, the workflow definition needs to be declared within the main block.

The input for the workflow can then be specified as an argument:

1
2
3
workflow {
    my_pipeline(Channel.of(params.greeting))
}

9.3.2. Workflow outputs

A workflow can declare one or more output channels using the emit statement. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
workflow my_pipeline {
    take:
    greeting

    main:
    SPLITLETTERS(greeting)
    CONVERTTOUPPER(SPLITLETTERS.out.flatten())

    emit:
    CONVERTTOUPPER.out.upper
}

workflow {
    my_pipeline(Channel.of(params.greeting))
    my_pipeline.out.view()
}

As a result, we can use the my_pipeline.out notation to access the outputs of my_pipeline in the invoking workflow.

We can also declare named outputs within the emit block.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
workflow my_pipeline {
    take:
    greeting

    main:
    SPLITLETTERS(greeting)
    CONVERTTOUPPER(SPLITLETTERS.out.flatten())

    emit:
    my_data = CONVERTTOUPPER.out.upper
}

workflow {
    my_pipeline(Channel.of(params.greeting))
    my_pipeline.out.my_data.view()
}

The result of the above snippet can then be accessed using my_pipeline.out.my_data.

9.3.3. Calling named workflows

Within a main.nf script (called hello.nf in our example) we also can have multiple workflows. In which case we may want to call a specific workflow when running the code. For this we use the entrypoint call -entry <workflow_name>.

The following snippet has two named workflows (my_pipeline_one and my_pipeline_two):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#!/usr/bin/env nextflow

params.greeting = 'Hello world!'

include { SPLITLETTERS as SPLITLETTERS_one } from './modules.nf'
include { SPLITLETTERS as SPLITLETTERS_two } from './modules.nf'

include { CONVERTTOUPPER as CONVERTTOUPPER_one } from './modules.nf'
include { CONVERTTOUPPER as CONVERTTOUPPER_two } from './modules.nf'


workflow my_pipeline_one {
    letters_ch1 = SPLITLETTERS_one(params.greeting)
    results_ch1 = CONVERTTOUPPER_one(letters_ch1.flatten())
    results_ch1.view{ it }
}

workflow my_pipeline_two {
    letters_ch2 = SPLITLETTERS_two(params.greeting)
    results_ch2 = CONVERTTOUPPER_two(letters_ch2.flatten())
    results_ch2.view{ it }
}

workflow {
    my_pipeline_one(Channel.of(params.greeting))
    my_pipeline_two(Channel.of(params.greeting))
}

You can choose which pipeline runs by using the entry flag:

nextflow run hello.2.nf -entry my_pipeline_one

9.3.4. Parameter scopes

A module script can define one or more parameters or custom functions using the same syntax as with any other Nextflow script. Using the minimal examples below:

Module script (./modules.nf)
1
2
3
4
5
6
params.foo = 'Hello'
params.bar = 'world!'

def SAYHELLO() {
    println "$params.foo $params.bar"
}
Main script (./hello.nf)
1
2
3
4
5
6
7
8
9
10
#!/usr/bin/env nextflow

params.foo = 'Hola'
params.bar = 'mundo!'

include { SAYHELLO } from './modules.nf'

workflow {
    SAYHELLO()
}

Running hello.nf should print:

Hola mundo!

As highlighted above, the script will print Hola mundo! instead of Hello world! because parameters are inherited from the including context.

To avoid being ignored, pipeline parameters should be defined at the beginning of the script before any include declarations.

The addParams option can be used to extend the module parameters without affecting the external scope. For example:

1
2
3
4
5
6
7
8
9
10
#!/usr/bin/env nextflow

params.foo = 'Hola'
params.bar = 'mundo!'

include { SAYHELLO } from './modules.nf' addParams(foo: 'Olá')

workflow {
    SAYHELLO()
}

Executing the main script above should print:

Olá mundo!

9.4. DSL2 migration notes

To view a summary of the changes introduced when Nextflow migrated from DSL1 to DSL2 please refer to the DSL2 migration notes in the official Nextflow documentation.

10. Nextflow configuration

A key Nextflow feature is the ability to decouple the workflow implementation by the configuration setting required by the underlying execution platform.

This enables portable deployment without the need to modify the application code.

10.1. Configuration file

When a pipeline script is launched, Nextflow looks for a file named nextflow.config in the current directory and in the script base directory (if it is not the same as the current directory). Finally, it checks for the file: $HOME/.nextflow/config.

When more than one of the above files exists, they are merged, so that the settings in the first override the same settings that may appear in the second, and so on.

The default config file search mechanism can be extended by providing an extra configuration file by using the command line option: -c <config file>.

10.1.1. Config syntax

A Nextflow configuration file is a simple text file containing a set of properties defined using the syntax:

name = value
Please note that string values need to be wrapped in quotation characters while numbers and boolean values (true, false) do not. Also, note that values are typed, meaning for example that, 1 is different from '1', since the first is interpreted as the number one, while the latter is interpreted as a string value.

10.1.2. Config variables

Configuration properties can be used as variables in the configuration file itself, by using the usual $propertyName or ${expression} syntax.

1
2
3
propertyOne = 'world'
anotherProp = "Hello $propertyOne"
customPath = "$PATH:/my/app/folder"
In the configuration file it’s possible to access any variable defined in the host environment such as $PATH, $HOME, $PWD, etc.

10.1.3. Config comments

Configuration files use the same conventions for comments used in the Nextflow script:

1
2
3
4
5
6
// comment a single line

/*
   a comment spanning
   multiple lines
 */

10.1.4. Config scopes

Configuration settings can be organized in different scopes by dot prefixing the property names with a scope identifier or grouping the properties in the same scope using the curly brackets notation. This is shown in the following example:

1
2
3
4
5
6
7
alpha.x  = 1
alpha.y  = 'string value..'

beta {
    p = 2
    q = 'another string ..'
}

10.1.5. Config params

The scope params allows the definition of workflow parameters that override the values defined in the main workflow script.

This is useful to consolidate one or more execution parameters in a separate file.

1
2
3
// config file
params.foo = 'Bonjour'
params.bar = 'le monde!'
1
2
3
4
5
6
// workflow script
params.foo = 'Hello'
params.bar = 'world!'

// print both params
println "$params.foo $params.bar"

Exercise

Save the first snippet as nextflow.config and the second one as params.nf. Then run:

nextflow run params.nf
For the result, click here:
Bonjour le monde!

Execute is again specifying the foo parameter on the command line:

nextflow run params.nf --foo Hola
For the result, click here:
Hola le monde!

Compare the result of the two executions.

10.1.6. Config env

The env scope allows the definition of one or more variables that will be exported into the environment where the workflow tasks will be executed.

1
2
env.ALPHA = 'some value'
env.BETA = "$HOME/some/path"

Exercise

Save the above snippet as a file named my-env.config. Then save the snippet below in a file named foo.nf:

1
2
3
4
5
6
process foo {
  echo true
  '''
  env | egrep 'ALPHA|BETA'
  '''
}

Finally, execute the following command:

nextflow run foo.nf -c my-env.config
Your result should look something like the following, click here:
BETA=/home/some/path
ALPHA=some value

10.1.7. Config process

process directives allow the specification of settings for the task execution such as cpus, memory, container, and other resources in the pipeline script.

This is useful when prototyping a small workflow script.

However, it’s always a good practice to decouple the workflow execution logic from the process configuration settings, i.e. it’s strongly suggested to define the process settings in the workflow configuration file instead of the workflow script.

The process configuration scope allows the setting of any process directives in the Nextflow configuration file. For example:

1
2
3
4
5
process {
    cpus = 10
    memory = 8.GB
    container = 'biocontainers/bamtools:v2.4.0_cv3'
}

The above config snippet defines the cpus, memory and container directives for all processes in your workflow script.

The process selector can be used to apply the configuration to a specific process or group of processes (discussed later).

Memory and time duration units can be specified either using a string-based notation in which the digit(s) and the unit can be separated by a blank or by using the numeric notation in which the digit(s) and the unit are separated by a dot character and are not enclosed by quote characters.
String syntax Numeric syntax Value

'10 KB'

10.KB

10240 bytes

'500 MB'

500.MB

524288000 bytes

'1 min'

1.min

60 seconds

'1 hour 25 sec'

-

1 hour and 25 seconds

The syntax for setting process directives in the configuration file requires = (i.e. assignment operator), whereas it should not be used when setting the process directives within the workflow script.

For an example, click here:
1
2
3
4
5
6
7
8
9
10
11
process foo {
  cpus 4
  memory 2.GB
  time 1.hour
  maxRetries 3

  script:
  """
    your_command --cpus $task.cpus --mem $task.memory
  """
}

This is especially important when you want to define a config setting using a dynamic expression using a closure. For example:

1
2
3
process foo {
    memory = { 4.GB * task.cpus }
}

Directives that require more than one value, e.g. pod, in the configuration file need to be expressed as a map object.

1
2
3
process {
    pod = [env: 'FOO', value: '123']
}

Finally, directives that are to be repeated in the process definition, in the configuration files need to be defined as a list object. For example:

1
2
3
4
process {
    pod = [ [env: 'FOO', value: '123'],
            [env: 'BAR', value: '456'] ]
}

10.1.8. Config Docker execution

The container image to be used for the process execution can be specified in the nextflow.config file:

1
2
process.container = 'nextflow/rnaseq-nf'
docker.enabled = true

The use of unique "SHA256" docker image IDs guarantees that the image content does not change over time, for example:

1
2
process.container = 'nextflow/rnaseq-nf@sha256:aeacbd7ea1154f263cda972a96920fb228b2033544c2641476350b9317dab266'
docker.enabled = true

10.1.9. Config Singularity execution

To run a workflow execution with Singularity, a container image file path is required in the Nextflow config file using the container directive:

1
2
process.container = '/some/singularity/image.sif'
singularity.enabled = true
The container image file must be an absolute path i.e. it must start with a /.

The following protocols are supported:

  • library:// download the container image from the Singularity Library service.

  • shub:// download the container image from the Singularity Hub.

  • docker:// download the container image from the Docker Hub and convert it to the Singularity format.

  • docker-daemon:// pull the container image from a local Docker installation and convert it to a Singularity image file.

* Singularity hub shub:// is no longer available as a builder service. Though existing images from before 19th April 2021 will still work.
By specifying a plain Docker container image name, Nextflow implicitly downloads and converts it to a Singularity image when the Singularity execution is enabled. For example:
1
2
process.container = 'nextflow/rnaseq-nf'
singularity.enabled = true

The above configuration instructs Nextflow to use the Singularity engine to run your script processes. The container is pulled from the Docker registry and cached in the current directory to be used for further runs.

Alternatively, if you have a Singularity image file, its absolute path location can be specified as the container name either using the -with-singularity option or the process.container setting in the config file.

Exercise

Try to run the script as shown below, changing the nextflow.config file to the one above using singularity:

nextflow run script7.nf
Nextflow will pull the container image automatically, it will require a few seconds depending on the network connection speed.

10.1.10. Config Conda execution

The use of a Conda environment can also be provided in the configuration file by adding the following setting in the nextflow.config file:

process.conda = "/home/ubuntu/miniconda2/envs/nf-tutorial"

You can specify the path of an existing Conda environment as either directory or the path of Conda environment YAML file.

11. Deployment scenarios

Real-world genomic applications can spawn the execution of thousands of jobs. In this scenario a batch scheduler is commonly used to deploy a pipeline in a computing cluster, allowing the execution of many jobs in parallel across many compute nodes.

Nextflow has built-in support for the most commonly used batch schedulers, such as Univa Grid Engine, SLURM and IBM LSF. Check the Nextflow documentation for the complete list of supported execution platforms.

11.1. Cluster deployment

A key Nextflow feature is the ability to decouple the workflow implementation from the actual execution platform. The implementation of an abstraction layer allows the deployment of the resulting workflow on any executing platform supported by the framework.

nf executors

To run your pipeline with a batch scheduler, modify the nextflow.config file specifying the target executor and the required computing resources if needed. For example:

process.executor = 'slurm'

11.2. Managing cluster resources

When using a batch scheduler, it is often needed to specify the number of resources (i.e. cpus, memory, execution time, etc.) required by each task.

This can be done using the following process directives:

queue

the cluster queue to be used for the computation

cpus

the number of cpus to be allocated a task execution

memory

the amount of memory to be allocated for a task execution

time

the max amount of time to be allocated for a task execution

disk

the amount of disk storage required for a task execution

11.2.1. Workflow wide resources

Use the scope process to define the resource requirements for all processes in your workflow applications. For example:

1
2
3
4
5
6
7
process {
    executor = 'slurm'
    queue = 'short'
    memory = '10 GB'
    time = '30 min'
    cpus = 4
}

11.2.2. Configure process by name

In real-world applications, different tasks need different amounts of computing resources. It is possible to define the resources for a specific task using the select withName: followed by the process name:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
process {
    executor = 'slurm'
    queue = 'short'
    memory = '10 GB'
    time = '30 min'
    cpus = 4

    withName: foo {
        cpus = 2
        memory = '20 GB'
        queue = 'short'
    }

    withName: bar {
        cpus = 4
        memory = '32 GB'
        queue = 'long'
    }
}

Exercise

Run the RNA-Seq script (script7.nf) from earlier, but specify that the quantification process requires 2 CPUs and 5 GB of memory, within the nextflow.config file.

Click here for the answer:
1
2
3
4
5
6
process {
    withName: quantification {
        cpus = 2
        memory = '5 GB'
    }
}

11.2.3. Configure process by labels

When a workflow application is composed of many processes, listing all of the process names and choosing resources for each of them in the configuration file can be difficult.

A better strategy consists of annotating the processes with a label directive. Then specify the resources in the configuration file used for all processes having the same label.

The workflow script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
process task1 {
  label 'long'

  """
  first_command --here
  """
}

process task2 {
  label 'short'

  """
  second_command --here
  """
}

The configuration file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
process {
    executor = 'slurm'

    withLabel: 'short' {
        cpus = 4
        memory = '20 GB'
        queue = 'alpha'
    }

    withLabel: 'long' {
        cpus = 8
        memory = '32 GB'
        queue = 'omega'
    }
}

11.2.4. Configure multiple containers

Containers can be set for each process in your workflow. You can define their containers in a config file as shown below:

1
2
3
4
5
6
7
8
9
10
process {
  withName: foo {
    container = 'some/image:x'
  }
  withName: bar {
    container = 'other/image:y'
  }
}

docker.enabled = true
Should I use a single fat container or many slim containers? Both approaches have pros & cons. A single container is simpler to build and maintain, however when using many tools the image can become very big and tools can create conflicts with each other. Using a container for each process can result in many different images to build and maintain, especially when processes in your workflow use different tools for each task.

Read more about config process selectors at this link.

11.3. Configuration profiles

Configuration files can contain the definition of one or more profiles. A profile is a set of configuration attributes that can be activated/chosen when launching a pipeline execution by using the -profile command- line option.

Configuration profiles are defined by using the special scope profiles which group the attributes that belong to the same profile using a common prefix. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
profiles {

    standard {
        params.genome = '/local/path/ref.fasta'
        process.executor = 'local'
    }

    cluster {
        params.genome = '/data/stared/ref.fasta'
        process.executor = 'sge'
        process.queue = 'long'
        process.memory = '10GB'
        process.conda = '/some/path/env.yml'
    }

    cloud {
        params.genome = '/data/stared/ref.fasta'
        process.executor = 'awsbatch'
        process.container = 'cbcrg/imagex'
        docker.enabled = true
    }

}

This configuration defines three different profiles: standard, cluster and cloud that set different process configuration strategies depending on the target runtime platform. By convention, the standard profile is implicitly used when no other profile is specified by the user.

To enable a specific profile use -profile option followed by the profile name:

nextflow run <your script> -profile cluster
Two or more configuration profiles can be specified by separating the profile names with a comma character:
nextflow run <your script> -profile standard,cloud

11.4. Cloud deployment

AWS Batch is a managed computing service that allows the execution of containerized workloads in the Amazon cloud infrastructure.

Nextflow provides built-in support for AWS Batch which allows the seamless deployment of a Nextflow pipeline in the cloud, offloading the process executions as Batch jobs.

Once the Batch environment is configured, specify the instance types to be used and the max number of CPUs to be allocated, you need to create a Nextflow configuration file like the one shown below:

1
2
3
4
5
6
process.executor = 'awsbatch'                          (1)
process.queue = 'nextflow-ci'                          (2)
process.container = 'nextflow/rnaseq-nf:latest'        (3)
workDir = 's3://nextflow-ci/work/'                     (4)
aws.region = 'eu-west-1'                               (5)
aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws' (6)
1 Set AWS Batch as the executor to run the processes in the workflow
2 The name of the computing queue defined in the Batch environment
3 The Docker container image to be used to run each job
4 The workflow work directory must be a AWS S3 bucket
5 The AWS region to be used
6 The path of the AWS cli tool required to download/upload files to/from the container
The best practice is to keep this setting as a separate profile in your workflow config file. This allows the execution with a simple command.
nextflow run script7.nf

The complete details about AWS Batch deployment are available at this link.

11.5. Volume mounts

Elastic Block Storage (EBS) volumes (or other supported storage) can be mounted in the job container using the following configuration snippet:

1
2
3
4
5
aws {
  batch {
      volumes = '/some/path'
  }
}

Multiple volumes can be specified using comma-separated paths. The usual Docker volume mount syntax can be used to define complex volumes for which the container path is different from the host path or to specify a read-only option:

1
2
3
4
5
6
aws {
  region = 'eu-west-1'
  batch {
      volumes = ['/tmp', '/host/path:/mnt/path:ro']
  }
}
This is a global configuration that has to be specified in a Nextflow config file and will be applied to all process executions.
Nextflow expects paths to be available. It does not handle the provision of EBS volumes or another kind of storage.

11.6. Custom job definition

Nextflow automatically creates the Batch Job definitions needed to execute your pipeline processes. Therefore it’s not required to define them before you run your workflow.

However, you may still need to specify a custom Job Definition to provide fine-grained control of the configuration settings of a specific job (e.g. to define custom mount paths or other special settings of a Batch Job).

To use your own job definition in a Nextflow workflow, use it in place of the container image name, prefixing it with the job-definition:// string. For example:

1
2
3
process {
    container = 'job-definition://your-job-definition-name'
}

11.7. Custom image

Since Nextflow requires the AWS CLI tool to be accessible in the computing environment, a common solution consists of creating a custom Amazon Machine Image (AMI) and installing it in a self-contained manner (e.g. using Conda package manager).

When creating your custom AMI for AWS Batch, make sure to use the Amazon ECS-Optimized Amazon Linux AMI as the base image.

The following snippet shows how to install AWS CLI with Miniconda:

sudo yum install -y bzip2 wget
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -f -p $HOME/miniconda
$HOME/miniconda/bin/conda install -c conda-forge -y awscli
rm Miniconda3-latest-Linux-x86_64.sh
The aws tool will be placed in a directory named bin in the main installation folder. The tools will not work properly if you modify this directory structure after the installation.

Finally, specify the aws full path in the Nextflow config file as shown below:

aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws'

11.8. Launch template

An alternative approach to is to create a custom AMI using a Launch template that installs the AWS CLI tool during the instance boot via custom user data.

In the EC2 dashboard, create a Launch template specifying the user data field:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="//"

--//
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/sh
## install required deps
set -x
export PATH=/usr/local/bin:$PATH
yum install -y jq python27-pip sed wget bzip2
pip install -U boto3

## install awscli
USER=/home/ec2-user
wget -q https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -f -p $USER/miniconda
$USER/miniconda/bin/conda install -c conda-forge -y awscli
rm Miniconda3-latest-Linux-x86_64.sh
chown -R ec2-user:ec2-user $USER/miniconda

--//--

Then create a new compute environment in the Batch dashboard and specify the newly created launch template in the corresponding field.

11.9. Hybrid deployments

Nextflow allows the use of multiple executors in the same workflow application. This feature enables the deployment of hybrid workloads in which some jobs are executed on the local computer or local computing cluster, and some jobs are offloaded to the AWS Batch service.

To enable this feature use one or more process selectors in your Nextflow configuration file.

For example, apply the AWS Batch configuration only to a subset of processes in your workflow. You can try the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
process {
    executor = 'slurm'  (1)
    queue = 'short'     (2)

    withLabel: bigTask {          (3)
      executor = 'awsbatch'       (4)
      queue = 'my-batch-queue'    (5)
      container = 'my/image:tag'  (6)
  }
}

aws {
    region = 'eu-west-1'    (7)
}
1 Set slurm as the default executor
2 Set the queue for the SLURM cluster
3 Setting of a process named bigTask
4 Set awsbatch as the executor for the bigTask process
5 Set the queue for the bigTask process
6 Set the container image to deploy for the bigTask process
7 Define the region for Batch execution

12. Get started with Nextflow Tower

12.1. Basic concepts

Nextflow Tower is the centralized command post for data management and pipelines. It brings monitoring, logging and observability to distributed workflows and simplifies the deployment of pipelines on any cloud, cluster or laptop.

Nextflow tower core features include:

  • The launching of pre-configured pipelines with ease.

  • Programmatic integration to meet the needs of an organization.

  • Publishing pipelines to shared workspaces.

  • Management of the infrastructure required to run data analysis at scale.

Sign up to try Tower for free or request a demo for deployments in your own on-premise or cloud environment.

12.2. Usage

You can use Tower via either the -with-tower option while using the Nextflow run command, through the online GUI or through the API.

12.2.1. Via the Nextflow run command

Create an account and login into Tower.

1. Create a new token

You can access your tokens from the Settings drop-down menu :

usage create token

2. Name your token

usage name token

3. Save your token safely

Copy and keep your new token in a safe place.
usage token

4. Export your token

Once your token has been created, open a terminal and type:

export TOWER_ACCESS_TOKEN=eyxxxxxxxxxxxxxxxQ1ZTE=
export NXF_VER=20.10.0

Where eyxxxxxxxxxxxxxxxQ1ZTE= is the token you have just created.

Check your nextflow -version. Bearer tokens require Nextflow version 20.10.0 or later and can be set with the second command shown above. You can change the version if necessary.

To submit a pipeline to a Workspace using the Nextflow command-line tool, add the workspace ID to your environment. For example:

export TOWER_WORKSPACE_ID=000000000000000

The workspace ID can be found on the organization’s Workspaces overview page.

5. Run Nextflow with tower

Run your Nextflow workflows as usual with the addition of the -with-tower command:

1
nextflow run hello.nf -with-tower

You will see and be able to monitor your Nextflow jobs in Tower.

To configure and execute Nextflow jobs in Cloud environments, visit the Compute environments section.

Exercise

Run the RNA-Seq script7.nf using the -with-tower flag, after correctly completing the token settings outlined above.

If you get stuck, click here:

Go to tower.nf/, login, then click the run tab, and select the run that you just submitted. If you can’t find it, double check your token was entered correctly.

12.2.2. Via online GUI

To run using the GUI, there are three main steps:

1. Create an account and login into Tower, available free of charge, at tower.nf.

2. Create and configure a new compute environment.

Configuring your compute environment

Tower uses the concept of Compute Environments to define the execution platform where a pipeline will run.

It supports the launching of pipelines into a growing number of cloud and on-premise infrastructures.

compute env platforms

Each compute environment must be pre-configured to enable Tower to submit tasks. You can read more on how to set up each environment using the links below.

The following guides describe how to configure each of these compute environments.

Selecting a default compute environment

If you have more than one Compute Environment, you can select which one will be used by default when launching a pipeline.

  1. Navigate to your compute environments.

  2. Choose your default environment by selecting the Make primary button.

Congratulations!

You are now ready to launch pipelines with your primary compute environment.

Launchpad

Launchpad makes it easy for any workspace user to launch a pre-configured pipeline.

overview launch

A pipeline is a repository containing a Nextflow workflow, a compute environment and pipeline parameters.

Pipeline Parameters Form

Launchpad automatically detects the presence of a nextflow_schema.json in the root of the repository and dynamically creates a form where users can easily update the parameters.

The parameter forms view will appear if the workflow has a Nextflow schema file for the parameters. Please refer to the Nextflow Schema guide to learn more about the schema file use-cases and how to create them.

This makes it trivial for users without any expertise in Nextflow to enter their pipeline parameters and launch.

launch rnaseq nextflow schema
Adding a new pipeline

Adding a pipeline to the pre-saved workspace launchpad is detailed in full on the tower webpage docs.

In brief, these are the steps you need to follow to set up a pipeline.

  1. Select the Launchpad button in the navigation bar. This will open the Launch Form.

  2. Select a compute environment.

  3. Enter the repository of the pipeline you want to launch. e.g. github.com/nf-core/rnaseq.git

  4. Select a pipeline Revision number. The Git default branch (main/master) or manifest.defaultBranch in the Nextflow configuration will be used by default.

  5. Set the Work directory location of the Nextflow work directory. The location associated with the compute environment will be selected by default.

  6. Enter the name(s) of each of the Nextflow Config profiles followed by the Enter key. See the Nextflow Config profiles documentation for more details.

  7. Enter any Pipeline parameters in YAML or JSON format. YAML example:

    1
    2
    
    reads: 's3://nf-bucket/exome-data/ERR013140_{1,2}.fastq.bz2'
    paired_end: true
  8. Select Launchpad to begin the pipeline execution.

Nextflow pipelines are simply Git repositories and can be changed to any public or private Git-hosting platform. See Git Integration in the Tower docs and Pipeline Sharing in the Nextflow docs for more details.
The credentials associated with the compute environment must be able to access the work directory.
In the configuration, the full path to a bucket must be specified with single quotes around strings and no quotes around booleans or numbers.
To create your own customized Nextflow Schema for your pipeline, see the examples from the nf-core workflows that have adopted this approach. For example, eager and rnaseq.

For advanced settings options check out this page.

There is also community support available if you get into trouble, join the Nextflow Slack by following this link.

12.2.3. API

To learn more about using the Tower API, visit the API section in this documentation.

12.3. Workspaces and Organizations

Nextflow Tower simplifies the development and execution of workflows by providing a centralized interface for users and organizations.

Each user has a unique workspace where they can interact and manage all resources such as workflows, compute environments and credentials. Details of this can be found here.

By default, each user has their own private workspace, while organizations have the ability to run and manage users through role-based access as members and collaborators.

12.3.1. Organization resources

You can create your own organization and participant workspace by following the docs at tower.

Tower allows the creation of multiple organizations, each of which can contain multiple workspaces with shared users and resources. This allows any organization to customize and organize the usage of resources while maintaining an access control layer for users associated with a workspace.

12.3.2. Organization users

Any user can be added or removed from a particular organization or a workspace and can be allocated a specific access role within that workspace.

The Teams feature provides a way for organizations to group various users and participants together into teams. For example, workflow-developers or analysts, and apply access control to all the users within this team collectively.

For further information, please refer to the User Management section.

Setting up a new organization

Organizations are the top-level structure and contain Workspaces, Members, Teams and Collaborators.

To create a new Organization:

  1. Click on the dropdown next to your name and select New organization to open the creation dialog.

  2. On the dialog, fill in the fields as per your organization. The Name and Full name fields are compulsory.

    A valid name for the organization must follow a specific pattern. Please refer to the UI for further instructions.
  3. The rest of the fields such as Description, Location, Website URL and Logo Url are optional.

  4. Once the details are filled in, you can access the newly created organization using the organization’s page, which lists all of your organizations.

    It is possible to change the values of the optional fields either using the Edit option on the organization’s page or by using the Settings tab within the organization page, provided that you are the Owner of the organization.
    A list of all the included Members, Teams and Collaborators can be found on the organization page.

13. Execution cache and resume

The Nextflow caching mechanism works by assigning a unique ID to each task which is used to create a separate execution directory where the tasks are executed and the results stored.

The task unique ID is generated as a 128-bit hash number composing the task input values, files and command string.

The pipeline work directory is organized as shown below:

work/
├── 12
│   └── 1adacb582d2198cd32db0e6f808bce
│       ├── genome.fa -> /data/../genome.fa
│       └── index
│           ├── hash.bin
│           ├── header.json
│           ├── indexing.log
│           ├── quasi_index.log
│           ├── refInfo.json
│           ├── rsd.bin
│           ├── sa.bin
│           ├── txpInfo.bin
│           └── versionInfo.json
├── 19
│   └── 663679d1d87bfeafacf30c1deaf81b
│       ├── ggal_gut
│       │   ├── aux_info
│       │   │   ├── ambig_info.tsv
│       │   │   ├── expected_bias.gz
│       │   │   ├── fld.gz
│       │   │   ├── meta_info.json
│       │   │   ├── observed_bias.gz
│       │   │   └── observed_bias_3p.gz
│       │   ├── cmd_info.json
│       │   ├── libParams
│       │   │   └── flenDist.txt
│       │   ├── lib_format_counts.json
│       │   ├── logs
│       │   │   └── salmon_quant.log
│       │   └── quant.sf
│       ├── ggal_gut_1.fq -> /data/../ggal_gut_1.fq
│       ├── ggal_gut_2.fq -> /data/../ggal_gut_2.fq
│       └── index -> /data/../asciidocs/day2/work/12/1adacb582d2198cd32db0e6f808bce/index
You can create these plots using the tree function if you have it installed. On unix, simply sudo apt install -y tree or with Homebrew: brew install tree

13.1. How resume works

The -resume command-line option allows the continuation of a pipeline execution from the last step that was completed successfully:

nextflow run <script> -resume

In practical terms, the pipeline is executed from the beginning. However, before launching the execution of a process, Nextflow uses the task unique ID to check if the work directory already exists and that it contains a valid command exit state with the expected output files.

If this condition is satisfied the task execution is skipped and previously computed results are used as the process results.

The first task for which a new output is computed invalidates all downstream executions in the remaining DAG.

13.2. Work directory

The task work directories are created in the folder work in the launching path by default. This is supposed to be a scratch storage area that can be cleaned up once the computation is completed.

Workflow final output(s) are supposed to be stored in a different location specified using one or more publishDir directive.
Make sure to delete your work directory occasionally, else your machine/environment may be filled with unused files.

A different location for the execution work directory can be specified using the command line option -w. For example:

nextflow run <script> -w /some/scratch/dir
If you delete or move the pipeline work directory, it will prevent the use of the resume feature in subsequent runs.

The hash code for input files is computed using:

  • The complete file path

  • The file size

  • The last modified timestamp

Therefore, just touching a file will invalidate the related task execution.

13.3. How to organize in silico experiments

It’s good practice to organize each experiment in its own folder. The main experiment input parameters should be specified using a Nextflow config file. This makes it simple to track and replicate an experiment over time.

In the same experiment, the same pipeline can be executed multiple times, however, launching two (or more) Nextflow instances in the same directory concurrently should be avoided.

The nextflow log command lists the executions run in the current folder:

$ nextflow log

TIMESTAMP            DURATION  RUN NAME          STATUS  REVISION ID  SESSION ID                            COMMAND
2019-05-06 12:07:32  1.2s      focused_carson    ERR     a9012339ce   7363b3f0-09ac-495b-a947-28cf430d0b85  nextflow run hello
2019-05-06 12:08:33  21.1s     mighty_boyd       OK      a9012339ce   7363b3f0-09ac-495b-a947-28cf430d0b85  nextflow run rnaseq-nf -with-docker
2019-05-06 12:31:15  1.2s      insane_celsius    ERR     b9aefc67b4   4dc656d2-c410-44c8-bc32-7dd0ea87bebf  nextflow run rnaseq-nf
2019-05-06 12:31:24  17s       stupefied_euclid  OK      b9aefc67b4   4dc656d2-c410-44c8-bc32-7dd0ea87bebf  nextflow run rnaseq-nf -resume -with-docker

You can use either the session ID or the run name to recover a specific execution. For example:

nextflow run rnaseq-nf -resume mighty_boyd

13.4. Execution provenance

The log command, when provided with a run name or session ID, can return many useful bits of information about a pipeline execution that can be used to create a provenance report.

By default, it will list the work directories used to compute each task. For example:

$ nextflow log tiny_fermat

/data/.../work/7b/3753ff13b1fa5348d2d9b6f512153a
/data/.../work/c1/56a36d8f498c99ac6cba31e85b3e0c
/data/.../work/f7/659c65ef60582d9713252bcfbcc310
/data/.../work/82/ba67e3175bd9e6479d4310e5a92f99
/data/.../work/e5/2816b9d4e7b402bfdd6597c2c2403d
/data/.../work/3b/3485d00b0115f89e4c202eacf82eba

The -f (fields) option can be used to specify which metadata should be printed by the log command. For example:

$ nextflow log tiny_fermat -f 'process,exit,hash,duration'

index    0   7b/3753ff  2.0s
fastqc   0   c1/56a36d  9.3s
fastqc   0   f7/659c65  9.1s
quant    0   82/ba67e3  2.7s
quant    0   e5/2816b9  3.2s
multiqc  0   3b/3485d0  6.3s

The complete list of available fields can be retrieved with the command:

nextflow log -l

The -F option allows the specification of filtering criteria to print only a subset of tasks. For example:

$ nextflow log tiny_fermat -F 'process =~ /fastqc/'

/data/.../work/c1/56a36d8f498c99ac6cba31e85b3e0c
/data/.../work/f7/659c65ef60582d9713252bcfbcc310

This can be useful to locate specific task work directories.

Finally, the -t option enables the creation of a basic custom provenance report, showing a template file in any format of your choice. For example:

<div>
<h2>${name}</h2>
<div>
Script:
<pre>${script}</pre>
</div>

<ul>
    <li>Exit: ${exit}</li>
    <li>Status: ${status}</li>
    <li>Work dir: ${workdir}</li>
    <li>Container: ${container}</li>
</ul>
</div>

Save the above snippet in a file named template.html. Then run this command (using the correct id for your run, e.g. not tiny_fermat):

nextflow log tiny_fermat -t template.html > prov.html

Finally, open the prov.html file with a browser.

13.5. Resume troubleshooting

If your workflow execution is not resumed as expected with one or more tasks being unexpectedly re-executed each time, these may be the most likely causes:

  • Input file changed: Make sure that there’s no change in your input file(s). Don’t forget the task unique hash is computed by taking into account the complete file path, the last modified timestamp and the file size. If any of this information has changed, the workflow will be re-executed even if the input content is the same.

  • A process modifies an input: A process should never alter input files, otherwise the resume for future executions will be invalidated for the same reason explained in the previous point.

  • Inconsistent file attributes: Some shared file systems, such as NFS, may report an inconsistent file timestamp (i.e. a different timestamp for the same file) even if it has not been modified. To prevent this problem use the lenient cache strategy.

  • Race condition in global variable: Nextflow is designed to simplify parallel programming without taking care about race conditions and the access to shared resources. One of the few cases in which a race condition can arise is when using a global variable with two (or more) operators. For example:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    Channel
        .of(1,2,3)
        .map { it -> X=it; X+=2 }
        .view { "ch1 = $it" }
    
    Channel
        .of(1,2,3)
        .map { it -> X=it; X*=2 }
        .view { "ch2 = $it" }

    The problem in this snippet is that the X variable in the closure definition is defined in the global scope. Therefore, since operators are executed in parallel, the X value can be overwritten by the other map invocation.

    The correct implementation requires the use of the def keyword to declare the variable local.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    Channel
        .of(1,2,3)
        .map { it -> def X=it; X+=2 }
        .println { "ch1 = $it" }
    
    Channel
        .of(1,2,3)
        .map { it -> def X=it; X*=2 }
        .println { "ch2 = $it" }
  • Not deterministic input channels: While dataflow channel ordering is guaranteed (i.e. data is read in the same order in which it’s written in the channel), a process can declare as input two or more channels each of which is the output of a different process, the overall input ordering is not consistent over different executions.

    In practical terms, consider the following snippet:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    
    process foo {
      input:
      tuple val(pair), path(reads)
    
      output:
      tuple val(pair), path('*.bam'), emit: bam_ch
    
      script:
      """
      your_command --here
      """
    }
    
    process bar {
      input:
      tuple val(pair), path(reads)
    
      output:
      tuple val(pair), path('*.bai'), emit: bai_ch
    
      script:
      """
      other_command --here
      """
    }
    
    process gather {
      input:
      tuple val(pair), path(bam)
      tuple val(pair), path(bai)
    
      script:
      """
      merge_command $bam $bai
      """
    }

    The inputs declared at line 29 and 30 can be delivered in any order because the execution order of the process foo and bar are not deterministic due to their parallel execution.

    Therefore the input of the third process needs to be synchronized using the join operator, or a similar approach. The third process should be written as:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    
    ...
    
    process gather {
      input:
      tuple val(pair), path(bam), path(bai)
    
      script:
      """
      merge_command $bam $bai
      """
    }

14. Error handling and troubleshooting

14.1. Execution error debugging

When a process execution exits with a non-zero exit status, Nextflow stops the workflow execution and reports the failing task:

ERROR ~ Error executing process > 'index'

Caused by:           (1)
  Process `index` terminated with an error exit status (127)

Command executed:    (2)

  salmon index --threads 1 -t transcriptome.fa -i index

Command exit status: (3)
  127

Command output:      (4)
  (empty)

Command error:       (5)
  .command.sh: line 2: salmon: command not found

Work dir:            (6)
  /Users/pditommaso/work/0b/b59f362980defd7376ee0a75b41f62
1 A description of the error cause
2 The command executed
3 The command exit status
4 The command standard output, when available
5 The command standard error
6 The command work directory

Carefully review all error data as it can provide information valuable for debugging.

If this is not enough, cd into the task work directory. It contains all the files to replicate the issue in an isolated manner.

The task execution directory contains these files:

  • .command.sh: The command script.

  • .command.run: The command wrapped used to run the job.

  • .command.out: The complete job standard output.

  • .command.err: The complete job standard error.

  • .command.log: The wrapper execution output.

  • .command.begin: Sentinel file created as soon as the job is launched.

  • .exitcode: A file containing the task exit code.

  • Task input files (symlinks)

  • Task output files

Verify that the .command.sh file contains the expected command and all variables are correctly resolved.

Also verify the existence of the .exitcode and .command.begin files, which if absent, suggest the task was never executed by the subsystem (e.g. the batch scheduler). If the .command.begin file exists, the job was launched but was likely killed abruptly.

You can replicate the failing execution using the command bash .command.run to verify the cause of the error.

14.2. Ignore errors

There are cases in which a process error may be expected and it should not stop the overall workflow execution.

To handle this use case, set the process errorStrategy to ignore:

1
2
3
4
5
6
7
process foo {
  errorStrategy 'ignore'
  script:
  """
    your_command --this --that
  """
}

If you want to ignore any error you can set the same directive in the config file as a default setting:

process.errorStrategy = 'ignore'

14.3. Automatic error fail-over

In rare cases, errors may be caused by transient conditions. In this situation, an effective strategy is re-executing the failing task.

1
2
3
4
5
6
7
process foo {
  errorStrategy 'retry'
  script:
  """
    your_command --this --that
  """
}

Using the retry error strategy the task is re-executed a second time if it returns a non-zero exit status before stopping the complete workflow execution.

The directive maxRetries can be used to set the number of attempts the task can be re-executed before declaring it failed with an error condition.

14.4. Retry with backoff

There are cases in which the required execution resources may be temporarily unavailable (e.g. network congestion). In these cases simply re-executing the same task will likely result in an identical error. A retry with an exponential backoff delay can better recover these error conditions.

1
2
3
4
5
6
7
8
process foo {
  errorStrategy { sleep(Math.pow(2, task.attempt) * 200 as long); return 'retry' }
  maxRetries 5
  script:
  '''
  your_command --here
  '''
}

14.5. Dynamic resources allocation

It’s a very common scenario that different instances of the same process may have very different needs in terms of computing resources. In such situations requesting, for example, an amount of memory too low will cause some tasks to fail. Instead, using a higher limit that fits all the tasks in your execution could significantly decrease the execution priority of your jobs.

To handle this use case, you can use a retry error strategy and increase the computing resources allocated by the job at each successive attempt.

1
2
3
4
5
6
7
8
9
10
11
12
process foo {
  cpus 4
  memory { 2.GB * task.attempt }   (1)
  time { 1.hour * task.attempt }   (2)
  errorStrategy { task.exitStatus == 140 ? 'retry' : 'terminate' }   (3)
  maxRetries 3   (4)

  script:
  """
    your_command --cpus $task.cpus --mem $task.memory
  """
}
1 The memory is defined in a dynamic manner, the first attempt is 2 GB, the second 4 GB, and so on.
2 The wall execution time is set dynamically as well, the first execution attempt is set to 1 hour, the second 2 hours, and so on.
3 If the task returns an exit status equal to 140 it will set the error strategy to retry otherwise it will terminate the execution.
4 It will retry the process execution up to three times.