4. Part 2: Data Transformation & Analysis¶
In this part, we'll build a multi-step workflow that classifies images using machine learning, manages computational resources, and groups results intelligently.
4.1 Classification¶
Let's get to the fun part - the cat sorting!
We have a little classification script - classify.py that I've provided in the .stuff directory.
In your research sometimes you have small accessory scripts that are useful for your pipelines.
We're using a python script here in this workshop example, but this pattern will hold for scripts written in perl, ruby, R, python, closurescript, or any of the other interpreted languages.
4.1.1 Set up the classification script¶
Let's pull the file out into a new bin directory:
The script requires some dependencies. Again, we'll do this the slow/painful way one time before we demonstrate how to use containers to encapsulate the software dependencies.
We'll grab one more file from our .stuff directory - a pyproject.toml file which is a way of describing software dependencies for Python projects.
This is unrelated to Nextflow, but an example of one of the (many) ways in which different languages and frameworks might install software.
You can install the dependencies and activate the environment with:
which you can run with:
usage: classify.py [-h] [--model-path MODEL_PATH] [--labels LABELS [LABELS ...]] [--json] image
Classify a single image using MetaCLIP
positional arguments:
image Path to the image file to classify
options:
-h, --help show this help message and exit
--model-path MODEL_PATH
Path to MetaCLIP model weights (default: data/models/b32_400m.pt)
--labels LABELS [LABELS ...]
Labels for classification (default: ["good cat", "bad cat"])
--json Output result as JSON to stdout
--architecture {ViT-B-32-quickgelu,ViT-B-16-quickgelu,ViT-L-14-quickgelu,ViT-H-14-quickgelu}
Model architecture (auto-detected from filename if not specified)
4.1.2 Download the classification model¶
The script takes images, a model, and a set of labels and classifies each of the images according to the labels. To run the script outside of Nextflow, we'll need to download one of the models. Do so with:
mkdir -p data/models
(cd data/models && wget https://dl.fbaipublicfiles.com/MMPT/metaclip/b32_400m.pt)
4.1.3 Create the Classify process¶
Now let's create a Classify process that will take two channels - one channel of images and one channel that supplies the model:
| Process definition | |
|---|---|
Note here that we're calling the classify.py script directly, even though we can't do that from the command line (we had to provide the relative or absolute path).
This is because Nextflow automatically adds the bin directory (relative to the main.nf) to the $PATH for all Nextflow tasks.
This is a very convenient way to bundle accessory scripts and snippets with your workflow.
4.1.4 Understanding queue vs. value channels¶
Processes can have multiple channels as input or as output. A process will continue to emit tasks as long as it can pull an item from each of the input channels. We could create a new channel for the model, and define a sensible default:
| Workflow with model channel | |
|---|---|
What happens when you run the workflow? Given what we know about channels, what might be happening?
Answer: The Classify process only spawns a single task. This is because after pulling the model path from the second input channel on the first iteration, the channel is empty, so no more Classify tasks can be submitted for execution.
There are two types of channel in Nextflow - queue channels and value channels. Queue channels are exhaustible - they have a set number of items in the channel and each process can only take each item in the channel once. The second type of channel is a value channel, which is a channel of only a single item. This item is emitted without exhaustion.
4.1.5 Using value channels¶
There are some operators which will always return a value channel.
Examples are first, collect, count, etc.
We could also create a value channel using the channel.value factory:
| Using channel.value | |
|---|---|
Note here that we're wrapping the params.model value (a String) in the file() function, which turns an ordinary String into an object that Nextflow can use as a path.
We've not needed to use this until now because the channel.fromPath factory necessarily returns paths, so it automatically does this conversion for us.
4.1.6 Implicit value channels¶
An even simpler solution is to provide the path object directly when calling the process. Any non-channel object will automatically be converted into a value channel for you:
| main.nf | |
|---|---|
Add the Classify process definition to your workflow and run it:
You might find that the process errors out with a 137 exit code. This generally means that we've run out of RAM because we're running too many of these classification jobs at the same time. Let's talk about how we tell Nextflow that a particular process requires more resources.
4.1.7 Takeaway¶
Understanding queue channels vs. value channels is crucial for controlling how data flows through multi-input processes.
4.1.8 What's next?¶
Let's learn how to manage computational resources for our processes.
4.2 Resources¶
Our processes are currently composed of the input:, output:, and script: blocks.
In addition to these blocks, processes can use "process directives" which are optional annotations which modify the behaviour of the processes.
There are many directives (documentation), but we can introduce the concept with two important process directives - memory and cpus.
4.2.1 Understanding executors¶
So far, we've been using the local executor to run Nextflow - running on the local machine. There are many other executors targeting different backends, from HPC executors like SLURM and PBS to cloud executors like AWS Batch, Google Batch, and Azure Batch. There are more than a dozen supported executors (documentation).
Each of these have a concept of the resources a particular task will require - resources such as cpus, memory, gpus, disk, etc.
4.2.2 Resource defaults and management¶
If not otherwise specified, the defaults are to request 1 cpu, 1 GB of RAM and 0 GPUs for each task.
When using the local executor, Nextflow scans the machine it is running on and determines how many cpus and how much RAM the system has. It will ensure that (given the resources specified or defaults applied) the running tasks never exceed the available limits. If the system has 16 GB of RAM, for example, and a particular process requires 6 GB of ram, Nextflow will ensure that at most 2 of those tasks are running at any one time. As a task finishes, Nextflow begins the next task in line.
4.2.3 Add resource directives¶
Update your Classify process to request more memory:
| Process with memory directive | |
|---|---|
Now run the workflow again:
4.2.4 Takeaway¶
Process directives like memory and cpus communicate resource requirements to Nextflow executors, enabling proper scheduling and preventing resource exhaustion.
4.2.5 What's next?¶
Let's learn how to combine related data using the join and groupTuple operators.
4.3 Grouping¶
Now we want to combine our classification results with our resized images.
We can use the join operator, which finds pairs of items (one from each channel) that share a key.
By default, the join operator will use the first element of each item in the channel as the key.
In our case, that first item was the image metadata, which occupies the first position in both the Classify process output and the Resize process output.
4.3.1 Join classification results with images¶
Update your workflow to join the channels:
This produces a channel like:
4.3.2 Group items by label¶
In order to make a picture of just the good cats and a second picture of just the bad cats, we'll need to group the items in the channel based on the label.
We can do this with the groupTuple operator.
Normally the groupTuple expects that the grouping key will be the first element in each item in the channel.
In our case, it is the second item, i.e. index "1" if the first item is index "0".
To ask Nextflow to group on the item with index 1, we add a by: 1 argument to the operator:
This produces a channel of the form:
4.3.3 Takeaway¶
The join and groupTuple operators allow you to match related items and collect them by common attributes.
4.3.4 What's next?¶
Let's create visual collages for each group of classified images.
4.4 Collage¶
Let's create a Collage process that takes this channel and produces a collage of all of the images for each label.
The script block here is a little involved, but it uses ImageMagick's montage command to arrange images into a grid.
4.4.1 Create the Collage process¶
4.4.2 Connect to the workflow¶
We can then hook this into our channel chain:
4.4.3 Optimize with resized images¶
Those collage tasks are taking a little too long, but that might be because we're collaging the original full-sized images and not our resized images.
Because the images channel and the output channel from the Resize process both have the same shape, we can simply replace them in the workflow:
4.4.4 Combine all collages¶
For our final process, let's combine these two collages together into a single final image.
We'll create a process that takes a collection of images (we don't care what they are called) and produces a final collage_all.png image:
| CombineImages process | |
|---|---|
4.4.5 Transform the channel¶
The channel coming from the Collage process looks like:
but we need it to look like:
So we'll drop the labels and collect all images:
The collect operator takes all the items in a channel and then emits them as a single "wide" collection.
Run the complete workflow:
4.4.6 Scaling up without code changes¶
One of Nextflow's key strengths is automatic scalability. Let's see this in action by adding more data to our analysis!
While your workflow is still running (or right after it completes), open a new terminal and add more cat images:
This brings our total from 4 cats to 24 cats.
Now run the workflow again with -resume:
Notice what happens in the output:
- Tasks for the original 4 images show as [cached] in gray
- Only the 20 new images are processed through Resize and Classify
- The groupTuple, Collage, and CombineImages steps run again (because their inputs changed)
- The final collage now includes all 24 cats
You didn't change a single line of code - the workflow automatically:
- Detected the new input files via the glob pattern
data/pics/*.{png,gif,jpg} - Processed only the new images that hadn't been seen before
- Reused cached results for the original 4 images
- Scaled the grouping and collage operations to handle more data
This is the power of Nextflow's declarative approach: you describe what you want to do, and Nextflow figures out how to do it efficiently, whether you have 4 files or 4,000 files.
Scalability in practice
This same pattern works at any scale:
- Local development: Test with 4 samples
- Pilot study: Scale to 24 samples with no code changes
- Production: Process thousands of samples with the same workflow
- HPC/Cloud: Nextflow automatically distributes tasks across available resources
4.4.7 Takeaway¶
You can chain together multiple processes and operators to build sophisticated multi-step workflows that transform and aggregate data. Nextflow automatically scales your workflow as your data grows, without requiring any code changes.
4.4.8 What's next?¶
Now that we have a working workflow, let's learn how to publish the results in an organized way.