Observability in genomic workflows: Part 1

Introduction
Processing large amounts of genomic data, whether for clinical or research purposes, requires complex technical infrastructure with many moving pieces. We need to account for the considerable size of genomic data, the networking bandwidth required to transfer it between systems, the processing power and memory needed to perform analyses, as well as the myriad software components utilized in various genomic workflows. The infrastructure itself is often dynamic, such as in cloud deployments where compute resources are created and configured on-demand, then removed once idle.
A significant challenge when designing any complex computing infrastructure is ensuring the system has sufficient observability. Observability refers to how effectively one can reason about the internal operation of a system based on its outputs. Generally, the more observable a system is, the easier it is to troubleshoot when something goes wrong. And, as anyone in engineering knows, things can always go wrong!
In this post, I’m going to walk through improving the observability of a typical system one might use in the processing of genomic data: a Slurm high-performance computing cluster running Nextflow workflows. This will set us up for my follow-up post, where we will leverage our new and improved observable computing cluster to do something really cool! And yes, it will involve AI.
Let’s get started.
The cluster
In preparing for this post, I did a bit of research into ways to easily experiment with a minimal Slurm cluster. I found an excellent resource in Giovanni Torres’s slurm-docker-cluster GitHub repo. It’s a fully working Slurm cluster running in Docker, including a controller and two compute nodes. Perfect for creating a sandbox that we can run on our local system.
I forked the repo and made a few changes to the Docker image:
- Installed Java 17 (a dependency of Nextflow)
- Installed Nextflow
- Added a
slurmuserfor running jobs, instead of using the root account
Once the cluster is built and running, the make shell shortcut will log us in to the controller:
$ make shell
docker exec -it -u slurmuser -w /home/slurmuser slurmctld bash --login
[slurmuser@slurmctld ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up infinite 2 idle c[1-2]
[slurmuser@slurmctld ~]$ nextflow -version
N E X T F L O W
version 25.10.0 build 10289
created 22-10-2025 16:26 UTC
cite doi:10.1038/nbt.3820
http://nextflow.io
Now we are ready to submit jobs to the cluster.
The workflow
Let’s define a basic exome workflow using Nextflow. Because we are running it on a tiny cluster inside Docker, it isn’t actually going to do anything useful. Instead, I’ll just have the various component processes create dummy outputs and status messages. Here’s the workflow’s structure without any process definitions:
workflow {
ch_aligned_bam = ALIGN(
params.sample_id,
file(params.reads1),
file(params.reads2),
file(params.ref_fa)
)
ch_vcf = CALL_VARIANTS(
params.sample_id,
ch_aligned_bam,
file(params.ref_fa)
)
COVERAGE_QC(
params.sample_id,
ch_aligned_bam
)
ch_annotated_vcf = ANNOTATE_VCF(
params.sample_id,
ch_vcf
)
SUBMIT_TO_DB(
params.sample_id,
ch_aligned_bam,
ch_vcf,
ch_annotated_vcf
)
}
The full workflow is available here. Each of the processes print some status messages in addition to their output files. Let’s run the workflow on our cluster:
nextflow run \
-config nextflow.config \
-profile slurm \
-name exome_TESTSAMPLE01 \
exome.nf \
--sample_id TESTSAMPLE01 \
--reads1 r1_fastq.gz \
--reads2 r2_fastq.gz \
--ref_fa reference.fa
N E X T F L O W ~ version 25.10.0
Launching `exome.nf` [exome_TESTSAMPLE01] DSL2 - revision: 738622dda5
executor > slurm (5)
[f7/5dc093] ALIGN (TESTSAMPLE01) [100%] 1 of 1 ✔
[fe/8b04e7] CALL_VARIANTS (TESTSAMPLE01) [100%] 1 of 1 ✔
[9a/7dbb22] COVERAGE_QC (TESTSAMPLE01) [100%] 1 of 1 ✔
[a2/aaddf0] ANNOTATE_VCF (TESTSAMPLE01) [100%] 1 of 1 ✔
[34/eff6c8] SUBMIT_TO_DB (TESTSAMPLE01) [100%] 1 of 1 ✔
Great! Our workflow executed successfully on our Slurm cluster. Next, we’ll explore the outputs generated by Nextflow and our workflow processes.
Workflow outputs
By default, Nextflow uses the work directory for workflow execution:
$ find work -type f
work/34/eff6c8e95d8bfd3e1eec2252b1baaf/.command.begin
work/34/eff6c8e95d8bfd3e1eec2252b1baaf/.command.run
work/34/eff6c8e95d8bfd3e1eec2252b1baaf/.exitcode
work/34/eff6c8e95d8bfd3e1eec2252b1baaf/.command.log
work/34/eff6c8e95d8bfd3e1eec2252b1baaf/.command.sh
work/34/eff6c8e95d8bfd3e1eec2252b1baaf/TESTSAMPLE01.db_payload.json
work/34/eff6c8e95d8bfd3e1eec2252b1baaf/.command.err
work/34/eff6c8e95d8bfd3e1eec2252b1baaf/.command.out
work/a2/aaddf0235fb6db20c3bf409901bd6c/.command.begin
work/a2/aaddf0235fb6db20c3bf409901bd6c/.command.run
work/a2/aaddf0235fb6db20c3bf409901bd6c/.exitcode
work/a2/aaddf0235fb6db20c3bf409901bd6c/.command.log
work/a2/aaddf0235fb6db20c3bf409901bd6c/.command.sh
work/a2/aaddf0235fb6db20c3bf409901bd6c/.command.err
work/a2/aaddf0235fb6db20c3bf409901bd6c/.command.out
work/a2/aaddf0235fb6db20c3bf409901bd6c/TESTSAMPLE01.annotated.vcf.gz
work/fe/8b04e7a4cd491e5ef0d42c304c54af/.command.begin
work/fe/8b04e7a4cd491e5ef0d42c304c54af/.command.run
work/fe/8b04e7a4cd491e5ef0d42c304c54af/.exitcode
work/fe/8b04e7a4cd491e5ef0d42c304c54af/.command.log
work/fe/8b04e7a4cd491e5ef0d42c304c54af/.command.sh
work/fe/8b04e7a4cd491e5ef0d42c304c54af/TESTSAMPLE01.variants.vcf
work/fe/8b04e7a4cd491e5ef0d42c304c54af/.command.err
work/fe/8b04e7a4cd491e5ef0d42c304c54af/.command.out
work/fe/8b04e7a4cd491e5ef0d42c304c54af/TESTSAMPLE01.variants.vcf.gz
work/9a/7dbb22ee642fe117d1e7c9cf389fea/.command.begin
work/9a/7dbb22ee642fe117d1e7c9cf389fea/TESTSAMPLE01.coverage_qc.txt
work/9a/7dbb22ee642fe117d1e7c9cf389fea/.command.run
work/9a/7dbb22ee642fe117d1e7c9cf389fea/.exitcode
work/9a/7dbb22ee642fe117d1e7c9cf389fea/.command.log
work/9a/7dbb22ee642fe117d1e7c9cf389fea/.command.sh
work/9a/7dbb22ee642fe117d1e7c9cf389fea/.command.err
work/9a/7dbb22ee642fe117d1e7c9cf389fea/.command.out
work/f7/5dc0939716b76cf43b3bc534aa5719/.command.begin
work/f7/5dc0939716b76cf43b3bc534aa5719/.command.run
work/f7/5dc0939716b76cf43b3bc534aa5719/.exitcode
work/f7/5dc0939716b76cf43b3bc534aa5719/.command.log
work/f7/5dc0939716b76cf43b3bc534aa5719/.command.sh
work/f7/5dc0939716b76cf43b3bc534aa5719/TESTSAMPLE01.aligned.bam
work/f7/5dc0939716b76cf43b3bc534aa5719/.command.err
work/f7/5dc0939716b76cf43b3bc534aa5719/.command.out
Each process in the workflow was assigned its own working directory, and log messages can be found in the .command.err (standard error) and .command.out (standard output) files. For example, in the output of the nextflow run command, we see the identifier [f7/5dc093] next to the ALIGN process. That refers to the working directory work/f7/5dc0939716b76cf43b3bc534aa5719. So the output messages generated by the alignment process will be in this .command.out file:
head work/f7/5dc0939716b76cf43b3bc534aa5719/.command.out
Starting alignment for sample: TESTSAMPLE01
R1: r1_fastq.gz
R2: r2_fastq.gz
Ref: reference.fa
Aligning... 0% complete
Aligning... 1% complete
Aligning... 2% complete
Aligning... 3% complete
Aligning... 4% complete
Aligning... 5% complete
You are probably beginning to see that this is a bit complicated to review. If this workflow were to fail, we’d need to know a lot of little details to begin troubleshooting:
- Where can we find the
workdirectory associated with this workflow execution? - Which process failed in the workflow?
- What is the name of the working directory for that process?
- What are the contents of the standard output and standard error logs for the failed process?
We might also want to know the specific Slurm job IDs for each submitted process, say to identify the exact compute node used for execution. Those are not in the work directory, but within the corresponding .nextflow.log file as lines like this (line breaks added for readability):
[Task submitter] DEBUG nextflow.executor.GridTaskHandler -
[SLURM] submitted process ALIGN (TESTSAMPLE01) > jobId: 29;
workDir: /home/slurmuser/exome/work/f7/5dc0939716b76cf43b3bc534aa5719
While everything we need to troubleshoot a workflow failure is indeed available, it is certainly not easily available, and as a result the observability of this system suffers. What we want is for the cluster to output this information in a way that is clear and accessible. An obvious first step would be to get all of these logs into one spot instead of scattered throughout various Nextflow work directories.
Nextflow configuration improvements
Our current Nextflow configuration file looks like this:
profiles {
slurm {
process {
executor = 'slurm'
cpus = 1
queue = 'normal'
}
}
}
We can adjust it so that all workflows executed using this configuration file use the same work directory. This is supported by Nextflow, as processes will be assigned unique working directories automatically under this base workDir path:
profiles {
slurm {
process {
executor = 'slurm'
cpus = 1
queue = 'normal'
}
workDir = '/nextflow/work'
}
}
Now, our log centralization task is a bit easier, as most of the Nextflow logs we care about can be found in a central spot. I did mention above that some additional information, such as the Slurm job ID for each process, is outside the work directory and in individual .nextflow.log files generated by nextflow run. We will come back to those later.
Generally, log centralization doesn’t mean that we just write all of our logs to one location in the filesystem. It typically means that we configure multiple systems to send their logs to a centralized logging server. To accomplish this, we are going to need another component: a log forwarder.
Log forwarding
A log forwarder is a process that continuously monitors for new log entries and sends them to a central log server. This allows for the monitoring and reviewing of log data from one location. In this case, forwarding the logs to a central server will allow us to avoid having to manually connect to the Slurm cluster and navigate to the /nextflow/work directory in order to explore log messages from our workflows.
I selected Fluent Bit as the log forwarder for this exercise, as it has a useful feature that we will take advantage of shortly: dynamic log record enrichment. This is the ability to add extra metadata to our log entries as they are streamed to the central log server, which will be helpful when we need to search through our logs.
Since Fluent Bit is available as a Docker image, we can add it as another service in our Docker-based Slurm cluster:
fluentbit:
image: fluent/fluent-bit:latest
container_name: fluentbit
command: ["-c", "/fluent/fluent-bit.conf"]
volumes:
- ./volumes/nextflow:/nextflow:ro
- ./volumes/fluent:/fluent
networks:
- slurm-network
restart: unless-stopped
Now we can adjust its configuration to process our Nextflow logs, since they are always stored under directories in /nextflow/work/. You can view the Fluent Bit configuration files in the GitHub repo. This configuration does the following:
- Monitors
/nextflow/workfor.command.errand.command.outNextflow process logs - Attempts to parse each line as JSON, otherwise as a plain log line
- Runs each log record (whether JSON or plain) through a custom Lua
enrich_nextflow_metadata()function - Sends the processed log records to Loki, which we will discuss later in this post
The JSON parsing might seem confusing, as the processes in our Nextflow workflow don’t output JSON status messages, only plain text log lines. But with this Fluent Bit configuration, we can add JSON output to our workflow and it will be picked up by Fluent Bit and attached to our log entries as metadata. We want extra metadata included with our log entries because, by default, Fluent Bit will only send the filename of the log alongside the log entry. We are going to add additional metadata to help us sort through the logs. Here’s how it works:
- Each process will first output a custom JSON line to both standard output and standard error
- Fluent Bit will parse the JSON and assign the metadata it contains to the log file
- Subsequent non-JSON log lines from the same file will be assigned the metadata during ingestion
Now, we can control the metadata assigned to our Nextflow process logs from within our Nextflow workflow itself. We just need to ensure that the first line outputted by each process is a JSON string. Let’s walk through that next.
Enriching Nextflow log output
Here are some details that would be helpful to associate with each log entry:
- The source Nextflow workflow name
- The workflow’s
-nameparameter value - The workflow process it belongs to
- Whether it is standard output or standard error
- The user running the workflow
- The Slurm job ID for the process
- The Slurm node name
- The sample name that was being processed
All of these details can be obtained either from environment variables during execution of the process, or from Nextflow itself. We are going to collect them into a single JSON line that we can output at the start of every process in our workflow. First, we’ll create a reusable function in our workflow:
def jsonHeader(task, sample_id) {
def process = task.process
def attempt = task.attempt
"""\
printf '{"nf_meta":true,"workflow":"%s","nextflow_run":"%s","process":"%s","sample":"%s","attempt":"%s","user":"%s","slurm_job_id":"%s","slurm_node":"%s"}\\n' \
"\${WORKFLOW_NAME:-unknown}" \
"\${RUN_NAME:-unknown}" \
"${process}" \
"${sample_id}" \
"${attempt}" \
"\${USER:-unknown}" \
"\${SLURM_JOB_ID:-none}" \
"\${SLURM_NODELIST:-none}" | tee /dev/stderr
""".stripIndent()
}
Now we can call the function at the start of each script: block for our processes:
process ALIGN {
tag "${sample_id}"
input:
val(sample_id)
path(reads1)
path(reads2)
path(ref_fa)
output:
path("${sample_id}.aligned.bam")
script:
"""
${jsonHeader(task, sample_id)}
echo "Starting alignment for sample: ${sample_id}"
echo "R1: ${reads1}"
echo "R2: ${reads2}"
echo "Ref: ${ref_fa}"
# script continues ...
"""
}
See exome-fluent.nf for the fully modified workflow. I encountered one small issue with this approach: the WORKFLOW_NAME and RUN_NAME variables are not set by Nextflow and are not even accessible from a process script: block. I fixed this by placing a beforeScript directive in the Nextflow configuration file that defines the two variables:
process {
beforeScript = { "export RUN_NAME=${workflow.runName}; export WORKFLOW_NAME=${workflow.scriptName}" }
}
Now, our workflow will prepend a metadata line in JSON format to every .command.out and .command.err log file. Let’s run the exome-fluent.nf workflow to see:
nextflow run \
-config nextflow.config \
-profile slurm \
-name exome_TESTSAMPLE02 \
exome-fluent.nf \
--sample_id TESTSAMPLE02 \
--reads1 r1_fastq.gz \
--reads2 r2_fastq.gz \
--ref_fa reference.fa
N E X T F L O W ~ version 25.10.0
Launching `exome-fluent.nf` [exome_TESTSAMPLE02] DSL2 - revision: dddc8b9e6b
executor > slurm (5)
[2e/c3167c] ALIGN (TESTSAMPLE02) [100%] 1 of 1 ✔
[2f/51de0b] CALL_VARIANTS (TESTSAMPLE02) [100%] 1 of 1 ✔
[3e/0c67c2] COVERAGE_QC (TESTSAMPLE02) [100%] 1 of 1 ✔
[b1/331833] ANNOTATE_VCF (TESTSAMPLE02) [100%] 1 of 1 ✔
[2b/ac660c] SUBMIT_TO_DB (TESTSAMPLE02) [100%] 1 of 1 ✔
$ head -n 5 /nextflow/work/2e/c3167c0ad5be07340f41ac0298aa07/.command.out
{"nf_meta":true,"workflow":"exome-fluent.nf","nextflow_run":"exome_TESTSAMPLE02","process":"ALIGN","sample":"TESTSAMPLE02","attempt":"1","user":"slurmuser","slurm_job_id":"67","slurm_node":"c1"}
Starting alignment for sample: TESTSAMPLE02
R1: r1_fastq.gz
R2: r2_fastq.gz
Ref: reference.fa
Great! Our JSON header shows up. These values will be parsed by Fluent Bit and sent as extra metadata attached to all log lines from the same file. We have just two additional components left to configure. We’ve already mentioned the first one - the central logging server.
Centralized log storage
Fluent Bit doesn’t store logs, it only forwards log records to a configurable destination. The [OUTPUT] section of our fluent-bit.conf file points to a host called loki. Loki is a database for storing and querying log data and metadata. Like Fluent Bit, it is also available as a Docker image, allowing us to easily add it alongside our Slurm cluster:
loki:
image: grafana/loki:latest
container_name: loki
command: ["-config.file=/etc/loki/config.yaml"]
ports:
- "3100:3100"
expose:
- "3100"
volumes:
- ./volumes/loki/loki-config.yaml:/etc/loki/config.yaml:ro
- ./volumes/loki/data:/loki
networks:
- slurm-network
restart: unless-stopped
Loki has a lot of options for where and how to store log data. I’ve selected the simplest option for our sandbox environment: local storage to a bind mount in the Docker container. See the Loki configuration file for more details.
Once our Loki container is up and running, our Fluent Bit configuration will send all of our Nextflow logs to Loki with added metadata. Now, we just need a way to view and query our logs in Loki.
Accessing Loki logs
The simplest way to retrieve our logs from Loki is to use the logcli command line tool. We can run this tool outside of our Slurm cluster, as the Docker configuration above exposes Loki’s port 3100. Loki’s query language is called LogQL, and is relatively straightforward. But before we try out some queries, I need to explain how logs are stored in Loki. Loki uses labels to index our log entries for rapid querying. You might assume that all of the JSON fields in our metadata become Loki labels, but that would unfortunately make Loki very inefficient. Loki’s labels need to have low cardinality—meaning they should have a small number of unique values. Nearly all of our JSON metadata fields will have many unique values over time: sample names, Slurm job IDs, Nextflow run and process names, etc. The Fluent Bit configuration determines which of our metadata JSON fields get “promoted” to Loki labels, and I’ve specified only two: source (which is always set to “nextflow”) and stream, which is either stdout or stderr.
We can still filter and query our logs using the extra JSON fields, but when we structure our LogQL query we need to always first filter using the labels. Let’s look at some examples. If we want to see all the logs (standard error and standard output) for a Nextflow run named “exome_TESTSAMPLE02”, we could retrieve the JSON entries with this logcli command:
export LOKI_ADDR=http://localhost:3100
logcli query '{ source="nextflow" } | json | nextflow_run="exome_TESTSAMPLE02"'
You’ll see the log entries as JSON records with timestamps:
2025-11-22T12:39:09-05:00 {file="/nextflow/work/2b/ceb8fbb62f7249593646b592c86f79/.command.out", log="Annotated VCF: TESTSAMPLE02.annotated.vcf.gz", process="SUBMIT_TO_DB", slurm_job_id="81", slurm_node="c1"} {"user":"slurmuser","log":"Annotated VCF: TESTSAMPLE02.annotated.vcf.gz","workflow":"exome-fluent.nf","nextflow_run":"exome_TESTSAMPLE02","slurm_job_id":"81","process":"SUBMIT_TO_DB","sample":"TESTSAMPLE02","slurm_node":"c1","attempt":"1","file":"/nextflow/work/2b/ceb8fbb62f7249593646b592c86f79/.command.out"}
2025-11-22T12:39:09-05:00 {file="/nextflow/work/2b/ceb8fbb62f7249593646b592c86f79/.command.out", log="VCF: TESTSAMPLE02.variants.vcf.gz", process="SUBMIT_TO_DB", slurm_job_id="81", slurm_node="c1"} {"user":"slurmuser","log":"VCF: TESTSAMPLE02.variants.vcf.gz","workflow":"exome-fluent.nf","nextflow_run":"exome_TESTSAMPLE02","slurm_job_id":"81","process":"SUBMIT_TO_DB","sample":"TESTSAMPLE02","slurm_node":"c1","attempt":"1","file":"/nextflow/work/2b/ceb8fbb62f7249593646b592c86f79/.command.out"}
2025-11-22T12:39:09-05:00 {file="/nextflow/work/2b/ceb8fbb62f7249593646b592c86f79/.command.out", log="BAM: TESTSAMPLE02.aligned.bam", process="SUBMIT_TO_DB", slurm_job_id="81", slurm_node="c1"} {"user":"slurmuser","log":"BAM: TESTSAMPLE02.aligned.bam","workflow":"exome-fluent.nf","nextflow_run":"exome_TESTSAMPLE02","slurm_job_id":"81","process":"SUBMIT_TO_DB","sample":"TESTSAMPLE02","slurm_node":"c1","attempt":"1","file":"/nextflow/work/2b/ceb8fbb62f7249593646b592c86f79/.command.out"}
2025-11-22T12:39:09-05:00 {file="/nextflow/work/2b/ceb8fbb62f7249593646b592c86f79/.command.out", log="Sample: TESTSAMPLE02", process="SUBMIT_TO_DB", slurm_job_id="81", slurm_node="c1"} {"user":"slurmuser","log":"Sample: TESTSAMPLE02","workflow":"exome-fluent.nf","nextflow_run":"exome_TESTSAMPLE02","slurm_job_id":"81","process":"SUBMIT_TO_DB","sample":"TESTSAMPLE02","slurm_node":"c1","attempt":"1","file":"/nextflow/work/2b/ceb8fbb62f7249593646b592c86f79/.command.out"}
2025-11-22T12:39:04-05:00 {file="/nextflow/work/9d/c7f4f445da221091c93b26ba71a97b/.command.out", log="VCF: TESTSAMPLE02.variants.vcf.gz", process="ANNOTATE_VCF", slurm_job_id="80", slurm_node="c1"} {"user":"slurmuser","log":"VCF: TESTSAMPLE02.variants.vcf.gz","workflow":"exome-fluent.nf","nextflow_run":"exome_TESTSAMPLE02","slurm_job_id":"80","process":"ANNOTATE_VCF","sample":"TESTSAMPLE02","slurm_node":"c1","attempt":"1","file":"/nextflow/work/9d/c7f4f445da221091c93b26ba71a97b/.command.out"}
We can improve the formatting using some additional logcli parameters and a LogQL line_format expression:
logcli query \
--quiet \
--limit 10 \
--no-labels \
--forward \
'{ source="nextflow" }
| json
| nextflow_run="exome_TESTSAMPLE02"
| line_format " {{.process}} {{.stream}}: {{.log}} "'
2025-11-22T12:38:44-05:00 ALIGN stdout: Starting alignment for sample: TESTSAMPLE02
2025-11-22T12:38:44-05:00 ALIGN stdout: R1: r1_fastq.gz
2025-11-22T12:38:44-05:00 ALIGN stdout: R2: r2_fastq.gz
2025-11-22T12:38:44-05:00 ALIGN stdout: Ref: reference.fa
2025-11-22T12:38:44-05:00 ALIGN stdout: Aligning... 0% complete
2025-11-22T12:38:44-05:00 ALIGN stdout: Aligning... 1% complete
2025-11-22T12:38:44-05:00 ALIGN stdout: Aligning... 2% complete
2025-11-22T12:38:44-05:00 ALIGN stdout: Aligning... 3% complete
2025-11-22T12:38:44-05:00 ALIGN stdout: Aligning... 4% complete
2025-11-22T12:38:44-05:00 ALIGN stdout: Aligning... 5% complete
This is a much easier way of reviewing our Nextflow logs compared to hunting through the work directory! And now that our logs are in Loki, we can use Grafana to build custom query interfaces and dashboards to help us monitor our Nextflow workflows.
Adding Grafana
Grafana is a web interface for querying and visualizing logs and other data. Just like Fluent Bit and Loki, it is available as a Docker image that we can drop in to our existing Slurm cluster:
grafana:
image: grafana/grafana:latest
container_name: grafana
depends_on:
- loki
ports:
- "3000:3000"
expose:
- "3000"
networks:
- slurm-network
restart: unless-stopped
Once the container is running, we can navigate to http://localhost:3000 and login with the default admin username and password admin. After adding a new Loki data source using the URL http://loki:3100, we can immediately start building queries using the Explore section:

Grafana is a powerful tool and I encourage you to learn more about it:
What’s next
Now that we have configured a Slurm cluster with modern observability enhancements targeting our Nextflow workflows, my next post will explore how we might use an AI agent to perform a first-pass analysis of our workflow logs in the event of a failure. This could save us valuable time by rapidly interpreting the workflow logs and identifying possible causes of the workflow failure. Stay tuned!