Execute groovy inside nextflow process using process inputs - regex

I have a nextflow script with a channel for paired file inputs. I am trying to extract a substring from the file inputs to use as part of the shell call. I am trying to use Groovy's regex matching to extract the substring, but since it is based on an input value, I am having trouble executing the matching. An alternative would be to perform the regex in bash as part of the process shell call, but I am interested in figuring out how to manipulate inputs within a process anyways, as I feel it would be useful for other things too. How can I perform intermediate Groovy code with the process inputs prior to the shell call?
process alignment {
input:
val files
output:
stdout
def matcher = "${files[1][0]}" =~ /.+\/bcl2fastq_out\/([^\/]+)\/.+/
# this is the culprit, as if I hardcode the first string it works
def project = matcher.findAll()[0][1]
"""
echo ${project}
"""
}
workflow {
files = Channel
.fromFilePairs("${params.out_dir}/**{_R1,_R2}_00?.fastq.gz", checkIfExists:true, size: 2)
alignment(files)
}
when I execute this, I get the error
No such variable: files
an example input string would look like extractions/test/bcl2fastq_out/project1/example_L001_R1_001.fastq.gz where I'm trying to extract the project1 substring

As you've already discovered, you can declare variables in the script block, before the command string. For example:
process alignment {
input:
tuple val(sample), path(fastq_files)
output:
tuple val(sample), path(output_file)
script:
output_file = "${sample}.bam"
"""
>&2 echo "Aligning ${sample}..."
touch "${output_file}"
"""
}
Note that these are global (within the process scope) unless you define them using the def keyword. If you don't need these elsewhere in your process definition, like in your example, a local variable (using def) is usually preferable. If, however, you need to access these in your output declaration, for example, then they will need to be global.
Note that the fromFilePairs factory method emits a tuple, where the first element is a group key and the second element is a list of files. The problem with just using val to declare the inputs is that the files in the second element will not be localized to the working directory when your script is run. To fix this, you can just change your input definition to something like:
input:
tuple val(sample), path(fastq_files)
The problem with this approach, is that we're unable to extract the parent directory name from the localized filenames. So you will need to pass this in somehow. Usually, you could just get the parent name from the first file in the tuple, using:
params.input_dir = './path/to/files'
params.pattern = '**_R{1,2}_00?.fastq.gz'
process alignment {
debug true
input:
tuple val(sample), val(project), path(fastq_files)
"""
echo "${sample}: ${project}: ${fastq_files}"
"""
}
workflow {
Channel
.fromFilePairs( "${params.input_dir}/${params.pattern}" )
.map { sample, reads ->
def project = reads[0].parent.name
tuple( sample, project, reads )
}
.set { reads }
alignment( reads )
}
But since the glob pattern has an additional wildcard, i.e. _00?, you may not necessarily get the results you expect. For example:
$ mkdir -p path/to/files/project{1,2,3}
$ touch path/to/files/project1/sample1_R{1,2}_00{1,2,3,4}.fastq.gz
$ touch path/to/files/project2/sample2_R{1,2}_00{1,2,3,4}.fastq.gz
$ touch path/to/files/project3/sample3_R{1,2}_00{1,2,3,4}.fastq.gz
$ nextflow run main.nf
N E X T F L O W ~ version 22.04.0
Launching `main.nf` [determined_roentgen] DSL2 - revision: f80ab33ac8
executor > local (12)
[a8/9235cc] process > alignment (12) [100%] 12 of 12 ✔
sample2: project2: sample2_R1_001.fastq.gz sample2_R1_004.fastq.gz
sample1: project1: sample1_R1_003.fastq.gz sample1_R2_001.fastq.gz
sample1: project1: sample1_R1_004.fastq.gz sample1_R2_003.fastq.gz
sample3: project3: sample3_R1_001.fastq.gz sample3_R2_001.fastq.gz
sample1: project1: sample1_R1_001.fastq.gz sample1_R2_004.fastq.gz
sample1: project1: sample1_R1_002.fastq.gz sample1_R2_002.fastq.gz
sample2: project2: sample2_R1_002.fastq.gz sample2_R2_002.fastq.gz
sample2: project2: sample2_R2_001.fastq.gz sample2_R2_004.fastq.gz
sample2: project2: sample2_R1_003.fastq.gz sample2_R2_003.fastq.gz
sample3: project3: sample3_R2_002.fastq.gz sample3_R2_004.fastq.gz
sample3: project3: sample3_R1_003.fastq.gz sample3_R1_004.fastq.gz
sample3: project3: sample3_R1_002.fastq.gz sample3_R2_003.fastq.gz
Fortunately, we can supply a custom file pair grouping strategy using a closure. This uses the readPrefix helper function:
workflow {
Channel
.fromFilePairs( "${params.input_dir}/${params.pattern}" ) { file ->
prefix = Channel.readPrefix(file, params.pattern)
suffix = file.simpleName.tokenize('_').last()
"${file.parent.name}/${prefix}_${suffix}"
}
.map { key, reads ->
def (project, sample) = key.tokenize('/')
tuple( sample, project, reads )
}
.set { reads }
alignment( reads )
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 22.04.0
Launching `main.nf` [loving_cantor] DSL2 - revision: 5a76ac712f
executor > local (12)
[f4/74edbc] process > alignment (12) [100%] 12 of 12 ✔
sample1_003: project1: sample1_R1_003.fastq.gz sample1_R2_003.fastq.gz
sample2_002: project2: sample2_R1_002.fastq.gz sample2_R2_002.fastq.gz
sample1_002: project1: sample1_R1_002.fastq.gz sample1_R2_002.fastq.gz
sample2_003: project2: sample2_R1_003.fastq.gz sample2_R2_003.fastq.gz
sample2_004: project2: sample2_R1_004.fastq.gz sample2_R2_004.fastq.gz
sample2_001: project2: sample2_R1_001.fastq.gz sample2_R2_001.fastq.gz
sample1_001: project1: sample1_R1_001.fastq.gz sample1_R2_001.fastq.gz
sample1_004: project1: sample1_R1_004.fastq.gz sample1_R2_004.fastq.gz
sample3_001: project3: sample3_R1_001.fastq.gz sample3_R2_001.fastq.gz
sample3_004: project3: sample3_R1_004.fastq.gz sample3_R2_004.fastq.gz
sample3_002: project3: sample3_R1_002.fastq.gz sample3_R2_002.fastq.gz
sample3_003: project3: sample3_R1_003.fastq.gz sample3_R2_003.fastq.gz

I figured it out, if instead of just jumping into the shell script with the triple quotes, you can start specifying the process execution script with "script:" then run Groovy using the process inputs
process alignment {
input:
val files
output:
stdout
script:
test = (files[1][0] =~ '.+/test/([^/]+)/.+').findAll()[0][1]
"""
echo $test
"""

Related

I use perl in a bash script that pipes to a regular expression. How do I also set a variable in the piped stream?

I have created a bash script that runs on several machines that contain different git local repositories. It tests many conditions and tells me if the repository has uncommitted files, untracked files, and one test in particular tells me that the local is ahead or behind the remote by the number of commits. The problem with the script is that it doesn't return or set an 'ok' flag which I use to echo the "ok" message if everything is in sync. So, I get the message that it's ahead or behind, but then get the "ok" message. Here is the portion of the script that does the ahead or behind, and I can't see how to get it to set an ok = false somehow.
git fetch>/dev/null && git branch -v |
perl -wlne'
print "$ENV{reponame} [$1] --> $3 $2"
if /^..(\S+)\s+([a-f0-9]+)\s+(\[(?:ahead|behind)\s+\d+\])/
' |
while IFS= read -r MOD; do
ok=false
printf ' %s\n' "$MOD" # Replace with code that uses $MOD
done
if $ok; then
echo " OK --> $reponame [$br] $rev"
fi
I copied this from another script and don't really understand the IFS = read -r MOD; section that I thought might set the flag, but it doesn't occur.
This is the output I get:
bin [develop] --> [behind 1] 026e1ad
OK --> bin [develop] 026e1ad
OK --> notes [develop] 4cd077f
OK --> indecks [develop] e6b4293
OK --> queue [develop] 5469679
OK --> frameworks [master] 05fedb6
OK --> dashboard [isolate] f8b1101
OK --> nodejs [develop] 5af2ea7
OK --> perl-forth [master] 45cc837
OK --> blog [master] c19edfd
Note that for bin I get:
bin [develop] --> [behind 1] 026e1ad
OK --> bin [develop] 026e1ad
I'd rather not get that OK after the behind 1! Another script checks for any non-OK in the left column and sends me an email.
With the perl and all the piping, how could I set the ok variable before it prints?
In most shell implementations, all processes in a pipeline are run in a subshell. In this case, you're running a while loop at the end of a pipeline, so it (and it alone) is in the subshell. Whether you set ok to false or not, it has no effect on the if block because that's run the main shell, which doesn't inherit variables from the subshell.
zsh and AT&T ksh (but not other ksh implementations) execute the last command in the main shell and not a subshell. POSIX permits either behavior, but the bash behavior is far more common among shells.
The easiest way to handle this is to run the entire command you're interested in in a subshell:
git fetch>/dev/null && git branch -v |
perl -wlne'
print "$ENV{reponame} [$1] --> $3 $2"
if /^..(\S+)\s+([a-f0-9]+)\s+(\[(?:ahead|behind)\s+\d+\])/
' |
(while IFS= read -r MOD; do
ok=false
printf ' %s\n' "$MOD" # Replace with code that uses $MOD
done
if $ok; then
echo " OK --> $reponame [$br] $rev"
fi)
This puts both parts using the ok variable in the same subshell, so you can modify it and it will have an effect.

name input/output files in snakemake according to variable (not wildcard) in config.yaml

I am trying to edit and run a snakemake pipeline. In a nutshell, the snakemake pipeline calls a default genome aligner (minimap) and produces output files with this name. I am trying to add a variable aligner to config.yaml to specify the aligner I want to call. Also (where I am actually stuck), the output files should have the name of the aligner specified in config.yaml.
My config.yaml looks like this:
# this config.yaml is passed to Snakefile in pipeline-structural-variation subfolder.
# Snakemake is run from this pipeline-structural-variation folder; it is necessary to
# pass an appropriate path to the input-files (the ../ prefix is sufficient for this demo)
aligner: "ngmlr" # THIS IS THE VARIABLE I AM ADDING TO THIS FILE. VALUES COULD BE minimap or ngmlr
# FASTQ file or folder containing FASTQ files
# check if this has to be gzipped
input_fastq: "/nexusb/Gridion/20190917PGD2staal2/PD170815/PD170815_cat_all.fastq.gz" # original is ../RawData/GM24385_nf7_chr20_af.fastq.gz
# FASTA file containing the reference genome
# note that the original reference sequence contains only the sequence of chr20
reference_fasta: "/nexus/bhinckel/19/ONT_projects/PGD_breakpoint/ref_hg19_local/hg19_chr1-y.fasta" # original is ../ReferenceData/human_g1k_v37_chr20_50M.fasta
# Minimum SV length
min_sv_length: 300000 # original value was 40
# Maximum SV length
max_sv_length: 1000000 # original value was 1000000. Note that the value I used to run the pipeline for the sample PD170677 was 100000000000, which will be coerced to NA in the R script (/home/bhinckel/ont_tutorial_sv/ont_tutorial_sv.R)
# Min read length. Shorter reads will be discarded
min_read_length: 1000
# Min mapping quality. Reads will lower mapping quality will be discarded
min_read_mapping_quality: 20
# Minimum read support required to call a SV (auto for auto-detect)
min_read_support: 'auto'
# Sample name
sample_name: "PD170815" # original value was GM24385.nf7.chr20_af. Note that this can be a list
I am posting below the sections of my snakefile which generate output files with the extension _minimap2.bam, which I would like to replace by either _minimap2.bam or _ngmlr.bam, depending on aligner on config.yaml
# INPUT BAM folder
bam = None
if "bam" in config:
bam = os.path.join(CONFDIR, config["bam"])
# INPUT FASTQ folder
FQ_INPUT_DIRECTORY = []
if not bam:
if not "input_fastq" in config:
print("\"input_fastq\" not specified in config file. Exiting...")
FQ_INPUT_DIRECTORY = os.path.join(CONFDIR, config["input_fastq"])
if not os.path.exists(FQ_INPUT_DIRECTORY):
print("Could not find {}".format(FQ_INPUT_DIRECTORY))
MAPPED_BAM = "{sample}/alignment/{sample}_minimap2.bam" # Original
#MAPPED_BAM = "{sample}/alignment/{sample}_{alignerName}.bam" # this did not work
#MAPPED_BAM = f"{sample}/alignment/{sample}_{config['aligner']}.bam" # this did nor work either
else:
MAPPED_BAM = find_file_in_folder(bam, "*.bam", single=True)
...
if config['aligner'] == 'minimap':
rule index_minimap2:
input:
REF = FA_REF
output:
"{sample}/index/minimap2.idx"
threads: config['threads']
conda: "env.yml"
shell:
"minimap2 -t {threads} -ax map-ont --MD -Y {input.REF} -d {output}"
rule map_minimap2:
input:
FQ = FQ_INPUT_DIRECTORY,
IDX = rules.index_minimap2.output,
SETUP = "init"
output:
BAM = "{sample}/alignment/{sample}_minimap2.bam",
BAI = "{sample}/alignment/{sample}_minimap2.bam.bai"
conda: "env.yml"
threads: config["threads"]
shell:
"cat_fastq {input.FQ} | minimap2 -t {threads} -K 500M -ax map-ont --MD -Y {input.IDX} - | samtools sort -# {threads} -O BAM -o {output.BAM} - && samtools index -# {threads} {output.BAM}"
else:
print(f"Aligner is {config['aligner']} - skipping indexing step for minimap2")
rule map_ngmlr:
input:
REF = FA_REF,
FQ = FQ_INPUT_DIRECTORY,
SETUP = "init"
output:
BAM = "{sample}/alignment/{sample}_minimap2.bam",
BAI = "{sample}/alignment/{sample}_minimap2.bam.bai"
conda: "env.yml"
threads: config["threads"]
shell:
"cat_fastq {input.FQ} | ngmlr -r {input.REF} -t {threads} -x ont - | samtools sort -# {threads} -O BAM -o {output.BAM} - && samtools index -# {threads} {output.BAM}"
I initially tried to create a alignerName parameter, similar to the sample parameter, as shown below:
# Parameter: sample_name
sample = "sv_sample01"
if "sample_name" in config:
sample = config['sample_name']
###############
#
# code below created by me
#
###############
# Parameter: aligner_name
alignerName = "defaultAligner"
if "aligner" in config:
alignerName = config['aligner']
Then I tried to input {alignerName} wherever I have minimap2 on my input/ output files (see commented MAPPED_BAM variable definition above), though this is throwing an error. I guess snakemake will interpret {alignerName} as a wildcard, though what I want is simply to pass the variable name defined in config['aligner'] to input/ output files. I also tried with f-string (MAPPED_BAM = f"{sample}/alignment/{sample}_{config['aligner']}.bam"), though I guess this it did not work either.
You are close!
The way wildcards work in snakemake is they get interpreted 'last', while f-strings get interpreted first. To not interpret a curly brace in an f-string you can escape it with another curly brace, like so:
print(f"{{keep curly}}")
>>> {keep curly}
So all we need to do is
MAPPED_BAM = f"{{sample}}/alignment/{{sample}}_{config['aligner']}.bam"

if condition to folder branch in Jenkinsfile

I have branch folder "feature-set" under this folder there's multibranch
I need to run the below script in my Jenkinsfile with a condition if this build runs from any branches under the "feature-set" folder like "feature-set/" then run the script
the script is:
sh """
if [ ${env.BRANCH_NAME} = "feature-set*" ]
then
echo ${env.BRANCH_NAME}
branchName='${env.BRANCH_NAME}' | cut -d'\\/' -f 2
echo \$branchName
npm install
ng build --aot --output-hashing none --sourcemap=false
fi
"""
the current output doesn't get the condition:
[ feature-set/swat5 = feature-set* ]
any help?
I would re-write this to be primarily Jenkins/Groovy syntax and only go to shell when required.
Based on the info you provided I assume your env.BRANCH_NAME always looks like `feature-set/
// Echo first so we can see value if condition fails
echo(env.BRANCH_NAME)
// startsWith better than contains() based on current usecase
if ( (env.BRANCH_NAME).startsWith('feature-set') ) {
// Split branch string into list based on delimiter
List<String> parts = (env.BRANCH_NAME).tokenize('/')
/**
* Grab everything minus the first part
* This handles branches that include additional '/' characters
* e.g. 'feature-set/feat/my-feat'
*/
branchName = parts[1..-1].join('/')
echo(branchName)
sh('npm install && ng build --aot --output-hashing none --sourcemap=false')
}
This seems to be more on shell side. Since you are planning to use shell if condition the below worked for me.
Administrator1#XXXXXXXX:
$ if [[ ${BRANCH_NAME} = feature-set* ]]; then echo "Success"; fi
Success
Remove the quotes and add an additional "[]" at the start and end respectively.
The additional "[]" works as regex

Implementing a 'for loop' within user_data provisioner file

I am currently using a template_file to provision user_data into an aws_launch_configuration, like so:
resource "aws_launch_configuration" "launch_config" {
...
user_data = "${data.template_file.init.rendered}"
}
data "template_file" "init" {
template = "${file("router-init.sh.tpl")}"
vars {
hub_ip_addresses = "${join(",", aws_instance.gridHub.*.private_ip)}"
}
}
I am feeding in a variable (i.e. hub_ip_addresses) into the router-init.sh.tpl file, and in this file I am making use of the argument like so:
`#!/bin/sh
...
IFS=',' read -r -a array <<< "$hub_ip_addresses"
for element in "${array[#]}"
do
#do stuff with $element
done
Basically, I am splitting the string based on a delimiter, and then looping through each ip address in the array.
This bash script works fine when I run it on my local machine -- however, when terraform executes it, it throws a error:
* data.template_file.init: data.template_file.init: failed to render : parse error at 13:25: expected expression but found invalid sequence "#"
I'm supposing the '#' symbol is causing an issue. Is there a reason why this is so? Do I need to escape it with a '\' ?
EDIT: Not sure if related to this issue, but in the preceeding line in the bash script, IFS=',' read -r -a array <<< "$hub_ip_addresses", the <<< seems to be causing everything else that follows to look as if they are inside a comment (i.e. greyed out as if it was within a quotation mark ').)
You need to escape the $ characters in your template by doubling them up or Terraform will attempt to interpolate them as the input variables to the template.
The template docs cover this briefly although the example given is for inline templates rather than for all templates, including those that are loaded with the file() function.
So something like:
#!/bin/sh
...
IFS=',' read -r -a array <<< "$hub_ip_addresses"
for element in "$${array[#]}"
do
#do stuff with $$element
done

How to pass input parameters to make? [duplicate]

How to pass argument to Makefile from command line?
I understand I can do
$ make action VAR="value"
$ value
with Makefile
VAR = "default"
action:
#echo $(VAR)
How do I get the following behavior?
$ make action value
value
How about
$make action value1 value2
value1 value2
You probably shouldn't do this; you're breaking the basic pattern of how Make works. But here it is:
action:
#echo action $(filter-out $#,$(MAKECMDGOALS))
%: # thanks to chakrit
#: # thanks to William Pursell
EDIT:
To explain the first command,
$(MAKECMDGOALS) is the list of "targets" spelled out on the command line, e.g. "action value1 value2".
$# is an automatic variable for the name of the target of the rule, in this case "action".
filter-out is a function that removes some elements from a list. So $(filter-out bar, foo bar baz) returns foo baz (it can be more subtle, but we don't need subtlety here).
Put these together and $(filter-out $#,$(MAKECMDGOALS)) returns the list of targets specified on the command line other than "action", which might be "value1 value2".
Here is a generic working solution based on #Beta's
I'm using GNU Make 4.1 with SHELL=/bin/bash atop my Makefile, so YMMV!
This allows us to accept extra arguments (by doing nothing when we get a job that doesn't match, rather than throwing an error).
%:
#:
And this is a macro which gets the args for us:
args = `arg="$(filter-out $#,$(MAKECMDGOALS))" && echo $${arg:-${1}}`
Here is a job which might call this one:
test:
#echo $(call args,defaultstring)
The result would be:
$ make test
defaultstring
$ make test hi
hi
Note! You might be better off using a "Taskfile", which is a bash pattern that works similarly to make, only without the nuances of Maketools. See https://github.com/adriancooney/Taskfile
Much easier aproach. Consider a task:
provision:
ansible-playbook -vvvv \
-i .vagrant/provisioners/ansible/inventory/vagrant_ansible_inventory \
--private-key=.vagrant/machines/default/virtualbox/private_key \
--start-at-task="$(AT)" \
-u vagrant playbook.yml
Now when I want to call it I just run something like:
AT="build assets" make provision
or just:
make provision in this case AT is an empty string
Few years later, want to suggest just for this: https://github.com/casey/just
action v1 v2=default:
#echo 'take action on {{v1}} and {{v2}}...'
You will be better of defining variables and calling your make instead of using parameters:
Makefile
action: ## My action helper
#echo $$VAR_NAME
Terminal
> VAR_NAME="Hello World" make action
Hello World
don't try to do this
$ make action value1 value2
instead create script:
#! /bin/sh
# rebuild if necessary
make
# do action with arguments
action "$#"
and do this:
$ ./buildthenaction.sh value1 value2
for more explanation why do this and caveats of makefile hackery read my answer to another very similar but seemingly not duplicate question: Passing arguments to "make run"