Bash script: how to stop looping after a match was found? - regex

This must be very basic but I can't not find a way to solve it.
I have a script like this:
#!/bin/bash
seqFolder="/raw_data/data"
seqmode="paired"
Input=$(basename ${seqFolder});
if [ $seqmode = paired ]; then
for x in $seqFolder/*; do
if [[ "$x" =~ .*\.fastq.gz$ ]]; then
z=$(basename $x 1_001.fastq.gz)
echo $z
echo "file of this iteration $z"1_001.fastq.gz" $z"2_001.fastq.gz""
fi
done
fi
When I run this script I get this:
MG-AB-17_S17_R
file of this iteration MG-AB-17_S17_R1_001.fastq.gz MG-AB-17_S17_R2_001.fastq.gz
MG-AB-17_S17_R2_001.fastq.gz
file of this iteration MG-AB-17_S17_R2_001.fastq.gz1_001.fastq.gz MG- AB-17_S17_R2_001.fastq.gz2_001.fastq.gz
MG-AB-81_S74_R
file of this iteration MG-AB-81_S74_R1_001.fastq.gz MG-AB-81_S74_R2_001.fastq.gz
MG-AB-81_S74_R2_001.fastq.gz
file of this iteration MG-AB-81_S74_R2_001.fastq.gz1_001.fastq.gz MG- AB-81_S74_R2_001.fastq.gz2_001.fastq.gz
Files in: /raw_data/data are these 4 (this is just example):
MG-AB-17_S17_R1_001.fastq.gz
MG-AB-17_S17_R2_001.fastq.gz
MG-AB-81_S74_R1_001.fastq.gz
MG-AB-81_S74_R2_001.fastq.gz
The issue is that I don't want my variable $z to be MG-AB-17_S17_R2_001.fastq.gz or MG-AB-81_S74_R2_001.fastq.gz because files like these:
MG-AB-17_S17_R2_001.fastq.gz1_001.fastq.gz
MG-AB-17_S17_R2_001.fastq.gz2_001.fastq.gz
...
don't exist in directory /raw_data/data
I was thinking that .fastq.gz$ in "$x" =~ .*\.fastq.gz$ would ensure that, but it seems that's not the case. Can you please advise.

You're looking for the break statement.
$ cat /tmp/so3651.sh
#!/usr/bin/env bash
for i in 1 2 3 4 5; do
echo "i: $i"
if [[ -f /tmp/foo$i ]]; then
echo "found it!"
break
fi
done
If the file doesn't exist, it produces:
$ /tmp/so3651.sh
i: 1
i: 2
i: 3
i: 4
i: 5
But when a file does exist, it stops early:
$ touch /tmp/foo4
$ /tmp/so3651.sh
i: 1
i: 2
i: 3
i: 4
found it!

Related

Regular Expression to search for a number between two

I am not very familiar with Regular Expressions.
I have a requirement to extract all lines that match an 8 digit number between any two given numbers (for example 20200628 and 20200630) using regular expression. The boundary numbers are not fixed, but need to be parameterized.
In case you are wondering, this number is a timestamp, and I am trying to extract information between two dates.
HHHHH,E.164,20200626113247
HHHHH,E.164,20200627070835
HHHHH,E.164,20200628125855
HHHHH,E.164,20200629053139
HHHHH,E.164,20200630125855
HHHHH,E.164,20200630125856
HHHHH,E.164,20200626122856
HHHHH,E.164,20200627041046
HHHHH,E.164,20200628125856
HHHHH,E.164,20200630115849
HHHHH,E.164,20200629204531
HHHHH,E.164,20200630125857
HHHHH,E.164,20200630125857
HHHHH,E.164,20200626083628
HHHHH,E.164,20200627070439
HHHHH,E.164,20200627125857
HHHHH,E.164,20200628231003
HHHHH,E.164,20200629122857
HHHHH,E.164,20200630122237
HHHHH,E.164,20200630122351
HHHHH,E.164,20200630122858
HHHHH,E.164,20200630122857
HHHHH,E.164,20200630084722
Assuming the above data is stored in a file named data.txt, the idea is to sort it on the 3rd column delimited by the comma (i.e. sort -nk3), and then pass the sorted output through this perl filter, as demonstrated by this find_dates.sh script:
#!/bin/bash
[ $# -ne 3 ] && echo "Expects 3 args: YYYYmmdd start, YYYYmmdd end, and data filename" && exit
DATE1=$1
DATE2=$2
FILE=$3
echo "$DATE1" | perl -ne 'exit 1 unless /^\d{8}$/'
[ $? -ne 0 ] && echo "ERROR: First date is invalid - $DATE1" && exit
echo "$DATE2" | perl -ne 'exit 1 unless /^\d{8}$/'
[ $? -ne 0 ] && echo "ERROR: Second date is invalid - $DATE2" && exit
[ ! -r "$FILE" ] && echo "ERROR: File not found - $FILE" && exit
cat $FILE | sort -t, -nk3 | perl -ne '
BEGIN { $date1 = shift; $date2 = shift }
print if /164,$date1/ .. /164,$date2/;
print if /164,$date2/;
' $DATE1 $DATE2 | sort -u
Running the command find_dates.sh 20200627 20200629 data.txt will produce the result:
HHHHH,E.164,20200627041046
HHHHH,E.164,20200627070439
HHHHH,E.164,20200627070835
HHHHH,E.164,20200627125857
HHHHH,E.164,20200628125855
HHHHH,E.164,20200628125856
HHHHH,E.164,20200628231003
HHHHH,E.164,20200629053139
HHHHH,E.164,20200629122857
HHHHH,E.164,20200629204531
For the example you gave, between 20200628 and 20200630, you may try:
\b202006(?:2[89]|30)
Demo
I might be tempted to make the general comment that regex is not very suitable for finding numerical ranges (whereas application programming languages are). However, in the case of parsing a text log file, regex is what would be easily available.

Shell: Checking if argument exists and matches expression

I'm new to shell scripting and trying to write the ability to check if an argument exists and if it matches an expression. I'm not sure how to write expressions, so this is what I have so far:
#!/bin/bash
if [[ -n "$1"] && [${1#*.} -eq "tar.gz"]]; then
echo "Passed";
else
echo "Missing valid argument"
fi
To run the script, I would type this command:
# script.sh YYYY-MM.tar.gz
I believe what I have is
if the YYYY-MM.tar.gz is not after script.sh it will echo "Missing valid argument" and
if the file does not end in .tar.gz it echo's the same error.
However, I want to also check if the full file name is in YYYY-MM.tar.gz format.
if [[ -n "$1" ]] && [[ "${1#*.}" == "tar.gz" ]]; then
-eq: (equal) for arithmetic tests
==: to compare strings
See: help test
You can also use:
case "$1" in
*.tar.gz) ;; #passed
*) echo "wrong/missing argument $1"; exit 1;;
esac
echo "ok arg: $1"
As long as the file is in the correct YYYY-MM.tar.gz format, it obviously is non-empty and ends in .tar.gz as well. Check with a regular expression:
if ! [[ $1 =~ [0-9]{4}-[0-9]{1,2}.tar.gz ]]; then
echo "Argument 1 not in correct YYYY-MM.tar.gz format"
exit 1
fi
Obviously, the regular expression above is too general, allowing names like 0193-67.tar.gz. You can adjust it to be as specific as you need it to be for your application, though. I might recommend
[1-9][0-9]{3}-([1-9]|10|11|12).tar.gz
to allow only 4-digit years starting with 1000 (support for the first millennium ACE seems unnecessary) and only months 1-12 (no leading zero).

How to do reach specific section of text file and then search

I have a text file like
Apples
Big 7
Small 6
Apples
Good 5
Bad 3
Oranges
Big 4
Small 2
Good 1
Bad 5
How do I get to specific section of this file and then do a grep? For example, If I need to find how many Good Oranges are there, how do I do it from command line with this file as input, using say awk?
You could use the range operator like this:
awk '/Apples/,/^$/ { if (/Good/) print $2}' file
would print how many good apples there are:
5
The range operator , will evaluate to true when the first condition is satisfied and remain true until the second condition. The second pattern /^$/ matches a blank line. This means that only the relevant sections will be tested for the property Good, Bad, etc.
I'm assuming that your original input file wasn't double-spaced? If it was, the method above can be patched to skip every other line:
awk '!NR%2{next} /Oranges/,/^$/ { if (/Good/) print $2}' file
When the record number NR is even, NR%2 is 0 and !0 is true so every other line will be skipped.
You could use Bash to read from the file line by line in a loop.
while read -a fruit; do
[ ${#fruit[#]} -eq 1 ] && name=${fruit[0]}
case $name in
Oranges) [ "${fruit[0]}" = "Good" ] && echo ${fruit[1]};;
esac
done < file
You could also make this a function and pass it arguments to get trait information about any fruit.
read_fruit (){
while read -a fruit; do
[ ${#fruit[#]} -eq 1 ] && name=${fruit[0]}
case $name in
$1) [ "${fruit[0]}" = "$2" ] && echo ${fruit[1]};;
esac
done < file
}
Use:
read_fruit Apples Small
result:
6
When you have name/value pairs, it's usually best to first build an array indexed by the name and containing the value, then you can just print whatever you're interested in using the appropriate name(s) to index the array:
$ awk 'NF==1{key=$1} {val[key,$1]=$2} END{print val["Oranges","Good"]}' file
1
$ awk 'NF==1{key=$1} {val[key,$1]=$2} END{print val["Apples","Bad"]}' file
3
or if you're looking for the starting point to implement a more complete/complex set of requirements here's one way:
$ awk '
NF {
if (NF==1) {
key=$1
keys[key]
}
else {
val[key,$1]=$2
names[$1]
}
}
END {
for (key in keys)
for (name in names)
print key, name, val[key,name]
}
' file
Apples Big 7
Apples Bad 3
Apples Good 5
Apples Small 6
Oranges Big 4
Oranges Bad 5
Oranges Good 1
Oranges Small 2
To test #JohnB's theory that a shell script would be faster than an awk script if there were thousands of files, I copied the OPs input file 5,000 times into a tmp directory then ran these 2 equivalent scripts on them (the bash one based on Johns answer in this thread and then an awk one that does the same as the bash one):
$ cat tst.sh
for file in "$#"; do
while read -r field1 field2 ; do
[ -z "$field2" ] && name="$field1"
case $name in
Oranges) [ "$field1" = "Good" ] && echo "$field2";;
esac
done < "$file"
done
.
$ cat tst.awk
NF==1 { fruit=$1 }
fruit=="Oranges" && $1=="Good" { print $2 }
and here's the results of running both on those 5,000 files:
$ time ./tst.sh tmp/* > bash.out
real 0m6.490s
user 0m2.792s
sys 0m3.650s
.
$ time awk -f tst.awk tmp/* > awk.out
real 0m2.262s
user 0m0.311s
sys 0m1.934s
The 2 output files were identical.

Make reference to a file in a regular expression

I have two files. One is a SALESORDERLIST, which goes like this
ProductID;ProductDesc
1,potatoes 1 kg.
2,tomatoes 2 k
3,bottles of whiskey 2 un.
4,bottles of beer 40 gal
(ProductID;ProductDesc) header is actually not in the file, so disregard it.
In another file, POSSIBLEUNITS, I have -you guessed- the possible units, and their equivalencies:
u;u.;un;un.;unit
k;k.;kg;kg.,kilograms
This is my first day with regular expressions and I would like to know how can I get the entries in SALESORDERLIST, whose units appear in POSSIBLEUNITS. In my example, I would like to exclude entry 4 since 'gal' is not listed in POSSIBLEUNITS file.
I say regex, since I have a further criteria that needs to be matched:
egrep "^[0-9]+;{1}[^; ][a-zA-Z ]+" SALESORDERLIST
From those resultant entries, I want to get those ending in valid units.
Thanks!
One way of achieving what you want is:
cat SALESORDERLIST | egrep "\b(u|u\.|un|un\.|unit|k|k\.|kg|kg\.|kilograms)\b"
1,potatoes 1 kg.
2,tomatoes 2 k
3,bottles of whiskey 2 un.
The metacharacter \b is an anchor that allows you to perform a "whole words only" search using
a regular expression in the form of \bword\b.
http://www.regular-expressions.info/wordboundaries.html
One way would be to create a bash script, say called findunit.sh:
while read line
do
match=$(egrep -E "^[0-9]+,{1}[^, ][a-zA-Z ]+" <<< $line)
name=${match##* }
# echo "$name..."
found=$(egrep "$name" /pathtofile/units.txt)
# echo "xxx$found"
[ -n "$found" ] && echo $line
done < $1
Then run with:
findunit.sh SALESORDERLIST
My output from this is:
1,potatoes 1 kg.
2,tomatoes 2 k
3,bottles of whiskey 2 un.
An example of doing it completely in bash:
declare -A units
while read line; do
while [ -n "$line" ]; do
i=`expr index $line ";"`
if [[ $i == 0 ]]; then
units[$line]=1
break
fi
units[${line:0:$((i-1))}]=1
line=${line#*;}
done
done < POSSIBLEUNITS
while read line; do
unit=${line##* }
if [[ ${units[$unit]} == 1 ]]; then
echo $line
fi
done < SALESORDERLIST

remove single elements in a text file in bash

Basically what I have is a text file (file.txt), which contains lines of numbers (lines aren't necessarily the same length) e.g.
1 2 3 4
5 6 7 8
9 10 11 12 13
What I need to do is write new files with each of these numbers deleted, one at a time, with replacement e.g. the first new file will contain
2 3 4 <--- 1st element removed
5 6 7 8
9 10 11 12 13
and the 7th file will contain
1 2 3 4
5 6 8 <--- 7th element removed here
9 10 11 12 13
To generate these, I'm looping through each line, and then each element in each line. E.g. for the 7th file, where I remove the third element of the second line, I'm trying to do this by reading in the line, removing the appropriate element, then reinserting this new line
$lineNo is 2 (second line)
$line is 5 6 7 8
with cut, I remove the third number, making $newline 5 6 8
Then I try to replace the line $lineNo in file.txt with $newline using sed:
sed -n '$lineNo s/.*/'$newline'/' > file.txt
This is totally not working. I get an error
sed: can't read 25.780000: No such file or directory
(where 25.780000 is a number in my text file. It looks like it's trying to use $newline to read files or something)
I have reason to suspect my way of stating which line to replace isn't working either :(
My question is, a) is there a better way to do this rather than sed, and b) if sed is the way to go, what am I doing wrong?
Thanks!!
filename=file.txt
i=1
while [[ -s $filename ]]; do
new=file_$i.txt
awk 'NR==1 {if (NF==1) next; else sub(/^[^ ]+ /, "")} 1' $filename > $new
((i++))
filename=$new
done
This leaves a space at the beginning the the first line for each new file, and when a line becomes empty the line is removed. The loop ends when the last generated file is empty.
Update due to requirement clarification:
words=$(wc -w < file.txt)
for ((i=1; i<=words; i++)); do
awk -v n=$i '
words < n && n <= words+NF {$(n-words) = "" }
{words += NF; print}
' file.txt > file_$i.txt
done
Unless I misunderstood the question, the following should work, although it will be pretty slow if your files are large:
#! /bin/bash
remove_by_value()
{
local TO_REMOVE=$1
while read line; do
out=
for word in $line; do [ "$word" = "$TO_REMOVE" ] || out="$out $word"; done
echo "${out/ }"
done < $2
}
remove_by_position()
{
local NTH=$1
while read line; do
out=
for word in $line; do
((--NTH == 0)) || out="$out $word"
done
echo "${out/ }"
done < $2
}
FILE=$1
shift
for number; do
echo "Removing $number"
remove_by_position $number "$FILE"
done
This will dump all the output to stdout, but it should be trivial to change it so the output for each removed number is redirected (e.g. with remove_by_position $number $FILE > $FILE.$$ && mv $FILE.$$ $FILE.$number and proper quoting). Run it as, say,
$ bash script.sh file.txt $(seq 11)
I have to admit, that I'm a bit surprised how short the other solutions are.
#!/bin/bash
#
file=$1
lines=$(cat $file | wc -l)
out=0
dropFromLine () {
file=$1
row=$2
to=$((row-1))
from=$((row+1))
linecontent=($(sed -n "${row}p" $file))
# echo " linecontent: " ${linecontent[#]}
linelen=${#linecontent[#]}
# echo " linelength: " $linelen
for n in $(seq 0 $linelen)
do
(
if [[ $row > 1 ]] ; then sed -n "1,${to}p" $file ;fi
for i in $(seq 0 $linelen)
do
if [[ $n != $i ]]
then
echo -n ${linecontent[$i]}" "
fi
done
echo
# echo "mod - drop " ${linecontent[$n]}
sed -n "$from,${lines}p" $file
) > outfile-${out}.txt
out=$((out+1))
done
}
for row in $(seq 1 $lines)
do
dropFromLine $file $row
done
invocation:
./dropFromRow.sh num.dat
num.dat:
1 2 3 4
5 6 7 8
9 10 11
result:
outfile-0 outfile-10 outfile-12 outfile-2 outfile-4 outfile-6 outfile-8
outfile-1 outfile-11 outfile-13 outfile-3 outfile-5 outfile-7 outfile-9
samples:
asux:~/proj/mini/forum > cat outfile-0
2 3 4
5 6 7 8
9 10 11
asux:~/proj/mini/forum > cat outfile-1
1 3 4
5 6 7 8
9 10 11
One way using perl:
Content of file.txt:
1 2 3 4
5 6 7 8
9 10 11 12 13
Content of script.pl:
use warnings;
use strict;
## Read all input to a scalar variable as a single string.
my $str;
{
local $/ = undef;
$str = <>;
}
## Loop for each number found.
while ( $str =~ m/(\d+)(?:\h*)?/g ) {
## Open file for writing. The name of the file will be
## the number matched in previous regexp.
open my $fh, q[>], ($1 . q[.txt]) or
die qq[Couldn't create file $1.txt\n];
## Print everything prior to matched string plus everything
## after matched string.
printf $fh qq[%s%s], $`, $';
## Close file.
close $fh;
}
Run it like:
perl script.pl file.txt
Show files created:
ls [0-9]*.txt
With output:
10.txt 11.txt 12.txt 13.txt 1.txt 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt 8.txt 9.txt
Show content of one of them:
cat 9.txt
Output:
1 2 3 4
5 6 7 8
10 11 12 13