Setting stdout from awk as input for "if" - if-statement

As part of my script I need to search through some files, and check if the value of a certain column is equal or greater to a given number.
In this simplified example I want to see if the value in column 3 of the first line is greater than 10:
head -1 examplefile | awk '{print $3}' | if [?? > 10 ]; then print "YES"; fi
The problem is to call on stdout from awk (which is then the number I want) as input for the if command (??).
Should be simple enough, but I guess I'm just stupid... ;)
Cheers,
Martin

Why not set the output of awk to a variable ?
myVar=$(head ... awk '{print $3}')
and test that in the if statement.
if [ $myVar -gt 10 ]; then
print "YES"
fi
(I think you'll need the -gt operator).
Or why not use awk's conditional statements and avoid this altogether ?

Related

awk syntax error in utilizing a shell variable with if/then statement

I'm trying to get the following code to work but I keep on getting syntax errors in the awk portion of the script.
Briefly, I want to calculate a cutoff value and store it as a floating decimal in a numerical variable (e.g., cutoff). I want to pass this variable to the awk script which I try but still run into syntax problems with errors that state:
awk: syntax error at source line 3
context is
>>> <<<
Here is the following sample sequences could have the first four lines of file Spl-129-run10_xx.fa:
>Spl-129_TTCAGTGG_80
CAGACATAGTCATCTATCAATACATaGATGATTTGTATGTAGGATCTGACTTAGAAATAGGGCAGCATAGAACAAAAATAGAGGAACTGAGACAACATCTGTTGAGGTGGGGATTTACCACACCAGACAAAAAACATCAGAAAGAACCTCCATTCCTTTGGATGGGTTATGAACTCCATCCTGATAAATGGACAGTACAGCCTATAGTGCTGCCAGAAAAGGACAGCTGGACTGTCAATGACATACAGA
>Spl-129_TGGGGACC_80
CAGACATAGTCATCTATCAATACATaGATGATTTGTATGTAGGATCTGACTTAGAAATAGGGCAGCATAGAACAAAAATAGAGGAACTGAGACAACATCTGTTGAGGTGGGGATTTACCACACCAGACAAAAAACATCAGAAAGAACCTCCATTCCTTTGGATGGGTTATGAACTCCATCCTGATAAATGGACAGTACAGCCTATAGTGCTGCCAGAAAAGGACAGCTGGACTGTCAATGACATACAGA
and now the code:
for file in *fa; do
name=`echo $file | cut -d'.' -f1`;
awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' $file | tail -n+2 | sed 's/_/\t/g' >tmp;
m=`cut -f3 tmp | sort -nr | head -n1`;
cutoff=`echo "(-1.24*10^-21*$m^6)+(3.53*10^-17*$m^5)-(3.90*10^-13*$m^4)+(2.12*10^-9*$m^3)-(6.06*10^-6*$m^2)+(0.018*$m)+3.15" | bc`;
echo "$name\t$cutoff";
awk -v c="$cutoff" -v n="$name" '{
if (c < 4)
awk '$3 > 2' tmp >n"_CUT.txt";
else awk '$3 > c' tmp >n"_CUT.txt";
}';
done
The expected output should be a tab-delimited file (e.g., "Spl-129-run10_CUT.txt") in the example form of
>Spl-129 TGGGGACC 80 sequence
At the end of the day, I want to utilize the calculated cutoff variable above to filter sequences less than the cutoff (using the value in the third field as comparison) with the condition that if the cutoff is less than 4, then the cutoff of 2 will be used.
Any help that you could provide would be much appreciated. Thanks!
The awk sniplet has multiple issues. In particular, there seems to be unnecessary calls to awk, from within the awk script:
awk -v c="$cutoff" -v n="$name" '{
if (c < 4)
awk '$3 > 2' tmp >n"_CUT.txt";
else awk '$3 > c' tmp >n"_CUT.txt";
}';
Without additional details (and sample input), it hard to provide a working solution. Consider posting sample input and output. In particular, it's not clear how the calculation of cutoff should be performed from the input.

Bash - numbers of multiple lines matching regex (possible oneliner?)

I'm not very fluent in bash but actively trying to improve, so I'd like to ask some experts here for a little suggestion :)
Let's say I've got a following text file:
Some
spam
about which I don't care.
I want following letters:
X1
X2
X3
I do not want these:
X4
X5
Nor this:
X6
But I'd like these, too:
I want following letters:
X7
And so on...
And I'd like to get numbers of lines with these letters, so my desired output should look like:
5 6 7 15
To clarify: I want all lines matching some regex /\s*X./, that occur right after one match with another regex /\sI want following letters:/
Right now I've got a working solution, which I don't really like:
cat data.txt | grep -oPz "\sI want following letters:((\s*X.)*)" | grep -oPz "\s*X." > tmp.txt
for entry in $(cat tmp.txt); do
grep -n $entry data.txt | cut -d ":" -f1
done
My question is: Is there any smart way, any tool I don't know with a functionality to do this in one line? (I esspecially don't like having to use temp file and a loop here)
You can use awk:
awk '/I want following/{p=1;next}!/^X/{p=0;next}p{print NR}' file
Explanation in multiline version:
#!/usr/bin/awk
/I want following/{
# Just set a flag and move on with the next line
p=1
next
}
!/^X/ {
# On all other lines that doesn't start with a X
# reset the flag and continue to process the next line
p=0
next
}
p {
# If the flag p is set it must be a line with X+number.
# print the line number NR
print NR
}
Following may help you here.
awk '!/X[0-9]+/{flag=""} /I want following letters:/{flag=1} flag' Input_file
Above will print the lines which have I want following letters: too in case you don't want these then use following.
awk '!/X[0-9]+/{flag=""} /I want following letters:/{flag=1;next} flag' Input_file
To add line number to output use following.
awk '!/X[0-9]+/{flag=""} /I want following letters:/{flag=1;next} flag{print FNR}' Input_file
First, let's optimize a little bit your current script:
#!/bin/bash
FILE="data.txt"
while read -r entry; do
[[ $entry ]] && grep -n $entry "$FILE" | cut -d ":" -f1
done < <(grep -oPz "\sI want following letters:((\s*X.)*)" "$FILE"| grep -oPz "\s*X.")
And here's some comments:
No need to use cat file|grep ... => grep ... file
Do not use the syntaxe for i in $(command), it's often the cause of multiple bugs and there's always a smarter solution.
No need to use a tmp file either
And then, there's a lot of shorter possible solutions. Here's one using awk:
$ awk '{ if($0 ~ "I want following letters:") {s=1} else if(!($0 ~ "^X[0-9]*$")) {s=0}; if (s && $0 ~ "^X[0-9]*$") {gsub("X", ""); print}}' data.txt
1
2
3
7

grep line with exact pattern in first column

I have this script :
while read line; do grep $line my_annot | awk '{print $2}' ; done < foo.txt
But it doesn't return what I want.
The problem is that in foo.txt, when I have for instance Contig1, the script will return the column 2 of the file my_annot even if the pattern found is Contig12 and not Contig1 only!
I tried with $ at the end of the pattern but the problem is that it corresponds to end of line while this expression I search is in column 1 and therefore not end of line.
How can I tell to search this EXACT pattern and not those that contain this pattern?
####### ANSWER :
My script is :
annot='/home/mu/myannot'
awk 'NR == FNR { line[$0]; next } $1 in line { print $2 }' $1 $annot > out
It allows me to give the list of expression I want to find as first argument doing ./myscript.sh mylist
And I redirect the result in a file called out.
Thank you guys !!!!
You should use awk to do the whole thing:
awk 'NR == FNR { line[$0]; next } $1 in line { print $2 }' foo.txt my_annot
This reads each line of foo.txt, setting a key in the array line, then prints the second column of any lines whose first column exactly matches one of the keys in the array.
Of course I have made a guess that the format of your data is the same as in the other answer.
So you have a file like
Contig1 hugo
Contig12 paul
right?
Then this will help:
awk '$1~/^Contig1$/ {print $2}' foo.txt
I think this is what you want
while read line; do grep -w $line my_annot | awk '{print $2}' ; done < foo.txt
But it's not 100% clear (because of a lack of example data) whether it will work in all cases.

awk: replace second column if not zero

I'm trying to use awk to check the second column of a three column set of data and replace its value if it's not zero. I've found this regex to find the non-zero numbers, but I can't figure out how to combine gsub with print to replace the contents and output it to a new file. I only want to run the gsub on the second column, not the first or third. Is there a simple awk one-liner to do this? Or am I looking at doing something more complex? I've even tried doing an expression to check for zero, but I'm not sure how to do an if/else statement in awk.
The command that I had semi-success with was:
awk '$2 != 0 {print $1, 1, $3}' input > output
The problem is that it didn't print out the row if the second column was zero. This is where I thought either gsub or an if/else statement would work, but I can't figure out the awk syntax. Any guidance on this would be appreciated.
Remember that in awk, anything that is not 0 is true (though any string that is not "0" is also true). So:
awk '$2 { $2 = 1; print }' input > output
The $2 evaluates to true if it's not 0. The rest is obvious. This replicates your script.
If you want to print all lines, including the ones with a zero in $2, I'd go with this:
awk '$2 { $2 = 1 } 1' input > output
This does the same replacement as above, but the 1 at the end is short-hand for "true". And without a statement, the default statement of {print} is run.
Is this what you're looking for?
In action, it looks like this:
[ghoti#pc ~]$ printf 'none 0 nada\none 1 uno\ntwo 2 tvo\n'
none 0 nada
one 1 uno
two 2 tvo
[ghoti#pc ~]$ printf 'none 0 nada\none 1 uno\ntwo 2 tvo\n' | awk '$2 { $2 = 1 } 1'
none 0 nada
one 1 uno
two 1 tvo
[ghoti#pc ~]$
Is this what you want?
awk '$2 != 0 {print $1, 1, $3} $2 == 0 {print}' input > output
or with sed:
sed 's/\([^ ]*\) [0-9]*[1-9][0-9]* /\1 1 /' input > output

Extract substring from rows with regex and remove rows with duplicate substring

I have a text file with some rows in the following form
*,[anything, even blanks],[dog|log|frog],[dog|log|frog],[0|1],[0|1],[0|1]
I would like to remove duplicate rows that have the same value for * (case insensitive), ie anything left of ,[anything, even blanks],[dog|log|frog],[dog|log|frog],[0|1],[0|1],[0|1]
For example here's a sample text file
test,bar,log,dog,0,0,0
one
foo,bar,log,dog,0,0,0
/^test$/,bar,log,dog,0,0,0
one
FOO,,frog,frog,1,1,1
The resulting text file should have the duplicate foo removed (order does not matter to me so long as the duplicates are removed, leaving 1 unique)
test,bar,log,dog,0,0,0
one
/^test$/,bar,log,dog,0,0,0
one
FOO,,frog,frog,1,1,1
What's the simplest bash command I could do to achieve this?
awk -F, '!seen[tolower($1)]++' file
You can do this with awk like this (since you don't care which of the duplicates gets kept):
awk -F, '{lines[tolower($1)]=$0}END{for (l in lines) print lines[l]}'
If you wanted to keep the first instead:
awk -F, '{if (lines[tolower($1)]!=1) { print; lines[tolower($1)]=1 } }'
Search for
(?:(?<=\n)|^)(.*)((?:,(?:d|l|fr)og){2}(?:,[01]){3})(?=\n)([\s\S]*)(?<=\n).*\2(?:\n|$)
...and replace with
$1$2$3
#!/bin/bash
for line in $(cat $1)
do
key=$( echo ${line%%,*} | awk '{print tolower($0)}')
found=0
for k in ${keys[#]} ; do [[ "$k" == "$key" ]] && found=1 && break ; done
(( found )) && continue
echo $line
keys=( "${keys[#]}" "$key" )
done
Using an array instead of an association (hash), which is less performant. But it seems to work.
This might work for you (GNU sed):
cat -n file |
sort -fk2,2 |
sed -r ':a;$!N;s/^.{7}([^,]*),[^,]*(,(d|l|fr)og){2}(,[01]){3}\n(.{7}\1,[^,]*(,(d|l|fr)og){2}(,[01]){3})$/\5/i;ta;P;D' |
sort -n |
sed -r 's/^.{7}//'
Number each line.
Sort by the first key (ignoring case)
Remove duplicates (based on specific criteria)
Sort reduced file back into original order
Remove line numbers