awk logical conditions confusion - if-statement

Can someone please explain why does this work as expected:
echo "one\ntwo\nthree\n" | awk '{if (gsub(/one/,"")) { print } else {print $0}}'
two
three
echo "one\ntwo\nthree\n" | awk '{if (gsub(/four/,"")) { print } else {print $0}}'
one
two
three
but this doesn't?
echo "one\ntwo\nthree\n" | awk '{if (gsub(/one/,"")) { print }}'
Similarly, if trying to chain multiple substitutions, requiring all of them to return a non-zero count of replacements occurred, and only then print the altered result, otherwise print the original content:
echo "one\ntwo\nthree\n" | awk '{if (gsub(/one/,"") && gsub(/two/,"")) { print } else {print $0}}'
I am getting:
two
three
where I'd expect:
three
What am I missing here? Coming from any other programming language, I would expect this to "just work". Note that I observe same behavior in BSD and GNU awk.
EDIT:
I gather this has something to do with how awk processes multiline input:
echo "one\ntwo\nthree\n" | awk '{if (gsub(/one/,"")) print "found"; else print "not found" }'
found
not found
not found
not found

printf 'one\ntwo\nthree\n' | awk '{if (gsub(/one/,"")) { print } else {print $0}}'
can be reduced to:
printf 'one\ntwo\nthree\n' | awk '{gsub(/one/,""); print}'
as it just removes one, if present, from every line and prints every line.
On the other hand your failing script:
printf 'one\ntwo\nthree\n' | awk '{if (gsub(/one/,"")) { print }}'
which can be reduced to:
printf 'one\ntwo\nthree\n' | awk 'gsub(/one/,"") { print }'
removes one, if present, from every line but then it only prints those lines for which gsub() returned a non-zero number, i.e. removed at least 1 one.
The other script you posted:
printf 'one\ntwo\nthree\n' |
awk '{if (gsub(/one/,"") && gsub(/two/,"")) { print } else {print $0}}'
can be reduced to:
printf 'one\ntwo\nthree\n' |
awk 'gsub(/one/,""){ gsub(/two/,"") } { print }'
so it tries to remove ones and if it succeeds then it tries to remove twos (so it will never try to remove a two that didn't have a one on the same line, which you don't have any cases of in your input) and in the end it prints every line regardless of what else happened.
If you wanted to always remove both ones and twos and print every line then that'd be:
printf 'one\ntwo\nthree\n' |
awk '{gsub(/one/,""); gsub(/two/,""); print }'

OK, so two things:
I need to parse the input so that awk treats it as a single record, by adding BEGIN {FS="\n"; RS=""}
I was confusing the print and print $0 usage, thinking that the former stored the current value of modified input while the latter stored the original value, but they both store current value only.
So the solution to my last problem is:
echo "one\ntwo\nthree\n" | awk 'BEGIN {FS="\n"; RS=""}{orig=$0;if (gsub(/one/,"") && gsub(/two/,"")) { print } else {print orig}}'
three

Related

How should I use if-else statement in awk?

I'm writing a parser in bash where I have a text with one ":" in each line, and I need to output the part after a colon if the part before the colon matches the word "txt".
So I divided the text's lines by ":" and then tried to use if-statement in awk.
Command that I've tried:
echo "txt:hello" | awk -F: '{if [[ $1="txt" ]] then print $2 fi}'
But that resulted in a syntax error in the if-statement, so I wonder if the awk's if-else construction differs from basic bash's?
use if-statement in awk.
AWK is not Bash. AWK syntax more resembles C style.
awk -F: '{if ($1 == "txt") print $2}'
Or just:
awk -F: '$1 == "txt"{print $2}'
See https://www.gnu.org/software/gawk/manual/gawk.html#Getting-Started and https://www.gnu.org/software/gawk/manual/gawk.html#Very-Simple .

print the last letter of each word to make a string using `awk` command

I have this line
UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS
i am trying to print the last letter of each word to make a string using awk command
awk '{ print substr($1,6) substr($2,6) substr($3,6) substr($4,6) substr($5,6) substr($6,6) }'
In case I don't know how many characters a word contains, what is the correct command to print the last character of $column, and instead of the repeding substr command, how can I use it only once to print specific characters in different columns
If you have just this one single line to handle you can use
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($i))} END{print r}' file
If you have multiple lines in the input:
awk '{r=""; for (i=1;i<=NF;i++) r = r "" substr($i,length($i)); print r}' file
Details:
{for (i=1;i<=NF;i++) r = r "" substr($i,length($i)) - iterate over all fields in the current record, i is the field ID, $i is the field value, and all last chars of each field (retrieved with substr($i,length($i))) are appended to r variable
END{print r} prints the r variable once awk script finishes processing.
In the second solution, r value is cleared upon each line processing start, and its value is printed after processing all fields in the current record.
See the online demo:
#!/bin/bash
s='UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS'
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($1))} END{print r}' <<< "$s"
Output:
GMUCHOS
Using GNU awk and gensub:
$ gawk '{print gensub(/([^ ]+)([^ ])( |$)/,"\\2","g")}' file
Output:
GMUCHOS
1st solution: With GNU awk you could try following awk program, written and tested eith shown samples.
awk -v RS='.([[:space:]]+|$)' 'RT{gsub(/[[:space:]]+/,"",RT);val=val RT} END{print val}' Input_file
Explanation: Set record separator as any character followed by space OR end of value/line. Then as per OP's requirement remove unnecessary newline/spaces from fetched value; keep on creating val which has matched value of RS, finally when awk program is done with reading whole Input_file print the value of variable then.
2nd solution: Using record separator as null and using match function on values to match regex (.[[:space:]]+)|(.$) to get last letter values only with each match found, keep adding matched values into a variable and at last in END block of awk program print variable's value.
awk -v RS= '
{
while(match($0,/(.[[:space:]]+)|(.$)/)){
val=val substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
}
END{
gsub(/[[:space:]]+/,"",val)
print val
}
' Input_file
Simple substitutions on individual lines is the job sed exists to do:
$ sed 's/[^ ]*\([^ ]\) */\1/g' file
GMUCHOS
using many tools
$ tr -s ' ' '\n' <file | rev | cut -c1 | paste -sd'\0'
GMUCHOS
separate the words to lines, reverse so that we can pick the first char easily, and finally paste them back together without a delimiter. Not the shortest solution but I think the most trivial one...
I would harness GNU AWK for this as follows, let file.txt content be
UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS
then
awk 'BEGIN{FPAT="[[:alpha:]]\\>";OFS=""}{$1=$1;print}' file.txt
output
GMUCHOS
Explanation: Inform AWK to treat any alphabetic character at end of word and use empty string as output field seperator. $1=$1 is used to trigger line rebuilding with usage of specified OFS. If you want to know more about start/end of word read GNU Regexp Operators.
(tested in gawk 4.2.1)
Another solution with GNU awk:
awk '{$0=gensub(/[^[:space:]]*([[:alpha:]])/, "\\1","g"); gsub(/\s/,"")} 1' file
GMUCHOS
gensub() gets here the characters and gsub() removes the spaces between them.
or using patsplit():
awk 'n=patsplit($0, a, /[[:alpha:]]\>/) { for (i in a) printf "%s", a[i]} i==n {print ""}' file
GMUCHOS
An alternate approach with GNU awk is to use FPAT to split by and keep the content:
gawk 'BEGIN{FPAT="\\S\\>"}
{ s=""
for (i=1; i<=NF; i++) s=s $i
print s
}' file
GMUCHOS
Or more tersely and idiomatic:
gawk 'BEGIN{FPAT="\\S\\>";OFS=""}{$1=$1}1' file
GMUCHOS
(Thanks Daweo for this)
You can also use gensub with:
gawk '{print gensub(/\S*(\S\>)\s*/,"\\1","g")}' file
GMUCHOS
The advantage here of both is that single letter "words" are handled properly:
s2='SINGLE X LETTER Z'
gawk 'BEGIN{FPAT="\\S\\>";OFS=""}{$1=$1}1' <<< "$s2"
EXRZ
gawk '{print gensub(/\S*(\S\>)\s*/,"\\1","g")}' <<< "$s2"
EXRZ
Where the accepted answer and most here do not:
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($1))} END{print r}' <<< "$s2"
ER # WRONG
gawk '{print gensub(/([^ ]+)([^ ])( |$)/,"\\2","g")}' <<< "$s2"
EX RZ # WRONG

Bash AWK and Regex Apply on specific Column

I have the following dataset
Name,quantity,unit
car,6,6
plane,7,5
ship,2,3.44
bike,8,7.66
I want to print only the names which has unit in whole numbers.
I have done the following which does not give out the result
#!/bin/bash
awk 'BEGIN {
FS=","
}
/^[0-9]*$/ {
print "Has Whole numbers: " $1
}
' file.csv
The result should be
Has Whole numbers: car
Has Whole numbers: plane
Added a couple of lines to your test data:
Name,quantity,unit
car,6,6
plane,7,5
ship,2,3.44
bike,8,7.66
Starship,1,1.0
Super Heavy,2,0
null,0,
And awk:
$ awk -F, 'int($3)==$3 ""' file
Output:
car,6,6
plane,7,5
Super Heavy,2,0
int($3) makes an integer of $3 and $3 "" turns $3 to a string.
If you are sure 3rd column is a number:
awk -F, '(NR != 1 && $3 !~ /\./){print "Has Whole numbers:", $1}' file.csv
or well actually its better the way you did it:
awk -F, '$3 ~ /^[0-9]$/{print "Has Whole numbers:", $1}' file
Try changing /^[0-9]*$/ to $3 ~ /^[0-9]*$/ && $3 != 0 once in your tried attempt it should work then.
In case you DO NOT want to hard code field number and want to find out unit field number automatically then try following.
awk -F="," -v field_val="unit" '
FNR==1{
for(j=1;j<=NF;j++){
if($j==field_val){
field_number=j
next
}
}
}
$field_number ~ /[0-9]*$/ && $field_number!=0{
print "Has whole numbers: " $1
}' Input_file

grep line with exact pattern in first column

I have this script :
while read line; do grep $line my_annot | awk '{print $2}' ; done < foo.txt
But it doesn't return what I want.
The problem is that in foo.txt, when I have for instance Contig1, the script will return the column 2 of the file my_annot even if the pattern found is Contig12 and not Contig1 only!
I tried with $ at the end of the pattern but the problem is that it corresponds to end of line while this expression I search is in column 1 and therefore not end of line.
How can I tell to search this EXACT pattern and not those that contain this pattern?
####### ANSWER :
My script is :
annot='/home/mu/myannot'
awk 'NR == FNR { line[$0]; next } $1 in line { print $2 }' $1 $annot > out
It allows me to give the list of expression I want to find as first argument doing ./myscript.sh mylist
And I redirect the result in a file called out.
Thank you guys !!!!
You should use awk to do the whole thing:
awk 'NR == FNR { line[$0]; next } $1 in line { print $2 }' foo.txt my_annot
This reads each line of foo.txt, setting a key in the array line, then prints the second column of any lines whose first column exactly matches one of the keys in the array.
Of course I have made a guess that the format of your data is the same as in the other answer.
So you have a file like
Contig1 hugo
Contig12 paul
right?
Then this will help:
awk '$1~/^Contig1$/ {print $2}' foo.txt
I think this is what you want
while read line; do grep -w $line my_annot | awk '{print $2}' ; done < foo.txt
But it's not 100% clear (because of a lack of example data) whether it will work in all cases.

awk print matching line and line before the matched

Following is what I am trying to do using awk. Get the line that matches the regex and the line immediately before the matched and print. I can get the line that matched the regex but not the line immediately before that:
awk '{if ($0!~/^CGCGGCTGCTGG/) print $0}'
In this case you could easily solve it with grep:
grep -B1 foo file
However, if you need to to use awk:
awk '/foo/{if (a && a !~ /foo/) print a; print} {a=$0}' file
/abc/{if(a!="")print a;print;a="";next}
{a=$0}
use more straightforward pattern search
gawk '{if (/^abc$/) {print x; print $0};x=$0}' file1 > file2
I created the following awk script. Prints the matching line as well as the previous 2 lines. You can make it more flexible from this idea.
search.awk
{
a[0]=$0;
for(i=0;i<2;i++)
{
getline;
if(i==0){
a[1]=$0;
}
if(i==1){
if($0 ~ /message received/){
print a[0];
print a[1];
print $0;
}
}
}
}
Usage:
awk '{print $0}' LogFile.log | awk -f search.awk
Why not use grep -EB1 '^CGCGGCTGCTGG'
The awk to do the same thing is very long-winded, see Marco's answer.
Maybe slightly off-topic, but I used the answer from belisarius to create my own variation of the above solution, that searches for the Nth entry, and returns that and the previous line.
awk -v count=1 '/abc/{{i++};if(i==count){print a;print;exit}};{a=$0}' file