Bash AWK and Regex Apply on specific Column - regex

I have the following dataset
Name,quantity,unit
car,6,6
plane,7,5
ship,2,3.44
bike,8,7.66
I want to print only the names which has unit in whole numbers.
I have done the following which does not give out the result
#!/bin/bash
awk 'BEGIN {
FS=","
}
/^[0-9]*$/ {
print "Has Whole numbers: " $1
}
' file.csv
The result should be
Has Whole numbers: car
Has Whole numbers: plane

Added a couple of lines to your test data:
Name,quantity,unit
car,6,6
plane,7,5
ship,2,3.44
bike,8,7.66
Starship,1,1.0
Super Heavy,2,0
null,0,
And awk:
$ awk -F, 'int($3)==$3 ""' file
Output:
car,6,6
plane,7,5
Super Heavy,2,0
int($3) makes an integer of $3 and $3 "" turns $3 to a string.

If you are sure 3rd column is a number:
awk -F, '(NR != 1 && $3 !~ /\./){print "Has Whole numbers:", $1}' file.csv
or well actually its better the way you did it:
awk -F, '$3 ~ /^[0-9]$/{print "Has Whole numbers:", $1}' file

Try changing /^[0-9]*$/ to $3 ~ /^[0-9]*$/ && $3 != 0 once in your tried attempt it should work then.
In case you DO NOT want to hard code field number and want to find out unit field number automatically then try following.
awk -F="," -v field_val="unit" '
FNR==1{
for(j=1;j<=NF;j++){
if($j==field_val){
field_number=j
next
}
}
}
$field_number ~ /[0-9]*$/ && $field_number!=0{
print "Has whole numbers: " $1
}' Input_file

Related

print the last letter of each word to make a string using `awk` command

I have this line
UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS
i am trying to print the last letter of each word to make a string using awk command
awk '{ print substr($1,6) substr($2,6) substr($3,6) substr($4,6) substr($5,6) substr($6,6) }'
In case I don't know how many characters a word contains, what is the correct command to print the last character of $column, and instead of the repeding substr command, how can I use it only once to print specific characters in different columns
If you have just this one single line to handle you can use
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($i))} END{print r}' file
If you have multiple lines in the input:
awk '{r=""; for (i=1;i<=NF;i++) r = r "" substr($i,length($i)); print r}' file
Details:
{for (i=1;i<=NF;i++) r = r "" substr($i,length($i)) - iterate over all fields in the current record, i is the field ID, $i is the field value, and all last chars of each field (retrieved with substr($i,length($i))) are appended to r variable
END{print r} prints the r variable once awk script finishes processing.
In the second solution, r value is cleared upon each line processing start, and its value is printed after processing all fields in the current record.
See the online demo:
#!/bin/bash
s='UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS'
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($1))} END{print r}' <<< "$s"
Output:
GMUCHOS
Using GNU awk and gensub:
$ gawk '{print gensub(/([^ ]+)([^ ])( |$)/,"\\2","g")}' file
Output:
GMUCHOS
1st solution: With GNU awk you could try following awk program, written and tested eith shown samples.
awk -v RS='.([[:space:]]+|$)' 'RT{gsub(/[[:space:]]+/,"",RT);val=val RT} END{print val}' Input_file
Explanation: Set record separator as any character followed by space OR end of value/line. Then as per OP's requirement remove unnecessary newline/spaces from fetched value; keep on creating val which has matched value of RS, finally when awk program is done with reading whole Input_file print the value of variable then.
2nd solution: Using record separator as null and using match function on values to match regex (.[[:space:]]+)|(.$) to get last letter values only with each match found, keep adding matched values into a variable and at last in END block of awk program print variable's value.
awk -v RS= '
{
while(match($0,/(.[[:space:]]+)|(.$)/)){
val=val substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
}
END{
gsub(/[[:space:]]+/,"",val)
print val
}
' Input_file
Simple substitutions on individual lines is the job sed exists to do:
$ sed 's/[^ ]*\([^ ]\) */\1/g' file
GMUCHOS
using many tools
$ tr -s ' ' '\n' <file | rev | cut -c1 | paste -sd'\0'
GMUCHOS
separate the words to lines, reverse so that we can pick the first char easily, and finally paste them back together without a delimiter. Not the shortest solution but I think the most trivial one...
I would harness GNU AWK for this as follows, let file.txt content be
UDACBG UYAZAM DJSUBU WJKMBC NTCGCH DIDEVO RHWDAS
then
awk 'BEGIN{FPAT="[[:alpha:]]\\>";OFS=""}{$1=$1;print}' file.txt
output
GMUCHOS
Explanation: Inform AWK to treat any alphabetic character at end of word and use empty string as output field seperator. $1=$1 is used to trigger line rebuilding with usage of specified OFS. If you want to know more about start/end of word read GNU Regexp Operators.
(tested in gawk 4.2.1)
Another solution with GNU awk:
awk '{$0=gensub(/[^[:space:]]*([[:alpha:]])/, "\\1","g"); gsub(/\s/,"")} 1' file
GMUCHOS
gensub() gets here the characters and gsub() removes the spaces between them.
or using patsplit():
awk 'n=patsplit($0, a, /[[:alpha:]]\>/) { for (i in a) printf "%s", a[i]} i==n {print ""}' file
GMUCHOS
An alternate approach with GNU awk is to use FPAT to split by and keep the content:
gawk 'BEGIN{FPAT="\\S\\>"}
{ s=""
for (i=1; i<=NF; i++) s=s $i
print s
}' file
GMUCHOS
Or more tersely and idiomatic:
gawk 'BEGIN{FPAT="\\S\\>";OFS=""}{$1=$1}1' file
GMUCHOS
(Thanks Daweo for this)
You can also use gensub with:
gawk '{print gensub(/\S*(\S\>)\s*/,"\\1","g")}' file
GMUCHOS
The advantage here of both is that single letter "words" are handled properly:
s2='SINGLE X LETTER Z'
gawk 'BEGIN{FPAT="\\S\\>";OFS=""}{$1=$1}1' <<< "$s2"
EXRZ
gawk '{print gensub(/\S*(\S\>)\s*/,"\\1","g")}' <<< "$s2"
EXRZ
Where the accepted answer and most here do not:
awk '{for (i=1;i<=NF;i++) r = r "" substr($i,length($1))} END{print r}' <<< "$s2"
ER # WRONG
gawk '{print gensub(/([^ ]+)([^ ])( |$)/,"\\2","g")}' <<< "$s2"
EX RZ # WRONG

grep line with exact pattern in first column

I have this script :
while read line; do grep $line my_annot | awk '{print $2}' ; done < foo.txt
But it doesn't return what I want.
The problem is that in foo.txt, when I have for instance Contig1, the script will return the column 2 of the file my_annot even if the pattern found is Contig12 and not Contig1 only!
I tried with $ at the end of the pattern but the problem is that it corresponds to end of line while this expression I search is in column 1 and therefore not end of line.
How can I tell to search this EXACT pattern and not those that contain this pattern?
####### ANSWER :
My script is :
annot='/home/mu/myannot'
awk 'NR == FNR { line[$0]; next } $1 in line { print $2 }' $1 $annot > out
It allows me to give the list of expression I want to find as first argument doing ./myscript.sh mylist
And I redirect the result in a file called out.
Thank you guys !!!!
You should use awk to do the whole thing:
awk 'NR == FNR { line[$0]; next } $1 in line { print $2 }' foo.txt my_annot
This reads each line of foo.txt, setting a key in the array line, then prints the second column of any lines whose first column exactly matches one of the keys in the array.
Of course I have made a guess that the format of your data is the same as in the other answer.
So you have a file like
Contig1 hugo
Contig12 paul
right?
Then this will help:
awk '$1~/^Contig1$/ {print $2}' foo.txt
I think this is what you want
while read line; do grep -w $line my_annot | awk '{print $2}' ; done < foo.txt
But it's not 100% clear (because of a lack of example data) whether it will work in all cases.

how to use awk to match the complete word

I have this script which takes the value of variable $userId as input & uses it in below awk command.
gzip -cd "input.csv.gz"|/usr/xpg4/bin/awk -v search="$userId" -F, 'BEGIN{ OFS=","} { if( match($4, search)) print $0 }' >>$outputFileNameUser
format of input.csv.gz is like below:
gzip -cd input.csv.gz|head -4
Circle,Date,Time,SubscriberId,OperatorId,VoucherNumber,Status
UPE,01-JUN-15,20:23:39,9936596081,,1161504025632821,Used
UPE,01-JUN-15,20:23:39,7755802655,,1161504038349788,Used
UPE,01-JUN-15,20:23:39,9793948511,,1161504027670339,Used
This awk command is matching 4th field of input file (i.e. SubscriberId) with variable $UserId. In the first row value of $4 is 9936596081.
So even if I provide "6596" (which is a part of 9936596081 ) as $UserId it will match the first row. I want to match the complete number(9936596081) not any part of the number.. I tried like this..
gzip -cd "${outputFileName}"|/usr/xpg4/bin/awk -v search="$userId" -F, 'BEGIN{ OFS=","} { if( $4 == "search")) print $0 }' >>$outputFileNameUser
but didn't work.. could you please help me on this?
Try it like this:
awk -v search="$userId" -F, '$4==search'

Using html input pattern in awk

echo 8d07\'54.520\"W | awk '{ if ($1 ~ /[-+]?[0-9]*[.]?[0-9]+/) print $1; else print "erro" }'
I'm trying to check if it's a number, but it's no working... I use this same regex in a html
input text, and it works.
In this case I was expecting "erro". It's not working.
My final goal is to apply 3 different pattern match to 3 fields $1 $2 $3...
Not 100% sure of the requirement but you probably need to put anchors.
$ echo 8d07\'54.520\"W | awk '{ if ($1 ~ /^[-+]?[0-9]+[.]?[0-9]+/) print $1; else print "erro" }'
erro

How to print matched regex pattern using awk?

Using awk, I need to find a word in a file that matches a regex pattern.
I only want to print the word matched with the pattern.
So if in the line, I have:
xxx yyy zzz
And pattern:
/yyy/
I want to only get:
yyy
EDIT:
thanks to kurumi i managed to write something like this:
awk '{
for(i=1; i<=NF; i++) {
tmp=match($i, /[0-9]..?.?[^A-Za-z0-9]/)
if(tmp) {
print $i
}
}
}' $1
and this is what i needed :) thanks a lot!
This is the very basic
awk '/pattern/{ print $0 }' file
ask awk to search for pattern using //, then print out the line, which by default is called a record, denoted by $0. At least read up the documentation.
If you only want to get print out the matched word.
awk '{for(i=1;i<=NF;i++){ if($i=="yyy"){print $i} } }' file
It sounds like you are trying to emulate GNU's grep -o behaviour. This will do that providing you only want the first match on each line:
awk 'match($0, /regex/) {
print substr($0, RSTART, RLENGTH)
}
' file
Here's an example, using GNU's awk implementation (gawk):
awk 'match($0, /a.t/) {
print substr($0, RSTART, RLENGTH)
}
' /usr/share/dict/words | head
act
act
act
act
aft
ant
apt
art
art
art
Read about match, substr, RSTART and RLENGTH in the awk manual.
After that you may wish to extend this to deal with multiple matches on the same line.
gawk can get the matching part of every line using this as action:
{ if (match($0,/your regexp/,m)) print m[0] }
match(string, regexp [, array])
If array is present, it is cleared,
and then the zeroth element of array is set to the entire portion of
string matched by regexp. If regexp contains parentheses, the
integer-indexed elements of array are set to contain the portion of
string matching the corresponding parenthesized subexpression.
http://www.gnu.org/software/gawk/manual/gawk.html#String-Functions
If Perl is an option, you can try this:
perl -lne 'print $1 if /(regex)/' file
To implement case-insensitive matching, add the i modifier
perl -lne 'print $1 if /(regex)/i' file
To print everything AFTER the match:
perl -lne 'if ($found){print} else{if (/regex(.*)/){print $1; $found++}}' textfile
To print the match and everything after the match:
perl -lne 'if ($found){print} else{if (/(regex.*)/){print $1; $found++}}' textfile
If you are only interested in the last line of input and you expect to find only one match (for example a part of the summary line of a shell command), you can also try this very compact code, adopted from How to print regexp matches using `awk`?:
$ echo "xxx yyy zzz" | awk '{match($0,"yyy",a)}END{print a[0]}'
yyy
Or the more complex version with a partial result:
$ echo "xxx=a yyy=b zzz=c" | awk '{match($0,"yyy=([^ ]+)",a)}END{print a[1]}'
b
Warning: the awk match() function with three arguments only exists in gawk, not in mawk
Here is another nice solution using a lookbehind regex in grep instead of awk. This solution has lower requirements to your installation:
$ echo "xxx=a yyy=b zzz=c" | grep -Po '(?<=yyy=)[^ ]+'
b
Off topic, this can be done using the grep also, just posting it here in case if anyone is looking for grep solution
echo 'xxx yyy zzze ' | grep -oE 'yyy'
Using sed can also be elegant in this situation. Example (replace line with matched group "yyy" from line):
$ cat testfile
xxx yyy zzz
yyy xxx zzz
$ cat testfile | sed -r 's#^.*(yyy).*$#\1#g'
yyy
yyy
Relevant manual page: https://www.gnu.org/software/sed/manual/sed.html#Back_002dreferences-and-Subexpressions
If you know what column the text/pattern you're looking for (e.g. "yyy") is in, you can just check that specific column to see if it matches, and print it.
For example, given a file with the following contents, (called asdf.txt)
xxx yyy zzz
to only print the second column if it matches the pattern "yyy", you could do something like this:
awk '$2 ~ /yyy/ {print $2}' asdf.txt
Note that this will also match basically any line where the second column has a "yyy" in it, like these:
xxx yyyz zzz
xxx zyyyz
echo "abc123def" | awk '
function MATCH(haystack, needle, ltrim, rtrim)
{
if(ltrim == 0 && !length(ltrim))
ltrim = 0;
if(rtrim == 0 && !length(rtrim))
rtrim = 0;
return substr(haystack, match(haystack, needle) + ltrim, RLENGTH - ltrim - rtrim);
}
{
print $0 " - " MATCH($0, "123"); # 123
print $0 " - " MATCH($0, "[0-9]*d", 0, 1); # 123
print $0 " - " MATCH($0, "1234"); # Nothing printed
}'