awk, a field doesn't match but it should match - regex

I have a file structured as record list, where field separator is \t.
I want to extract only records where the second field is a number from 1 to 9, but my awk script doesn't work.
The awk script is
cat file |awk -v FS="\t" '$2 ~ /[0-9]{1}/ {print $0;}'
or this
cat file |awk -v FS="\t" '$2 ~ /.{1}/ {print $0;}' #because the second fields of my file have all second fields as number
Why these sscript don't work? Isn't regex a good regex?

Update
Even with the interval {1}, you are still going to match a field like 23 because the 2 matches a single number. What you really want to use are anchors and forget about intervals:
awk '$2 ~ /^[0-9]$/{print}' FS="\t" file
The problem is the use of intervals {1}. awk less than version 4 doesn't support intervals. gawk on the other hand will if you add the following flag: --re-interval
Try this:
awk --re-interval '$2 ~ /[0-9]{1}/{print}' FS="\t" file
Some other things to note:
Built in vars such as FS can be assigned at the end without the need for -v
You can use just print rather than print $0 as that is its default behavior
Useless use of cat. awk can take a file as an argument, use that instead

If you want to ensure the 2nd field is a single-digit number, you don't really need a regex:
awk '1 <= $2 && $2 <= 9 {print}'

Related

How can I use bash variable in awk with regexp?

I have a file like this (this is sample):
71.13.55.12|212.152.22.12|71.13.55.12|8.8.8.8
81.23.45.12|212.152.22.12|71.13.55.13|8.8.8.8
61.53.54.62|212.152.22.12|71.13.55.14|8.8.8.8
21.23.51.22|212.152.22.12|71.13.54.12|8.8.8.8
...
I have iplist.txt like this:
71.13.55.
12.33.23.
8.8.
4.2.
...
I need to grep if 3. column starts like in iplist.txt.
Like this:
71.13.55.12|212.152.22.12|71.13.55.12|8.8.8.8
81.23.45.12|212.152.22.12|71.13.55.13|8.8.8.8
61.53.54.62|212.152.22.12|71.13.55.14|8.8.8.8
I tried:
for ip in $(cat iplist.txt); do
awk -v var="$ip" -F '|' '{if ($3 ~ /^$var/) print $0;}' text.txt
done
But bash variable does not work in /^ / regex block. How can I do that?
First, you can use a concatenation of strings for the regular expression, it doesn't have to be a regex block. You can say:
'{if ($3 ~ "^" var) print $0;}'
Second, note above that you don't use a $ with variables inside awk. $ is only used to refer to fields by number (as in $3, or $somevar where somevar has a field number as its value).
Third, you can do everything in awk in which case you can avoid the shell loop and don't need the var:
awk -F'|' 'NR==FNR {a["^" $0]; next} { for (i in a) if ($3 ~ i) {print;next} }' iplist.txt r.txt
71.13.55.12|212.152.22.12|71.13.55.12|8.8.8.8
81.23.45.12|212.152.22.12|71.13.55.13|8.8.8.8
61.53.54.62|212.152.22.12|71.13.55.14|8.8.8.8
EDIT
As rightly pointed out in the comments, the .s in the patterns will match any character, not just a literal .. Thus we need to escape them before doing the match:
awk -F'|' 'NR==FNR {gsub(/\./,"\\."); a["^" $0]; next} { for (i in a) if ($3 ~ i) print }' iplist.txt r.txt
I'm assuming that you only want to output a given line once, even if it matches multiple patterns from iplist.txt. If you want to output a line multiple times for multiple matches (as your version would have done), remove the next from {print;next}.
Use var directly, instead of in /^$var/ ( adding ^ to the variable first):
awk -v var="^$ip" -F '|' '$3 ~ var' text.txt
By the way, the default action for a true condition is to print the current record, so, {if (test) {print $0}} can often be contracted to just test.
Here is a way with bash, sed and grep, it's straight forward and I think may be a bit cleaner than awk in this case:
IFS=$(echo -en "\n\b") && for ip in $(sed 's/\./\\&/g' iplist.txt); do
grep "^[^|]*|[^|]*|${ip}" r.txt
done

Is there a way to obtain the current pattern searched in an AWK script?

The basic idea is this. Suppose that you want to search a file for multiple patterns from a pipe with awk :
... | awk -f - '{...}' someFile.txt
* '...' is just short for some code
* '-f -' indicates the pattern is taken from pipe
Is there a way to know which pattern is searched at each instant within the awk script
(like you know $1 is the first field, is there something like $PATTERN that contains the current pattern
searched or a way to get something like it?
More Elaboration:
if I have 2 files:
someFile.txt containing:
1
2
4
patterns.txt containing:
1
2
3
4
running this command:
cat patterns.txt |awk -f - '{...}' someFile.txt
What should I type between the braces such that only the pattern in patterns.txt that
has not been matched in someFile.txt is printed?(in this case the number 3 in patterns.txt is not matched)
Under the requirements that patterns.txt be supplied as stdin and that the processing be done with awk:
$ cat patterns.txt | awk 'FNR==NR{p=p "\n" $0;next;} p !~ $0' someFile.txt -
3
This was tested using GNU awk.
Explanation
We want to remove from patterns.txt anything that matches a line in someFile.txt. To do this, we first read in someFile.txt and create patterns from it. Next, we print only the lines from patterns.txt that do not match any of the patterns from someFile.txt.
FNR==NR{p=p "\n" $0;next;}
NR is the number of lines that awk has read so far and FNR is the number of lines that awk has read so far from the current file. Thus, if FNR==NR, we are still reading the first named file: someFile.txt. We save all such lines in the newline-separated variable p. We then tell awk to skip the remaining commands and jump to the next line.
p !~ $0
If we got here, then we are now reading the second named file on the command line which is - for stdin. This boolean condition evaluates to either true or false. If it is true, the line is printed. If not, it is skipped. In other words, the above is awk's crytic shorthand for:
p !~ $0 {print $0}
cmd | awk 'NR==FNR{pats[$0]; next} {for (p in pats) if ($0 ~ p) delete pats[p]} END{ for (p in pats) print p }' - someFile.txt
Another way in awk
cat patterns.txt | awk 'NR>FNR&&!($0 in a);{a[$0]}' someFile.txt -

Print matched pattern with AWK

For example i have this data:
/home/test/dat1.txt
/home/test/dat2.txt
/home/test/test1/dat3.txt
/home/test/test2/dat4.txt
/home/test/test3/test4/dat5.txt
I need to print only the name and extension, that output should be:
dat1.txt
dat2.txt
dat3.txt
dat4.txt
dat5.txt
I need to use the awk command... anyone can help?
I use this regular expression: '/\/*\.txt/{print ???}
If you are going to use awk, you do not need a regex for this purpose.
You can just tell awk to print the last field, using a field separator of /.
awk -F'/' '{print $NF}' Input.txt
As hd1's comment already noted, NF is the number of fields on the current input record (in this case line). Since awk starts indexing fields at $1, $NF gives you the last field.
You could use this short awk
awk -F/ '$0=$NF' Input.txt
If you need empty line use
awk -F/ '{$0=$NF}1' Input.txt

Awk to skip the blank lines

The output of my script is tab delimited using awk as :
awk -v variable=$bashvariable '{print variable"\t single\t" $0"\t double"}' myinfile.c
The awk command is run in a while loop which updates the variable value and the file myinfile.c for every cycle.
I am getting the expected results with this command .
But if the inmyfile.c contains a blank line (it can contain) it prints no relevant information. can I tell awk to ignore the blank line ?
I know it can be done by removing the blank lines from myinfile.c before passing it on to awk .
I am in knowledge of sed and tr way but I want awk to do it in the above mentioned command itself and not a separate solution as below or a piped one.
sed '/^$/d' myinfile.c
tr -s "\n" < myinfile.c
Thanks in advance for your suggestions and replies.
There are two approaches you can try to filter out lines:
awk 'NF' data.txt
and
awk 'length' data.txt
Just put these at the start of your command, i.e.,
awk -v variable=$bashvariable 'NF { print variable ... }' myinfile
or
awk -v variable=$bashvariable 'length { print variable ... }' myinfile
Both of these act as gatekeepers/if-statements.
The first approach works by only printining out lines where the number of fields (NF) is not zero (i.e., greater than zero).
The second method looks at the line length and acts if the length is not zero (i.e., greater than zero)
You can pick the approach that is most suitable for your data/needs.
You could just add
/^\s*$/ {next;}
To the front of your script that will match the blank lines and skip the rest of the awk matching rules. Put it all together:
awk -v variable=$bashvariable '/^\s*$/ {next;} {print variable"\t single\t" $0"\t double"}' myinfile.c
may be you could try this out:
awk -v variable=$bashvariable '$0{print variable"\t single\t" $0"\t double"}' myinfile.c
Try this:
awk -v variable=$bashvariable '/^.+$/{print variable"\t single\t" $0"\t double"}' myinfile.c
I haven't seen this solution, so: awk '!/^\s*$/{print $1}' will run the block for all non-empty lines.
\s metacharacter is not available in all awk implementations, but you can also write !/^[ \t]*$/.
https://www.gnu.org/software/gawk/manual/gawk.html
\s Matches any space character as defined by the current locale. Think of it as shorthand for ‘[[:space:]]’.
Based on Levon's answer, you may just add | awk 'length { print $1 }' to the end of the command.
So change
awk -v variable=$bashvariable '{ whatever }' myinfile.c
to
awk -v variable=$bashvariable '{ whatever }' myinfile.c | awk 'length { print $1 }'
In case this doesn't work, use | awk 'NF { print $1 }' instead.
another awk way of only trimming out actually zero length lines but keep the ones with only spaces tabs is this :
awk 8 RS=
just doing awk NF trims out lines 3 (zero length) and 5 (spaces and tabs) …..
1 abc
2 def
3
4 3591952
5
6 93253
1 abc
2 def
3 3591952
4
5 93253
1 abc
2 def
3 3591952
4 93253
but the RS= approach keeps line 5 for u:
1 abc
2 def
3 3591952
4
5 93253
** lines with \013 \v VT :: \014 \f FF :: \015 \r CR aren't skipped by default FS = " ", despite them also belonging to POSIX [[:space:]]

how to get sub-expression value of regExp in awk?

I was analyzing logs contains information like the following:
y1e","email":"","money":"100","coi
I want to fetch the value of money, i used 'awk' like :
grep pay action.log | awk '/"money":"([0-9]+)"/' ,
then how can i get the sub-expression value in ([0-9]+) ?
If you have GNU AWK (gawk):
awk '/pay/ {match($0, /"money":"([0-9]+)"/, a); print substr($0, a[1, "start"], a[1, "length"])}' action.log
If not:
awk '/pay/ {match($0, /"money":"([0-9]+)"/); split(substr($0, RSTART, RLENGTH), a, /[":]/); print a[5]}' action.log
The result of either is 100. And there's no need for grep.
Offered as an alternative, assuming the data format stays the same once the lines are grep'ed, this will extract the money field, not using a regular expression:
awk -v FS=\" '{print $9}' data.txt
assuming data.txt contains
y1e","email":"","money":"100","coin.log
yielding:
100
I.e., your field separator is set to " and you print out field 9
You need to reference group 1 of the regex
I'm not fluent in awk but here are some other relevant questions
awk extract multiple groups from each line
GNU awk: accessing captured groups in replacement text
Hope this helps
If you have money coming in at different places then may be it would not be a good idea to hard code the positional parameter.
You can try something like this -
$ awk -v FS=[,:\"] '{ for (i=1;i<=NF;i++) if($i~/money/) print $(i+3)}' inputfile
grep pay action.log | awk -F "\n" 'm=gensub(/.*money":"([0-9]+)".*/, "\\1", "g", $1) {print m}'