how to get sub-expression value of regExp in awk? - regex

I was analyzing logs contains information like the following:
y1e","email":"","money":"100","coi
I want to fetch the value of money, i used 'awk' like :
grep pay action.log | awk '/"money":"([0-9]+)"/' ,
then how can i get the sub-expression value in ([0-9]+) ?

If you have GNU AWK (gawk):
awk '/pay/ {match($0, /"money":"([0-9]+)"/, a); print substr($0, a[1, "start"], a[1, "length"])}' action.log
If not:
awk '/pay/ {match($0, /"money":"([0-9]+)"/); split(substr($0, RSTART, RLENGTH), a, /[":]/); print a[5]}' action.log
The result of either is 100. And there's no need for grep.

Offered as an alternative, assuming the data format stays the same once the lines are grep'ed, this will extract the money field, not using a regular expression:
awk -v FS=\" '{print $9}' data.txt
assuming data.txt contains
y1e","email":"","money":"100","coin.log
yielding:
100
I.e., your field separator is set to " and you print out field 9

You need to reference group 1 of the regex
I'm not fluent in awk but here are some other relevant questions
awk extract multiple groups from each line
GNU awk: accessing captured groups in replacement text
Hope this helps

If you have money coming in at different places then may be it would not be a good idea to hard code the positional parameter.
You can try something like this -
$ awk -v FS=[,:\"] '{ for (i=1;i<=NF;i++) if($i~/money/) print $(i+3)}' inputfile

grep pay action.log | awk -F "\n" 'm=gensub(/.*money":"([0-9]+)".*/, "\\1", "g", $1) {print m}'

Related

Edited: Grep/Awk- Print specific info from table

(This example is edited, following a user's recommendation, considering a mistake in my table display)
I have a .csv table from where I need certain info. My table looks like this:
Name, Birth
James,2001/02/03 California
Patrick,2001/02/03 Texas
Sarah,2000/03/01 Alabama
Sean,2002/02/01 New York
Michael,2002/02/01 Ontario
From here, I would need to print only the unique birthdates, in an ascending order, like this:
2000/03/01
2001/02/03
2002/02/01
I have thought of a regular expression to identify the dates, such as:
awk '/[0-9]{4}/[0-9]{2}/[0-9]/{2}/' students.csv
However, I'm getting a syntax error in the regex, and I wouldn't know how to follow from this step.
Any hints?
Use cut and sort with -u option to print unique values:
cut -d' ' -f2 students.csv | sort -u > out_file
You can also use grep instead of cut:
grep -Po '\d\d\d\d/\d\d/\d\d' students.csv | sort -u > out_file
Here, GNU grep uses the following options:
-P : Use Perl regexes.
-o : Print the matches only (1 match per line), not the entire lines.
SEE ALSO:
perlre - Perl regular expressions
Here is a gnu awk solution to get this done in a single command:
awk 'NF > 2 && !seen[$2]++{} END {
PROCINFO["sorted_in"]="#ind_str_asc"; for (i in seen) print i}' file
2000/03/01
2001/02/03
2002/02/01
Using any awk and whether your names have 1 word or more and whether blank chars exist after the commas or not:
$ awk -F', *' 'NR>1{sub(/ .*/,"",$2); print $2}' file | sort -u
2000/03/01
2001/02/03
2002/02/01
With your shown samples, could you please try following. Written and tested in GNU awk, should work in any awk though.
awk '
match($0,/[0-9]{4}(\/[0-9]{2}){2}/){
arrDate[substr($0,RSTART,RLENGTH)]
}
END{
for(i in arrDate){
print i
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/[0-9]{4}(\/[0-9]{2}){2}/){ ##using match function to match regex to match only date format.
arrDate[substr($0,RSTART,RLENGTH)] ##Creating array arrDate which has index as sub string of matched one.
}
END{ ##Starting END block of this awk program from here.
for(i in arrDate){ ##Traversing through arrDate here.
print i ##Printing index of array here.
}
}
' Input_file ##Mentioning Input_file name here.

AWK: how to match a comma

I want to return lines from awk with a pattern "C," or ".,C" or ".,C,.*".
For example:
Valid
C,G
G,C
G,C,A
Invalid
G,CC
My code is below:
echo G,CC | awk '$0 ~ /^C,+.*|.*,C,*.*/ {print $0}'
output:
G,CC
I hope it returns nothing to me. Unfortunately, it returns "G,CC" to me.
How do I solve this problem?
Edit:
Based on the answers from #Emma and #perreal. I used a shorter command line to solve my question:
awk '$0 ~ /^C,.*|.*,C,.*|.*,C$/ {print $0}'
Until now, it works well. Thanks for your help!!
Could you please try following.
awk '!/CC/ && /^C,+.*|.*,C,*.*/' Input_file
The + is not necessary in ^C,+.*, since you already match the comma and also match whatever comes after.
The * right after the second comma is not correct in .*,C,*.*. It makes the comma optional so it can also match G,CC (.*, matches G, and C,* matches CC).
This should work:
awk '$0 ~ /^[GCA](,[GCA])*$/ && /C/ {print $0}'
My guess is that maybe this would also work:
awk '$0 ~ /^([A-Z],C,[A-Z]|[A-Z],C|C,[A-Z])$/ {print $0}'
Demo
Advice
Mr. Rankin is advising that:
It is equivalent to awk '/^([A-Z],C,[A-Z]|[A-Z],C|C,[A-Z])$/'. Output
with print is the default operation along with the match against the
record.
$ awk '/(^|,)C(,|$)/' file
C,G
G,C
G,C,A
More alternatives
In other words, you want to select lines with "C" as word? If yes, here are 2 solutions:
grep -w C
grep -E '\<C\>'
The first one advises grep to match only whole words. The second line uses begin-word and end-word patterns. These pattern can be used with awk too:
awk '/\<C\>/ {print}'
A complete different solution (and different form other answers too) is to add commas at both ends before comparing ,C,:
awk '"," $0 "," ~ /,C,/ {print}

Print matched pattern with AWK

For example i have this data:
/home/test/dat1.txt
/home/test/dat2.txt
/home/test/test1/dat3.txt
/home/test/test2/dat4.txt
/home/test/test3/test4/dat5.txt
I need to print only the name and extension, that output should be:
dat1.txt
dat2.txt
dat3.txt
dat4.txt
dat5.txt
I need to use the awk command... anyone can help?
I use this regular expression: '/\/*\.txt/{print ???}
If you are going to use awk, you do not need a regex for this purpose.
You can just tell awk to print the last field, using a field separator of /.
awk -F'/' '{print $NF}' Input.txt
As hd1's comment already noted, NF is the number of fields on the current input record (in this case line). Since awk starts indexing fields at $1, $NF gives you the last field.
You could use this short awk
awk -F/ '$0=$NF' Input.txt
If you need empty line use
awk -F/ '{$0=$NF}1' Input.txt

Using awk to grab only numbers from a string

Background:
I have a column that should get user input in form of "Description text ref12345678". I have existing scripts that grab the reference number but unfortunately some users add it incorrectly so instead of "ref12345678" it can be "ref 12345678", "RF12345678", "abcd12345678" or any variation. Naturally the wrong formatting breaks some of the triggered scripts.
For now I can't control the user input to this field, so I want to make the scripts later in the pipeline just to get the number.
At the moment I'm stripping the letters with awk '{gsub(/[[:alpha:]]/, "")}; 1', but substitution seems like an inefficient solution. (I know I can do this also with sed -n 's/.*[a-zA-Z]//p' and tr -d '[[:alpha:]]' but they are essentially the same and I want awk for additional programmability).
The question is, is there a way to set awk to either print only numbers from a string, or set delimits to numeric items in a string? (or is substitution really the most efficient solution for this problem).
So in summary: how do I use awk for $ echo "ref12345678" to print only "12345678" without substitution?
if awk is not a must:
grep -o '[0-9]\+'
example:
kent$ echo "ref12345678"|grep -o '[0-9]\+'
12345678
with awk for your example:
kent$ echo "ref12345678"|awk -F'[^0-9]*' '$0=$2'
12345678
You can also try the following with awk assuming there will be only one number in a string:
awk '{print ($0+0)}'
This converts your entire string to numeric, and the way that awk is implemented only the values that fit the numeric description will be left. Thus for example:
echo "19 trees"|awk '{print ($0+0)}'
will produce:
19
In AWK you can specify multiple conditions like:
($3~/[[:digit:]+]/ && $3 !~/[[:alpha:]]/ && $3 !~/[[:punct:]]/ ) {print $3}
will display only digit without any alphabet and punctuation.
with !~ means not contain any.
grep works perfectly :
$ echo "../Tin=300_maxl=9_rdx=1.1" | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'
300
9
1.1
Step by step explanation:
-E
Use extended regex.
-o
Return only the matches, not the context
[+-]?[0-9]+([.][0-9]+)?+
Match numbers which are identified as:
[+-]?
An optional leading sign
[0-9]+
One or more numbers
([.][0-9]+)?
An optional period followed by one or more numbers.
it is convenient to put the output in an array
arr=($(echo "../Tin=300_maxl=9_rdx=1.1" | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'))
and then use it like this
Tin=${arr[0]}
maxl=${arr[1]}
etc..
Another option (assuming GNU awk) involves specifying a non-numeric regular expression as a separator
awk -F '[^0-9]+' '{OFS=" "; for(i=1; i<=NF; ++i) if ($i != "") print($i)}'

awk, a field doesn't match but it should match

I have a file structured as record list, where field separator is \t.
I want to extract only records where the second field is a number from 1 to 9, but my awk script doesn't work.
The awk script is
cat file |awk -v FS="\t" '$2 ~ /[0-9]{1}/ {print $0;}'
or this
cat file |awk -v FS="\t" '$2 ~ /.{1}/ {print $0;}' #because the second fields of my file have all second fields as number
Why these sscript don't work? Isn't regex a good regex?
Update
Even with the interval {1}, you are still going to match a field like 23 because the 2 matches a single number. What you really want to use are anchors and forget about intervals:
awk '$2 ~ /^[0-9]$/{print}' FS="\t" file
The problem is the use of intervals {1}. awk less than version 4 doesn't support intervals. gawk on the other hand will if you add the following flag: --re-interval
Try this:
awk --re-interval '$2 ~ /[0-9]{1}/{print}' FS="\t" file
Some other things to note:
Built in vars such as FS can be assigned at the end without the need for -v
You can use just print rather than print $0 as that is its default behavior
Useless use of cat. awk can take a file as an argument, use that instead
If you want to ensure the 2nd field is a single-digit number, you don't really need a regex:
awk '1 <= $2 && $2 <= 9 {print}'