AWK: how to match a comma - regex

I want to return lines from awk with a pattern "C," or ".,C" or ".,C,.*".
For example:
Valid
C,G
G,C
G,C,A
Invalid
G,CC
My code is below:
echo G,CC | awk '$0 ~ /^C,+.*|.*,C,*.*/ {print $0}'
output:
G,CC
I hope it returns nothing to me. Unfortunately, it returns "G,CC" to me.
How do I solve this problem?
Edit:
Based on the answers from #Emma and #perreal. I used a shorter command line to solve my question:
awk '$0 ~ /^C,.*|.*,C,.*|.*,C$/ {print $0}'
Until now, it works well. Thanks for your help!!

Could you please try following.
awk '!/CC/ && /^C,+.*|.*,C,*.*/' Input_file

The + is not necessary in ^C,+.*, since you already match the comma and also match whatever comes after.
The * right after the second comma is not correct in .*,C,*.*. It makes the comma optional so it can also match G,CC (.*, matches G, and C,* matches CC).
This should work:
awk '$0 ~ /^[GCA](,[GCA])*$/ && /C/ {print $0}'

My guess is that maybe this would also work:
awk '$0 ~ /^([A-Z],C,[A-Z]|[A-Z],C|C,[A-Z])$/ {print $0}'
Demo
Advice
Mr. Rankin is advising that:
It is equivalent to awk '/^([A-Z],C,[A-Z]|[A-Z],C|C,[A-Z])$/'. Output
with print is the default operation along with the match against the
record.

$ awk '/(^|,)C(,|$)/' file
C,G
G,C
G,C,A

More alternatives
In other words, you want to select lines with "C" as word? If yes, here are 2 solutions:
grep -w C
grep -E '\<C\>'
The first one advises grep to match only whole words. The second line uses begin-word and end-word patterns. These pattern can be used with awk too:
awk '/\<C\>/ {print}'
A complete different solution (and different form other answers too) is to add commas at both ends before comparing ,C,:
awk '"," $0 "," ~ /,C,/ {print}

Related

Print everything before relevant symbol and keep 1 character after relevant symbol

I'm trying to find a one-liner to print every before relevant symbol and keep just 1 character after relevant symbol:
Input:
thisis#atest
thisisjust#anothertest
just#testing
Desired output:
thisis#a
thisjust#a
just#t
awk -F"#" '{print $1 "#" }' will almost give me what I want but I need to find a way to print the second character as well. Any ideas?
You can substitute what's after the first character after # with nothing with sed:
sed 's/\(#.\).*/\1/'
You could use grep:
$ grep -o '[^#]*#.' infile
thisis#a
thisisjust#a
just#t
This matches a sequence of characters other than #, followed by # and any character. The -o option retains only the match itself.
With the special RT variable in GNU's awk, you can do:
awk 'BEGIN{RS="#.|\n"}RT!="\n"{print $0 RT}'
Get the index of the '#', then pull out the substring.
$ awk '{print substr($0,1,index($0,"#")+1);}' in.txt
thisis#a
thisisjust#a
just#t
1st Solution: Could you please try following.
awk 'match($0,/[^#]*#./){print substr($0,RSTART,RLENGTH)}' Input_file
Above will print lines as per your ask which have # in them and leave lines which does not have it, in case you want to completely print those lines use following then.
awk 'match($0,/[^#]*#./){print substr($0,RSTART,RLENGTH);next} 1' Input_file
2nd solution:
awk 'BEGIN{FS=OFS="#"} {print $1,substr($2,1,1)}' Input_file
Some small variation of Ravindes 2nd example
awk -F# '{print $1"#"substr($2,1,1)}' file
awk -F# '{print $1FS substr($2,1,1)}' file
Another grep variation (shortest posted so far):
grep -oP '.+?#.' file
o print only matching
P Perl regex (due to +?)
. any character
+ and more
? but stop with:
#
. pluss one more character
If we do not add ?. This line test#one#two becomes test#one#t instead of test#o do to the greedy +
If you want to use awk, the cleanest way to do this with is using index which finds the position of a character:
awk 'n=index($0,'#') { print substr($0,1,n+1) }' file
There are, however, shorter and more dedicated tools for this. See the other answers.

How can I use bash variable in awk with regexp?

I have a file like this (this is sample):
71.13.55.12|212.152.22.12|71.13.55.12|8.8.8.8
81.23.45.12|212.152.22.12|71.13.55.13|8.8.8.8
61.53.54.62|212.152.22.12|71.13.55.14|8.8.8.8
21.23.51.22|212.152.22.12|71.13.54.12|8.8.8.8
...
I have iplist.txt like this:
71.13.55.
12.33.23.
8.8.
4.2.
...
I need to grep if 3. column starts like in iplist.txt.
Like this:
71.13.55.12|212.152.22.12|71.13.55.12|8.8.8.8
81.23.45.12|212.152.22.12|71.13.55.13|8.8.8.8
61.53.54.62|212.152.22.12|71.13.55.14|8.8.8.8
I tried:
for ip in $(cat iplist.txt); do
awk -v var="$ip" -F '|' '{if ($3 ~ /^$var/) print $0;}' text.txt
done
But bash variable does not work in /^ / regex block. How can I do that?
First, you can use a concatenation of strings for the regular expression, it doesn't have to be a regex block. You can say:
'{if ($3 ~ "^" var) print $0;}'
Second, note above that you don't use a $ with variables inside awk. $ is only used to refer to fields by number (as in $3, or $somevar where somevar has a field number as its value).
Third, you can do everything in awk in which case you can avoid the shell loop and don't need the var:
awk -F'|' 'NR==FNR {a["^" $0]; next} { for (i in a) if ($3 ~ i) {print;next} }' iplist.txt r.txt
71.13.55.12|212.152.22.12|71.13.55.12|8.8.8.8
81.23.45.12|212.152.22.12|71.13.55.13|8.8.8.8
61.53.54.62|212.152.22.12|71.13.55.14|8.8.8.8
EDIT
As rightly pointed out in the comments, the .s in the patterns will match any character, not just a literal .. Thus we need to escape them before doing the match:
awk -F'|' 'NR==FNR {gsub(/\./,"\\."); a["^" $0]; next} { for (i in a) if ($3 ~ i) print }' iplist.txt r.txt
I'm assuming that you only want to output a given line once, even if it matches multiple patterns from iplist.txt. If you want to output a line multiple times for multiple matches (as your version would have done), remove the next from {print;next}.
Use var directly, instead of in /^$var/ ( adding ^ to the variable first):
awk -v var="^$ip" -F '|' '$3 ~ var' text.txt
By the way, the default action for a true condition is to print the current record, so, {if (test) {print $0}} can often be contracted to just test.
Here is a way with bash, sed and grep, it's straight forward and I think may be a bit cleaner than awk in this case:
IFS=$(echo -en "\n\b") && for ip in $(sed 's/\./\\&/g' iplist.txt); do
grep "^[^|]*|[^|]*|${ip}" r.txt
done

Print matched pattern with AWK

For example i have this data:
/home/test/dat1.txt
/home/test/dat2.txt
/home/test/test1/dat3.txt
/home/test/test2/dat4.txt
/home/test/test3/test4/dat5.txt
I need to print only the name and extension, that output should be:
dat1.txt
dat2.txt
dat3.txt
dat4.txt
dat5.txt
I need to use the awk command... anyone can help?
I use this regular expression: '/\/*\.txt/{print ???}
If you are going to use awk, you do not need a regex for this purpose.
You can just tell awk to print the last field, using a field separator of /.
awk -F'/' '{print $NF}' Input.txt
As hd1's comment already noted, NF is the number of fields on the current input record (in this case line). Since awk starts indexing fields at $1, $NF gives you the last field.
You could use this short awk
awk -F/ '$0=$NF' Input.txt
If you need empty line use
awk -F/ '{$0=$NF}1' Input.txt

Regex, get what's after the second occurence of a string

I have a string of the following format:
TEXT####TEXT####SPECIALTEXT
I need to get the SPECIALTEXT, basically what is after the second occurrence of the ####. I can't get it done. Thanks
The regex (?:.*?####){2}(.*) contains what you're looking for in its first group.
If you are using shell and can use awk for it:
From a file:
awk 'BEGIN{FS="####"} {print $3}' input_file
From a variable:
awk 'BEGIN{FS="####"} {print $3}' <<< "$input_variable"

how to get sub-expression value of regExp in awk?

I was analyzing logs contains information like the following:
y1e","email":"","money":"100","coi
I want to fetch the value of money, i used 'awk' like :
grep pay action.log | awk '/"money":"([0-9]+)"/' ,
then how can i get the sub-expression value in ([0-9]+) ?
If you have GNU AWK (gawk):
awk '/pay/ {match($0, /"money":"([0-9]+)"/, a); print substr($0, a[1, "start"], a[1, "length"])}' action.log
If not:
awk '/pay/ {match($0, /"money":"([0-9]+)"/); split(substr($0, RSTART, RLENGTH), a, /[":]/); print a[5]}' action.log
The result of either is 100. And there's no need for grep.
Offered as an alternative, assuming the data format stays the same once the lines are grep'ed, this will extract the money field, not using a regular expression:
awk -v FS=\" '{print $9}' data.txt
assuming data.txt contains
y1e","email":"","money":"100","coin.log
yielding:
100
I.e., your field separator is set to " and you print out field 9
You need to reference group 1 of the regex
I'm not fluent in awk but here are some other relevant questions
awk extract multiple groups from each line
GNU awk: accessing captured groups in replacement text
Hope this helps
If you have money coming in at different places then may be it would not be a good idea to hard code the positional parameter.
You can try something like this -
$ awk -v FS=[,:\"] '{ for (i=1;i<=NF;i++) if($i~/money/) print $(i+3)}' inputfile
grep pay action.log | awk -F "\n" 'm=gensub(/.*money":"([0-9]+)".*/, "\\1", "g", $1) {print m}'