How can I use bash variable in awk with regexp?

How can I use bash variable in awk with regexp? - regex

I have a file like this (this is sample):
71.13.55.12|212.152.22.12|71.13.55.12|8.8.8.8
81.23.45.12|212.152.22.12|71.13.55.13|8.8.8.8
61.53.54.62|212.152.22.12|71.13.55.14|8.8.8.8
21.23.51.22|212.152.22.12|71.13.54.12|8.8.8.8
...
I have iplist.txt like this:
71.13.55.
12.33.23.
8.8.
4.2.
...
I need to grep if 3. column starts like in iplist.txt.
Like this:
71.13.55.12|212.152.22.12|71.13.55.12|8.8.8.8
81.23.45.12|212.152.22.12|71.13.55.13|8.8.8.8
61.53.54.62|212.152.22.12|71.13.55.14|8.8.8.8
I tried:
for ip in $(cat iplist.txt); do
awk -v var="$ip" -F '|' '{if ($3 ~ /^$var/) print $0;}' text.txt
done
But bash variable does not work in /^ / regex block. How can I do that?

First, you can use a concatenation of strings for the regular expression, it doesn't have to be a regex block. You can say:
'{if ($3 ~ "^" var) print $0;}'
Second, note above that you don't use a $ with variables inside awk. $ is only used to refer to fields by number (as in $3, or $somevar where somevar has a field number as its value).
Third, you can do everything in awk in which case you can avoid the shell loop and don't need the var:
awk -F'|' 'NR==FNR {a["^" $0]; next} { for (i in a) if ($3 ~ i) {print;next} }' iplist.txt r.txt
71.13.55.12|212.152.22.12|71.13.55.12|8.8.8.8
81.23.45.12|212.152.22.12|71.13.55.13|8.8.8.8
61.53.54.62|212.152.22.12|71.13.55.14|8.8.8.8
EDIT
As rightly pointed out in the comments, the .s in the patterns will match any character, not just a literal .. Thus we need to escape them before doing the match:
awk -F'|' 'NR==FNR {gsub(/\./,"\\."); a["^" $0]; next} { for (i in a) if ($3 ~ i) print }' iplist.txt r.txt
I'm assuming that you only want to output a given line once, even if it matches multiple patterns from iplist.txt. If you want to output a line multiple times for multiple matches (as your version would have done), remove the next from {print;next}.

Use var directly, instead of in /^$var/ ( adding ^ to the variable first):
awk -v var="^$ip" -F '|' '$3 ~ var' text.txt
By the way, the default action for a true condition is to print the current record, so, {if (test) {print $0}} can often be contracted to just test.

Here is a way with bash, sed and grep, it's straight forward and I think may be a bit cleaner than awk in this case:
IFS=$(echo -en "\n\b") && for ip in $(sed 's/\./\\&/g' iplist.txt); do
grep "^[^|]*|[^|]*|${ip}" r.txt
done

Related

Remove hostnames from a single line that follow a pattern in bash script

I need to cat a file and edit a single line with multiple domains names. Removing any domain name that has a set certain pattern of 4 letters ex: ozar.
This will be used in a bash script so the number of domain names can range, I will save this to a csv later on but right now returning a string is fine.
I tried multiple commands, loops, and if statements but sending the output to variable I can use further in the script proved to be another difficult task.
Example file
$ echo file.txt
ozarkzshared.com win.ad.win.edu win_fl.ozarkzsp.com ap.allk.org allk.org >ozarkz.com website.com
What I attempted (that was close)
domains_1=$(cat /tmp/file.txt | sed 's/ozar*//g')
domains_2=$( cat /tmp/file.txt | printf '%s' "${string##*ozar}")
Goal
echo domain_x
win.ad.win.edu ap.allk.org allk.org website.com

If all the domains are on a single line separated by spaces, this might work:
awk '/ozar/ {next} 1' RS=" " file.txt
This sets RS, your record separator, then skips any record that matches the keyword. If you wanted to be able to skip a substring provided in a shell variable, you could do something like this:
$ s=ozar
$ awk -v re="$s" '$0 ~ re {next} 1' RS=" " file.txt
Note that the ~ operator is comparing a regular expression, not precisely a substring. You could leverage the index() function if you really want to check a substring:
$ awk -v s="$s" 'index($0,s) {next} 1' RS=" " file.txt
Note that all of the above is awk, which isn't what you asked for. If you'd like to do this with bash alone, the following might be for you:
while read -r -a a; do
for i in "${a[#]}"; do
[[ "$i" = *"$s"* ]] || echo "$i"
done
done < file.txt
This assigns each line of input to the array $a[], then steps through that array testing for a substring match and printing if there is none. Text processing in bash is MUCH less efficient than in a more specialized tool like awk or sed. YMMV.

you want to delete the words until a space delimiter
$ sed 's/ozar[^ ]*//g' file
win.ad.win.edu win_fl. ap.allk.org allk.org website.com

AWK: how to match a comma

I want to return lines from awk with a pattern "C," or ".,C" or ".,C,.*".
For example:
Valid
C,G
G,C
G,C,A
Invalid
G,CC
My code is below:
echo G,CC | awk '$0 ~ /^C,+.*|.*,C,*.*/ {print $0}'
output:
G,CC
I hope it returns nothing to me. Unfortunately, it returns "G,CC" to me.
How do I solve this problem?
Edit:
Based on the answers from #Emma and #perreal. I used a shorter command line to solve my question:
awk '$0 ~ /^C,.*|.*,C,.*|.*,C$/ {print $0}'
Until now, it works well. Thanks for your help!!

Could you please try following.
awk '!/CC/ && /^C,+.*|.*,C,*.*/' Input_file

The + is not necessary in ^C,+.*, since you already match the comma and also match whatever comes after.
The * right after the second comma is not correct in .*,C,*.*. It makes the comma optional so it can also match G,CC (.*, matches G, and C,* matches CC).
This should work:
awk '$0 ~ /^[GCA](,[GCA])*$/ && /C/ {print $0}'

My guess is that maybe this would also work:
awk '$0 ~ /^([A-Z],C,[A-Z]|[A-Z],C|C,[A-Z])$/ {print $0}'
Demo
Advice
Mr. Rankin is advising that:
It is equivalent to awk '/^([A-Z],C,[A-Z]|[A-Z],C|C,[A-Z])$/'. Output
with print is the default operation along with the match against the
record.

$ awk '/(^|,)C(,|$)/' file
C,G
G,C
G,C,A

More alternatives
In other words, you want to select lines with "C" as word? If yes, here are 2 solutions:
grep -w C
grep -E '\<C\>'
The first one advises grep to match only whole words. The second line uses begin-word and end-word patterns. These pattern can be used with awk too:
awk '/\<C\>/ {print}'
A complete different solution (and different form other answers too) is to add commas at both ends before comparing ,C,:
awk '"," $0 "," ~ /,C,/ {print}

How to create awk regex to match only one "space" between two words?

I have a sentence of form 2016-23-12 90-34-23 want to create an awk script to match it.
a.awk
$1 ~ /^[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}[[:space:]][[:digit:]]{2}-[[:digit:]]{2}-[[:digit:]]{2}/{
ts = $1 " " $2
print
}
Run using:
awk -f a.awk --posix
2016-23-12 90-34-23
Output:
Nothing

I assume your intention is match the whole string, in which $1 is incorrect, use it as $0
The problem you are seeing is Awk dynamic regular-expressions like the one you used don't need the $0 ~ /regex/ type match, the // is not needed here, just do as with your script being,
dynamicRegex = "[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}[[:space:]][[:digit:]]{2}-[[:digit:]]{2}-[[:digit:]]{2}"
$0 ~ dynamicRegex {
print "match success"
}
and now running the script as
echo "2016-23-12 90-34-23"| awk -f a.awk --posix
2016-23-12 90-34-23
match success
Quoting from the page,
[..]The righthand side of a ~ or !~ operator need not be a regexp constant (i.e., a string of characters between slashes). It may be any expression. The expression is evaluated and converted to a string if necessary; the contents of the string are used as the regexp. A regexp that is computed in this way is called a dynamic regexp [..]
Another way would be to use the normal Regular Expression syntax over the POSIX character classes as a regexp constant as below,
$0 ~ /^[0-9]{4}-[0-9]{2}-[0-9]{2}\s[0-9]{2}-[0-9]{2}-[0-9]{2}$/ {
print "match success"
}
Remember with the above regex, your script is not longer POSIX compatible and running with --posix won't work here, also the \s here is a GNU Awk specific construct. Running it as
echo "2016-23-12 90-34-23"| awk -f a.awk
match success
Now to print the line upon the match, upon success just do,
print $1 FS $2
after the earlier print command.

Try this -
$echo "2016-23-12 90-34-23" | awk '{if($0 ~ /^[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}[[:space:]][[:digit:]]{2}-[[:digit:]]{2}-[[:digit:]]{2}$/) {print $0}}'
2016-23-12 90-34-23
$echo "2016-23-121 190-34-23" | awk '{if($0 ~ /^[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}[[:space:]][[:digit:]]{2}-[[:digit:]]{2}-[[:digit:]]{2}$/) {print $0}}'
##### No result

Remove everything after 2nd occurrence in a string in unix

I would like to remove everything after the 2nd occurrence of a particular
pattern in a string. What is the best way to do it in Unix? What is most elegant and simple method to achieve this; sed, awk or just unix commands like cut?
My input would be
After-u-math-how-however
Output should be
After-u
Everything after the 2nd - should be stripped out. The regex should also match
zero occurrences of the pattern, so zero or one occurrence should be ignored and
from the 2nd occurrence everything should be removed.
So if the input is as follows
After
Output should be
After

Something like this would do it.
echo "After-u-math-how-however" | cut -f1,2 -d'-'
This will split up (cut) the string into fields, using a dash (-) as the delimiter. Once the string has been split into fields, cut will print the 1st and 2nd fields.

This might work for you (GNU sed):
sed 's/-[^-]*//2g' file

You could use the following regex to select what you want:
^[^-]*-\?[^-]*
For example:
echo "After-u-math-how-however" | grep -o "^[^-]*-\?[^-]*"
Results:
After-u

#EvanPurkisher's cut -f1,2 -d'-' solution is IMHO the best one but since you asked about sed and awk:
With GNU sed for -r
$ echo "After-u-math-how-however" | sed -r 's/([^-]+-[^-]*).*/\1/'
After-u
With GNU awk for gensub():
$ echo "After-u-math-how-however" | awk '{$0=gensub(/([^-]+-[^-]*).*/,"\\1","")}1'
After-u
Can be done with non-GNU sed using \( and *, and with non-GNU awk using match() and substr() if necessary.

awk -F - '{print $1 (NF>1? FS $2 : "")}' <<<'After-u-math-how-however'
Split the line into fields based on field separator - (option spec. -F -) - accessible as special variable FS inside the awk program.
Always print the 1st field (print $1), followed by:
If there's more than 1 field (NF>1), append FS (i.e., -) and the 2nd field ($2)
Otherwise: append "", i.e.: effectively only print the 1st field (which in itself may be empty, if the input is empty).

This can be done in pure bash (which means no fork, no external process). Read into an array split on '-', then slice the array:
$ IFS=-
$ read -ra val <<< After-u-math-how-however
$ echo "${val[*]}"
After-u-math-how-however
$ echo "${val[*]:0:2}"
After-u

awk '$0 = $2 ? $1 FS $2 : $1' FS=-
Result
After-u
After

This will do it in awk:
echo "After" | awk -F "-" '{printf "%s",$1; for (i=2; i<=2; i++) printf"-%s",$i}'

awk, a field doesn't match but it should match

I have a file structured as record list, where field separator is \t.
I want to extract only records where the second field is a number from 1 to 9, but my awk script doesn't work.
The awk script is
cat file |awk -v FS="\t" '$2 ~ /[0-9]{1}/ {print $0;}'
or this
cat file |awk -v FS="\t" '$2 ~ /.{1}/ {print $0;}' #because the second fields of my file have all second fields as number
Why these sscript don't work? Isn't regex a good regex?

Update
Even with the interval {1}, you are still going to match a field like 23 because the 2 matches a single number. What you really want to use are anchors and forget about intervals:
awk '$2 ~ /^[0-9]$/{print}' FS="\t" file
The problem is the use of intervals {1}. awk less than version 4 doesn't support intervals. gawk on the other hand will if you add the following flag: --re-interval
Try this:
awk --re-interval '$2 ~ /[0-9]{1}/{print}' FS="\t" file
Some other things to note:
Built in vars such as FS can be assigned at the end without the need for -v
You can use just print rather than print $0 as that is its default behavior
Useless use of cat. awk can take a file as an argument, use that instead

If you want to ensure the 2nd field is a single-digit number, you don't really need a regex:
awk '1 <= $2 && $2 <= 9 {print}'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How can I use bash variable in awk with regexp? - regex

Use var directly, instead of in /^$var/ ( adding ^ to the variable first): awk -v var="^$ip" -F '|' '$3 ~ var' text.txt By the way, the default action for a true condition is to print the current record, so, {if (test) {print $0}} can often be contracted to just test.

Here is a way with bash, sed and grep, it's straight forward and I think may be a bit cleaner than awk in this case: IFS=$(echo -en "\n\b") && for ip in $(sed 's/\./\\&/g' iplist.txt); do grep "^[^|]|[^|]|${ip}" r.txt done

Related

Remove hostnames from a single line that follow a pattern in bash script

AWK: how to match a comma

How to create awk regex to match only one "space" between two words?

Remove everything after 2nd occurrence in a string in unix

awk, a field doesn't match but it should match

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How can I use bash variable in awk with regexp? - regex

Use var directly, instead of in /^$var/ ( adding ^ to the variable first): awk -v var="^$ip" -F '|' '$3 ~ var' text.txt By the way, the default action for a true condition is to print the current record, so, {if (test) {print $0}} can often be contracted to just test.

Here is a way with bash, sed and grep, it's straight forward and I think may be a bit cleaner than awk in this case: IFS=$(echo -en "\n\b") && for ip in $(sed 's/\./\\&/g' iplist.txt); do grep "^[^|]*|[^|]*|${ip}" r.txt done

Related

Remove hostnames from a single line that follow a pattern in bash script

AWK: how to match a comma

How to create awk regex to match only one "space" between two words?

Remove everything after 2nd occurrence in a string in unix

awk, a field doesn't match but it should match

Categories

Resources

Here is a way with bash, sed and grep, it's straight forward and I think may be a bit cleaner than awk in this case: IFS=$(echo -en "\n\b") && for ip in $(sed 's/\./\\&/g' iplist.txt); do grep "^[^|]|[^|]|${ip}" r.txt done