Search regex on a specific field using awk - regex

In awk I can search a field for a value like:
$ echo -e "aa,bb,cc\ndd,eaae,ff" | awk 'BEGIN{FS=",";}; $2=="eaae" {print $0};'
aa,bb,cc
dd,eaae,ff
And I can search by regular expressions like
$ echo -e "aa,bb,cc\ndd,eaae,ff" | awk 'BEGIN{FS=",";}; /[a]{2}/ {print $0};'
aa,bb,cc
dd,eaae,ff
Can I force the awk to apply the regexp search to a specific field ? I'm looking for something like
$ echo -e "aa,bb,cc\ndd,eaae,ff" | awk 'BEGIN{FS=",";}; $2==/[a]{2}/ {print $0};'
expecting result:
dd,eaae,ff
Anyone know how to do it using awk?
Accepted response - Operator "~" (thanks to hek2mgl):
$ echo -e "aa,bb,cc\ndd,eaae,ff" | awk 'BEGIN{FS=",";}; $2 ~ /[a]{2}/ {print $0};'

You can use :
$2 ~ /REGEX/ {ACTION}
If the regex should apply to the second field (for example) only.
In your case this would lead to:
awk -F, '$2 ~ /^[a]{2}$/' <<< "aa,bb,cc\ndd,eaae,ff"
You may wonder why I've just used the regex in the awk program and no print. This is because your action is print $0 - printing the current line - which is the default action in awk.

Related

Awk regex expression to select for a certain number of delimiters in a field

I am trying to select for a field which has exactly a certain number of commas. For example, I can select for 1 comma in a field as follows:
$ echo jkl,abc | awk '$1 ~ /[a-z],[a-z]/{print $0}'
jkl,abc
The expected output, "jkl,abc", is seen.
However, when I try for 2 commas it doesn't work.
$ echo jkl,abc,xyz | awk '$1 ~ /[a-z],[a-z],[a-z]/{print $0}'
(no output)
Any thoughts?
Thanks!
/[a-z],[a-z],[a-z]/ doesn't match jkl,abc,xyz because you didn't use quantifiers. Right regex would have been: /^[a-z]+,[a-z]+,[a-z]+$/ e.g.
awk '/^[a-z]+,[a-z]+,[a-z]+$/' <<< 'jkl,abc,xyz'
However, to validate number of commas, it would be better to compare number of fields while using FS = "," like this:
awk -F, 'NF == 2' <<< 'jkl,abc'
awk -F, 'NF == 3' <<< 'jkl,abc,xyz'
jkl,abc
jkl,abc,xyz
It should be like:
echo jkl,abc,xyz | awk '/[a-z]+,[a-z]+,[a-z]+/{print $0}'
OR
echo jkl,abc,xyz | awk '/[a-z]+,[a-z]+,[a-z]+/'
OP's code why its not working:
Because OP is mentioning only 1 occurrence of [a-z] and , but that is not that case there are more than 1 characters present in line before comma hence its not matching it. With your given code $1 is not required since you are matching whole line so I have removed $1 part from solution.
In case you have multiple fields(separated by spaces) and you want to check condition on 1st part then you could go with:
echo "jkl,abc,xyz blabla" | awk '$1 ~ /[a-z]+,[a-z]+,[a-z]+/'
Your middle segment of the regexp wasn't accounting for more than one letter between the commas so you should have made just that one part of it [a-z]* or [a-z]+ depending on your requirements for handling the case of zero letters.
Some approaches to consider to find 2 or more commas in a field:
$ echo jkl,abc,xyz | awk '$1 ~ /[a-z],[a-z]*,[a-z]/'
jkl,abc,xyz
$ echo jkl,abc,xyz | awk '$1 ~ /([a-z]*,){2,}/'
jkl,abc,xyz
$ echo jkl,abc,xyz | awk '$1 ~ /[^,],[^,]*,[^,]/'
jkl,abc,xyz
$ echo jkl,abc,xyz | awk '$1 ~ /([^,]*,){2,}/'
jkl,abc,xyz
$ echo jkl,abc,xyz | awk 'gsub(/,/,"&",$1) > 1'
jkl,abc,xyz

How to use sed to identify a string in brackets?

I want to find the string in that is placed with in the brackets. How do I use sed to pull the string?
# cat /sys/block/sdb/queue/scheduler
noop anticipatory deadline [cfq]
I'm not getting the exact result
# cat /sys/block/sdb/queue/scheduler | sed 's/\[*\]//'
noop anticipatory deadline [cfq
I'm expecting an output
cfq
It can be easier with grep, if it happens to be changing the position in which the text in between brackets is located:
$ grep -Po '(?<=\[)[^]]*' file
cfq
This is look-behind: whenever you find a string [, start fetching all the characters up to a ].
See another example:
$ cat a
noop anticipatory deadline [cfq]
hello this [is something] we want to [enclose] yeah
$ grep -Po '(?<=\[)[^]]*' a
cfq
is something
enclose
You can also use awk for this, in case it is always in the same position:
$ awk -F[][] '{print $2}' file
cfq
It is setting the field separators as [ and ]. And from that, prints the second one.
And with sed:
$ sed 's/[^[]*\[\([^]]*\).*/\1/g' file
cfq
It is a bit messy, but basically it is looking from the block of text in between [] and prints it back.
I found one possible solution-
cut -d "[" -f2 | cut -d "]" -f1
so the exact solution is
# cat /sys/block/sdb/queue/scheduler | cut -d "[" -f2 | cut -d "]" -f1
Another potential solution is awk:
s='noop anticipatory deadline [cfq]'
awk -F'[][]' '{print $2}' <<< "$s"
cfq
Another way by gnu grep :
grep -Po "\[\K[^]]*" file
with pure shell:
while read line; do [[ "$line" =~ \[([^]]*)\] ]] && echo "${BASH_REMATCH[1]}"; done < file
Another awk
echo 'noop anticipatory deadline [cfq]' | awk '{gsub(/.*\[|\].*/,x)}8'
cfq
perl -lne 'print $1 if(/\[([^\]]*)\]/)'
Tested here

regex to search for a string between two slashes

I have a question in bash shell scripting. I am looking to search a string between two slashes. Slash is a delimiter here.
Lets say the string is /one/two/, I want to be able to just pick up one.
How can i achieve this is in shell scripts? Any pointers are greatly appreciated.
Use the -F flag of awk to set the delimeter to /. Then you can print the first ($2) and second ($3) field from the line.
$ cat /my/file
/one/two/
$ awk -F/ '{print $2}' /my/file
one
$ awk -F/ '{print $3}' /my/file
two
If the string is in a variable, you can pipe it to awk.
#!/bin/bash
var=/one/two/
echo $var | awk -F/ '{print $2}'
echo $var | awk -F/ '{print $3}'
path="/one/two/"
path=${path#/} # Remove leading /
path=${path%%/*} # Remove everything after first /
echo "$path" # Is now "one"
Using a bash regular expression:
$ str="/one/two/"
$ re="/([^/]*)/[^/]*/"
$ [[ $str =~ $re ]] && echo "${BASH_REMATCH[1]}"
one
$
Using cut:
$ str="/one/two/"
$ echo "$str" | cut -d/ -f2
one
$
Convert your string to an array, delimited with / and read the necessary element:
$ str="/one/two/"
$ IFS='/' a=( $str ) echo "${a[1]}"
one
$
And a couple of more
> cut -f 2 -d "/" <<< "/one/two"
one
> awk -F "/" '{print $2}' <<< "/one/two"
one
> oldifs="$IFS"; IFS="/"; var="/one/two/"; set -- $var; echo "$2"; IFS="$oldifs"
one

awk - How to get only the matching portion of a regex

I have code like this
echo abc | awk '$0 ~ "a\(b\)c" {print $0}'
What if I only wanted what's in the parentheses instead of the whole line? This is obviously very simplified, and there is really a lot of awk code so I don't want to switch to sed or grep or something. Thanks
As far as I know you cannot do it in the pattern part, you must do it inside the action part with the match() function:
echo abc | awk '{ if ( match($0, /a(b)c/, a) > 0 ) { print a[1] } }'
It yields:
b
With GNU awk:
$ echo abc | awk '{print gensub(/a(b)c/,"\\1","")}'
b

Substitute a regex pattern using awk

I am trying to write a regex expression to replace one or more '+' symbols present in a file with a space. I tried the following:
echo This++++this+++is+not++done | awk '{ sub(/\++/, " "); print }'
This this+++is+not++done
Expected:
This this is not done
Any ideas why this did not work?
Use gsub which does global substitution:
echo This++++this+++is+not++done | awk '{gsub(/\++/," ");}1'
sub function replaces only 1st match, to replace all matches use gsub.
Or the tr command:
echo This++++this+++is+not++done | tr -s '+' ' '
The idiomatic awk solution would be just to translate the input field separator to the output separator:
$ echo This++++this+++is+not++done | awk -F'++' '{$1=$1}1'
This this is not done
Try this
echo "This++++this+++is+not++done" | sed -re 's/(\+)+/ /g'
You could use sed too.
echo This++++this+++is+not++done | sed -e 's/+\{1,\}/ /g'
This matches one or more + and replaces it with a space.
For this case I recommend sed, this is powerful for substitution and has a short syntax.
Solution sed:
echo This++++this+++is+not++done | sed -En 's/\\++/ /gp'
Result:
This this is not done
For awk:
You must use the gsub function for global line substitution (more than one substitution).
The syntax:
gsub(regexp, replacement [, target]).
If the third parameter is ommited then $0 is the target.
Target must a variable or array element. gsub works in target, overwritten target with the replacement.
Solution awk:
echo This++++this+++is+not++done | awk 'gsub(/\\++/," ")
Result:
This this is not done
echo "This++++this+++is+not++done" | sed 's/++*/ /g'
If you have access to node on your computer you can do it by installing rexreplace
npm install -g regreplace
and then run
rexreplace '\++' ' ' myfile.txt
Of if you have more files in a dir data you can do
rexreplace '\++' ' ' data/*.txt