awk: fatal: Invalid regular expression when setting multiple field separators - regex

I was trying to solve Grep regex to select only 10 character using awk. The question consists in a string XXXXXX[YYYYY--ZZZZZ and the OP wants to print the text in between the unique [ and -- strings within the text.
If it was just one - I would say use [-[] as field separator (FS). This is setting the FS to be either - or [:
$ echo "XXXXXXX[YYYYY-ZZZZ" | awk -F[-[] '{print $2}'
YYYYY
The tricky point is that [ has also a special meaning as a character class, so that to make it be correctly interpreted as one of the possible FS it cannot be written in the first position. Well, this is done by saying [-[]. So we are done to match either - or [.
However, in this case it is not one but two hyphens: I want to say either -- or [. I cannot say [--[] because the hyphen also has a meaning to define a range.
What I can do is to use -F"one pattern|another pattern" like:
$ echo "XXXXXXXaaYYYYYbbZZZZ" | awk -F"aa|bb" '{print $2}'
YYYYY
So if I try to use this with -- and [, I cannot get a proper result:
$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -F"--|[" '{print $2}'
awk: fatal: Invalid regular expression: /--|[/
And in fact, not even having [ as one of the terms:
$ echo "XXXXXXX[YYYYYbbZZZZ" | awk -F"bb|[" '{print $2}'
awk: fatal: Invalid regular expression: /bb|[/
$ echo "XXXXXXX[YYYYYbbZZZZ" | awk -F"bb|\[" '{print $2}'
awk: warning: escape sequence `\[' treated as plain `['
awk: fatal: Invalid regular expression: /bb|[/
$ echo "XXXXXXX[YYYYYbbZZZZ" | awk -F"(bb|\[)" '{print $2}'
awk: warning: escape sequence `\[' treated as plain `['
awk: fatal: Unmatched [ or [^: /(bb|[)/
You see I tried to either escaping [, enclosing in parentheses and nothing worked.
So: what can I do to set the field separator to either -- or [? Is it possible at all?

IMHO this is best explained if we start by looking at a regexp being used by the split() command since that explicitly shows what is happening when a string is split into fields using a literal vs dynamic regexp and then we can relate that to Field Separators.
This uses a literal regexp (delimited by /s):
$ echo "XXXXXXX[YYYYY--ZZZZ" | awk '{split($0,f,/\[|--/); print f[2]}'
YYYYY
and so requires the [ to be escaped so it is taken literally since [ is a regexp metacharacter.
These use a dynamic regexp (one stored as a string):
$ echo "XXXXXXX[YYYYY--ZZZZ" | awk '{split($0,f,"\\[|--"); print f[2]}'
YYYYY
$ echo "XXXXXXX[YYYYY--ZZZZ" | awk 'BEGIN{re="\\[|--"} {split($0,f,re); print f[2]}'
YYYYY
$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -v re='\\[|--' '{split($0,f,re); print f[2]}'
YYYYY
and so require the [ to be escaped 2 times since awk has to convert the string holding the regexp (a variable named re in the last 2 examples) to a regexp (which uses up one backslash) before it's used as the separator in the split() call (which uses up the second backslash).
This:
$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -v re="\\\[|--" '{split($0,f,re); print f[2]}'
YYYYY
exposes the variable contents to the shell for it's evaluation and so requires the [ to be escaped 3 times since the shell parses the string first to try to expand shell variables etc. (which uses up one backslash) and then awk has to convert the string holding the regexp to a regexp (which uses up a second backslash) before it's used as the separator in the split() call (which uses up the third backslash).
A Field Separator is just a regexp stored as variable named FS (like re above) with some extra semantics so all of the above applies to it to, hence:
$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -F '\\[|--' '{print $2}'
YYYYY
$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -F "\\\[|--" '{print $2}'
YYYYY
Note that we could have used a bracket expression instead of escaping it to have the [ treated literally:
$ echo "XXXXXXX[YYYYY--ZZZZ" | awk '{split($0,f,/[[]|--/); print f[2]}'
YYYYY
and then we don't have to worry about escaping the escapes as we add layers of parsing:
$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -F "[[]|--" '{print $2}'
YYYYY
$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -F '[[]|--' '{print $2}'
YYYYY

You need to use double backslash for escaping regex meta chars inside double quoted string so that it would be treated as regex meta character otherwise (if you use single backslash) it would be treated as ecape sequence.
$ echo 'XXXXXXX[YYYYYbbZZZZ' | awk -v FS="bb|\\[" '{print $2}'
YYYYY

This with GNU Awk 3.1.7
echo "XXXXXXX[YYYYY--ZZZZ" | awk -F"--|[[]" '{print $2}'
echo "XXXXXXX[YYYYYbbZZZZ" | awk -F"bb|[[]" '{print $2}'

Related

shell multiline selection from word to character

.textexpandrc
[yoro] よろしくお願いします。
[ohayo] おはようございます。
元気ですか?
[otsu] お疲れさまでします。
Looking for
$ KEY=ohayo; awk "???" ~/.textexpandrc
おはようございます。
元気ですか?
awk or sed is fine, but I'd like to avoid using a mix of awk/sed/perl/tr/cut etc because I'm under the impression that awk is robust enough to handle this on its own.
The best I could find on my own was
$ KEY=ohayo; awk "/\[${KEY}/,/\[otsu/" ~/.textexpandrc | sed "s/\[${KEY}\] //" | grep -v otsu
おはようございます。
元気ですか?
But I need to know the next key in advance (not impossible but ugly). Strangely, if asking awk to search until the square bracket, it fails to select a multiline
$ KEY=ohayo; awk "/\[${KEY}/,/\[/" ~/.textexpandrc
[ohayo] おはようございます。
Currently using a single-line parser solution as follow
#!/usr/bin/env bash
CONFIG=${HOME}/.textexpandrc
ALL_KEYS=$(sed 's/\].*/]/' ${CONFIG} | tr -d '[]')
KEY=$(echo $ALL_KEYS | rofi -sep ' ' -dmenu -p "autocomplete")
grep "\[${KEY}\]" $CONFIG | sed "s/\[${KEY}\] //" | xsel -ib # ← HERE
xdotool key ctrl+shift+v
If you set up the RS and FS variables to match [ and ], this works quite well:
awk 'BEGIN{ RS="\["; FS="\] " }; $1 ~ key { print $2 }' key=ohayo tmp.txt
You pass in the parameter you're searching for using key=.... on the command line instead of setting a variable. This makes it much easier to write the awk script within single quotes.

Bash: regular expressions within backticks

I have a file called "align_summary.txt" which looks like this:
Left reads:
Input : 26410324
Mapped : 21366875 (80.9% of input)
of these: 451504 ( 2.1%) have multiple alignments (4372 have >20)
...more text....
... and several more lines of text....
I want to pull out the % of multiple alignments among all left aligned reads (in this case it's 2.1) in bash shell.
If I use this:
pcregrep -M "Left reads.\n..+.\n.\s+Mapped.+.\n.\s+of these" align_summary.txt | awk -F"\\\( " '{print $2}' | awk -F"%" '{print $1}' | sed -n 4p
It promptly gives me the output: 2.1
However, if I enclose the same expression in backticks like this:
leftmultiple=`pcregrep -M "Left reads.\n..+.\n.\s+Mapped.+.\n.\s+of these" align_summary.txt | awk -F"\\\( " '{print $2}' | awk -F"%" '{print $1}' | sed -n 4p`
I receive an error:
awk: syntax error in regular expression ( at
input record number 1, file
source line number 1
As I understand it, enclosing this expression in backticks affects the interpretation of the regular expression that includes "(" symbol, despite the fact that it is escaped by backslashes.
Why does this happen and how to avoid this error?
I would be grateful for any input and suggestions.
Many thanks,
Just use awk:
leftmultiple=$(awk '/these:.*multiple/{sub(" ","",$2);print $2}' FS='[(%]' align_summary.txt )
Always use $(...) instead of backticks but more importantly, just use awk alone:
$ leftmultiple=$( gawk -v RS='^$' 'match($0,/Left reads.\s*\n\s+.+\n\s+Mapped.+.\n.\s+of these[^(]+[(]\s*([^)%]+)/,a) { print a[1] }' align_summary.txt )
$ echo "$leftmultiple"
2.1
The above uses GNU awk 4.* and assumes you do need the complicated regexp that you were using to avoid false matches elsewhere in your input file. If that's not the case then the script can of course get much simpler.

How to remove/strip double or single quote from a string?

I have a file with some lines like these:
ENVIRONMENT="myenv"
ENV_DOMAIN='mydomain.net'
LOGIN_KEY=mykey.pem
I want to extract the parts after the = but without the surrounding quotes. I tried with gsub like this:
awk -F= '!/^(#|$)/ && /^ENVIRONMENT=/ {gsub(/"|'/, "", $2); print $2}'
Which ends up with -bash: syntax error near unexpected token ')' error. It works just fine for single matching: /"/ or /'/ but doesn't work when I try match either one. What am I doing wrong?
If you are just trying to remove the punctuation then you can do it as below....
# remove all punctuation
awk -F= '{print $2}' n.dat | tr -d [[:punct:]]
# only remove single and double quotes
awk -F= '{print $2}' n.dat | tr -d \''"\'
explanation:
tr -d \''"\' is to delete any single and double quotes.
tr -d [[:punct:]] to delete all character from the punctuation class
Sample output as below from 2nd command above (without quotes):
myenv
mydomain.net
mykeypem
The problem is not with awk, but with bash. The single quote inside the gsub is closing the open quote so that bash is trying to parse the command awk with arguments !/^...gsub(/"|/,, ,, $2 and then an unmatched close paren. Try replacing the single quote with '"'"' (so that bash will properly terminate the string, then apply a single quote, then reopen another string.)
Is awk really a requirement? If not, why don't you use a simple sed command:
sed -rn -e "s/^[^#]+='(.*)'$/\1/p" \
-e "s/^[^#]+=\"(.*)\"$/\1/p" \
-e "s/^[^#]+=(.*)/\1/p" data
This might seems over engineered, but it works properly with embedded quotes:
sh$ cat data
ENVIRONMENT="myenv"
ENV_DOMAIN='mydomain.net'
LOGIN_KEY=mykey.pem
PASSWD="good ol'passwd"
sh$ sed -rn -e "s/^[^#]+='(.*)'/\1/p" -e "s/^[^#]+=\"(.*)\"/\1/p" -e "s/^[^#]+=(.*)/\1/p" data
myenv
mydomain.net
mykey.pem
good ol'passwd
You can use awk like this:
awk -F "=['\"]?|['\"]" '{print $2}' file
myenv
mydomain.net
mykey.pem
This will work with your awk
awk -F= '!/^(#|$)/ && /^ENVIRONMENT=/ {gsub(/"/,"",$2);gsub(q,"",$2); print $2}' q=\' file
It is the single quote in the expression that create problems. Add it to an variable and it will work.
I did the following:
awk -F"=\"|='|'|\"|=" '{print $2}' file
myenv
mydomain.net
mykey.pem
This tells awk to use either =", =', ' or " as field separator.
This is because the awk program must be enclosed in single quotes when run as a command line program. The program can be tripped up if a single quote is contained inside the script. Special tricks can be made to use single quotes as strings inside the program. See Shell-Quoting Issues in the GNU Awk Manual.
One trick is to save the match string as a variable:
awk -F\= -v s="'|\"" '{gsub(s, "", $2); print $2}' file
Output:
myenv
mydomain.net
mykey.pem

Search regex on a specific field using awk

In awk I can search a field for a value like:
$ echo -e "aa,bb,cc\ndd,eaae,ff" | awk 'BEGIN{FS=",";}; $2=="eaae" {print $0};'
aa,bb,cc
dd,eaae,ff
And I can search by regular expressions like
$ echo -e "aa,bb,cc\ndd,eaae,ff" | awk 'BEGIN{FS=",";}; /[a]{2}/ {print $0};'
aa,bb,cc
dd,eaae,ff
Can I force the awk to apply the regexp search to a specific field ? I'm looking for something like
$ echo -e "aa,bb,cc\ndd,eaae,ff" | awk 'BEGIN{FS=",";}; $2==/[a]{2}/ {print $0};'
expecting result:
dd,eaae,ff
Anyone know how to do it using awk?
Accepted response - Operator "~" (thanks to hek2mgl):
$ echo -e "aa,bb,cc\ndd,eaae,ff" | awk 'BEGIN{FS=",";}; $2 ~ /[a]{2}/ {print $0};'
You can use :
$2 ~ /REGEX/ {ACTION}
If the regex should apply to the second field (for example) only.
In your case this would lead to:
awk -F, '$2 ~ /^[a]{2}$/' <<< "aa,bb,cc\ndd,eaae,ff"
You may wonder why I've just used the regex in the awk program and no print. This is because your action is print $0 - printing the current line - which is the default action in awk.

Get An Specified Match Under a String

I'm trying to match the contents of a string that contains sequences of quotes using Shell Script, at the time the far I got was this:
et="\"He\" \"llo\""
echo $et | sed -e '/\"(.*?)\"/g'
Which returns this:
"He" "llo"
But I don't want the quote marks to appear on the result, also how can I echo only the first, or the second, or the third, etc. match?
sed -e 's/"\([^"]*\)"/\1/g' will remove quotes around balanced " quotes. To only show the first, second match etc with sed you probably have to make different capture groups.
$ echo '"1" "2" "3"' | sed -e 's/"\([^"]*\)" "\([^"]*\)" "\([^"]*\)"/\2/g'
2
$
Provided that what is wanted is only the text between the first pair of quotes, here is a solution with perl:
echo $et | perl -ne '/"[^"]+"/ and print "$&\n";'
This will also handle quotes witin quotes if they are preceded by a backslash:
echo $et | perl -ne '/"[^"\\]+(\\.[^"]*)*"/ and print "$&\n";'
This is much simpler with awk since you can specify the double-quote to be the field separator.
$ et='"He" "llo"'
$ awk -F'"' '{print $2}' <<<$et
He
$ awk -F'"' '{print $4}' <<<$et
llo
Note: This is also scalable and the strings fields will be in multiples of two, i.e $2, $4, $6, etc.
You can also do something like this:
[srikanth#myhost ~]$ echo "\"He\" \"llo\"" | awk ' { match($0,/([A-Za-z]+)[" ]+([A-Za-z]+)/,a); print a[1]","a[2]} '
He,llo