Bash: regular expressions within backticks

Bash: regular expressions within backticks - regex

I have a file called "align_summary.txt" which looks like this:
Left reads:
Input : 26410324
Mapped : 21366875 (80.9% of input)
of these: 451504 ( 2.1%) have multiple alignments (4372 have >20)
...more text....
... and several more lines of text....
I want to pull out the % of multiple alignments among all left aligned reads (in this case it's 2.1) in bash shell.
If I use this:
pcregrep -M "Left reads.\n..+.\n.\s+Mapped.+.\n.\s+of these" align_summary.txt | awk -F"\\\( " '{print $2}' | awk -F"%" '{print $1}' | sed -n 4p
It promptly gives me the output: 2.1
However, if I enclose the same expression in backticks like this:
leftmultiple=`pcregrep -M "Left reads.\n..+.\n.\s+Mapped.+.\n.\s+of these" align_summary.txt | awk -F"\\\( " '{print $2}' | awk -F"%" '{print $1}' | sed -n 4p`
I receive an error:
awk: syntax error in regular expression ( at
input record number 1, file
source line number 1
As I understand it, enclosing this expression in backticks affects the interpretation of the regular expression that includes "(" symbol, despite the fact that it is escaped by backslashes.
Why does this happen and how to avoid this error?
I would be grateful for any input and suggestions.
Many thanks,

Just use awk:
leftmultiple=$(awk '/these:.*multiple/{sub(" ","",$2);print $2}' FS='[(%]' align_summary.txt )

Always use $(...) instead of backticks but more importantly, just use awk alone:
$ leftmultiple=$( gawk -v RS='^$' 'match($0,/Left reads.\s*\n\s+.+\n\s+Mapped.+.\n.\s+of these[^(]+[(]\s*([^)%]+)/,a) { print a[1] }' align_summary.txt )
$ echo "$leftmultiple"
2.1
The above uses GNU awk 4.* and assumes you do need the complicated regexp that you were using to avoid false matches elsewhere in your input file. If that's not the case then the script can of course get much simpler.

Related

shell multiline selection from word to character

.textexpandrc
[yoro] よろしくお願いします。
[ohayo] おはようございます。
元気ですか？
[otsu] お疲れさまでします。
Looking for
$ KEY=ohayo; awk "???" ~/.textexpandrc
おはようございます。
元気ですか？
awk or sed is fine, but I'd like to avoid using a mix of awk/sed/perl/tr/cut etc because I'm under the impression that awk is robust enough to handle this on its own.
The best I could find on my own was
$ KEY=ohayo; awk "/\[${KEY}/,/\[otsu/" ~/.textexpandrc | sed "s/\[${KEY}\] //" | grep -v otsu
おはようございます。
元気ですか？
But I need to know the next key in advance (not impossible but ugly). Strangely, if asking awk to search until the square bracket, it fails to select a multiline
$ KEY=ohayo; awk "/\[${KEY}/,/\[/" ~/.textexpandrc
[ohayo] おはようございます。
Currently using a single-line parser solution as follow
#!/usr/bin/env bash
CONFIG=${HOME}/.textexpandrc
ALL_KEYS=$(sed 's/\].*/]/' ${CONFIG} | tr -d '[]')
KEY=$(echo $ALL_KEYS | rofi -sep ' ' -dmenu -p "autocomplete")
grep "\[${KEY}\]" $CONFIG | sed "s/\[${KEY}\] //" | xsel -ib # ← HERE
xdotool key ctrl+shift+v

If you set up the RS and FS variables to match [ and ], this works quite well:
awk 'BEGIN{ RS="\["; FS="\] " }; $1 ~ key { print $2 }' key=ohayo tmp.txt
You pass in the parameter you're searching for using key=.... on the command line instead of setting a variable. This makes it much easier to write the awk script within single quotes.

AWK: get file name from LS

I have a list of file names (name plus extension) and I want to extract the name only without the extension.
I'm using
ls -l | awk '{print $9}'
to list the file names and then
ls -l | awk '{print $9}' | awk /(.+?)(\.[^.]*$|$)/'{print $1}'
But I get an error on escaping the (:
-bash: syntax error near unexpected token `('
The regex (.+?)(\.[^.]*$|$) to isolate the name has a capture group and I think it is correct, while I don't get is not working within awk syntax.
My list of files is like this ABCDEF.ext in the root folder.

Your specific error is caused by the fact that your awk command is incorrectly quoted. The single quotes should go around the whole command, not just the { action } block.
However, you cannot use capture groups like that in awk. $1 refers to the first field, as defined by the input field separator (which in this case is the default: one or more "blank" characters). It has nothing to do with the parentheses in your regex.
Furthermore, you shouldn't start from ls -l to process your files. I think that in this case your best bet would be to use a shell loop:
for file in *; do
printf '%s\n' "${file%.*}"
done
This uses the shell's built-in capability to expand * to the list of everything in the current directory and removes the .* from the end of each name using a standard parameter expansion.
If you really really want to use awk for some reason, and all your files have the same extension .ext, then I guess you could do something like this:
printf '%s\0' * | awk -v RS='\0' '{ sub(/\.ext$/, "") } 1'
This prints all the paths in the current directory, and uses awk to remove the suffix. Each path is followed by a null byte \0 - this is the safe way to pass lists of paths, which in principle could contain any other character.
Slightly less robust but probably fine in most cases would be to trust that no filenames contain a newline, and use \n to separate the list:
printf '%s\n' * | awk '{ sub(/\.ext$/, "") } 1'
Note that the standard tool for simple substitutions like this one would be sed:
printf '%s\n' * | sed 's/\.ext$//'

(.+?) is a PCRE construct. awk uses EREs, not PCREs. Also you have the opening script delimiter ' in the middle of the script AFTER the condition instead of where it belongs, before the start of the script.
The syntax for any command (awk, sed, grep, whatever) is command 'script' so this should be is awk 'condition{action}', not awk condition'{action}'.
But, in any case, as mentioned by #Aaron in the comments - don't parse the output of ls, see http://mywiki.wooledge.org/ParsingLs

Try this.
ls -l | awk '{ s=""; for (i=9;i<=NF;i++) { s = s" "$i }; sub(/\.[^.]+$/,"",s); print s}'
Notes:
read the ls -l output is weird
It doesn't check the items (they are files? directories? ... strip extentions everywhere)
Read the other answers :D

If the extension is always the same pattern try a sed replacement:
ls -l | awk '{print $9}' | sed 's\.ext$\\'

awk: fatal: Invalid regular expression when setting multiple field separators

I was trying to solve Grep regex to select only 10 character using awk. The question consists in a string XXXXXX[YYYYY--ZZZZZ and the OP wants to print the text in between the unique [ and -- strings within the text.
If it was just one - I would say use [-[] as field separator (FS). This is setting the FS to be either - or [:
$ echo "XXXXXXX[YYYYY-ZZZZ" | awk -F[-[] '{print $2}'
YYYYY
The tricky point is that [ has also a special meaning as a character class, so that to make it be correctly interpreted as one of the possible FS it cannot be written in the first position. Well, this is done by saying [-[]. So we are done to match either - or [.
However, in this case it is not one but two hyphens: I want to say either -- or [. I cannot say [--[] because the hyphen also has a meaning to define a range.
What I can do is to use -F"one pattern|another pattern" like:
$ echo "XXXXXXXaaYYYYYbbZZZZ" | awk -F"aa|bb" '{print $2}'
YYYYY
So if I try to use this with -- and [, I cannot get a proper result:
$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -F"--|[" '{print $2}'
awk: fatal: Invalid regular expression: /--|[/
And in fact, not even having [ as one of the terms:
$ echo "XXXXXXX[YYYYYbbZZZZ" | awk -F"bb|[" '{print $2}'
awk: fatal: Invalid regular expression: /bb|[/
$ echo "XXXXXXX[YYYYYbbZZZZ" | awk -F"bb|\[" '{print $2}'
awk: warning: escape sequence `\[' treated as plain `['
awk: fatal: Invalid regular expression: /bb|[/
$ echo "XXXXXXX[YYYYYbbZZZZ" | awk -F"(bb|\[)" '{print $2}'
awk: warning: escape sequence `\[' treated as plain `['
awk: fatal: Unmatched [ or [^: /(bb|[)/
You see I tried to either escaping [, enclosing in parentheses and nothing worked.
So: what can I do to set the field separator to either -- or [? Is it possible at all?

IMHO this is best explained if we start by looking at a regexp being used by the split() command since that explicitly shows what is happening when a string is split into fields using a literal vs dynamic regexp and then we can relate that to Field Separators.
This uses a literal regexp (delimited by /s):
$ echo "XXXXXXX[YYYYY--ZZZZ" | awk '{split($0,f,/\[|--/); print f[2]}'
YYYYY
and so requires the [ to be escaped so it is taken literally since [ is a regexp metacharacter.
These use a dynamic regexp (one stored as a string):
$ echo "XXXXXXX[YYYYY--ZZZZ" | awk '{split($0,f,"\\[|--"); print f[2]}'
YYYYY
$ echo "XXXXXXX[YYYYY--ZZZZ" | awk 'BEGIN{re="\\[|--"} {split($0,f,re); print f[2]}'
YYYYY
$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -v re='\\[|--' '{split($0,f,re); print f[2]}'
YYYYY
and so require the [ to be escaped 2 times since awk has to convert the string holding the regexp (a variable named re in the last 2 examples) to a regexp (which uses up one backslash) before it's used as the separator in the split() call (which uses up the second backslash).
This:
$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -v re="\\\[|--" '{split($0,f,re); print f[2]}'
YYYYY
exposes the variable contents to the shell for it's evaluation and so requires the [ to be escaped 3 times since the shell parses the string first to try to expand shell variables etc. (which uses up one backslash) and then awk has to convert the string holding the regexp to a regexp (which uses up a second backslash) before it's used as the separator in the split() call (which uses up the third backslash).
A Field Separator is just a regexp stored as variable named FS (like re above) with some extra semantics so all of the above applies to it to, hence:
$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -F '\\[|--' '{print $2}'
YYYYY
$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -F "\\\[|--" '{print $2}'
YYYYY
Note that we could have used a bracket expression instead of escaping it to have the [ treated literally:
$ echo "XXXXXXX[YYYYY--ZZZZ" | awk '{split($0,f,/[[]|--/); print f[2]}'
YYYYY
and then we don't have to worry about escaping the escapes as we add layers of parsing:
$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -F "[[]|--" '{print $2}'
YYYYY
$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -F '[[]|--' '{print $2}'
YYYYY

You need to use double backslash for escaping regex meta chars inside double quoted string so that it would be treated as regex meta character otherwise (if you use single backslash) it would be treated as ecape sequence.
$ echo 'XXXXXXX[YYYYYbbZZZZ' | awk -v FS="bb|\\[" '{print $2}'
YYYYY

This with GNU Awk 3.1.7
echo "XXXXXXX[YYYYY--ZZZZ" | awk -F"--|[[]" '{print $2}'
echo "XXXXXXX[YYYYYbbZZZZ" | awk -F"bb|[[]" '{print $2}'

How to remove/strip double or single quote from a string?

I have a file with some lines like these:
ENVIRONMENT="myenv"
ENV_DOMAIN='mydomain.net'
LOGIN_KEY=mykey.pem
I want to extract the parts after the = but without the surrounding quotes. I tried with gsub like this:
awk -F= '!/^(#|$)/ && /^ENVIRONMENT=/ {gsub(/"|'/, "", $2); print $2}'
Which ends up with -bash: syntax error near unexpected token ')' error. It works just fine for single matching: /"/ or /'/ but doesn't work when I try match either one. What am I doing wrong?

If you are just trying to remove the punctuation then you can do it as below....
# remove all punctuation
awk -F= '{print $2}' n.dat | tr -d [[:punct:]]
# only remove single and double quotes
awk -F= '{print $2}' n.dat | tr -d \''"\'
explanation:
tr -d \''"\' is to delete any single and double quotes.
tr -d [[:punct:]] to delete all character from the punctuation class
Sample output as below from 2nd command above (without quotes):
myenv
mydomain.net
mykeypem

The problem is not with awk, but with bash. The single quote inside the gsub is closing the open quote so that bash is trying to parse the command awk with arguments !/^...gsub(/"|/,, ,, $2 and then an unmatched close paren. Try replacing the single quote with '"'"' (so that bash will properly terminate the string, then apply a single quote, then reopen another string.)

Is awk really a requirement? If not, why don't you use a simple sed command:
sed -rn -e "s/^[^#]+='(.*)'$/\1/p" \
-e "s/^[^#]+=\"(.*)\"$/\1/p" \
-e "s/^[^#]+=(.*)/\1/p" data
This might seems over engineered, but it works properly with embedded quotes:
sh$ cat data
ENVIRONMENT="myenv"
ENV_DOMAIN='mydomain.net'
LOGIN_KEY=mykey.pem
PASSWD="good ol'passwd"
sh$ sed -rn -e "s/^[^#]+='(.*)'/\1/p" -e "s/^[^#]+=\"(.*)\"/\1/p" -e "s/^[^#]+=(.*)/\1/p" data
myenv
mydomain.net
mykey.pem
good ol'passwd

You can use awk like this:
awk -F "=['\"]?|['\"]" '{print $2}' file
myenv
mydomain.net
mykey.pem

This will work with your awk
awk -F= '!/^(#|$)/ && /^ENVIRONMENT=/ {gsub(/"/,"",$2);gsub(q,"",$2); print $2}' q=\' file
It is the single quote in the expression that create problems. Add it to an variable and it will work.

I did the following:
awk -F"=\"|='|'|\"|=" '{print $2}' file
myenv
mydomain.net
mykey.pem
This tells awk to use either =", =', ' or " as field separator.

This is because the awk program must be enclosed in single quotes when run as a command line program. The program can be tripped up if a single quote is contained inside the script. Special tricks can be made to use single quotes as strings inside the program. See Shell-Quoting Issues in the GNU Awk Manual.
One trick is to save the match string as a variable:
awk -F\= -v s="'|\"" '{gsub(s, "", $2); print $2}' file
Output:
myenv
mydomain.net
mykey.pem

can not use "(" in awk command

I tried following awk command in my Linux box, it throws error.
awk -F "VALUES(" '{print $2}'
Error :
awk: fatal: Unmatched ( or \(: /VALUES(/
I also tried with back slash, its also not working.
awk -F "VALUES\(" '{print $2}'
Error :
awk: warning: escape sequence `\(' treated as plain `('
awk: fatal: Unmatched ( or \(: /VALUES(/
Please let me know how to include ( in awk search string.

if the value of -F is longer than 1, it was considered as regex, so you need do something like:
regex character class:
kent$ echo "a foo( b"|awk -F"foo[(]" '{print $1,$2}'
a b
escape the (
if you really want to escape the (, you need:
kent$ echo "a foo( b"|awk -F"foo\\\\(" '{print $1,$2}'
a b
or
kent$ echo "a foo( b"|awk -F'foo\\(' '{print $1,$2}'
a b

You're using Linux, so you're likely using gawk. From the gawk man page on my system:
-F fs
--field-separator fs
Use fs for the input field separator (the value of the FS prede-
fined variable).
and:
Fields
As each input record is read, gawk splits the record into fields, using
the value of the FS variable as the field separator. If FS is a single
character, fields are separated by that character. If FS is the null
string, then each individual character becomes a separate field. Otherwise, FS is expected to be a full regular expression.
Since you've got multiple characters in your -F, it's being interpreted as a regular expression. And you have an opening bracket without a closing bracket.
You may be able solve this either by escaping the bracket ("VALUES\("), or by putting the bracket in a group ("VALUES[(]"). I recommend the latter, because escaping things with backslashes can be ugly and unpredictable, and will help point out to you that this is a regex, not a string, when you re-read this script a few months from now.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js