Complex regex - works in Powershell, not in Bash - regex

The below code is a small portion of my code for Solarwinds to parse the output of a Netbackup command. This is fine for our Windows boxes but some of our boxes are RHEL.
I'm trying to convert the below code into something useable on RHEL 4.X but I'm running into a wall with parsing the regex. Obviously the below code has some of the characters escaped for use with Powershell, I have unescaped those characters for use with Shell.
I'm not great with Shell yet, but I will post a portion of my Shell code below the Powershell code.
$output = ./bpdbjobs
$Results = #()
$ColumnName = #()
foreach ($match in $OUTPUT) {
$matches = $null
$match -match "(?<jobID>\d+)?\s+(?<Type>(\b[^\d\W]+\b)|(\b[^\d\W]+\b\s+\b[^\d\W]+\b))?\s+(?<State>(Done)|(Active)|(\w+`-\w+`-\w+))?\s+(?<Status>\d+)?\s+(?<Policy>(\w+)|(\w+`_\w+)|(\w+`_\w+`_\w+))?\s+(?<Schedule>(\b[^\d\W]+\b\-\b[^\d\W]+\b)|(\-)|(\b[^\d\W]+\b))?\s+(?<Client>(\w+\.\w+\.\w+)|(\w+))?\s+(?<Dest_Media_Svr>(\w+\.\w+\.\w+)|(\w+))?\s+(?<Active_PID>\d+)?\s+(?<FATPipe>\b[^\d\W]+\b)?"
$Results+=$matches
}
The below is a small portion of Shell code I've written (which is clearly very wrong, learning as I go here). I'm just using this to test the Regex and see if it functions in Shell - (Spoiler alert) it does not.
#!/bin/bash
#
backups=bpdbjobs
results=()
for results in $backups; do
[[ $results =~ /(?<jobID>\d+)?\s+(?<Type>(\b[^\d\W]+\b)|(\b[^\d\W]+\b\s+\b[^\d\W]+\b))?\s+(?<State>(Done)|(Active)|(\w+\w+\-\w\-+))?\s+(?<Status>\d+)?\s+(?<Policy>(\w+)|(\w+\_\w+)|(\w+\_\w+\_\w+))?\s+(?<Schedule>(\b[^\d\W]+\b\-\b[^\d\W]+\b)|(\-)|(\b[^\d\W]+\b))?\s+(?<Client>(\w+\.\w+\.\w+)|(\w+))?\s+(?<Dest_Media_Svr>(\w+\.\w+\.\w+)|(\w+))?\s+(?<Active_PID>\d+)?/ ]]
done
$results
Below are the errors I get.
./netbackupsolarwinds.sh: line 9: syntax error in conditional expression: unexpected token `('
./netbackupsolarwinds.sh: line 9: syntax error near `/(?'
./netbackupsolarwinds.sh: line 9: ` [[ $results =~ /(?<jobID>\d+)?\s+(?<Type>(\b[^\d\W]+\b)|(\b[^\d\W]+\b\s+\b[^\d\W]+\b))?\s+(?<State>(Done)|(Active)|(\w+\w+\-\w\-+))?\s+(?<Status>\d+)?\s+(?<Policy>(\w+)|(\w+\_\w+)|(\w+\_\w+\_\w+))?\s+(?<Schedule>(\b[^\d\W]+\b\-\b[^\d\W]+\b)|(\-)|(\b[^\d\W]+\b))?\s+(?<Client>(\w+\.\w+\.\w+)|(\w+))?\s+(?<Dest_Media_Svr>(\w+\.\w+\.\w+)|(\w+))?\s+(?<Active_PID>\d+)?/ ]]'

From man bash:
An additional binary operator, =~, is available, with the same precedence as == and !=. When it is used, the string to the right of the operator is considered an extended regular expression and matched accordingly (as in regex(3)).
Meaning that the expression is parsed as a POSIX extended regular expression, which AFAIK does not support either named capturing groups ((?<name>...)) or character escapes (\d, \w, \s, ...).
If you want to use [[ $var =~ expr ]] you need to rewrite the regular expression. Otherwise use grep (which supports PCRE):
grep -P '(?<jobID>\d+)?\s+...' <<<$results

Updated answer, after comments exchange.
The best way to perform your migration quickly is to use the --perl-regexp Perl compatibility option of Grep, like eventually suggested in another answer.
If you still want to perform this operation with pure Bash, you need to rewrite the regular expression accordingly, following the documentation.

Thanks all for the answers. I swapped to Grep -P to no avail, turns out the named capture groups were the problem for Grep -P.
I was also unable to figure out a way to use Grep to output the capture group matches to individual variables.
This lead me to swap over to using perl, as follows, with alterations to my regex.
bpdbjobs | perl -lne 'print "$1" if /(\d+)?\s+((\b[^\d\W]+\b)|(\b[^\d\W]+\b\s+\b[^\d\W]+\b))?\s+((Done)|(Active)|(\w+\w+\-\w\-+))?\s+(\d+)?\s+((\w+)|(\w+\_\w+)|(\w+\_\w+\_\w+))?\s+((b[^\d\W]+\b\-\b[^\d\W]+\b)|(\-)|(\b[^\d\W]+\b))?\s+((\w+\.\w+\.\w+)|(\w+))?\s+((\w+\.\w+\.\w+)|(\w+))?\s+(\d+)?/g'
With $<num> referring to the capture group number. I can now list, display and (the important part) count the number of matches within an individual group, corresponding to the data found in each column.

Related

Bash Regex to extract everything between the last occurrence of a string (release-) and some characters (--)

I have multiple strings, where I want to extract everything between the last occurrence of a string (release-) and some characters (--). More specifically, for a sting like the following:
inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE
I want to have the following output:
PI_4.1-Sprint-3.1a
I created a regex online, which you can find here. There regex is the following:
.*release-(.*)--.*
However, when I am trying to use this script into a bash script, it wont work. Here is an example.
artifactoryVersion="inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE"
[[ "$artifactoryVersion" =~ (.*release-(.*)--.*) ]]
echo $BASH_REMATCH[0]
echo $BASH_REMATCH[1]
Will return:
inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE[0]
inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE[1]
Do you have any ideas about how can I accomplish my goal in bash?
You may use:
s='inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE'
rx='.*-release-(.*)--'
[[ $s =~ $rx ]] && echo "${BASH_REMATCH[1]}"
PI_4.1-Sprint-3.1a
Code Demo
Your regex appears correct but make sure to use "${BASH_REMATCH[1]}" to extract first capture group in the result.
You need to use the following:
#!/bin/bash
artifactoryVersion="inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE"
if [[ "$artifactoryVersion" =~ .*release-(.*)-- ]]; then
echo ${BASH_REMATCH[1]};
fi
See the online demo
Output:
PI_4.1-Sprint-3.1a
With your shown samples please try following BASH code with regex. I have also mentioned comments before executing each statement to understand each statement here.
##Shell variable named var being created here.
var="inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE"
##Mentioning regex which needs to be checked on later in program.
regex="(.*release-release)-(.*)--"
##Check condition on var variable with regex if match found then print 2nd capturing group value.
[[ $var =~ $regex ]] && echo "${BASH_REMATCH[2]}"
Explanation of regex: Following is the detailed explanation for used regex.
regex="(.*release-release)-(.*)--": Creating shell variable named regex in which putting regular expression (.*release-release)-(.*)--.
Where regex is creating 2 capturing groups.
First matching everything till release-release(with greedy match), which is followed by a -(not captured anywhere).
Which is followed by a greedy match, which will basically match everything before -- to get the exactly needed value.
You can also do it with shell parameter expansions (it's slower than a bash regex but it's standard):
artifactoryVersion='inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE'
result=${artifactoryVersion##*-release-}
result=${result%%--*}
printf %s\\n "$result"
PI_4.1-Sprint-3.1a
Or directly with a bash parameter expansion and extended globing:
#!/bin/bash
shopt -s extglob
artifactoryVersion='inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE'
echo "${artifactoryVersion//#(*-release-|--*)}"
PI_4.1-Sprint-3.1a

How to replace consecutive and identical characters in Perl?

I have a string like
XXXXYYYYZZZYYZZZYYYY which needs to be converted to
XXXXAAAYZZZAYZZZAAAY
$s =~ s/Y{2}+/AY/g;
this has 2 problems, {2}+ will get YYYY to AYAY; and AY is not the same length as YYYY (expecting AAAY)
How to get this done in perl?
Use a "look-ahead":
$s =~ s/Y(?=Y+)/A/g;
(?=Y+) means "followed by one or more Y characters", so any Y character that is followed by another Y character will be replaced with an A.
More info from perlretut
There's always more than one way to do it. My suggestion is to grab all the Ys except the last one, and then use that to create a string of As of the same length. The e modifier tells perl to execute the code in the replacement side instead of using it directly, and the r modifier tells =~ to return the result of the substitution instead of modifying the input text directly (useful for these one-liner tests, among other places).
$ perl -E 'say shift =~ s/(Y+)(?=Y)/"A"x length$1/gre' XXXXYYYYZZZYYZZZYYYY
XXXXAAAYZZZAYZZZAAAY
$s =~ s/Y{2}+/AY/g
RHS Pattern is ambiguously obscure pattern: Y{2}+, that's very rarely used regex pattern except if {}+ very rarely is available in few advanced regex engine, including perl maybe, as a regex feature called 'atomic grouping'.
You might have meant (Y{2})+ which is (YY)+ or Y{2,} which is YY+
in perl it's no brainer simple and easy as it supports lookaround feature
perl -e '$s=XXXXYYYYZZZYYZZZYYYY ;$s =~ s/Y(?=Y)/A/g;print $s'
actually lower regex engine such sed still can do it albeit in cumbersome, uneasy way
echo XXXXYYYYZZZYYZZZYYYY |sed -E 's/YY+/&\n/g;s/Y/A/g;s/A\n/Y/g'

is there any named regular expression capture for grep?

i'd like to know if its possible to get named regular expression with grep -P(linux bash) from a non formatted string? well.. from any string
For example:
John Smith www.website.com john#website.com jan-01-2001
to capture as
$name
$website
$email
$date
but it seems I cant pass any variables from output?
echo "www.website.com" | grep -Po '^(www\.)?(?<domain>.+)$' | echo $domain
has no output
no. grep is a process. you are talking about environment propagation from child to parent. that's forbidden.
instead, you can do
DATA=($your_line)
then take name=DATA[0] so and forth.
or another way using awk:
eval "`echo $your_line | awk '
function escape(s)
{
gsub(/'\''/,"'\''\"'\''\"'\''", s);
s = "'\''"s"'\''";
return s;
}
{
print "name="escape($1);
print "family_name="escape($2);
print "website="escape($3);
print "email="escape($4);
print "date="escape($5);
}'`"
the sense here is to propagate the info via stdout and eval it in the parent environment.
notice that, here, escape function will escape any string correctly such that nothing will be interpreted wrongly(like the evil of quotes).
following is the output from my jessie:
name='John'
family_name='Smith'
website='www.website.com'
email='john#website.com'
date='jan-01-2001'
if the family name is O'Reilly, the eval result will still be correct:
name='John'
family_name='O'"'"'Reilly'
website='www.website.com'
email='john#website.com'
date='jan-01-2001'
Grep is an independent command-line utility; it does not run inside of bash. So it couldn't create bash variables even if it wanted to.
However, bash has a regular expression matcher built-in. It's not a perl-compatible regex matcher, so it doesn't implement named captures. (To be precise, it matches Posix extended regular expressions, the same as grep -E.) But it does implement numbered captures.
You do regular expression matches with the =~ operator inside of the [[ ... ]] compound command syntax. If the regular expression matches, then the expression succeeds, and the captures are inserted into the array variable BASH_REMATCH. ${BASH_REMATCH[0]} will be the entire matched substring, and the remaining elements, starting with ${BASH_REMATCH[1]}, will be the individual captures in order.
For example:
$ url=www.example.com
$ [[ $url =~ ^(www\.)?(.*) ]]
$ echo "${BASH_REMATCH[1]}"
www.
$ echo "${BASH_REMATCH[2]}"
example.com

Bash double bracket regex comparison using negative lookahead error return 2

On Bash 4.1 machine,
I'm trying to use "double bracket" [[ expression ]] to do REGEX comparison using "NEGATIVE LOOKAHEAD".
I did "set +H" to disable BASH variable'!' expansion to command history search.
I want to match to "any string" except "arm-trusted-firmware".
set +H
if [[ alsa =~ ^(?!arm-trusted-firmware).* ]]; then echo MATCH; else echo "NOT MATCH"; fi
I expect this to print "MATCH" back,
but it prints "NOT MATCH".
After looking into the return code of "double bracket",
it returns "2":
set +H
[[ alsa =~ ^(?!arm-trusted-firmware).* ]]
echo $?
According to bash manual,
the return value '2' means "the regular expression is syntactically incorrect":
An additional binary operator, =~, is available,
with the same precedence as == and !=.
When it is used,
the string to the right of the operator is considered
an extended regular expression and matched accordingly (as in regex(3)).
The return value is 0 if the string matches the pattern, and 1 otherwise.
If the regular expression is syntactically incorrect,
the conditional expression's return value is 2.
What did I do wrong?
In my original script,
I'm comparing against to a list of STRINGs.
When it matches, I trigger some function calls;
when it doesn't match, I skip my actions.
So, YES, from this example,
I'm comparing literally the STRING between 'alsa' and 'arm-trusted-firmware'.
By default bash POSIX standard doesn't supports PCRE. (source: Wiki Bash Hackers)
As workaround, you'll need to enable extglob. This will enable some extended globing patterns:
$ shopt -s extglob
Check Wooledge Wiki for reading more about extglob.
Then you'll be able to use patterns like that:
?(pattern-list) Matches zero or one occurrence of the given patterns
*(pattern-list) Matches zero or more occurrences of the given patterns
+(pattern-list) Matches one or more occurrences of the given patterns
#(pattern-list) Matches one of the given patterns
!(pattern-list) Matches anything except one of the given patterns
More about extended BASH globbing at Wiki Bash Hackers and LinuxJournal.
Thanks for the answer from #Barmar
BASH doesn't support "lookaround" (lookahead and lookbehind)
bash doesn't use PCRE, and doesn't support lookarounds.
Respectfully, aren't you over-complicating things?
if [ "$alsa" = arm-trusted-firmware ]
then
echo 'MATCH'
else
echo 'NOT MATCH'
fi
If you have a good reason for wanting to use the Bashism [[, it would serve
you better to provide an example that justifies it.
Bashism

Matching optional parameters with non-capturing groups in Bash regular expression

I want to parse strings similar to the following into separate variables using regular expressions from within Bash:
Category: entity;scheme="http://schemas.ogf.org/occi/core#";class="kind";title="Entity";attributes="occi.core.id occi.core.title";
or
Category: resource;scheme="http://schemas.ogf.org/occi/core#";class="kind";title="Resource";rel="http://schemas.ogf.org/occi/core#entity";attributes="occi.core.summary";
The first part before "title" is common to all strings, the parts title and attributes are optional.
I managed to extract the mandatory parameters common to all strings, but I have trouble with optional parameters not necessarily present for all strings. As far as I found out, Bash doesn't support Non-capturing parentheses which I would use for this purpose.
Here is what I achieved thus far:
CATEGORY_REGEX='Category:\s*([^;]*);scheme="([^"]*)";class="([^"]*)";'
category_string='Category: entity;scheme="http://schemas.ogf.org/occi/core#";class="kind";title="Entity";attributes="occi.core.id occi.core.title";'
[[ $category_string =~ $CATEGORY_REGEX ]]
echo ${BASH_REMATCH[0]}
echo ${BASH_REMATCH[1]}
echo ${BASH_REMATCH[2]}
echo ${BASH_REMATCH[3]}
The regular expression I would like to use (and which is working for me in Ruby) would be:
CATEGORY_REGEX='Category:\s*([^;]*);\s*scheme="([^"]*)";\s*class="([^"]*)";\s*(?:title="([^"]*)";)?\s*(?:rel="([^"]*)";)?\s*(?:location="([^"]*)";)?\s*(?:attributes="([^"]*)";)?\s*(?:actions="([^"]*)";)?'
Is there any other solution to parse the string with command line tools without having to fall back on perl, python or ruby?
I don't think non-capturing groups exist in bash regex, so your options are to use a scripting language or to remove the ?: from all of the (?:...) groups and just be careful about which groups you reference, for example:
CATEGORY_REGEX='Category:\s*([^;]*);\s*scheme="([^"]*)";\s*class="([^"]*)";\s*(title="([^"]*)";)?\s*(rel="([^"]*)";)?\s*(location="([^"]*)";)?\s*(attributes="([^"]*)";)?\s*(actions="([^"]*)";)?'
category_string='Category: entity;scheme="http://schemas.ogf.org/occi/core#";class="kind";title="Entity";attributes="occi.core.id occi.core.title";'
[[ $category_string =~ $CATEGORY_REGEX ]]
echo "full: ${BASH_REMATCH[0]}"
echo "category: ${BASH_REMATCH[1]}"
echo "scheme: ${BASH_REMATCH[2]}"
echo "class: ${BASH_REMATCH[3]}"
echo "title: ${BASH_REMATCH[5]}"
echo "rel: ${BASH_REMATCH[7]}"
echo "location: ${BASH_REMATCH[9]}"
echo "attributes: ${BASH_REMATCH[11]}"
echo "actions: ${BASH_REMATCH[13]}"
Note that starting with the optional parameters we need to skip a group each time, because the even numbered groups from 4 on contain the parameter name as well as the value (if the parameter is present).
You can emulate non-matching groups in bash using a little bit of regexp magic:
_2__ _4__ _5__
[[ "fu#k" =~ ((.+)#|)((.+)/|)(.+) ]];
echo "${BASH_REMATCH[2]:--} ${BASH_REMATCH[4]:--} ${BASH_REMATCH[5]:--}"
# Output: fu - k
Characters # and / are parts of string we parse.
Regexp pipe | is used for either left or right (empty) part matching.
For curious, ${VAR:-<default value>} is variable expansion with default value in case $VAR is empty.