preg_match_all equivalent for BASH? - regex

I have a string like this
foo:collection:indexation [options] [--] <text> <text_1> <text_2> <text_3> <text_4>
And i want to use bash regex to get an array or string that I can split to get this in order to check if the syntax is correct
["text", "text_1", "text_2", "text_3", "text_4"]
I have tried to do this :
COMMAND_OUTPUT=$($COMMAND_HELP)
# get the output of the help
# regex
ARGUMENT_REGEX="<([^>]+)>"
GOOD_REGEX="[a-z-]"
# get all the arguments
while [[ $COMMAND_OUTPUT =~ $ARGUMENT_REGEX ]]; do
ARGUMENT="${BASH_REMATCH[1]}"
# bad syntax
if [[ ! $ARGUMENT =~ $GOOD_REGEX ]]; then
echo "Invalid argument '$ARGUMENT' for the command $FILE"
echo "Must only use characters [a-z:-]"
exit 5
fi
done
But the while does not seem to be appropriate since I always get the first match.
How can I get all the matches for this regex ?
Thanks !

The loop doesn't work because every time you're just testing the same input string against the regexp. It doesn't know that it should start scanning after the match from the previous iteration. You'd need to remove the part of the string up to and including the previous match before doing the next test.
A simpler way is to use grep -o to get all the matches.
$COMMAND_HELP | grep -o "$ARGUMENT_REGEX" | while read ARGUMENT; do
if [[ ! $ARGUMENT =~ $GOOD_REGEX ]]; then
echo "Invalid argument '$ARGUMENT' for the command $FILE"
echo "Must only use characters [a-z:-]"
exit 5
fi
done

Bash doesn't have this directly, but you can achieve a similar effect with a slight modification.
string='foo...'
re='<([^>]+)>'
while [[ $string =~ $re(.*) ]]; do
string=${BASH_REMATCH[2]}
# process as before
done
This matches the regex we want and also everything in the string after the regex. We keep shortening $string by assigning only the after-our-regex portion to it on every iteration. On the last iteration, ${BASH_REMATCH[2]} will be empty so the loop will terminate.

Related

Bash Regex to extract everything between the last occurrence of a string (release-) and some characters (--)

I have multiple strings, where I want to extract everything between the last occurrence of a string (release-) and some characters (--). More specifically, for a sting like the following:
inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE
I want to have the following output:
PI_4.1-Sprint-3.1a
I created a regex online, which you can find here. There regex is the following:
.*release-(.*)--.*
However, when I am trying to use this script into a bash script, it wont work. Here is an example.
artifactoryVersion="inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE"
[[ "$artifactoryVersion" =~ (.*release-(.*)--.*) ]]
echo $BASH_REMATCH[0]
echo $BASH_REMATCH[1]
Will return:
inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE[0]
inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE[1]
Do you have any ideas about how can I accomplish my goal in bash?
You may use:
s='inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE'
rx='.*-release-(.*)--'
[[ $s =~ $rx ]] && echo "${BASH_REMATCH[1]}"
PI_4.1-Sprint-3.1a
Code Demo
Your regex appears correct but make sure to use "${BASH_REMATCH[1]}" to extract first capture group in the result.
You need to use the following:
#!/bin/bash
artifactoryVersion="inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE"
if [[ "$artifactoryVersion" =~ .*release-(.*)-- ]]; then
echo ${BASH_REMATCH[1]};
fi
See the online demo
Output:
PI_4.1-Sprint-3.1a
With your shown samples please try following BASH code with regex. I have also mentioned comments before executing each statement to understand each statement here.
##Shell variable named var being created here.
var="inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE"
##Mentioning regex which needs to be checked on later in program.
regex="(.*release-release)-(.*)--"
##Check condition on var variable with regex if match found then print 2nd capturing group value.
[[ $var =~ $regex ]] && echo "${BASH_REMATCH[2]}"
Explanation of regex: Following is the detailed explanation for used regex.
regex="(.*release-release)-(.*)--": Creating shell variable named regex in which putting regular expression (.*release-release)-(.*)--.
Where regex is creating 2 capturing groups.
First matching everything till release-release(with greedy match), which is followed by a -(not captured anywhere).
Which is followed by a greedy match, which will basically match everything before -- to get the exactly needed value.
You can also do it with shell parameter expansions (it's slower than a bash regex but it's standard):
artifactoryVersion='inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE'
result=${artifactoryVersion##*-release-}
result=${result%%--*}
printf %s\\n "$result"
PI_4.1-Sprint-3.1a
Or directly with a bash parameter expansion and extended globing:
#!/bin/bash
shopt -s extglob
artifactoryVersion='inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE'
echo "${artifactoryVersion//#(*-release-|--*)}"
PI_4.1-Sprint-3.1a

Implementing Regex In Bash

I am trying to throw an error if my text file lines have any combination of 5 [A-Z 0-9] chars followed by a comma and nothing else, like this:
WH3Y4,
H7UF5,
but my code is showing the error even when the text lines look like this, with spaces and words after the comma:
WH3Y4, my test
H7UF5, your test
The regex I am using below should work, if I understand how this is done:
^ to indicate the beginning of the text line
[A-Z0-9]{5} to indicate 5 chars of either cap letter or numbers
, to indicate they are followed by a comma
$ to indicate the end of the text line
So in theory, when it encounters any text after the comma on the same line, it should not produce the error, yet that's what's happening:
if ! [[ $myText =~ ^[A-Z0-9]{5},$ ]]; then
echo "Error"
continue
fi
Similarly, if I want to error when the text looks like this:
WH3Y4 test
H7UF5 test
this should work, but it doesn't either:
if ! [[ $myText =~ ^[A-Z0-9]{5} *[A-Za-z]$ ]]; then
echo "Error"
continue
fi
And when I try this as suggested in the comments:
[[ "$myText" =~ ^[A-Z0-9]{5},\$ ]]
it produces an error for this as it should:
WH3Y4,
H7UF5,
but also produces an error for this as it shouldn't:
WH3Y4, my test
H7UF5, your test
I thought the idea of the $ is to indicate the end of the line, but if the line continues with more chars then it should not match the error condition.
It seems that your code is not implementing what you want. You say
I am trying to throw an error if my text file lines have any combination
of 5 [A-Z 0-9] chars followed by a comma and nothing else
but then your code says
if ! [[ $myText =~ ^[A-Z0-9]{5},$ ]]; then
echo "Error"
continue
fi
The presence of the ! - the negation operator - means that it will print "Error" if the string does not match the regex, so it will accept the two sample strings your gave - WH3Y4, and H7UF5, - and will reject anything else. I think what you wanted here is
if [[ $myText =~ ^[A-Z0-9]{5},$ ]] ; then
echo "Error"
continue
fi
or in other words, just get rid of the !.
In the second case
if ! [[ $myText =~ ^[A-Z0-9]{5} *[A-Za-z]$ ]]; then
echo "Error"
continue
fi
the problem is that your regular expression doesn't match your data. I suggest that you use
if [[ $myText =~ ^[A-Z0-9]{5},[\ A-Za-z]+$ ]]; then
echo "Error"
continue
fi
Note that I dropped the ! in the this case as well.

Bash only get the first matched result when use regex

There's a string example
"j2sdk/1.8.0_25-static j2sdk/1.8.0_45 j2sdk/1.8.0_p120 j2sdk/1.8.0_40 j2sdk/1.8.0_51"
I want to find the ones matched with format j2sdk/1.8.0_xxx, but xxx only with digits, here, I want below strings be matched
j2sdk/1.8.0_45
j2sdk/1.8.0_40
j2sdk/1.8.0_51
I wrote below code, but when run, it only get the first matched j2sdk/1.8.0_45, anything wrong with my code?
avail_versions="j2sdk/1.8.0_25-static j2sdk/1.8.0_45 j2sdk/1.8.0_p120 j2sdk/1.8.0_40 j2sdk/1.8.0_51"
patern='j2sdk\/1\.8\.0_[0-9]+\s+'
if [[ $avail_versions =~ $patern ]];then
echo matched
echo ${BASH_REMATCH[0]}
echo ${BASH_REMATCH[1]}
echo ${BASH_REMATCH[2]}
fi
The results is that BASH_REMATCH[0] is j2sdk/1.8.0_45, BASH_REMATCH[1] and [2] are empty
I expected I can get them in BASH_REMATH[1],BASH_REMATH[2],BASH_REMATH[3].
Is there other way in Bash I can get expected matches.
Thanks
I split the input at spaces and add back the space after each word.
for s in $avail_versions ; do
s="$s "
if [[ $s =~ $patern ]];then
echo ${BASH_REMATCH[0]}
fi
done
j2sdk/1.8.0_45
j2sdk/1.8.0_40
j2sdk/1.8.0_51

Bash Regex comparison not working

keyFileName=$1;
for fileExt in "${validTypes[#]}"
do
echo $fileExt;
if [[ $keyFileName == *.$fileExt ]]; then
keyStatus="true";
fi
done;
I am trying to check the file extension of a file passed in against an array of multiple file extensions. However it doesn't seem to be working properly. Any help?
validTypes=(".txt" ".mp3")
keyFileName="$1"
for fileExt in "${validTypes[#]}"
do
echo $fileExt;
if [[ $keyFileName =~ ^.*$fileExt$ ]]; then
keyStatus="true";
echo "Yes"
fi
done;
Effectively, you could change your if statement to either:
if [[ $keyFileName == ?*$fileExt ]] # Glob pattern case, ? denotes single char
or:
if [[ $keyFileName =~ .*$fileExt ]] # Regex case, . denotes single char
Looping over the array to do a regex match on each element seems rather inefficient. You're using regex; it's easy to combine the expressions and avoid looping at all.
Mangling the array into a valid regex is not entirely trivial, though. Here's my attempt:
validTypes=('\.txt' '\.mp3')
fileExtRe=$(printf '|%s' "${validTypes[#]}"
# Trim off the first alternation, add parens and anchor
fileExtRe="(${fileExtRe#?})$"
if [[ $keyFileName =~ $fileExtRe ]]; then
:
Notice how the elements in validTypes are regular expressions now, with the dot escaped to only match a literal dot.

Bash scripting, regex in if statement

I'm pretty new to bash scripting and regexp and have a question.
I want to check to see if my variable $name starts with a-d, e-h, i-l etc and do some stuff accordingly. If the string starts with "the." or "The." it should check the first letter after the period.
My problem is that if $name consists of "the.anchor" both the a-d0-9 and q-t will be true. Do you guys have any idea what's wrong?
if [[ $name =~ ^([tT]he\.)?[a-dA-D0-9]+ ]]; then
do some stuff
fi
if [[ $name =~ ^([tT]he\.)?[e-hE-H]+ ]]; then
do some stuff
fi
if [[ $name =~ ^([tT]he\.)?[i-lI-L]+ ]]; then
do some stuff
fi
if [[ $name =~ ^([tT]he\.)?[m-pM-P]+ ]]; then
do some stuff
fi
if [[ $name =~ ^([tT]he\.)?[q-tQ-T]+ ]]; then
do some stuff
fi
if [[ $name =~ ^([tT]he\.)?[u-wU-W]+ ]]; then
do some stuff
fi
if [[ $name =~ ^([tT]he\.)?[x-zX-Z]+ ]]; then
do some stuff
fi
Thanks in advance!
Your first part it optional:
([tT]he\.)?
So the.anchor matches the pattern ^([tT]he\.)?[a-dA-D0-9]+ because the the. matches `^([tT]he\.)? and the a matches [a-dA-D0-9]+. It matches ^([tT]he\.)?[q-tQ-T]+ because ^([tT]he\.)? is optional an t matches [q-tQ-T]+. Note not the whole input is consumed by the second pattern, in fact only the first character is grabbed.
You can verify this by having bash echo the match:
echo "${BASH_REMATCH[0]}"
Which should print the.anchor in the first case and t in the second.
You do not have an end anchor on the pattern so only part of the input needs to be matched. If you made the second pattern ^([tT]he\.)?[q-tQ-T]+$ then it would not match.
Alternatively you could make the the first part possessive - ^([tT]he\.)?+. This will mean that if the engine matches the first expression it will not be unmatched. In the latter case ^([tT]he\.)?+ will grab the the. and then not release it when [q-tQ-T]+ fails; this will cause the match to fail.
I figured out a way to fix my problem by using elif statements and putting the q-t part as the last one
I think the ? can be removed as the if statement is already doing the test. The + matches the preceding item at least once and would only be needed if you want to match more than one instance of the letters.
You can do it like this:
if [[ $name =~ ^[tT]he\.[a-dA-D0-9] ]]; then
do some stuff
fi
The condition will only return true if the first character after ^[tT]he\. is [a-dA-D0-9].
However, I tend to think case is a cleaner solution than if statements when matching lists of characters against variables.
case $name in
[tT]he\.[a-dA-D0-9]*)
do some stuff
;;
esac