Return RegEx match in bash - regex

Using bash, I can check to see if the value of a variable matches a regular expression. However, I cannot find a way of returning the part that matched. Is this possible?
For example take $test as test="123456-name-goes-here.1.2.3-something.zip" The part I'd like to return is 1.2.3-something.
With the code below, I can successfully match $test, but I don't know where to go from here.
[[ $test =~ ([0-9]\.[0-9](\.[0-9])?(\.[0-9])?)(-[a-z-]*)? ]] && echo "matched"

The $BASH_REMATCH[0] will contain the value you need:
test="123456-name-goes-here.1.2.3-something.zip"
reg="[0-9]\.[0-9](\.[0-9])?(\.[0-9])?(-[a-z-]*)?"
if [[ $test =~ $reg ]]; then
echo ${BASH_REMATCH[0]};
fi
See the IDEONE demo
See this cheatsheet:
Regular expression captures will be available in $BASH_REMATCH, ${BASH_REMATCH[1]}, ${BASH_REMATCH[2]}, etc.
That means that the whole match value is stored in ${BASH_REMATCH} with Index = 0, and the subsequent items cotnain submatches that were captured with (...) (capturing groups).

Related

Why does bash "=~" operator ignore the last part of the pattern specified?

I am trying to do compare a string in bash to a regex pattern and have found something odd. For starters I am using GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu). This is within WSL.
For example here is sample program demonstrating the problem:
#!/bin/env bash
name="John"
if [[ "${name}" =~ "John"* ]]; then
echo "found"
else
echo "not found"
fi
exit
As expected this will echo found since the name "John" matches the regex pattern described. Now what I find odd is if I drop the n in John, it still echos found. Imo "Joh" does match the pattern of "John"*.
If you drop the "hn" and just set $name to "Jo" then it echos not found. It seems to only affect the last character in the Regex pattern (aside from the wildcard).
I am converting an old csh script to bash and this behavior is not happening in csh. What is causing bash to do this?
You're mixing up syntax for shell patterns and regular expressions. Your regular expression, after stripping the quoting, is John*: Joh followed by any number of n, including 0. Matches Joh, John, Johnn, Johnnn, ...
It's not anchored, so it also matches any string containing one of the matches above.
Since it's not anchored, depending on what you want, you could do any of these:
Any string containing John should match:
Regex: [[ $name =~ John ]]
Shell pattern: [[ $name == *John* ]]
Any string that begins with John should match:
Regex: [[ $name =~ ^John ]]
Shell pattern: [[ $name == John* ]]
Notice that shell patterns, unlike the regular expressions, must match the entire string.
A note on quoting: within [[ ... ]], the left-hand side doesn't have to be quoted; on the right-hand side, quoted parts are interpreted literally. For regular expressions, it's a good practice to define it in a separate variable:
re='^John'
if [[ $name =~ $re ]]; then
This avoids a few edge cases with special characters in the regex.
The =~ operator compares using regular expression syntax, not glob syntax. The * isn't a shell wildcard, it means, "the previous character, 0 or more times".
The string Joh matches the regular expression John* because it contains Joh followed by zero n characters.

Bash Regex to extract everything between the last occurrence of a string (release-) and some characters (--)

I have multiple strings, where I want to extract everything between the last occurrence of a string (release-) and some characters (--). More specifically, for a sting like the following:
inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE
I want to have the following output:
PI_4.1-Sprint-3.1a
I created a regex online, which you can find here. There regex is the following:
.*release-(.*)--.*
However, when I am trying to use this script into a bash script, it wont work. Here is an example.
artifactoryVersion="inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE"
[[ "$artifactoryVersion" =~ (.*release-(.*)--.*) ]]
echo $BASH_REMATCH[0]
echo $BASH_REMATCH[1]
Will return:
inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE[0]
inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE[1]
Do you have any ideas about how can I accomplish my goal in bash?
You may use:
s='inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE'
rx='.*-release-(.*)--'
[[ $s =~ $rx ]] && echo "${BASH_REMATCH[1]}"
PI_4.1-Sprint-3.1a
Code Demo
Your regex appears correct but make sure to use "${BASH_REMATCH[1]}" to extract first capture group in the result.
You need to use the following:
#!/bin/bash
artifactoryVersion="inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE"
if [[ "$artifactoryVersion" =~ .*release-(.*)-- ]]; then
echo ${BASH_REMATCH[1]};
fi
See the online demo
Output:
PI_4.1-Sprint-3.1a
With your shown samples please try following BASH code with regex. I have also mentioned comments before executing each statement to understand each statement here.
##Shell variable named var being created here.
var="inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE"
##Mentioning regex which needs to be checked on later in program.
regex="(.*release-release)-(.*)--"
##Check condition on var variable with regex if match found then print 2nd capturing group value.
[[ $var =~ $regex ]] && echo "${BASH_REMATCH[2]}"
Explanation of regex: Following is the detailed explanation for used regex.
regex="(.*release-release)-(.*)--": Creating shell variable named regex in which putting regular expression (.*release-release)-(.*)--.
Where regex is creating 2 capturing groups.
First matching everything till release-release(with greedy match), which is followed by a -(not captured anywhere).
Which is followed by a greedy match, which will basically match everything before -- to get the exactly needed value.
You can also do it with shell parameter expansions (it's slower than a bash regex but it's standard):
artifactoryVersion='inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE'
result=${artifactoryVersion##*-release-}
result=${result%%--*}
printf %s\\n "$result"
PI_4.1-Sprint-3.1a
Or directly with a bash parameter expansion and extended globing:
#!/bin/bash
shopt -s extglob
artifactoryVersion='inte_integration-abc-abcde-abcdefg-release-release-PI_4.1-Sprint-3.1a--1.0.2-RELEASE'
echo "${artifactoryVersion//#(*-release-|--*)}"
PI_4.1-Sprint-3.1a

Regex/Shell - how to match all except those with specific pattern

I need a regex in shell to match all strings except those with specific pattern.
My specific pattern can be variable, i.e. (i|I)[2 digits numbers](u|U)[2 digits numbers] in every string should not match.
For example :
Some.text.1234.text => should match
Some.text.1234.i10u20.text => shouldn't match
Some.text.1234.I01U02.text => shouldn't match
Some.text.1234.i83U23.text => shouldn't match
You can try with that:
^(?!.*[tuTU]\d{2}).*$
Demo
Explanation:
^ start of a line
?!.* negative look ahead
[tuTU]\d{2} check if there exists such character following 2 digits only
.*$ if previous condition is negative then match entire string to end of string $
The Bash script checking if a string matches a regex or not can look like
f='It_is_your_string_to_check';
if [[ "${f^^}" =~ I[0-9]{2}U[0-9]{2} ]]; then
echo "$f is invalid";
else
echo "$f is valid"
fi;
Here, "${f^^}" turns the string into uppercase (so as not to use (U|u) and (I|i)), and then =~ operator triggers a regex check here since the pattern on the right side is not quoted. You may play it safe and define the regex pattern with a separate single-quoted string variable and use
rx='I[0-9]{2}U[0-9]{2}'
if [[ "${f^^}" =~ $rx ]]; then ...
See a Bash demo online:
s='Some.text.1234.text
Some.text.1234.i10u20.text
Some.text.1234.I01U02.text
Some.text.1234.i83U23.text'
for f in $s; do
if [[ "${f^^}" =~ I[0-9]{2}U[0-9]{2} ]]; then
echo "$f is invalid";
else
echo "$f is valid"
fi;
done;
Output:
Some.text.1234.text is valid
Some.text.1234.i10u20.text is invalid
Some.text.1234.I01U02.text is invalid
Some.text.1234.i83U23.text is invalid

Using Bash regex match to test membership

Below, I use ALLOWED as container to test a token.
I am using a Bash regex match syntax =~ where the right hand side should be an extended regular expression.
In Bash's Regular Expression Matching. Using the operator =~, the left hand side operand is matched against the extended regular expression (ERE) on the right hand side. Check a related question on using date regex.
But I can't see str1 as a regex and I don't know why ALLOWED matches a string which is present inside it. Even as this works in this case, having regex (str1) as the test string leaves it open for tricky bugs in future.
export ALLOWED="str0 str1 strn"
export STR1="str1"
export STR2="str2"
if [[ $ALLOWED =~ ${STR1} ]]; then
echo "how does it this work?"
fi
if [[ $ALLOWED =~ ${STR2} ]]; then
:
else
echo "does not work."
fi
Questions:
Why/ How does this work?
What's a better to do test for an element in a list in bash?
The syntax is content =~ regex, for example think about how this simple phone number validation works
$ phone="555-443-2321"; if [[ $phone =~ [0-9-]+ ]]; then echo PASS; fi
as in your example, the right hand side is the regular expression and left hand side is the content.
Your regex can be a string literal, then the check will be whether content contains that substring
$ phone="555-443-2321"; if [[ $phone =~ "555" ]]; then echo PASS; fi
if it makes it easier for you think that as a regex for .*555.*
If I understand right, the confusion is because $a =~ $b checks whether there's a match for $b in $a, not whether $a as a whole matches. [[ "str0 str1 strn" =~ str1 ]] succeeds because there's a match for the (trivial) regex str1 somewhere in "str0 str1 strn".
If you want to check for a match to the entire string, you need to anchor the regex with a ^ at the beginning, and $ at the end: [[ $ALLOWED =~ ^${STR1}$ ]]

shell script odd regex

i have some regex that is behaving oddly in my shell script i have variables, and i have tried every what way to get them to behave, and they dont seem to do any regex, and i know my regex quite well thanks to regex101, here is what a sample looks like
fname="direcheck"
FIND="*"
if [[ $fname =~ $FIND ]]; then
echo "no quotes"
fi
if [[ "$fname" =~ "$FIND" ]]; then
echo "with quotes"
fi
right now it will display nothing
if i change find to
FIND="[9]*"
then it prints no quotes
if i say
FIND="[a-z]*"
then it prints no quotes
if i say
FIND="dircheck"
then nothing prints
if i say
FIND="*ck"
then nothing prints
I don't get how this regex is working
how do i use these variables, and what is the proper syntax?
* and *ck are invalid regular expressions. It would work (with no quotes) if you were comparing with ==, not =~. If you want to use the same functionality that you get in == for them, the equivalent regexps are .* and .*ck.
[9]* is any number (including zero) of characters that are 9. There is zero characters 9 in your direcheck, so it matches. (Edited from brainfart, thanks chepner)
dircheck is not found in direcheck, so not printing anything is hardly surprising.
[a-z]* is any number of characters that are between a and z (i.e. any number of lowercase letters). This will match, assuming it's not quoted.
I finally figured it out, and why it was working so oddly
[a-z]* and [9]* and [anythinghere]* they all match because it matches zero or more times. so "direcheck" has [9] zero or more times.
so
if [[ "$fname" =~ $FIND ]]; then
or
if [[ $fname =~ $FIND ]]; then
are both correct, and
if [[ "$fname" =~ "$FIND" ]]; then
matches only when the string matches exactly because $FIND is matched as a literal string not regex