matching regex not working properly in zsh - regex

I am finding some wrong results here in bash. I dont know why can some one help to understand whats happening
$ [[ example.com/something =~ .*\.mp4\?.* ]] && echo matched2
matched2
My regex is ^.*\.mp4\?.* should only match something like example.com/file.mp4?size=large but how come its matching without any such pattern here.
I am using zsh
$ zsh --version
zsh 5.7.1 (x86_64-pc-linux-gnu)

The backslashes aren't part of the regular expression; the shell performs quote removal to generate the regular expression .*.mp4?.*, which matches any string containing 1 or more arbitrary characters, followed by mp and an optional 4. You need to escape the backslashes as well.
[[ example.com/something =~ .*\\.mp4\\?.* ]] && echo matched2
This will produces the desired regular expression .*\.mp4\?.*.
(Note that regular expression aren't anchored to the beginning or end of the input string, so \\.mp4\\? or '\.mp4\?' would suffice.)

Related

Difference between grep -E regex and Bash regex in conditional expression

For the same regex applied to the same string, why does grep -E match, but the Bash =~ operator in [[ ]] does not?
$ D=Dw4EWRwer
$ echo $D|grep -qE '^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_-\ ]{1,22}$' || echo wrong pattern
$ [[ "${D}" =~ ^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_-\ ]{1,22}$ ]] || echo wrong pattern
wrong pattern
Update: I confirm this worked:
[[ "${D}" =~ ^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]\ _-]{1,22}$ ]] || echo wrong pattern
The problem (for both versions of the code) is on this character class:
[[:alnum:]_-\ ]
In the grep version, because the regex is enclosed in single quotes, the backslash doesn't escape anything and the character range received by grep is exactly how it is represented above.
In the bash version, the backslash (\) escapes the space that follows it and the actual character class used by [[ ]] to test is [[:alnum:]_- ].
Because in ASCII table the underscore (_) comes after both space () and backslash (\), neither of these character classes is correct.
For the bash version you can use:
[[ "${D}" =~ ^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_-\ ]{1,22}$ ]]; echo $?
to verify its outcome. If the regex is incorrect, the exit code is 2.
If you want to put a dash (-) into a character class you have to put it either as the first character in the class (just after [ or [^ if it is a negating class) or as the last character in the class (right before the closing]`).
The grep version of the code should be (there is no need to escape anything inside a string enclosed in single quotes):
$ echo $D | grep -qE '^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_ -]{1,22}$' || echo wrong pattern
The bash version of your code should be:
[[ "${D}" =~ ^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_\ -]{1,22}$ ]] || echo wrong pattern
Based on your comment, you want the bracket expression to contain alphanumeric characters, spaces, underscores and dashes, so the dash is not supposed to indicate a range. To add a hyphen to a bracket expression, it has to be the first or last character in it. Additionally, you don't have to escape things in bracket expressions, so you can drop the backslash. Your grep regex includes a literal \ in the bracket expression:
$ grep -q '[\]' <<< '\' && echo "Match"
Match
In the Bash regex, the space has to be escaped because the string is first read by the shell, but see below how to avoid that.
First, fixing your regex:
^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_ -]{1,22}$
The backslash is gone, and the hyphen is moved to the end. Using this with grep works fine:
$ D=Dw4EWRwer
$ grep -E '^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_ -]{1,22}$' <<< "$D"
Dw4EWRwer
To use the regex within [[ ]] directly, the space has to be escaped:
$ [[ $D =~ ^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_\ -]{1,22}$ ]] && echo "Match"
Match
I would make the following changes:
Use character classes where possible: [A-Z] is [[:upper:]], [A-Za-z0-9] is [[:alnum:]]
Store the regex in a variable for usage in [[ ]]; this has two advantages: no escaping characters special to the shell, and compatibility with older Bash versions, as the quoting requirements changed between 3.1 and 3.2 (see the Patterns article in the BashGuide).
The regex would then become this for grep:
$ grep -E '^[[:upper:]][[:alnum:]][[:alnum:]_ -]{1,22}$' <<< "$D"
Dw4EWRwer
and this in Bash:
$ re='^[[:upper:]][[:alnum:]][[:alnum:]_ -]{1,22}$'
$ [[ $D =~ $re ]] && echo "Match"
Match

Matching groups in bash regex [duplicate]

This question already has answers here:
bash regex with quotes?
(5 answers)
Closed 5 years ago.
In bash I have the following:
REGEX="(1.0.0|2.0.0)"
declare -a arr=("A:1.0.0" "B:1.0.0" "C:2.0.0" "D:2.0.1")
for i in "${arr[#]}"
do
echo "Found: $i"
if [[ "$i"=~"${REGEX}" ]]; then
echo "$i matches: ${REGEX}"
else
echo "$i DOES NOT match: ${REGEX}"
fi
done
I would assume that for D:2.0.1 it would print ...DOES NOT match... but instead it prints
Found: A:1.0.0
A:1.0.0 matches: (1.0.0|2.0.0)
Found: B:1.0.0
B:1.0.0 matches: (1.0.0|2.0.0)
Found: C:2.0.0
C:2.0.0 matches: (1.0.0|2.0.0)
Found: D:2.0.1
D:2.0.1 matches: (1.0.0|2.0.0)
So what is wrong with my REGEX group pattern? Specifying a group pattern like that works fine in other languages - e.g. like groovy.
You have a a typo in the regex match expression to start with
if [[ "$i"=~"${REGEX}" ]]; then
should have been written as just
if [[ $i =~ ${REGEX} ]]; then
which implicates the point that you should never quote your regex expressions. For it to understand the | operator in its extended regular expressions support(ERE) you need to make it understand that it is not a literal string which it treats so when under double-quotes.
Not recommended
But if you still want to quote your regex strings - bash 3.2 introduced a compatibility option compat31 (under New Features in Bash 1.l) which reverts bash regular expression quoting behavior back to 3.1 which supported quoting of the regex string. You just need to enable it via an extended shell option
shopt -s compat31
and run it with quotes now
if [[ $i =~ "${REGEX}" ]]; then

Regex stored in a shell variable doesn't work between double brackets

The below is a small part of a bigger script I'm working on, but the below is giving me a lot of pain which causes a part of the bigger script to not function properly. The intention is to check if the variable has a string value matching red hat or Red Hat. If it is, then change the variable name to redhat. But it doesn't quite match the regex I've used.
getos="red hat"
rh_reg="[rR]ed[:space:].*[Hh]at"
if [ "$getos" =~ "$rh_reg" ]; then
getos="redhat"
fi
echo $getos
Any help will be greatly appreciated.
There are a multiple things to fix here
bash supports regex pattern matching within its [[ extended test operator and not within its POSIX standard [ test operator
Never quote our regex match string. bash 3.2 introduced a compatibility option compat31 (under New Features in Bash 1.l) which reverts bash regular expression quoting behavior back to 3.1 which supported quoting of the regex string.
Fix the regex to use [[:space:]] instead of just [:space:]
So just do
getos="red hat"
rh_reg="[rR]ed[[:space:]]*[Hh]at"
if [[ "$getos" =~ $rh_reg ]]; then
getos="redhat"
fi;
echo "$getos"
or enable the compat31 option from the extended shell option
shopt -s compat31
getos="red hat"
rh_reg="[rR]ed[[:space:]]*[Hh]at"
if [[ "$getos" =~ "$rh_reg" ]]; then
getos="redhat"
fi
echo "$getos"
shopt -u compat31
But instead of messing with those shell options just use the extended test operator [[ with an unquoted regex string variable.
There are two issues:
First, replace:
rh_reg="[rR]ed[:space:].*[Hh]at"
With:
rh_reg="[rR]ed[[:space:]]*[Hh]at"
A character class like [:space:] only works when it is in square brackets. Also, it appears that you wanted to match zero or more spaces and that is [[:space:]]* not [[:space:]].*. The latter would match a space followed by zero or more of anything at all.
Second, replace:
[ "$getos" =~ "$rh_reg" ]
With:
[[ "$getos" =~ $rh_reg ]]
Regex matches requires bash's extended test: [[...]]. The POSIX standard test, [...], does not have the feature. Also, in bash, regular expressions only work if they are unquoted.
Examples:
$ rh_reg='[rR]ed[[:space:]]*[Hh]at'
$ getos="red Hat"; [[ "$getos" =~ $rh_reg ]] && getos="redhat"; echo $getos
redhat
$ getos="RedHat"; [[ "$getos" =~ $rh_reg ]] && getos="redhat"; echo $getos
redhat

Bash: Regex for finding pattern having double quotes

I am trying to use regex in my shell script to find a substring.
Original string:
"relative-to="jboss.server.base.dir" scan-enabled="true" scan-interval="0""
Trying to find following substring:
"scan-enabled="true""
Code:
str="relative-to=\"jboss.server.base.dir\" scan-enabled=\"true\" scan-interval=\"0\""
reg='scan-enabled.*"'
[[ "$str" =~ $reg ]] && echo $BASH_REMATCH
but it is returning,
scan-enabled="true" scan-interval="0"
Can someone please help on how to search for a pattern involving double quotes using regex?
Bash version: 4.1.2(1)-release
If you want to match the entire expression scan-enabled="true" or scan-enabled="false" then you can try this:
reg='(scan-enabled=\"[^"]*\")'
[[ "$str" =~ $reg ]] && echo ${BASH_REMATCH[1]}
The variable ${BASH_REMATCH[1]} will match the first capture group match in the regular expression. In this case, the entire regular expression is contained in parenthesis, so this is the first capture group.
You can explore this regex at this link:
Regex101

Bash script wont match on regular expression

I have the following bash script which should be producing the output TEST
#!/bin/bash
test="TEST:THING - OBJECT_X"
if [[ $test =~ ^([a-zA-Z0-9]+)\:([a-zA-Z0-9]+)[A-Z\s\-_]+$ ]]; then
echo ${BASH_REMATCH[1]}
fi
In my regex tester the regular expression seems to be matching and capturing on the first and second groups:
https://regex101.com/r/kR1jM7/1
Any idea whats causing this?
\s is a PCRE construct not meaningful inside of ERE. Use [:space:] instead. Also, instead of escaping the dash as \-, move the - to the very end of the character set definition.
The following works:
[[ $test =~ ^([a-zA-Z0-9]+):([a-zA-Z0-9]+)[A-Z[:space:]_-]+$ ]]
That said, for compatibility with a wider range of bash releases, move the regex into a variable:
re='^([a-zA-Z0-9]+):([a-zA-Z0-9]+)[A-Z[:space:]_-]+$'
[[ $test =~ $re ]]
To use POSIX character classes more aggressively (and thus make your code more likely to work correctly across languages and locales), also consider:
re='^([[:alnum:]]+):([[:alnum:]]+)[[:upper:][:space:]_-]+$'