Bash regex gotcha - regex

I have a small problem I really can't understand :
bash -c 'if [[ "hello" =~ ^[a-zA-Z0-9]\{1,\}\\.$ ]] ; then echo "OK" ; else echo "KO" ; fi
I think this should give me KO and it gives me OK...
I would like to match things with at least 1 character and ending with a dot...
I finally noticed that it works with bash version 4.1.5 and not with version 3.2.25
How should I proceed with this version ?
EDIT :
I found a workaround that works, but I don't know why I had to put the escaped dot between brackets:
bash -c 'if [[ "hello" =~ ^[a-zA-Z0-9]{1,}[\.]$ ]] ; then echo "OK" ; else echo "KO" ; fi'

You did not escape the dot, so it is used as a wildcard and matches any character. Replace the . with \. Also, instead of {1,}, use +, because they are equivalent.

. is special in regular expressions ("match any characters"). Escape it as \.

Related

Correct way to filter results with if statement in bash loop

I'm trying to work out a loop that will let me ignore some matches. So far I have:
for d in /home/chambres/web/x.org/public_html/2018/js/lib/*.js ; do
if [[ $d =~ /*.min.js/ ]];
then
echo "ignore $d"
else
filename="${d##*/}"
echo "$d"
#echo "$filename"
fi
done
However when I run it, they still seem to get included. What am I doing wrong?
/home/chambres/web/x.org/public_html/2018/js/lib/underscore.js.min.js
/home/chambres/web/x.org/public_html/2018/js/lib/tiny-slider.js
/home/chambres/web/x.org/public_html/2018/js/lib/tiny-slider.js.min.js
/home/chambres/web/x.org/public_html/2018/js/lib/underscore.js
BTW I'm a bit of a newbie with bash, so please be kind ;)
In Bash, regular expressions are not enclosed in /, so you should change your test to:
if [[ $d =~ \.min\.js$ ]]
As well as removing the enclosing /, I have escaped the . (otherwise they would match any character) and added a $ to match the end of the string.
But in fact you can use a simpler (and marginally faster) glob match in this case:
if [[ $d = *.min.js ]]
This matches any string that ends in .min.js.

Difference between grep -E regex and Bash regex in conditional expression

For the same regex applied to the same string, why does grep -E match, but the Bash =~ operator in [[ ]] does not?
$ D=Dw4EWRwer
$ echo $D|grep -qE '^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_-\ ]{1,22}$' || echo wrong pattern
$ [[ "${D}" =~ ^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_-\ ]{1,22}$ ]] || echo wrong pattern
wrong pattern
Update: I confirm this worked:
[[ "${D}" =~ ^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]\ _-]{1,22}$ ]] || echo wrong pattern
The problem (for both versions of the code) is on this character class:
[[:alnum:]_-\ ]
In the grep version, because the regex is enclosed in single quotes, the backslash doesn't escape anything and the character range received by grep is exactly how it is represented above.
In the bash version, the backslash (\) escapes the space that follows it and the actual character class used by [[ ]] to test is [[:alnum:]_- ].
Because in ASCII table the underscore (_) comes after both space () and backslash (\), neither of these character classes is correct.
For the bash version you can use:
[[ "${D}" =~ ^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_-\ ]{1,22}$ ]]; echo $?
to verify its outcome. If the regex is incorrect, the exit code is 2.
If you want to put a dash (-) into a character class you have to put it either as the first character in the class (just after [ or [^ if it is a negating class) or as the last character in the class (right before the closing]`).
The grep version of the code should be (there is no need to escape anything inside a string enclosed in single quotes):
$ echo $D | grep -qE '^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_ -]{1,22}$' || echo wrong pattern
The bash version of your code should be:
[[ "${D}" =~ ^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_\ -]{1,22}$ ]] || echo wrong pattern
Based on your comment, you want the bracket expression to contain alphanumeric characters, spaces, underscores and dashes, so the dash is not supposed to indicate a range. To add a hyphen to a bracket expression, it has to be the first or last character in it. Additionally, you don't have to escape things in bracket expressions, so you can drop the backslash. Your grep regex includes a literal \ in the bracket expression:
$ grep -q '[\]' <<< '\' && echo "Match"
Match
In the Bash regex, the space has to be escaped because the string is first read by the shell, but see below how to avoid that.
First, fixing your regex:
^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_ -]{1,22}$
The backslash is gone, and the hyphen is moved to the end. Using this with grep works fine:
$ D=Dw4EWRwer
$ grep -E '^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_ -]{1,22}$' <<< "$D"
Dw4EWRwer
To use the regex within [[ ]] directly, the space has to be escaped:
$ [[ $D =~ ^[A-Z][A-Za-z0-9]{1,2}[[:alnum:]_\ -]{1,22}$ ]] && echo "Match"
Match
I would make the following changes:
Use character classes where possible: [A-Z] is [[:upper:]], [A-Za-z0-9] is [[:alnum:]]
Store the regex in a variable for usage in [[ ]]; this has two advantages: no escaping characters special to the shell, and compatibility with older Bash versions, as the quoting requirements changed between 3.1 and 3.2 (see the Patterns article in the BashGuide).
The regex would then become this for grep:
$ grep -E '^[[:upper:]][[:alnum:]][[:alnum:]_ -]{1,22}$' <<< "$D"
Dw4EWRwer
and this in Bash:
$ re='^[[:upper:]][[:alnum:]][[:alnum:]_ -]{1,22}$'
$ [[ $D =~ $re ]] && echo "Match"
Match

Regular expression in bash s character

I have very strange issue with s character.
This works:
[[ "import scala" =~ ^import\s*.+cala$ ]] && echo "yes"
but this doesn't work:
[[ "import scala" =~ ^import\s*scala$ ]] && echo "yes"
I tried to escape s and but it didn't works.
How to solve this issue?
\s doesn't work with bash regex. Use [[:blank:]] instead to match a space or tab character:
[[ "import scala" =~ ^import[[:blank:]].*scala$ ]] && echo "yes"
yes
PS: However [[:space:]] is equivalent of \s that also matches \n
Also note that you must use .* instead of .+ before scala to match 0 or more characters instead of 1+ because space has already been matched using [[:blank:]]
\s will lose its meaning in shell (escaped as 's'), try to use a variable to store regex expression as suggested in bash manual:
ex='^import\s+scala$'; [[ "import scala" =~ $ex ]] && echo "yes"
This works on my machine.

How to match repeated characters using regular expression operator =~ in bash?

I want to know if a string has repeated letter 6 times or more, using the =~ operator.
a="aaaaaaazxc2"
if [[ $a =~ ([a-z])\1{5,} ]];
then
echo "repeated characters"
fi
The code above does not work.
BASH regex flavor i.e. ERE doesn't support backreference in regex. ksh93 and zsh support it though.
As an alternate solution, you can do it using extended regex option in grep:
a="aaaaaaazxc2"
grep -qE '([a-zA-Z])\1{5}' <<< "$a" && echo "repeated characters"
repeated characters
EDIT: Some ERE implementations support backreference as an extension. For example Ubuntu 14.04 supports it. See snippet below:
$> echo $BASH_VERSION
4.3.11(1)-release
$> a="aaaaaaazxc2"
$> re='([a-z])\1{5}'
$> [[ $a =~ $re ]] && echo "repeated characters"
repeated characters
[[ $var =~ $regex ]] parses a regular expression in POSIX ERE syntax.
See the POSIX regex standard, emphasis added:
BACKREF - Applicable only to basic regular expressions. The character string consisting of a character followed by a single-digit numeral, '1' to '9'.
Backreferences are not formally specified by the POSIX standard for ERE; thus, they are not guaranteed to be available (subject to platform-specific libc extensions) in bash's native regex syntax, thus mandating the use of external tools (awk, grep, etc).
You do not need the full power of backreferences for this specific case of one character repeats. You could just build the regex that would check for a repeat of every single lower case letter
regex="a{6}"
for x in {b..z} ; do regex="$regex|$x{6}" ; done
if [[ "$a" =~ ($regex) ]] ; then echo "repeated characters" ; fi
The regex built with the above for loop looks like
> echo "$regex" | fold -w60
a{6}|b{6}|c{6}|d{6}|e{6}|f{6}|g{6}|h{6}|i{6}|j{6}|k{6}|l{6}|
m{6}|n{6}|o{6}|p{6}|q{6}|r{6}|s{6}|t{6}|u{6}|v{6}|w{6}|x{6}|
y{6}|z{6}
This regular expression behaves as you would expect
> if [[ "abcdefghijkl" =~ ($regex) ]] ; then \
echo "repeated characters" ; else echo "no repeat detected" ; fi
no repeat detected
> if [[ "aabbbbbbbbbcc" =~ ($regex) ]] ; then \
echo "repeated characters" ; else echo "no repeat detected" ; fi
repeated characters
Updated following the comment from #sln replaced bound {6,} expression with a simple {6}.

What does this match : bash regex

if [[ "$len" -lt "$MINLEN" && "$line" =~ \[*\.\] ]]
This is from Advanced bash scripting guide "Example 10-1. Inserting a blank line between paragraphs in a text file"
As I understand this matches "any string or a dot character". Right ?
It matches zero or more open bracket characters (\[*), followed by a period and a close square bracket (\.\]). Note that it only requires that a match exist somewhere in "$line", not that the whole string match. Here's a demo:
$ showmatch() { [[ "$1" =~ \[*\.\] ]] && echo "matched: '${BASH_REMATCH[0]}'" || echo "no match"; }
$ showmatch "abc[.]def"
matched: '[.]'
$ showmatch "abc.]def"
matched: '.]'
$ showmatch "abc[[[[[[[.]def"
matched: '[[[[[[[.]'
$ showmatch "abc[[[[[[[xyz.]def"
matched: '.]'
$ showmatch "abc[[[[[[[.xyz]def"
no match
...and I'm pretty sure that's not what it's supposed to be doing in that example script.
It means any string ended with dot inside bracers, for example: [.]
[abc.]
Update: +1 to Gordon Davisson, who has summed it up pretty well... so I've redacted my original post
In brief: You can test the result of a bash regex match like this:
[[ "[*.]" =~ \[*\.\] ]] ; echo ${BASH_REMATCH[0]}