Why is zsh not executing my regular expression correctly? - regex

I started using zsh a while back (installed using brew as described here).
Everything's working great but I've noticed that the regular expression operator =~ doesn't really work.
For example if I want to extract the file name of a JSON file from a path I get the correct result in bash but not in zsh.
I.e.
bash -s
[[ "/a/b/c/file.json" =~ ([[:alnum:]\-]+)\.json$ ]] && echo ${BASH_REMATCH[1]}
works and yields file but the same thing in zsh just prints an empty line.
Does anyone know why and how to fix this? Do I have to enable regex support somehow?

Nevermind, found it. The zsh docs clarify that matches are stored in the variable match rather than BASH_REMATCH as in Bash.
So, obtaining the match like this
[[ "/a/b/c/file.json" =~ ([[:alnum:]\-]+)\.json$ ]] && echo ${match[1]}
works as expected.

Related

BASH_REMATCH empty

I'm trying capture the some input regex in Bash but BASH_REMATCH comes EMPTY
#!/usr/bin/env /bin/bash
INPUT=$(cat input.txt)
TASK_NAME="MailAccountFetch"
MATCH_PATTERN="(${TASK_NAME})\s+([0-9]{4}-[0-9]{2}-[0-9]{2}\s[0-9]{2}:[0-9]{2}:[0-9]{2})"
while read -r line; do
if [[ $line =~ $MATCH_PATTERN ]]; then
TASK_RESULT=${BASH_REMATCH[3]}
TASK_LAST_RUN=${BASH_REMATCH[2]}
TASK_EXECUTION_DURATION=${BASH_REMATCH[4]}
fi
done <<< "$INPUT"
My input is:
MailAccountFetch 2017-03-29 19:00:00 Success 5.0 Second(s) 2017-03-29 19:03:00
By debugging the script (VS Code+Bash ext) I can see the INPUT string matches as the code goes inside the IF but BASH_REMATCH is not populated with my two capture groups.
I'm on:
GNU bash, version 4.4.0(1)-release (x86_64-pc-linux-gnu)
What could be the issue?
LATER EDIT
Accepted Answer
Accepting most explanatory answer.
What finally resolved the issue:
bashdb/VS Code environment are causing the empty BASH_REMATCH. The code works OK when ran alone.
As Cyrus shows in his answer, a simplified version of your code - with the same input - does work on Linux in principle.
That said, your code references capture groups 3 and 4, whereas your regex only defines 2.
In other words: ${BASH_REMATCH[3]} and ${BASH_REMATCH[4]} are empty by definition.
Note, however, that if =~ signals success, BASH_REMATCH is never fully empty: at the very least - in the absence of any capture groups - ${BASH_REMATCH[0]} will be defined.
There are some general points worth making:
Your shebang line reads #!/usr/bin/env /bin/bash, which is effectively the same as #!/bin/bash.
/usr/bin/env is typically used if you want a version other than /bin/bash to execute, one you've installed later and put in the PATH (too):
#!/usr/bin/env bash
ghoti points out that another reason for using #!/usr/bin/env bash is to also support less common platforms such as FreeBSD, where bash, if installed, is located in /usr/local/bin rather than the usual /bin.
In either scenario it is less predictable which bash binary will be executed, because it depends on the effective $PATH value at the time of invocation.
=~ is one of the few Bash features that are platform-dependent: it uses the particular regex dialect implemented by the platform's regex libraries.
\s is a character class shortcut that is not available on all platforms, notably not on macOS; the POSIX-compliant equivalent is [[:space:]].
(In your particular case, \s should work, however, because your Bash --version output suggests that you are on a Linux distro.)
It's better not to use all-uppercase shell variable names such as INPUT, so as to avoid conflicts with environment variables and special shell variables.
Bash uses system libraries to parse regular expressions, and different parsers implement different features. You've come across a place where regex class shorthand strings do not work. Note the following:
$ s="one12345 two"
$ [[ $s =~ ^([a-z]+[0-9]{4})\S*\s+(.*) ]] && echo yep; declare -p BASH_REMATCH
declare -ar BASH_REMATCH=()
$ [[ $s =~ ^([a-z]+[0-9]{4})[^[:space:]]*[[:space:]]+(.*) ]] && echo yep; declare -p BASH_REMATCH
yep
declare -ar BASH_REMATCH=([0]="one12345 two" [1]="one1234" [2]="two")
I'm doing this on macOS as well, but I get the same behaviour on FreeBSD.
Simply replace \s with [[:space:]], \d with [[:digit:]], etc, and you should be good to go. If you avoid using RE shortcuts, your expressions will be more widely understood.

Regular Expression : bash 3 vs bash 4

The follow code with a regular expression check does not outputs the same result between bash 3 and bash 4:
TESTCASE="testcase0"
[[ ${TESTCASE} =~ "^testcase[0-9\.]*$" ]]
echo $?
echo ${BASH_REMATCH}
bash 3.2 outputs a successful regular expression check:
0
testcase0
bash 4.1 fails the regular expression check:
1
<empty line>
I can't identify where in my regex pattern the expressions fails. I would need a code compatible between both version of bash.
Does anyone have a clue on what's my problem ?
Thanks !
In older versions of Bash (3.1), it was possible to use quotes around a regular expression in a test. In newer versions, the quotes are treated as part of the pattern, so the match fails.
The solution is to remove the quotes.
The recommended way to use regular expressions is this:
re='^testcase[0-9\.]*$' # single quotes around variable
[[ ${TESTCASE} =~ $re ]] # unquoted variable used in test
This syntax should work in all versions of bash that support regular expressions. The variable isn't strictly necessary but it improves readability. See the regular expressions section of Greg's wiki for more details.
Regarding the use of a variable (from the link above):
For cross-compatibility (to avoid having to escape parentheses, pipes and so on) use a variable to store your regex, e.g. re='^\*( >| *Applying |.*\.diff|.*\.patch)'; [[ $var =~ $re ]] This is much easier to maintain since you only write ERE syntax and avoid the need for shell-escaping, as well as being compatible with all 3.x BASH versions.
By the way, there's no need to escape the . inside the bracket expression.

Regex compatibility across multiple shells

I'm checking correct date format in my bash script with following code:
if [[ $variable == [0-9][0-9][0-9][0-9]-[0-1][0-9]-[0-3][0-9] ]]
The format should be: YYYY-MM-DD
This works well in bash, however I have problems when trying to run it in dash or sh. Could you help me rewriting this so it's compatible with dash and sh? Or alternatively find a different solution that can be used on all shells?
Thanks in advance !
case should work in dash, too:
case 0015-18-32 in
[0-9][0-9][0-9][0-9]-[0-1][0-9]-[0-3][0-9] ) echo yes ;;
esac

Why is Bash pattern match for ?(*[[:class:]])foobar slow?

I have a text file foobar.txt which is around 10KB, not that long. Yet the following match search command takes about 10 seconds on a high-performance Linux machine.
bash>shopt -s extglob
bash>[[ `cat foobar.txt` == ?(*[[:print:]])foobar ]]
There is no match: all the characters in foobar.txt are printable but there is no string "foobar".
The search should try to match two alternatives, each of them will not match:
"foobar"
that's instantenous
*[[:print:]]foobar
- which should go like this:
should scan the file character by character in one pass, each time, check if the next characters are
[[:print:]]foobar
this should also be fast, no way should take a millisecond per character.
In fact, if I drop ?, that is, do
bash>[[ `cat foobar.txt` == *[[:print:]]foobar ]]
this is instantaneous. But this is simply the second alternative above, without the first.
So why is this so long??
The glob matcher in bash is just not optimized. See, for example, this bug-bash thread, during which bash maintainer Chet Ramey says:
It's not a regexp engine, it's an interpreted-on-the-fly matcher.
Since bash includes a regexp engine as well (use =~ instead of == inside [[ ]]), there's probably not much motivation to do anything about it.
On my machine, the equivalent regexp (^(.*[[:print:]])?foobar$) suffered considerably from locale-aware [[:print:]]; for some reason, that didn't affect the glob matcher. Setting LANG=C made the regexp work fine.
However, for a string that size, I'd use grep.
As others have noted, you're probably better off using grep.
That said, if you wanted to stick with a [[ conditional - combining #konsolebox and #rici's advice - you'd get:
[[ $(<foobar.txt) =~ (^|[[:print:]])foobar$ ]]
Edit: Regex updated to match the OP's requirements - thanks, #rici.
Generally speaking, it is preferable to use regular expressions for string matching (via the =~ operator, in this case), rather than [globbing] patterns (via the == operator), whose primary purpose is matching file- and folder names.
Simply because you do many forks of bash (one for the subshell, and one for the cat command), and also, you read the cat binary as well while you execute it.
[[ `cat foobar.txt` == *[[:print:]]foobar ]]
This form would be faster:
[[ $(<foobar.txt) == *[[:print:]]foobar ]]
Or
IFS= read -r LINE < foobar.txt && [[ $LINE == *[[:print:]]foobar ]]
If it doesn't make a difference the speed of pattern matching could be related to the version of Bash you're using.

Match unicode character in zsh regex

I want to make sure that a variable does not contain a specific character (in this case an 'α'), but the following code fails (returns 1):
FOO="test" && [[ $FOO =~ '^[^α]*$' ]]
Edit: Changed the pattern based on feedback from stema below to require matching only “non-'α'” characters from start to end.
Replacing 'α' with e.g. 'x' works as expected. Why does it fail with an 'α', and how can I make this work?
System info:
$ zsh --version
zsh 4.3.11 (i386-apple-darwin11.0)
$ locale
LANG="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_ALL="en_GB.UTF-8"
Edit 2: I now tested on a Linux machine running Ubuntu 11.10 with zsh 4.3.11 with identical locale settings, and there it works – i.e. FOO="test" && [[ $FOO =~ '^[^α]*$' ]] returns success. I'm running Mac OS X 10.7.2.
with this regex .*[^α].* you can't test that α is not in the string. What this is testing is: Is there ONE character in the string that is not a α.
If you want to check that there is not this character in the string, do this
FOO="test" && [[ $FOO =~ '^[^α]*$' ]]
this will check if the complete string from the start to the end consists of non "α" characters.
The simplest way of expressing this is with a negative look-ahead anchored at the start:
^(?!.*α)
This is saying "when looking forward from the start, I shouldn't be able to see α anywhere.
The advantage of using look-heads is they are non-capturing, so you can combine them with other capturing regexes, eg to find groups of numbers in quotes in input that doesn't contain a α, use this: ^(?!.*α)"(\d+)"
For some reason I got to similar problem on my build system, while having ZSH version 5.0.2 on my notebook (where Unicode works as expected) and ZSH 4.3.17 on my build system. It seems to me that ZSH 5 does not have the problem with Unicode characters in regular expression patterns.
Specifically, parsing the key/value pair:
[[ "revision/author=Ľudovít Lučenič" =~ '^([^=]+)=(.*)$' ]]
echo "$match[1]:$match[2]"
renders
: # ZSH 4.3.17
revision/author:Ľudovít Lučenič # ZSH 5.0.2
Also, I assume some shortcoming with ZSH 4 Unicode support in general.
Update: after some investigation, I have found out that the dot in regexp does not match the letter 'č' in ZSH 4. Once I updated the pattern to:
[[ "revision/author=Ľudovít Lučenič" =~ '^([^=]+)=((.|č)*)$' ]]
echo "$match[1]:$match[2]"
I am getting the same result in both ZSH versions. I do not know, though, why exactly this letter is the problem here. However, it may help somebody to work this shortcoming around.