BASH_REMATCH empty - regex

I'm trying capture the some input regex in Bash but BASH_REMATCH comes EMPTY
#!/usr/bin/env /bin/bash
INPUT=$(cat input.txt)
TASK_NAME="MailAccountFetch"
MATCH_PATTERN="(${TASK_NAME})\s+([0-9]{4}-[0-9]{2}-[0-9]{2}\s[0-9]{2}:[0-9]{2}:[0-9]{2})"
while read -r line; do
if [[ $line =~ $MATCH_PATTERN ]]; then
TASK_RESULT=${BASH_REMATCH[3]}
TASK_LAST_RUN=${BASH_REMATCH[2]}
TASK_EXECUTION_DURATION=${BASH_REMATCH[4]}
fi
done <<< "$INPUT"
My input is:
MailAccountFetch 2017-03-29 19:00:00 Success 5.0 Second(s) 2017-03-29 19:03:00
By debugging the script (VS Code+Bash ext) I can see the INPUT string matches as the code goes inside the IF but BASH_REMATCH is not populated with my two capture groups.
I'm on:
GNU bash, version 4.4.0(1)-release (x86_64-pc-linux-gnu)
What could be the issue?
LATER EDIT
Accepted Answer
Accepting most explanatory answer.
What finally resolved the issue:
bashdb/VS Code environment are causing the empty BASH_REMATCH. The code works OK when ran alone.

As Cyrus shows in his answer, a simplified version of your code - with the same input - does work on Linux in principle.
That said, your code references capture groups 3 and 4, whereas your regex only defines 2.
In other words: ${BASH_REMATCH[3]} and ${BASH_REMATCH[4]} are empty by definition.
Note, however, that if =~ signals success, BASH_REMATCH is never fully empty: at the very least - in the absence of any capture groups - ${BASH_REMATCH[0]} will be defined.
There are some general points worth making:
Your shebang line reads #!/usr/bin/env /bin/bash, which is effectively the same as #!/bin/bash.
/usr/bin/env is typically used if you want a version other than /bin/bash to execute, one you've installed later and put in the PATH (too):
#!/usr/bin/env bash
ghoti points out that another reason for using #!/usr/bin/env bash is to also support less common platforms such as FreeBSD, where bash, if installed, is located in /usr/local/bin rather than the usual /bin.
In either scenario it is less predictable which bash binary will be executed, because it depends on the effective $PATH value at the time of invocation.
=~ is one of the few Bash features that are platform-dependent: it uses the particular regex dialect implemented by the platform's regex libraries.
\s is a character class shortcut that is not available on all platforms, notably not on macOS; the POSIX-compliant equivalent is [[:space:]].
(In your particular case, \s should work, however, because your Bash --version output suggests that you are on a Linux distro.)
It's better not to use all-uppercase shell variable names such as INPUT, so as to avoid conflicts with environment variables and special shell variables.

Bash uses system libraries to parse regular expressions, and different parsers implement different features. You've come across a place where regex class shorthand strings do not work. Note the following:
$ s="one12345 two"
$ [[ $s =~ ^([a-z]+[0-9]{4})\S*\s+(.*) ]] && echo yep; declare -p BASH_REMATCH
declare -ar BASH_REMATCH=()
$ [[ $s =~ ^([a-z]+[0-9]{4})[^[:space:]]*[[:space:]]+(.*) ]] && echo yep; declare -p BASH_REMATCH
yep
declare -ar BASH_REMATCH=([0]="one12345 two" [1]="one1234" [2]="two")
I'm doing this on macOS as well, but I get the same behaviour on FreeBSD.
Simply replace \s with [[:space:]], \d with [[:digit:]], etc, and you should be good to go. If you avoid using RE shortcuts, your expressions will be more widely understood.

Related

Regular Regex is not working as expected in Bitbucket client side hooks

I was trying to enforce a regex match to keep a consistent naming convention when creating a branch in bitbucket.
My regex requirements:
starts with feature/ or release/
must be followed by ABCD-
have any number after that (at least one digit is mandatory)
must end with only numbers, no other character is allowed
what I am trying are these alternatives:
^(feature|release)\/(ABCD-\d+$).*
^(feature|release)\/(ABCD-\d+$)
It should be working fine, I checked on regex101
and my example feature/ABCD-99 is matching in this site.
But when I am using the same regex in a .git/hooks/pre-push, my source tree is failing even when I supply the tested example. i.e. feature/ABCD-99 is failing.
Please help me understand, do Regex behave differently at different sites/vendors?
How can i correct my regex to get it worked for Atlassian Bitbucket (I am using SourceTree).
Adding my pre-push script here:
#!/bin/sh
echo "inside pre-push hook"
local_branch="$(git rev-parse --abbrev-ref HEAD)"
echo "local branch is: $local_branch "
branch_regex="^(feature|release)\/(ABCD-\d+$).*"
if [[ ! $local_branch =~ $branch_regex ]]
then
echo "Bad naming convention of branch. Rejected"
exit 1
fi
exit 0
Adding output when giving ABCD-99 (valid scenario regex) in source tree:
Pushing to git
inside pre-push hook
local branch is: feature/ABCD-99
Bad naming convention of branch. Rejected
The problem you're seeing is because you're using bash and the =~ match requires a POSIX extended regular expression. That means that you can use character classes like [0-9], but you cannot use Perl-compatible notation like \d. In general, Perl-compatible regexes are very powerful, but they require an additional library whereas POSIX extended regexes are available on almost every Unix system, so they're a better fit for a portable shell like bash.
So your regex should be something like this: ^(feature|release)\/ABCD-[0-9]+$.
Do note that the [[ command and the =~ operator to it are bash extensions and are not found in other shells. Using them is fine, but if you want to use bash, your shebang needs to be #!/bin/bash (or #!/usr/bin/env bash), since /bin/sh is not bash on all systems (e.g., Debian, Ubuntu, the BSDs, and some macOS systems). The POSIX way to write this check is this:
if ! echo "$local_branch" | grep -qsE "$branch_regex"
then
echo "Bad naming convention of branch. Rejected"
exit 1
fi
On Linux systems, the regex(7) manual page describes the syntax allowed for POSIX extended (modern) and basic (obsolete) regular expressions.
After lot of brain storming and trail and error methods, found out some how atlassian/bitbucket Git/ Sourcetree is not ready to take '\d' to match digits for a Regex, instead they are allowing only '[0-9]'
Changed above code from
^(feature|release)\/(ABCD-\d+$) to ^(feature|release)\/(ABCD-[0-9]+$) to make it work.

bash 2.0 string matching

I'm on GNU bash, version 2.05b.0(1)-release (2002). I'd like to determine whether the value of $1 is a path in one of those /path/*.log rules in, say, /etc/logrotate.conf. It's not my box so I can't upgrade it.
Edit: my real goal is given /path/actual.log answer whether it is already governed by logrotate or if all the current rules miss it. I wonder then if my script should just run logrotate -d /etc/logrotate.conf and see if /path/actual.log is in the output. This seems simpler and covers all the cases as opposed to this other approach.
But I still want to know how to approach string matching in Bash 2.0 in general...
the line itself can start with some white space or none
it's not a match if it is in a commented line (comments are lines where the first non white space char is #)
there can be one or more paths on the same line to the left of $1
like if $1 is /my/path/*.log and the line in question is
/other/path*.log /yet/another.log /my/path/*.log {
there can be one or more paths to the right as well
the line itself can end with { and even more white space or not
paths can be contained in double-quotes or not
it can be assumed that the file is a valid logrotate conf file.
I have something that seems to work in Bash 4 but not in Bash 2.05. Where can I go to read what Bash 2.0 supports? How would this matching be checked in Bash 2.0?
You can find a terse bash changelog here.
You'll see that =~, the regex-matching operator, didn't get introduced until version 3.0.
Thus, your best bet is to use a utility to perform the regex matching for you; e.g.:
if grep -Eq '<your-extended-regex>' <<<"$1"; then ...
grep -Eq '<your-extended-regex>' <<<"$1":
IS like [[ $1 =~ <your-extended-regex> ]] in Bash 3.0+ in that its exit code indicates whether the literal value of $1 matches the extended regex <your-extended-regex>
Note that Bash 3.1 changed the interpretation of the RHS to treat quoted (sub)strings as literals.
Also note that grep -E may support a slightly different regular-expression dialect.
is NOT like it in that the grep solution cannot return capture groups; by contrast, Bash 3.0+ provides the overall match and capture groups via special array variable ${BASH_REMATCH[#]}.

Why is Bash pattern match for ?(*[[:class:]])foobar slow?

I have a text file foobar.txt which is around 10KB, not that long. Yet the following match search command takes about 10 seconds on a high-performance Linux machine.
bash>shopt -s extglob
bash>[[ `cat foobar.txt` == ?(*[[:print:]])foobar ]]
There is no match: all the characters in foobar.txt are printable but there is no string "foobar".
The search should try to match two alternatives, each of them will not match:
"foobar"
that's instantenous
*[[:print:]]foobar
- which should go like this:
should scan the file character by character in one pass, each time, check if the next characters are
[[:print:]]foobar
this should also be fast, no way should take a millisecond per character.
In fact, if I drop ?, that is, do
bash>[[ `cat foobar.txt` == *[[:print:]]foobar ]]
this is instantaneous. But this is simply the second alternative above, without the first.
So why is this so long??
The glob matcher in bash is just not optimized. See, for example, this bug-bash thread, during which bash maintainer Chet Ramey says:
It's not a regexp engine, it's an interpreted-on-the-fly matcher.
Since bash includes a regexp engine as well (use =~ instead of == inside [[ ]]), there's probably not much motivation to do anything about it.
On my machine, the equivalent regexp (^(.*[[:print:]])?foobar$) suffered considerably from locale-aware [[:print:]]; for some reason, that didn't affect the glob matcher. Setting LANG=C made the regexp work fine.
However, for a string that size, I'd use grep.
As others have noted, you're probably better off using grep.
That said, if you wanted to stick with a [[ conditional - combining #konsolebox and #rici's advice - you'd get:
[[ $(<foobar.txt) =~ (^|[[:print:]])foobar$ ]]
Edit: Regex updated to match the OP's requirements - thanks, #rici.
Generally speaking, it is preferable to use regular expressions for string matching (via the =~ operator, in this case), rather than [globbing] patterns (via the == operator), whose primary purpose is matching file- and folder names.
Simply because you do many forks of bash (one for the subshell, and one for the cat command), and also, you read the cat binary as well while you execute it.
[[ `cat foobar.txt` == *[[:print:]]foobar ]]
This form would be faster:
[[ $(<foobar.txt) == *[[:print:]]foobar ]]
Or
IFS= read -r LINE < foobar.txt && [[ $LINE == *[[:print:]]foobar ]]
If it doesn't make a difference the speed of pattern matching could be related to the version of Bash you're using.

Create directory based on part of filename

First of all, I'm not a programmer — just trying to learn the basics of shell scripting and trying out some stuff.
I'm trying to create a function for my bash script that creates a directory based on a version number in the filename of a file the user has chosen in a list.
Here's the function:
lav_mappe () {
shopt -s failglob
echo "[--- Choose zip file, or x to exit ---]"
echo ""
echo ""
select zip in $SRC/*.zip
do
[[ $REPLY == x ]] && . $HJEM/build
[[ -z $zip ]] && echo "Invalid choice" && continue
echo
grep ^[0-9]{1}\.[0-9]{1,2}\.[0-9]{1,2}$ $zip; mkdir -p $MODS/out/${ver}
done
}
I've tried messing around with some other commands too:
for ver in $zip; do
grep "^[0-9]{1}\.[0-9]{1,2}\.[0-9]{1,2}$" $zip; mkdir -p $MODS/out/${ver}
done
And also find | grep — but I'm doing it wrong :(
But it ends up saying "no match" for my regex pattern.
I'm trying to take the filename the user has selected, then grep it for the version number (ALWAYS x.xx.x somewhere in the filename), and fianlly create a directory with just that.
Could someone give me some pointers what the command chain should look like? I'm very unsure about the structure of the function, so any help is appreciated.
EDIT:
Ok, this is how the complete function looks like now: (Please note, the sed(1) commands besides the directory creation is not created by me, just implemented in my code.)
Pastebin (Long code.)
I've got news for you. You are writing a Bash script, you are a programmer!
Your Regular Expression (RE) is of the "wrong" type. Vanilla grep uses a form known as "Basic Regular Expressions" (BRE), but your RE is in the form of an Extended Regular Expression (ERE). BRE's are used by vanilla grep, vi, more, etc. EREs are used by just about everything else, awk, Perl, Python, Java, .Net, etc. Problem is, you are trying to look for that pattern in the file's contents, not in the filename!
There is an egrep command, or you can use grep -E, so:
echo $zip|grep -E '^[0-9]\.[0-9]{1,2}\.[0-9]{1,2}$'
(note that single quotes are safer than double). By the way, you use ^ at the front and $ at the end, which means the filename ONLY consists of a version number, yet you say the version number is "somewhere in the filename". You don't need the {1} quantifier, that is implied.
BUT, you don't appear to be capturing the version number either.
You could use sed (we also need the -E):
ver=$(echo $zip| sed -E 's/.*([0-9]\.[0-9]{1,2}\.[0-9]{1,2}).*/\1/')
The \1 on the right means "replace everything (that's why we have the .* at front and back) with what was matched in the parentheses group".
That's a bit clunky, I know.
Now we can do the mkdir (there is no merit in putting everything on one line, and it makes the code harder to maintain):
mkdir -p "$MODS/out/$ver"
${ver} is unnecessary in this case, but it is a good idea to enclose path names in double quotes in case any of the components have embedded white-space.
So, good effort for a "non-programmer", particularly in generating that RE.
Now for Lesson 2
Be careful about using this solution in a general loop. Your question specifically uses select, so we cannot predict which files will be used. But what if we wanted to do this for every file?
Using the solution above in a for or while loop would be inefficient. Calling external processes inside a loop is always bad. There is nothing we can do about the mkdir without using a different language like Perl or Python. But sed, by it's nature is iterative, and we should use that feature.
One alternative would be to use shell pattern matching instead of sed. This particular pattern would not be impossible in the shell, but it would be difficult and raise other questions. So let's stick with sed.
A problem we have is that echo output places a space between each field. That gives us a couple of issues. sed delimits each record with a newline "\n", so echo on its own won't do here. We could replace each space with a new-line, but that would be an issue if there were spaces inside a filename. We could do some trickery with IFS and globbing, but that leads to unnecessary complications. So instead we will fall back to good old ls. Normally we would not want to use ls, shell globbing is more efficient, but here we are using the feature that it will place a new-line after each filename (when used redirected through a pipe).
while read ver
do
mkdir "$ver"
done < <(ls $SRC/*.zip|sed -E 's/.*([0-9]{1}\.[0-9]{1,2}\.[0-9]{1,2}).*/\1/')
Here I am using process substitution, and this loop will only call ls and sed once. BUT, it calls the mkdir program n times.
Lession 3
Sorry, but that's still inefficient. We are creating a child process for each iteration, to create a directory needs only one kernel API call, yet we are creating a process just for that? Let's use a more sophisticated language like Perl:
#!/usr/bin/perl
use warnings;
use strict;
my $SRC = '.';
for my $file (glob("$SRC/*.zip"))
{
$file =~ s/.*([0-9]{1}\.[0-9]{1,2}\.[0-9]{1,2}).*/$1/;
mkdir $file or die "Unable to create $file; $!";
}
You might like to note that your RE has made it through to here! But now we have more control, and no child processes (mkdir in Perl is a built-in, as is glob).
In conclusion, for small numbers of files, the sed loop above will be fine. It is simple, and shell based. Calling Perl just for this from a script will probably be slower since perl is quite large. But shell scripts which create child processes inside loops are not scalable. Perl is.

Match unicode character in zsh regex

I want to make sure that a variable does not contain a specific character (in this case an 'α'), but the following code fails (returns 1):
FOO="test" && [[ $FOO =~ '^[^α]*$' ]]
Edit: Changed the pattern based on feedback from stema below to require matching only “non-'α'” characters from start to end.
Replacing 'α' with e.g. 'x' works as expected. Why does it fail with an 'α', and how can I make this work?
System info:
$ zsh --version
zsh 4.3.11 (i386-apple-darwin11.0)
$ locale
LANG="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_ALL="en_GB.UTF-8"
Edit 2: I now tested on a Linux machine running Ubuntu 11.10 with zsh 4.3.11 with identical locale settings, and there it works – i.e. FOO="test" && [[ $FOO =~ '^[^α]*$' ]] returns success. I'm running Mac OS X 10.7.2.
with this regex .*[^α].* you can't test that α is not in the string. What this is testing is: Is there ONE character in the string that is not a α.
If you want to check that there is not this character in the string, do this
FOO="test" && [[ $FOO =~ '^[^α]*$' ]]
this will check if the complete string from the start to the end consists of non "α" characters.
The simplest way of expressing this is with a negative look-ahead anchored at the start:
^(?!.*α)
This is saying "when looking forward from the start, I shouldn't be able to see α anywhere.
The advantage of using look-heads is they are non-capturing, so you can combine them with other capturing regexes, eg to find groups of numbers in quotes in input that doesn't contain a α, use this: ^(?!.*α)"(\d+)"
For some reason I got to similar problem on my build system, while having ZSH version 5.0.2 on my notebook (where Unicode works as expected) and ZSH 4.3.17 on my build system. It seems to me that ZSH 5 does not have the problem with Unicode characters in regular expression patterns.
Specifically, parsing the key/value pair:
[[ "revision/author=Ľudovít Lučenič" =~ '^([^=]+)=(.*)$' ]]
echo "$match[1]:$match[2]"
renders
: # ZSH 4.3.17
revision/author:Ľudovít Lučenič # ZSH 5.0.2
Also, I assume some shortcoming with ZSH 4 Unicode support in general.
Update: after some investigation, I have found out that the dot in regexp does not match the letter 'č' in ZSH 4. Once I updated the pattern to:
[[ "revision/author=Ľudovít Lučenič" =~ '^([^=]+)=((.|č)*)$' ]]
echo "$match[1]:$match[2]"
I am getting the same result in both ZSH versions. I do not know, though, why exactly this letter is the problem here. However, it may help somebody to work this shortcoming around.