Who do Bash regular expressions seem to fail on simple matches? [duplicate] - regex

This question already has answers here:
bash regex with quotes?
(5 answers)
Closed 2 years ago.
My question is about the Bash binary operator =~ about which the Bash manual page says the following:
When it is used, the string to the right of the operator is considered a POSIX extended regular expression and matched accordingly (as in regex(3)). The return value is 0 if the string matches the pattern, and 1 otherwise.
Under the heading Compound Command the manual says of an expression in the form:
[[ expression ]]
Return a status of 0 or 1 depending on the evaluation of the conditional expression expression. Expressions are composed of the primaries described below under CONDITIONAL EXPRESSIONS...[and] An additional binary operator, =~, is available...
Which seems to indicate that the =~ operator is available within a compound command of the form
[[ <string> =~ <string> ]]
Indeed, the following expression invoked at the Bash command-line prompt:
[[ 'x' =~ 'x' ]]
exits with a return value of 0 which, according to the manual page, indicates the pattern matched. However:
[[ 'x' =~ '.' ]]
returns 1 indicating the pattern does not match. And
[[ 'x' =~ '^' ]]
also returns 1. I have tried this on GNU bash version 5.0.18(1)-release on Debian Linux, and 5.0.17(1)-release on Apple Darwin.
The entry for "regex" in section 7 of the Debian manual (and "re_format" on the Apple machine) begins by indicating that it describes "Regular expressions ("RE"s), as defined in POSIX.2" of which one form is "modern REs (roughly those of egrep; POSIX.2 calls these 'extended' REs)." If the POSIX.2 mentioned in the regex page is the same as the POSIX mentioned in the bash page, then that would mean that the "modern REs" described in the regex page are the same as the "POSIX extended regular expressions" that Bash considers the string to the right of the =~ to be.
The regex manual entry says further:
"A (modern) RE is one or more nonempty branches"
"A branch is one or more pieces"
"A piece is an atom"
"An atom is [inter alia] '.' (matching any single character) [or] '^' (matching the null string at the beginning of a line..."
As noted above, this expression:
[[ 'x' =~ '.' ]]
returns a value 1 indicating no match. Yet if Bash considers the string to the right of the =~ operator to be a POSIX regular expression, and if the single character '.' can be a POSIX regular expression that matches any single character, and 'x' is a single character, then ought not the string '.' to the right of the =~ operator to match the single character 'x' that is to the left of the =~ operator in the above expression? If so, then why is the return value 1?
Similarly, if '^' matches the null string at the beginning of a line, then ought not the string '^" to the right of the =~ operator to match the string 'x' to the left of the =~ operator in the above expression? If so then why does the expression [[ 'x' =~ '^' ]] return 1?
Post-solution Update
chepner's answer (and the comments) provide the working solution. The following is the relevant excerpt from the bash manual page that I had overlooked:
Any part of the pattern may be quoted to force the quoted portion to be matched as a string. Bracket expressions in regular expressions must be treated carefully, since normal quoting characters lose their meanings between brackets. If the pattern is stored in a shell variable, quoting the variable expansion forces the entire pattern to be matched as a string.

Quoted characters in a regular expression are treated literally, not as a regex metacharacters. [[ 'x' =~ '.' ]] is equivalent to [[ 'x' = . ]].
Dropping the quotes works as expected:
$ [[ 'x' =~ . ]] && echo works
works
For this reason, you often use an unquoted parameter expansion to specify a regular expression.
$ regex=. # or regex='.'
$ [[ 'x' =~ $regex ]] && echo works
works

Related

Why does bash "=~" operator ignore the last part of the pattern specified?

I am trying to do compare a string in bash to a regex pattern and have found something odd. For starters I am using GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu). This is within WSL.
For example here is sample program demonstrating the problem:
#!/bin/env bash
name="John"
if [[ "${name}" =~ "John"* ]]; then
echo "found"
else
echo "not found"
fi
exit
As expected this will echo found since the name "John" matches the regex pattern described. Now what I find odd is if I drop the n in John, it still echos found. Imo "Joh" does match the pattern of "John"*.
If you drop the "hn" and just set $name to "Jo" then it echos not found. It seems to only affect the last character in the Regex pattern (aside from the wildcard).
I am converting an old csh script to bash and this behavior is not happening in csh. What is causing bash to do this?
You're mixing up syntax for shell patterns and regular expressions. Your regular expression, after stripping the quoting, is John*: Joh followed by any number of n, including 0. Matches Joh, John, Johnn, Johnnn, ...
It's not anchored, so it also matches any string containing one of the matches above.
Since it's not anchored, depending on what you want, you could do any of these:
Any string containing John should match:
Regex: [[ $name =~ John ]]
Shell pattern: [[ $name == *John* ]]
Any string that begins with John should match:
Regex: [[ $name =~ ^John ]]
Shell pattern: [[ $name == John* ]]
Notice that shell patterns, unlike the regular expressions, must match the entire string.
A note on quoting: within [[ ... ]], the left-hand side doesn't have to be quoted; on the right-hand side, quoted parts are interpreted literally. For regular expressions, it's a good practice to define it in a separate variable:
re='^John'
if [[ $name =~ $re ]]; then
This avoids a few edge cases with special characters in the regex.
The =~ operator compares using regular expression syntax, not glob syntax. The * isn't a shell wildcard, it means, "the previous character, 0 or more times".
The string Joh matches the regular expression John* because it contains Joh followed by zero n characters.

Does bash operator =~ respect locale?

Does bash operator =~ as described in the Conditional Constructs section of the bash manual respect locale?
The documentation alludes to it using POSIX extended regular expressions:
the string to the right of the operator is considered an extended regular expression and matched accordingly (as in regex3)
The POSIX extended regular expression manpage man 7 regex describes that they are locale dependent. Specifically concerning bracket expressions it says:
If two characters in the list are separated by '-', this is shorthand for the full range of characters between those two (inclusive) in the collating sequence, for example, "[0-9]" in ASCII matches any decimal digit. ... Ranges are very collating-sequence-dependent, and portable programs should avoid relying on them.
All of this suggests to me that the regular expressions used with the bash =~ operator should respect locale; however my testing does not seem to bear this out:
$ export LANG=en_US
$ export LC_COLLATE=en_US
$ [[ B =~ [A-M] ]] && echo matched || echo unmatched
matched
$ [[ b =~ [A-M] ]] && echo matched || echo unmatched
unmatched
I would expect the last command to also echo matched as the collating sequence for en_US is aAbBcCdD... as opposed to the ABCD...abcd... sequence in the C (ASCII) locale.
Do I set my locale incorrectly? Is bash not setting up the locale correctly for POSIX extended regular expressions to use the locale?
Some more experimentation based on Marcos's answer:
When in en_US locale, [a-M] apparently matches any lower case character a through z and any uppercase character A through M. That would suggest a collating order of abcd...ABCD... instead of aAbBcCdD.... Switching to the C locale using [a-M] will result in an exit code of 2 from the Conditional Construct instead of 0 or 1. This indicates an invalid regular expression, which makes sense as in the C locale a comes after M in the collating order.
So, locale is definitely being used in the POSIX extended regular expressions. However the bracket expression does not follow the collating order I would expect. Do bracket expressions perhaps use something other than the collating order?
edit1: updated to use the actual correct en_US collating sequence.
edit2: added further findings.
It's actually aAbB... and not AaBb.
Try this: touch {a..z}; touch {A..Z}; ls -1 | sort.
See?
So
$ [[ a =~ [a-M] ]] && echo matched || echo unmatched
matched
$ [[ A =~ [a-M] ]] && echo matched || echo unmatched
matched

Regular expression for positive integer [duplicate]

This question already has answers here:
How do I use regular expressions in bash scripts?
(2 answers)
Test whether string is a valid integer
(11 answers)
Closed 6 years ago.
What is a regular expression for a positive integer? I need it in an if clause in a bash script and I tried [[ $myvar == [1-9][0-9]* ]] and I don't get why it says, for instance, that 6 is not an integer and 20O0O0 is.
The == operator performs pattern matching, not regular expression matching. [1-9][0-9]* matches a string that starts with 1-9, following by a digit in the range 0-9, followed by anything, including an empty string. * is not an operator, but a wildcard. As such, basic pattern matching is not sufficient.
You can use extended pattern matching, which can be enabled explicitly, or (in the case of newer versions of bash) is assumed to be enabled for the argument to == and !=.
shopt -s extglob # may not be necessary
if [[ $myvar == [1-9]*([0-9]) ]]; then
The pattern *([0-9]) will match zero or more occurrences of the pattern enclosed in parentheses.
If you want to use a regular expression instead, use the =~ operator. Note that you now need to anchor your regular expression to the beginning and end of the string you are matching; patterns do so automatically.
if [[ $myvar =~ ^[0-9][1-9]*$ ]]; then
Note that some of the confusion stems from the fact that [...] is both a legal regular expression and pattern, and that characters like * are used in both but with slightly different meanings. Also note that extended patterns are equivalent in power to regular expressions (anything you can match with one you can match with the other), but I leave the proof of that as an exercise to the reader.
There is no need to use regex to check a positive integer. Just (( ... )) construct like this:
isInt() {
# do sanity check for argument if needed
local n="$1"
[[ $n == [1-9]* && $n -gt 0 ]] 2>/dev/null && echo '+ve integer' || echo 'nope'
}
Then use it as:
isInt '-123'
nope
isInt 'abc'
nope
isInt '.123'
nope
isInt '0'
nope
isInt '789'
+ve integer
isInt '0123'
nope
foo=1234
isInt 'foo'
nope
[[ $myvar =~ ^[+]*[[:digit:]]*$ ]] && echo "Positive Integer"
shouldn't do it?
If a 0 is not a positive number in your description and you are not ready to accept leading zeros or plus, then do
[[ $myvar =~ ^[1-9]+[[:digit:]]*$ ]] && echo "Positive Integer"

bash regex not working

So I have this code
function test(){
local output="ASD[test]"
if [[ "$output" =~ ASD\[(.*?)\] ]]; then
echo "success";
else
echo "fail"
fi;
}
And as you can see it's supposed to echo success since the string matches that regular expression. However this ends up returning fail. What did I do wrong?
The ? in ASD\[(.*?)\] doesn't belong there. It looks like you're trying to apply a non-greedy modifier to the *, which is *? in Perl-compatible syntax, but Bash doesn't support that. (See the guide here.) In fact, if you examine $? after the test, you'll see that it's not 1 (the normal "string didn't match" result) but 2, which indicates a syntax error in the regular expression.
If you use the simpler pattern ASD\[(.*)\], then the match will succeed. However, if you use that regex on a string which might have later instances of brackets, too much will get captured by the parentheses. For example:
output=ASD[test1],ASD[test2]
[[ $output =~ ASD\[(.*)\] ]] && echo "first subscript is '${BASH_REMATCH[1]}'"
#=> first subscript is 'test1],ASD[test2'
In languages that support the *? syntax, it makes the matching "non-greedy" so that it will match the smallest string it can that makes the overall match succeed; without the ?, such expressions always match the longest possible instead. Since Bash doesn't have non-greediness, your best bet is probably to use a character class that matches everything except a close bracket, making it impossible for the match to move past the first one:
[[ $output =~ ASD\[([^]]*)\] ]] && echo "first subscript is '${BASH_REMATCH[1]}'"
#=> first subscript is 'test1'
Note that this breaks if there are any nested layers of bracket pairs within the subscript brackets - but then, so does the *? version.

Must a leading curly brace be escaped in a regular expression?

regex="\{foo"; string="{foo"; [[ $string =~ $regex ]] && echo "true"
This is a bash script that works in Bash 3.x and 4.x. If the "\" is removed then it stops working in Bash 4.x. Is this behavior expected and/or a bug? the regex(7) man page suggests the escape is not required. Do other flavors of regex require that curly brace be escaped?
The opening brace needs to be escaped, because it denotes the start of the quantifier - {m,n}. I haven't used any regex flavour, where it works without escaping {. But, I can't comment for all of them. But the reason is quite logical.
For the same reason, you would need to escape the opening bracket - [, because it denotes the start of a character class.
The manual says (emphasis mine):
An additional binary operator, ‘=~’, is available, with the same
precedence as ‘==’ and ‘!=’. When it is used, the string to the right
of the operator is considered an extended regular expression and
matched accordingly.
As such, the { in the regex needs to be escaped.
However, you can force bash to perform a string comparison by quoting the string on the rhs of the =~ operator.
$ regex="{foo"; string="{foo"; [[ $string =~ "$regex" ]] && echo "true"
true