Does bash operator =~ respect locale? - regex

Does bash operator =~ as described in the Conditional Constructs section of the bash manual respect locale?
The documentation alludes to it using POSIX extended regular expressions:
the string to the right of the operator is considered an extended regular expression and matched accordingly (as in regex3)
The POSIX extended regular expression manpage man 7 regex describes that they are locale dependent. Specifically concerning bracket expressions it says:
If two characters in the list are separated by '-', this is shorthand for the full range of characters between those two (inclusive) in the collating sequence, for example, "[0-9]" in ASCII matches any decimal digit. ... Ranges are very collating-sequence-dependent, and portable programs should avoid relying on them.
All of this suggests to me that the regular expressions used with the bash =~ operator should respect locale; however my testing does not seem to bear this out:
$ export LANG=en_US
$ export LC_COLLATE=en_US
$ [[ B =~ [A-M] ]] && echo matched || echo unmatched
matched
$ [[ b =~ [A-M] ]] && echo matched || echo unmatched
unmatched
I would expect the last command to also echo matched as the collating sequence for en_US is aAbBcCdD... as opposed to the ABCD...abcd... sequence in the C (ASCII) locale.
Do I set my locale incorrectly? Is bash not setting up the locale correctly for POSIX extended regular expressions to use the locale?
Some more experimentation based on Marcos's answer:
When in en_US locale, [a-M] apparently matches any lower case character a through z and any uppercase character A through M. That would suggest a collating order of abcd...ABCD... instead of aAbBcCdD.... Switching to the C locale using [a-M] will result in an exit code of 2 from the Conditional Construct instead of 0 or 1. This indicates an invalid regular expression, which makes sense as in the C locale a comes after M in the collating order.
So, locale is definitely being used in the POSIX extended regular expressions. However the bracket expression does not follow the collating order I would expect. Do bracket expressions perhaps use something other than the collating order?
edit1: updated to use the actual correct en_US collating sequence.
edit2: added further findings.

It's actually aAbB... and not AaBb.
Try this: touch {a..z}; touch {A..Z}; ls -1 | sort.
See?
So
$ [[ a =~ [a-M] ]] && echo matched || echo unmatched
matched
$ [[ A =~ [a-M] ]] && echo matched || echo unmatched
matched

Related

Why does bash "=~" operator ignore the last part of the pattern specified?

I am trying to do compare a string in bash to a regex pattern and have found something odd. For starters I am using GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu). This is within WSL.
For example here is sample program demonstrating the problem:
#!/bin/env bash
name="John"
if [[ "${name}" =~ "John"* ]]; then
echo "found"
else
echo "not found"
fi
exit
As expected this will echo found since the name "John" matches the regex pattern described. Now what I find odd is if I drop the n in John, it still echos found. Imo "Joh" does match the pattern of "John"*.
If you drop the "hn" and just set $name to "Jo" then it echos not found. It seems to only affect the last character in the Regex pattern (aside from the wildcard).
I am converting an old csh script to bash and this behavior is not happening in csh. What is causing bash to do this?
You're mixing up syntax for shell patterns and regular expressions. Your regular expression, after stripping the quoting, is John*: Joh followed by any number of n, including 0. Matches Joh, John, Johnn, Johnnn, ...
It's not anchored, so it also matches any string containing one of the matches above.
Since it's not anchored, depending on what you want, you could do any of these:
Any string containing John should match:
Regex: [[ $name =~ John ]]
Shell pattern: [[ $name == *John* ]]
Any string that begins with John should match:
Regex: [[ $name =~ ^John ]]
Shell pattern: [[ $name == John* ]]
Notice that shell patterns, unlike the regular expressions, must match the entire string.
A note on quoting: within [[ ... ]], the left-hand side doesn't have to be quoted; on the right-hand side, quoted parts are interpreted literally. For regular expressions, it's a good practice to define it in a separate variable:
re='^John'
if [[ $name =~ $re ]]; then
This avoids a few edge cases with special characters in the regex.
The =~ operator compares using regular expression syntax, not glob syntax. The * isn't a shell wildcard, it means, "the previous character, 0 or more times".
The string Joh matches the regular expression John* because it contains Joh followed by zero n characters.

Who do Bash regular expressions seem to fail on simple matches? [duplicate]

This question already has answers here:
bash regex with quotes?
(5 answers)
Closed 2 years ago.
My question is about the Bash binary operator =~ about which the Bash manual page says the following:
When it is used, the string to the right of the operator is considered a POSIX extended regular expression and matched accordingly (as in regex(3)). The return value is 0 if the string matches the pattern, and 1 otherwise.
Under the heading Compound Command the manual says of an expression in the form:
[[ expression ]]
Return a status of 0 or 1 depending on the evaluation of the conditional expression expression. Expressions are composed of the primaries described below under CONDITIONAL EXPRESSIONS...[and] An additional binary operator, =~, is available...
Which seems to indicate that the =~ operator is available within a compound command of the form
[[ <string> =~ <string> ]]
Indeed, the following expression invoked at the Bash command-line prompt:
[[ 'x' =~ 'x' ]]
exits with a return value of 0 which, according to the manual page, indicates the pattern matched. However:
[[ 'x' =~ '.' ]]
returns 1 indicating the pattern does not match. And
[[ 'x' =~ '^' ]]
also returns 1. I have tried this on GNU bash version 5.0.18(1)-release on Debian Linux, and 5.0.17(1)-release on Apple Darwin.
The entry for "regex" in section 7 of the Debian manual (and "re_format" on the Apple machine) begins by indicating that it describes "Regular expressions ("RE"s), as defined in POSIX.2" of which one form is "modern REs (roughly those of egrep; POSIX.2 calls these 'extended' REs)." If the POSIX.2 mentioned in the regex page is the same as the POSIX mentioned in the bash page, then that would mean that the "modern REs" described in the regex page are the same as the "POSIX extended regular expressions" that Bash considers the string to the right of the =~ to be.
The regex manual entry says further:
"A (modern) RE is one or more nonempty branches"
"A branch is one or more pieces"
"A piece is an atom"
"An atom is [inter alia] '.' (matching any single character) [or] '^' (matching the null string at the beginning of a line..."
As noted above, this expression:
[[ 'x' =~ '.' ]]
returns a value 1 indicating no match. Yet if Bash considers the string to the right of the =~ operator to be a POSIX regular expression, and if the single character '.' can be a POSIX regular expression that matches any single character, and 'x' is a single character, then ought not the string '.' to the right of the =~ operator to match the single character 'x' that is to the left of the =~ operator in the above expression? If so, then why is the return value 1?
Similarly, if '^' matches the null string at the beginning of a line, then ought not the string '^" to the right of the =~ operator to match the string 'x' to the left of the =~ operator in the above expression? If so then why does the expression [[ 'x' =~ '^' ]] return 1?
Post-solution Update
chepner's answer (and the comments) provide the working solution. The following is the relevant excerpt from the bash manual page that I had overlooked:
Any part of the pattern may be quoted to force the quoted portion to be matched as a string. Bracket expressions in regular expressions must be treated carefully, since normal quoting characters lose their meanings between brackets. If the pattern is stored in a shell variable, quoting the variable expansion forces the entire pattern to be matched as a string.
Quoted characters in a regular expression are treated literally, not as a regex metacharacters. [[ 'x' =~ '.' ]] is equivalent to [[ 'x' = . ]].
Dropping the quotes works as expected:
$ [[ 'x' =~ . ]] && echo works
works
For this reason, you often use an unquoted parameter expansion to specify a regular expression.
$ regex=. # or regex='.'
$ [[ 'x' =~ $regex ]] && echo works
works

Mix of regex and non-regex in bash if-statement

Inside of my $foo variable I have this data (please pay close attention to the .s and ,s):
,example.com,de.wikipedia.org,reddit,stackoverflow.com.,amazon.,
I am trying to write an if statement in bash that basically works like this:
if [[ "${foo}" =~ *','[a-z0-9]','* || "${foo}" =~ *','[a-z0-9]'.,'* ]]; then
echo "Invalid input detected"
else
echo "OK"
fi
It would echo Invalid input detected since reddit and amazon. are in $foo.
If I change the contents of $foo to be:
,example.com,de.wikipedia.org,www.reddit.com,stackoverflow.com.,amazon.com,
Then it would echo OK.
I am using bash 3.2.57(1)-release on OS X 10.11.6 El Capitan.
Try:
if [[ $foo =~ ,[a-z0-9]*, || $foo =~ ,[a-z0-9]*\., ]]; then
echo "Invalid input detected"
else
echo "OK"
fi
Notes:
=~ is a regular expression operator. The right-hand-side needs to be a regular expression, not a glob.
, is not a shell-active character. Thus, it does not need any special quoting.
[a-z0-9] matches exactly one alphanumeric. Since we want to allow for more any number, use [a-z0-9]*
In regular expressions, ','* matches zero or more commas. This is not what you want. One might write ,.* which, because, . is a wildcard, matches a comma followed by zero or more of anything. Since the regex is not anchored to the end, adding a final .* makes no difference.
Inside of [[...]] there is no word splitting. So shell variables do not the double-quoting that need elsewhere.
Note that, in [a-z0-9], the exact characters that match a-z or 0-9 depend on the collation order in the locale.

Bash double bracket regex comparison using negative lookahead error return 2

On Bash 4.1 machine,
I'm trying to use "double bracket" [[ expression ]] to do REGEX comparison using "NEGATIVE LOOKAHEAD".
I did "set +H" to disable BASH variable'!' expansion to command history search.
I want to match to "any string" except "arm-trusted-firmware".
set +H
if [[ alsa =~ ^(?!arm-trusted-firmware).* ]]; then echo MATCH; else echo "NOT MATCH"; fi
I expect this to print "MATCH" back,
but it prints "NOT MATCH".
After looking into the return code of "double bracket",
it returns "2":
set +H
[[ alsa =~ ^(?!arm-trusted-firmware).* ]]
echo $?
According to bash manual,
the return value '2' means "the regular expression is syntactically incorrect":
An additional binary operator, =~, is available,
with the same precedence as == and !=.
When it is used,
the string to the right of the operator is considered
an extended regular expression and matched accordingly (as in regex(3)).
The return value is 0 if the string matches the pattern, and 1 otherwise.
If the regular expression is syntactically incorrect,
the conditional expression's return value is 2.
What did I do wrong?
In my original script,
I'm comparing against to a list of STRINGs.
When it matches, I trigger some function calls;
when it doesn't match, I skip my actions.
So, YES, from this example,
I'm comparing literally the STRING between 'alsa' and 'arm-trusted-firmware'.
By default bash POSIX standard doesn't supports PCRE. (source: Wiki Bash Hackers)
As workaround, you'll need to enable extglob. This will enable some extended globing patterns:
$ shopt -s extglob
Check Wooledge Wiki for reading more about extglob.
Then you'll be able to use patterns like that:
?(pattern-list) Matches zero or one occurrence of the given patterns
*(pattern-list) Matches zero or more occurrences of the given patterns
+(pattern-list) Matches one or more occurrences of the given patterns
#(pattern-list) Matches one of the given patterns
!(pattern-list) Matches anything except one of the given patterns
More about extended BASH globbing at Wiki Bash Hackers and LinuxJournal.
Thanks for the answer from #Barmar
BASH doesn't support "lookaround" (lookahead and lookbehind)
bash doesn't use PCRE, and doesn't support lookarounds.
Respectfully, aren't you over-complicating things?
if [ "$alsa" = arm-trusted-firmware ]
then
echo 'MATCH'
else
echo 'NOT MATCH'
fi
If you have a good reason for wanting to use the Bashism [[, it would serve
you better to provide an example that justifies it.
Bashism

Does bash support word boundary regular expressions?

I am trying to match on the presence of a word in a list before adding that word again (to avoid duplicates). I am using bash 4.2.24 and am trying the below:
[[ $foo =~ \bmyword\b ]]
also
[[ $foo =~ \<myword\> ]]
However, neither seem to work. They are mentioned in the bash docs example: http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_04_01.html.
I presume I am doing something wrong but I am not sure what.
tl;dr
To be safe, do not use a regex literal with =~.
Instead, use:
either: an auxiliary variable - see #Eduardo Ivancec's answer.
or: a command substitution that outputs a string literal - see #ruakh's comment on #Eduardo Ivancec's answer
Note that both must be used unquoted as the =~ RHS.
Whether \b and \< / \> are supported at all depends on the host platform, not Bash:
they DO work on Linux,
but NOT on BSD-based platforms such as macOS; there, use [[:<:]] and [[:>:]] instead, which, in the context of an unquoted regex literal, must be escaped as [[:\<:]] and [[:\>:]]; the following works as expected, but only on BSD/macOS:
[[ ' myword ' =~ [[:\<:]]myword[[:\>:]] ]] && echo YES # OK
The problem wouldn't arise - on any platform - if you limited your regex to the constructs in the POSIX ERE (extended regular expression) specification.
Unfortunately, POSIX EREs do not support word-boundary assertions, though you can emulate them - see the last section.
As on macOS, no \-prefixed constructs are supported, so that handy character-class shortcuts such as \s and \w aren't available either.
However, the up-side is that such ERE-compliant regexes are then portable (work on both Linux and macOS, for instance)
=~ is the rare case (the only case?) of a built-in Bash feature whose behavior is platform-dependent: It uses the regex libraries of the platform it is running on, resulting in different regex flavors on different platforms.
Thus, it is generally non-trivial and requires extra care to write portable code that uses the =~ operator.
Sticking with POSIX EREs is the only robust approach, which means that you have to work around their limitations - see bottom section.
If you want to know more, read on.
On Bash v3.2+ (unless the compat31 shopt option is set), the RHS (right-hand side operand) of the =~ operator must be unquoted in order to be recognized as a regex (if you quote the right operand, =~ performs regular string comparison instead).
More accurately, at least the special regex characters and sequences must be unquoted, so it's OK and useful to quote those substrings that should be taken literally; e.g., [[ '*' =~ ^'*' ]] matches, because ^ is unquoted and thus correctly recognized as the start-of-string anchor, whereas *, which is normally a special regex char, matches literally due to the quoting.
However, there appears to be a design limitation in (at least) bash 3.x that prevents use of \-prefixed regex constructs (e.g., \<, \>, \b, \s, \w, ...) in a literal =~ RHS; the limitation affects Linux, whereas BSD/macOS versions are not affected, due to fundamentally not supporting any \-prefixed regex constructs:
# Linux only:
# PROBLEM (see details further below):
# Seen by the regex engine as: <word>
# The shell eats the '\' before the regex engine sees them.
[[ ' word ' =~ \<word\> ]] && echo MATCHES # !! DOES NOT MATCH
# Causes syntax error, because the shell considers the < unquoted.
# If you used \\bword\\b, the regex engine would see that as-is.
[[ ' word ' =~ \\<word\\> ]] && echo MATCHES # !! BREAKS
# Using the usual quoting rules doesn't work either:
# Seen by the regex engine as: \\<word\\> instead of \<word\>
[[ ' word ' =~ \\\<word\\\> ]] && echo MATCHES # !! DOES NOT MATCH
# WORKAROUNDS
# Aux. viarable.
re='\<word\>'; [[ ' word ' =~ $re ]] && echo MATCHES # OK
# Command substitution
[[ ' word ' =~ $(printf %s '\<word\>') ]] && echo MATCHES # OK
# Change option compat31, which then allows use of '...' as the RHS
# CAVEAT: Stays in effect until you reset it, may have other side effects.
# Using (...) around the command confines the effect to a subshell.
(shopt -s compat31; [[ ' word ' =~ '\<word\>' ]] && echo MATCHES) # OK
The problem:
Tip of the hat to Fólkvangr for his input.
A literal RHS of =~ is by design parsed differently than unquoted tokens as arguments, in an attempt to allow the user to focus on escaping characters just for the regex, without also having to worry about the usual shell escaping rules in unquoted tokens.
For instance,
[[ 'a[b' =~ a\[b ]] && echo MATCHES # OK
matches, because the \ is _passed through to the regex engine (that is, the regex engine too sees literal a\[b), whereas if you used the same unquoted token as a regular argument, the usual shell expansions applied to unquoted tokens would "eat" the \, because it is interpreted as a shell escape character:
$ printf %s a\[b
a[b # '\' was removed by the shell.
However, in the context of =~ this exceptional passing through of \ is only applied before characters that are regex metacharacters by themselves, as defined by the ERE (extended regular expressions) POSIX specification (in order to escape them for the regex, so that they're treated as literals:
\ ^ $ [ { . ? * + ( ) |
Conversely, these regex metacharacters may exceptionally be used unquoted - and indeed must be left unquoted to have their special regex meaning - even though most of them normally require \-escaping in unquoted tokens to prevent the shell from interpreting them.
Yet, a subset of the shell metacharacters do still need escaping, for the shell's sake, so as not to break the syntax of the [[ ... ]] conditional:
& ; < > space
Since these characters aren't also regex metacharacters, there is no need to also support escaping them on the regex side, so that, for instance, the regex engine seeing \& in the RHS as just & works fine.
For any other character preceded by \, the shell removes the \ before sending the string to the regex engine (as it does during normal shell expansion), which is unfortunate, because then even characters that the shell doesn't consider special cannot be passed as \<char> to the regex engine, because the shell invariably passes them as just <char>.
E.g, \b is invariably seen as just b by the regex engine.
It is therefore currently impossible to use a (by definition non-POSIX) regex construct in the form \<char> (e.g., \<, \>, \b, \s, \w, \d, ...) in a literal, unquoted =~ RHS, because no form of escaping can ensure that these constructs are seen by the regex engine as such, after parsing by the shell:
Since neither <, >, nor b are regex metacharacters, the shell removes the \ from \<, \>, \b (as happens in regular shell expansion). Therefore, passing \<word\>, for instance, makes the regex engine see <word>, which is not the intent:
[[ '<word>' =~ \<word\> ]] && echo YES matches, because the regex engine sees <word>.
[[ 'boo' =~ ^\boo ]] && echo YES matches, because the regex engine sees ^boo.
Trying \\<word\\> breaks the command, because the shell treats each \\ as an escaped \, which means that metacharacter < is then considered unquoted, causing a syntax error:
[[ ' word ' =~ \\<word\\> ]] && echo YES causes a syntax error.
This wouldn't happen with \\b, but \\b is passed through (due to the \ preceding a regex metachar, \), which also doesn't work:
[[ '\boo' =~ ^\\boo ]] && echo YES matches, because the regex engine sees \\boo, which matches literal \boo.
Trying \\\<word\\\> - which by normal shell expansion rules results in \<word\> (try printf %s \\\<word\\\>) - also doesn't work:
What happens is that the shell eats the \ in \< (ditto for \b and other \-prefixed sequences), and then passes the preceding \\ through to the regex engine as-is (again, because \ is preserved before a regex metachar):
[[ ' \<word\> ' =~ \\\<word\\\> ]] && echo YES matches, because the regex engine sees \\<word\\>, which matches literal \<word\>.
In short:
Bash's parsing of =~ RHS literals was designed with single-character regex metacharacters in mind, and does not support multi-character constructs that start with \, such as \<.
Because POSIX EREs support no such constructs, =~ works as designed if you limit yourself to such regexes.
However, even within this constraint the design is somewhat awkward, due to the need to mix regex-related and shell-related \-escaping (quoting).
Fólkvangr found the official design rationale in the Bash FAQ here, which, however, neither addresses said awkwardness nor the lack of support for (invariably non-POSIX) \<char> regex constructs; it does mention using an aux. variable as a workaround, however, although only with respect to making it easier to represent whitespace.
All these parsing problems go away if the string that the regex engine should see is provided via a variable or via the output from a command substitution, as demonstrated above.
Optional reading: A portable emulation of word-boundary assertions with POSIX-compliant EREs (extended regular expressions):
(^|[^[:alnum:]_]) instead of \< / [[:<:]]
([^[:alnum:]_]|$) instead of \> / [[:>:]]
Note: \b can't be emulated with a SINGLE expression - use the above in the appropriate places.
The potential caveat is that the above expressions will also capture the non-word character being matched, whereas true assertions such as \< / [[:<:]] and do not.
$foo = 'myword'
[[ $foo =~ (^|[^[:alnum:]_])myword([^[:alnum:]_]|$) ]] && echo YES
The above matches, as expected.
Yes, all the listed regex extensions are supported but you'll have better luck putting the pattern in a variable before using it. Try this:
re=\\bmyword\\b
[[ $foo =~ $re ]]
Digging around I found this question, whose answers seems to explain why the behaviour changes when the regex is written inline as in your example.
Editor's note: The linked question does not explain the OP's problem; it merely explains how starting with Bash version 3.2 regexes (or at least the special regex chars.) must by default be unquoted to be treated as such - which is exactly what the OP attempted.
However, the workarounds in this answer are effective.
You'll probably have to rewrite your tests so as to use a temporary variable for your regexes, or use the 3.1 compatibility mode:
shopt -s compat31
Not exactly "\b", but for me more readable (and portable) than the other suggestions:
[[ $foo =~ (^| )myword($| ) ]]
The accepted answer focuses on using auxiliary variables to deal with the syntax oddities of regular expressions in Bash's [[ ... ]] expressions. Very good info.
However, the real answer is:
\b \< and \> do not work on OS X 10.11.5 (El Capitan) with bash version 4.3.42(1)-release (x86_64-apple-darwin15.0.0).
Instead, use [[:<:]] and [[:>:]].
Tangential to your question, but if you can use grep -E (or egrep, its effective, but obsolescent alias) in your script:
if grep -q -E "\b${myword}\b" <<<"$foo"; then
I ended up using this after flailing with bash's =~.
Note that while regex constructs \<, \>, and \b are not POSIX-compliant, both the BSD (macOS) and GNU (Linux) implementations of grep -E support them, which makes this approach widely usable in practice.
Small caveat (not an issue in the case at hand): By not using =~, you lose the ability to inspect capturing subexpressions (capture groups) via ${BASH_REMATCH[#]} later.
I've used the following to match word boundaries on older systems. The key is to wrap $foo with spaces since [^[:alpha:]] will not match words at the beginning or end of the list.
[[ " $foo " =~ [^[:alpha:]]myword[^[:alpha:]] ]]
Tweak the character class as needed based on the expected contents of myword, otherwise this may not be good solution.
This worked for me
bar='\<myword\>'
[[ $foo =~ $bar ]]
You can use grep, which is more portable than bash's regexp like this:
if echo $foo | grep -q '\<myword\>'; then
echo "MATCH";
else
echo "NO MATCH";
fi