Weird behavior of BASH glob/regex ranges

Weird behavior of BASH glob/regex ranges - regex

I'm seeing BASH bracket ranges (e.g. [A-Z]) behaving in an unexpected way.Is there's an explanation for such behavior, or it is a bug?
Let's say I have a variable, from which I want to strip all uppercase letters:
$ var='ABCDabcd0123'
$ echo "${var//[A-Z]/}"
The result I get is this:
a0123
If I do it with sed, I get an expected result:
$ echo "${var}" | sed 's/[A-Z]//g'
abcd0123
The same seems to be the case for BASH built-in regex match:
$ [[ a =~ [A-Z] ]] ; echo $?
1
$ [[ b =~ [A-Z] ]] ; echo $?
0
If I check all lowercase letters from 'a' to 'z', it seems that only 'a' is an exception:
$ for l in {a..z}; do [[ $l =~ [A-Z] ]] || echo $l; done
a
I do not have case-insensitive matching enabled, and even if I did, it should not make letter 'a' behave differently:
$ shopt -p nocasematch
shopt -u nocasematch
For the reference, I'm using Cygwin, and I don't see this behavior on any other machine:
$ uname
CYGWIN_NT-6.3
$ bash --version | head -1
GNU bash, version 4.3.46(7)-release (x86_64-unknown-cygwin)
$ locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_ALL=
EDIT:
I've found the exact same issue reported here:
https://bugs.launchpad.net/ubuntu/+source/bash/+bug/120687
So, I guess it's a bug(?) of "en_GB.UTF-8" collation, but not BASH itself.
Setting LC_COLLATE=C indeed solves this.

It certainly had to do with setting of your locale. An excerpt from the GNU bash man page under Pattern Matching
[..] in the default C locale, [a-dx-z] is equivalent to [abcdxyz]. Many locales sort characters in dictionary order, and in these locales [a-dx-z] is typically not equivalent to [abcdxyz]; it might be equivalent to [aBbCcDdxXyYz], for example. To obtain the traditional interpretation of ranges in bracket expressions, you can force the use of the C locale by setting the LC_COLLATE or LC_ALL environment variable to the value C, or enable the globasciiranges shell option.[..]
Use the POSIX character-classess, [[:upper:]] in this case or change your locale setting LC_ALL or LC_COLLATE to C as mentioned above.
LC_ALL=C var='ABCDabcd0123'
echo "${var//[A-Z]/}"
abcd0123
Also, your negative test to do upper-case check will fail for all the lower case letters when setting this locale hence printing the letters,
LC_ALL=C; for l in {a..z}; do [[ $l =~ [A-Z] ]] || echo $l; done
Also, under the above locale setting
[[ a =~ [A-Z] ]] ; echo $?
1
[[ b =~ [A-Z] ]] ; echo $?
1
but will be true for all lower-case ranges,
[[ a =~ [a-z] ]] ; echo $?
0
[[ b =~ [a-z] ]] ; echo $?
0
Said this, all these can be avoided by using the POSIX specified character classes, under a new shell without any locale setting,
echo "${var//[[:upper:]]/}"
abcd0123
and
for l in {a..z}; do [[ $l =~ [[:upper:]] ]] || echo $l; done

Related

Regex stored in a shell variable doesn't work between double brackets

The below is a small part of a bigger script I'm working on, but the below is giving me a lot of pain which causes a part of the bigger script to not function properly. The intention is to check if the variable has a string value matching red hat or Red Hat. If it is, then change the variable name to redhat. But it doesn't quite match the regex I've used.
getos="red hat"
rh_reg="[rR]ed[:space:].*[Hh]at"
if [ "$getos" =~ "$rh_reg" ]; then
getos="redhat"
fi
echo $getos
Any help will be greatly appreciated.

There are a multiple things to fix here
bash supports regex pattern matching within its [[ extended test operator and not within its POSIX standard [ test operator
Never quote our regex match string. bash 3.2 introduced a compatibility option compat31 (under New Features in Bash 1.l) which reverts bash regular expression quoting behavior back to 3.1 which supported quoting of the regex string.
Fix the regex to use [[:space:]] instead of just [:space:]
So just do
getos="red hat"
rh_reg="[rR]ed[[:space:]]*[Hh]at"
if [[ "$getos" =~ $rh_reg ]]; then
getos="redhat"
fi;
echo "$getos"
or enable the compat31 option from the extended shell option
shopt -s compat31
getos="red hat"
rh_reg="[rR]ed[[:space:]]*[Hh]at"
if [[ "$getos" =~ "$rh_reg" ]]; then
getos="redhat"
fi
echo "$getos"
shopt -u compat31
But instead of messing with those shell options just use the extended test operator [[ with an unquoted regex string variable.

There are two issues:
First, replace:
rh_reg="[rR]ed[:space:].*[Hh]at"
With:
rh_reg="[rR]ed[[:space:]]*[Hh]at"
A character class like [:space:] only works when it is in square brackets. Also, it appears that you wanted to match zero or more spaces and that is [[:space:]]* not [[:space:]].*. The latter would match a space followed by zero or more of anything at all.
Second, replace:
[ "$getos" =~ "$rh_reg" ]
With:
[[ "$getos" =~ $rh_reg ]]
Regex matches requires bash's extended test: [[...]]. The POSIX standard test, [...], does not have the feature. Also, in bash, regular expressions only work if they are unquoted.
Examples:
$ rh_reg='[rR]ed[[:space:]]*[Hh]at'
$ getos="red Hat"; [[ "$getos" =~ $rh_reg ]] && getos="redhat"; echo $getos
redhat
$ getos="RedHat"; [[ "$getos" =~ $rh_reg ]] && getos="redhat"; echo $getos
redhat

How to match repeated characters using regular expression operator =~ in bash?

I want to know if a string has repeated letter 6 times or more, using the =~ operator.
a="aaaaaaazxc2"
if [[ $a =~ ([a-z])\1{5,} ]];
then
echo "repeated characters"
fi
The code above does not work.

BASH regex flavor i.e. ERE doesn't support backreference in regex. ksh93 and zsh support it though.
As an alternate solution, you can do it using extended regex option in grep:
a="aaaaaaazxc2"
grep -qE '([a-zA-Z])\1{5}' <<< "$a" && echo "repeated characters"
repeated characters
EDIT: Some ERE implementations support backreference as an extension. For example Ubuntu 14.04 supports it. See snippet below:
$> echo $BASH_VERSION
4.3.11(1)-release
$> a="aaaaaaazxc2"
$> re='([a-z])\1{5}'
$> [[ $a =~ $re ]] && echo "repeated characters"
repeated characters

[[ $var =~ $regex ]] parses a regular expression in POSIX ERE syntax.
See the POSIX regex standard, emphasis added:
BACKREF - Applicable only to basic regular expressions. The character string consisting of a character followed by a single-digit numeral, '1' to '9'.
Backreferences are not formally specified by the POSIX standard for ERE; thus, they are not guaranteed to be available (subject to platform-specific libc extensions) in bash's native regex syntax, thus mandating the use of external tools (awk, grep, etc).

You do not need the full power of backreferences for this specific case of one character repeats. You could just build the regex that would check for a repeat of every single lower case letter
regex="a{6}"
for x in {b..z} ; do regex="$regex|$x{6}" ; done
if [[ "$a" =~ ($regex) ]] ; then echo "repeated characters" ; fi
The regex built with the above for loop looks like
> echo "$regex" | fold -w60
a{6}|b{6}|c{6}|d{6}|e{6}|f{6}|g{6}|h{6}|i{6}|j{6}|k{6}|l{6}|
m{6}|n{6}|o{6}|p{6}|q{6}|r{6}|s{6}|t{6}|u{6}|v{6}|w{6}|x{6}|
y{6}|z{6}
This regular expression behaves as you would expect
> if [[ "abcdefghijkl" =~ ($regex) ]] ; then \
echo "repeated characters" ; else echo "no repeat detected" ; fi
no repeat detected
> if [[ "aabbbbbbbbbcc" =~ ($regex) ]] ; then \
echo "repeated characters" ; else echo "no repeat detected" ; fi
repeated characters
Updated following the comment from #sln replaced bound {6,} expression with a simple {6}.

retrieve a word after a regular expression in shell script

I am trying to retrieve specific fields from a text file which has a metadata as follows:
project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN
And I have the following script for retrieving the field 'cell'
while read line
do
cell="$(echo $line | cut -d";" -f7 )"
echo $cell
fi
done < files.txt
However the following script retrieves the whole field as cell=ABC , whereas I just want the value 'ABC' from the field, how do I retrieve the value after the regex, in the same line of code?

If extracting one value (or, generally, a non-repeating set of values captured by distinct capture groups) is enough and you're running bash, ksh, or zsh, consider using the regex-matching operator, =~: [[ string =~ regex ]]:
Tip of the hat to #Adrian Frühwirth for the gist of the ksh and zsh solutions.
Sample input string:
string='project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN'
Shell-specific use of =~ is discussed next; a multi-shell implementation of the =~ functionality via a shell function can be found at the end.
bash
The special BASH_REMATCH array variable receives the results of the matching operation: element 0 contains the entire match, element 1 the first capture group's (parenthesized subexpression's) match, and so on.
bash 3.2+:
[[ $string =~ \ cell=([^;]+) ]] && cell=${BASH_REMATCH[1]} # -> $cell == 'ABC'
bash 4.x:
While the specific command above works, using regex literals in bash 4.x is buggy, notably when involving word-boundary assertions \< and \> on Linux; e.g., [[ a =~ \<a ]] inexplicably doesn't match; workaround: use an intermediate variable (unquoted!): re='\a'; [[ a =~ $re ]] works (also on bash 3.2+).
bash 3.0 and 3.1 - or after setting shopt -s compat31:
Quote the regex to make it work:
[[ $string =~ ' cell=([^;]+)' ]] && cell=${BASH_REMATCH[1]} # -> $cell == 'ABC'
ksh
The ksh syntax is the same as in bash, except:
the name of the special array variable that contains the matched strings is .sh.match (you must enclose the name in {...} even when just implicitly referring to the first element with ${.sh.match}):
[[ $string =~ \ cell=([^;]+) ]] && cell=${.sh.match[1]} # -> $cell == 'ABC'
zsh
The zsh syntax is also similar to bash, except:
The regex literal must be quoted - for simplicity as a whole, or at least some shell metacharacters, such as ;.
you may, but needn't double-quote a regex provided as a variable value.
Note how this quoting behavior differs fundamentally from that of bash 3.2+: zsh requires quoting only for syntax reasons and always treats the resulting string as a whole as a regex, whether it or parts of it were quoted or not.
There are 2 variables containing the matching results:
$MATCH contains the entire matched string
array variable $match contains only the matches for the capture groups (note that zsh arrays start with index 1 and that you don't need to enclose the variable name in {...} to reference array elements)
[[ $string =~ ' cell=([^;]+)' ]] && cell=$match[1] # -> $cell == 'ABC'
Multi-shell implementation of the =~ operator as shell function reMatch
The following shell function abstracts away the differences between bash, ksh, zsh with respect to the =~ operator; the matches are returned in array variable ${reMatches[#]}.
As #Adrian Frühwirth notes, to write portable (across zsh, ksh, bash) code with this, you need to execute setopt KSH_ARRAYS in zsh so as to make its arrays start with index 0; as a side effect, you also have to use the ${...[]} syntax when referencing arrays, as in ksh and bash).
Applied to our example we'd get:
# zsh: make arrays behave like in ksh/bash: start at *0*
[[ -n $ZSH_VERSION ]] && setopt KSH_ARRAYS
reMatch "$string" ' cell=([^;]+)' && cell=${reMatches[1]}
Shell function:
# SYNOPSIS
# reMatch string regex
# DESCRIPTION
# Multi-shell implementation of the =~ regex-matching operator;
# works in: bash, ksh, zsh
#
# Matches STRING against REGEX and returns exit code 0 if they match.
# Additionally, the matched string(s) is returned in array variable ${reMatch[#]},
# which works the same as bash's ${BASH_REMATCH[#]} variable: the overall
# match is stored in the 1st element of ${reMatch[#]}, with matches for
# capture groups (parenthesized subexpressions), if any, stored in the remaining
# array elements.
# NOTE: zsh arrays by default start with index *1*.
# EXAMPLE:
# reMatch 'This AND that.' '^(.+) AND (.+)\.' # -> ${reMatch[#]} == ('This AND that.', 'This', 'that')
function reMatch {
typeset ec
unset -v reMatch # initialize output variable
[[ $1 =~ $2 ]] # perform the regex test
ec=$? # save exit code
if [[ $ec -eq 0 ]]; then # copy result to output variable
[[ -n $BASH_VERSION ]] && reMatch=( "${BASH_REMATCH[#]}" )
[[ -n $KSH_VERSION ]] && reMatch=( "${.sh.match[#]}" )
[[ -n $ZSH_VERSION ]] && reMatch=( "$MATCH" "${match[#]}" )
fi
return $ec
}
Note:
function reMatch (as opposed to reMatch()) is used to declare the function, which is required for ksh to truly create local variables with typeset.

I would not use cut, since you cannot specify more than one delimiter.
If your grep supports PCRE, then you can do:
$ string='project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN'
$ grep -oP '(?<=cell=)[^;]+' <<< "$string"
ABC
You can use sed, which in simple terms can be done as -
$ sed -r 's/.*cell=([^;]+).*/\1/' <<< "$string"
ABC
Another option is to use awk. With that you can do the following by specifying list of delimiters you want to consider as field separators:
$ awk -F'[;= ]' '{print $5}' <<< "$string"
ABC
You can certainly put more checks by iterating over the line so that you don't have to hard-code to print 5th field.
Note that if your shell does not support here-string notation <<< then you can echo the variable and pipe it to the command.
$ echo "$string" | cmd

Here's a native shell solution:
$ string='project=XYZ; cell=ABC; strain=C3H; sex=F; age=PQR; treatment=None; id=MLN'
$ cell=${string#*cell=}
$ cell=${cell%%;*}
$ echo "${cell}"
ABC
This removes the shortest leading match up to including cell= from the string, then removes the longest trailing match up to including the ; leaving you with ABC.
Here's another solution which uses read to split the strings:
$ cat t.sh
#!/bin/bash
while IFS=$'; \t' read -ra attributes; do
for foo in "${attributes[#]}"; do
IFS='=' read -r key value <<< "${foo}"
[ "${key}" = cell ] && echo "${value}"
done
done <<EOF
foo=X; cell=ABC; quux=Z;
foo=X; cell=DEF; quux=Z;
EOF
.
$ ./t.sh
ABC
DEF
For solutions using external tools see #jaypal's excellent answer.

match leading dots in bash if using regex

Say I want to match the leading dot in a string ".a"
So I type
[[ ".a" =~ ^\. ]] && echo "ha"
ha
[[ "a" =~ ^\. ]] && echo "ha"
ha
Why am I getting the same result here?

You need to escape the dot it has meaning beyond just a period - it is a metacharacter in regex.
[[ "a" =~ ^\. ]] && echo "ha"
Make the change in the other example as well.
Check your bash version - you need 4.0 or higher I believe.

There's some compatibility issues with =~ between Bash versions after 3.0. The safest way to use =~ in Bash is to put the RE pattern in a var:
$ pat='^\.foo'
$ [[ .foo =~ $pat ]] && echo yes || echo no
yes
$ [[ foo =~ $pat ]] && echo yes || echo no
no
$
For more details, see E14 on the Bash FAQ page.

Probably it's because bash tries to treat "." as a \ character, like \n \r etc.
In order to tell \ & . as 2 separate characters, try
[[ "a" =~ ^\\. ]] && echo ha

use regular expression in if-condition in bash

I wonder the general rule to use regular expression in if clause in bash?
Here is an example
$ gg=svm-grid-ch
$ if [[ $gg == *grid* ]] ; then echo $gg; fi
svm-grid-ch
$ if [[ $gg == ^....grid* ]] ; then echo $gg; fi
$ if [[ $gg == ....grid* ]] ; then echo $gg; fi
$ if [[ $gg == s...grid* ]] ; then echo $gg; fi
$
Why the last three fails to match?
Hope you could give as many general rules as possible, not just for this example.

When using a glob pattern, a question mark represents a single character and an asterisk represents a sequence of zero or more characters:
if [[ $gg == ????grid* ]] ; then echo $gg; fi
When using a regular expression, a dot represents a single character and an asterisk represents zero or more of the preceding character. So ".*" represents zero or more of any character, "a*" represents zero or more "a", "[0-9]*" represents zero or more digits. Another useful one (among many) is the plus sign which represents one or more of the preceding character. So "[a-z]+" represents one or more lowercase alpha character (in the C locale - and some others).
if [[ $gg =~ ^....grid.*$ ]] ; then echo $gg; fi

Use
=~
for regular expression check Regular Expressions Tutorial Table of Contents

if [[ $gg =~ ^....grid.* ]]

Adding this solution with grep and basic sh builtins for those interested in a more portable solution (independent of bash version; also works with plain old sh, on non-Linux platforms etc.)
# GLOB matching
gg=svm-grid-ch
case "$gg" in
*grid*) echo $gg ;;
esac
# REGEXP
if echo "$gg" | grep '^....grid*' >/dev/null ; then echo $gg ; fi
if echo "$gg" | grep '....grid*' >/dev/null ; then echo $gg ; fi
if echo "$gg" | grep 's...grid*' >/dev/null ; then echo $gg ; fi
# Extended REGEXP
if echo "$gg" | egrep '(^....grid*|....grid*|s...grid*)' >/dev/null ; then
echo $gg
fi
Some grep incarnations also support the -q (quiet) option as an alternative to redirecting to /dev/null, but the redirect is again the most portable.

#OP,
Is glob pettern not only used for file names?
No, "glob" pattern is not only used for file names. you an use it to compare strings as well. In your examples, you can use case/esac to look for strings patterns.
gg=svm-grid-ch
# looking for the word "grid" in the string $gg
case "$gg" in
*grid* ) echo "found";;
esac
# [[ $gg =~ ^....grid* ]]
case "$gg" in ????grid*) echo "found";; esac
# [[ $gg =~ s...grid* ]]
case "$gg" in s???grid*) echo "found";; esac
In bash, when to use glob pattern and when to use regular expression? Thanks!
Regex are more versatile and "convenient" than "glob patterns", however unless you are doing complex tasks that "globbing/extended globbing" cannot provide easily, then there's no need to use regex.
Regex are not supported for version of bash <3.2 (as dennis mentioned), but you can still use extended globbing (by setting extglob ). for extended globbing, see here and some simple examples here.
Update for OP: Example to find files that start with 2 characters (the dots "." means 1 char) followed by "g" using regex
eg output
$ shopt -s dotglob
$ ls -1 *
abg
degree
..g
$ for file in *; do [[ $file =~ "..g" ]] && echo $file ; done
abg
degree
..g
In the above, the files are matched because their names contain 2 characters followed by "g". (ie ..g).
The equivalent with globbing will be something like this: (look at reference for meaning of ? and * )
$ for file in ??g*; do echo $file; done
abg
degree
..g

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Weird behavior of BASH glob/regex ranges - regex

Related

Regex stored in a shell variable doesn't work between double brackets

How to match repeated characters using regular expression operator =~ in bash?

retrieve a word after a regular expression in shell script

match leading dots in bash if using regex

use regular expression in if-condition in bash

Categories

Resources