bash regular expression different formats - regex

I have used regular expression in my code like this: .*[^0-9].*
But recently I have seen some functions implemented like this: *[!0-9]* for the same purpose of first example, that is non-integer numbers.
So I confused what is the true form of regex and what is the difference of them.
can anybody help me in this issue?

There is only one regular expression - the first one. The second one is a glob pattern.
See regex(7) for the description of POSIX extended regular expressions supported by Bash:
http://man7.org/linux/man-pages/man7/regex.7.html
See Bash manual for the description of glob patterns: http://www.gnu.org/software/bash/manual/html_node/Pattern-Matching.html
Bash uses regular expressions in [[…]] command only: http://www.gnu.org/software/bash/manual/html_node/Conditional-Constructs.html
Bash uses glob patterns for everything else.

POSIX defines:
1) two types of regular expressions: BREs and EREs. These are used by utilities / built-ins.
BREs are more restricted and exist for backwards compatibility and typing less on an interactive session. Avoid them if possible and use EREs instead, which are more flexible and PERL-like.
Some utilities allow you to choose between both types of regular expressions.
For example, grep matches BREs by default (backwards compatibility...), but you can make it match EREs with -E.
Use usually must quote those before passing them to utilities or the shell will filename expand them.
.*[^0-9].* could be both a BRE or an ERE. In both cases it means the same as the Perl regex, which is equivalent to the glob *[!0-9]*.
The main difference between BRE and ERE is that EREs add more useful Perl like special characters such as (a|b), a{m,n}, a+, a?. Examples:
echo a | grep '(a|b)'
# output:
echo a | grep -E '(a|b)'
# output: a
echo a | grep 'a{1,2}'
# output:
echo a | grep -E 'a{1,2}'
# output: a
2) Patterns Used for Filename Expansion, also known as globs (used by the POSIX glob C function). These are usually expanded by the shell before going to the utilities and expand to match filenames. If you quote them they are don't expand anymore.
*[!0-9]* is must be a glob since BREs ane EREs use ^ instead of !.
echo *[!0-9]*
# output: filenames which are not numbers
echo '*[!0-9]*'
# output: *[!0-9]*

Related

regex quantifiers in bash --simple vs extended matching {n} times

I'm using the bash shell and trying to list files in a directory whose names match regex patterns. Some of these patterns work, while others don't. For example, the * wildcard is fine:
$ls FILE_*
FILE_123.txt FILE_2345.txt FILE_789.txt
And the range pattern captures the first two of these with the following:
$ls FILE_[1-3]*.txt
FILE_123.txt FILE_2345.txt
but not the filename with the "7" character after "FILE_", as expected. Great. But now I want to count digits:
$ls FILE_[0-9]{3}.txt
ls: FILE_[0-9]{3}.txt: No such file or directory
Shouldn't this give me the filenames with three numeric digits following "FILE_" (i.e. FILE_123.txt and FILE_789.txt, but not FILE_2345.txt) Can someone tell me how I should be using the {n} quantifier (i.e. "match this pattern n times)?
ls uses with glob pattern, you can not use {3}. You have to use FILE_[0-9][0-9][0-9].txt. Or, you could the following command.
ls | grep -E "FILE_[0-9]{3}.txt"
Edit:
Or, you also use find command.
find . -regextype egrep -regex '.*/FILE_[0-9]{3}\.txt'
The .*/ prefix is needed to match a complete path. On Mac OS X :
find -E . -regex ".*/FILE_[0-9]{3}\.txt"
Bash filename expansion does not use regular expressions. It uses glob pattern matching, which is distinctly different, and what you're trying with FILE_[0-9]{3}.txt does brace expansion followed by filename expansion. Even bash's extended globbing feature doesn't have an equivalent to regular expression's {N}, so as already mentioned you have to use FILE_[0-9][0-9][0-9].txt

grep only certain expressions involving quotation marks

I have a txt file from which I want to get only the expressions of the type
'USA_word*' where * is whatever ( I don't want the whole line, only the expressions )
I try the command
grep -oP ''USA_word*''
But I get a list :
USA_word
USA_word
USA_word
.....
without the part signified by the *.
You may use
grep -o 'USA_word[^[:blank:]]*'
The [^[:blank:]]* part matches 0+ non-whitespace chars.
Besides, this does not use -P PCRE option, and uses a pure BRE POSIX regex making it compatible with the majority of grep implementations.

How do I reference a shell variable and arbitrary digits inside a grep regex?

I am looking to translate this regular expression into grep flavour:
I am trying to filter all lines that contain refs/changes/\d+/$VAR/
Example of line that should match, assuming that VAR=285900
b3fb1e501749b98c69c623b8345a512b8e01c611 refs/changes/00/285900/9
Current code:
VAR=285900
grep 'refs/changes/\d+/$VAR/' sample.txt
I am trying to filter all lines that contain refs/changes/\d+/$VAR/
That would be
grep "refs/changes/[[:digit:]]\{1,\}/$VAR/"
or
grep -E "refs/changes/[[:digit:]]+/$VAR/"
Note that the \d+ notation is a perl thing. Some overfeatured greps might support it with an option, but I don't recommend it for portability reasons.
inside simple quotes I cannot use variable expansion
You can mix and match quotes:
foo=not; echo 'single quotes '"$foo"' here'
with double quotes it does match anything.
It's not clear what you're doing, so we can't say why it doesn't work. It should work. There is no need to escape forward slashes for grep, they don't have any special meaning.

How to write conditional code depending on Bash version / features?

I'm using the =~ in one of my Bash scripts. However, I need to make the script compatible with Bash versions that do not support that operator, in particular the version of Bash that ships with msysgit which is not build against libregex. I have a work-around that uses expr match instead, but that again does not work on Mac OS X for some reason.
Thus, I want to use the expr match only if [ -n "$MSYSTEM" -a ${BASH_VERSINFO[0]} -lt 4 ] and use =~ otherwise. The problem is that Bash always seems to parse the whole script, and even if msysgit Bash would execute the work-around at runtime, it still stumbles upon the =~ when parsing the script.
Is it possible to conditionally execute code in the same script depending on the Bash version, or should I look into another way to address this issue?
In your case, you can replace the regular expression with an equivalent pattern match.
[[ $foo = \[+([0-9])\][[:space:]]* ]]
Some explanations:
Patterns are matched against the entire string. The following regexes and patterns are equivalent:
^foo$ and foo
^foo and foo*
foo$ and *foo
foo and *foo*
+(...) matches one or more occurrences of the enclosed pattern, which in this case is [0-9]. That is, if $pattern and $regex match the same string, then so do +($pattern) and ($regex)+.
My current solution is to use grep -q on all platforms instead. This avoids any conditionals or complicated code constructs.
Probably using eval to parse the code containing =~ only at runtime would have worked, too, but then again that would have made the code more complicated to read.
For this particular pattern, an equivalent but portable case statement can be articulated. It needs to have a fairly substantial number of different glob patterns to enumerate all the corner cases, though.
case $foo in
[![]* | \[[!0-9]* | *[!][0-9[:space:]]* | *[!0-9[:space:]] | \
*\]*[![:space:]] | *[!0-9]\]* | \[*\[* | *\]*\]* )
return 1;; # false
\[*[0-9]*\]* )
return 0;; # true
*)
return 1;; # false
esac

Bash string replacement with regex repetition

I have a file: filename_20130214_suffix.csv
I'd like replace the yyyymmdd part in bash. Here is what I intend to do:
file=`ls -t /path/filename_* | head -1`
file2=${file/20130214/20130215}
#this will not work
#file2=${file/[0-9]{8}/20130215/}
The problem is that parameter expansion does not use regular expressions, but patterns or globs(compare the difference between the regular expression "filename_..csv" and the glob "filename_.csv"). Globs cannot match a fixed number of a specific string.
However, you can enable extended patterns in bash, which should be close enough to what you want.
shopt -s extglob # Turn on extended pattern support
file2=${file/+([0-9])/20130215}
You can't match exactly 8 digts, but the +(...) lets you match one or more of the pattern inside the parentheses, which should be sufficient for your use case.
Since all you want to do in this case is replace everything between the _ characters, you could also simply use
file2=${file/_*_/_20130215_}
[[ $file =~ ^([^_]+_)[0-9]{8}(_.*) ]] && file2="${BASH_REMATCH[1]}20130215${BASH_REMATCH[2]}"