regex quantifiers in bash --simple vs extended matching {n} times - regex

I'm using the bash shell and trying to list files in a directory whose names match regex patterns. Some of these patterns work, while others don't. For example, the * wildcard is fine:
$ls FILE_*
FILE_123.txt FILE_2345.txt FILE_789.txt
And the range pattern captures the first two of these with the following:
$ls FILE_[1-3]*.txt
FILE_123.txt FILE_2345.txt
but not the filename with the "7" character after "FILE_", as expected. Great. But now I want to count digits:
$ls FILE_[0-9]{3}.txt
ls: FILE_[0-9]{3}.txt: No such file or directory
Shouldn't this give me the filenames with three numeric digits following "FILE_" (i.e. FILE_123.txt and FILE_789.txt, but not FILE_2345.txt) Can someone tell me how I should be using the {n} quantifier (i.e. "match this pattern n times)?

ls uses with glob pattern, you can not use {3}. You have to use FILE_[0-9][0-9][0-9].txt. Or, you could the following command.
ls | grep -E "FILE_[0-9]{3}.txt"
Edit:
Or, you also use find command.
find . -regextype egrep -regex '.*/FILE_[0-9]{3}\.txt'
The .*/ prefix is needed to match a complete path. On Mac OS X :
find -E . -regex ".*/FILE_[0-9]{3}\.txt"

Bash filename expansion does not use regular expressions. It uses glob pattern matching, which is distinctly different, and what you're trying with FILE_[0-9]{3}.txt does brace expansion followed by filename expansion. Even bash's extended globbing feature doesn't have an equivalent to regular expression's {N}, so as already mentioned you have to use FILE_[0-9][0-9][0-9].txt

Related

regex in bash-script to exclude certain word

I want to exclude "cgs" and "CGS" but select all other data.
Testdata:
exclude this-->
C
SP999_20151204080019_0054236_000_CGS.csv
CSP999_20151204080019_0054236_000_cgs.csv
accept all other.
I tried something like this .*([Cc][Gg][Ss]).* to select the cgs, but I don't understand the exclude thing =) It must be a filename_pattern without grep.
Kind Regards,
Bobby
Does it have to be a regexp? You can easily do it with a glob pattern, if you set in your script
shopt -o extglob
to enable extended globbing. You would then use the pattern
!(*[Cc][Gg][Ss]*)
to generate all entries which do NOT have CGS in their name.
grep --invert-match --ignore-case cgs < filenames_list
extglob bash option
Try this:
ls -ld $path/!(*[cC][gG][sS].csv)
And have a look at
man -Pless\ +/extglob bash
If the extglob shell option is enabled using the shopt builtin, several
extended pattern matching operators are recognized. In the following
description, a pattern-list is a list of one or more patterns separated
by a |. Composite patterns may be formed using one or more of the fol‐
lowing sub-patterns:
?(pattern-list)
Matches zero or one occurrence of the given patterns
*(pattern-list)
Matches zero or more occurrences of the given patterns
+(pattern-list)
Matches one or more occurrences of the given patterns
#(pattern-list)
Matches one of the given patterns
!(pattern-list)
Matches anything except one of the given patterns
Following may help you:
ls |grep -iEv "cgs"
Using invert match of grep:
grep -v 'cgs\|CGS' <filelist
Or,
ls | grep -v 'cgs\|CGS'

Sed command and regular expressions

I need to change 'bind a.b.c.d:80' to 'bind x.b.f.d:80'. I wrote below command for this purpose but it is not working, i don't know why?
sed -i 's,bind *:35357,bind x.y.z.a:35357,' haproxy-sample.cfg
Any help would be appreciated.
Thanks in advance.
Executed from bash
$ echo bind 12.54.36.165:35357 | sed "s/bind [^:]\+:35357/bind a.y.z.a:35357/"
bind a.y.z.a:35357
I think you need somethink like this:
sed -r 's/^(bind\s+)([^.]+)\.([^.]+)\.([^.]+)\.([^.:]+)(:80(\s+.*)?)$/\1x.\3.f.\5\6/g' haproxy-sample.cfg
where you have to replace the literal .f. and x. with your required strings ( in the second part of the sed s).
It capture the various parts (bind, the four numbers of the address, the :80 up to lineend into \1 to \6 and recombine the captured strings with the b and f .
The problem is your pattern contains a lone * -- you're probably thinking about shell globbing patterns where * matches 0 or more of any characters. sed does not speak shell patterns, it speaks regular expressions.
Where you have just *, you need .* -- in a regular expression the . character is a wildcard that matches any one character, and * is a quantifier that matches zero or more of the preceding thing.
sed -i 's/\(bind \).*\(:35357\)/\1x.y.z.a\2/'

Regular Expression in Find command

I want to list out the files which starts with a number and ends with ".c" extension. The following is the find command which is used. But, it does not give
the expected output.
Command:
find -type f -regex "^[0-9].*\\.c$"
It's because the regex option works with the full path and you specified only the file name. From man find:
-regex pattern
File name matches regular expression pattern. This is a match on the whole
path, not a search. For example, to match a file named './fubar3', you can use
the regular expression '.*bar.' or '.*b.*3', but not 'f.*r3'.
The regular expressions understood by find are by default Emacs Regular
Expressions, but this can be changed with the -regextype option.
Try with this:
find -type f -regex ".*/[0-9][^/]+\.c$"
where you explicitly look for a string where "the format of your filename follows any string that terminates with a slash"
UPDATE: I made a correction to the regex. I changed .* in the filename to [^\]+ as after "any string that terminates with a slash" we don't want to find a slash in that part of the string because it wouldn't be a filename but another directory!
NOTE: The matching .* can be very harmful...
Just use -name option. It accepts pattern for the last component of the path name as the doc says:
-name pattern
True if the last component of the pathname being examined matches
pattern. Special shell pattern matching characters (``['',
``]'', ``*'', and ``?'') may be used as part of pattern. These
characters may be matched explicitly by escaping them with a
backslash (``\'').
So:
$ find -type f -name "[0-9]*.c"
should work.

bash regular expression different formats

I have used regular expression in my code like this: .*[^0-9].*
But recently I have seen some functions implemented like this: *[!0-9]* for the same purpose of first example, that is non-integer numbers.
So I confused what is the true form of regex and what is the difference of them.
can anybody help me in this issue?
There is only one regular expression - the first one. The second one is a glob pattern.
See regex(7) for the description of POSIX extended regular expressions supported by Bash:
http://man7.org/linux/man-pages/man7/regex.7.html
See Bash manual for the description of glob patterns: http://www.gnu.org/software/bash/manual/html_node/Pattern-Matching.html
Bash uses regular expressions in [[…]] command only: http://www.gnu.org/software/bash/manual/html_node/Conditional-Constructs.html
Bash uses glob patterns for everything else.
POSIX defines:
1) two types of regular expressions: BREs and EREs. These are used by utilities / built-ins.
BREs are more restricted and exist for backwards compatibility and typing less on an interactive session. Avoid them if possible and use EREs instead, which are more flexible and PERL-like.
Some utilities allow you to choose between both types of regular expressions.
For example, grep matches BREs by default (backwards compatibility...), but you can make it match EREs with -E.
Use usually must quote those before passing them to utilities or the shell will filename expand them.
.*[^0-9].* could be both a BRE or an ERE. In both cases it means the same as the Perl regex, which is equivalent to the glob *[!0-9]*.
The main difference between BRE and ERE is that EREs add more useful Perl like special characters such as (a|b), a{m,n}, a+, a?. Examples:
echo a | grep '(a|b)'
# output:
echo a | grep -E '(a|b)'
# output: a
echo a | grep 'a{1,2}'
# output:
echo a | grep -E 'a{1,2}'
# output: a
2) Patterns Used for Filename Expansion, also known as globs (used by the POSIX glob C function). These are usually expanded by the shell before going to the utilities and expand to match filenames. If you quote them they are don't expand anymore.
*[!0-9]* is must be a glob since BREs ane EREs use ^ instead of !.
echo *[!0-9]*
# output: filenames which are not numbers
echo '*[!0-9]*'
# output: *[!0-9]*

Bash string replacement with regex repetition

I have a file: filename_20130214_suffix.csv
I'd like replace the yyyymmdd part in bash. Here is what I intend to do:
file=`ls -t /path/filename_* | head -1`
file2=${file/20130214/20130215}
#this will not work
#file2=${file/[0-9]{8}/20130215/}
The problem is that parameter expansion does not use regular expressions, but patterns or globs(compare the difference between the regular expression "filename_..csv" and the glob "filename_.csv"). Globs cannot match a fixed number of a specific string.
However, you can enable extended patterns in bash, which should be close enough to what you want.
shopt -s extglob # Turn on extended pattern support
file2=${file/+([0-9])/20130215}
You can't match exactly 8 digts, but the +(...) lets you match one or more of the pattern inside the parentheses, which should be sufficient for your use case.
Since all you want to do in this case is replace everything between the _ characters, you could also simply use
file2=${file/_*_/_20130215_}
[[ $file =~ ^([^_]+_)[0-9]{8}(_.*) ]] && file2="${BASH_REMATCH[1]}20130215${BASH_REMATCH[2]}"