Regular Expressions for file name matching - regex

In Bash, how does one match a regular expression with multiple criteria against a file name?
For example, I'd like to match against all the files with .txt or .log endings.
I know how to match one type of criteria:
for file in *.log
do
echo "${file}"
done
What's the syntax for a logical or to match two or more types of criteria?

Bash does not support regular expressions per se when globbing (filename matching). Its globbing syntax, however, can be quite versatile. For example:
for i in A*B.{log,txt,r[a-z][0-9],c*} Z[0-5].c; do
...
done
will apply the loop contents on all files that start with A and end in a B, then a dot and any of the following extensions:
log
txt
r followed by a lowercase letter followed by a single digit
c followed by pretty much anything
It will also apply the loop commands to an file starting with Z, followed by a digit in the 0-5 range and then by the .c extension.
If you really want/need to, you can enable extended globbing with the shopt builtin:
shopt -s extglob
which then allows significantly more features while matching filenames, such as sub-patterns etc.
See the Bash manual for more information on supported expressions:
http://www.gnu.org/software/bash/manual/bash.html#Pattern-Matching
EDIT:
If an expression does not match a filename, bash by default will substitute the expression itself (e.g. it will echo *.txt) rather than an empty string. You can change this behaviour by setting the nullglob shell option:
shopt -s nullglob
This will replace a *.txt that has no matching files with an empty string.
EDIT 2:
I suggest that you also check out the shopt builtin and its options, since quite a few of them affect filename pattern matching, as well as other aspects of the the shell:
http://www.gnu.org/software/bash/manual/bash.html#The-Shopt-Builtin

Do it the same way you'd invoke ls. You can specify multiple wildcards one after the other:
for file in *.log *.txt

for file in *.{log,txt} ..

for f in $(find . -regex ".*\.log")
do
echo $f
end

You simply add the other conditions to the end:
for VARIABLE in 1 2 3 4 5 .. N
do
command1
command2
commandN
done
So in your case:
for file in *.log *.txt
do
echo "${file}"
done

You can also do this:
shopt -s extglob
for file in *.+(log|txt)
which could be easily extended to more alternatives:
for file in *.+(log|txt|mp3|gif|foo)

Related

Match X or Y in grep regular expression

I'm trying to run a fairly simple regular expression to clear out some home directories. For background: I'm trying to ask users on my system to clear out their unnecessary files to clear up space on their home directories, so I want to inform users with scripts such as Anaconda / Miniconda installation scripts that they can clear that out.
To generate a list of users who might need such an email, I'm trying to run a simple regular expression to list all homedirs that contain such an installation script. So my assumption would be that the follwing should suffice:
for d in $(ls -d /home/); do
if $(ls $d | grep -q "(Ana|Mini)conda[23].*\.sh"); then
echo $d;
fi;
done;
But after running this, it resulted in nothing at all, sadly. After a while looking, I noticed that grep does not interpret regular expressions as I would expect it to. The following:
echo "Lorem ipsum dolor sit amet" | grep "(Lorem|Ipsum) ipsum"
results in no matches at all. Which would then explain why the above forloop wouldn't work either.
My question then is: is it possible to match the specified regular expression (Ana|Mini)conda[23].*\.sh, in the same way it matches strings in https://regex101.com/r/yxN61p/1? Or is there some other way to find all users who have such a file in their homedir using a simple for-loop in bash?
Short answer: grep defaults to Basic Regular Expressions (BRE), but unescaped () and | are part of Extended Regular Expressions (ERE). GNU grep, as an extension, supports alternation (which isn't technically part of BRE), but you have to escape \:
grep -q "\(Ana\|Mini\)conda[23].*\.sh"
Or you can indicate that you want to use ERE:
grep -Eq "(Ana|Mini)conda[23].*\.sh"
Longer answer: this all being said, you don't need grep, and parsing the output of ls comes with a lot of pitfalls. Instead, you can use globs:
printf '%s\n' /home/*/*{Ana,Mini}conda[23]*.sh
should do it, if I understand the intention correctly.
This uses the fact that printf just repeats its formatting string if supplied with more parameters than formatting directives, printing each file on a separate line.
/home/*/*{Ana,Mini}conda[23]*.sh uses brace expansion, i.e., it first expands to
/home/*/*Anaconda[23]*.sh /home/*/*Miniconda[23]*.sh
and each of those is then expanded with filename expansion. [23] works the same way as in a regular expression; * is "zero or more of any character except /".
If you don't know how deep in the directory tree the files you're looking for are, you could use globstar and **:
shopt -s globstar
printf '%s\n' /home/**/*{Ana,Mini}conda[23]*.sh
** matches all files and zero or more subdirectories.
Finally, if you want to handle the case where nothing matches, you could set either shopt -s nullglob (expand to nothing if nothing matches) or shopt -s failglob (error if nothing matches).
Shell patterns are described here.
You don't need ls or grep at all for this:
shopt -s extglob
for f in /home/*/#(Ana|Mini)conda[23].*.sh; do
echo "$f"
done
With extglob enabled, #(Ana|Mini) matches either Ana or Mini.

regular expression for "11th to 16th letter"

I am new to regular expression. Need help for reading files in unix system. I want to apply regular expression on ls command.
I have below files :
DLERMS08001708161708209683.csv.gz
DLERMS13001708161330170816.csv.gz
DLERMS13001708171330170816.csv.gz
and would like to extract files which have 170816 between 11th record to 16th digit.
I tried with below command ls *170816*.gz. However I am getting 3 filenames instead of two. I want only first two filenames instead of all 3. Could you please help.
Also want to add here that my third filename already contains 170816 at the end DLERMS13001708171330170816.csv.gz. I want to avoid this in my ls command output.
Using bash parameter-expansion alone,
for file in *.csv.gz; do
[ -e "$file" ] || continue
[ "${file:10:6}" == "170816" ] && printf "%s\n" "$file"
done
${PARAMETER:OFFSET:LENGTH}
This one can expand only a part of a parameter's value, given a position to start and maybe a length. If LENGTH is omitted, the parameter will be expanded up to the end of the string. If LENGTH is negative, it's taken as a second offset into the string, counting from the end of the string
Based on comments from below, apparently OP wants to copy the files intended to an alternate path, in which case the printf() should be replaced with cp with necessary arguments
[ "${file:10:6}" == "170816" ] && cp -- "$file" path/to/destination
Firstly, be careful not to confuse regular expressions with shell glob patterns (which is what you want here).
Your glob could be:
??????????170816*.gz
Which matches 10 unknown characters followed by the sequence you specified.
Depending on your next step, you might not need to use ls at all, for example you can loop over these files like this:
for file in ??????????170816*.gz; do
something_with "$file"
done
Or output the files that match using one of the following:
echo ??????????170816*.gz
printf '%s\n' ??????????170816*.gz
If there is a possibility that no files match, then you may wish to consider enabling nullglob (using shopt -s nullglob), which would expand to nothing in that case.
If you want to use globbing, it's not the same as using regular expression.
In your example you can use "?" as a placeholder for matching a single character:
Hence to achieve what you want as output, use ls with pattern below -
ls ??????????170816*
You want to use the wildcard (not regex) "any single letter" ? appropriatly often.
ls DLERMS????170816*.csv.gz
Regexes are much more flexible/powerful and overkill for this simple use case.
But as far as I know, ls does not support them, so you would have to go via other bash tools to identify the files in case you ever need to actually use regexes for anything.
I also reflected what I perceive to be another common of your filenames, the DLERMS at the beginning, if that is NOT common, replace those letter by ?, too.
Try this:
ls ??????????170816*
A solution with find and regex
find . -regextype egrep -regex "^.{12}170816.*\.gz"
find read: ./xxxxxxxxxxxxx and .{12} means the first twelve, so 170816 is the expression between 13th record to 18th
I don't think you can use regular expressions with ls directly, but with egrep, it works fine.
ls * | egrep "DLERMS[0-9]{4}170816[0-9]{10}.csv.gz"
[0-9]{4} - any number, four times.
[0-9]{10} - any number, ten times.
Also could be used instead "egrep" the command "grep -E", the -E option allows especial regular expressions like "[{|" without need to escape them "\".

Search for files in a git repository by extensions

I have a string like this *.{jpg,png} for example, but the string could also be just *.scss - in fact it is an editorconfig.
Then I want to search for every file of this extension which is tracked by my git repository.
I've tried several methods but didn't find any sufficient solution.
The closest one I've got is:
git ls-tree -r master --name-only | grep -E ".*\.jpg"
But this is only working for single file extensions not for something like this git ls-tree -r master --name-only | grep -E ".*\.{jpg,png}".
Anyone could help me?
Try this:
git ls-tree -r master --name-only | grep -E '.*\.(jpg|png)'
The expression you tried to pass via -E option is interpreted as any characters (.*), the dot (\.), and the string {jpg,png}. I guess you are confusing the Bash brace expansion with the alternation (|) in a regular expression group (the parenthesis).
Consider using the end-of-line anchor: '.*\.(jpg|png)$'.
Without grep
As #0andriy pointed out you can pass patterns to git ls-files as follows:
git ls-files '*.jpg' '*.png'
Note, you should escape the arguments in order to prevent the filename expansion (globbing). In particular, the asterisk (*) character
matches any number of repeats of the character string or RE preceding it, including zero instances.
But this obviously will work only for the simple git patterns. For a slightly more complicated case such as "extension matching N characters from a given set" you will need a regular expression (and grep, for example).

how to change pattern in file's line

I have file with one line:
22:50133-MM:MM1,52-MM:MM2;23:254940-MM:MM1,63-MM:MM2;24:15574-MM:MM1,65-MM:MM2;
I need find this part of line 24:15574-MM and then replace the number 15574 to another one. The number can be any length.
I want to use bash for it, but I have no idea how to do it.
How can I do it? Please help.
Since you asked for I want to use bash for it, here is an attempt using only native operators in it; using the regEx feature with its ~ operator (supported from bash 3.0 onwards) .
Assuming your file has only one single line in it, you can do the following steps,
The below commands can be run directly on the command-line (or)
wrap-it up in a shell script with the bash she-bang(#!/bin/bash).
Capturing the file contents for regEx match using the <file, which stores the entire file contents in the variable.
fileContent=$(<file)
[[ $fileContent =~ .*24:([[:digit:]]+)-MM.* ]] && replacement="${BASH_REMATCH[1]}"
replaceValue=5555
printf "%s\n" "${fileContent/$replacement/$replaceValue}"
For your input file, the commands produce a result
22:50133-MM:MM1,52-MM:MM2;23:254940-MM:MM1,63-MM:MM2;24:5555-MM:MM1,65-MM:MM2;
It can be easily achieved using sed command with -i option:
new_number=11111
sed -i "s/24:\(15574\)-MM/24:$new_number-MM/" /tmp/test.txt
/tmp/test.txt - replace with your current filepath
new_number - is a variable for replacement number
To replace using regexp pattern use the following command with -E option enabled(extended regular expressions mode):
sed -i -E "s/24:(15574)-MM/24:$new_number-MM/" /tmp/test.txt

Grep or in part of a string

Good day All,
A filename can either be
abc_source_201501.csv Or,
abc_source2_201501.csv
Is it possible to do something like grep abc_source|source2_201501.csv without fully listing out filename as the filenames I'm working with are much longer than examples given to get both options?
Thanks for assistance here.
Use extended regex flag in grep.
For example:
grep -E abc_source.?_201501.csv
would source out both lines in your example. You can think of other regex patterns that would suit your data more.
You can use Bash globbing to grep in several files at once.
For example, to grep for the string "hello" in all files with a filename that starts with abc_source and ends with 201501.csv, issue this command:
grep hello abc_source*201501.csv
You can also use the -r flag, to recursively grep in all files below a given folder - for example the current folder (.).
grep -r hello .
If you are asking about patterns for file name matching in the shell, the extended globbing facility in Bash lets you say
shopt -s extglob
grep stuff abc_source#(|2)_201501.csv
to search through both files with a single glob expression.
The simplest possibility is to use brace expansion:
grep pattern abc_{source,source2}_201501.csv
That's exactly the same as:
grep pattern abc_source{,2}_201501.csv
You can use several brace patterns in a single word:
grep pattern abc_source{,2}_2015{01..04}.csv
expands to
grep pattern abc_source_201501.csv abc_source_201502.csv \
abc_source_201503.csv abc_source_201504.csv \
abc_source2_201501.csv abc_source2_201502.csv \
abc_source2_201503.csv abc_source2_201504.csv