Match X or Y in grep regular expression - regex

I'm trying to run a fairly simple regular expression to clear out some home directories. For background: I'm trying to ask users on my system to clear out their unnecessary files to clear up space on their home directories, so I want to inform users with scripts such as Anaconda / Miniconda installation scripts that they can clear that out.
To generate a list of users who might need such an email, I'm trying to run a simple regular expression to list all homedirs that contain such an installation script. So my assumption would be that the follwing should suffice:
for d in $(ls -d /home/); do
if $(ls $d | grep -q "(Ana|Mini)conda[23].*\.sh"); then
echo $d;
fi;
done;
But after running this, it resulted in nothing at all, sadly. After a while looking, I noticed that grep does not interpret regular expressions as I would expect it to. The following:
echo "Lorem ipsum dolor sit amet" | grep "(Lorem|Ipsum) ipsum"
results in no matches at all. Which would then explain why the above forloop wouldn't work either.
My question then is: is it possible to match the specified regular expression (Ana|Mini)conda[23].*\.sh, in the same way it matches strings in https://regex101.com/r/yxN61p/1? Or is there some other way to find all users who have such a file in their homedir using a simple for-loop in bash?

Short answer: grep defaults to Basic Regular Expressions (BRE), but unescaped () and | are part of Extended Regular Expressions (ERE). GNU grep, as an extension, supports alternation (which isn't technically part of BRE), but you have to escape \:
grep -q "\(Ana\|Mini\)conda[23].*\.sh"
Or you can indicate that you want to use ERE:
grep -Eq "(Ana|Mini)conda[23].*\.sh"
Longer answer: this all being said, you don't need grep, and parsing the output of ls comes with a lot of pitfalls. Instead, you can use globs:
printf '%s\n' /home/*/*{Ana,Mini}conda[23]*.sh
should do it, if I understand the intention correctly.
This uses the fact that printf just repeats its formatting string if supplied with more parameters than formatting directives, printing each file on a separate line.
/home/*/*{Ana,Mini}conda[23]*.sh uses brace expansion, i.e., it first expands to
/home/*/*Anaconda[23]*.sh /home/*/*Miniconda[23]*.sh
and each of those is then expanded with filename expansion. [23] works the same way as in a regular expression; * is "zero or more of any character except /".
If you don't know how deep in the directory tree the files you're looking for are, you could use globstar and **:
shopt -s globstar
printf '%s\n' /home/**/*{Ana,Mini}conda[23]*.sh
** matches all files and zero or more subdirectories.
Finally, if you want to handle the case where nothing matches, you could set either shopt -s nullglob (expand to nothing if nothing matches) or shopt -s failglob (error if nothing matches).
Shell patterns are described here.

You don't need ls or grep at all for this:
shopt -s extglob
for f in /home/*/#(Ana|Mini)conda[23].*.sh; do
echo "$f"
done
With extglob enabled, #(Ana|Mini) matches either Ana or Mini.

Related

How to list all files with a given extension? [duplicate]

I want to search a filename which may contain kavi or kabhi.
I wrote command in the terminal:
ls -l *ka[vbh]i*
Between ka and i there may be v or bh .
The code I wrote isn't correct. What would be the correct command?
A nice way to do this is to use extended globs. With them, you can perform regular expressions on Bash.
To start you have to enable the extglob feature, since it is disabled by default:
shopt -s extglob
Then, write a regex with the required condition: stuff + ka + either v or bh + i + stuff. All together:
ls -l *ka#(v|bh)i*
The syntax is a bit different from the normal regular expressions, so you need to read in Extended Globs that...
#(list): Matches one of the given patterns.
Test
$ ls
a.php AABB AAkabhiBB AAkabiBB AAkaviBB s.sh
$ ls *ka#(v|bh)i*
AAkabhiBB AAkaviBB
a slightly longer cmd line could be using find, grep and xargs. it has the advantage of being easily extended to different search terms (by either extending the grep statement or by using additional options of find), a bit more readability (imho) and flexibility in being able to execute specific commands on the files which are found
find . | grep -e "kabhi" -e "kavi" | xargs ls -l
You can get what you want by using curly braces in bash:
ls -l *ka{v,bh}i*
Note: this is not a regular expression question so much as a "shell globbing" question. Shell "glob patterns" are different from regular expressions, though they are similar in many ways.

How do I grep multiple possible extensions recursively

This question is different from other grep pattern matching questions because we're looking for a large number of file extensions, and thus the following from this question will be too long and tedious to type:
grep -r -i --include '*.ade' --include '*.adp' ... CP_Image ~/path[12345]
I was trying to email the backup of a static site when Google blocked my attachment upload for security reasons. Their support page says:
You can't send or receive the following file types:
.ade, .adp, .bat, .chm, .cmd, .com, .cpl, .exe, .hta, .ins, .isp, .jar, .jse, .lib, .lnk, .mde, .msc, .msp, .mst, .pif, .scr, .sct, .shb, .sys, .vb, .vbe, .vbs, .vxd, .wsc, .wsf, .wsh
I converted and tested the following Regular Expression here:
/.*\.(ade|adp|bat|chm|cmd|com|cpl|exe|hta|ins|isp|jar|jse|lib|lnk|mde|msc|msp|mst|pif|scr|sct|shb|sys|vb|vbe|vbs|vxd|wsc|wsf|wsh)/gi
And tried running it with:
ls -lahR | grep '.*\.(ade|adp|bat|chm|cmd|com|cpl|exe|hta|ins|isp|jar|jse|lib|lnk|mde|msc|msp|mst|pif|scr|sct|shb|sys|vb|vbe|vbs|vxd|wsc|wsf|wsh)'
It doesn't work. I don't think grep interprets the and (|) symbol properly because ls -lahR | grep '.*\.html' works
Normal grep uses Basic Regular Expressions (BRE). In BRE, capturing groups are represented by \(...\) and the alternation op is referred by \|
grep '.*\.\(ade\|adp\|bat\|chm\|cmd\|com\|cpl\|exe\|hta\|ins\|isp\|jar\|jse\|lib\|lnk\|mde\|msc\|msp\|mst\|pif\|scr\|sct\|shb\|sys\|vb\|vbe\|vbs\|vxd\|wsc\|wsf\|wsh\)'
OR
grep -E '.*\.(ade|adp|bat|chm|cmd|com|cpl|exe|hta|ins|isp|jar|jse|lib|lnk|mde|msc|msp|ms‌​t|pif|scr|sct|shb|sys|vb|vbe|vbs|vxd|wsc|wsf|wsh)'
Use --extended-regex by enabling the -E parameter.
Reference
Add the flag -E to indicate it's an extended regular expression. From GNU Grep 2.1: The default is "basic regular expression", and
[i]n basic regular expressions the meta-characters ‘?’, ‘+’, ‘{’, ‘|’, ‘(’, and ‘)’ lose their special meaning.
I'm recursively trying to find files with the specified extensions.
Better to use find with -iregex option:
find . -regextype posix-egrep -iregex '.*\.(ade|adp|bat|chm|cmd|com|cpl|exe|hta|ins|isp|jar|jse|lib|lnk|mde|msc|msp|mst|pif|scr|sct|shb|sys|vb|vbe|vbs|vxd|wsc|wsf|wsh)'
On OSX use:
find -E . posix-egrep -iregex '.*\.(ade|adp|bat|chm|cmd|com|cpl|exe|hta|ins|isp|jar|jse|lib|lnk|mde|msc|msp|mst|pif|scr|sct|shb|sys|vb|vbe|vbs|vxd|wsc|wsf|wsh)'
A bash method to exclude the given extensions: use extended globbing
shopt -s extglob nullglob
ls *.!(ade|adp|bat|chm|cmd|com|cpl|exe|hta|ins|isp|jar|jse|lib|lnk|mde|msc|msp|mst|pif|scr|sct|shb|sys|vb|vbe|vbs|vxd|wsc|wsf|wsh)

Grep or in part of a string

Good day All,
A filename can either be
abc_source_201501.csv Or,
abc_source2_201501.csv
Is it possible to do something like grep abc_source|source2_201501.csv without fully listing out filename as the filenames I'm working with are much longer than examples given to get both options?
Thanks for assistance here.
Use extended regex flag in grep.
For example:
grep -E abc_source.?_201501.csv
would source out both lines in your example. You can think of other regex patterns that would suit your data more.
You can use Bash globbing to grep in several files at once.
For example, to grep for the string "hello" in all files with a filename that starts with abc_source and ends with 201501.csv, issue this command:
grep hello abc_source*201501.csv
You can also use the -r flag, to recursively grep in all files below a given folder - for example the current folder (.).
grep -r hello .
If you are asking about patterns for file name matching in the shell, the extended globbing facility in Bash lets you say
shopt -s extglob
grep stuff abc_source#(|2)_201501.csv
to search through both files with a single glob expression.
The simplest possibility is to use brace expansion:
grep pattern abc_{source,source2}_201501.csv
That's exactly the same as:
grep pattern abc_source{,2}_201501.csv
You can use several brace patterns in a single word:
grep pattern abc_source{,2}_2015{01..04}.csv
expands to
grep pattern abc_source_201501.csv abc_source_201502.csv \
abc_source_201503.csv abc_source_201504.csv \
abc_source2_201501.csv abc_source2_201502.csv \
abc_source2_201503.csv abc_source2_201504.csv

Create directory based on part of filename

First of all, I'm not a programmer — just trying to learn the basics of shell scripting and trying out some stuff.
I'm trying to create a function for my bash script that creates a directory based on a version number in the filename of a file the user has chosen in a list.
Here's the function:
lav_mappe () {
shopt -s failglob
echo "[--- Choose zip file, or x to exit ---]"
echo ""
echo ""
select zip in $SRC/*.zip
do
[[ $REPLY == x ]] && . $HJEM/build
[[ -z $zip ]] && echo "Invalid choice" && continue
echo
grep ^[0-9]{1}\.[0-9]{1,2}\.[0-9]{1,2}$ $zip; mkdir -p $MODS/out/${ver}
done
}
I've tried messing around with some other commands too:
for ver in $zip; do
grep "^[0-9]{1}\.[0-9]{1,2}\.[0-9]{1,2}$" $zip; mkdir -p $MODS/out/${ver}
done
And also find | grep — but I'm doing it wrong :(
But it ends up saying "no match" for my regex pattern.
I'm trying to take the filename the user has selected, then grep it for the version number (ALWAYS x.xx.x somewhere in the filename), and fianlly create a directory with just that.
Could someone give me some pointers what the command chain should look like? I'm very unsure about the structure of the function, so any help is appreciated.
EDIT:
Ok, this is how the complete function looks like now: (Please note, the sed(1) commands besides the directory creation is not created by me, just implemented in my code.)
Pastebin (Long code.)
I've got news for you. You are writing a Bash script, you are a programmer!
Your Regular Expression (RE) is of the "wrong" type. Vanilla grep uses a form known as "Basic Regular Expressions" (BRE), but your RE is in the form of an Extended Regular Expression (ERE). BRE's are used by vanilla grep, vi, more, etc. EREs are used by just about everything else, awk, Perl, Python, Java, .Net, etc. Problem is, you are trying to look for that pattern in the file's contents, not in the filename!
There is an egrep command, or you can use grep -E, so:
echo $zip|grep -E '^[0-9]\.[0-9]{1,2}\.[0-9]{1,2}$'
(note that single quotes are safer than double). By the way, you use ^ at the front and $ at the end, which means the filename ONLY consists of a version number, yet you say the version number is "somewhere in the filename". You don't need the {1} quantifier, that is implied.
BUT, you don't appear to be capturing the version number either.
You could use sed (we also need the -E):
ver=$(echo $zip| sed -E 's/.*([0-9]\.[0-9]{1,2}\.[0-9]{1,2}).*/\1/')
The \1 on the right means "replace everything (that's why we have the .* at front and back) with what was matched in the parentheses group".
That's a bit clunky, I know.
Now we can do the mkdir (there is no merit in putting everything on one line, and it makes the code harder to maintain):
mkdir -p "$MODS/out/$ver"
${ver} is unnecessary in this case, but it is a good idea to enclose path names in double quotes in case any of the components have embedded white-space.
So, good effort for a "non-programmer", particularly in generating that RE.
Now for Lesson 2
Be careful about using this solution in a general loop. Your question specifically uses select, so we cannot predict which files will be used. But what if we wanted to do this for every file?
Using the solution above in a for or while loop would be inefficient. Calling external processes inside a loop is always bad. There is nothing we can do about the mkdir without using a different language like Perl or Python. But sed, by it's nature is iterative, and we should use that feature.
One alternative would be to use shell pattern matching instead of sed. This particular pattern would not be impossible in the shell, but it would be difficult and raise other questions. So let's stick with sed.
A problem we have is that echo output places a space between each field. That gives us a couple of issues. sed delimits each record with a newline "\n", so echo on its own won't do here. We could replace each space with a new-line, but that would be an issue if there were spaces inside a filename. We could do some trickery with IFS and globbing, but that leads to unnecessary complications. So instead we will fall back to good old ls. Normally we would not want to use ls, shell globbing is more efficient, but here we are using the feature that it will place a new-line after each filename (when used redirected through a pipe).
while read ver
do
mkdir "$ver"
done < <(ls $SRC/*.zip|sed -E 's/.*([0-9]{1}\.[0-9]{1,2}\.[0-9]{1,2}).*/\1/')
Here I am using process substitution, and this loop will only call ls and sed once. BUT, it calls the mkdir program n times.
Lession 3
Sorry, but that's still inefficient. We are creating a child process for each iteration, to create a directory needs only one kernel API call, yet we are creating a process just for that? Let's use a more sophisticated language like Perl:
#!/usr/bin/perl
use warnings;
use strict;
my $SRC = '.';
for my $file (glob("$SRC/*.zip"))
{
$file =~ s/.*([0-9]{1}\.[0-9]{1,2}\.[0-9]{1,2}).*/$1/;
mkdir $file or die "Unable to create $file; $!";
}
You might like to note that your RE has made it through to here! But now we have more control, and no child processes (mkdir in Perl is a built-in, as is glob).
In conclusion, for small numbers of files, the sed loop above will be fine. It is simple, and shell based. Calling Perl just for this from a script will probably be slower since perl is quite large. But shell scripts which create child processes inside loops are not scalable. Perl is.

Regular Expressions for file name matching

In Bash, how does one match a regular expression with multiple criteria against a file name?
For example, I'd like to match against all the files with .txt or .log endings.
I know how to match one type of criteria:
for file in *.log
do
echo "${file}"
done
What's the syntax for a logical or to match two or more types of criteria?
Bash does not support regular expressions per se when globbing (filename matching). Its globbing syntax, however, can be quite versatile. For example:
for i in A*B.{log,txt,r[a-z][0-9],c*} Z[0-5].c; do
...
done
will apply the loop contents on all files that start with A and end in a B, then a dot and any of the following extensions:
log
txt
r followed by a lowercase letter followed by a single digit
c followed by pretty much anything
It will also apply the loop commands to an file starting with Z, followed by a digit in the 0-5 range and then by the .c extension.
If you really want/need to, you can enable extended globbing with the shopt builtin:
shopt -s extglob
which then allows significantly more features while matching filenames, such as sub-patterns etc.
See the Bash manual for more information on supported expressions:
http://www.gnu.org/software/bash/manual/bash.html#Pattern-Matching
EDIT:
If an expression does not match a filename, bash by default will substitute the expression itself (e.g. it will echo *.txt) rather than an empty string. You can change this behaviour by setting the nullglob shell option:
shopt -s nullglob
This will replace a *.txt that has no matching files with an empty string.
EDIT 2:
I suggest that you also check out the shopt builtin and its options, since quite a few of them affect filename pattern matching, as well as other aspects of the the shell:
http://www.gnu.org/software/bash/manual/bash.html#The-Shopt-Builtin
Do it the same way you'd invoke ls. You can specify multiple wildcards one after the other:
for file in *.log *.txt
for file in *.{log,txt} ..
for f in $(find . -regex ".*\.log")
do
echo $f
end
You simply add the other conditions to the end:
for VARIABLE in 1 2 3 4 5 .. N
do
command1
command2
commandN
done
So in your case:
for file in *.log *.txt
do
echo "${file}"
done
You can also do this:
shopt -s extglob
for file in *.+(log|txt)
which could be easily extended to more alternatives:
for file in *.+(log|txt|mp3|gif|foo)