How do I grep multiple possible extensions recursively - regex

This question is different from other grep pattern matching questions because we're looking for a large number of file extensions, and thus the following from this question will be too long and tedious to type:
grep -r -i --include '*.ade' --include '*.adp' ... CP_Image ~/path[12345]
I was trying to email the backup of a static site when Google blocked my attachment upload for security reasons. Their support page says:
You can't send or receive the following file types:
.ade, .adp, .bat, .chm, .cmd, .com, .cpl, .exe, .hta, .ins, .isp, .jar, .jse, .lib, .lnk, .mde, .msc, .msp, .mst, .pif, .scr, .sct, .shb, .sys, .vb, .vbe, .vbs, .vxd, .wsc, .wsf, .wsh
I converted and tested the following Regular Expression here:
/.*\.(ade|adp|bat|chm|cmd|com|cpl|exe|hta|ins|isp|jar|jse|lib|lnk|mde|msc|msp|mst|pif|scr|sct|shb|sys|vb|vbe|vbs|vxd|wsc|wsf|wsh)/gi
And tried running it with:
ls -lahR | grep '.*\.(ade|adp|bat|chm|cmd|com|cpl|exe|hta|ins|isp|jar|jse|lib|lnk|mde|msc|msp|mst|pif|scr|sct|shb|sys|vb|vbe|vbs|vxd|wsc|wsf|wsh)'
It doesn't work. I don't think grep interprets the and (|) symbol properly because ls -lahR | grep '.*\.html' works

Normal grep uses Basic Regular Expressions (BRE). In BRE, capturing groups are represented by \(...\) and the alternation op is referred by \|
grep '.*\.\(ade\|adp\|bat\|chm\|cmd\|com\|cpl\|exe\|hta\|ins\|isp\|jar\|jse\|lib\|lnk\|mde\|msc\|msp\|mst\|pif\|scr\|sct\|shb\|sys\|vb\|vbe\|vbs\|vxd\|wsc\|wsf\|wsh\)'
OR
grep -E '.*\.(ade|adp|bat|chm|cmd|com|cpl|exe|hta|ins|isp|jar|jse|lib|lnk|mde|msc|msp|ms‌​t|pif|scr|sct|shb|sys|vb|vbe|vbs|vxd|wsc|wsf|wsh)'
Use --extended-regex by enabling the -E parameter.
Reference

Add the flag -E to indicate it's an extended regular expression. From GNU Grep 2.1: The default is "basic regular expression", and
[i]n basic regular expressions the meta-characters ‘?’, ‘+’, ‘{’, ‘|’, ‘(’, and ‘)’ lose their special meaning.

I'm recursively trying to find files with the specified extensions.
Better to use find with -iregex option:
find . -regextype posix-egrep -iregex '.*\.(ade|adp|bat|chm|cmd|com|cpl|exe|hta|ins|isp|jar|jse|lib|lnk|mde|msc|msp|mst|pif|scr|sct|shb|sys|vb|vbe|vbs|vxd|wsc|wsf|wsh)'
On OSX use:
find -E . posix-egrep -iregex '.*\.(ade|adp|bat|chm|cmd|com|cpl|exe|hta|ins|isp|jar|jse|lib|lnk|mde|msc|msp|mst|pif|scr|sct|shb|sys|vb|vbe|vbs|vxd|wsc|wsf|wsh)'

A bash method to exclude the given extensions: use extended globbing
shopt -s extglob nullglob
ls *.!(ade|adp|bat|chm|cmd|com|cpl|exe|hta|ins|isp|jar|jse|lib|lnk|mde|msc|msp|mst|pif|scr|sct|shb|sys|vb|vbe|vbs|vxd|wsc|wsf|wsh)

Related

Match X or Y in grep regular expression

I'm trying to run a fairly simple regular expression to clear out some home directories. For background: I'm trying to ask users on my system to clear out their unnecessary files to clear up space on their home directories, so I want to inform users with scripts such as Anaconda / Miniconda installation scripts that they can clear that out.
To generate a list of users who might need such an email, I'm trying to run a simple regular expression to list all homedirs that contain such an installation script. So my assumption would be that the follwing should suffice:
for d in $(ls -d /home/); do
if $(ls $d | grep -q "(Ana|Mini)conda[23].*\.sh"); then
echo $d;
fi;
done;
But after running this, it resulted in nothing at all, sadly. After a while looking, I noticed that grep does not interpret regular expressions as I would expect it to. The following:
echo "Lorem ipsum dolor sit amet" | grep "(Lorem|Ipsum) ipsum"
results in no matches at all. Which would then explain why the above forloop wouldn't work either.
My question then is: is it possible to match the specified regular expression (Ana|Mini)conda[23].*\.sh, in the same way it matches strings in https://regex101.com/r/yxN61p/1? Or is there some other way to find all users who have such a file in their homedir using a simple for-loop in bash?
Short answer: grep defaults to Basic Regular Expressions (BRE), but unescaped () and | are part of Extended Regular Expressions (ERE). GNU grep, as an extension, supports alternation (which isn't technically part of BRE), but you have to escape \:
grep -q "\(Ana\|Mini\)conda[23].*\.sh"
Or you can indicate that you want to use ERE:
grep -Eq "(Ana|Mini)conda[23].*\.sh"
Longer answer: this all being said, you don't need grep, and parsing the output of ls comes with a lot of pitfalls. Instead, you can use globs:
printf '%s\n' /home/*/*{Ana,Mini}conda[23]*.sh
should do it, if I understand the intention correctly.
This uses the fact that printf just repeats its formatting string if supplied with more parameters than formatting directives, printing each file on a separate line.
/home/*/*{Ana,Mini}conda[23]*.sh uses brace expansion, i.e., it first expands to
/home/*/*Anaconda[23]*.sh /home/*/*Miniconda[23]*.sh
and each of those is then expanded with filename expansion. [23] works the same way as in a regular expression; * is "zero or more of any character except /".
If you don't know how deep in the directory tree the files you're looking for are, you could use globstar and **:
shopt -s globstar
printf '%s\n' /home/**/*{Ana,Mini}conda[23]*.sh
** matches all files and zero or more subdirectories.
Finally, if you want to handle the case where nothing matches, you could set either shopt -s nullglob (expand to nothing if nothing matches) or shopt -s failglob (error if nothing matches).
Shell patterns are described here.
You don't need ls or grep at all for this:
shopt -s extglob
for f in /home/*/#(Ana|Mini)conda[23].*.sh; do
echo "$f"
done
With extglob enabled, #(Ana|Mini) matches either Ana or Mini.

Why do my results appear to differ between ag and grep?

I'm having trouble correctly (and safely) executing the right regex searches with grep. I seem to be able to do what I want using ag
What I want to do in plain english:
Search my current directory (recursively?) for files that have lines containing both the words "nested" and "merge"
Successful attempt with ag:
$ ag --depth=2 -l "nested.*merge|merge.*nested" .
scratch.md
scratch.rb
Unsuccessful attempt with grep:
$ grep -elr 'nested.*merge|merge.*nested' .
grep: nested.*merge|merge.*nested: No such file or directory
grep: .: Is a directory
What am I missing? Also, could either approach be improved?
Thanks!
You probably want -E not -e, or just egrep.
A man grep will make you understand why -e gave you that error.
You can use grep -lr 'nested.*merge\|merge.*nested' or grep -Elr 'nested.*merge|merge.*nested' for your case.
Besides, for the latter one, E mean using ERE regular expression syntax, since grep will use BRE by default, where | will match character | and \| mean or.
For more detail about ERE and BRE, you can read this article

Grep or in part of a string

Good day All,
A filename can either be
abc_source_201501.csv Or,
abc_source2_201501.csv
Is it possible to do something like grep abc_source|source2_201501.csv without fully listing out filename as the filenames I'm working with are much longer than examples given to get both options?
Thanks for assistance here.
Use extended regex flag in grep.
For example:
grep -E abc_source.?_201501.csv
would source out both lines in your example. You can think of other regex patterns that would suit your data more.
You can use Bash globbing to grep in several files at once.
For example, to grep for the string "hello" in all files with a filename that starts with abc_source and ends with 201501.csv, issue this command:
grep hello abc_source*201501.csv
You can also use the -r flag, to recursively grep in all files below a given folder - for example the current folder (.).
grep -r hello .
If you are asking about patterns for file name matching in the shell, the extended globbing facility in Bash lets you say
shopt -s extglob
grep stuff abc_source#(|2)_201501.csv
to search through both files with a single glob expression.
The simplest possibility is to use brace expansion:
grep pattern abc_{source,source2}_201501.csv
That's exactly the same as:
grep pattern abc_source{,2}_201501.csv
You can use several brace patterns in a single word:
grep pattern abc_source{,2}_2015{01..04}.csv
expands to
grep pattern abc_source_201501.csv abc_source_201502.csv \
abc_source_201503.csv abc_source_201504.csv \
abc_source2_201501.csv abc_source2_201502.csv \
abc_source2_201503.csv abc_source2_201504.csv

regex match either string in linux "find" command

I'm trying the following to recursively look for files ending in either .py or .py.server:
$ find -name "stub*.py(|\.server)"
However this does not work.
I've tried variations like:
$ find -name "stub*.(py|py\.server)"
They do not work either.
A simple find -name "*.py" does work so how come this regex does not?
Say:
find . \( -name "*.py" -o -name "*.py.server" \)
Saying so would result in file names matching *.py and *.py.server.
From man find:
expr1 -o expr2
Or; expr2 is not evaluated if expr1 is true.
EDIT: If you want to specify a regex, use the -regex option:
find . -type f -regex ".*\.\(py\|py\.server\)"
Find can take a regular expression pattern:
$ find . -regextype posix-extended -regex '.*[.]py([.]server)?$' -print
Options:
-regex pattern
File name matches regular expression pattern. This is a match on the whole path, not a search. For example, to match a file named ./fubar3, you can use the regular expression .*bar. or
.*b.*3, but not f.*r3. The regular expressions understood by find are by default Emacs
Regular Expressions, but this can be changed with the -regextype option.
-print True;
print the full file name on the standard output, followed by a newline. If you are piping
the output of find into another program and there is the faintest possibility that the files which
you are searching for might contain a newline, then you should seriously consider using the
-print0 option instead of -print. See the UNUSUAL FILENAMES section for information about how
unusual characters in filenames are handled.
-regextype type
Changes the regular expression syntax understood by -regex and -iregex tests which occur later on
the command line. Currently-implemented types are emacs (this is the default), posix-awk, posix-
basic, posix-egrep and posix-extended.
A clearer description or the options. Don't forgot all the information can be found by reading man find or info find.
find -name does not use regexp, here's an extract from the man page on Ubuntu 12.04
-name pattern
Base of file name (the path with the leading directories
removed) matches shell pattern pattern. The metacharacters
(`*', `?', and `[]') match a `.' at the start of the base name
(this is a change in findutils-4.2.2; see section STANDARDS CON‐
FORMANCE below). To ignore a directory and the files under it,
use -prune; see an example in the description of -path. Braces
are not recognised as being special, despite the fact that some
shells including Bash imbue braces with a special meaning in
shell patterns. The filename matching is performed with the use
of the fnmatch(3) library function. Don't forget to enclose
the pattern in quotes in order to protect it from expansion by
the shell.
So the pattern that -name takes is more like a shell glob and not at all like a regexp
If I wanted to find by regexp I'd do something like
find . -type f -print | egrep 'stub(\.py|\.server)'

Regex to get delimited content with egrep

I would like to get the parameter (without parantheses) of a function call with a regular expression.
I am using egrep in a bash script with cygwin.
This is what I got so far (with parantheses):
$ echo "require(catch.me)" | egrep -o '\((.*?)\)'
(catch.me)
What would be the right regex here?
http://www.greenend.org.uk/rjk/2002/06/regexp.html
What are you looking for - is a lookbehind and lookahead regular expressions.
Egrep cannot do that. grep with perl support can do that.
from man grep:
-P, --perl-regexp
Interpret PATTERN as a Perl regular expression. This is highly experimental and grep -P may warn of unimplemented features.
So
$> echo "require(catch.me)" | grep -o -P '(?<=\().*?(?=\))'
catch.me
If you can use sed then the following would work -
echo "require(catch.me)" | sed 's/.*[^(](\(.*\))/\1/'
You can modify your existing regex to this
echo "require(catch.me)" | egrep -o 'c.*e'
Even though egrep offers this (from the man page)
-o, --only-matching
Show only the part of a matching line that matches PATTERN.
It isn't really the correct utility. SED and AWK are masters at this. You will have much more control using either SED or AWK. :)
From the manual :
grep, egrep, fgrep - print lines matching a pattern
Basically, grep is used to print the complete line, so you won't do anything more.
What you should do is using another tool, maybe perl, for such operations.