Why does find -regex not accept my regex? - regex

I want to select some files that are matching a regular expression.
Files are for example:
4510-88aid-50048-INA.txt
4510-88nid-50048-INA.txt
xxxx-05xxx-xxxxx-INA.txt
I want all files that match this regex:
.*[\w]{4}-05(?!aid)[\w]{3}-[\w]{5}-INA\.txt
In my opinion this have to be xxxx-05xxx-xxxxx-INA.txt in the case above.
Using some tool like RegexTester, everything works perfect.
Using the bash command find -regex doesn´t seem to work for me.
My question is, why?
I can't figure it out, I am using:
find /some/path -regex ".*[\w]{4}-05(?!aid)[\w]{3}-[\w]{5}-INA\.txt" -exec echo {} \;
But nothing is printed... Any ideas?
$ uname -a
Linux debmu838 2.6.5-7.321-smp #1 SMP Mon Nov 9 14:29:56 UTC 2009 x86_64 x86_64 x86_64 GNU/Linux

bash4+ and perl
ls /some/path/**/*.txt | perl -nle 'print if /^[\w]{4}-05(?!aid)[\w]{3}-[\w]{5}-INA\.txt/'
you should have in your .profile shopt -s globstar

According to the find man page the find regex uses per default emacs regex. And according to http://www.regular-expressions.info/refflavors.html emacs is GNU ERE and that does not support look arounds.
You can try a different -regextype like #l0b0 suggested, but also the Posix flavours seems to not support this feature.

I pretty much ditto the other answers: Find's -regex switch can't emulate everything in Perl's regex, However, here's something you can try...
Take a look at the find2perl command. That program can take a typical find statement, and give you a Perl program equivalent for it. I don't believe -regex is recognized by find2perl (It's not in the standard Unix find, but only in the GNU find), but you can simply use -name, and then see the program it generates. From there, you can modify the program to use the Perl expressions you want in your regex. In the end, you'll get a small Perl script that will do the file directory find you want.
Otherwise, try using -regextype posix-extended which pretty much match most of Perl's regex expressions. You can't use look arounds, but you can probably find something that does work.

What you've got looks like a Perl regex. Try with a different -regextype, and tweak the regex accordingly:
Changes the regular expression syntax
understood by -regex and -iregex
tests which occur later on the command
line. Currently-implemented types are
emacs (this is the default),
posix-awk, posix-basic, posix-egrep
and posix-extended.

Try this:
ls ????-??aid-?????-INA.txt

Try simple script like this:
#!/bin/bash
for file in *INA.txt
do
match=$(echo "${file%INA.txt}" | sed -r 's/^\w{4}-\w{5}-\w{5}-$/found/')
[ $match == "found" ] && echo "$file"
done

Related

Find files with multiple matches of a pattern in the filename in linux [duplicate]

I have some images named with generated uuid1 string. For example 81397018-b84a-11e0-9d2a-001b77dc0bed.jpg. I want to find out all these images using "find" command:
find . -regex "[a-f0-9\-]\{36\}\.jpg".
But it doesn't work. Something wrong with the regex? Could someone help me with this?
find . -regextype sed -regex ".*/[a-f0-9\-]\{36\}\.jpg"
Note that you need to specify .*/ in the beginning because find matches the whole path.
Example:
susam#nifty:~/so$ find . -name "*.jpg"
./foo-111.jpg
./test/81397018-b84a-11e0-9d2a-001b77dc0bed.jpg
./81397018-b84a-11e0-9d2a-001b77dc0bed.jpg
susam#nifty:~/so$
susam#nifty:~/so$ find . -regextype sed -regex ".*/[a-f0-9\-]\{36\}\.jpg"
./test/81397018-b84a-11e0-9d2a-001b77dc0bed.jpg
./81397018-b84a-11e0-9d2a-001b77dc0bed.jpg
My version of find:
$ find --version
find (GNU findutils) 4.4.2
Copyright (C) 2007 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Eric B. Decker, James Youngman, and Kevin Dalley.
Built using GNU gnulib version e5573b1bad88bfabcda181b9e0125fb0c52b7d3b
Features enabled: D_TYPE O_NOFOLLOW(enabled) LEAF_OPTIMISATION FTS() CBO(level=0)
susam#nifty:~/so$
susam#nifty:~/so$ find . -regextype foo -regex ".*/[a-f0-9\-]\{36\}\.jpg"
find: Unknown regular expression type `foo'; valid types are `findutils-default', `awk', `egrep', `ed', `emacs', `gnu-awk', `grep', `posix-awk', `posix-basic', `posix-egrep', `posix-extended', `posix-minimal-basic', `sed'.
The -regex find expression matches the whole name, including the relative path from the current directory. For find . this always starts with ./, then any directories.
Also, these are emacs regular expressions, which have other escaping rules than the usual egrep regular expressions.
If these are all directly in the current directory, then
find . -regex '\./[a-f0-9\-]\{36\}\.jpg'
should work. (I'm not really sure - I can't get the counted repetition to work here.) You can switch to egrep expressions by -regextype posix-egrep:
find . -regextype posix-egrep -regex '\./[a-f0-9\-]{36}\.jpg'
(Note that everything said here is for GNU find, I don't know anything about the BSD one which is also the default on Mac.)
Judging from other answers, it seems this might be find's fault.
However you can do it this way instead:
find . * | grep -P "[a-f0-9\-]{36}\.jpg"
You might have to tweak the grep a bit and use different options depending on what you want but it works.
on Mac OS X (BSD find): Same effect as the accepted answer.
$ find -E . -regex ".*/[a-f0-9\-]{36}.jpg"
man find says -E uses extended regex support
NOTE: the .*/ prefix is needed to match a complete path:
For comparison purposes, here's the GNU/Linux version:
$ find . -regextype sed -regex ".*/[a-f0-9\-]\{36\}\.jpg"
Simple way - you can specify .* in the beginning because find matches the whole path.
$ find . -regextype egrep -regex '.*[a-f0-9\-]{36}\.jpg$'
find version
$ find --version
find (GNU findutils) 4.6.0
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Eric B. Decker, James Youngman, and Kevin Dalley.
Features enabled: D_TYPE O_NOFOLLOW(enabled) LEAF_OPTIMISATION
FTS(FTS_CWDFD) CBO(level=2)
Try to use single quotes (') to avoid shell escaping of your string. Remember that the expression needs to match the whole path, i.e. needs to look like:
find . -regex '\./[a-f0-9-]*.jpg'
Apart from that, it seems that my find (GNU 4.4.2) only knows basic regular expressions, especially not the {36} syntax. I think you'll have to make do without it.
You should use absolute directory path when applying find instruction with regular expression.
In your example, the
find . -regex "[a-f0-9\-]\{36\}\.jpg"
should be changed into
find . -regex "./[a-f0-9\-]\{36\}\.jpg"
In most Linux systems, some disciplines in regular expression cannot be recognized by that system, so you have to explicitly point out -regexty like
find . -regextype posix-extended -regex "[a-f0-9\-]\{36\}\.jpg"
If you want to maintain cross-platform compatibility, I could find no built-in regex search option that works across different versions of find in a consistent way.
Combine with grep
As suggested by #yarian, you could run an over-inclusive find and then run the output through grep:
find . | grep -E '<POSIX regex>'
This is likely to be slow but will give you cross-platform regex search if you need to use a full regular expression and can't reformat your search as a glob
Rewrite as a glob
The -name option is compatible with globs which will provide limited (but cross-platform) pattern matching.
You can use all the patterns that you would on the command line like * ? {} **. Although not as powerful as full regex, you might be able to reformulate your search to globs depending on your use-case.
Internet search for globs - many tutorials detailing full functionality are available online
One thing I don't see covered is how to combine regular expressions with regular find syntax.
Eg: I want to find core dump files on BSD / Linux, I change to the root I want to scan.. eg: cd / then execute:
find \( -path "./dev" -o -path "./sys" -o -path "./proc" \) -prune -o -type f -regextype sed -regex ".*\.core$" -exec du -h {} \; 2> /dev/null
So I am using the prune command to exclude multiple system directories, before doing regular expression on the remaining files. Any error output (stderr) is deleted.
The important part is to use the Find syntax first, then OR (-o) with the regular expression.

How to scrub emails from all CSVs in a directory?

I have this regex that works fine enough for my purposes for identifying emails in CSVs within a directory using grep on Mac OS X:
grep --no-filename -E -o "\b[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b" *
I've tried to get this working with sed so that I can replace the emails with foo#bar.baz:
sed -E -i '' -- 's/\b[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b/foo#bar.baz/g' *
However, I can't seem to get it to work. Admittedly, sed and regex are not my strong points. Any ideas?
The sed in OSX is broken. Replace it with GNU sed using Homebrew that will be used as a replacement for the one bundled in OSX. Use this command for installation
sudo brew install gnu-sed
and use this for substitution
sed -E -i 's/\b[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b/foo#bar.baz/g' *
Reference
You seem to assume that grep and sed support the same regex dialect, but that is not necessarily, or even usually, the case.
If you want a portable solution, you could easily use Perl for this, which however supports yet another regex dialect...
perl -i -p -e 's/\b[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b/foo#bar.baz/g' *
For a bit of an overview of regex dialects, see https://stackoverflow.com/a/11857890/874188
Your regex kind of sucks, but I understand that is sort of beside the point here.

How to use regex with find command?

I have some images named with generated uuid1 string. For example 81397018-b84a-11e0-9d2a-001b77dc0bed.jpg. I want to find out all these images using "find" command:
find . -regex "[a-f0-9\-]\{36\}\.jpg".
But it doesn't work. Something wrong with the regex? Could someone help me with this?
find . -regextype sed -regex ".*/[a-f0-9\-]\{36\}\.jpg"
Note that you need to specify .*/ in the beginning because find matches the whole path.
Example:
susam#nifty:~/so$ find . -name "*.jpg"
./foo-111.jpg
./test/81397018-b84a-11e0-9d2a-001b77dc0bed.jpg
./81397018-b84a-11e0-9d2a-001b77dc0bed.jpg
susam#nifty:~/so$
susam#nifty:~/so$ find . -regextype sed -regex ".*/[a-f0-9\-]\{36\}\.jpg"
./test/81397018-b84a-11e0-9d2a-001b77dc0bed.jpg
./81397018-b84a-11e0-9d2a-001b77dc0bed.jpg
My version of find:
$ find --version
find (GNU findutils) 4.4.2
Copyright (C) 2007 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Eric B. Decker, James Youngman, and Kevin Dalley.
Built using GNU gnulib version e5573b1bad88bfabcda181b9e0125fb0c52b7d3b
Features enabled: D_TYPE O_NOFOLLOW(enabled) LEAF_OPTIMISATION FTS() CBO(level=0)
susam#nifty:~/so$
susam#nifty:~/so$ find . -regextype foo -regex ".*/[a-f0-9\-]\{36\}\.jpg"
find: Unknown regular expression type `foo'; valid types are `findutils-default', `awk', `egrep', `ed', `emacs', `gnu-awk', `grep', `posix-awk', `posix-basic', `posix-egrep', `posix-extended', `posix-minimal-basic', `sed'.
The -regex find expression matches the whole name, including the relative path from the current directory. For find . this always starts with ./, then any directories.
Also, these are emacs regular expressions, which have other escaping rules than the usual egrep regular expressions.
If these are all directly in the current directory, then
find . -regex '\./[a-f0-9\-]\{36\}\.jpg'
should work. (I'm not really sure - I can't get the counted repetition to work here.) You can switch to egrep expressions by -regextype posix-egrep:
find . -regextype posix-egrep -regex '\./[a-f0-9\-]{36}\.jpg'
(Note that everything said here is for GNU find, I don't know anything about the BSD one which is also the default on Mac.)
Judging from other answers, it seems this might be find's fault.
However you can do it this way instead:
find . * | grep -P "[a-f0-9\-]{36}\.jpg"
You might have to tweak the grep a bit and use different options depending on what you want but it works.
on Mac OS X (BSD find): Same effect as the accepted answer.
$ find -E . -regex ".*/[a-f0-9\-]{36}.jpg"
man find says -E uses extended regex support
NOTE: the .*/ prefix is needed to match a complete path:
For comparison purposes, here's the GNU/Linux version:
$ find . -regextype sed -regex ".*/[a-f0-9\-]\{36\}\.jpg"
Simple way - you can specify .* in the beginning because find matches the whole path.
$ find . -regextype egrep -regex '.*[a-f0-9\-]{36}\.jpg$'
find version
$ find --version
find (GNU findutils) 4.6.0
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Eric B. Decker, James Youngman, and Kevin Dalley.
Features enabled: D_TYPE O_NOFOLLOW(enabled) LEAF_OPTIMISATION
FTS(FTS_CWDFD) CBO(level=2)
Try to use single quotes (') to avoid shell escaping of your string. Remember that the expression needs to match the whole path, i.e. needs to look like:
find . -regex '\./[a-f0-9-]*.jpg'
Apart from that, it seems that my find (GNU 4.4.2) only knows basic regular expressions, especially not the {36} syntax. I think you'll have to make do without it.
You should use absolute directory path when applying find instruction with regular expression.
In your example, the
find . -regex "[a-f0-9\-]\{36\}\.jpg"
should be changed into
find . -regex "./[a-f0-9\-]\{36\}\.jpg"
In most Linux systems, some disciplines in regular expression cannot be recognized by that system, so you have to explicitly point out -regexty like
find . -regextype posix-extended -regex "[a-f0-9\-]\{36\}\.jpg"
If you want to maintain cross-platform compatibility, I could find no built-in regex search option that works across different versions of find in a consistent way.
Combine with grep
As suggested by #yarian, you could run an over-inclusive find and then run the output through grep:
find . | grep -E '<POSIX regex>'
This is likely to be slow but will give you cross-platform regex search if you need to use a full regular expression and can't reformat your search as a glob
Rewrite as a glob
The -name option is compatible with globs which will provide limited (but cross-platform) pattern matching.
You can use all the patterns that you would on the command line like * ? {} **. Although not as powerful as full regex, you might be able to reformulate your search to globs depending on your use-case.
Internet search for globs - many tutorials detailing full functionality are available online
One thing I don't see covered is how to combine regular expressions with regular find syntax.
Eg: I want to find core dump files on BSD / Linux, I change to the root I want to scan.. eg: cd / then execute:
find \( -path "./dev" -o -path "./sys" -o -path "./proc" \) -prune -o -type f -regextype sed -regex ".*\.core$" -exec du -h {} \; 2> /dev/null
So I am using the prune command to exclude multiple system directories, before doing regular expression on the remaining files. Any error output (stderr) is deleted.
The important part is to use the Find syntax first, then OR (-o) with the regular expression.

why isn't this regex working : find ./ -regex '.*\(m\|h\)$

Why isn't this regex working?
find ./ -regex '.*\(m\|h\)$
I noticed that the following works fine:
find ./ -regex '.*\(m\)$'
But when I add the "or a h at the end of the filename" by adding \|h it doesn't work. That is, it should pick up all my *.m and *.h files, but I am getting nothing back.
I am on Mac OS X.
On Mac OS X, you can't use \| in a basic regular expression, which is what find uses by default.
re_format man page
[basic] regular expressions differ in several respects. | is an ordinary character and there is no equivalent for its functionality.
The easiest fix in this case is to change \(m\|h\) to [mh], e.g.
find ./ -regex '.*[mh]$'
Or you could add the -E option to tell find to use extended regular expressions instead.
find -E ./ -regex '.*(m|h)$'
Unfortunately -E isn't portable.
Also note that if you only want to list files ending in .m or .h, you have to escape the dot, e.g.
find ./ -regex '.*\.[mh]$'
If you find this confusing (me too), there's a great reference table that shows which features are supported on which systems.
Regex Syntax Summary [Google Cache]
A more efficient solution is to use the -o flag:
find . -type f \( -name "*.m" -o -name "*.h" \)
but if you want the regex use:
find . -type f -regex ".*\.[mh]$"
Okay this is a little hacky but if you don't want to wrangle the regex limitations of find on OSX, you can just pipe find's output to grep:
find . | grep ".*\(\h\|m\)"
What’s wrong with
find . -name '*.[mh]' -type f
If you want fancy patterns, then use find2perl and hack the pattern.

Using non-consuming matches in Linux find regex

Here's my problem in a simplified scenario.
Create some test files:
touch /tmp/test.xml
touch /tmp/excludeme.xml
touch /tmp/test.ini
touch /tmp/test.log
I have a find expression that returns me all the XML and INI files:
[root#myserver] ~> find /tmp -name -prune -o -regex '.*\.\(xml\|ini\)'
/tmp/test.ini
/tmp/test.xml
/tmp/excludeme.xml
I now want a way of modifying this -regex to exclude the excludeme.xml file from being included in the results.
I thought this should be possible by using/combining a non-consuming regex (?=expr) with a negated match (?!expr). Unfortunately I can't quite get the format of the command right, so my attempts result in no matches being returned. Here was one of my attempts (I've tried many different forms of this with different escaping!):
find /tmp -name -prune -o -regex '\(?=.*excludeme\.xml\).*\.\(xml\|ini\)'
I can't break down the command into multiple steps (e.g. piping through grep -v) as the find command is assumed as input into other parts of our tool.
This does what you want on linux:
find /tmp -name -prune -o -regex '.*\.\(xml\|ini\)' \! -regex '.*excludeme\.xml'
I'm not sure if the "!" operator is unique to gnu find.
Not sure about what escapes you need or if lookarounds work, but these work for Perl:
/^(?!.*\/excludeme\.).*\.(xml|ini)$/
/(?<!\/excludeme)\.(xml|ini)$/
Edit - Just checked find command, best you can do with find is to change the regextype to -regextype posix-extended but that doesen't do stuff like look-arounds. The only way around this looks to be using some gnu stuff, either as #unholygeek suggests with find or piping find into gnu grep with the -P perl option. You can use the above regex verbatim if you go with a gnu grep. Something like find .... -print | xargs grep -P ...
Sorry, thats the best I can do.