does find command ignore anchors - regex

Let's say I have a file which has the path ./bar2.txt
and content of bar2.txt is
./bar2.txt
output of
grep "\bbar2\b" *
is
bar2.txt:./bar2.txt
as expected, wheres
find . -regextype posix-extended -regex "\bbar2\b"
doesn't find anything.
I know I should change the regex to
".*/bar2.*"
since find looks for full path. So, does this mean that find ignores \b which specifies the word boundary?
Thanks in advance.

grep and find will use different regular expression engines. Notably, POSIX regular expressions (which find uses) don't include "\b" as a word boundary, so it is the same as "b". On OSX, for instance, these have the same result:
find . -regex '.*bar2.txt' -print
and
find . -regex '.*\bar2.txt' -print
Double check the manpage to make sure that -regex is doing what you think. For my find the regular expression must match the entire filename, e.g. this doesn't find any files:
find . -regex 'bar' -print
but this one does:
find . -regex '.*bar.*' -print

However, what I am curious about is why it ignores \b which specifies word
boundary.
From my understanding \b is part of Perl regular expressions and would not be available to find, as find does not support the Perl regextype as grep does.

Related

Find using regex with altenatives

I try to use find to match several alternative file patterns represented by certain numbers in the middle, but it returns an empty list. My actual pattern has a fixed beginning and variable numbers in the middle.
Reproducible example. Create a list of files
touch a10a a24b b12c a45d
Select a10a and a24b from the list using the following regex resulting in empty output
find . -regex '.*/a(10|45).*'
I expect that the issue should be easy to solve but I could not find a solution and could not figure out it. What did I miss?
system: ubuntu 16.04
The idea is right, but you need to type of the regex to use for find. Since you have alternate operator | here, you need to enable ERE (Extended Regular Expressions) support which you can do as below. The -regextype allows you to specify the regex flavor that you need for the requirement. Also the / part is optional if you have enabled a greedy match .*
find . -type f -regextype posix-extended -regex '.*/a(10|45).*'
From my version of GNU findutils, you could see from the man page
-regextype type
Changes the regular expression syntax understood by -regex and -iregex tests which occur later on the command line. Currently-implemented types are emacs (this is the default), posix-awk, posix-basic, posix-egrep and posix-extended.
Try specifying -regextype awk instead:
find . -regextype awk -regex '.*/a(10|45).*'
It seems you are using the wrong type of bracket. They should be square, not round.
The correct command should be:
find . -regex '.*/a[10|45].*'
Hope this helps!

linux find files with optional character in their name

suppose I have two files: ac and abc. I want to find a regex to match both files. Normally I would expect the following regex to work, but it never does:
find ./ -name ab?c
I have tried escaping or not the questionmark, this never seems to work. Normally in the regex documentations I have found; ? means: previous character repeated 0 or 1 times, but find doesn't seem to understand this.
I have tried this on two different find versions: GNU find version 4.2.31 and
find (GNU findutils) 4.6.0
PS: this works with *, but I specifically would like to match just one optional character.
find ./ -name a*c
gives
./ac
./abc
The expression passed to -name is not a regex, it is a glob expression. A (single) glob expression can't be used for your use case but you can use regular expressions using -regex:
find -regex '.*/ab?c'
Btw, the default regular expression language is Emacs Style as explained here : https://www.emacswiki.org/emacs/RegularExpression . You can change the regex language using -regextype.
To match the expression with only one optional character try using or option:
touch abc ac abbc
find . -name "a?c" -or -name "ac"
Gives you only: abc and ac names.
Generally you can build pretty complex find queries using or and and options =)
The find -name option uses a glob pattern, which is not the same as a regex. For globs, ? means any single character. If you want a character to be optional, you need to use two patterns:
find ./ -name abc -o -name ac
Other answers are good enough to have a solution but knowing find's -regex option matches on whole entry is essential. So you can't just do a partial match:
-regex 'ab?c'
You have to use one or two dot-stars:
-regex '.*ab?c.*'
Also without wildcards this would be possible using grep:
ls . | grep 'ab\?c'

regex match either string in linux "find" command

I'm trying the following to recursively look for files ending in either .py or .py.server:
$ find -name "stub*.py(|\.server)"
However this does not work.
I've tried variations like:
$ find -name "stub*.(py|py\.server)"
They do not work either.
A simple find -name "*.py" does work so how come this regex does not?
Say:
find . \( -name "*.py" -o -name "*.py.server" \)
Saying so would result in file names matching *.py and *.py.server.
From man find:
expr1 -o expr2
Or; expr2 is not evaluated if expr1 is true.
EDIT: If you want to specify a regex, use the -regex option:
find . -type f -regex ".*\.\(py\|py\.server\)"
Find can take a regular expression pattern:
$ find . -regextype posix-extended -regex '.*[.]py([.]server)?$' -print
Options:
-regex pattern
File name matches regular expression pattern. This is a match on the whole path, not a search. For example, to match a file named ./fubar3, you can use the regular expression .*bar. or
.*b.*3, but not f.*r3. The regular expressions understood by find are by default Emacs
Regular Expressions, but this can be changed with the -regextype option.
-print True;
print the full file name on the standard output, followed by a newline. If you are piping
the output of find into another program and there is the faintest possibility that the files which
you are searching for might contain a newline, then you should seriously consider using the
-print0 option instead of -print. See the UNUSUAL FILENAMES section for information about how
unusual characters in filenames are handled.
-regextype type
Changes the regular expression syntax understood by -regex and -iregex tests which occur later on
the command line. Currently-implemented types are emacs (this is the default), posix-awk, posix-
basic, posix-egrep and posix-extended.
A clearer description or the options. Don't forgot all the information can be found by reading man find or info find.
find -name does not use regexp, here's an extract from the man page on Ubuntu 12.04
-name pattern
Base of file name (the path with the leading directories
removed) matches shell pattern pattern. The metacharacters
(`*', `?', and `[]') match a `.' at the start of the base name
(this is a change in findutils-4.2.2; see section STANDARDS CON‐
FORMANCE below). To ignore a directory and the files under it,
use -prune; see an example in the description of -path. Braces
are not recognised as being special, despite the fact that some
shells including Bash imbue braces with a special meaning in
shell patterns. The filename matching is performed with the use
of the fnmatch(3) library function. Don't forget to enclose
the pattern in quotes in order to protect it from expansion by
the shell.
So the pattern that -name takes is more like a shell glob and not at all like a regexp
If I wanted to find by regexp I'd do something like
find . -type f -print | egrep 'stub(\.py|\.server)'

repetition in GNU find regexp

I am trying to find all the files whose name contains exactly 14 digits (I'm trying to match a timestamp in the filename). I'm not sure how to get the GNU find regexp syntax for repetitions right.
I've tried find -regex ".*[0-9]{14} and find -regex ".*[0-9]\{14\}, neither of these turns up any results. Can you help me with the syntax?
remember, GNU find's -regex matches a whole path. Anyway, you can use a combination of find and grep to do the task, eg to find exactly 14 digits with no other characters
find . -type f -printf "%f\n" | grep -E "\b[0-9]{14}\b"
modify to suit your needs
Try changing the -regextype parameter to find.
Changes the regular expression syntax understood by -regex and -iregex
tests which occur later on the command line. Currently-implemented
types are emacs (this is the default), posix-awk, posix-basic,
posix-egrep and posix-extended.
Strange, I just gave it a try and I could not get this work. Here's a workaround anyway (matching 2 consecutive numbers):
$ls
a123.txt a1b2c3.txt a45.txt b123.txt
$find -regex '.*[^0-9][0-9][0-9][^0-9].*'
./a45.txt

regular expression to exclude filetypes from find

When using find command in linux, one can add a -regex flag that uses emacs regualr expressions to match.
I want find to look for all files except .jar files and .ear files. what would be the regular expression in this case?
Thanks
You don't need a regex here. You can use find with the -name and -not options:
find . -not -name "*.jar" -not -name "*.ear"
A more concise (but less readable) version of the above is:
find . ! \( -name "*.jar" -o -name "*.ear" \)
EDIT: New approach:
Since POSIX regexes don't support lookaround, you need to negate the match result:
find . -not -regex ".*\.[je]ar"
The previously posted answer uses lookbehind and thus won't work here, but here it is for completeness' sake:
.*(?<!\.[je]ar)$
find . -regextype posix-extended -not -regex ".*\\.(jar|ear)"
This will do the job, and I personally find it a bit clearer than some of the other solutions. Unfortunately the -regextype is required (cluttering up an otherwise simple command) to make the capturing group work.
Using a regular expression in this case sounds like an overkill (you could just check if the name ends with something). I'm not sure about emacs syntax, but something like this should be generic enough to work:
\.(?!((jar$)|(ear$)))
i.e. find a dot (.) not followed by ending ($) "jar" or (|) "ear".