find command with 'regex' match not working - regex

Am trying to do a simple file-name match with this below regex which I tested to be working from this page for a sample file-name ABC_YYYYMMDDHHMMSS.sha1
ABC_20[0-9]{2}(0[1-9]|1[0-2])([0-2][0-9]|3[0-1])([0-2][0-3])([0-5][0-9])([0-5][0-9])\.sha1
When I couple this in the -regex flag of find like
find . -type f -regex "ABC_20[0-9]{2}(0[1-9]|1[0-2])([0-2][0-9]|3[0-1])([0-2][0-3])([0-5][0-9])([0-5][0-9])\.sha1"
the command is not identifying the file present in the path (e.g ABC_20161231225950.sha1). Am aware of many existing regex-types from this page, but I realized my type is posix-extended and tried as below,
find . -type f -regextype posix-extended -regex 'ABC_20[0-9]{2}(0[1-9]|1[0-2])([0-2][0-9]|3[0-1])([0-2][0-3])([0-5][0-9])([0-5][0-9])\.sha1'
and still no result. I searched around some similar questions of this type, but they were involving giving the wrong regex leading to files not being found. In my case, though the regex is found to be proper. I need to know what am I missing here. Also possibly how to debug non matching issues when using -regex in find.
Note:- I could do some optimizations over the capturing groups in the regex, but that is not in the scope of the current question.

add .* at the start of your regex because you will always get something like ./ at start of path
find . -type f -regextype posix-extended -regex '.*ABC_20[0-9]{2}(0[1-9]|1[0-2])([0-2][0-9]|3[0-1])([0-2][0-3])([0-5][0-9])([0-5][0-9])\.sha1'

Related

Issue using RegEx with Linux find

I'm writing a script to find and list files on my video drive that aren't already .mkv format, as well as listing any multi-episode files so that I can eventually convert and split these files properly.
Examples of files that should match:
Path/to/FilE332.1/Series Title/Season 01/Series - S01E03 - Episode Name Bluray-2160p.mkv
/Series - S01E103 - Episode Name WEBDL-1080p.mkv
Examples of files that shouldn't match:
Path/to/FilE332.1/Series Title/Season 01/Series - S01E04E05 - Episode Name SDTV.mkv
/Series - S01E04E05 - Episode Name SDTV.mkv
Here's the command I came up with:
find /path/to/files -type f ! -regex ".*- S\d{2}E(?:\d{3}|\d{2}) -.*\.mkv"
This regex seems to be working properly when tested on regex101's website, so I'm pretty confident that the regex string is correct: https://regex101.com/r/iyUbh6/1
I've tried adding the -regextype flag to no avail:
find /path/to/files -type f ! -regextype posix-egrep -regex ".*- S\d{2}E(?:\d{3}|\d{2}) -.*\.mkv"
find /path/to/files -type f ! -regextype posix-basic -regex ".*- S\d{2}E(?:\d{3}|\d{2}) -.*\.mkv"
find /path/to/files -type f ! -regextype egrep -regex ".*- S\d{2}E(?:\d{3}|\d{2}) -.*\.mkv"
I also read some stuff about \d not working properly, so I tried changing it to [[:digit:]]. That didn't work either.
find /path/to/files -type f ! -regextype posix-basic -regex ".*- S[[:digit:]]{2}E(?:[[:digit:]]{3}|[[:digit:]]{2}) -.*\.mkv"
find /path/to/files -type f ! -regextype posix-extended -regex ".*- S[[:digit:]]{2}E([[:digit:]]{3}|[[:digit:]]{2}) -.*\.mkv"
I don't really know where to go from here, so hopefully someone with more experience has some insight on this issue.
Note: The following assumes you're using GNU find, which since you mention Linux, is a safe bet.
The default regular expression syntax does not understand \d (Instead you'd use [0-9] or [[:digit:]]). Alternation is \|. I don't think it supports repetition ranges; they're not documented. POSIX Basic Regular Expression syntax also doesn't understand \d, or alternation (though some GNU implementations do as an extension using \|), and requires many other things like groups and repetition ranges to be escaped. And none of the supported flavors supports non-capturing grouping ((?:...)).
Since your alternating group tests for either two or three digits, it can be turned into a single range when using one of the RE flavors that supports them.
So, something like:
find /path/to/files -regextype posix-extended -type f ! -regex ".*- S[0-9]{2}E[0-9]{2,3} -.*\.mkv"
is probably the cleanest approach.
I just pipe find to grep -v to do the filtering out:
find path -type f | grep -v \.mkv

Posix Extended Regex match with find Bash Linux [duplicate]

Am trying to do a simple file-name match with this below regex which I tested to be working from this page for a sample file-name ABC_YYYYMMDDHHMMSS.sha1
ABC_20[0-9]{2}(0[1-9]|1[0-2])([0-2][0-9]|3[0-1])([0-2][0-3])([0-5][0-9])([0-5][0-9])\.sha1
When I couple this in the -regex flag of find like
find . -type f -regex "ABC_20[0-9]{2}(0[1-9]|1[0-2])([0-2][0-9]|3[0-1])([0-2][0-3])([0-5][0-9])([0-5][0-9])\.sha1"
the command is not identifying the file present in the path (e.g ABC_20161231225950.sha1). Am aware of many existing regex-types from this page, but I realized my type is posix-extended and tried as below,
find . -type f -regextype posix-extended -regex 'ABC_20[0-9]{2}(0[1-9]|1[0-2])([0-2][0-9]|3[0-1])([0-2][0-3])([0-5][0-9])([0-5][0-9])\.sha1'
and still no result. I searched around some similar questions of this type, but they were involving giving the wrong regex leading to files not being found. In my case, though the regex is found to be proper. I need to know what am I missing here. Also possibly how to debug non matching issues when using -regex in find.
Note:- I could do some optimizations over the capturing groups in the regex, but that is not in the scope of the current question.
add .* at the start of your regex because you will always get something like ./ at start of path
find . -type f -regextype posix-extended -regex '.*ABC_20[0-9]{2}(0[1-9]|1[0-2])([0-2][0-9]|3[0-1])([0-2][0-3])([0-5][0-9])([0-5][0-9])\.sha1'

Find using regex with altenatives

I try to use find to match several alternative file patterns represented by certain numbers in the middle, but it returns an empty list. My actual pattern has a fixed beginning and variable numbers in the middle.
Reproducible example. Create a list of files
touch a10a a24b b12c a45d
Select a10a and a24b from the list using the following regex resulting in empty output
find . -regex '.*/a(10|45).*'
I expect that the issue should be easy to solve but I could not find a solution and could not figure out it. What did I miss?
system: ubuntu 16.04
The idea is right, but you need to type of the regex to use for find. Since you have alternate operator | here, you need to enable ERE (Extended Regular Expressions) support which you can do as below. The -regextype allows you to specify the regex flavor that you need for the requirement. Also the / part is optional if you have enabled a greedy match .*
find . -type f -regextype posix-extended -regex '.*/a(10|45).*'
From my version of GNU findutils, you could see from the man page
-regextype type
Changes the regular expression syntax understood by -regex and -iregex tests which occur later on the command line. Currently-implemented types are emacs (this is the default), posix-awk, posix-basic, posix-egrep and posix-extended.
Try specifying -regextype awk instead:
find . -regextype awk -regex '.*/a(10|45).*'
It seems you are using the wrong type of bracket. They should be square, not round.
The correct command should be:
find . -regex '.*/a[10|45].*'
Hope this helps!

Regular expression error

I have a regular expression that seams to work in Javascript, but doesn't work with the Linux find command. The purpose is to gather a list of files that have been updated in the last 90 days, excluding particular directories (for example, assume I want to include the directory /data/safe/23/test, but not /data/safe/23/skip1). Here is the regex:
^/data/safe/\d{1,4}/(?:(?!skip1|skip2).*)
And here is the find command (notice I'm using posix-extended; that may be the problem):
find /data/safe -regextype posix-extended -regex '^/data/safe/\d{1,4}/(?:(?!skip1|skip2).*)' -mtime -90
And finally this is the error that is generated:
find: Invalid preceding regular expression
Any help is greatly appreciated!
I think that posix-extended does not supports "?:" and "?!".
Anyway, with find, it would be easier to use something like:
find /data/safe -regextype posix-extended -regex '^/data/safe/[0-9]{1,4}/.*' ! -regex '^/data/safe/[0-9]{1,4}/(skip1|skip2).*' -mtime -90
You're doing it wrong. You need to replace the curly brackets with asterisks. Also you should be piping that whole thing through the tar or touch commands.

repetition in GNU find regexp

I am trying to find all the files whose name contains exactly 14 digits (I'm trying to match a timestamp in the filename). I'm not sure how to get the GNU find regexp syntax for repetitions right.
I've tried find -regex ".*[0-9]{14} and find -regex ".*[0-9]\{14\}, neither of these turns up any results. Can you help me with the syntax?
remember, GNU find's -regex matches a whole path. Anyway, you can use a combination of find and grep to do the task, eg to find exactly 14 digits with no other characters
find . -type f -printf "%f\n" | grep -E "\b[0-9]{14}\b"
modify to suit your needs
Try changing the -regextype parameter to find.
Changes the regular expression syntax understood by -regex and -iregex
tests which occur later on the command line. Currently-implemented
types are emacs (this is the default), posix-awk, posix-basic,
posix-egrep and posix-extended.
Strange, I just gave it a try and I could not get this work. Here's a workaround anyway (matching 2 consecutive numbers):
$ls
a123.txt a1b2c3.txt a45.txt b123.txt
$find -regex '.*[^0-9][0-9][0-9][^0-9].*'
./a45.txt