Linux $FIND and hex notation of Unicode characters' range? - regex

I'am unable to get the unicode hex notation working within linux $find utility and its -regex functionality. There is my case.
Given a folder with 5 files suchs :
./cmn-我.flac
./cmn-的.flac
./cmn-三.flac
./cmn-a.flac
./cmn-b.flac
To find the files with CJK characters, I tried the following :
find ./ -regex "./cmn-.\.flac" #Find *ALL* files "*.txt", not what I want.
find ./ -regex "./cmn-[\x4e00-\x9fa5]\.flac" #fails
find ./ -regex "./cmn-[\u4e00-\u9fa5]\.flac" #fails
find ./ -regex "./cmn-[\x{4e00}-\x{9fa5}]\.flac" #fails
find ./ -regex "./cmn-[\u{4e00}-\u{9fa5}]\.flac" #fails
find ./ -regex "./cmn-[\U0004e00-\U0009fa5]\.flac" #fails
without success.
How to find the files with CJK characters using find ./ -regex "[myRegEx]" and an unicode hex notation regex ?

As I explained it in What regex to find files with CJK characters using find command? find use POSIX regex that doesn't support this kind of pattern.
Explanation
Looking at the -regex-type option I only see POSIX regular expression types: emacs (default), posix-awk, posix-basic, posix-egrep and posix-extended).
Which doesn't support custom hex range definition (compare Perl with POSIX).
Solution
But grep does have an experimental -P or --perl-regexp option where you can use this kind of pattern:
find . -name 'cmn-*.flac' -print | grep -P '[\x4e00-\x9fa5]'
see command explanation.

Related

How to delete files based on the extension in MacOS terminal using regex?

I need to delete a huge amount of .zip and .apk files from my project's root folder I'd like to do it using the bash terminal (MacOS X).
So far I've successfully made it with two commands:
$ find . -name \*.zip -delete
$ find . -name \*.apk -delete
But I want to do it in one using regex:
$ find . -regex '\w*.(apk|zip)' -delete
But this regular expression doesn't seem to work because it's deleting anything... what am I doing wrong?
MORE INFO:
An example of what I want to delete is android~1~1~sampleproject.zip.
$ find -E . -regex './[~a-zA-Z0-9]+\.(apk|zip)' -delete
The find tries to match the whole file name. So it is necessary to start the regex with ./
I believe find doesn't support \w \d etc. So replace them with character class. But find doesn't support them as well so you need to add -E to enable extended regular expressions.
-E Interpret regular expressions followed by -regex and -iregex primaries as extended (modern) regular expres-
sions rather than basic regular expressions (BRE's). The re_format(7) manual page fully describes both for-
mats.
Example
For example consider the following commands
$ ls *.json
bower.json composer.json package.json
$ find -E . -regex "\./[a-zA-Z0-9]+\.(json)"
./bower.json
./composer.json
./package.json
Note The above answer is specifically for BSD find. If you are using GNU find, it won't support -E option, instead it support -regextype posix-extended. I can rewrite the above example as
$ find . -regextype posix-extended -regex "\./\w+\.(json)"
I would use:
find . -type f \( -name "*.zip" -o -name "*.apk" \) -delete

Find command with regex for multiple file extension

I was wondering if this is possible in find command. I am trying to find all the specific files with the following extensions then it will be SED after. Here's my current command script:
find . -regex '.*\.(sh|ini|conf|vhost|xml|php)$' | xargs sed -i -e 's/%%MEFIRST%%/mefirst/g'
unfortunately, I'm not that familiar in regex but something like this is what I need.
I see in the comments that you found out how to escape it with GNU basic regexes via the nonstandard flag find -regex RE, but you can also specify a type of regex that supports it without any escapes, making it a bit more legible:
In GNU findutils (Linux), use -regextype posix-extended:
find . -regextype posix-extended -regex '.*\.(sh|ini|conf|vhost|xml|php)$' | …
In BSD find (FreeBSD find or Mac OS X find), use -E:
find . -E -regex '.*\.(sh|ini|conf|vhost|xml|php)$' | …
The POSIX spec for find does not support regular expressions at all, but it does support wildcard globbing and -or, so you could be fully portable with this verbose monster:
find . -name '*.sh' -or -name '*.ini' -or -name '*.conf' \
-or -name '*.vhost' -or -name '*.xml' -or -name '*.php' | …
Be sure those globs are quoted, otherwise the shell will expand them prematurely and you'll get syntax errors (since e.g. find -name a.sh b.sh … doesn't insert a -or -name between the two matched files and it won't expand to files in subdirectories).

Wording a regex with bash find and boolean match

I'm not at all sure why this doesn't work. Other posts here suggest that it should. I just want a regex on find to locate all files that match ___orig.png and ___DIFF.png. This will find the first:
find . -type f -regex '.*_____orig\.png'
But this finds nothing:
find . -type f -regex '.*_____(orig|DIFF)\.png'
What is the correct way to phrase the regex to match both? (Yes I know I can use -or to have a much longer and less maintainable comamnd...)
You need to escape both parens and the pipe, use:
find . -type f -regex '.*_____\(orig\|DIFF\)\.png'
GNU find's -regex uses emacs flavour by default, which I'm not very familiar with. You can change the regex used with -regextype. With -regextype posix-extended your current -regex should work.
The portable way is to use two -name operators.
find . -type f \( -name "*_____orig.png" -o -name "*_____DIFF.png" \) -print
Or, with bash 4.0 or newer, you can use globstar and extglob instead of find
shopt -s globstar extglob
for file in ./**/*_____#(orig|DIFF).png; do
echo "$file"
done

why isn't this regex working : find ./ -regex '.*\(m\|h\)$

Why isn't this regex working?
find ./ -regex '.*\(m\|h\)$
I noticed that the following works fine:
find ./ -regex '.*\(m\)$'
But when I add the "or a h at the end of the filename" by adding \|h it doesn't work. That is, it should pick up all my *.m and *.h files, but I am getting nothing back.
I am on Mac OS X.
On Mac OS X, you can't use \| in a basic regular expression, which is what find uses by default.
re_format man page
[basic] regular expressions differ in several respects. | is an ordinary character and there is no equivalent for its functionality.
The easiest fix in this case is to change \(m\|h\) to [mh], e.g.
find ./ -regex '.*[mh]$'
Or you could add the -E option to tell find to use extended regular expressions instead.
find -E ./ -regex '.*(m|h)$'
Unfortunately -E isn't portable.
Also note that if you only want to list files ending in .m or .h, you have to escape the dot, e.g.
find ./ -regex '.*\.[mh]$'
If you find this confusing (me too), there's a great reference table that shows which features are supported on which systems.
Regex Syntax Summary [Google Cache]
A more efficient solution is to use the -o flag:
find . -type f \( -name "*.m" -o -name "*.h" \)
but if you want the regex use:
find . -type f -regex ".*\.[mh]$"
Okay this is a little hacky but if you don't want to wrangle the regex limitations of find on OSX, you can just pipe find's output to grep:
find . | grep ".*\(\h\|m\)"
What’s wrong with
find . -name '*.[mh]' -type f
If you want fancy patterns, then use find2perl and hack the pattern.

Using non-consuming matches in Linux find regex

Here's my problem in a simplified scenario.
Create some test files:
touch /tmp/test.xml
touch /tmp/excludeme.xml
touch /tmp/test.ini
touch /tmp/test.log
I have a find expression that returns me all the XML and INI files:
[root#myserver] ~> find /tmp -name -prune -o -regex '.*\.\(xml\|ini\)'
/tmp/test.ini
/tmp/test.xml
/tmp/excludeme.xml
I now want a way of modifying this -regex to exclude the excludeme.xml file from being included in the results.
I thought this should be possible by using/combining a non-consuming regex (?=expr) with a negated match (?!expr). Unfortunately I can't quite get the format of the command right, so my attempts result in no matches being returned. Here was one of my attempts (I've tried many different forms of this with different escaping!):
find /tmp -name -prune -o -regex '\(?=.*excludeme\.xml\).*\.\(xml\|ini\)'
I can't break down the command into multiple steps (e.g. piping through grep -v) as the find command is assumed as input into other parts of our tool.
This does what you want on linux:
find /tmp -name -prune -o -regex '.*\.\(xml\|ini\)' \! -regex '.*excludeme\.xml'
I'm not sure if the "!" operator is unique to gnu find.
Not sure about what escapes you need or if lookarounds work, but these work for Perl:
/^(?!.*\/excludeme\.).*\.(xml|ini)$/
/(?<!\/excludeme)\.(xml|ini)$/
Edit - Just checked find command, best you can do with find is to change the regextype to -regextype posix-extended but that doesen't do stuff like look-arounds. The only way around this looks to be using some gnu stuff, either as #unholygeek suggests with find or piping find into gnu grep with the -P perl option. You can use the above regex verbatim if you go with a gnu grep. Something like find .... -print | xargs grep -P ...
Sorry, thats the best I can do.