Find file names using find command and regex, functioning improperly - regex

We have a Samba server that is backing up to an S3 bucket. Come to find out that a large number of file names contain inappropriate characters and the AWS CLI won't allow the transfer of those files. Using the "worst offender" I build a quick regex check, tested in rubular against another file name to try and generate a list of files that need to be fixed:
([中文网页我们的团队孙é¹â€“¦]+)
The command I'm running is:
find . -regextype awk -regex ".*/([中文网页我们的团队孙é¹â€“¦]+)"
This brings back a small list of files that contain the above string, in order, not individual characters contained throughout the name. This leads me to believe that either my regextype is incorrect or something is wrong with the formatting of the list of characters. I've tried types emacs and egrep as they seem most similar to regex I've used outside of a Unix environment to no luck.
My test file name is: this-is-my€™s'-test-_ folder-name. which, according to my rubular tests, should be returned but isn't. Any help would be greatly appreciated.

Your regex .*/([中文网页我们的团队孙é¹â€“¦]+) expects one of the special characters after the slash and your test file doesn't start with one of these characters.
You might try something more like .*[中文网页我们的团队孙é¹â€“¦]+.* instead.

Related

IF in a Regex expresion? (replace something in file names ONLY if file type is xxx)

Ok, I know some regex but this fooling me...
I usually manage every month hundreds of files submited and have to make some checks and replaces before making them available again in our intranet...
I'm doing it locally on hd on Windows, through a file renamer program which can do pcre but only do one line at a time, so all should be in the same regex.
The problem is that I would like to do replacements only if file type is xxx.
For example, replace all spaces for underscores ONLY if extension is jpg|jpeg|jpe
so
this is a test.jpg => this_is_a_test.jpg
this is a another test.jpe => this_is_a_another_test.jpe
this is a test.docx => this is a test.docx
Jpg is an EXAMPLE, I do diferent replaces for each extension and not for all extensions, so something which replaces spaces in the above example in the .docx will be wrong...
is it posible???
You need to find spaces, and then look ahead to see the extension:
(?=.*\.jp(?:g|e)$)
Note the leading space.
Try it here: https://regex101.com/r/ZgDv7S/1
You said you were on Windows, but do you have WSL (Bash for Linux) or something similar available? If so, here is a quick way to do this from the Linux command line:
rename 's/ /_/g;' *.jpg *.jpe
Explanation: rename runs the given Perl script on the name of the specified files.

Regex: Identify file name with "string" but exclude if has .filepart extension

I have a requirement to search through a directory to identify specific files with a string contained in the file name. But I want to exclude part loaded files with a ".filepart" extension.
This must be done through Regex due to tool limitations.
The file names can be in multiple formats, and we must identify them from the "file identifier" string that we pass into the Regex.
I have read some very good articles within SO and other websites but I am struggling to nail down the correct syntax.
I have saved a page on regex101.com to provide a more detailed explanation of what I am trying to achieve. The "FILETYPE" can be considered the string we pass into the Regex.
https://regex101.com/r/zTrbyX/4
Thanks,
K
Your original regex is:
.*FILETYPE.*\.[[:alnum:]]*(?!filepart)
will give the same result as:
.*FILETYPE.*
Instead you could use the following regex (similar to CAustin solution in comments):
.*FILETYPE.*(?<!filepart)$
This will match every line starting with .*FILETYPE.* and not ending with filetpart. Here $ denotes the end of the line. In regex101.com you need to activate flag m for $ to be recognized as EOL.

grep complete resource url within a file

I have to search and extract within a file addresses like these:
http://deimos.apple.com/WebObjects/Core.woa/DownloadRedirectedTrackPreview/unina.it-dz.5373092572.05373092574.12739786322/enclosure.m4v
They are 38 links with only the last serie of digit which change.
I tried with this regexp:
grep -io 'http://ex[a-z.-]*/[a-z0-9+-]*/[a-z0-9.,-+]*[.m4v]'
it extract all the urls present in the file which point to an m4v file but not the complete url it get a partial url as follow:
http://deimos.apple.com/WebObjects/Core.woa/DownloadRedirectedTrackPreview/unina.
Where am I wrong?
I can't figure out why it happens.
Thanks a lot for your effort.
Your regex and your extracted filename do not match. The filename that you list does not begin with:
http://ex
Which your regex requires. you could change your regex to something more like this which would match your URL:
'http://(?:[a-z0-9+-]+/)*[a-z0-9+-]+\.m4v'
Sorry Jonathan it was a typing mistake while I posted in my regex was correctly used dei and not ex as written.
But the problem persisted.
Marc opened my mind.
I knew how the address starts so I tried with
grep -io 'http://dei/.m4v'
no success :-(
fedorqui gave the last hint, maybe the problem was a dot
so I tried
grep -io 'http://deimos./.m4v' :-D
and it did the trick!
Now I have the file to give to wget to automate multiple file downloads without proprietary softwares needing.
The files are podcasts of juridic lessons released free as in freedom but only in an easy way for who would buy Apple or Microsoft (iTunes).
Now I have the file to give to wget to automate multiples file downloads without soiling my system with emulators and proprietary software.
Thanks to all indeed!!

sed regexp matching in a long line

I have a XML file that I wish to extract all occurrences of some tag AB. The file is one long line with ~500 000 chars.
Now I do know about regexp and such, but when I try it with sed and try to extract only the characters within the tags I am totally lost regarding the result :).
Here's my command:
sed -r 's/(.*)<my_tag>([A-Z][A-Z])<\/my_tag>(.*)/hello\2/g' myfile.out
transforms the entire file with only "helloAB" e.g. While the expected should at least contain 100+ matches.
So I'm thinking around the concepts of greedy matching and such but not getting anywhere. Maybe awk is a better idea?
If you have python (2.6+), this should be fairly trivial:
import xml.dom.minidom as MD
tree = MD.parse("yourfile.xml")
for e in tree.getElementsByTagName("AB"):
print e.toprettyxml()
In general, trying to parse XML by hand should be avoided as there are much simpler solutions like these. Not to mention, these kinds of libraries will give you easy access to attributes and values without further parsing.
Thank your for your answers.
I tried #MannyD's suggestion and unfortunately the XML didn't seem to be well formed, thus the parsing failed. Since I cannot anticipate only well formed XML's I made grep solution, which does the job.
grep -o "<my_tag>[A-Z][A-Z]</my_tag>" myfile.out | sort -u
The -o option flag will print each match on a new line, from there I just sort and print the unique matches from the file.

Replace exact part of text in a string (fstab) using sed

I'm in the process of migrating some data between 2 servers. The data is held in the same folder structure on each server.
Once the data has been moved I want to update the fstab file on all of the affected Linux machines. I have a bash script that rsyncs the data between the servers and then logs on to each machine in a list and updates the fstab with the new IP address using sed.
sed "s/\(172.16.0.30\)\(.*\)\(${share}\)\(.*\)/172.16.0.35\2\3\4/"
This has worked absolutely fine in the past, however this time I'm migrating a folder which has a name very similar to a few others, let's say $share is 'home':
home
home-old
home-ancient
The problem I'm having is that this regex is picking up all of the shares with the text contained in $share and not just the one I want.
Is there a way to adjust the regex so that it will only replace the IP on the single line that I want? I've looked at the /b variable but can't seem to get it to work, unfortunately regular expressions usually confuse me!
\b is a GNU extension and in this case won't work because it matches a word boundary, and both the space and - are in the group of non-word. It will match all of them. One simple option is to match a space (or end-of-line) character after $share, like:
sed "s/\(172.16.0.30\)\(.*\)\(${share}\)\( \(.*\)\|$\)/172.16.0.35\2\3\4/"