using find and rename for their intended use - regex

Now before you face palm and click on duplicate entry or the like, read on, this question is both Theory and practical.
From the title it is pretty obvious what I am trying to do, find some files, then rename them. Well the problem, there is so many way to do this, that I finally decided to pick one, and try to figure it out, theoretically.
Let me set the stage:
Lets say I have 100 files all named like this Image_200x200_nnn_AlphaChars.jpg, where the nnn is a incremental number and AlphaChars ie:
Image_200x200_001_BlueHat.jpg
Image_200x200_002_RedHat.jpg
...
Image_200x200_100_MyCat.jpg
Enter the stage find. Now with a simple one liner I can find all the image files in this directory.(Not sure how to do this case insensitive)
find . -type f -name "*.jpg"
Enter the stage rename. On it's own, rename expect you to do the following:
rename <search> <replace> <haystack>
When I try to combine the two with -print0 and xargs and some regular expressions I get stuck, and I am almost sure it's because rename is looking for the haystack or the search part... (Please do explain if you understand what happens after the pipe)
find . -type f -name "*.jpg" -print0 | xargs -0 rename "s/Image_200x200_(\d{3})/img/"
So the goal is to get the find to give rename the original image name, and replace everything before the last underscore with img
Yes I know that duplicates will give a problem, and yes I know that spaces in the name will also make my life hell, and don't even start with sub directories and the like. To keep it simple, we are talking about a single directory, and all filename are unique and without special characters.
I need to understand the fundamental basics, before getting to the hardcore stuff. Anybody out there feel like helping?

Another approach is to avoid using rename -- bash is capable enough:
find ... -print0 | while read -r -d '' filename; do
mv "$filename" "img_${filename##*_}"
done
the ##*_ part remove all leading characters up to and including the last underscore from the value.

If you don't need -print0 (i.e. you are sure your filenames don't contain newlines), you can just do:
find . -type f -name "*.jpg" | xargs rename 's/Image_200x200_(\d{3})/img/'
Which works for me:
~/tmp$ touch Image_200x200_001_BlueHat.jpg
~/tmp$ touch Image_200x200_002_RedHat.jpg
~/tmp$ touch Image_200x200_100_MyCat.jpg
~/tmp$ find . -type f -name "*.jpg" | xargs rename 's/Image_200x200_(\d{3})/img/'
~/tmp$ ls
img_BlueHat.jpg img_MyCat.jpg img_RedHat.jpg
What's happening after the pipe is that xargs is parsing the output of find and passing that in reasonable chunks to a rename command, which is executing a regex on the filename and renaming the file to the result.
update: I didn't try your version with the null-terminators at first, but it also works for me. Perhaps you tested with a different regex?

What's happening after the pipe:
find ... -print0 | xargs -0 rename "s/Image_200x200_(\d{3})/img/"
xargs is reading the filenames produced by the find command, and executing the rename command repeatedly, appending a few filenames at a time. The net effect will be something like:
rename '...' file001 file002 file003 file004 file005 file006 file007 file008 file009 file010
rename '...' file011 file012 file013 file014 file015 file016 file017 file018 file019 file010
rename '...' file021 file022 file023 file024 file025 file026 file027 file028 file029 file010
...
rename '...' file091 file092 file093 file094 file095 file096 file097 file098 file099 file100
The find -print0 | xargs -0 is a handy combination for more safely handling files that may contain whitespace.

Related

For the love of BASH, regex, locate & find - contains A not B

Goal: Regex pattern for use with find and locate that "Contains A but not B"
So I have a bash script that manipulates a few video files.
In its current form, I create a variable to act on later with a for loop that works well:
if [ "$USE_FIND" = true ]; then
vid_files=$(find "${DIR}" -type f -regex ".*\.\(mkv\|avi\|ts\|mp4\|m2ts\)")
else
vid_files=$(locate -ir "${DIR}.*\.\(mkv\|avi\|ts\|mp4\|m2ts\)")
fi
So "contains A" is any one of the listed extensions.
I'd like to add to a condition where if a certain string (B) is contained the file isn't added to the array (can be a directory or a filename).
I've spent some time with lookaheads trying to implement this to no avail. So an example of "not contains B" as "Robot" - I've used different forms of .*(?!Robot).*
e.g. ".*\(\?\!Robot\).*\.\(mkv\|avi\|ts\|mp4\|m2ts\)" for find but it doesn't work.
I've sort of exhausting regex101.com, terminal and chmod +x at this point and would welcome some help. I think it's the case that's it's called through a bash script causing me the difficulty.
One of my many sources of reference in trying to sort this:
Ref: Is there a regex to match a string that contains A but does not contain B
You may want to avoid the use find inside a process substitution to build a list of files, as, while this is admittedly rare, filenames could contain newlines.
You could use an array, which will handle file names without issues (assuming the array is later expanded properly).
declare -a vid_files=()
while IFS= read -r -d '' file
do
! [[ "$file" =~ Robot ]] || continue
vid_files+=("$file")
done < <(find "${DIR}" -type f -regex ".*\.\(mkv\|avi\|ts\|mp4\|m2ts\)" -print0)
The -print0 option of find generates a null byte to separate the file names, and the -d '' option of read allows a null byte to be used as a record separator (both obviously go together).
You can get the list of files using "${vid_files[#]}" (double quotes are important to prevent word splitting). You can also iterate over the list easily :
for file in "${vid_files[#]}"
do
echo "$file"
done

Bash: Batch Rename Appending File Extension

I have a bunch of temperature logger data files in .csv format. The proprietary temp-logger software saves them with weird useless names. I want to name the files by their serial numbers (S/N). The S/N can be found in each of the files (in several places).
So, I need to extract the S/N and change the name of the file to {S/N}.csv.
I'm almost there, but can't figure out how to get the ".csv" file extension onto the end.
Here's my code:
for i in *.csv; do grep -Eo "S\/N\: [0-9]+" "$i" | cut -c 6- | head -1 | xargs mv "$i" ; done
Note the "cut" and "head" commands are necessary to get just the S/N number from the regular expression return, and to take only one (the S/N is listed several times in the file).
If anyone has a more elegant solution, I'd love to see it. All I really need though is to get that ".csv" onto the end of my new file names.
Thanks!
You can do it with xargs, but it's simpler to skip it and call mv directly. (You're only renaming one file per call to xargs anyway.)
for i in *.csv; do
ser_num=$(grep -Eo "S\/N\: [0-9]+" "$i" | cut -c 6- | head -1)
mv "$i" "$ser_num.csv"
done

single repeating command with input and output files

I have been trying to learn how to adequately perform a single command multiple times using the command line. Although I have learned how to do a single command with no input and output files, it gets more complicated when it needs these.
The cp command requires this so lets use this as an example. I look for all images with .png extension and copy them. The way I have come up with after using google is:
find -regex ".*\.\(png\)" -exec cp {} {}3 \;
The only problem with that is that I have to rename the file with any figure after the name, so it gets renamed to something like file.png3 instead of file.png. I can't figure out how to do if differently as I can't put the new figure before the name as it doesn't seem to work.
Is there a better way to do this or am I going about it completely the wrong way?
I'm not sure how you might do that in a single find command, but you could split it out. First, find the files with find. Then use sed to remove the .png extension. Finally, use xargs to run the copy function on each file. Like this:
find -regex ".*\.\(png\)" | sed -r 's/.png//g' | xargs -I {} cp {}.png {}_copy.png
If you didn't know, the pipe "|" will send the output of one program into the next.
Alternatively, you could just modify the beginning of the filename (so 3img.png instead of img.png3) or copy to a new folder.

Disk usage of files whose names match a regex, in Linux?

So, in many situations I wanted a way to know how much of my disk space is used by what, so I know what to get rid of, convert to another format, store elsewhere (such as data DVDs), move to another partition, etc. In this case I'm looking at a Windows partition from a SliTaz Linux bootable media.
In most cases, what I want is the size of files and folders, and for that I use NCurses-based ncdu:
But in this case, I want a way to get the size of all files matching a regex. An example regex for .bak files:
.*\.bak$
How do I get that information, considering a standard Linux with core GNU utilities or BusyBox?
Edit: The output is intended to be parseable by a script.
I suggest something like: find . -regex '.*\.bak' -print0 | du --files0-from=- -ch | tail -1
Some notes:
The -print0 option for find and --files0-from for du are there to avoid issues with whitespace in file names
The regular expression is matched against the whole path, e.g. ./dir1/subdir2/file.bak, not just file.bak, so if you modify it, take that into account
I used h flag for du to produce a "human-readable" format but if you want to parse the output, you may be better off with k (always use kilobytes)
If you remove the tail command, you will additionally see the sizes of particular files and directories
Sidenote: a nice GUI tool for finding out who ate your disk space is FileLight. It doesn't do regexes, but is very handy for finding big directories or files clogging your disk.
du is my favorite answer. If you have a fixed filesystem structure, you can use:
du -hc *.bak
If you need to add subdirs, just add:
du -hc *.bak **/*.bak **/**/*.bak
etc etc
However, this isn't a very useful command, so using your find:
TOTAL=0;for I in $(find . -name \*.bak); do TOTAL=$((TOTAL+$(du $I | awk '{print $1}'))); done; echo $TOTAL
That will echo the total size in bytes of all of the files you find.
Hope that helps.
Run this in a Bourne Shell to declare a function that calculates the sum of sizes of all the files matching a regex pattern in the current directory:
sizeofregex() { IFS=$'\n'; for x in $(find . -regex "$1" 2> /dev/null); do du -sk "$x" | cut -f1; done | awk '{s+=$1} END {print s}' | sed 's/^$/0/'; unset IFS; }
(Alternatively, you can put it in a script.)
Usage:
cd /where/to/look
sizeofregex 'myregex'
The result will be a number (in KiB), including 0 (if there are no files that match your regex).
If you do not want it to look in other filesystems (say you want to look for all .so files under /, which is a mount of /dev/sda1, but not under /home, which is a mount of /dev/sdb1, add a -xdev parameter to find in the function above.
The previous solutions didn't work properly for me (I had trouble piping du) but the following worked great:
find path/to/directory -iregex ".*\.bak$" -exec du -csh '{}' + | tail -1
The iregex option is a case insensitive regular expression. Use regex if you want it to be case sensitive.
If you aren't comfortable with regular expressions, you can use the iname or name flags (the former being case insensitive):
find path/to/directory -iname "*.bak" -exec du -csh '{}' + | tail -1
In case you want the size of every match (rather than just the combined total), simply leave out the piped tail command:
find path/to/directory -iname "*.bak" -exec du -csh '{}' +
These approaches avoid the subdirectory problem in #MaddHackers' answer.
Hope this helps others in the same situation (in my case, finding the size of all DLL's in a .NET solution).
If you're OK with glob-patterns and you're only interested in the current directory:
stat -c "%s" *.bak | awk '{sum += $1} END {print sum}'
or
sum=0
while read size; do (( sum += size )); done < <(stat -c "%s" *.bak)
echo $sum
The %s directive to stat gives bytes not kilobytes.
If you want to descend into subdirectories, with bash version 4, you can shopt -s globstar and use the pattern **/*.bak
The accepted reply suggests to use
find . -regex '.*\.bak' -print0 | du --files0-from=- -ch | tail -1
but that doesn't work on my system as du doesn't know a --files-0-from option on my system. Only GNU du knows that option, it's neither part of the POSIX Standard (so you won't find it in FreeBSD or macOS), nor will you find it on BusyBox based Linux systems (e.g. most embedded Linux systems) or any other Linux system that does not use the GNU du version.
Then there's a reply suggesting to use:
find path/to/directory -iregex .*\.bak$ -exec du -csh '{}' + | tail -1
This solution will work as long as there aren't too many files found, as + means that find will try call du with as many hits as possible in a single call, however, there might be a maximum number of arguments (N) a system supports and if there are more hits than this value, find will call du multiple times, splitting the hits into groups smaller than or equal to N items each and this case the result will be wrong and only show the size of the last du call.
Finally there is an answer using stat and awk, which is a nice way to do it, but it relies on shell globbing in a way that only Bash 4.x or later supports. It will not work with older versions and if it works with other shells is unpredictable.
A POSIX conform solution (works on Linux, macOS and any BSD variants), that doesn't suffer by any limitation and that will surely work with every shell would be:
find . -regex '.*\.bak' -exec stat -f "%z" {} \; | awk '{s += $1} END {print s}'

find and replace within file

I have a requirement to search for a pattern which is something like :
timeouts = {default = 3.0; };
and replace it with
timeouts = {default = 3000.0;.... };
i.e multiply the timeout by factor of 1000.
Is there any way to do this for all files in a directory
EDIT :
Please note that some of the files are symlinks in the directory.Is there any way to get this done for symlinks also ?
Please note that timeouts exists as a substring also in the files so i want to make sure that only this line gets replaced. Any solution is acceptable using sed awk perl .
Give this a try:
for f in *
do
sed -i 's/\(timeouts = {default = [0-9]\+\)\(\.[0-9]\+;\)\( };\)/\1000\2....\3/' "$f"
done
It will make the replacements in place for each file in the current directory. Some versions of sed require a backup extension after the -i option. You can supply one like this:
sed -i .bak ...
Some versions don't support in-place editing. You can do this:
sed '...' "$f" > tmpfile && mv tmpfile "$f"
Note that this is obviously not actually multiplying by 1000, so if the number is 3.1 it would become "3000.1" instead of 3100.0.
you can do this
perl -pi -e 's/(timeouts\s*=\s*\{default\s*=\s*)([0-9.-]+)/print $1; $2*1000/e' *
One suggestion for whichever solution above you decide to use - it may be worth it to think through how you could refactor to avoid having to modify all of these files for a change like this again.
Do all of these scripts have similar functionality?
Can you create a module that they would all use for shared subroutines?
In the module, could you have a single line that would allow you to have a multiplier?
For me, anytime I need to make similar changes in more than one file, it's the perfect time to be lazy to save myself time and maintenance issues later.
$ perl -pi.bak -e 's/\w+\s*=\s*{\s*\w+\s*=\s*\K(-?[0-9.]+)/sprintf "%0.1f", 1000 * $1/eg' *
Notes:
The regex matches just the number (see \K in perlre)
The /e means the replacement is evaluated
I include a sprintf in the replacement just in case you need finer control over the formatting
Perl's -i can operate on a bunch of files
EDIT
It has been pointed out that some of the files are shambolic links. Given that this process is not idempotent (running it twice on the same file is bad), you had better generate a unique list of files in case one of the links points to a file that appears elsewhere in the list. Here is an example with find, though the code for a pre-existing list should be obvious.
$ find -L . -type f -exec realpath {} \; | sort -u | xargs -d '\n' perl ...
(Assumes none of your filenames contain a newline!)