unix find filenames that are lexicographically less that a given filename - regex

I have a list of files in a directory that are automatically generated by a system with the date in the filename. Some examples are: audit_20111020, audit_20111021, audit_20111022, etc.
I want to clean up files older than 18 months therefore I want to put together a unix find command that will find files less than audit_20100501 and delete them.
Does any know how to use lexicographical order as a criteria in the find command?

Another Perl variant:
perl -E'while(<audit_*>) { say if /(\d{8})/ && $1 < 20100501}'
Replace say by unlink if it prints expected filenames.
Note: < performs numerical comparison, use lt if you want string comparison.

With Perl it's easy. Type perl and:
for (glob "*")
{
my($n) = /(\d+)/;
unlink if ($n < 20100501);
}
^D
Test before using. Note that I'm assuming this is a fixed format and the directory only contains these files

It is possible to sort find's result using the sort command:
find . -name "audit*" | sort -n
... then find a way to split this list.
But for what you want to do, i.e. delete directories older than a certain date (18 months is ~547 days), you could use the below instead:
find -ctime -547 -type d | xargs -I{} rm -rf {}

Related

Bash: find and concatenate filenames with two digits

I'm trying to find all instances of csv files in a set of directories and concatenate them into one csv file.
The catch is that the directories are numbered. I only want directories that end in two digits. For example, I want directories RUN11, RUN12, etc, but not RUN1, RUN2.
If I didn't care about having two-digit numbers, I'd do this (from here)
find $(pwd)/RUN* -name '*csv' |xargs cat > big_cat_file.csv
I tried this:
find $(pwd)/RUN[!0-9]{2} -name '*csv' |xargs cat > big_cat_file.csv
But it says no such file or directory.
How can I grab csv files from directories with names like RUN11, RUN12, but not RUN1, RUN2?
You are trying to use regular expression syntax where you need to use a glob.
You just need to specify the range twice, rather than using {2}:
find "$PWD"/RUN[0-9][0-9] -name '*csv' |xargs cat > big_cat_file.csv
(Note that [!0-9] matches any single character except a digit.)
To accommodate any legal filename that might match *csv, you should use the -exec primary instead of xargs. (Consider what would happen if a file name contains whitespace, or in the worst case, a newline.)
find "$PWD"/RUN[0-9][0-9] -name '*csv' -exec cat {} + > big_cat_file.csv
This not only works with any valid file name, but minimizes the number of calls to cat that are required.

Extract unique lines from files (with a pattern) recursively from directory/subdirectories

I have a huge java codebase (more than 10,000 java classes) that makes extensive use of CORBA (no documentation available on its usage though).
As first step to figure out the CORBA usage, I decided to scan entire codebase and extract/print unique lines which contain the pattern "org.omg.CORBA". These are usually in the import statements (e.g. import org.omg.CORBA.x.y.z).
I am newbie to Perl and want to know if there is a way I can extract these details on Windows OS. I need to be able to scan all folders (and sub-folders) that have java classes.
You can use File::Find in a one-liner:
perl -MFile::Find -lwe "
find(sub { if (-f && /\.java$/) { push #ARGV,$File::Find::name } },'.');
while(<>) { /org.omg.CORBA/ && $seen{$_}++; };
print for keys %seen;"
Note that this one-liner is using the double quotes required for Windows.
This will search the current directory recursively for files with extension .java and add them to the #ARGV array. Then we use the diamond operator to open the files and search for the string org.omg.CORBA, and if it is found, that line is added as a key to the %seen hash, which will effectively remove duplicates. The last statement prints out all the unique keys in the hash.
In script form it looks like this:
use strict;
use warnings;
use File::Find;
find(sub { if (-f && /\.java$/) { push #ARGV,$File::Find::name } },'.');
my %seen;
while(<>) {
/org.omg.CORBA/ && $seen{$_}++;
}
print "$_\n" for keys %seen;"
Just for fun, a perl one-liner to do this:
perl -lne '/org.omg.CORBA/ and (++$seen{$_}>1 or print)' *
This first checks if a line matches and then if it has not seen it before prints out the line. That is done for all files specified (in this case '*').
i don't mean to be contrarian, but i'm not sure perl is the best solution here. nhahtdh's suggestion of using cygwin is a good one. grep or find is really what you want. using perl in this instance will involve using File::Find and then opening a filehandle on every file. that's certainly do-able, but, if possible, i'd suggest using the right tool for the job.
find . -name "*.java" -type f | xargs grep -l 'org.com.CORBA' | sort | uniq
if you really must use perl for this job we can work up the File::Find code.

Disk usage of files whose names match a regex, in Linux?

So, in many situations I wanted a way to know how much of my disk space is used by what, so I know what to get rid of, convert to another format, store elsewhere (such as data DVDs), move to another partition, etc. In this case I'm looking at a Windows partition from a SliTaz Linux bootable media.
In most cases, what I want is the size of files and folders, and for that I use NCurses-based ncdu:
But in this case, I want a way to get the size of all files matching a regex. An example regex for .bak files:
.*\.bak$
How do I get that information, considering a standard Linux with core GNU utilities or BusyBox?
Edit: The output is intended to be parseable by a script.
I suggest something like: find . -regex '.*\.bak' -print0 | du --files0-from=- -ch | tail -1
Some notes:
The -print0 option for find and --files0-from for du are there to avoid issues with whitespace in file names
The regular expression is matched against the whole path, e.g. ./dir1/subdir2/file.bak, not just file.bak, so if you modify it, take that into account
I used h flag for du to produce a "human-readable" format but if you want to parse the output, you may be better off with k (always use kilobytes)
If you remove the tail command, you will additionally see the sizes of particular files and directories
Sidenote: a nice GUI tool for finding out who ate your disk space is FileLight. It doesn't do regexes, but is very handy for finding big directories or files clogging your disk.
du is my favorite answer. If you have a fixed filesystem structure, you can use:
du -hc *.bak
If you need to add subdirs, just add:
du -hc *.bak **/*.bak **/**/*.bak
etc etc
However, this isn't a very useful command, so using your find:
TOTAL=0;for I in $(find . -name \*.bak); do TOTAL=$((TOTAL+$(du $I | awk '{print $1}'))); done; echo $TOTAL
That will echo the total size in bytes of all of the files you find.
Hope that helps.
Run this in a Bourne Shell to declare a function that calculates the sum of sizes of all the files matching a regex pattern in the current directory:
sizeofregex() { IFS=$'\n'; for x in $(find . -regex "$1" 2> /dev/null); do du -sk "$x" | cut -f1; done | awk '{s+=$1} END {print s}' | sed 's/^$/0/'; unset IFS; }
(Alternatively, you can put it in a script.)
Usage:
cd /where/to/look
sizeofregex 'myregex'
The result will be a number (in KiB), including 0 (if there are no files that match your regex).
If you do not want it to look in other filesystems (say you want to look for all .so files under /, which is a mount of /dev/sda1, but not under /home, which is a mount of /dev/sdb1, add a -xdev parameter to find in the function above.
The previous solutions didn't work properly for me (I had trouble piping du) but the following worked great:
find path/to/directory -iregex ".*\.bak$" -exec du -csh '{}' + | tail -1
The iregex option is a case insensitive regular expression. Use regex if you want it to be case sensitive.
If you aren't comfortable with regular expressions, you can use the iname or name flags (the former being case insensitive):
find path/to/directory -iname "*.bak" -exec du -csh '{}' + | tail -1
In case you want the size of every match (rather than just the combined total), simply leave out the piped tail command:
find path/to/directory -iname "*.bak" -exec du -csh '{}' +
These approaches avoid the subdirectory problem in #MaddHackers' answer.
Hope this helps others in the same situation (in my case, finding the size of all DLL's in a .NET solution).
If you're OK with glob-patterns and you're only interested in the current directory:
stat -c "%s" *.bak | awk '{sum += $1} END {print sum}'
or
sum=0
while read size; do (( sum += size )); done < <(stat -c "%s" *.bak)
echo $sum
The %s directive to stat gives bytes not kilobytes.
If you want to descend into subdirectories, with bash version 4, you can shopt -s globstar and use the pattern **/*.bak
The accepted reply suggests to use
find . -regex '.*\.bak' -print0 | du --files0-from=- -ch | tail -1
but that doesn't work on my system as du doesn't know a --files-0-from option on my system. Only GNU du knows that option, it's neither part of the POSIX Standard (so you won't find it in FreeBSD or macOS), nor will you find it on BusyBox based Linux systems (e.g. most embedded Linux systems) or any other Linux system that does not use the GNU du version.
Then there's a reply suggesting to use:
find path/to/directory -iregex .*\.bak$ -exec du -csh '{}' + | tail -1
This solution will work as long as there aren't too many files found, as + means that find will try call du with as many hits as possible in a single call, however, there might be a maximum number of arguments (N) a system supports and if there are more hits than this value, find will call du multiple times, splitting the hits into groups smaller than or equal to N items each and this case the result will be wrong and only show the size of the last du call.
Finally there is an answer using stat and awk, which is a nice way to do it, but it relies on shell globbing in a way that only Bash 4.x or later supports. It will not work with older versions and if it works with other shells is unpredictable.
A POSIX conform solution (works on Linux, macOS and any BSD variants), that doesn't suffer by any limitation and that will surely work with every shell would be:
find . -regex '.*\.bak' -exec stat -f "%z" {} \; | awk '{s += $1} END {print s}'

Bash go through list of dirs and generate md5

What would be the bash script that:
Goes through a directory, and puts all the sub-directories in an array
For each dir, generate an md5 sum of a file inside that dir
Also, the file who's md5sum has to be generated doesn't always have the same name and path. However, the pattern is always the same:
/var/mobile/Applications/{ the dir name here is taken from the array }/{some name}.app/{ binary, who's name is the same as it's parent dir, but without the .app extension }
I've never worked with bash before (and have never needed to) so this may be something really simple and nooby. Anybody got an idea? As can be seen by the path, this is designed to be run on an iDevice.
for dir in /var/mobile/Applications/*; do
for app in "$dir"/*.app; do
appdirname=${app##*/}
appname=${appdirname%.app}
binary="$app/$appname"
if [ -f "$binary" ]; then
echo "I: dir=$dir appbase=$appbase binary=$binary"
fi
done
done
Try this, I hope the code is straight-forward. The two things worth explaining are:
${app##*/}, which uses the ## operator to strip off the longest prefix matching the expression */.
${appdirname%.app}, which uses the % operator to strip off the shortest suffix matching the expression .app. (You could have also used %% (strip longest suffix) instead of %, since the pattern .app is always four characters long.)
Try something like:
ls -1 /Applications/*/Contents/Info.plist | while read name; do md5 -r "$name"; done
the above will show md5 checksum for all Info.plist files for all applications, like:
d3bde2b76489e1ac081b68bbf18a7c29 /Applications/Address Book.app/Contents/Info.plist
6a093349355d20d4af85460340bc72b2 /Applications/Automator.app/Contents/Info.plist
f1c120d6ccc0426a1d3be16c81639ecb /Applications/Calculator.app/Contents/Info.plist
Bash is very easy but you need to know the cli-tools of your system.
For to print the md5 hash of all files of the a directory recursively:
find /yourdirectory/ -type f | xargs md5sum
If you only want to list the tree of directories:
find /tmp/ -type d
You can generate a list with:
MYLIST=$( find /tmp/ -type d )
Use "for" for iterate the list:
for i in $MYLIST; do
echo $i;
done
If you are a newbie in bash:
http://tldp.org/LDP/Bash-Beginners-Guide/html/
http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO.html

find and replace within file

I have a requirement to search for a pattern which is something like :
timeouts = {default = 3.0; };
and replace it with
timeouts = {default = 3000.0;.... };
i.e multiply the timeout by factor of 1000.
Is there any way to do this for all files in a directory
EDIT :
Please note that some of the files are symlinks in the directory.Is there any way to get this done for symlinks also ?
Please note that timeouts exists as a substring also in the files so i want to make sure that only this line gets replaced. Any solution is acceptable using sed awk perl .
Give this a try:
for f in *
do
sed -i 's/\(timeouts = {default = [0-9]\+\)\(\.[0-9]\+;\)\( };\)/\1000\2....\3/' "$f"
done
It will make the replacements in place for each file in the current directory. Some versions of sed require a backup extension after the -i option. You can supply one like this:
sed -i .bak ...
Some versions don't support in-place editing. You can do this:
sed '...' "$f" > tmpfile && mv tmpfile "$f"
Note that this is obviously not actually multiplying by 1000, so if the number is 3.1 it would become "3000.1" instead of 3100.0.
you can do this
perl -pi -e 's/(timeouts\s*=\s*\{default\s*=\s*)([0-9.-]+)/print $1; $2*1000/e' *
One suggestion for whichever solution above you decide to use - it may be worth it to think through how you could refactor to avoid having to modify all of these files for a change like this again.
Do all of these scripts have similar functionality?
Can you create a module that they would all use for shared subroutines?
In the module, could you have a single line that would allow you to have a multiplier?
For me, anytime I need to make similar changes in more than one file, it's the perfect time to be lazy to save myself time and maintenance issues later.
$ perl -pi.bak -e 's/\w+\s*=\s*{\s*\w+\s*=\s*\K(-?[0-9.]+)/sprintf "%0.1f", 1000 * $1/eg' *
Notes:
The regex matches just the number (see \K in perlre)
The /e means the replacement is evaluated
I include a sprintf in the replacement just in case you need finer control over the formatting
Perl's -i can operate on a bunch of files
EDIT
It has been pointed out that some of the files are shambolic links. Given that this process is not idempotent (running it twice on the same file is bad), you had better generate a unique list of files in case one of the links points to a file that appears elsewhere in the list. Here is an example with find, though the code for a pre-existing list should be obvious.
$ find -L . -type f -exec realpath {} \; | sort -u | xargs -d '\n' perl ...
(Assumes none of your filenames contain a newline!)