Disk usage of files whose names match a regex, in Linux? - regex

So, in many situations I wanted a way to know how much of my disk space is used by what, so I know what to get rid of, convert to another format, store elsewhere (such as data DVDs), move to another partition, etc. In this case I'm looking at a Windows partition from a SliTaz Linux bootable media.
In most cases, what I want is the size of files and folders, and for that I use NCurses-based ncdu:
But in this case, I want a way to get the size of all files matching a regex. An example regex for .bak files:
.*\.bak$
How do I get that information, considering a standard Linux with core GNU utilities or BusyBox?
Edit: The output is intended to be parseable by a script.

I suggest something like: find . -regex '.*\.bak' -print0 | du --files0-from=- -ch | tail -1
Some notes:
The -print0 option for find and --files0-from for du are there to avoid issues with whitespace in file names
The regular expression is matched against the whole path, e.g. ./dir1/subdir2/file.bak, not just file.bak, so if you modify it, take that into account
I used h flag for du to produce a "human-readable" format but if you want to parse the output, you may be better off with k (always use kilobytes)
If you remove the tail command, you will additionally see the sizes of particular files and directories
Sidenote: a nice GUI tool for finding out who ate your disk space is FileLight. It doesn't do regexes, but is very handy for finding big directories or files clogging your disk.

du is my favorite answer. If you have a fixed filesystem structure, you can use:
du -hc *.bak
If you need to add subdirs, just add:
du -hc *.bak **/*.bak **/**/*.bak
etc etc
However, this isn't a very useful command, so using your find:
TOTAL=0;for I in $(find . -name \*.bak); do TOTAL=$((TOTAL+$(du $I | awk '{print $1}'))); done; echo $TOTAL
That will echo the total size in bytes of all of the files you find.
Hope that helps.

Run this in a Bourne Shell to declare a function that calculates the sum of sizes of all the files matching a regex pattern in the current directory:
sizeofregex() { IFS=$'\n'; for x in $(find . -regex "$1" 2> /dev/null); do du -sk "$x" | cut -f1; done | awk '{s+=$1} END {print s}' | sed 's/^$/0/'; unset IFS; }
(Alternatively, you can put it in a script.)
Usage:
cd /where/to/look
sizeofregex 'myregex'
The result will be a number (in KiB), including 0 (if there are no files that match your regex).
If you do not want it to look in other filesystems (say you want to look for all .so files under /, which is a mount of /dev/sda1, but not under /home, which is a mount of /dev/sdb1, add a -xdev parameter to find in the function above.

The previous solutions didn't work properly for me (I had trouble piping du) but the following worked great:
find path/to/directory -iregex ".*\.bak$" -exec du -csh '{}' + | tail -1
The iregex option is a case insensitive regular expression. Use regex if you want it to be case sensitive.
If you aren't comfortable with regular expressions, you can use the iname or name flags (the former being case insensitive):
find path/to/directory -iname "*.bak" -exec du -csh '{}' + | tail -1
In case you want the size of every match (rather than just the combined total), simply leave out the piped tail command:
find path/to/directory -iname "*.bak" -exec du -csh '{}' +
These approaches avoid the subdirectory problem in #MaddHackers' answer.
Hope this helps others in the same situation (in my case, finding the size of all DLL's in a .NET solution).

If you're OK with glob-patterns and you're only interested in the current directory:
stat -c "%s" *.bak | awk '{sum += $1} END {print sum}'
or
sum=0
while read size; do (( sum += size )); done < <(stat -c "%s" *.bak)
echo $sum
The %s directive to stat gives bytes not kilobytes.
If you want to descend into subdirectories, with bash version 4, you can shopt -s globstar and use the pattern **/*.bak

The accepted reply suggests to use
find . -regex '.*\.bak' -print0 | du --files0-from=- -ch | tail -1
but that doesn't work on my system as du doesn't know a --files-0-from option on my system. Only GNU du knows that option, it's neither part of the POSIX Standard (so you won't find it in FreeBSD or macOS), nor will you find it on BusyBox based Linux systems (e.g. most embedded Linux systems) or any other Linux system that does not use the GNU du version.
Then there's a reply suggesting to use:
find path/to/directory -iregex .*\.bak$ -exec du -csh '{}' + | tail -1
This solution will work as long as there aren't too many files found, as + means that find will try call du with as many hits as possible in a single call, however, there might be a maximum number of arguments (N) a system supports and if there are more hits than this value, find will call du multiple times, splitting the hits into groups smaller than or equal to N items each and this case the result will be wrong and only show the size of the last du call.
Finally there is an answer using stat and awk, which is a nice way to do it, but it relies on shell globbing in a way that only Bash 4.x or later supports. It will not work with older versions and if it works with other shells is unpredictable.
A POSIX conform solution (works on Linux, macOS and any BSD variants), that doesn't suffer by any limitation and that will surely work with every shell would be:
find . -regex '.*\.bak' -exec stat -f "%z" {} \; | awk '{s += $1} END {print s}'

Related

Using grep for listing files by owner/read perms

The rest of my bash script works, just having trouble using grep. On each file I am using the following command:
ls -l $filepath | grep "^.r..r..r.*${2}$"
How can I properly use the second argument in the regular expression? What I am trying to do is print the file if it can be read by anyone and the owner is who is passed by the second argument.
Using:
ls -l $filepath | grep "^.r..r..r"
Will print the information successfully based on the read permissions. What I am trying to do is print based on... [read permission][any characters in between][ending with the owner's name]
The immediate problem with your attempt is the final $ which anchors the search to the end of the line, which is the end of the file name, not the owner field. A better solution would replace grep with Awk instead, which has built-in support for examining only specific fields. But actually don't use ls for this, or really in scripts at all.
Unfortuntately, the stat command's options are not entirely portable, but for Linux, try
case $(stat -c %a:%u "$filepath") in
[4-7][4-7][4-7]:"$2") ls -l "$filepath";;
esac
or maybe more portably
find "$filepath" -user "$2" -perm /444 -ls
Sadly, the -perm /444 predicate is not entirely portable, either.
Paradoxically, the de facto most portable replacement for stat to get a file's permissions might actually be
perl -le '#s = stat($ARGV[0]); printf "%03o\n", $s[2]' "$filepath"
The stat call returns a list of fields; if you want the owner, too, the numeric UID is in $s[4] and getpwuid($s[4]) gets the user name.

One parameter for multiple patterns - grep

I'm trying to search pdf files from terminal. My attempt is to provide the search string from terminal. The search string can be one word, multiple words with (AND,OR) or an exact phrase. I would like to keep only one parameter for all search queries. I'll save the following command as a shell script and will call shell script as an alias from .aliases in zsh or bash shell.
Following from sjr's answer, here: search multiple pdf files.
I've used sjr's answer like this:
find ${1} -name '*.pdf' -exec sh -c 'pdftotext "{}" - |
grep -E -m'${2}' --line-buffered --label="{}" '"${3}"' '${4}'' \;
$1 takes path
$2 limits the number of results
$3 is context parameter (it is accepting -A , -B , -C , either individually or jointly)
$4 takes search string
The issue I am facing is with $4 value. As I said earlier I want this parameter to pass my search string which can be a phrase or one word or multiple words with AND / OR relation.
I am not able to get desired results, till now I was not getting search results for phrase search until I followed Robin Green's Comment. But still phrase results are not accurate.
Edit Text from judgments:
The original rule was that you could not claim for psychiatric injury in
negligence. There was no liability for psychiatric injury unless there was also
physical injury (Victorian Rly Commrs v Coultas [1888]). The courts were worried
both about fraudulent claims and that if they allowed claims, the floodgates would
open.
The claimant was 15 metres away behind a tram and did not see the accident but
later saw blood on the road. She suffered nervous shock and had a miscarriage. She
sued for negligence. The court held that it was not reasonably foreseeable that
someone so far away would suffer shock and no duty of care was owed.
White v Chief Constable of South Yorkshire [1998] The claimants were police
officers who all had some part in helping victims at Hillsborough and suffered
psychiatric injury. The House of Lords held that rescuers did not have a special
position and had to follow the normal rules for primary and secondary victims.
They were not in physical danger and not therefore primary victims. Neither could
they establish they had a close relationship with the injured so failed as
secondary victims. It is necessary to define `nervous shock' which is the rather
quaint term still sometimes used by lawyers for various kinds of
psychiatric injury...rest of para
word1 can be: shock, (nervous shock)
word2 can be: psychiatric
exact phrase: (nervous shock)
Commands
alias s='sh /path/shell/script.sh'
export p='path/pdf/files'
In terminal:
s "$p" 10 -5 "word1/|word2" #for OR search
s "$p" 10 -5 "word1.*word2.*word3" #for AND search
s "$p" 10 -5 ""exact phrase"" #for phrase search
Second Test Sample:
An example pdf file, since command runs on pdf document: Test-File. Its 4 pages (part of 361 pg file)
If we run the following command on it, as the solution mentions:
s "$p" 10 -5 'doctrine of basic structure' > ~/desktop/BSD.txt && open ~/desktop/BSD.txt
we'll get the relevant text and 'll avoid going through entire file. Thought it would be a cool way to read what we want rather than going traditional approach.
You need to:
pass a double-quoted command string to sh -c in order for the embedded shell-variable references to be expanded (which then requires escaping embedded " instances as \").
quote the regex with printf %q for safe inclusion in the command string - note that this requires bash, ksh, or zsh as the shell.
dir=$1
numMatches=$2
context=$3
regexQuoted=$(printf %q "$4")
find "${dir}" -type f -name '*.pdf' -exec sh -c "pdftotext \"{}\" - |
grep -E -m${numMatches} --with-filename --label=\"{}\" ${context} ${regexQuoted}" \;
The 3 invocation scenarios would then be:
s "$p" 10 -5 'word1|word2' #for OR search
s "$p" 10 -5 'word1.*word2.*word3' #for AND search
s "$p" 10 -5 'exact phrase' #for phrase search
Note that there's no need to escape | and no need to add an extra layer of double quotes around exact phrase.
Also note that I've replaced --line-buffered with --with-filename, as I assume that's what you meant (to have the matching lines prefixed with the PDF file path).
Note that with the above approach a shell instance must be created for every input path, which is inefficient, so consider rewriting your command as follows, which also obviates the need for printf %q (assume regex=$4):
find "${dir}" -type f -name '*.pdf' |
while IFS= read -r file; do
pdftotext "$f" - |
grep -E -m${numMatches} --with-filename --label="$f" ${context} "${regex}"
done
The above assumes that your filenames have no embedded newlines, which is rarely a real-world concern. If it is, there a ways to solve the problem.
An additional advantage of this solution is that it uses only POSIX-compliant shell features, but note that the grep command uses nonstandard options.

using find and rename for their intended use

Now before you face palm and click on duplicate entry or the like, read on, this question is both Theory and practical.
From the title it is pretty obvious what I am trying to do, find some files, then rename them. Well the problem, there is so many way to do this, that I finally decided to pick one, and try to figure it out, theoretically.
Let me set the stage:
Lets say I have 100 files all named like this Image_200x200_nnn_AlphaChars.jpg, where the nnn is a incremental number and AlphaChars ie:
Image_200x200_001_BlueHat.jpg
Image_200x200_002_RedHat.jpg
...
Image_200x200_100_MyCat.jpg
Enter the stage find. Now with a simple one liner I can find all the image files in this directory.(Not sure how to do this case insensitive)
find . -type f -name "*.jpg"
Enter the stage rename. On it's own, rename expect you to do the following:
rename <search> <replace> <haystack>
When I try to combine the two with -print0 and xargs and some regular expressions I get stuck, and I am almost sure it's because rename is looking for the haystack or the search part... (Please do explain if you understand what happens after the pipe)
find . -type f -name "*.jpg" -print0 | xargs -0 rename "s/Image_200x200_(\d{3})/img/"
So the goal is to get the find to give rename the original image name, and replace everything before the last underscore with img
Yes I know that duplicates will give a problem, and yes I know that spaces in the name will also make my life hell, and don't even start with sub directories and the like. To keep it simple, we are talking about a single directory, and all filename are unique and without special characters.
I need to understand the fundamental basics, before getting to the hardcore stuff. Anybody out there feel like helping?
Another approach is to avoid using rename -- bash is capable enough:
find ... -print0 | while read -r -d '' filename; do
mv "$filename" "img_${filename##*_}"
done
the ##*_ part remove all leading characters up to and including the last underscore from the value.
If you don't need -print0 (i.e. you are sure your filenames don't contain newlines), you can just do:
find . -type f -name "*.jpg" | xargs rename 's/Image_200x200_(\d{3})/img/'
Which works for me:
~/tmp$ touch Image_200x200_001_BlueHat.jpg
~/tmp$ touch Image_200x200_002_RedHat.jpg
~/tmp$ touch Image_200x200_100_MyCat.jpg
~/tmp$ find . -type f -name "*.jpg" | xargs rename 's/Image_200x200_(\d{3})/img/'
~/tmp$ ls
img_BlueHat.jpg img_MyCat.jpg img_RedHat.jpg
What's happening after the pipe is that xargs is parsing the output of find and passing that in reasonable chunks to a rename command, which is executing a regex on the filename and renaming the file to the result.
update: I didn't try your version with the null-terminators at first, but it also works for me. Perhaps you tested with a different regex?
What's happening after the pipe:
find ... -print0 | xargs -0 rename "s/Image_200x200_(\d{3})/img/"
xargs is reading the filenames produced by the find command, and executing the rename command repeatedly, appending a few filenames at a time. The net effect will be something like:
rename '...' file001 file002 file003 file004 file005 file006 file007 file008 file009 file010
rename '...' file011 file012 file013 file014 file015 file016 file017 file018 file019 file010
rename '...' file021 file022 file023 file024 file025 file026 file027 file028 file029 file010
...
rename '...' file091 file092 file093 file094 file095 file096 file097 file098 file099 file100
The find -print0 | xargs -0 is a handy combination for more safely handling files that may contain whitespace.

find and replace within file

I have a requirement to search for a pattern which is something like :
timeouts = {default = 3.0; };
and replace it with
timeouts = {default = 3000.0;.... };
i.e multiply the timeout by factor of 1000.
Is there any way to do this for all files in a directory
EDIT :
Please note that some of the files are symlinks in the directory.Is there any way to get this done for symlinks also ?
Please note that timeouts exists as a substring also in the files so i want to make sure that only this line gets replaced. Any solution is acceptable using sed awk perl .
Give this a try:
for f in *
do
sed -i 's/\(timeouts = {default = [0-9]\+\)\(\.[0-9]\+;\)\( };\)/\1000\2....\3/' "$f"
done
It will make the replacements in place for each file in the current directory. Some versions of sed require a backup extension after the -i option. You can supply one like this:
sed -i .bak ...
Some versions don't support in-place editing. You can do this:
sed '...' "$f" > tmpfile && mv tmpfile "$f"
Note that this is obviously not actually multiplying by 1000, so if the number is 3.1 it would become "3000.1" instead of 3100.0.
you can do this
perl -pi -e 's/(timeouts\s*=\s*\{default\s*=\s*)([0-9.-]+)/print $1; $2*1000/e' *
One suggestion for whichever solution above you decide to use - it may be worth it to think through how you could refactor to avoid having to modify all of these files for a change like this again.
Do all of these scripts have similar functionality?
Can you create a module that they would all use for shared subroutines?
In the module, could you have a single line that would allow you to have a multiplier?
For me, anytime I need to make similar changes in more than one file, it's the perfect time to be lazy to save myself time and maintenance issues later.
$ perl -pi.bak -e 's/\w+\s*=\s*{\s*\w+\s*=\s*\K(-?[0-9.]+)/sprintf "%0.1f", 1000 * $1/eg' *
Notes:
The regex matches just the number (see \K in perlre)
The /e means the replacement is evaluated
I include a sprintf in the replacement just in case you need finer control over the formatting
Perl's -i can operate on a bunch of files
EDIT
It has been pointed out that some of the files are shambolic links. Given that this process is not idempotent (running it twice on the same file is bad), you had better generate a unique list of files in case one of the links points to a file that appears elsewhere in the list. Here is an example with find, though the code for a pre-existing list should be obvious.
$ find -L . -type f -exec realpath {} \; | sort -u | xargs -d '\n' perl ...
(Assumes none of your filenames contain a newline!)

How to grep for a line containing at most one forward slash?

I'm trying to get the size of the top level directories under the current directory (on Solaris). So I'm piping du to grep and want to match only those lines that have a single forward slash ie. the top level directories.
Something like:
du -h | grep -e <your answer here>
but nothing I try works. Help appreciated!
grep -e '^[^/]*/[^/]*$'
Note that this matches lines that have exactly one (not at most one) slash, but that should be OK for your usage.
You could also probably do something with the -s switch
du -hs */
You can also match the things you do not want with the -v option :
ptimac:Tools pti$ du | grep -v '/.*/'
22680 ./960-Grid-System
137192 ./apache-activemq-5.3.0
23896 ./apache-camel-2.0.0
386816 ./apache-servicemix-3.3.1
251480 ./apache-solr-1.4.0
345288 ./Community Edition-IC-96.SNAPSHOT.app
(I checked the solaris man page first now ;-)
There are other ways on GNU systems to skin that cat without using regex :
find . -d1
finds all files/folders at a depth of 1
and a command I use often when cleaning housedisk is :
du -d1
or (and this should work on Solaris too)
du | sort -n
which shows me the largest directories wherever they are below the current directory.
This doesn't answer your question exactly, but why don't you ask gdu to do that for you?
gdu --max-depth=1
If you really want to go the grep way, how about this?
du -h| grep -v '/.*/'
This will filter out lines with two or more slashes, leaving you with those that have one or zero.
du --max-depth=1 -h