Unix: fast 'remove directory' for cleaning up daily builds

Unix: fast 'remove directory' for cleaning up daily builds - build

Is there a faster way to remove a directory then simply submitting
rm -r -f *directory*
? I am asking this because our daily cross-platform builds are really huge (e.g. 4GB per build). So the harddisks on some of the machines are frequently running out of space.
This is namely the case for our AIX and Solaris platforms.
Maybe there are 'special' commands for directory remove on these platforms?
PASTE-EDIT (moved my own separate answer into the question):
I am generally wondering why 'rm -r -f' is so slow. Doesn't 'rm' just need to modify the '..' or '.' files to de-allocate filesystem entries.
something like
mv *directory* /dev/null
would be nice.

For deleting a directory from a filesystem, rm is your fastest option.
On linux, sometimes we do our builds (few GB) in a ramdisk, and it has a really impressive delete speed :) You could also try different filesystems, but on AIX/Solaris you may not have many options...
If your goal is to have the directory $dir empty now, you can rename it, and delete it later from a background/cron job:
mv "$dir" "$dir.old"
mkdir "$dir"
# later
rm -r -f "$dir.old"
Another trick is that you create a seperate filesystem for $dir, and when you want to delete it, you just simply re-create the filesystem. Something like this:
# initialization
mkfs.something /dev/device
mount /dev/device "$dir"
# when you want to delete it:
umount "$dir"
# re-init
mkfs.something /dev/device
mount /dev/device "$dir"

I forgot the source of this trick but it works:
EMPTYDIR=$(mktemp -d)
rsync -r --delete $EMPTYDIR/ dir_to_be_emptied/

On AIX at least, you should be using LVM, the logical volume manager. All our systems bundle all the physical hard drive into a single volume group and then create one big honkin' file system out of that.
That way, you can add physical devices to your machine at will and increase the size of your file system to whatever you need.
One other solution I've seen is to allocate a trash directory on each file system and use a combination of mv and a find cron job to tackle the space problem.
Basically, have a cron job that runs every ten minutes and executes:
rm -rf /trash/*
rm -rf /filesys1/trash/*
rm -rf /filesys2/trash/*
Then, when you want your specific directory on that file system recycled, use something like:
mv /filesys1/overnight /filesys1/trash/overnight
and, within the next ten minutes your disk space will start being recovered. The filesys1/overnight directory will immediately be available for use even before the trashed version has started being deleted.
It's important that the trash directory be on the same filesystem as the directory you want to get rid of, otherwise you have a massive copy/delete operation on your hands rather than a relatively quick move.

rm -r directory works by recursing depth-first down through directory, deleting files, and deleting the directories on the way back up. It has to, since you cannot delete a directory that is not empty.
Long, boring details: Each file system object is represented by an inode in the file system, which has file system-wide, flat array of inodes.[1] If you just deleted directory without first deleting its children then the children would remain allocated, but without any pointers to them. (fsck checks for that kind of thing when it runs, since it represents file system damage.)
[1] That may not be strictly true for every file system out there, and there may be a file system that works the way you describe. It would possibly require something like a garbage collector. However, all the common ones I know of act like fs objects are owned by inodes, and directories are lists of name/inode number pairs.

If rm -rf is slow, perhaps you are using a "sync" option or similar, which is writing to the disk too often. On Linux ext3 with normal options, rm -rf is very quick.
One option for fast removal which would work on Linux and presumably also on various Unixen is to use a loop device, something like:
hole temp.img $[5*1024*1024*1024] # create a 5Gb "hole" file
mkfs.ext3 temp.img
mkdir -p mnt-temp
sudo mount temp.img mnt-temp -o loop
The "hole" program is one I wrote myself to create a large empty file using a "hole" rather than allocated blocks on the disk, which is much faster and doesn't use any disk space until you really need it. http://sam.nipl.net/coding/c-examples/hole.c
I just noticed that GNU coreutils contains a similar program "truncate", so if you have that you can use this to create the image:
truncate --size=$[5*1024*1024*1024] temp.img
Now you can use the mounted image under mnt-temp for temporary storage, for your build. When you are done with it, do this to remove it:
sudo umount mnt-temp
rm test.img
rmdir mnt-temp
I think you will find that removing a single large file is much quicker than removing lots of little files!
If you don't care to compile my "hole.c" program, you can use dd, but this is much slower:
dd if=/dev/zero of=temp.img bs=1024 count=$[5*1024*1024] # create a 5Gb allocated file

I think that actually there is nothing else than "rm -rf" as you quoted to delete your directories.
to avoid doing it manually over and over you can cron daily a script that recursively deletes all the build directories of your build root directory if they're "old enough" with something like :
find <buildRootDir>/* -prune -mtime +4 -exec rm -rf {} \;
(here mtime +4 indicates "any file older than 4 days)
Another way would be to configure your builder (if it allows such things) to crush the previous build with the current one.

I was looking into this as well.
I had a dir with 600,000+ files.
rm * would fail, because there are too many entries.
find . -exec rm {} \; was nice, and deleting ~750 files every 5 seconds. Was checking the rm rate via another shell.
So, instead I wrote a short script to rm many files at once. Which obtained about ~1000 files every 5 seconds. The idea is to put as many files into 1 rm command as you can to increase the efficiency.
#!/usr/bin/ksh
string="";
count=0;
for i in $(cat filelist);do
string="$string $i";
count=$(($count + 1));
if [[ $count -eq 40 ]];then
count=1;
rm $string
string="";
fi
done

On Solaris, this is the fastest way I have found.
find /dir/to/clean -type f|xargs rm
If you have files with odd paths, use
find /dir/to/clean -type f|while read line; do echo "$line";done|xargs rm

Use
perl -e 'for(<*>){((stat)[9]<(unlink))}'
Please refer below link:
http://www.slashroot.in/which-is-the-fastest-method-to-delete-files-in-linux

Needed to delete 700 Gbytes from dozens of directories on AWS EBS 1 TB disk (ext3) before copying remainder to a new 200 Gbyte XFS volume. It was taking hours leaving that volume at 100%wa. Since the disk IO and server time are not free, this took only a fraction of a second per directory.
where /dev/sdb
is an empty volume of any size
directory_to_delete=/ebs/var/tmp/
mount /dev/sdb $directory_to_delete
nohup rsync -avh /ebs/ /ebs2/

I coded a small Java application RdPro (Recursive Directory Purge tool) which is faster than rm. It also can remove target directories user specified under a root.Works for both Linux/Unix and Windows. It has both a command line version and a GUI version.
https://github.com/mhisoft/rdpro

I had to delete more than 3,00,000 files in windows. I had cygwin installed. Luckily i had all the primary directory in a database. Created a for loop and based on line entry and delete using rm -rf

I just use find ./ -delete in the folder to empty, and it has deleted 620000 directories (total size) 100GB in arround 10 minutes.
Source : a comment in this site https://www.slashroot.in/comment/1286#comment-1286

Related

Cannot delete file with special caracters under LINUX

When trying to run a PHP script under Linux, my command fails and I inherited a a new file in the folder.
The file is called ");? ?for ($j=0;$j".
Impossible to delete with rm, impossible to move...Screenshot
Any idea, please ?

Just an untested idea :
Maybe you can try to delete all the repository, with rm -R folder_name.
You could also add -f : rm -R -f folder_name
Of course, don't forget to save the other files beforehand, but it should be easy as there are just a few.

GCP : gsutil -m cp -rn /Downloads gs://my-bucket - How to skip files already copied

enter image description here
When I resume copying files its works on two processes 1) Skipping Copied files and 2) Copying files.
Because of this it's taking a long time.
Is there any method to skip this already copied files during the process?

Yes, this can be achieved by using the the gsutil rsync command.
The gsutil rsync command synchronises the contents of the source directory and the destination directory by copying only the missing files. It's therefore much more efficient than the standard vanilla cp command.
However, if you're using the -n switch with the cp command, this forces the cp command to skip the files already copied. So whether or not using gsutil rsync is faster than gsutil cp -n is open to debate, and maybe depends on different scenarios.
To use gsutil rsync you can run something like this (the -r flag makes the command recursive):
gustil rsync -r source gs://mybucket/
For more detailed on the gsutil rsync command, please take a look here.
I understand you have some concerns about the time it take for both commands to calculate which files need to be copied. As both the gsutil cp -n and gsutil rsync commands both need to make a comparison between the source and destination directories, there is always going to be a certain amount of overhead/delay on top of the the copying process, especially with very large collections.
If you want to cut out this part of the process altogether, and you only want to copy files less than a certain age, you could specify this at source and use a standard gsutil copy command to see if that is faster. However, doing so would remove some of the benefits of gsutil cp -n and gsutil rsync as there would no longer be a direct comparison between the source and destination directories.
For example, you could generate a variable at the source of files which have been modified recently, for example, within the last day. You could then use a standard gsutil cp command to only copy these files.
For example, to create a variable containing a list of files modified within the last day:
modified="$(find . -mtime -1)"
Then use the variable as the target for the copy command.
gsutil -m cp $modified gs://publicobject/
You need to work out whether or not this would work for your use case, as although there is chance it may be faster, some of the advantages of the other two methods are lost (automatic syncing of directories).

How to delete thousands of png files in a directory based on a string in the middle of the filename?

I have a directory with 150,000 png files. I need to delete about 70,000 of them.
The files I need to delete have a string "&zoom=9&" in the middle of the file name, like this:
Historical_Min_Temp_of_coldest_Month&zoom=9&x=129&y=377.png
I want to keep all the other files in the directory (with zoom levels 0-8). I'm on a Mac.
I have tried:
ls *zoom=9*
grep '^\./zoom-9'
find -P | grep 'zoom=9'
But I'm obviously missing some core concepts. Any help would be appreciated.

If you have several subdirectories you can try this:
find . -name "*&zoom=9&*" -delete
or (less preferred)
find . -name "*&zoom=9&*" -exec rm {} +
The fist version removes the files internally, so no additional external executables are launched. Closing the line with + instead of the usual \; add as many found files as fits to the command line buffer, reducing the number of external calls (similar to the external xargs utility)
(I do not have Mac, this is the Linux version, but I assume that these features are basic ones and supported by OSX)
I would like to propose something. Do not store 100 000+ files in one directory. This can slow down your system and deleting files will not solve the problem. To reduce the size of a directory i-node, you have to (hard)link all files under a new directory and remove the old directory.

Is rm *zoom=9*.png not working?

Use INotify to watch a file with multiple symlinks

So I setup some code to watch a config file for edits, which worked until I used VIM to edit the file, then I had to also watch the directory for renames and creations. Then I discovered that didn't catch renames higher in the path hierarchy. Then I looked into symlinks...gaaahhhh!
First setup a made up example showing one (of many) tricky symlink scenarios:
mkdir config1
touch config1/config
ln -s config1 machine1
mkdir config2
touch config2/config
ln -s config2 machine2
ln -s machine1 active
Now, given a filename like active/config that I want to watch, I can see how to get an inotify watch descriptor for:
config1/ -> watch active/ follow symlinks (watches inode for config1)
active/ -> watch active/ dont follow symlinks (watches inode for active symlink
active/config -> watch active/config (watches inode for config1/config)
How do I add a watch on the machine1 symlink? Do I need to find some way to manually walk each symlink adding watches for each along the way? How?
The purpose is to allow:
mkdir config3
touch config3/config
ln -s -f -n config3 machine1
And have inotify warn that active/config has been redirected. At the moment it looks like I'll have to add a watch for:
- target file inode
- every directory inode leading to the file (to detect moves/renames of directories)
- every symlink inode involved in reaching any of the above
There must be an easier way to just watch one file? Have I strayed from the path or is this really the way?

My answer is a straight "yes, you are doing it right".
After carefully reading the inotify syscall manpage, I cannot see any way of short of watching every step of a (possibly-symlinked) path-to-a-file in order to detect any and all changes to the full path.
This just seems to be the way inotify works: it only looks at specific files or folders, and it does not do recursion on its own. That, plus having to explicitly follow symlinks, seems to match your 3-step plan to the letter.
Selected quotes from the manpage:
The following further bits can be specified in mask when calling inotify_add_watch(2):
IN_DONT_FOLLOW (since Linux 2.6.15)
Don't dereference pathname if it is a symbolic link.
[...]
Limitations and caveats
Inotify monitoring of directories is not recursive: to monitor subdirectories under a directory, additional watches must be created. This can take a significant amount time for large directory trees. [...]
This FAQ also lends support for your strategy re symlinks:
Q: What about the IN_ONLYDIR and IN_DONT_FOLLOW flags?
IN_ONLYDIR ensures that the event occur only on a directory. If you create such watch on a file it will not emit events. IN_DONT_FOLLOW forbids following symbolic links (these ones will be monitored themselves and not the files they point to).

Is there a "watch" / "monitor" / "guard" program for Makefile dependencies?

I've recently been spoiled by using nodemon in a terminal window, to run my Node.js program whenever I save a change.
I would like to do something similar with some C++ code I have. My actual project has lots of source files, but if we assume the following example, I would like to run make automatically whenever I save a change to sample.dat, program.c or header.h.
test: program sample.dat
./program < sample.dat
program: program.c header.h
gcc program.c -o program
Is there an existing solution which does this?
(Without firing up an IDE. I know lots of IDEs can do a project rebuild when you change files.)

If you are on a platform that supports inotifywait (to my knowledge, only Linux; but since you asked about Make, it seems there's a good chance you're on Linux; for OS X, see this question), you can do something like this:
inotifywait --exclude '.*\.swp|.*\.o|.*~' --event MODIFY -q -m -r . |
while read
do make
done
Breaking that down:
inotifywait
Listen for file system events.
--exclude '.*\.swp|.*\.o|.*~'
Exclude files that end in .swp, .o or ~ (you'll probably want to add to this list).
--event MODIFY
When you find one print out the filepath of the file for which the event occurred.
-q
Do not print startup messages (so make is not prematurely invoked).
-m
Listen continuously.
-r .
Listen recursively on the current directory. Then it is piped into a simple loop which invokes make for every line read.
Tailor it to your needs. You may find inotifywait --help and the manpage helpful.
Here is a more detailed script. I haven't tested it much, so use with discernment. It is meant to keep the build from happening again and again needlessly, such as when switching branches in Git.
#!/bin/sh
datestampFormat="%Y%m%d%H%M%S"
lastrun=$(date +$datestampFormat)
inotifywait --exclude '.*\.swp|.*\.o|.*~' \
--event MODIFY \
--timefmt $datestampFormat \
--format %T \
-q -m -r . |
while read modified; do
if [ $modified -gt $lastrun ]; then
make
lastrun=$(date +$datestampFormat)
fi
done

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js