Quick file access in a directory with 500,000 files - c++

I have a directory with 500,000 files in it. I would like to access them as quickly as possible. The algorithm requires me to repeatedly open and close them (can't have 500,000 file open simultaneously).
How can I do that efficiently? I had originally thought that I could cache the inodes and open the files that way, but *nix doesn't provide a way to open files by inode (security or some such).
The other option is to just not worry about it and hope the FS does good job on file look up in a directory. If that is the best option, which FS's would work best. Do certain filename patterns look up faster than others? eg 01234.txt vs foo.txt
BTW this is all on Linux.

Assuming your file system is ext3, your directory is indexed with a hashed B-Tree if dir_index is enabled. That's going to give you as much a boost as anything you could code into your app.
If the directory is indexed, your file naming scheme shouldn't matter.
http://lonesysadmin.net/2007/08/17/use-dir_index-for-your-new-ext3-filesystems/

A couple of ideas:
a) If you can control the directory layout then put the files into subdirectories.
b) If you can't move the files around, then you might try different filesystems, I think xfs might be good for directories with lots of entries?

If you've got enough memory, you can use ulimit to increase the maximum number of files that your process can have open at one time, I have successfully done with with 100,000 files, 500,000 should work as well.
If that isn't a option for you, try to make sure that your dentry cache has enough room to store all the entries. The dentry cache is the filename -> inode mapping that the kernel uses to speed up file access based on filename, accessing huge numbers of different files can effectively eliminate the benefit of the dentry cache as well as introduce an additional performance hit. Stock 2.6 kernel has a hash with up to 256 * MB RAM entries in it at a time, if you have 2GB of memory you should be okay for up to a little over 500,000 files.
Of course, make sure you perform the appropriate profiling to determine if this really causes a bottlneck.

The traditional way to do this is with hashed subdirectories. Assume your file names are all uniformly-distributed hashes, encoded in hexadecimal. You can then create 256 directories based on the first two characters of the file name (so, for instance, the file 012345678 would be named 01/2345678). You can use two or even more levels if one is not enough.
As long as the file names are uniformly distributed, this will keep the directory sizes manageable, and thus make any operations on them faster.

Another question is how much data is in the files? Is an SQL back end an option?

Related

concatenate/append/merge files in c++ (windows) without coping

How can i concatenate few large files(total size~ 3 Tb) in 1 file using c/c++ on windows?
I cant copy data, because it takes too much time, so i cant use:
cmd copy
Appending One File to Another File(https://msdn.microsoft.com/en-us/library/windows/desktop/aa363778%28v=vs.85%29.aspx)
and so on(stream::readbuf(),...)
I just need represent few files as one.
if this is inside your own program only, then you can create a class that would virtually glue the files together so you can read over it and make it apear as a single file.
if you want to physically have a single file. then no, not possible.
that requires opening file 1 and appending the others.
or creating a new file and appending all the files.
neither the C/C++ library nor the windows API have a means to concatenate files
even if such an API would be available, it would be restrictive in that the first file would have to be of a size that is a multiple of the disk allocation size.
Going really really low level, and assuming the multiple of allocation size is fulfilled... yes, if you unmount the drive, and physically override the file system and mess around with the file system structures, you could "stitch" the files together but that would be a challenge to do for FAT, and near impossible for NTFS.

hardlink multiple file to one file

I have many files in a folder. I want to concatenate all these files to a single file. For example cat * > final_file;
But this will increase disk space. Is there is a way where I can hardlink all the files to final_file? For example ln * final_file.
This is not possible using links.
If you really need this kind of feature and can not afford to create one large file you could go for a custom file system driver. FUSE will allow you to write a simple file system driver which runs in the user space and allows to access the files as they were one large file.
You could also write a custom block device (e.g. by emulating the NBD "Network Block Device" protocol) which combines two or more files into one large block device.
Getting to know the concrete use case would help to give a better answer.
No. Hardlinking links 2 files, nothing more. The filesystem does not support that at an underlying level.

Is there a way to completely remove an inode when the Link count is 2?

Currently my data is organised in a volume which has a cache directory (where all the files are first created or transferred). After that there are suitable directories on the volume which in their subdirs, contain files hardlinked to files in the cache.
This is done so that the same inode (file) can be hardlinked multiple times in multiple directories.
Now when trying to clean up the volume, I recurively go through the dirs(not the cache) and based on certain criterion, unlink the files (which basically reduces the inode count of the cache entry by 1). Is there a way for me to delete the cache entry directly, when I am deleting the last hardlink (that is bringing down the count from 2 to 1). This way I would not have to manually parse through the whole cache directory to clear any inodes from it, which have a link count of just 1.
I have gone through unlink/remove functions, and could not find anything specific of use. Is there some purging algorithm that internally takes care of this, then I can try to implement that.
Any help on this would be highly appreciated. In anticipation of a prompt reply.
I saw this and a few other places which instruct you how to delete all hardlinks from shell (use find -samefile and call remove on each file). You could call it via system although that might be frowned on by some people).
No, there isn't anything that does what you want out of the box.
It might be useful to do the deletion when unlinking the hardlink and noticing that the link count is 1, since at that point the inode should be in the page cache; this of course is dependent on knowing the name of the file in the cache directory.

Out of Core Implementation of a Quadtree

I am trying to build a Quadtree data structure(or let's just say a tree) on the secondary memory(Hard Disk).
I have a C++ program to do so and I use fopen to create the files. Also, I am using tesseral coding to store each cell in a file named with its corresponding code to store it on the disk in one directory.
The problem is that after creating about 1,100 files, fopen just returns NULL and stops creating new files. I can create further files manually in that directory, but using C++ it can not create any further files.
I know about max limit of inode on ext3 filesystem which is (from Wikipedia) 32,000 but mine is way less than that, also note that I can create files manually on the disk; just not through fopen.
Also, I really appreciate any idea regarding the best way to store a very dynamic quadtree on disk(I need the nodes to be in separate files and the quadtree might have a depth of 50).
Using nested directories is one idea, but I think it will slow down the performance because of following the links on the filesystem to access the file.
Thanks,
Nima
Whats the errno value of the failed fopen() call?
Do you keep the files you have created open? If yes you are most probably exceeding the maximum number of open files per process.
When you use directories as data structures, you delegate the work of maintaining that structure to the file system, which is not necessarily designed to do that.
Edit: Frank is probably right that you'v exceeded the number of available file descriptors. You can increase those, but that shows that you're also using internals of your ABI as a data structure. Slow and (as resources are exhausted) unstable.
Either code for a very specific OS installation, or use a SQL database.
I have no idea why fopen wouldn't work. Look at errno.
However, storing everything in one directory is a bad idea. When you add a lot of files, it will get slow. Having a directory for every level of the tree will also be slow.
Instead, combine multiple levels into one directory. You could, for example, have one directory for every four levels of the tree. This would limit the number of directories, amount of nesting, and number of files per directory, giving very good performance.
The limitation could come from:
stdio (C library). most 256 handles. Can be increased to 1024 (in VC, call _setmaxstdio)
OS kernel on the file hanldes per process (usually 1024).

How can I quickly enumerate directories on Win32?

I'm trying to speedup directory enumeration in C++, where I'm recursing into subdirectories. I currently have an app which spends 95% of it's time in FindFirst/FindNextFile APIs, and it takes several minutes to enumerate all the files on a given volume. I know it's possible to do this faster because there is an app that does: Everything. It enumerates my entire drive in seconds.
How might I accomplish something like this?
I realize this is an old post, but there is a project on source forge that does exactly what you are asking and the source code is available.
You can find the project here: NTFS-Search
"Everything" builds an index in the background, so queries are against the index not the file system itself.
There are a few improvements to be made - at least over the straight-forward algorrithm:
First, breadth search over depth search. That is, enumerate and process all files in a single folder before recursing into the sub folders you found. This improves locality - usually a lot.
On Windows 7 / W2K8R2, you can use FindFirstFileEx with FindExInfoBasic, the main speedup being omitting the short file name on NTFS file systems where this is enabled.
Separate threads help if you enumerate different physical disks (not just drives). For the same disk it only helps if it's an SSD ("zero seek time"), or you spend significant time processing a file name (compared to the time spent on disk access).
[edit] Wikipedia actually has some comments -
Basically, they are skipping the file system abstraction layer, and access NTFS directly. This way, they can batch calls and skip expensive services of the file system - such as checking ACL's.
A good starting point would be the NTFS Technical Reference on MSDN.
"Everything" accesses directory information at a lower level than the Win32 FindFirst/FindNext APIs.
I believe it reads and interprets the NTFS MFT structures directly, and that this is one of the main reasons for its performance. It's also why it requires admin privileges and why "Everything" only indexes local or removable NTFS volumes (not network drives, for example).
A couple other utilities that do the similar things are:
FindOnClick by 2Brightsparks
Search GT
A little reverse engineering with a debugger on these tools might give you some insight on the techniques they use.
Don't recurse immediately, save a list of directories you find and dive into them when finished. You want to do linear access to each directory, to take advantage of locality of reference and any caching the OS is doing.
If you're already doing the best you can to get the maximum speed from the API, the next step is to do low-level disk accesses and bypass Windows altogether. You might get some guidance from the NTFS drivers for Linux, or perhaps you can use one directly.
If you are doing this on NTFS, here's a lib for low level access: NTFSLib.
You can enumerate through all file records in $MFT, each representing a real file on disk. You can get all file attributes from the record, including $DATA.
This may be the fastest way to enumerate all files/directories on NTFS volumes, 200k~300k files per minute as I tested.