Fast way to do big directory listings - c++

I want to know how a fast directory listing can be obtained, of e.g. a whole hard disk drive.
In C++ I have been using dirent.h with a recursive directory listing. This works nice of course, but the more you're listing, the longer it takes and a 500GB drive would easily take an hour to be scanned.
This has to do with the fragmentation, while one file or directory could be at the outer part of the disk, the next one could be at the innermost. Worst case would be a jump from innermost part to outermost part on every read access.
Of course the hard disk will always be a bottleneck, but I used applications that delivered a directory listing of the drive in about 2-5 minutes.
How did they do it?
Some sources on google said something about accessing the "Master File Table" of the file system.
Alright, but how would I do that and what about NTFS, FAT, Ext4, they are certainly using different styles of tables, aren't they?
Well, this is a very broad question, I know, so for the sake of clarification I narrow it down to this:
Could someone please explain what these "Master File Tables" are and how the mentioned file systems use them.
Is using these tables the right approach to this task?
How would I access such a table in C++, they sure have no path like /root/.filesystem/master_file_table.
Any explanation, resource or push in the right direction is welcome, thank you in advance!

Related

How exactly "Everything Search" can give me immediately searchable list of 2bln files on my 4TB HDD in less than 10 seconds?

There's windows program "Everything Search" http://www.voidtools.com/ that reads file names of the NTFS volume faster than I assume is possible by recursive descent (it reads filenames of almost 2bln files on 4TB HDD in less than 10 seconds).
I know that it probably reads NTFS folder structure directly of the volume in bulk, and makes sense of it without calling OS filesystem functions.
How exactly can it be done? What system functions should I call to get that information about NTFS volume that fast and how can I parse it into file and directory names? Are there any libraries in any language that help with that?
If you are not sure what I am asking, there are more details in my previous question (I was asked to rephrase it): Can I read whole NTFS directory tree into RAM at once?
The NTFS volume has a low-visiblity structure it relies on called the master file table. There are APIs for querying this table directly, but they require some privileges to invoke, because you have to get a handle to the volume. The main function to query the master file table is DeviceIOControl and the control code is FSCTL_ENUM_USN_DATA
The control code appears to be a USN-related code - which is a touch misleading in this particular case - but it will give the basic flavor of the call and related structures. You get back an enumeration of records that look like usn records, but they're thin wrappers around master file table entries.
The records each have FileName, IDs and parent IDs. The FileNames are the "local" name of the file or folder, and to get the full name, you would expect to traverse the table structure.
It is lightning fast - way faster than recursing through the file system. You'll get back (and will have to filter out) things that aren't exposed in any of the normal file APIs - things you definitely don't want to expose to users, for example.

Structure for storing data from thousands of files on a mobile device

I have more than 32000 binary files that store a certain kind of spatial data. I access the data by file name. The files range in size from 0-400kb. I need to be able to access the content of these files randomly and at various time points. I don't like the idea of having 32000+ separate files of data installed on a mobile device (even though the total file size is < 100mb). I want to merge the files into a single structure that will still let me access the data I need just as quickly. I'd like suggestions as to what the best way to do this is. Any suggestions should have C/C++ libs for accessing the data and should have a liberal license that allows inclusion in commercial, closed-source applications without any issue.
The only thing I've thought of so far is storing everything in an sqlite database, though I'm not sure if this is the best method, or what considerations I need to take into account for storing blob data with quick look up times (ie, what schema I'd use).
Why not roll your own?
Your requirements sound pretty simple and straight forward. Just bundle everything into a single binary file and add an index at the beginning telling which file starts where and how bit it is.
30 lines of C++ code max. Invest a good 10 minutes designing a good interface for it so you could replace the implementation when and if the need occurs.
That is of course if the data is read only. If you need to change it as you go, it gets hairy fast.

How can I quickly enumerate directories on Win32?

I'm trying to speedup directory enumeration in C++, where I'm recursing into subdirectories. I currently have an app which spends 95% of it's time in FindFirst/FindNextFile APIs, and it takes several minutes to enumerate all the files on a given volume. I know it's possible to do this faster because there is an app that does: Everything. It enumerates my entire drive in seconds.
How might I accomplish something like this?
I realize this is an old post, but there is a project on source forge that does exactly what you are asking and the source code is available.
You can find the project here: NTFS-Search
"Everything" builds an index in the background, so queries are against the index not the file system itself.
There are a few improvements to be made - at least over the straight-forward algorrithm:
First, breadth search over depth search. That is, enumerate and process all files in a single folder before recursing into the sub folders you found. This improves locality - usually a lot.
On Windows 7 / W2K8R2, you can use FindFirstFileEx with FindExInfoBasic, the main speedup being omitting the short file name on NTFS file systems where this is enabled.
Separate threads help if you enumerate different physical disks (not just drives). For the same disk it only helps if it's an SSD ("zero seek time"), or you spend significant time processing a file name (compared to the time spent on disk access).
[edit] Wikipedia actually has some comments -
Basically, they are skipping the file system abstraction layer, and access NTFS directly. This way, they can batch calls and skip expensive services of the file system - such as checking ACL's.
A good starting point would be the NTFS Technical Reference on MSDN.
"Everything" accesses directory information at a lower level than the Win32 FindFirst/FindNext APIs.
I believe it reads and interprets the NTFS MFT structures directly, and that this is one of the main reasons for its performance. It's also why it requires admin privileges and why "Everything" only indexes local or removable NTFS volumes (not network drives, for example).
A couple other utilities that do the similar things are:
FindOnClick by 2Brightsparks
Search GT
A little reverse engineering with a debugger on these tools might give you some insight on the techniques they use.
Don't recurse immediately, save a list of directories you find and dive into them when finished. You want to do linear access to each directory, to take advantage of locality of reference and any caching the OS is doing.
If you're already doing the best you can to get the maximum speed from the API, the next step is to do low-level disk accesses and bypass Windows altogether. You might get some guidance from the NTFS drivers for Linux, or perhaps you can use one directly.
If you are doing this on NTFS, here's a lib for low level access: NTFSLib.
You can enumerate through all file records in $MFT, each representing a real file on disk. You can get all file attributes from the record, including $DATA.
This may be the fastest way to enumerate all files/directories on NTFS volumes, 200k~300k files per minute as I tested.

How to create a program which is working similar like RAID1 (mirroring)?

I want to create a simple program which is working very similar to RAID1. It should work like this:
First i want to give the primary HDD-s drive letter and than the secondary one. I will only write to the primary HDD! If any new data is copied to the primary HDD it should automatically copy it to the secondary one.
I need some help where should i start all this? How to monitor the written data in the primary HDD? Obviously there are many ways to do what i want (i think), but i need the simpliest way.
If this isn't so complicated, than how can i handle that case if the primary HDD has two or more partition, because then i should check the secondary HDD's partition too, and then create/resize them if necessary?
Thanks in advance!
kampi
The concept of mirroring disk writes to another disk in real time is the basis for high availability, and implementing these schemes are not trivial.
The company I work for makes DoubleTake, which does real time mirroring & replication of file based IO to local or remote volumes. This is a little different than what you are describing, which appears to be block based disk/volume replication, but many of the concepts are similar.
For file based replication, there are a quite a few nasty scenarios, i'll describe a few:
Synchronizing the contents of one volume to another volume, keeping in mind that changes can occur while you are doing this. I suppose you could simply this by requiring that volumes start out totally formatted. But for people that have data that will not be a good solution!
keeping up with disk changes: What if the volume you are mirroring to is slower than the source volume? Where do you buffer? To Disk? Memory?
Anyways we use a kernel mode file system filter driver to capture the disk IO, and then our user mode service grabs this IO and forwards it to a local or remote disk.
If you want to learn about file system filtering, one of the best books (its old but good) is File System Internals, by Rajeev Nagar. Its a must read for doing any serious work with file system filters.
Also take a look at the file system filter samples on the Windows 7 WDK, its free, and they have good file mon examples that will get you seeing disk changes pretty quickly.
Good Luck!

Change file order in a Windows Directory in C

Like when you drag a file on top of another one and change the order, like that.
I'm going to assume you're asking about how to rearrange the order in which files are displayed in a folder. I'm not exactly sure how to do it, but you'll want to use the various functions from the Windows shell to accomplish this. See the Shell Developer's Guide.
There is no way to do this (except maybe by hacking the directory structures on the disk using raw, sector-based APIs). The order of files on the disk is managed by the file system according to it's design and needs.
For what its worth, FAT directory entries are stored in the order in which they are added. NTFS actually indexes its directory entries, but I thought creation order still played some role in which order they're retrieved. Maybe not. Nearly every UI that does file listings does some type of sorting on display, though, usually alphabetical.
Bottom line-- if its not application-sorted and its not creation time, then there's nothing you can do.