How to get file path from NTFS index number? - c++

I have dwVolumeSerialNumber, nFileIndexHigh, nFileIndexLow values obtained from a call to GetFileInformationByHandle. How can I get file path from these values?

Because of hard links, there may be multiple paths that map to the given VolumeSerialNumber and FileIndex. To find all such paths:
Iterate volumes to find one whose root directory matches dwVolumeSerialNumber
Recursively enumerate all directories on the volume, skipping symbolic links and reparse points, to find all files with matching nFileIndexHigh and nFileIndexLow.
This can be quite time-consuming. If you really need to do this as fast as possible and your filesystem is NTFS, you can raw read the entire MFT into a buffer and parse it yourself. This will get all directories that fit inside an MFT entry in one fell swoop. The rest of the directories can be read through the OS or also through raw reads, depending on the amount of work you want to do. But any way you look at it, this is a lot of work and doesn't even apply to FAT, FAT32 or any other filesystem.
A better solution is probably to hang onto the original path if at all possible.

This MSDN article shows how to get the path from a file handle.
You use OpenFileById to open a file given its file ID but you also need an open file elsewhere on the same volume, I assume to get the volume serial number.
This blog posting raises an interesting issue that you need to pass in 24 for the structure size (worked out by looking at assembly code).
I leave it as an interesting exercise (I couldn't find an easy answer) how you go from a dwVolumeSerialNumber to having a valid other handle open for that volume or a file on that volume, but maybe you already have enough information for your case. One possibility is to iterate all mounted volumes calling GetVolumeInformation to find the one with matching serial number.
Note: If you don't have the file open then you may not be able to rely on the nFileIndexHigh/Low combo (aka file ID) as described in the BY_HANDLE_FILE_INFORMATION Structure notes, which warns it can change for FAT systems, but In the NTFS file system, a file keeps the same file ID until it is deleted.

Note: The original question had an error in it. Now that the question has been fixed this answer no longer applies.
In general you can't. The information you retrieved just tells you what disk the file is on and how big it is. It does not provide enough information to identify the actual file. Specifically:
dwVolumeSerialNumber identifies the volume, and
nFileSizeHigh and nFileSizeLow give you the size of the file
If the file happens to be the only file on that volume that is that exact size, you could search the volume for a file of that size. But in general this is both expensive and unreliable, so I don't recomment it.

Related

clusters the file is occupying [duplicate]

I need to get any information about where the file is physically located on the NTFS disk. Absolute offset, cluster ID..anything.
I need to scan the disk twice, once to get allocated files and one more time I'll need to open partition directly in RAW mode and try to find the rest of data (from deleted files). I need a way to understand that the data I found is the same as the data I've already handled previously as file. As I'm scanning disk in raw mode, the offset of the data I found can be somehow converted to the offset of the file (having information about disk geometry). Is there any way to do this? Other solutions are accepted as well.
Now I'm playing with FSCTL_GET_NTFS_FILE_RECORD, but can't make it work at the moment and I'm not really sure it will help.
UPDATE
I found the following function
http://msdn.microsoft.com/en-us/library/windows/desktop/aa364952(v=vs.85).aspx
It returns structure that contains nFileIndexHigh and nFileIndexLow variables.
Documentation says
The identifier that is stored in the nFileIndexHigh and nFileIndexLow members is called the file ID. Support for file IDs is file system-specific. File IDs are not guaranteed to be unique over time, because file systems are free to reuse them. In some cases, the file ID for a file can change over time.
I don't really understand what is this. I can't connect it to the physical location of file. Is it possible later to extract this file ID from MFT?
UPDATE
Found this:
This identifier and the volume serial number uniquely identify a file. This number can change when the system is restarted or when the file is opened.
This doesn't satisfy my requirements, because I'm going to open the file and the fact that ID might change doesn't make me happy.
Any ideas?
Use the Defragmentation IOCTLs. For example, FSCTL_GET_RETRIEVAL_POINTERS will tell you the extents which contain file data.

How exactly "Everything Search" can give me immediately searchable list of 2bln files on my 4TB HDD in less than 10 seconds?

There's windows program "Everything Search" http://www.voidtools.com/ that reads file names of the NTFS volume faster than I assume is possible by recursive descent (it reads filenames of almost 2bln files on 4TB HDD in less than 10 seconds).
I know that it probably reads NTFS folder structure directly of the volume in bulk, and makes sense of it without calling OS filesystem functions.
How exactly can it be done? What system functions should I call to get that information about NTFS volume that fast and how can I parse it into file and directory names? Are there any libraries in any language that help with that?
If you are not sure what I am asking, there are more details in my previous question (I was asked to rephrase it): Can I read whole NTFS directory tree into RAM at once?
The NTFS volume has a low-visiblity structure it relies on called the master file table. There are APIs for querying this table directly, but they require some privileges to invoke, because you have to get a handle to the volume. The main function to query the master file table is DeviceIOControl and the control code is FSCTL_ENUM_USN_DATA
The control code appears to be a USN-related code - which is a touch misleading in this particular case - but it will give the basic flavor of the call and related structures. You get back an enumeration of records that look like usn records, but they're thin wrappers around master file table entries.
The records each have FileName, IDs and parent IDs. The FileNames are the "local" name of the file or folder, and to get the full name, you would expect to traverse the table structure.
It is lightning fast - way faster than recursing through the file system. You'll get back (and will have to filter out) things that aren't exposed in any of the normal file APIs - things you definitely don't want to expose to users, for example.

Truncating the file in c++

I was writing a program in C++ and wonder if anyone can help me with the situation explained here.
Suppose, I have a log file of about size 30MB, I have copied last 2MB of file to a buffer within the program.
I delete the file (or clear the contents) and then write back my 2MB to the file.
Everything works fine till here. But, the concern is I read the file (the last 2MB) and clear the file (the 30MB file) and then write back the last 2MB.
To much of time will be needed if in a scenario where I am copying last 300MB of file from a 1GB file.
Does anyone have an idea of making this process simpler?
When having a large log file the following reasons should and will be considered.
Disk Space: Log files are uncompressed plain text and consume large amounts of space.
Typical compression reduce the file size by 10:1. However a file cannot be compressed
when it is in use (locked). So a log file must be rotated out of use.
System resources: Opening and closing a file regularly will consume lots of system
resources and it would reduce the performance of the server.
File size: Small files are easier to backup and restore in case of a failure.
I just do not want to copy, clear and re-write the last specific lines to a file. Just a simpler process.... :-)
EDIT: Not making any inhouse process to support log rotation.
logrotate is the tool.
I would suggest an slightly different approach.
Create a new temporary file
Copy the required data from the original file to the temporary file
Close both files
Delete the original file
Rename the temp file to the same name as the original file
To improve the performance of the copy, you can copy the data in chunks, you can play around with the chunk size to find the optimal value.
If this is your file before:
-----------------++++
Where - is what you don't want and + is what you do want, the most portable way of getting:
++++
...is just as you said. Read in the section you want (+), delete/clear the file (as with fopen(... 'wb') or something similar and write out the bit you want (+).
Anything more complicated requires OS-specific help, and isn't portable. Unfortunately, I don't believe any major OS out there has support for what you want. There might be support for "truncate after position X" (a sort of head), but not the tail like operation you're requesting.
Such an operation would be difficult to implement, as varying blocksizes on filesystems (if the filesystem has a block size) would cause trouble. At best, you'd be limited to cutting on blocksize boundaries, but this would be harry. This is such a rare case, that this is probably why such a procudure is not directly supported.
A better approach might be not to let the file grow that big but rather use rotating log files with a set maximum size per log file and a maximum number of old files being kept.
If you can control the writing process, what you probably want to do here is to write to the file like a circular buffer. That way you can keep the last X bytes of data without having to do what you're suggesting at all.
Even if you can't control the writing process, if you can at least control what file it writes to, then maybe you could get it to write to a named pipe. You could attach your own program at the end of this named pipe that writes to a circular buffer as discussed.

How to get the latest file from a directory

This is specific to creating a logfiles. When I am connecting to a server using my application, it writes the details to a log file. When the log file reaches to specific size let's say 1MB then I create another file named LOG2.log.
Now While Writing back to log file , there are two or even more log files and I want to pick up the latest one. I don not want to traverse through all the files in that directory and the pick up the file, as this will take processing time, Is there any other way to get the last created file or log file in the directory.
Your best bet is to rotate log files, which is what gets done in Unix normally (generally via cron.)
One possible implementation is to keep 10 (or however many) old log files around, if your program detects that Log.log is over 1MB then move Log09.log to Log10.log, Log08.log to Log09.log, 7 to 8, 6 to 7, ... 2 to 3, and then Log.log to Log02.log. Finally, create a new Log.log file and continue recording.
This way you'll always write to Log.log and there's no filesystem mystery. In theory, this approach is scalable to ridiculous numbers of log files (more than you would ever reasonably need) and is more standard than writing to Log3023.log. Plus, one would always know where to find the current log.
I believe the answer is "stiff". You have to iterate and find the most recent one yourself, as the OS won't keep indices for each possible sort order around on the off chance someone may want them.
Are you able to modify the server? If so, perhaps introduce a LASTLOG.log file that either contains the name of the latest log file, or the actual contents of it.
Otherwise, Tony's right.. No real way to do it other than iterate through yourself.
How about the elegant :
ls -t | head -n 1
The most efficient way is to use a specialized function to go through all entries (as NTFS or FAT don't index by time), but ignore what you don't need. For that, call FindFirstFileEx with info level FindExInfoBasic. This skips 8.3 name resolution.

Out of Core Implementation of a Quadtree

I am trying to build a Quadtree data structure(or let's just say a tree) on the secondary memory(Hard Disk).
I have a C++ program to do so and I use fopen to create the files. Also, I am using tesseral coding to store each cell in a file named with its corresponding code to store it on the disk in one directory.
The problem is that after creating about 1,100 files, fopen just returns NULL and stops creating new files. I can create further files manually in that directory, but using C++ it can not create any further files.
I know about max limit of inode on ext3 filesystem which is (from Wikipedia) 32,000 but mine is way less than that, also note that I can create files manually on the disk; just not through fopen.
Also, I really appreciate any idea regarding the best way to store a very dynamic quadtree on disk(I need the nodes to be in separate files and the quadtree might have a depth of 50).
Using nested directories is one idea, but I think it will slow down the performance because of following the links on the filesystem to access the file.
Thanks,
Nima
Whats the errno value of the failed fopen() call?
Do you keep the files you have created open? If yes you are most probably exceeding the maximum number of open files per process.
When you use directories as data structures, you delegate the work of maintaining that structure to the file system, which is not necessarily designed to do that.
Edit: Frank is probably right that you'v exceeded the number of available file descriptors. You can increase those, but that shows that you're also using internals of your ABI as a data structure. Slow and (as resources are exhausted) unstable.
Either code for a very specific OS installation, or use a SQL database.
I have no idea why fopen wouldn't work. Look at errno.
However, storing everything in one directory is a bad idea. When you add a lot of files, it will get slow. Having a directory for every level of the tree will also be slow.
Instead, combine multiple levels into one directory. You could, for example, have one directory for every four levels of the tree. This would limit the number of directories, amount of nesting, and number of files per directory, giving very good performance.
The limitation could come from:
stdio (C library). most 256 handles. Can be increased to 1024 (in VC, call _setmaxstdio)
OS kernel on the file hanldes per process (usually 1024).