Detecting metadata-only read requests in windows filesystem - c++

I'm developing a kind of filesystem driver. All of read requests that windows makes to my filesystem goes by the driver implementation.
I would like to distinguish between "normal" read requests and those who want to get only the metadata from the file. ( Windows reads first 4K of the file and then stop reading ).
Does Windows mark this metadata reads in some way? It would be very useful in order to treat that two kind of operations in a different way.
In a typical CreateFile call, we have AccessMode, ShareMode, CreationDisposition and FlagsAndAttributes parameters ( being DWORD ), i'm not sure if it's possible to extract some clue of the operation requested.
Thanks for reading :)

I'd advise you to get the SysInternals file monitoring tool. It captures stacktraces for each call, and since it understands PDBs can even show you function names. That should allow you to figure out many details of this particular call.

On rereading, it appears that the question is looking at the wrong place for an optimization. Why not treat every request for the first 4KB as a request for metadata? There is very little harm in that assumption.
An assumption the other way around would be harmful, if you're doing 100 MB of real I/O when you really only needed 4KB. But if you need 100 MB, a small optimization for the first 4KB causes at most a one-time small hickup for an inherently lengthy operation.

It's not Windows, but Windows Explorer that performs scanning of files to extract metadata. Moreover, you will also face the reads to create thumbnails.
Reporting the drive as a remote / network to Windows will make Explorer read less information and reduce load on the file system, but unfortunately there seems to be no way to block such reading completely.

Related

Minifilter Driver - Which IRP request should I filter for realtime virus protection?

After I looked into the MS example (Scanner File System Minifilter Driver). I noticed that they use only IRP_MJ_CREATE, IRP_MJ_WRITE, IRP_MJ_CLEANUP. Will it enough for realtime protection?
All file writes will go through IRP_MJ_WRITE. So if you scan the file/data in this path, you can be fairly sure that the new file/data does not contain virus.
But make sure you filter ALL write functions (like writing memory mapped files etc). Better/required will be to filter paging IO too.
Once its assured that files on disk does not contain virus, scanning them during read is redundant.
However, with this if there is already a virus file present on system, it will not be caught. But don't think that is required for realtime protection.

Get notified about the change in raw data in hard disk sector - File change notification

I'm trying to make a software that backups my entire hard drive.
I've managed to write a code for reading the raw data from hard disk sectors. However, i want to have incremental backups. For that i need to know the changed made to OS settings, file changes, everything.
My question is -
Using FileSystemWatcher and Inotify, will i be able to know every change made to every sector in the hard drive ? (OS settings etc)
I'm coding it in C++ for linux and windows.
(Saw this question on Stackoverflow which gave me some idea)
Inotify is to detect changes while your program is running, I'm guessing that FilySystemWatches is similar.
One way to solve this is to have a checksum on each sector or multiple of sectors, and when making a backup you compare the checksums to the list you have and only backup blocks that have been changed.
The MS Windows FileSystemWatcher mechanism is more limited than Linux's Inotify, but both probably will do what you need. The Linux mechanism provides (optional) notification for file reads, which causes the "access timestamp" to be updated.
However, the weakness from your application's perspective is that all file modifications made from system boot up to your program getting loaded (and unload to shutdown) will not be monitored. Your application might need to look through file modification timestamps of many files to identify changed files, depending on the level of monitoring you are targeting.
Both architectures maintain a timestamp for each file tracking when the file was last accessed. If that being updated is a trigger for a backup notification, the Windows mechanism lacking such notification will cause mismatched behavior on the platforms. Windows' mechanism can also drop notifications due to buffer size limitations. Here is a real gem from the documentation:
Note that a FileSystemWatcher does not raise an Error event when an event is missed or when the buffer size is exceeded, due to dependencies with the Windows operating system. To keep from missing events, follow these guidelines:
Increasing the buffer size with the InternalBufferSize property can prevent missing file system change events.
Avoid watching files with long file names. Consider renaming using shorter names.
Keep your event handling code as short as possible.
At least you can control two out of three of these....

Reading kernel information in Linux with C/C++

It happens sometimes that I need to retrieve some system data like CPU usage, process information etc.. which I commonly find in /proc/.... What I do from C/C++ is to read the correct file in /proc/..., parse it and get the information. This is quite bothering and somehow to be kernel version dependent. Is this the correct way to go?
Unfortunately, the Linux kernel doesn't offer any system calls which can be used to retrieve the kind of system information that's exposed via /proc. Your best bet in that case is to keep using that file system.
If it makes you feel any better, all the tools like top, ps or htop all use the /proc filesystem. You should check out their sources if you're having trouble with using it.

Device browsing problem

I’m writing file browsing software and I want it to work correctly with all portable devices, such as cameras, smart phones and so on. My program shows thumbnails, so I need to read the content of each file.
Now I’m facing some problems:
With both my photo cameras I can open only one ISteam from device. For every additional stream I get ERROR_BUSY error. This is inconvenient as I get thumbnails in several background threads.
I can open multiple streams from my smart phone, but I cannot seek that streams! As workaround I have to copy the entire stream to temp file system location and process it there.
I wonder what it depends on. Device file system? Driver implementation? Or anything else?
Those seem like very reasonable restrictions on file access to a peripheral with very limited memory (limited fast volatile memory and code EEPROM are more of a concern than size of the flash card).
It's not the file system (which is almost universally FAT or FAT32 for these kinds of devices) or even limitations in the Windows driver (although the limits are probably enforced there to avoid confusing the device) but limited number of file descriptors in the device's embedded file access code.
As a result, you'll probably have to have workarounds for these and other unsupported driver features.
On a related note, multiple threads usually aren't the right way to do background I/O operations. If your devices support OVERLAPPED operation then you can use that along with events and MsgWaitForMultipleObjects (which replaces PeekMessage or GetMessage in the classic GetMessage/TranslateMessage/DispatchMessage main event loop). By keeping everything on one thread you avoid synchronization issues, most race conditions, and prevent the following problem:
Your customer wants to select and use
one of the files on her device, but
oh no, the only IStream is being used
on a thread reading thumbnails. Too
bad, have to wait for that thread to
finish its current file.

DLL Injection/IPC question

I'm work on a build tool that launches thousands of processes (compiles, links etc). It also distributes executables to remote machines so that the build can be run accross 100s of slave machines. I'm implementing DLL injection to monitor the child processes of my build process so that I can see that they opened/closed the resources I expected them to. That way I can tell if my users aren't specifying dependency information correctly.
My question is:
I've got the DLL injection working but I'm not all that familiar with windows programming. What would be the best/fastest way to callback to the parent build process with all the millions of file io reports that the children will be generating? I've thought about having them write to a non-blocking socket, but have been wondering if maybe pipes/shared memory or maybe COM would be better?
First, since you're apparently dealing with communication between machines, not just within one machine, I'd rule out shared memory immediately.
I'd think hard about trying to minimize the amount of data instead of worrying a lot about how fast you can send it. Instead of sending millions of file I/O reports, I'd batch together a few kilobytes of that data (or something on that order) and send a hash of that packet. With a careful choice of packet size, you should be able to reduce your data transmission to the point that you can simply use whatever method you find most convenient, rather than trying to pick the one that's the fastest.
If you stay in the windows world (None of your machines is linux or whatever) named pipes is a good choice, because it is fast and can be accessed across the machine boundary. I think shared memory is out of the race, because it can't cross the machine boundary. Distributed com allows to formulate the contract in IDL, but i think XML Messages via pipes are also ok. The xml messages have the benefit to work completely independent from the channel. If yo need linux later you can switch to tcp/ip transport and send your xml messages.
Some additional techniques with limitations:
Another forgotten but hot candidate is RPC (remote procedure calls). Lot of windows services rely on this. But i think it is hard to program RPC
If you are on the same machine and you only need to send some status information, you can regisier a windows message via RegisterWindowMessage() and send messages vie SendMessage()
apart from all the suggestions from thomas, you might also just use a common database to store the results. And if that is too slow use one of the more modern(and fast) key/value databases (like tokyo cabinet/memcachedb/etc).
This sounds like a lot of overkill for the task of verifying the files used in a build. How about, just scanning the build files? or capturing the output from the build tools?