How to compare 2 volumes and list modified files?

How to compare 2 volumes and list modified files? - c++

I have 2 hard-disk volumes(one is a backup image of the other), I want to compare the volumes and list all the modified files, so that the user can select the ones he/she wants to roll-back.
Currently I'm recursing through the new volume and comparing each file's time-stamps to the old volume's files (if they are int the old volume). Obviously this is a blunder approach. It's time consuming and wrong!
Is there an efficient way to do it?
EDIT:
- I'm using FindFirstFile and likes to recurse the volume, and gather info of each file (not very slow, just a few minutes).
- I'm using Volume Shadow Copy to backup.
- The backup-volume is remote so I cannot continuously monitor the actual volume.

Part of this depends upon how the two volumes are duplicated; if they are 'true' copies from the file system's point of view (e.g. shadow copies or other block-level copies), you can do a few tricky little things with respect to USN, which is the general technology others are suggesting you look into. You might want to look at an API like FSCTL_READ_FILE_USN_DATA, for example. That API will let you compare two different copies of a file (again, assuming they are the same file with the same file reference number from block-level backups). If you wanted to be largely stateless, this and similar APIs would help you a lot here. My algorithm would look something like this:
foreach( file in backup_volume ) {
file_still_exists = try_open_by_id( modified_volume )
if (file_still_exists) {
usn_result = compare_usn_values_of_files( file, file_in_modified_volume )
if (usn_result == equal_to) {
// file hasn't changed at all
} else {
// file has changed (somehow)
}
} else {
// file was deleted (possibly deleted and recreated)
}
}
// we still don't know about files new in modified_volume
All of that said, my experience leads me to believe that this will be more complicated than my off-the-cuff explanation hints at. This might be a good starting place, though.
If the volumes are not block-level copies of one another, then it will be very difficult to compare USN numbers and file IDs, if not impossible. Instead, you may very well be going by file name, which will be difficult if not impossible to do without opening every file (times can be modified by apps, sizes and times can be out of date in the findfirst/next queries, and you have to handle deleted-then-recreated cases, rename cases, etc.).
So knowing how much control you have over the environment is pretty important.

Instead of waiting until after changes have happened, and then scanning the whole disk to find the (usually few) files that have changed, I'd set up a program to use ReadDirectoryChangesW to monitor changes as they happen. This will let you build a list of files with a minimum of fuss and bother.

Assuming you're not comparing each file on the new volume to every file in the snapshot, that's the only way you can do it. How are you going to find which files aren't modified without looking at all of them?

I am not a Windows programmer.
However shouldn't u have stat function to retrieve the modified time of a file.
Sort the files based on mod time.
The files having mod time greater than your last backup time are the ones of your interest.
For the first time u can iterate over the back up volume to figure out the max mod time and created time from your interested set.
I am assuming the directories of interest don't get modified in the backup volume.

Without knowing more details about what you're trying to do here, it's hard to say. However, some tips about what I think you're trying to achieve:
If you're only concerned about NTFS volumes, I suggest looking into the USN / change journal API's. They have been around since 2000. This way, after the initial inventory you can only look at changes from that point on. A good starting point for this, though a very old article is here: http://www.microsoft.com/msj/0999/journal/journal.aspx
Also, utilizing USN API's, you could omit the hash step and just record information from the journal yourself (this will become more clear when/if you look into said APIs)
The first time through comparing a drive's contents, utilize a hash such as SHA-1 or MD5.
Store hashes and other such information in a database of some sort. For example, SQLite3. Note that this can take up a huge amount of space itself. A quick look at my audio folder with 40k+ files would result in ~750 megs of MD5 information.

Related

FatFS - can I create multiple seek locations?

I have a working integration of FatFS in my C++ application running on a Cortex M4-based platform.
My application consists of logging data to a data format called MDF.
On the implementation side, I log data (to a given file) in batches of buffers; The number of buffers depends on how fast I acquire the data: log batch of one buffer . . . do other stuff . . . log batch of five buffer . . . do other stuff . . . etc.
There is also a header which is 24 bytes and contains the number of bytes of data. On a PC, I would just save the header at the end of the measurement but this is an embedded product which could be de-powered at any point in time. If I don't save the header periodically, the file becomes "corrupted".
Therefore, in order to maintain coherency I need to re-save the header after saving every batch of data and that's where my issue is.
This means that I have to call f_lseek before writing the header and then before I write the batch of data.
I am using f_cache_fptr so f_lseek is not painfully slow but I'd like to avoid needing to call f_lseek so frequently.
QUESTION
Is it possible to somehow have 2 seek locations so that I don't need to call f_seek to ping-pong between header-location and data-location?
I am open to modifying FatFS.
The problem, at the low-level, is simpler because the header only shares one 512 byte sector with the data: 24 bytes of header followed by 488 bytes of data.

Is it possible to somehow have 2 seek locations so that I don't need to call f_seek to ping-pong between header-location and data-location?
Not as far as I can tell, no, and it doesn't really seem to make sense. A FIL has only one current position, indicating where the next data written to it will go. What would it even mean for there to be two? How would the system know where to write? It certainly wouldn't be correct to write to both places.
Note in particular that with some operating systems and file systems, it is possible to open the same file more than once, but FatFS supports duplicate file opens only when all openings involved are for read-only mode.
I guess it would be possible to modify FatFS to give it the ability to store one file position when you seek to another, and then later to return to the first. So that would mean adding at least one member to the FIL structure, and adding at least one new function.
But why muck with the innards of FatFS? That's going to be at least a little risky. As long as you have to add a function anyway, how about just implementing a FRESULT my_f_write_at_beginning(FIL* fp, const void* buff, UINT btw, UINT* bw) on top of the existing functions? It can store the current position, seek to the beginning of the file, perform the write (maybe ensuring that the full number of bytes specified is written), and then seek back to the original position.
But fundamentally, no, there is no escaping ping-ponging back and forth, because doing so is part of the requirement you laid out.

On a PC, I would just save the header at the end of the measurement but this is an embedded product which could be de-powered at any point in time. If I don't save the header periodically, the file becomes "corrupted".
Therefore, in order to maintain coherency I need to re-save the header after saving every batch of data and that's where my issue is.
More correctly; you need to save the buffer and the header (footer?), and update the directory entry to reflect the new file size, and update the file allocation table to account for sectors allocated; and you need to write to at least 3 completely separate sectors "atomically" so that everything is consistent if there's a power failure at the wrong time.
This isn't entirely possible on most hardware.
However, there is a way to do it "somewhat safely". Specifically:
pre-allocate enough clusters for a completely new copy of the file (including the new data to append to the end) and update the file allocation table accordingly. If there's a power failure while doing this (or immediately after this point) the risk is lost clusters, which is an "ignore-able" problem that will waste some space but can be fixed easily with a typical "check disk" utility.
create a whole new copy of the file's data in the pre-allocated clusters (copy the old data, then append the new data and header). If there's a power failure in the middle of doing this (or immediately after this point), then the risk is the same as before - just some lost clusters (ignore-able).
atomically update the directory entry; changing both the file size and the "starting cluster number" with the same atomic (single sector) write. If there's a power failure after this point the risk is the same lost clusters (where the old version of the file's data was instead of where the new version of the file data is).
free the clusters that the old version of the file used by doing writes to the file allocation table. After this point you've completed successfully, so a power failure is fine.
To make this less awful for performance you can have two "cluster chains" and alternate between them; such that one chain of clusters is for the current version of the file and the other will become the next version of the file. This avoids the need to copy a lot of older data from one place to another (if you know the old data is still in previously used clusters). It could also avoid the need to allocate and free most clusters in the file allocation table, but only with a significant increase in the risk of lost clusters.
Of course for any of this to work you'd need a guarantee that single-sector writes are atomic; and you can't be using FAT12 (where an entry in the file allocation table can be split by a sector boundary).

Remove beginning of file without rewriting the whole file

I have an embedded Linux system, that stores data in a very large file, appending new data to the end. As the file size grows near filling available storage space, I need to remove oldest data.
Problem is, I can't really accept the disruption it would take to move the massive bulk of data "up" the file, like normal - lock the file for an extended period of time just to rewrite it (plus this being a flash medium, it would cause unnecessary wear to the flash).
Probably the easiest way would be to split the file into multiple smaller ones, but this has several downsides related to how the data is handled and processed - all the 'client end' software expects single file. OTOH it can handle 'corruption' of having the first record cut in half, so the file doesn't need to be trimmed at record offsets, just 'somewhere up there', e.g. first few iNodes freed. Oldest data is obsolete anyway so even more severe corruption of the beginning of the file is completely acceptable, as long as the 'tail' remains clean, and liberties can be taken with how much exactly is removed - 'roughly several first megabytes' is okay, no need for 'first 4096KB exactly' precision.
Is there some method, API, trick, hack to truncate beginning of file like that?

You can achieve the goal with Linux kernel v3.15 above for ext4/xfs file system.
int ret = fallocate(fd, FALLOC_FL_COLLAPSE_RANGE, 0, 4096);
See here
Truncating the first 100MB of a file in linux

The easiest solution for your old applications would be a FUSE filesystem which gives them access to the underlying file, but with the offset cyclically shifted. This would allow you to implement a ringbuffer at the physical level. The FUSE layer would be fairly trivial as it only needs to adjust all filepositions by a constant, modulo filesize.

What about setting up a separate process that renames the output file when it reaches a predefined size (for instance by adding the linux time at the end of the file name).
This would allow you to keep the old data and the main process will recreate the output file the next time it writes to it.
Another cron job may remove the old file every now and then.

FAT, optimize performance when retrieve a file

I have an implementation of database with one file per record, and I have about 10000 records.
I'm trying to optimize the performance of access to file, and I have a little doubt.
Is split files into folders better then keep all in single folder, for quick access to the files? ex: from 0 to 999 in folder 0, from 1000 to 1999 in 2 etc...
What is better for this, FAT16 or FAT32?

If you are accessing the files directly, then you won't have any performance drop. If you are searching for a particular file on the disk, it would be faster to store them in folders. This way folders would emulate db indexes. But as #blow mentioned, why don't you use something like Sqlite?

When you retrieve a file by filename you most likely do a linear search in the directory containing that file, you skip all directory entries until you find the one that matches the given filename.
This search operation may be slow if you do it every time for every file, there are many files in the directory and reads are slow (if your CPU is slow you lose even more).
You may want to build some sort of an index, a compact array of pairs filename+location sorted by filename, which you can keep in memory to quickly find files w/o rereading the directory entries.
Things can be greatly simplified if there's a constant number of files and they have the same length or are padded to the same length. In that case you don't need any search as you can calculate the location of each file directly from the filename, provided, of course, that the order of the files is fixed.
The only practical difference between FAT1x and FAT32 in this context is the size of the file allocation table, that set of linked lists/chains that tells you which clusters are free or occupied by file/directory data and tells you which cluster is the next in a file/directory after the given one. In FAT32, the cluster chain elements are 32-bit, 2 times larger than on FAT16. If the number of used clusters is small (less than ~64K), you are going to read twice as much data from FAT32 while traversing the cluster chains compared with FAT16. Also, finding a free cluster on FAT32 (when you create a new file/dir or grow an existing one) can be slow if there are many clusters on the disk (and there can be up to 2^28 on FAT32 AFAIR vs 2^16 of FAT16). You don't want to start searching for a free cluster from the beginning of the FAT every time. You want to keep somewhere a pointer to the last place you stopped the search and the next time search from there and then go to the beginning of the FAT when you've reached the FAT's end.

Split them across directories (the split number depending on your cluster size) and do not use LFN (LongFileName) if you can, because it will slow down your operation. I also work on embbeded systems. I did not have to access 1000s of files like you, but i avoided LFN (especially for royalty reasons).

What portable data backends are there which have fast append and random access?

I'm working on a Qt GUI for visualizing 'live' data which is received via a TCP/IP connection. The issue is that the data is arriving rather quickly (a few dozen MB per second) - it's coming in faster than I'm able to visualize it even though I don't do any fancy visualization - I just show the data in a QTableView object.
As if that's not enough, the GUI also allows pressing a 'Freeze' button which will suspend updating the GUI (but it will keep receiving data in the background). As soon as the Freeze option was disabled, the data which has been accumulated in the background should be visualized.
What I'm wondering is: since the data is coming in so quickly, I can't possibly hold all of it in the memory. The customer might even keep the GUI running over night, so gigabytes of data will accumulate. What's a good data storage system for writing this data to disk? It should have the following properties:
It shouldn't be too much work to use it on a desktop system
It should be fast at appending new data at the end. I never need to touch previously written data anymore, so writing into anywhere but the end is not needed.
It should be possible to randomly access records in the data. This is because scrolling around in my GUI will make it necessary to quickly display the N to N+20 (or whatever the height of my table is) entries in the data stream.
The data which is coming in can be separated into records, but unfortunately the records don't have a fixed size. I'd rather not impose a maximum size on them (at least not if it's possible to get good performance without doing so).
Maybe some SQL database, or something like CouchDB? It would be great if somebody could share his experience with such scenarios.

I think that sqlite might do the trick. It seems to be fast. Unfortunately, I have no data flow like yours, but it works well as a backend for a log recorder. I have a GUI where you can view the n, n+k logs.
You can also try SOCI as a C++ database access API, it seems to work fine with sqlite (I have not used it for now but plan to).
my2c

I would recommend a simple file based solution.
If you can use fixed size records: If the you get the data continuously with constant sample rate, random access to data is easy and very fast when you know the time stamp of first data point and the sample rate. If the sample rate varies, then write time stamp with each data point. Now random access requires binary search, but it is still fast enough.
If you have variable size records: Write the variable size data to one file and to other file write indexes (which are fixed size) to the data file. And if the sample rate varies, write time stamps too. Now you can do the random access fast using the index file.
If you are using Qt to implement this kind of solution, you need two sets of QFile and QDataStream instances, one for writing and one for reading.
And a note about performance: don't flush the file after every data point write. But remember to flush the file before doing any random access to it.

What's the best way to write to more files than the kernel allows open at a time?

I have a very large binary file and I need to create separate files based on the id within the input file. There are 146 output files and I am using cstdlib and fopen and fwrite. FOPEN_MAX is 20, so I can't keep all 146 output files open at the same time. I also want to minimize the number of times I open and close an output file.
How can I write to the output files effectively?
I also must use the cstdlib library due to legacy code.
The executable must also be UNIX and windows cross-platform compatible.

A couple possible approaches you might take:
keep a cache of opened output file handles that's less than FOPEN_MAX - if a write needs to occur on a files that already open, then just do the write. Otherwise, close one of the handles in the cache and open the output file. If your data is generally clumped together in terms of the data for a particular set of files is grouped together in the input file, this should work nicely with an LRU policy for the file handle cache.
Handle the output buffering yourself instead of letting the library do it for you: keep your own set of 146 (or however many you might need) output buffers and buffer the output to those, and perform an open/flush/close when a particular output buffer gets filled. You could even combine this with the above approach to really minimize the open/close operations.
Just be sure you test well for the edge conditions that can happen on filling or nearly filling an output buffer.

It may also be worth scanning the input file, making a list of each output id and sorting it so that you write all the file1 entries first, then all the file2 entries etc..

If you cannot increase the max FOPEN_MAX somehow, you can create a simple queue of requests and then close and re-open files as needed.
You can also keep track of the last write-time for each file, and try to keep the most recently written files open.

The solution seems obvious - open N files, where N is somewhat less than FOPEN_MAX. Then read through the input file and extract the contents of the first N output files. Then close the output files, rewind the input, and repeat.

First of all, I hope you are running as much in parallel as possible. There is no reason why you can't write to multiple files at the same time. I'd recommend doing what thomask said and queue requests. You can then use some thread synchronization to wait until the entire queue is flushed before allowing the next round of writes to go through.

You haven't mentioned if it's critical to write to these outputs in "real-time", or how much data is being written. Subject to your constraints, one option might be to buffer all the outputs and write them at the end of your software run.
A variant of this is to setup internal buffers of a fixed size, once you hit the internal buffer limit, open the file, append, and close, then clear the buffer for more output. The buffers reduce the number of open/close cycles and give you bursts of writes which the file system is usually setup to handle nicely. This would be for cases where you need somewhat real-time writes, and/or data is bigger than available memory, and file handles exceed some max in your system.

You can do it in 2 steps.
1) Write the first 19 ids to one file, the next 19 ids to the next file and so on. So you need 8 output files (and the input file) opened in parallel for this step.
2) For every so created file create 19 (only 13 for the last one) new files and write the ids to it.
Independent of how large the input file is and how many id-datasets it contains, you always need to open and close 163 files. But you need to write the data twice, so it may only worth it, if the id-datasets are really small and randomly distributed.
I think in most cases it is more efficient to open and close the files more often.

The safest method is to open a file and flush after writing, then close if no more recent writing will take place. Many things outside your program's control can corrupt the content of your file. Keep this in mind as you read on.
I suggest keeping an std::map or std::vector of FILE pointers. The map allows you to access file pointers by an ID. If the ID range is small, you could create a vector, reserving elements, and using the ID as an index. This will allow you to keep a lot of files open at the same time. Beware the concept of data corruption.
The limit of simultaneous open files is set by the operating system. For example, if your OS has a maximum of 10, you will have make arrangements when the 11th file is requested.
Another trick is reserve buffers in dynamic memory for each file. When all the data is processed, open a file (or more than one), write the buffer (using one fwrite), close and move on. This may be faster since you are writing to memory during the data processing rather than a file. An interesting side note is that your OS may also page the buffers to the hard drive as well. The size and quantities of buffers is an optimization issue that is platform dependent (you'll have to adjust and test to get a good combination). Your program will slow down if the OS pages the memory to the disk.

Well, if I was writing it with your listed constraints in the OP, I would create 146 buffers and plop the data into them, then at the end, sequentially walk through the buffers and close/open a single file-handle.
You mentioned in a comment that speed was a major concern and that the naive approach is too slow.
There are a few things that you can start considering. One is a reorganizing of the binary file into sequential strips, which would allow parallel operations. Another is a least-recently used approach to your filehandle collection. Another approach might be to fork out to 8 different processes, each outputting to 19-20 files.
Some of these approaches will be more or less practical to write depending on binary organization(Highly fragmented vs highly sequential).
A major constraint is the size of your binary data. Is it bigger than cache? bigger than memory? streamed out of a tape deck? Continually coming off a sensor stream and only existing as a 'file' in memory? Each of those presents a different optimization strategy...
Another question is usage patterns. Are you doing occasional spike writes to the files, or are you having massive chunks written only a few times? That determines the effectiveness of the different caching/paging strategies of filehandles.

Assuming you are on a *nix system, the limit is per process, not system-wide. So that implies you could launch multiple processes, each responsible for a subset of the id's you are filtering for. Each could keep within the FOPEN_MAX for its process.
You could have one parent process reading the input file then sending the data to various 'write' processes through pipe special files.

"Fewest File Opens" Strategy:
To achieve a minimum number of file opens and closes, you will have to read through the input multiple times. Each time, you pick a subset of the ids that need sorting, and you extract only those records into the output files.
Pseudocode for each thread:
Run through the file, collect all the unique ids.
fseek() back to the beginning of the input.
For every group of 19 IDs:
Open a file for each ID.
Run through the input file, appending matching records to the corresponding output file.
Close this group of 19 output files.
fseek() to the beginning of the input.
This method doesn't work quite as nicely with multiple threads, because eventually the threads will be reading totally different parts of the file. When that happens, it's difficult for the file cache to be efficient. You could use barriers to keep the threads more-or-less in lock-step.
"Fewest File Operations" Strategy
You could use multiple threads and a large buffer pool to make only one run-through of the input. This comes at the expense of more file opens and closes (probably). Each thread would, until the whole file was sorted:
Choose the next unread page of the input.
Sort that input into 2-page buffers, one buffer for each output file. Whenever one buffer page is full:
Mark the page as unavailable.
If this page has the lowest page-counter value, append it to the file using fwrite(). If not, wait until it is the lowest (hopefully, this doesn't happen much).
Mark the page as available, and give it the next page number.
You could change the unit of flushing output files to disk. Maybe you have enough RAM to collect 200 pages at a time, per output file?
Things to be careful about:
Is your data page-aligned? If not, you'll have to be clever about reading "the next page".
Make sure you don't have two threads fwrite()'ing to the same output file at the same time. If that happens, you might corrupt one of the pages.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js