Mahout - FileDataModel: Delete file(s) after refresh? - refresh

I use the FileDataModel as the DataModel for Recommendations in Mahout. I first generate the base file (e.g. prefs.txt). From time to time, there are some changes, which are written to update files (prefs.1.txt, prefs.2.txt, ...).
Am I allowed to delete the old update-Files after loading them to the model? When I try deleting them (in Windows), the Explorer say's that the file is currently used by Java. Why is it the case that it's not allowed to delete the original file? I believe that the data is now stored in memory and thus Mahout doesn't need the file anymore.

It doesn't re-read the old files, only the new update data. However it does expect that the "main" data file is always there since it keeps looking for its last modified time.
The general idea is that you sometimes push a full copy of the data file, and in between, more frequently, push small update files. If you do that it should work as expected, and, yes, you can delete update files once they've been read.
(OF course, if you rebooted your server, it would have to start over from the last main data file and whatever update files were left, which could be incomplete or inconsistent. I'd only delete updates after you've pushed a new main data file.)
I don't know why you can't delete them as they are never held open. Maybe it's a weird Windows thing.

Related

Fastest way to erase part of file in C++

I wonder which is the fastest way to erase part of a file in c++.
I know the way of write a second file and skip the part you want. But i think is slow when you work with big files.
What about database system, how they remove records so fast?
A database keeps an index, with metadata listing which parts of the file are valid and which aren't. To delete data, just the index is updated to mark that section invalid, and the main file content doesn't have to be changed at all.
Database systems typically just mark deleted records as deleted, without physically recovering the unused space. They may later reuse the space occupied by deleted records. That's why they can delete parts of a database quickly.
The ability to quickly delete a portion of a file depends on the portion of the file you wish to delete. If the portion of the file that you are deleting is at the end of the file, you can simply truncate the file, using OS calls.
Deleting a portion of a file from the middle is potentially time consuming. Your choice is to either move the remainder of the file forward, or to copy the entire file to a new location, skipping the deleted portion. Either way could be time consuming for a large file.
The fastest way I know is to open data file as a Persisted memory-mapped file and simple move over the part you don't need. Would be faster than moving to second file but still not too fast with big files.

Writing to the middle of the file (without overwriting data)

In windows is it possible through an API to write to the middle of a file without overwriting any data and without having to rewrite everything after that?
If it's possible then I believe it will obviously fragment the file; how many times can I do it before it becomes a serious problem?
If it's not possible what approach/workaround is usually taken? Re-writing everything after the insertion point becomes prohibitive really quickly with big (ie, gigabytes) files.
Note: I can't avoid having to write to the middle. Think of the application as a text editor for huge files where the user types stuff and then saves. I also can't split the files in several smaller ones.
I'm unaware of any way to do this if the interim result you need is a flat file that can be used by other applications other than the editor. If you want a flat file to be produced, you will have to update it from the change point to the end of file, since it's really just a sequential file.
But the italics are there for good reason. If you can control the file format, you have some options. Some versions of MS Word had a quick-save feature where they didn't rewrite the entire document, rather they appended a delta record to the end of the file. Then, when re-reading the file, it applied all the deltas in order so that what you ended up with was the right file. This obviously won't work if the saved file has to be usable immediately to another application that doesn't understand the file format.
What I'm proposing there is to not store the file as text. Use an intermediate form that you can efficiently edit and save, then have a step which converts that to a usable text file infrequently (e.g., on editor exit). That way, the user can save as much as they want but the time-expensive operation won't have as much of an impact.
Beyond that, there are some other possibilities.
Memory-mapping (rather than loading) the file may provide efficiences which would speed things up. You'd probably still have to rewrite to the end of the file but it would be happening at a lower level in the OS.
If the primary reason you want fast save is to start letting the user keep working (rather than having the file available to another application), you could farm the save operation out to a separate thread and return control to the user immediately. Then you would need synchronisation between the two threads to prevent the user modifying data yet to be saved to disk.
The realistic answer is no. Your only real choices are to rewrite from the point of the modification, or build a more complex format that uses something like an index to tell how to arrange records into their intended order.
From a purely theoretical viewpoint, you could sort of do it under just the right circumstances. Using FAT (for example, but most other file systems have at least some degree of similarity) you could go in and directly manipulate the FAT. The FAT is basically a linked list of clusters that make up a file. You could modify that linked list to add a new cluster in the middle of a file, and then write your new data to that cluster you added.
Please note that I said purely theoretical. Doing this kind of manipulation under a complete unprotected system like MS-DOS would have been difficult but bordering on reasonable. With most newer systems, doing the modification at all would generally be pretty difficult. Most modern file systems are also (considerably) more complex than FAT, which would add further difficulty to the implementation. In theory it's still possible -- in fact, it's now thoroughly insane to even contemplate, where it was once almost reasonable.
I'm not sure about the format of your file but you could make it 'record' based.
Write your data in chunks and give each chunk an id.
Id could be data offset in file.
At the start of the file you could
have a header with a list of ids so
that you can read records in
order.
At the end of 'list of ids' you could point to another location in the file (and id/offset) that stores another list of ids
Something similar to filesystem.
To add new data you append them at the end and update index (add id to the list).
You have to figure out how to handle delete record and update.
If records are of the same size then to delete you can just mark it empty and next time reuse it with appropriate updates to index table.
Probably the most efficient way to do this (if you really want to do it) is to call ReadFileScatter() to read the chunks before and after the insertion point, insert the new data in the middle of the FILE_SEGMENT_ELEMENT[3] list, and call WriteFileGather(). Yes, this involves moving bytes on disk. But you leave the hard parts to the OS.
If using .NET 4 try a memory-mapped file if you have an editor-like application - might jsut be the ticket. Something like this (I didn't type it into VS so not sure if I got the syntax right):
MemoryMappedFile bigFile = MemoryMappedFile.CreateFromFile(
new FileStream(#"C:\bigfile.dat", FileMode.Create),
"BigFileMemMapped",
1024 * 1024,
MemoryMappedFileAccess.ReadWrite);
MemoryMappedViewAccessor view = MemoryMapped.CreateViewAccessor();
int offset = 1000000000;
view.Write<ObjectType>(offset, ref MyObject);
I noted both paxdiablo's answer on dealing with other applications, and Matteo Italia's comment on Installable File Systems. That made me realize there's another non-trivial solution.
Using reparse points, you can create a "virtual" file from a base file plus deltas. Any application unaware of this method will see a continuous range of bytes, as the deltas are applied on the fly by a file system filter. For small deltas (total <16 KB), the delta information can be stored in the reparse point itself; larger deltas can be placed in an alternative data stream. Non-trivial of course.
I know that this question is marked "Windows", but I'll still add my $0.05 and say that on Linux it is possible to both insert or remove a lump of data to/from the middle of a file without either leaving a hole or copying the second half forward/backward:
fallocate(fd, FALLOC_FL_COLLAPSE_RANGE, offset, len)
fallocate(fd, FALLOC_FL_INSERT_RANGE, offset, len)
Again, I know that this probably won't help the OP but I personally landed here searching for a Linix-specific answer. (There is no "Windows" word in the question, so web search engine saw no problem with sending me here.

Writing to an xml file with xmllite?

I have an xml file which holds a set of "game" nodes (which contain details about saved gameplay, as you'd save your game on any console game). All of this is contained within a "games" root node. I'm implementing save functionality to this xml file and wish to be able to append or overwrite a "game" node and its child nodes within the "games" root node.
How can this be accomplished with xmllite.dll?
You can't physicaly "rewrite in place" any text file (including an XML file) except in the rare case you can guarantee that you're overwriting exactly as many bytes as were there. What you always need to do is to write a new file (which has parts from the old one and parts that are new), then rename the old file (e.g. add a .bak extension to it, after removing any older .bak that might have been left hanging around), rename the new file to the old name, and only at this point remove the old file. This approach guarantees that a computer or disk crash in the middle of your work won't be a disaster -- either the old or the new data will be around (at worst you'll need a rename if the crash is just between the two renames).
To write a new file, with mods and lots from the old one, in xmlfile, use the reader functionality documented here and the writer functionality documented here. For a small file, you can first build a tree of objects in-memory via the reader, then write it all out via the writer; but that can take a lot of memory. The alternative is an incremental parsing approach such as the one the MSDN docs call a "pull programming model".

Atomic delete for large amounts of files

I am trying to delete 10000+ files at once, atomically e.g. either all need to be deleted at once, or all need to stay in place.
Of course, the obvious answer is to move all the files into a temporary directory, and delete it recursively on success, but that doubles the amount of I/O required.
Compression doesn't work, because 1) I don't know which files will need to be deleted, and 2) the files need to be edited frequently.
Is there anything out there that can help reduce the I/O cost? Any platform will do.
EDIT: let's assume a power outage can happen anytime.
Kibbee is correct: you're looking for a transaction. However, you needn't depend on either databases or special file system features if you don't want to. The essence of a transaction is this:
Write out a record to a special file (often called the "log") that lists the files you are going to remove.
Once this record is safely written, make sure your application acts just as if the files have actually been removed.
Later on, start removing the files named in the transaction record.
After all files are removed, delete the transaction record.
Note that, any time after step (1), you can restart your application and it will continue removing the logically deleted files until they're finally all gone.
Please note that you shouldn't pursue this path very far: otherwise you're starting to reimplement a real transaction system. However, if you only need a very few simple transactions, the roll-your-own approach might be acceptable.
On *nix, moving files within a single filesystem is a very low cost operation, it works by making a hard link to the new name and then unlinking the original file. It doesn't even change any of the file times.
If you could move the files into a single directory, then you could rename that directory to get it out of the way as a truly atomic op, and then delete the files (and directory) later in a slower, non-atomic fashion.
Are you sure you don't just want a database? They all have transaction commit and rollback built-in.
I think what you are really looking for is the ability to have a transaction. Because the disc can only write one sector at a time, you can only delete the files one at a time. What you need is the ability to roll back the previous deletions if one of the deletes doesn't happen successfully. Tasks like this are usually reserved for databases. Whether or not your file system can do transactions depends on which file system and OS you are using. NTFS on Windows Vista supports Transactional NTFS. I'm not too sure on how it works, but it could be useful.
Also, there is something called shadow copy for Windows, which in the Linux world is called an LVM Snapshot. Basically it's a snapshot of the disc at a point in time. You could take a snapshot directly before doing the delete, and on the chance that it isn't successfully, copy the files back out of the snapshot. I've created shadow copies using WMI in VBScript, I'm sure that similar apis exist for C/C++ also.
One thing about Shadow Copy and LVM Snapsots. The work on the whole partition. So you can't take a snapshot of just a single directory. However, taking a snapshot of the whole disk takes only a couple seconds. So you would take a snapshot. Delete the files, and then if unsucessful, copy the files back out of the snapshot. This would be slow, but depending on how often you plan to roll back, it might be acceptable. The other idea would be to restore the entire snapshot. This may or may not be good as it would roll back all changes on the entire disk. Not good if your OS or other important files are located there. If this partition only contains the files you want to delete, recovering the entire snapshot may be easier and quicker.
Instead of moving the files, make symbolic links into the temporary directory. Then if things are OK, delete the files. Or, just make a list of the files somewhere and then delete them.
Couldn't you just build the list of pathnames to delete, write this list out to a file to_be_deleted.log, make sure that file has hit the disk (fsync()), then start doing the deletes. After all the deletes have been done, remove the to_be_deleted.log transaction log.
When you start up the application, it should check for the existence of to_be_deleted.log, and if it's there, replay the deletes in that file (ignoring "does not exist" errors).
The basic answer to your question is "No.". The more complex answer is that this requires support from the filesystem and very few filesystems out there have that kind of support. Apparently NT has a transactional FS which does support this. It's possible that BtrFS for Linux will support this as well.
In the absence of direct support, I think the hardlink, move, remove option is the best you're going to get.
I think the copy-and-then-delete method is pretty much the standard way to do this. Do you know for a fact that you can't tolerate the additional I/O?
I wouldn't count myself an export at file systems, but I would imagine that any implementation for performing a transaction would need to first attempt to perform all of the desired actions, and then it would need to go back and commit those actions. I.E. you can't avoid performing more I/O than doing it non-atomically.
Do you have an abstraction layer (e.g. database) for reaching the files? (If your software goes direct to the filesystem then my proposal does not apply).
If the condition is "right" to delete the files, change the state to "deleted" in your abstraction layer and begin a background job to "really" delete them from the filesystem.
Of course this proposal incurs a certain cost at opening/closing of the files but saves you some I/O on symlink creation etc.
On Windows Vista or newer, Transactional NTFS should do what you need:
HANDLE txn = CreateTransaction(NULL, 0, 0, 0, 0, NULL /* or timeout */, TEXT("Deleting stuff"));
if (txn == INVALID_HANDLE_VALUE) {
/* explode */
}
if (!DeleteFileTransacted(filename, txn)) {
RollbackTransaction(txn); // You saw nothing.
CloseHandle(txn);
die_horribly();
}
if (!CommitTransaction(txn)) {
CloseHandle(txn);
die_horribly();
}
CloseHandle(txn);

How does rsync behave for concurrent file access?

I'm using rsync to run backups of my machine twice a day and the ten to fifteen minutes when it searches my files for modifications, slowing down everything considerably, start getting on my nerves.
Now I'd like to use the inotify interface of my kernel (I'm running Linux) to write a small background app that collects notifications about modified files and adds their pathnames to a list which is then processed regularly by a call to rsync.
Now, because this process by definition always works on files I've just been - and might still be - working on, I'm wondering whether I'll get loads of corrupted / partially updated files in my backup as rsync copies the files while I'm writing to them.
I couldn't find anyhing in the manpage and was yet unsuccessful in googling for the answer. I could go read the source, but that might take quite a while. Anybody know how concurrent file access is handled inside rsync?
It's not handled at all: rsync opens the file, reads as much as it can and copies that over.
So it depends how your applications handle this: Do they rewrite the file (not creating a new one) or do they create a temp file and rename that when all data has been written (as they should).
In the first case, there is little you can do: If two processes access the same data without any kind of synchronization, the result will be a mess. What you could do is defer the rsync for N minutes, assuming that the writing process will eventually finish before that. Reschedule the file if it is changes again within this time limit.
In the second case, you must tell rsync to ignore temp files (*.tmp, *~, etc).
It isn't handled in any way. If it is a problem, you can use e.g. LVM snapshots, and take the backup from the snapshot. That won't in itself guarantee that the files will be in a usable state, but it does guarantee that, as the name implies, it's a snapshot at a specific time.
Note that this doesn't have anything to do with whether you're letting rsync handle the change detection itself or if you use your own app. Your app, or rsync itself, just produces a list of files that have been changed, and then for each file, the rsync binary diff algorithm is run. The problem is if the file is changed while the rsync algorithm runs, not when producing the file list.