C++ Boost Object Serialization - Periodic Saving to Protect Data - c++

I have a program that uses boost serialization that loads on program start up and saves on shutdown.
Every once in a while, the program will crash due to this or that and I expect that to be fairly normal. The problem is that when the program crashes, often the objects are not saved at all. Other times, some will be missing or the data will be corrupted. This could be disastrous if a user loses months and months of data. In a perfect world, every one would backup their data and they could just roll back the data file.
My first solution is to periodically save the objects to a different temporary data file during run time. That way if the program crashes they can revert to the temporary data file with minimal data loss. My concern is the effect on performance. As far as I understand (correct me if I am wrong), once you save an object, it can't be used anymore? If that is the case, then the periodic save routine would involve saving and deleting my pointers, then loading them up again.
My second solution is to simply make a copy of the data file during program start up. The user's loss of data would be limited to that session. However, this may not be sufficient as some users may run the program for days and days.
Any input would be appreciated.
Thanks in advance.

If you save an object graph with boost serialization, that object graph is still available and can be saved again without necessarily reading anything from disk.
If you want to go high-tech and introduce a lot more complexity, you can use Boost Interprocess library with a managed_shared_memory segment. This enables you to actually transparently work directly on a disk file (actually, on memory pages backed by file blocks). This introduces another issue, actually: how to prevent changes from frequently hitting the disk.
Gratuitous advice:
I think the best of all worlds would be if your object graph is (e.g.) a Composite pattern where all nodes are shared immutables. Now serialization is "free" (with Boost), you can easily handle multiple versions of the program state (often a "document" or "database", logically) and efficiently save/load them with Boost Serialization. This pattern facilitates undo/redo, concurrent operations, transactional commit ¹ etc.
¹ (! not without extra work, but in principle)

Related

Extreme performance difference, when reading same files a second time with C

I have to read binary data into char-arrays from large (2GB) binary files in a C++ program. When reading the files for the first time from my SSD, reading takes about 6.4 seconds per file. But when running the same code again or even after running a different dummy-program, which does almost the same before, the next readings take only about 1.4 seconds per file. The Windows Task Manager even shows much less disk-activity on the second, third, fourth… run. So, my guess is Window’s File Caching is sparing me from waiting for data from the SSD, when filling the arrays another time.
Is there any clean option to read the files into file cache before the customer runs the software? Any better option than just already loading the files with fread in advance? And how can I make sure, the data remains in the File Cache until I need it?
Or am I totally wrong with my File Cache assumption? Is there another (better) explanation for these different loading times?
Educated guess here:
You most likely are right with your file cache assumption.
Can you pre load files before the user runs the software?
Not directly. How would your program be supposed to know that it is going to be run in the next few minutes?
So you probably need a helper mechanism or tricks.
The options I see here are:
Indexing mechanisms to provide a faster and better aimed access to your data. This is helpful if you only need small chunks of information from these data at once.
Attempt to parallelize the loading of the data, so even if it does not really get faster, the user has the impression it does because he can start working already with the data he has, while the rest is fetched in the background.
Have a helper tool starting up with the OS and pre-fetching everything, so you already have it in memory when required. Caution: This has serious implications since you reserve either a large chunk of RAM or even SSD-cache (depending on implementation) for your tool from the start. Only consider doing this if the alternative is the apocalypse…
You can also try to combine the first two options. The key to a faster data availability is to figure out what to read in which order instead of trying to load everything at once en-bloc. Divide and Conquer.
Without further details on the problem it is impossible to provide more specific solutions though.

How to flush memory-mapped files using Boost's `mapped_file_sink` class?

Using the Boost Libraries version 1.62.0 and the mapped_file_sink class from Boost.IOStreams.
I want to flush the written data to disk at will, but there is no mapped_file_sink::flush() member function.
My questions are:
How can I flush the written data when using mapped_file_sink?
If the above can't be done, why not, considering that msync() and FlushViewOfFile() are available for a portable implementation?
If you look at the mapped file support for proposed Boost.AFIO v2 at https://ned14.github.io/boost.afio/classboost_1_1afio_1_1v2__xxx_1_1map__handle.html, you'll notice a lack of ability to flush mapped file views as well.
The reason why is because it's redundant on modern unified page cache kernels when the mapped view is identical in every way to the page cached buffers for that file. msync() is therefore a no-op on such kernels because dirty pages are already queued for writing out to storage as and when the system decides it is appropriate. You can block your process until the system has finished writing out all the dirty pages for that file using good old fsync().
All the above does not apply where (a) your kernel is not a unified page cache design (QNX, NetBSD etc) or (b) your file resides on a networked file system. If you are in an (a) situation, best to simply avoid memory mapped i/o altogether, just do read() and write(), they are such a small percentage of OSs nowadays let them suffer with poor performance. For the (b) situation, you are highly inadvised to be using memory mapped i/o ever with networked file systems. There is an argument for read-only maps of immutable files only, otherwise just don't do it unless you know what you're doing. Fall back to read() and write(), it's safer and less likely to surprise.
Finally, you linked to a secure file deletion program. Those programs don't work reliably any more with recent file systems because of delayed extent allocation or copy on write allocation. In other words, when you rewrite a section of an existing file, it doesn't modify the original data on storage but actually allocates new storage and points the extents list for the file at the new linked list. This allows a consistent file system to be recovered after unexpected data loss easily. To securely delete data on recent file systems you usually need to use special OS APIs, though deleting all the files and then filling the free space with random data may securely delete most of the data in question most of the time. Note copy on write filing systems may not release freed extents back to the free space pool for new allocation for many days or weeks until the next time a garbage collection routine fires or a snapshot is deleted. In this situation, filling free space with randomness will not securely delete the files in question. If all this is a problem, use FAT32 as your filing system, it's very simple and rewriting data on it really does rewrite the same data on storage (though note that some storage media e.g. SSDs are highly likely to also not rewrite data, these also write modifications to new storage and garbage collect freed extents later).

Thread Optimization [duplicate]

I have an input file in my application that contains a vast amount of information. Reading over it sequentially, and at only a single file offset at a time is not sufficient for my application's usage. Ideally, I'd like to have two threads, that have separate and distinct ifstreams reading from two unique file offsets of the same file. I can't just start one ifstream up, and then make a copy of it using its copy constructor (since its uncopyable). So, how do I handle this?
Immediately I can think of two ways,
Construct a new ifstream for the second thread, open it on the same file.
Share a single instance of an open ifstream across both threads (using for instance boost::shared_ptr<>). Seek to the appropriate file offset that current thread is currently interested in, when the thread gets a time slice.
Is one of these two methods preferred?
Is there a third (or fourth) option that I have not yet thought of?
Obviously I am ultimately limited by the hard drive having to spin back and forth, but what I am interested in taking advantage of (if possible), is some OS level disk caching at both file offsets simultaneously.
Thanks.
Two std::ifstream instances will probably be the best option here. Modern HDDs are optimized for a large queue of I/O requests, so reading from two std::ifstream instances concurrently should give quite nice performance.
If you have a single std::ifstream you'll have to worry about synchronizing access to it, plus it might defeat the operating system's automatic sequential access read-ahead caching, resulting in poorer performance.
Between the two, I would prefer the second. Having two openings of the same file might cause an inconsistent view between the files, depending on the underlying OS.
For a third option, pass a reference or raw pointer into the other thread. So long as the semantics are that one thread "owns" the istream, the raw pointer or reference are fine.
Finally note that on the vast majority of hardware, the disk is the bottleneck, not CPU, when loading large files. Using two threads will make this worse because you're turning a sequential file access into a random access. Typical hard disks can do maybe 100MB/s sequentially, but top out at 3 or 4 MB/s random access.
Other option:
Memory-map the file, create as many memory istream objects as you want. (istrstream is good for this, istringstream is not).
It really depends on your system. A modern system will generally read
ahead; seeking within the file is likely to inhibit this, so should
definitly be avoided.
It might be worth experimenting how read-ahead works on your system:
open the file, then read the first half of it sequentially, and see how
long that takes. Then open it, seek to the middle, and read the second
half sequentially. (On some systems I've seen in the past, a simple
seek, at any time, will turn off read-ahead.) Finally, open it, then
read every other record; this will simulate two threads using the same
file descriptor. (For all of these tests, use fixed length records, and
open in binary mode. Also take whatever steps are necessary to ensure
that any data from the file is purged from the OS's cache before
starting the test—under Unix, copying a file of 10 or 20 Gigabytes
to /dev/null is usually sufficient for this.
That will give you some ideas, but to be really certain, the best
solution would be to test the real cases. I'd be surprised if sharing a
single ifstream (and thus a single file descriptor), and constantly
seeking, won, but you never know.
I'd also recommend system specific solutions like mmap, but if you've
got that much data, there's a good chance you won't be able to map it
all in one go anyway. (You can still use mmap, mapping sections of it
at a time, but it becomes a lot more complicated.)
Finally, would it be possible to get the data already cut up into
smaller files? That might be the fastest solution of all. (Ideally,
this would be done where the data is generated or imported into the
system.)
My vote would be a single reader, which hands the data to multiple worker threads.
If your file is on a single disk, then multiple readers will kill your read performance. Yes, your kernel may have some fantastic caching or queuing capabilities, but it is going to be spending more time seeking than reading data.

Cross-platform atomic writes/renames without a transactional FS in C++

I'm working on the app that needs to ensure consistency of its data saved to disk. I need to guarantee that the data never gets corrupt when dumped to disk. I.e. a reboot or app shutdown could happen when saving the data.
I know the steps that need to be done:
http://blogs.msdn.com/b/adioltean/archive/2005/12/28/507866.aspx
But I was wondering whether there's already an implementation allowing for this preferably in a cross-platform way? I presume boost::filesystem guarantees atomic rename (on Windows and POSIX), so wondering if I missed this functionality in boost somewhere? Thanks
UPD: I had hopes for boost::interprocess::message_queue but it just hangs on reading the queue if the process is killed in the middle of adding to the queue + memory mapped file takes up maximum size on disk, which is expected to be the worst case anyway.
you can get decrease of performance and/or lose all app data, if you will use renaming. May be store some key information (record ID and fingerprint, for example) after each record, and seek last correct key information when application is starting is better way?

Fastest small datastore on Windows

My app keeps track of the state of about 1000 objects. Those objects are read from and written to a persistent store (serialized) in no particular order.
Right now the app uses the registry to store each object's state. This is nice because:
It is simple
It is very fast
Individual object's state can be read/written without needing to read some larger entity (like pulling out a snippet from a large XML file)
There is a decent editor (RegEdit) which allow easily manipulating individual items
Having said that, I'm wondering if there is a better way. SQLite seems like a possibility, but you don't have the same level of multiple-reader/multiple-writer that you get with the registry, and no simple way to edit existing entries.
Any better suggestions? A bunch of flat files?
If what you mean by 'multiple-reader/multiple-writer' is that you keep a lot of threads writing to the store concurrently, SQLite is threadsafe (you can have concurrent SELECTs and concurrent writes are handled transparently). See the [FAQ [1]] and grep for 'threadsafe'
[1]: http://www.sqlite.org/faq.html/ FAQ
If you do begin to experiment with SQLite, you should know that "out of the box" it might not seem as fast as you would like, but it can quickly be made to be much faster by applying some established optimization tips:
SQLite optimization
Depending on the size of the data and the amount of RAM available, one of the best performance gains will occur by setting sqlite to use an all-in-memory database rather than writing to disk.
For in-memory databases, pass NULL as the filename argument to sqlite3_open and make sure that TEMP_STORE is defined appropriately
On the other hand, if you tell sqlite to use the harddisk, then you will get a similar benefit to your current usage of RegEdit to manipulate the program's data "on the fly."
The way you could simulate your current RegEdit technique with sqlite would be to use the sqlite command-line tool to connect to the on-disk database. You can run UPDATE statements on the sql data from the command-line while your main program is running (and/or while it is paused in break mode).
I doubt any sane person would go this route these days, however some of what you describe could be done with Window's Structured/Compound Storage. I only mention this since you're asking about Windows - and this is/was an official Windows way to do this.
This is how DOC files were put together (but not the new DOCX format). From MSDN it'll appear really complicated, but I've used it, it isn't the worst API in Win32.
it is not simple
it is fast, I would guess it's faster then the registry.
Individual object's state can be read/written without needing to read some larger entity.
There is no decent editor, however there are some real basic stuff (VC++ 6.0 had the "DocFile Viewer" under Tools. (yeah, that's what that thing did) I found a few more online.
You get a file instead of registry keys.
You gain some old-school Windows developer geek-cred.
Other random thoughts:
I think XML is the way to go (despite the random access issue). Heck, INI files may work. The registry gives you very fine grain security if you need it - people seem to forget this when the claim using files are better. An embedded DB seems like overkill if I'm understanding what you're doing.
Do you need to persist the objects on each change event or just in memory and store on shutdown? If so, just load them up and serialize them at the end, assuming your app runs for a long time (and you don't share that state with another program) then in memory is going to be a winner.
If you've got fixed size structures then you could consider just using a memory mapped file and allocate memory from that?
If the only thing you do is serialize/deserialize individual objects (no fancy queries), then use a btree database, for example Berkeley DB. It is very fast at storing and retrieving chunks of data by key (I assume your objects have some id that can be used as a key) and access by multiple processes is supported.