Losing Power While Writing to a File [closed] - c++

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am working on a project for a device that must constantly write information to a storage device. The device will need to be able to lose power but accurately retain the information that it collects up until the time that power is lost.
I've been looking for answers for what would happen if power was lost on a system like this. Are there any issues with losing power and not closing the file? Is data corruption a possibility?
Thank you

The whole subject of "safely storing data when power may be cut" is quite hard to solve in a generic way - the exact solution will depend on the exact type of data, rate data is stored, etc.
To retain information "while power is off", the data needs to be stored in non-volatile memory (either flash, eeprom or battery backed RAM). Again, this is a hardware solution.
If you may "lose data written to file"? Yes, it's entirely possible that the file may not be correctly written if the power to the file-storage device is lost when the system is in the middle of writing.
The answer to this really depends on how much freedom you have to build/customise the hardware to cope with this situation. Systems that are designed for high reliability will have a way to detect power-cuts and still run for several seconds (sometimes a lot more) after a power-cut, and when the power-cut happens, it goes into "save all data, and shut down nicely" mode. Typically, this is done by using an uninterruptable power supply (UPS), which has an alarm mechanism that signals that the external power is gone, and when the system receives this signal, starts a emergency shutdown.
If you don't have any way to connect a UPS and shut down in an orderly fashion, then there's other features, such as journaling filesystem that can give you a good set of data, but it's not guaranteed to give you complete data (and you need to handle your fileformat such that "cut off data" doesn't completely ruin the file - the classic example is a zip-file, which stores the "directory" (list of contents) at the very end of the file. So you can have 99.9% of the file complete, but the missing 0.1% is what you need to decode all the content.

Yes, data corruption is definitely a possibility.
However there are a few guidelines to minimize it in a purely software way:
Use a journalling filesystem and put it in its maximum journal mode (eg. for ext3/ext4 use data=journal, no less).
Avoid software buffers. If you don't have a choice, flush them ASAP.
Synchronize the filesystem ASAP (either through the sync/syncfs/fsync system calls, or using the sync mount option).
Never overwrite existing data, just append new data to existing files.
Be prepared to deal with incomplete data records.
This way, even if you lose data it will only be the last few bytes written, and the filesystem in general won't be corrupt.
You'll notice that I assumed a Unix-y OS. As far as I know, Windows doesn't give you enough control to enforce that kind of constraints on the filesystem.

Related

How to overwrite all the free disk space with 0x00? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
how to overwrite all free disk space with zeros, like the cipher command in Windows; for example:
cipher /wc:\
This will overwrite the free disk space in three passes. How can I do this in C or C++? (I want to this in one pass and as fast as possible.)
You can create a set a files and write random bytes to them until available disk space is filled. These files should be removed before exiting the program.
The files must be created on the device you wish to clean.
Multiple files may be required on some file systems, due to file size limitations.
It is important to use different non repeating random sequences in these files to avoid file system compression and deduplicating strategies that may reduce the amount of disk space actually written.
Note also that the OS may have quota systems that will prevent you from filling available disk space and may also show erratic behavior when disk space runs out for other processes.
Removing the files may cause the OS to skip the cache flushing mechanism, causing some blocks to not be written to disk. A sync() system call or equivalent might be required. Further synching at the hardware level might be delayed, so waiting for some time before removing the files may be necessary.
Repeating this process with a different random seed improves the odds of hardware recovery through surface analysis with advanced forensic tools. These tools are not perfect, especially when recovery would be a life saver for a lost Bitcoin wallet owner, but may prove effective in other more problematic circumstances.
Using random bytes has a double purpose:
prevent some file systems from optimizing the blocks and compress or share them instead of writing to the media, thus overwriting existing data.
increase the difficulty in recovering previously written data with advanced hardware recovery tools, just like these security envelopes that have random patterns printed on the inside to prevent exposing the contents of the letter by simply scanning the envelope over a strong light.

Fast, binary database alternative [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I want to implement a fast database alternative that only needs to handle binary data.
To specify, I want something close to a database that will be securely stored even in case of a forced termination (task manager) during execution, whilst also being accessed directly from memory in C++. Like a vector of structs that is mirrored to the hard disk. It should be able to handle hundreds of thousands of read accesses and at least 1000 write accesses per second. In case of a forced termination, at most the last command can be lost. It does not need to support multithreading and the database file will only be accessed by a single instance of the program. Only needs to run on Windows. These are the solutions I've thought of so far:
SQL Databases
Advantages
Easy to implement, since lots of libraries are available
Disadvantages
Server is on a different process, therefor possibly slow inter process communication
Necessity of parsing SQL queries
Built for multithreaded environments, so lots of unnecessary synchronization
Rows can't be directly accessed using pointers but need to be copied at least twice per change
Unnecessary delays on the UPDATE query, since the whole table needs to be searched and the WHERE case checked
These were just a few from the top of my head, there might be a lot more
Memory Mapped Files
Advantages
Direct memory mapping, so direct pointer access possible
Very fast compared to databases
Disadvantages
Forceful termination could lead to a whole page not being written
Lots of code (I don't actually mind that)
No forced synchronization possible
Increasing file size might take a lot of time
C++ vector*
Advantages
Direct pointer access possible, however, needs to manually notify of changes
Very fast compared to databases
Total programming freedom
Disadvantages
Possibly slow because of many calls to WriteFile
Lots of code (I don't actually mind that)
C++ vector with complete write every few seconds
Advantages
Direct pointer access possible
Very fast compared to databases
Total programming freedom
Disadvantages
Lots of unchanged data being rewritten to file, alternatively lots of RAM wasted on preventing unnecessary writes
Inaccessibility during writes of lots of RAM wasted on copy
Could lose multiple seconds worth of data
Multiple threads and therefor synchronization needed
*Basically, a wrapper class that only exposes per row read/write functionality of a vector OR allows direct write to memory, but relies on the caller to notify of changes, all reads are done from a copy in memory, all writes are done to a copy in memory and the file itself on a per-command basis
Also, is it possible to write to different parts of a file without flushing, and then flushing all changes at once with a guarantee that the file will be written either completely or not at all even in case of a forced termination during write? All I can think of is the following workflow:
Duplicate target file on startup, then for every set of data:
Write all changes to duplicate -> Flush by replacing original with duplicate
However, I feel like this would be a horrible waste of hard disk space for big files.
Thanks in advance for any input!

Wants to create an application storing data in memory. But i dont want the data to be lost even if my app crashes [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Wants to create an application storing data in memory. But i dont want the data to be lost even if my app crashes.
What concept should i use?
Should I use a shared memory, or is there some other concept that suits my requirement better.
You are asking for persistence (or even orthogonal persistence) and/or for application checkpointing.
This is not possible (at least thru portable C++ code) in the general case for some arbitrary existing C++ code, e.g. because of ASLR, because of pointers on -or to- the local call stack, because of multi-threading, and because of external resources (sockets, opened files, ...), because the current continuation cannot be accessed, restored and handled in standard C++.
However, you might design your application with persistence in mind. This is a strong architectural requirement. You could for instance have every class contain some dumping method and its load factory function. Beware of shared pointers, and take into account that you could have cyclic references. Study garbage collection algorithms (e.g. in the Gc HandBook) which are similar to those needed for persistence (a copying GC is quite similar to a checkpointing algorithm).
Look also in serialization libraries (like libs11n). You might also consider persisting into textual format (e.g. JSON), perhaps inside some Sqlite database (or some real database like PostGreSQL or MongoDb....). I am doing this (in C) in my monimelt software.
You might also consider checkpointing libraries like BLCR
The important thing is to think about persistence & checkpointing very early at design time. Thinking of your application as some specialized bytecode interpreter or VM might help (notably if you want to persist continuations, or some form of "call stack").
You could fork your process (assuming you are on Linux or Posix) before persistence. Hence, persistence time does not matter that much (e.g. if you persist every hour or every ten minutes).
Some language implementations are able to persist their entire state (notably their heap), e.g. SBCL (a good Common Lisp implementation) with its save-lisp-and-die, or Poly/ML -an ML dialect- with its SaveState, or Squeak (a Smalltalk implementation).
See also this answer & that one. J.Pitrat's blog has a related entry: CAIA as a sleeping beauty.
Persistency of data with code (e.g. vtables of objects, function pointers) might be technically difficult. dladdr(3) -with dlsym- might help (and, if you are able to code machine-specific things, consider the old getcontext(3), but I don't recommend that). Avoid name mangling (for dlsym) by declaring extern "C" all code related to persistence. If you want to persist some data and be able to restart from it with a slightly modified program (e.g. a small bugfix) things are much more complex.
More pragmatically, you could have a class representing your entire persistable state, and implement methods to persist (and reload it). You would then persist only at certain steps of your algorithm (e.g. if you have a main loop or an event loop, at start of that loop). You probably don't want to persist too often (e.g. because of the time and disk space required to persist), e.g. perhaps every ten minutes. You might perhaps consider some transaction log if it fits in the overall picture of your application.
Use memory mapped files - mmap (https://en.wikipedia.org/wiki/Mmap) And allocate all your structures inside mapped memory region. System will properly save mapped file to disk when your app crashes.

Fastest/Best way to serialize and deserialize data from database [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
In a few months I will start to write my bachelor-thesis. Although we only discussed the topic of my thesis very roughly, the main problem will be something like this:
A program written in C++ (more or less a HTTP-Server, but I guess it doesn't matter here) has to be executed to fulfill its task. There are several instances of this program running at the same time, and a loadbalancer takes care of equal distribution of http-requests between all instances. Every time the program's code is changed to enhance it, or to get rid of bugs, all instances have to be restarted. This can take up to 40 minutes, for one instance. As there are more than ten instances running, the restart process can take up to one work day. This is way to slow.
The presumed bottleneck is the access to the database during startup to load all necessary data (guess it will be a mysql-database). The idea of the teamleader to decrease the amount of time needed for the startup-process is to serialize the content of the database to a file, and read from this file instead of reading from the database. That would be my task. Of course the problem is to check if there is new data in the database, that is not in the file. I guess write processes are still applied to the database, not to the serialized file. My first idea is to use apache thrift for serialization and deserialization, as I already worked with it and it is fast, as far as I know (maybe i write some small python programm, to take care of this). However, I have some basic questions regarding this problem:
Is it a good solution to read from file instead of reading from database. Is there any chance this will save time?
Would thrift work well in this scenario, or is there some faster way for serialization/deserialization
As I am only reading, not writing, I don't have to take care of consistency, right?
Can you recommend some books or online literature that is worth to read regarding this topic.
If I'm missing Information, just ask. Thanks in advance. I just want to be well informed and prepared before I start with the thesis, this is why I ask.
Kind regards
Michael
Cache is king
As a general recommendation: Cache is king, but don't use files.
Cache? What cache?
The cache I'm talking about is of course an external cache. There are plenty of systems available, a lot of them are able to form a cache cluster with cached items spread across multiple machine's RAM. If you are doing it cleverly, the cost of serializing/deserializing into memory will make your algorithms shine, compared to the cost of grinding the database. And on top of that, you get nice features like TTL for cached data, a cache that persists even if your business logic crashes, and much more.
What about consistency?
As I am only reading, not writing, I don't have to take care of consistency, right?
Wrong. The issue is not, who writes to the database. It is about whether or not someone writes to the database, how often this happens, and how up-to-date your data need to be.
Even if you cache your data into a file as planned in your question, you have to be aware that this produces a redundant data duplicate, disconnected from the original data source. So the real question you have to answer (I can't do this for you) is, what the optimum update frequency should be. Do you need immediate updates in near-time? Is a certain time lag be acceptable?
This is exactly the purpose of the TTL (time to live) value that you can put onto your cached data. If you need more frequent updates, set a short TTL. If you are ok with updates in a slower frequency, set the TTL accordingly or have a scheduled task/thread/process running that does the update.
Ok, understood. Now what?
Check out Redis, or the "oldtimer" Memcached. You didn't say much about your platform, but there are Linux and Windows versions available for both (and especially on Windows you will have a lot more fun with Redis).
PS: Oh yes, Thrift serialization can be used for the serialization part.

Writing your own partition recovery [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I realise that the question I'm asking isn't a simple "O, that's easy! Do a simple this and that and voilà!" Fact is, without thinking one night I deleted the wrong partition. I tried a few Windows and Linux tools (Partition disk doctor, Easeus, Test disk, etc) but none of them worked. And I think it's because of the way I deleted the partition.
I have written my own boot sector creators / backup tools in C++ before as well as one or two kernels in C and Assembler (albeit fairly useless kernels...) so I think I have sufficient knowledge to at the very least TRY to recover it manually.
My drive was set up as follows:
Size: 1.82TB
part0 100MB (redundant windows recovery partition)
part1 ~1760MB (my data partition)
How I broke it:
In Windows 7, I deleted the first partition. I then extended the second to take up the first's free space, which meant I still had 2 partitions, now acting as one dynamic partition. I rebooted into my Ubuntu OS, and realised I could no longer read it. I rebooted back into Windows, deleted the first partition, then thought, wait...i shouldn't have done that. Needless to say it's dead now.
What I would like is some advice / good links on where to start, what not to do, and what not to expect. I'm hoping that if the journals are still intact I'll be able to recover the drive.
Edit:
This is an NTFS drive. After posting this question, I was wondering: given that I know the approximate location of where my partition was located, is there a way to easily identify the journals? Maybe I can reconstruct some of the other drive / partition info myself and write it to the disk.
The first step, I think, is to figure out how exactly those "dynamic partitions" as you call them work in windows 7. From your description, it sounds as if you created a kind of logical volumn from two physical partitions. My guess is that the second partition now contains some kind of header for that volume, which is why recovery tools unfamiliar with that format fail to function.
If you figure out what windows 7 did exactly when you merged the two partitions, you should be able to writen an application which extracts an image of the logical volume.
Or, you could check out NTFS-3G, the FUSE implementation of NTFS at http://www.tuxera.com/community/ntfs-3g-download/. By studying that code, I bet that you can find a way to locate the NTFS filesystem on your borked disk. Once you have that, try extracting everything from the beginning of the filesystem to the end of the disk into an image, and run some ntfs filesystem checker on it. With a little luck, you'll get a moutable filesystem back.
If you're wondering how to access the disk, just open the corresponding device in linux as if it was a regular file. You might need to align your reads to 512 bytes, though (or whatever the sector size of your disk is. 512 and to a lesser extend 4096 are common values), otherwise read() might return an error.