I am using MPI together with C++. I want to read information from one file, modify it by some rule, and then write modified content in the same file. I am using temporary file which where I store modified content and at the end I overwrite it by these commands:
temp_file.open("temporary.txt",ios::in);
ofstream output_file(output_name,ios::out);
output_file<<temp_file.rdbuf();
output_file.flush();
temp_file.close();
output_file.close();
remove("temporary.txt");
This function which modify the file is executed by MPI process with rank 0. After exiting from function, MPI_Barrier(MPI_COMM_WORLD); is called to ensure synchronization.
And then, all MPI processes should read modified file and perform some computations. The problem is that, since file is too big, data are not completely written to file when execution of function is finished, and I get wrong results. I also tried to put sleep() command, but sometimes it works, sometimes it doesn't (it depends on the node where I perform computations). Is there general way to solve this problem?
I put MPI as a tag, but I think this problm is inherently connected with c++ standard and manipulating with storage. How to deal with this latency between writing in buffer aand writing in file on storage medium?
Fun topic. You are dealing with two or maybe three consistency semantics here.
POSIX consistency says essentially when a byte is written to a file, it's visible.
NFS consistency says "woah, that's way too hard. you write to this file and I'll make it visible whenever I feel like it. "
MPI-IO consistency semantics (which you aren't using, but are good to know) say that data is visible after specific synchronization events occur. Those two events are "close a file and reopen it" or "sync file, barrier, sync file again".
If you are using NFS, give up now. NFS is horrible. There are a lot of good parallel file systems you can use, several of which you can set up entirely in userspace (such as PVFS).
If you use MPI-IO here, you'll get more well-defined behavior, but the MPI-IO routines are more like C system calls than C++ iostream operators, so think more like open(2) read(2) write(2) and close(2). Text files are usually a headache to deal with but in your case where modifications are appended to file, that shouldn't be too bad.
My application continuously calculates strings and outputs them into a file. This is being run for almost an entire day. But writing to the file is slowing my application. Is there a way I can improve the speed ? Also I want to extend the application so that I can send the results to an another system after some particular amount of time.
Thanks & Regards,
Mousey
There are several things that may or may not help you, depending on your scenario:
Consider using asynchronous I/O, for instance by using Boost.Asio. This way your application does not have to wait for expensive I/O-operations to finish. However, you will have to buffer your generated data in memory, so make sure there is enough available.
Consider buffering your strings to a certain size, and then write them to disk (or the network) in big batches. Few big writes are usually faster than many small ones.
If you want to make it really good C++, meaning STL-comliant, make your algorithm a template-function that takes and output-iterator as argument. This way you can easily have it write to files, the network, memory or the console by providing appropriate iterators.
How if you write the results to a socket, instead of file. Another program Y, will read the socket, open a file, write on it and close it, and after the specified time will transfer the results to another system.
I mean the process of file handling is handled by other program. Original program X just sends the output to the socket. It does not concern it self with flushing the file stream.
Also I want to extend the application
so that I can send the results to an
another system after some particular
amount of time.
If you just want to transfer the file to other system, then I think a simple script will be enough for that.
Use more than one file for the logging. Say, after your file reaches size of 1 MB, change its name to something contains the date and the time and start to write to a new one, named as the original file name.
then you have:
results.txt
results2010-1-2-1-12-30.txt (January 2 2010, 1:12:30)
and so on.
You can buffer the result of different computations in memory and only write to the file when buffer is full. For example, your can design your application in such a way that, it computes result for 100 calculations and writes all those 100 results at once in a file. Then computes another 100 and so on.
Writing file is obviously slow, but you can buffered data and initiate the separate thread for writhing on file. This can improve speed of your application.
Secondly you can use ftp for transfer files to other system.
I think there are some red herrings here.
On an older computer system, I would recommend caching the strings and doing a small number of large writes instead of a large number of small writes. On modern systems, the default disk-caching is more than adequate and doing additional buffering is unlikely to help.
I presume that you aren't disabling caching or opening the file for every write.
It is possible that there is some issue with writing very large files, but that would not be my first guess.
How big is the output file when you finish?
What causes you to think that the file is the bottleneck? Do you have profiling data?
Is it possible that there is a memory leak?
Any code or statistics you can post would help in the diagnosis.
I am writing some binary data into a binary file through fwrite and once i am through with writing i am reading back the same data thorugh fread.While doing this i found that fwrite is taking less time to write whole data where as fread is taking more time to read all data.
So, i just want to know is it fwrite always takes less time than fread or there is some issue with my reading portion.
Although, as others have said, there are no guarantees, you'll typically find that a single write will be faster than a single read. The write will be likely to copy the data into a buffer and return straight away, while the read will be likely to wait for the data to be fetched from the storage device. Sometimes the write will be slow if the buffers fill up; sometimes the read will be fast if the data has already been fetched. And sometimes one of the many layers of abstraction between fread/fwrite and the storage hardware will decide to go off into its own little world for no apparent reason.
The C++ language makes no guarantees on the comparative performance of these (or any other) functions. It is all down to the combination of hardware and operating system, the load on the machine and the phase of the moon.
These functions interact with the operating system's file system cache. In many cases it is a simple memory-to-memory copy. Write could indeed be marginally faster if you run your program repeatedly. It just needs to find a hole in the cache to dump its data. Flushing that data to the disk happens at a time you can't see or measure.
More work is usually needed to read. At a minimum it needs to traverse the cache structure to discover if the disk data is already cached. If not, it is going to have to block on a disk driver request to retrieve the data from the disk, that takes many milliseconds.
The standard trap with profiling this behavior is taking measurements from repeated runs of your program. They are not at all representative for the way your program is going to behave in the wild. The odds that the disk data is already cached are very good on the second run of your program. They are very poor in real life, reads are likely to be very slow, especially the first one. An extra special trap exists for a write, at some point (depending on the behavior of other programs too), the cache is not going to be able to buffer the write request. Write performance is then going to fall of a cliff as your program gets blocked until enough data is flushed to the disk.
Long story short: don't ever assume disk read/write performance measurements are representative for how your program will behave in production. And perhaps more to the point: there isn't anything you can do to solve disk I/O perf problems in your code.
You are seeing some effect of the buffer/cache systems as other have said, however, if you use async API (as you said your suing fread/write you should look at aio_read/aio_write) you can experiment with some other methods for I/O which are likely more well optimized for what your doing.
One suggestion is that if you are read/update/write/reading a file a lot, you should, by way of an ioctl or DeviceIOControl, request to the OS to provide you the geometry of the disk your code is running on, then determine the size of a disk cylander so you may be able to determine if you can do your read/write operations buffered inside of a single cylinder. This way, the drive head will not move for your read/write and save you a fair amount of run time.
Using C or C++, After I decrypt a file to disk- how can I guarantee it is deleted if the application crashes or the system powers off and can't clean it up properly? Using C or C++, on Windows and Linux?
Unfortunately, there's no 100% foolproof way to insure that the file will be deleted in case of a full system crash. Think about what happens if the user just pulls the plug while the file is on disk. No amount of exception handling will protect you from that (the worst) case.
The best thing you can do is not write the decrypted file to disk in the first place. If the file exists in both its encrypted and decrypted forms, that's a point of weakness in your security.
The next best thing you can do is use Brian's suggestion of structured exception handling to make sure the temporary file gets cleaned up. This won't protect you from all possibilities, but it will go a long way.
Finally, I suggest that you check for temporary decrypted files on start-up of your application. This will allow you to clean up after your application in case of a complete system crash. It's not ideal to have those files around for any amount of time, but at least this will let you get rid of them as quickly as possible.
Don't write the file decrypted to disk at all.
If the system is powerd off the file is still on disk, the disk and therefore the file can be accessed.
Exception would be the use of an encrypted file system, but this is out of control of your program.
I don't know if this works on Windows, but on Linux, assuming that you only need one process to access the decrypted file, you can open the file, and then call unlink() to delete the file. The file will continue to exist as long as the process keeps it open, but when it is closed, or the process dies, the file will no longer be accessible.
Of course the contents of the file are still on the disk, so really you need more than just deleting it, but zeroing out the contents. Is there any reason that the decrypted file needs to be on disk (size?). Better would just to keep the decrypted version in memory, preferably marked as unswappable, so it never hits the disk.
Try to avoid it completely:
If the file is sensitive, the best bet is to not have it written to disk in a decrypted format in the first place.
Protecting against crashes: Structured exception handling:
However, you could add structured exception handling to catch any crashes.
__try and __except
What if they pull the plug?:
There is a way to protect against this...
If you are on windows, you can use MoveFileEx and the option MOVEFILE_DELAY_UNTIL_REBOOT with a destination of NULL to delete the file on the next startup. This will protect against accidental computer shutdown with an undeleted file. You can also ensure that you have an exclusively opened handle to this file (specify no sharing rights such as FILE_SHARE_READ and use CreateFile to open it). That way no one will be able to read from it.
Other ways to avoid the problem:
All of these are not excuses for having a decrypted file on disk, but:
You could also consider writing to a file that is larger than MAX_PATH via file syntax of \\?\. This will ensure that the file is not browsable by windows explorer.
You should set the file to have the temporary attribute
You should set the file to have the hidden attribute
In C (and so, I assume, in C++ too), as long as your program doesn't crash, you could register an atexit() handler to do the cleanup. Just avoid using _exit() or _Exit() since those bypass the atexit() handlers.
As others pointed out, though, it is better to avoid having the decrypted data written to disk. And simply using unlink() (or equivalent) is not sufficient; you need to rewrite some other data over the original data. And journalled file systems make that very difficult.
A process cannot protect or watch itself. Your only possibility is to start up a second process as a kind of watchdog, which regularly checks the health of the decrypting other process. If the other process crashes, the watchdog will notice and delete the file itself.
You can do that using hearth-beats (regular polling of the other process to see whether it's still alive), or using interrupts sent from the other process itself, which will trigger a timeout if it has crashed.
You could use sockets to make the connection between the watchdog and your app work, for example.
It's becoming clear that you need some locking mechanism to prevent swapping to the pagefile / swap-partition. On Posix Systems, this can be done by the m(un)lock* family of functions.
There's a problem with deleting the file. It's not really gone.
When you delete files off your hard drive (not counting the recycle bin) the file isn't really gone. Just the pointer to the file is removed.
Ever see those spy movies where they overwrite the hard drive 6, 8,24 times and that's how they know that it's clean.. Well they do that for a reason.
I'd make every effort to not store the file's decrypted data. Or if you must, make it small amounts of data. Even, disjointed data.
If you must, then they try catch should protect you a bit.. Nothing can protect from the power outage though.
Best of luck.
Check out tmpfile().
It is part of BSD UNIX not sure if it is standard.
But it creates a temporary file and automatically unlinks it so that it will be deleted on close.
Writing to the file system (even temporarily) is insecure.
Do that only if you really have to.
Optionally you could create an in-memory file system.
Never used one myself so no recommendations but a quick google found a few.
In C++ you should use an RAII tactic:
class Clean_Up_File {
std::string filename_;
public Clean_Up_File(std::string filename) { ... } //open/create file
public ~Clean_Up_File() { ... } //delete file
}
int main()
{
Clean_Up_File file_will_be_deleted_on_program_exit("my_file.txt");
}
RAII helps automate a lot of cleanup. You simply create an object on the stack, and have that object do clean up at the end of its lifetime (in the destructor which will be called when the object falls out of scope). ScopeGuard even makes it a little easier.
But, as others have mentioned, this only works in "normal" circumstances. If the user unplugs the computer you can't guarantee that the file will be deleted. And it may be possible to undelete the file (even on UNIX it's possible to "grep the harddrive").
Additionally, as pointed out in the comments, there are some cases where objects don't fall out of scope (for instance, the std::exit(int) function exits the program without leaving the current scope), so RAII doesn't work in those cases. Personally, I never call std::exit(int), and instead I either throw exceptions (which will unwind the stack and call destructors; which I consider an "abnormal exit") or return an error code from main() (which will call destructors and which I also consider an "abnormal exit"). IIRC, sending a SIGKILL also does not call destructors, and SIGKILL can't be caught, so there you're also out of luck.
This is a tricky topic. Generally, you don't want to write decrypted files to disk if you can avoid it. But keeping them in memory doesn't always guarentee that they won't be written to disk as part of a pagefile or otherwise.
I read articles about this a long time ago, and I remember there being some difference between Windows and Linux in that one could guarentee a memory page wouldn't be written to disk and one couldn't; but I don't remember clearly.
If you want to do your due diligence, you can look that topic up and read about it. It all depends on your threat model and what you're willing to protect against. After all, you can use compressed air to chill RAM and pull the encryption key out of that (which was actually on the new Christian Slater spy show, My Own Worst Enemy - which I thought was the best use of cutting edge, accurate, computer security techniques in media yet)
on Linux/Unix, use unlink as soon as you created the file. The file will be removed as soon as you program closes the file descriptor or exits.
Better yet, the file will be removed even if the whole system crashes - because it is basically removed as soon as you unlink it.
The data will not be physically deleted from the disk, of course, so it still may be available for hacking.
Remember that the computer could be powered down at any time. Then, somebody you don't like could boot up with a Linux live CD, and examine your disk in any level of detail desired without changing a thing. No system that writes plaintext to the disk can be secure against such attacks, and they aren't hard to do.
You could set up a function that will overwrite the file with ones and zeros repeatedly, preferably injecting some randomness, and set it up to run at end of program, or at exit. This will work, provided there are no hardware or software glitches, power failures, or other interruptions, and provided the file system writes to only the sectors it claims to be using (journalling file systems, for example, may leave parts of the file elsewhere).
Therefore, if you want security, you need to make sure no plaintext is written out, and that also means it cannot be written to swap space or the equivalent. Find out how to mark memory as unswappable on all platforms you're writing for. Make sure decryption keys and the like are treated the same way as plaintext: never written to the disk under any circumstances, and kept in unswappable memory.
Then, your system should be secure against attacks short of hostiles breaking in, interrupting you, and freezing your RAM chips before powering down, so they don't lose their contents before being transferred for examination. Or authorities demanding your key, legally (check your local laws here) or illegally.
Moral of the story: real security is hard.
The method that I am going to implement will be to stream the decryption- so that the only part that is in memory is the part that is decrypted during the read as the data is being used. Here is a diagram of the pipeline:
This will be a streamed implementation, so the only data that is in memory is the data that I am consuming in the application at any given point. This makes some things tricky- considering a lot of traditional file tricks are no longer available, but since the implementation will be stream based i will still be able to seek to different points of the file which would be translated to the crypt stream to decrypt at different sections.
Basically, it will be encrypting blocks of the file at a time - so then if I try to seek to a certain point it will decrypt that block to read. When I read past a block it decrypts the next block and releases the previous (within the crypt stream).
This implementation does not require me to decrypt to a file or to memory and is compatible with other stream consumers and providers (fstream).
This is my 'plan'. I have not done this type of work with fstream before and I will likely be posting a question as soon as I am ready to work on this.
Thanks for all the other answers- it was very informative.
Assume that write operation throws an exception half-way. Is there any data written into the file, or is no data written in the file?
Since you have no view of the internals of CFile (or shouldn't, if it's encapsulated properly), you need to rely on the 'contract' of the API. In other words, unless the documentation tells you specifically what happens in certain cases, you can't rely on it.
Even if you had the source code and could figure it out, the API specification is the contract and anything not specified there can change at any time. This is one reason why some software developers are wary of publishing internals as it then can be seen to lock them in to supporting that forever.
If you really want to ensure that your file will be in a known state after the exception, you will need to code around the behavior. This can be something like:
backing up the file at program start-up (simple) ; or
backing it up before each save operation (still relatively simple); or
backing it up before any write operation (complex and slow).
Short answer: Most likely some data will be written to the file, unless the disk is full at the start of the write operation.
Longer answer: It will depend on what CFileException is thrown from the Write call.
http://msdn.microsoft.com/en-us/library/as5cs056(VS.80).aspx