Windows filesystem operations become unreliable under load - c++

I work on a relatively large C++ Windows application. When running multiple instances of the application in parallel (with 100% CPU usage), for example during execution of our test suite, several file related operations start failing randomly. Some examples include:
extracting a zip archive and the loading one of the extracted files
writing / copying a bunch of files and then adding them all to a zip archive
reading files that were just written by a child process
I have read that filesystem operations aren't guaranteed to be immediately visible upon closing a file handle, so I have created a little helper function that waits until a specified file becomes accessible and littered it about the code at some of the points where I noticed the failures, but:
It doesn't seem to be 100% reliable
It feels a bit ugly, especially since I have to put it at every point where the problem might present itself
So, my question is whether anyone has any ideas for a better or more reliable solution to this or any additional insight into the problem.

Related

Do Memory Mapped Files need Mutex when they are read only?

Recently, something happened with our windows c/c++ applications.
We use a DLL to map files to page file, and our applications read these shared files through memory mapping.
Everything is OK when we just run a single instance of application.
Sometimes we get nothing(just zeros) -- but not error or exception -- from mapped memory when we run 24 instances at the same time.
It seems like that this problem happens more on a slower storage device.
If the files are stored in a slower device(say, EFS of AWS), we got this problem about 6/24 instances every time.
But if we move files to EBS of AWS, we only got this problem about 1/24 or 2/24 instances, and not every time.
I guess maybe there are some conflicts during massive accessing?
Do I need mutex for these read only files?
The mutex is just for protecting writable objects, am I right?
More information:
Everything happened INSIDE that DLL.
EXEs just use this DLL to get TRUE or FALSE.
The DLL is used to judge whether some given data belong to a certain file.
Some structs describe the data structure of files, the problem is that a certain struct just get 0 when it should not, but not every time.
I logged the parameters inside the DLL, they are passed to DLL correctly, every time.
I still don't know how and why did this happen, but I found that I can avoid this problem simply by adding a RETRY to that judge function.
I still think this problem is a kind of I/O problem because RETRY can avoid this, but I have no more evidences.
And, maybe the title is not very proper to this problem so I think it's time to close it.
Finally, I figured it out.
This is NOT a memory mapped file problem, it is a LOGICAL problem.
Our DLL has not enough authority, so when we shared our data into memory, NOBODY can see them!
And our applications are designed to load data themselves if they can not find any shared data, so the difference of EFS and EBS happens!
These applications are very old, no documents left, and nobody knows how they are working, so I had to dig information from source code ...

Extreme performance difference, when reading same files a second time with C

I have to read binary data into char-arrays from large (2GB) binary files in a C++ program. When reading the files for the first time from my SSD, reading takes about 6.4 seconds per file. But when running the same code again or even after running a different dummy-program, which does almost the same before, the next readings take only about 1.4 seconds per file. The Windows Task Manager even shows much less disk-activity on the second, third, fourth… run. So, my guess is Window’s File Caching is sparing me from waiting for data from the SSD, when filling the arrays another time.
Is there any clean option to read the files into file cache before the customer runs the software? Any better option than just already loading the files with fread in advance? And how can I make sure, the data remains in the File Cache until I need it?
Or am I totally wrong with my File Cache assumption? Is there another (better) explanation for these different loading times?
Educated guess here:
You most likely are right with your file cache assumption.
Can you pre load files before the user runs the software?
Not directly. How would your program be supposed to know that it is going to be run in the next few minutes?
So you probably need a helper mechanism or tricks.
The options I see here are:
Indexing mechanisms to provide a faster and better aimed access to your data. This is helpful if you only need small chunks of information from these data at once.
Attempt to parallelize the loading of the data, so even if it does not really get faster, the user has the impression it does because he can start working already with the data he has, while the rest is fetched in the background.
Have a helper tool starting up with the OS and pre-fetching everything, so you already have it in memory when required. Caution: This has serious implications since you reserve either a large chunk of RAM or even SSD-cache (depending on implementation) for your tool from the start. Only consider doing this if the alternative is the apocalypse…
You can also try to combine the first two options. The key to a faster data availability is to figure out what to read in which order instead of trying to load everything at once en-bloc. Divide and Conquer.
Without further details on the problem it is impossible to provide more specific solutions though.

With what API do you perform a read-consistent file operation in OS X, analogous to Windows Volume Shadow Service

We're writing a C++/Objective C app, runnable on OSX from versions 10.7 to present (10.11).
Under windows, there is the concept of a shadow file, which allows you read a file as it exists at a certain point in time, without having to worry about other processes writing to that file in the interim.
However, I can't find any documentation or online articles discussing a similar feature in OS X. I know that OS X will not lock a file when it's being written to, so is it necessary to do something special to make sure I don't pick up a file that is in the middle of being modified?
Or does the Journaled Filesystem make any special handling unnecessary? I'm concerned that if I have one process that is creating or modifying files (within a single context of, say, an fopen call - obviously I can't be guaranteed of "completeness" if the writing process is opening and closing a file repeatedly during what should be an atomic operation), that a reading process will end up getting a "half-baked" file.
And if JFS does guarantee that readers only see "whole" files, does this extend to Fat32 volumes that may be mounted as external drives?
A few things:
On Unix, once you open a file, if it is replaced (as opposed to modified), your file descriptor continues to access the file you opened, not its replacement.
Many apps will replace rather than modify files, using things like -[NSData writeToFile:atomically:] with YES for atomically:.
Cocoa and the other high-level frameworks do, in fact, lock files when they write to them, but that locking is advisory not mandatory, so other programs also have to opt in to the advisory locking system to be affected by that.
The modern approach is File Coordination. Again, this is a voluntary system that apps have to opt in to.
There is no feature quite like what you described on Windows. If the standard approaches aren't sufficient for your needs, you'll have to build something custom. For example, you could make a copy of the file that you're interested in and, after your copy is complete, compare it to the original to see if it was being modified as you were copying it. If the original has changed, you'll have to start over with a fresh copy operation (or give up). You can use File Coordination to at least minimize the possibility of contention from cooperating programs.

Databases vs Files (performance)

My C++ program has to read information about 256 images just one time. The information is simple: the path and some floats per image.
I don't need any kind of concurrent access. Also, I don't care about writing, deleting or updating the information and I don't have to do any kind of complex query. This is my pipeline:
Read information about one image.
Store that information on a object.
Do some calculation with the information.
Delete the object.
Next image.
I can use 256 files (every image has the same information), 1 file with all the information or a PostgreSQL databases. What will be faster?
Your question 'which will be faster' is tricky as performance is dependent on so many different factors, including OS, whether the database or file system are on the same machines as your application, the size of the images etc. I would guess that you could find some combinations that would make any of your options faster if you try hard enough.
Having said that, if everything is running on the same machine, then a file based approach would seem intuitively to be faster than a database, just because a database generally provides more functionality, and hence does more work (not just serving requests but in the background also) so has to use more of your computing power.
Similarly, it seems intuitive that a single file will be more efficient than multiple files as it saves the opening (and closing if necessary) operations associated with multiple files. But, again, giving an absolute answer is hard as opening and closing multiple files may be a common use case that certain OS's have optimised, hence making it as fast (or even faster) than a just using a single file.
If performance is very important for your solution, it is hard to avoid having to do some comparative testing with your target deployment systems.

Is it necessary to close files after reading (only) in any programming language?

I read that a program should close files after writing to them in case there is still data in the write buffer not yet physically written to it. I also read that some languages such as Python automatically close all files that go out of scope, such as when the program ends.
But if I'm merely reading a file and not modifying it in any way, maybe except the OS changing its last-access date, is there ever a need to close it (even if the program never terminates, such as a daemon that monitors a log file)?
(Why is it necessary to close a file after using it? asks about file access in general, not only for reading.)
In general, you should always close a file after you are done using it.
Reason number 1: There are not unlimited available File Descriptors
(or in windows, the conceptually similar HANDLES).
Every time you access a file ressource, even for reading, you are reducing the number of handles (or FD's) available to other processes.
every time you close a handle, you release it and makes it available for other processes.
Now consider the consequences of a loop that opens a file, reads it, but doesn't close it...
http://en.wikipedia.org/wiki/File_descriptor
https://msdn.microsoft.com/en-us/library/windows/desktop/aa364225%28v=vs.85%29.aspx
Reason number 2: If you are doing anything else than reading a file, there are problems with race conditions, if multiple processes or threads accesses the same file..
To avoid this, you may find file locks in place.
http://en.wikipedia.org/wiki/File_locking
if you are reading a file, and not closing it afterward, other applications, that could try to obtain a file lock are denied access.
oh - and the file can't be deleted by anyone that doesn't have rights to kill your process..
Reason number 3: There is absolutely no reason to leave a file unclosed. In any language, which is why Python helps the lazy programmers, and automatically closes a handle that drops out of scope, in case the programmer forgot.
Yes, it's better to close file after reading is completed.
That's necessary because the other software might request exclusive access to that file. If file is still opened then such request will fail.
Not closing a file will result in unnecessary resources being taken from the system (File Descriptors on Unix and Handles on windows). Especially when a bug happens in some sort of loop or a system is never turned off, this gets important. Some languages manage unclosed files themselves when they for example run out of scope, others don't or only at some random time when it is checked (like the garbage collector in Java).
Imagine you have some sort of system that needs to run forever. For example a server. Then unclosed files can consume more and more resources, till ultimately all space is used by unclosed files.
In order to read a file you have to open it. So independent of what you do with a file, space will be reserved for the file. So far I tried to explain the importance of closing a file for resources, it's also important that you as a programmer know when an object (file) could be closed since no further use will be required. I think it's bad practice to not be at-least aware of unclosed files, and it's good practice to close files if no further use is required.
Some applications also require only access to a file, so require no other applications to have the file open. For example when you're trying to empty your recycle bin or move a file which you still have open on windows. (This is referred to as file locking). When you still have the file open windows won't let you throw away or move the files. This is just an example of when it would be annoying that a file is open while it should (rather) not (be). The example happens to me daily.