Execution time overhead at index 2^21 - c++

What do I want to do?
I have written a program which reads data from binary files and does calculation based on the read values. Execution time is most import for this program. To validate that my program is operating within the specified time limits, I tried to log all the calculations by storing them inside a std::vector<std::string>. And after the time critical execution is done, I write this vector to a file.
What is stored inside the vector?
In the vector I write the execution time (std::chrono:steady_clock.now()) and the current clock time (std::chrono::system_clock::now() with date.h by Howard Hinnant).
What did I observe?
While analyzing the results I stumble over the following pattern. Independent on the input data the mean execution time of 0.003ms for one operation explodes to ~20ms for a single operation at one specific reproducible index. After this, the execution time of all operations goes back to 0.003ms. The index of the execution time explosion is every time 2097151. Since 2^21 equals 2097152, something happens at 2^21 that slows down the entire program. The same effect can be observed with 2^22 and 2^23. Even more interesting is that the lag is doubled (2^21 = ~20ms, 2^22 = ~43ms, 2^23 =~81ms ). I googled about this specific number and the only thing I found was some node.js stuff which uses c++ under the hood.
What do I suspect?
At index 2^21 a memory area must be expanded, and that is why the delay occurs.
Questions
Is my assumption correct and the size of the vector is the problem?
How can I debug such a Phenomenon? (To be certain, that purely the vector is the problem)
Can I allocate enough memory beforehand to avoid the memory expansion?
What could I use instead of a std::vector, which supports > 10.000.000.000 elements?

I was able to solve my problem by reserving memory by using std::vector::reserve() before the time critical part of my program. Thanks to all the comments.
Here the working code I used:
std::vector<std::string> myLogVector;
myLogVector.reserve(12000000);
//...do time critical stuff, without reallocating storage

Related

Is there any workaround to "reserve" a cache fraction?

Assume I have to write a C or C++ computational intensive function that has 2 arrays as input and one array as output. If the computation uses the 2 input arrays more often than it updates the output array, I'll end up in a situation where the output array seldom gets cached because it's evicted in order to fetch the 2 input arrays.
I want to reserve one fraction of the cache for the output array and enforce somehow that those lines don't get evicted once they are fetched, in order to always write partial results in the cache.
Update1(output[]) // Output gets cached
DoCompute1(input1[]); // Input 1 gets cached
DoCompute2(input2[]); // Input 2 gets cached
Update2(output[]); // Output is not in the cache anymore and has to get cached again
...
I know there are mechanisms to help eviction: clflush, clevict, _mm_clevict, etc. Are there any mechanisms for the opposite?
I am thinking of 3 possible solutions:
Using _mm_prefetch from time to time to fetch the data back if it has been evicted. However this might generate unnecessary traffic plus that I need to be very careful to when to introduce them;
Trying to do processing on smaller chunks of data. However this would work only if the problem allows it;
Disabling hardware prefetchers where that's possible to reduce the rate of unwanted evictions.
Other than that, is there any elegant solution?
Intel CPUs have something called No Eviction Mode (NEM) but I doubt this is what you need.
While you are attempting to optimise the second (unnecessary) fetch of output[], have you given thought to using SSE2/3/4 registers to store your intermediate output values, update them when necessary, and writing them back only when all updates related to that part of output[] are done?
I have done something similar while computing FFTs (Fast Fourier Transforms) where part of the output is in registers and they are moved out (to memory) only when it is known they will not be accessed anymore. Until then, all updates happen to the registers. You'll need to introduce inline assembly to effectively use SSE* registers. Of course, such optimisations are highly dependent on the nature of the algorithm and data placement.
I am trying to get a better understanding of the question:
If it is true that the 'output' array is strictly for output, and you never do something like
output[i] = Foo(newVal, output[i]);
then, all elements in output[] are strictly write. If so, all you would ever need to 'reserve' is one cache-line. Isn't that correct?
In this scenario, all writes to 'output' generate cache-fills and could compete with the cachelines needed for 'input' arrays.
Wouldn't you want a cap on the cachelines 'output' can consume as opposed to reserving a certain number of lines.
I see two options, which may or may not work depending on the CPU you are targeting, and on your precise program flow:
If output is only written to and not read, you can use streaming-stores, i.e., a write instruction with a no-read hint, so it will not be fetched into cache.
You can use prefetching with a non-temporally-aligned (NTA) hint for input. I don't know how this is implemented in general, but I know for sure that on some Intel CPUs (e.g., the Xeon Phi) each hardware thread uses a specific way of cache for NTA data, i.e., with an 8-way cache 1/8th per thread.
I guess solution to this is hidden inside, the algorithm employed and the L1 cache size and cache line size.
Though I am not sure how much performance improvement we will see with this.
We can probably introduce artificial reads which cleverly dodge compiler and while execution, do not hurt computations as well. Single artificial read should fill cache lines as many needed to accommodate one page. Therefore, algorithm should be modified to compute blocks of output array. Something like the ones used in matrix multiplication of huge matrices, done using GPUs. They use blocks of matrices for computation and writing result.
As pointed out earlier, the write to output array should happen in a stream.
To bring in artificial read, we should initialize at compile time the output array at right places, once in each block, probably with 0 or 1.

Threaded reading of files in C++

I have written a program (using FFTW) to perform Fourier transforms of some data files written in OpenFOAM.
The program first finds the paths to each data file (501 files in my current example), then splits the paths between threads, such that thread0 gets paths 0->61, thread1 gets 62-> 123 or so, etc, and then runs the remaining files in serial at the end.
I have implemented timers throughout the code to try and see where it bottlenecks, since run in serial each file takes around 3.5s and for 8 files in parallel the time taken is around 21s (a reduction from the 28s for 8x3.5 (serial time), but not by so much)
The problematic section of my code is below
if (DIAG_timers) {readTimer = timerNow();}
for (yindex=0; yindex<ycells; yindex++)
{
for (xindex=0; xindex<xcells; xindex++)
{
getline(alphaFile, alphaStringValue);
convertToNumber(alphaStringValue, alphaValue[xindex][yindex]);
}
}
if (DIAG_timers) {endTimerP(readTimer, tid, "reading value and converting", false);}
Here, timerNow() returns the clock value, and endTimerP calculates the time that has passed in ms. (The remaining arguments relate to it running in a parallel thread, to avoid outputting 8 lines for each loop etc, and a description of what the timer measures).
convertToNumber takes the value on alphaStringValue, and converts it to a double, which is then stored in the alphaValue array.
alphaFile is a std::ifstream object, and alphaStringValue is a std::string which stores the text on each line.
The files to be read are approximately 40MB each (just a few lines more than 5120000, each containing only one value, between 0 and 1 (in most cases == (0||1) ), and I have 16GB of RAM, so copying all the files to memory would certainly be possible, since only 8 (1 per thread) should be open at once. I am unsure if mmap would do this better? Several threads on stackoverflow argue about the merits of mmap vs more straightforward read operations, in particular for sequential access, so I don't know if that would be beneficial.
I tried surrounding the code block with a mutex so that only one thread could run the block at once, in case reading multiple files was leading to slow IO via vaguely random access, but that just reduced the process to roughly serial-speed times.
Any suggestions allowing me to run this section more quickly, possibly via copying the file, or indeed anything else, would be appreciated.
Edit:
template<class T> inline void convertToNumber(std::string const& s, T &result)
{
std::istringstream i(s);
T x;
if (!(i >> x))
throw BadConversion("convertToNumber(\"" + s + "\")");
result = x;
}
turns out to have been the slow section. I assume this is due to the creation of 5 million stringstreams per file, followed by the testing of 5 million if conditions? Replacing it with TonyD's suggestion presumably removes the possibility of catching an error, but saves a vast number of (at least in this controlled case) unnecessary operations.
The files to be read are approximately 40MB each (just a few lines more than 5120000, each containing only one value, between 0 and 1 (in most cases == (0||1) ), and I have 16GB of RAM, so copying all the files to memory would certainly be possible,
Yes. But loading them there will still count towards your process' wall clock time unless they were already read by another process short before.
since only 8 (1 per thread) should be open at once.
Since any files that were not loaded in memory before the process started will have to be loaded and thus the loading will count towards the process wall clock time, it does not matter how many are open at once. Any that are not cache will slow down the process.
I am unsure if mmap would do this better?
No, it wouldn't. mmap is faster, but because it saves the copy from kernel buffer to application buffer and some system call overhead (with read you do a kernel entry for each page while with mmap pages that are read with read-ahead won't cause further page faults). But it will not save you the time to read the files from disk if they are not already cached.
mmap does not load anything in memory. The kernel loads data from disk to internal buffers, the page cache. read copies the data from there to your application buffer while mmap exposes parts of the page cache directly in your address space. But in either case the data are fetched on first access and remain there until the memory manager drops them to reuse the memory. The page cache is global, so if one process causes some data to be cached, next process will get them faster. But if it's first access after longer time, the data will have to be read and this will affect read and mmap exactly the same way.
Since parallelizing the process didn't improve the time much, it seems majority of the time is the actual I/O. So you can optimize a bit more and mmap can help, but don't expect much. The only way to improve I/O time is to get a faster disk.
You should be able to ask the system to tell you how much time was spent on the CPU and how much was spent waiting for data (I/O) using getrusage(2) (call it at end of each thread to get data for that thread). So you can confirm how much time was spent by I/O.
mmap is certainly the most efficient way to get large amounts of data into memory. The main benefit here is that there is no extra copying involved.
It does however make the code slightly more complex, since you can't directly use the file I/O functions to use mmap (and the main benefit is sort of lost if you use "m" mode of stdio functions, as you are now getting at least one copy). From past experiments that I've made, mmap beats all other file reading variants by some amount. How much depends on what proportion of the overall time is spent on waiting for the disk, and how much time is spent actually processing the file content.

C++ program stability after millions of executions

I have a program in C++ that performs mainly matrix multiplcations, additions and so on.
The problem is, a EXC_BAD_ACCESS happens when the calculation performs for about 3 million times.
Is there any possible problems that can arise when a problem is executed for millions of times and for several hours?
Details of the program:
The program is simply calculations on different ranges of values, so it is executing on 6 threads at the same time. There is no resource sharing between the threads.
There seems be no evident problem in the program since:
there is no memory leak, I've confirmed this using Instruments, and the memory size of the program is stable.
the program can execute for at least 2 million times on each thread without any problem, but it is almost guaranteed that the EXC_BAD_ACCESS exception arises some time, on some thread. (the exception happens in my 2 tries of the program (2/2) )
About the matrix multiplication:
Sometimes the size of the matrices is about 2*2 multiply 2*1000.
The elements of the matrix is a custom Complex Number class.
the values of the elements are randomly generated by rand() and converted to float.
the structure is like this:
class Complex
{
private:
float _real, _imag;
public:
// getters, setters and overloaded operators
};
class Matrix
{
private:
Complex **_values;
int _row,_col;
public:
getters, setters and overloaded operators
};
Thank you very much!
Any possible reason for the crash is greatly welcomed!
EXC_BAD_ACCESS means that you dereferenced a pointer which doesn't point into your process's current memory space. This is a bug in your code. Run it under a debugger until it fails and then have a look at the variable values in the statement where it fails. It could be simple or exceedingly subtle.
There's too little information in your post to make a decisive answer. However, it might be that no information available to you now would change it, and you need to debug the case more carefully. Here's what I'd do.
To debug, you want repeatability. But… you say that you're using random numbers. It seems though, that what your program does is some scientific-ish computations. In most cases you don't actually need “true” randomness, but “repeatable” randomness—randomness which passes statistical tests, but where you have enough data to reset the random number generator so that it will produce the exactly the same results as in a previous run. For that, you can just write down the current RNG state (e.g. seed) every time you start a new block of computation.
Now, write some piece of code that will store all the state necessary to restart computations (including RNG) once every few minutes, and run the program. This way, if your code crashes, you will be able to restart the computations with the same exact state and get to the point where it crashed without waiting for millions of iterations. I am putting a strong assumption here, that except for RNG your code does not depend on any other kind of external state (like, network activity, IO, process scheduler making certain choices when scheduling your threads…)
With this kind of data it will be easier to test if the problem is due to a machine fault (overheating, bad memory, etc.). Simply restart the computation with the last state before crashing—preferably after letting the machine cool down, maybe restarting it… if you'll encounter another crash (and it will happen every time you try to restart code), it's quite certain it's due to a bug in your code.
If not, we still cannot say that it's machine fault—your code might (by pure accident/mistake in code) crash due to an undefined behavior which depends on factors out of your control. Examples include using an uninitialized pointer in a rarely-taken code path: it might throw bad access sometimes, and go unnoticed if by pure luck the pointer points to memory you allocated. Try valgrind, this is probably the best tool to check for memory problems… except that it slows down execution so much that you'll again prefer to rerun the computations from a state known to be suspicious (the last state before crash) instead of waiting for millions of iterations. I've seen slowdowns of 5x to 100x.
In the meantime, try running your code on another machine. If you'll also get crashes after a similar number of iterations (to be sure wait for at least 3 times more iterations than it took to crash on the original machine), then it's quite probable that it's a bug in your code.
Happy hacking!
Calculations with finite precision that fail after a few million iterations? That could be accumulated round-off error. Problem is, those usually exhibit themselves as division by zero or other mathematical errors. EXC_BAD_ACCESS is not. However, there's one case in which this can happen: when you use the mathematical result as an array index.

function calling performance

I have called snprintf a few times consecutively with different arguments. I take the time needed for each snprintf. I found that the first call to snprintf takes the longest time. After that, the time needed to call the same function decreases until it converges. What is the reason for that? I have tried with other functions and also exhibit the same behavior.
I am asking this because it relates to testing the code performance. Normally in the main program it would be only be called periodically. However, when I test the function separately like in a loop, it would be faster, hence, resulting in inaccuracy of the measurement of performance.
The first call takes 4000++ ns, second call takes 1700ns, third call takes 800 ns until around 10++ calls, it is reduced to 130ns.
snprintf(buffer, 32, "%d", randomGeneratedNumber1);
snprintf(buffer, 32, "%d", randomGeneratedNumber2);
.
.
.
The most likely explanation is that both the function code will end up in the instruction cache after the second time around just like the input data (if there is any) will be in the data cache. Furthermore, some branches may be predicted correctly the second time around.
So, all in all, "things have been cached".
Your program may be dynamically linked to the library containing snprintf(). The first time delay would then be what is needed to load the library.
Search TLB and cache. But for just small answer, in these small codes, cache effects the execution time. For large codes, besides the cache, many memory pages will be swapped out and for later usage swapped in from hard disk to your ram. So when you use a part of code very often it will not be swapped out and thus it's execution time is enhanced.

C++ map performance - Linux (30 sec) vs Windows (30 mins) !

I need to process a list of files. The processing action should not be repeated for the same file. The code I am using for this is -
using namespace std;
vector<File*> gInputFileList; //Can contain duplicates, File has member sFilename
map<string, File*> gProcessedFileList; //Using map to avoid linear search costs
void processFile(File* pFile)
{
File* pProcessedFile = gProcessedFileList[pFile->sFilename];
if(pProcessedFile != NULL)
return; //Already processed
foo(pFile); //foo() is the action to do for each file
gProcessedFileList[pFile->sFilename] = pFile;
}
void main()
{
size_t n= gInputFileList.size(); //Using array syntax (iterator syntax also gives identical performance)
for(size_t i=0; i<n; i++){
processFile(gInputFileList[i]);
}
}
The code works correctly, but...
My problem is that when the input size is 1000, it takes 30 minutes - HALF AN HOUR - on Windows/Visual Studio 2008 Express. For the same input, it takes only 40 seconds to run on Linux/gcc!
What could be the problem? The action foo() takes only a very short time to execute, when used separately. Should I be using something like vector::reserve for the map?
EDIT, EXTRA INFORMATION
What foo() does is:
1. it opens the file
2. reads it into memory
3. closes the file
4. the contents of the file in memory is parsed
5. it builds a list of tokens; I'm using a vector for that.
Whenever I break the program (while running the program with the 1000+ files input set): the call-stack shows that the program is in the middle of a std::vector add.
In the Microsoft Visual Studio, there's a global lock when accessing the Standard C++ Library to protect from multi threading issue in Debug builds. This can cause big performance hits. For instance, our full test code runs on Linux/gcc in 50 minutes, whereas it needs 5 hours on Windows VC++2008. Note that this performance hit does not exist when compiling in Release mode, using the non-debug Visual C++ runtime.
I would approach it like any performance problem. This means: profiling. MSVC has a built-in profiler, by the way, so it may be a good chance to get familiar with it.
Break into the program using the debugger at a random time, and the chances are very high that the stack trace will tell you where it's spending the time.
I very very strongly doubt that your performance problem is coming from the STL containers.
Try to eliminate (comment out) the call to foo(pFile) or any other method which touches the filesystem. Although running foo(pFile) once may appear fast, running it on 1000 different files (especially on Windows filesystems, in my experience) could turn out to be much slower (e.g. because of filesystem cache behaviour.)
EDIT
Your initial post was claiming that BOTH debug and release builds were affected. Now you are withdrawing that claim.
Be aware that in DEBUG builds:
the STL implementation performs
extra checks and assertions
heap
operations (memory allocation etc.)
perform extra checks and assertions;
moreover, under debug builds the
low-fragmentation heap is
disabled (up to a 10x overall
slowdown in memory allocation)
no code optimizations are performed,
which may result in further STL
performance degradation (STL relying many a time heavily on inlining, loop unwinding etc.)
With 1000 iterations you are probably not affected by the above (not at the outer loop level at least) unless you use STL/the heap heavily INSIDE foo().
I would be astounded if the performance issues you are seeing have anything at all to do with the map class. Doing 1000 lookups and 1000 insertion should take a combined time on the order of microseconds. What is foo() doing?
Without knowing how the rest of the code fits in, I think the overall idea of caching processed files is a little flaky.
Try removing duplicates from your vector first, then process them all.
Try commenting each block or major operation to determine which part actually caused the difference in execution time in Linux and Windows. I also don't think it would be because of the STL map. The problem may be inside foo(). It may be in some file operation as it is the only thing I could think of that would be costly in this case.
You may insert clock() calls in between operations to get an idea of the execution time.
You say that when you break, you find yourself inside vector::add. You don't have a vector::add in the code you've shown us, so I suspect it's inside the foo function. Without seeing that code, it's going to be difficult to say what's up.
You might have inadvertently created a Shlemiel the Painter algorithm.
You can improve things somewhat if you ditch your map and partition your vector instead. This implies reordering the input files list. It also means you have to find a way of quickly determining if a file has been processed already, possibly by holding a flag in the File class. If it's ok to reorder the files list and if you can store that dirty flag in the File object then you can improve performance from O(n log m) to O(n), for n total files and m processed files.
#include <algorithm>
#include <functional>
// ...
vector<File*>::iterator end(partition(inputfiles.begin(), inputfiles.end(),
not1(mem_fun(&File::is_processed))));
for_each(inputfiles.begin(), end, processFile);
If you can't reorder the files list or if you can't change the File object then you can switch the map with a vector and shadow each file in the input files list with a flag in the second vector at the same index. This will cost you O(n) space but will give you O(1) check for dirty state.
vector<File*> processed(inputfiles.size(), 0);
for( vector<File*>::size_type i(0); i != inputfiles.size(); ++i ) {
if( processed[i] != 0 ) return; // O(1)
// ...
processed[i] = inputfiles[i]; // O(1)
}
But be careful: You're dealing with two distinct pointers pointing at the same address, and that's the case for each pair of pointers in the two containers. Make sure one and only one pointer owns the pointee.
I don't expect either of these to yield a solution for that performance hit, but nevertheless.
If you are doing most of your work in linux then I strongly strongly suggest you only ever compile to release mode in windows. That makes life much easier, especially considering all the windows inflexible library handling headaches.