In an ACM example, I had to build a big table for dynamic programming. I had to store two integers in each cell, so I decided to go for a std::pair<int, int>. However, allocating a huge array of them took 1.5 seconds:
std::pair<int, int> table[1001][1001];
Afterwards, I have changed this code to
struct Cell {
int first;
int second;
}
Cell table[1001][1001];
and the allocation took 0 seconds.
What explains this huge difference in time?
std::pair<int, int>::pair() constructor initializes the fields with default values (zero in case of int) and your struct Cell doesn't (since you only have an auto-generated default constructor that does nothing).
Initializing requires writing to each field which requires a whole lot of memory accesses that are relatively time consuming. With struct Cell nothing is done instead and doing nothing is a bit faster.
The answers so far don't explain the full magnitude of the problem.
As sharptooth has pointed out, the pair solution initializes the values to zero. As Lemurik pointed out, the pair solution isn't just initializing a contiguous block of memory, instead it is calling the pair constructor for every element in the table. However, even that doesn't account for it taking 1.5 seconds. Something else is happening.
Here's my logic:
Assuming you were on an ancient machine, say running at 1.33ghz, then 1.5 seconds is 2e9 clock cycles. You've got 2e6 pairs to construct, so somehow each pair constructor is taking 1000 cycles. It doesn't take 1000 cycles to call a constructor that just sets two integers to zero. I can't see how cache misses would make it take that long. I would believe it if the number was less than 100 cycles.
I thought it would be interesting to see where else all these CPU cycles are going. I used the crappiest oldest C++ compiler I could find to see if I could attain the level of wastage required. That compiler was VC++ v6. In debug mode, it does something I don't understand. It has a big loop that calls the pair constructor for each item in the table - fair enough. That constructor sets the two values to zero - fair enough. But just before doing that, it sets all the bytes in a 68 byte region to 0xcc. That region is just before the start of the the big table. It then overwrites the last element of that region with 0x28F61200. Every call of the pair constructor repeats this. Presumably this is some kind of book keeping by the compiler so it knows which regions are initialized when checking for pointer errors at run time. I'd love to know exactly what this is for.
Anyway, that would explain where the extra time is going. Obviously another compiler may not be this bad. And certainly an optimized release build wouldn't be.
These are all very good guesses, but as everyone knows, guesses are not reliable.
I would say just randomly pause it within that 1.5 seconds, but you'd have to be pretty quick. If you increased each dimension by a factor of about 3, you could make it take more like 10+ seconds, so it would be easier to pause.
Or, you could get it under a debugger, break it in the pair constructor code, and then single step to see what it is doing.
Either way, you would get a firm answer to the question, not just a guess.
My guess it's the way std::pair is created. There is more overhead during invoking a pair constructor 1001x1001 times than when you just allocate a memory range.
Related
I have a huge 98306 by 98306 2D array initialized. I created a kernel function that counts the total number of elements below a certain threshold.
#pragma omp parallel for reduction(+:num_below_threshold)
for(row)
for(col)
index = get_corresponding_index(row, col);
if (array[index] < threshold)
num_below_threshold++;
For benchmark purpose I measured the execution time of the kernel executing when the number of thread is set to 1. I noticed that the first time the kernel executes it took around 11 seconds. The next call to the kernel executing on the same array with one thread only took around 3 seconds. I thought it might be a problem related to cache but it doesn't seem to be related. What is the possible reasons that caused this?
This array is initialized as:
float *array = malloc(sizeof(float) * 98306 * 98306);
for (int i = 0; i < 98306 * 98306; i++) {
array[i] = rand() % 10;
}
This same kernel is applied to this array twice and the second execution time is much faster than the first kernel. I though of lazy allocation on Linux but that shouldn't be a problem because of the initialization function. Any explanations will be helpful. Thanks!
Since you don't provide any Minimal, Complete and Verifiable Example, I'll have to make some wild guesses here, but I'm pretty confident I have the gist of the issue.
First, you have to notice that 98,306 x 98,306 is 9,664,069,636 which is way larger than the maximum value a signed 32 bit integer can store (which is 2,147,483,647). Therefore, the upper limit of your for initialization loop, after overflowing, could become 1,074,135,044 (as on my machines, although it is undefined behavior so strictly speaking, anything could happen), which is roughly 9 times smaller than what you expected.
So now, after the initialization loop, only 11% of the memory you thought you allocated has actually been allocated and touched by the operating system. However, your first reduction loop does a good job in going over the various elements of the array, and since for about 89% of it, it's for the fist time, the OS does the actual memory allocation there and then, which takes some significant amount of time.
And now, for your second reduction loop, all memory has been properly allocated and touched, which makes it much faster.
So that's what I believe happened. That said, many other parameters can enter into play here, such as:
Swapping: the array you try to allocate represents about 36GB of memory. If your machine doesn't have that much memory available, then your code might swap, which will potentially make a big mess of whatever performance measurement you can come up with
NUMA effect: if your machine has multiple NUMA nodes, then thread pinning and memory affinity, when not managed properly, can have a large impact on performance between loop occurrences
Compiler optimization: you didn't mention which compiler you used and which level of optimization you requested. Depending on that, you'd be amazed on how shortened your code could become. For example, the compiler could totally remove the second loop as it does the same thing as the first and becomes useless as the result will be the same... And many other interesting and unexpected things which render your benchmark meaningless
I have an option to either create and destroy a vector on every call to func() and push elements in each iteration, as shown in Example A OR fixed the initialization and only overwrite old values in each iteration, as shown in Example B.
Example A:
void func ()
{
std::vector<double> my_vec(5, 0.0);
for ( int i = 0; i < my_vec.size(); i++) {
my_vec.push_back(i);
// do something
}
}
while (condition) {
func();
}
Example B:
void func (std::vector<double>& my_vec)
{
for ( int i = 0; i < my_vec.size(); i++) {
my_vec[i] = i;
// do something
}
}
while (condition) {
std::vector<double> my_vec(5, 0.0);
func(myVec);
}
Which of the two would be computationally inexpensive. The size of the array won't be more than 10.
I still suspect that the question that was asked is not the question that was intended, but it occurred to me that the main point of my answer would likely not change. If the question gets updated, I can always edit this answer to match (or delete it, if it turns out to be inapplicable).
De-prioritize optimizations
There are various factors that should affect how you write your code. Among the desirable goals are space optimization, time optimization, data encapsulation, logic encapsulation, readability, robustness, and correct functionality. Ideally, all of these goals would be achievable in every piece of code, but that is not particularly realistic. Much more likely is a situation where one or more of these goals must be sacrificed in favor of the others. When that happens, optimizations should typically yield to everything else.
That is not to say that optimizations should be ignored. There are plenty of optimizations that rarely obstruct the higher-priority goals. These range from the small, such as passing by const reference instead of by value, to the large, such as choosing the logarithmic algorithm instead of the exponential one. However, the optimizations that do interfere with the other goals should be postponed until after your code is reasonably complete and functioning correctly. At that point, a profiler should be used to determine where the bottlenecks actually are. Those bottlenecks are the only places where other goals should yield to optimizations, and only if the profiler confirms that the optimizations achieved their goals.
For the question being asked, this means that the main concern should not be computational expense, but encapsulation. Why should the caller of func() need to allocate space for func() to work with? It should not, unless a profiler identified this as a performance bottleneck. And if a profiler did that, it would be much easier (and more reliable!) to ask the profiler if the change helps than to ask Stack Overflow.
Why?
I can think of two major reasons to de-prioritize optimizations. First, the "sniff test" is unreliable. While there might be a few people who can identify bottlenecks by looking at code, there are many, many more who merely think they can. Second, that's why we have optimizing compilers. It is not unheard of for someone to come up with this super-clever optimization trick only to discover that the compiler was already doing it. Keep your code clean and let the compiler handle the routine optimizations. Only step in when the task demonstrably exceeds the compiler's capabilities.
See also: premature-optimization
Choosing an optimization
OK, suppose the profiler did identify construction of this small, 10-element array as a bottleneck. The next step is to test an alternative, right? Almost. First you need an alternative, and I would consider a review of the theoretical benefits of various alternatives to be useful. Just keep in mind that this is theoretical and that the profiler gets the final say. So I'll go into the pros and cons of the alternatives from the question, as well as some other alternatives that might bear consideration. Let's start from the worst options, working our way to the better ones.
Example A
In Example A, a vector is created with 5 elements, then elements are pushed onto the vector until i meets or exceeds the vector's size. Seeing how i and the vector's size are both increased by one each iteration (and i starts smaller than the size), this loop will run until the vector grows large enough to crash the program. That means probably billions of iterations (despite the question's claim that the size will not exceed 10).
Easily the most computationally expensive option. Don't do this.
Example B
In example B, a vector is created for each iteration of the outer while loop, which is then accessed by reference from within func(). The performance cons here include passing a parameter to func() and having func() access the vector indirectly through a reference. There are no performance pros as this does everything the baseline (see below) would do, plus some extra steps.
Even though a compiler might be able to compensate for the cons, I see no reason to try this approach.
Baseline
The baseline I'm using is a fix to Example A's infinite loop. Specifically, replace "my_vec.push_back(i);" with Example B's "my_vec[i] = i;". This simple approach is along the lines of what I would expect for the initial assessment by the profiler. If you cannot beat simple, stick with it.
Example B*
The text of the question presents an inaccurate assessment of Example B. Interestingly, the assessment describes an approach that has the potential to improve on the baseline. To get code that matches the textual description, move Example B's "std::vector<double> my_vec(5, 0.0);" to the line immediately before the while statement. This has the effect of constructing the vector only once, rather than constructing it with each iteration.
The cons of this approach are the same as those of Example B as originally coded. However, we now pick up a gain in that the vector's constructor is called only once. If construction is more expensive than the indirection costs, the result should be a net improvement once the while loop iterates often enough. (Beware these conditions: that's a significant "if" and there is no a priori guess as to how many iterations is "enough".) It would be reasonable to try this and see what the profiler says.
Get some static
A variant on Example B* that helps preserve encapsulation is to use the baseline (the fixed Example A), but precede the declaration of the vector with the keyword static. This brings in the benefit of constructing the vector only once, but without the overhead associated with making the vector a parameter. In fact, the benefit could be greater than in Example B* since construction happens only once per program execution, rather than each time the while loop is started. The more times the while loop is started, the greater this benefit.
The main con here is that the vector will occupy memory throughout the program's execution. Unlike Example B*, it will not release its memory when the block containing the while loop ends. Using this approach in too many places would lead to memory bloat. So while it is reasonable to profile this approach, you might want to consider other options. (Of course if the profiler calls this out as the bottleneck, dwarfing all others, the cost is small enough to pay.)
Fix the size
My personal choice for what optimization to try here would be to start from the baseline and switch the vector to std::array<10,double>. My main motivation is that the needed size won't be more than 10. Also relevant is that the construction of a double is trivial. Construction of the array should be on par with declaring 10 variables of type double, which I would expect to be negligible. So no need for fancy optimization tricks. Just let the compiler do its thing.
The expected possible benefit of this approach is that a vector allocates space on the heap for its storage, which has an overhead cost. The local array would not have this cost. However, this is only a possible benefit. A vector implementation might already take advantage of this performance consideration for small vectors. (Maybe it does not use the heap until the capacity needs to exceed some magic number, perhaps more than 10.) I would refer you back to earlier when I mentioned "super-clever" and "compiler was already doing it".
I'd run this through the profiler. If there's no benefit, there is likely no benefit from the other approaches. Give them a try, sure, since they're easy enough, but it would probably be a better use of your time to look at other aspects to optimize.
When I profile my program I get these results (see lower half):
Functions that I wrote seams to take about the right amount of costs. But I don't know why operator new, mallocand even operator delete as well as free are so expensive.
For example Cabin(const&): it copies a couple of things - including four small vectors:
Cabin::Cabin(const Cabin& c)
: m_entrance(c.m_entrance), //just 3 int's on the stack
m_seatContainer(c.m_seatContainer), //vector of 120 objects
m_rows(c.m_rows), //vector of 20 int's
m_cols(c.m_cols) //vector of 6 int's
{
m_seats.reserve(c.m_seats.size());
for(auto& s : m_seatContainer)
m_seats.emplace_back(&s); //vector of 120 pointers
}
The valgrind output for Cabin(const&) looks like this:
Is it normal for new and malloc to take so much time? What more information is needed? Also I'm not to much familiar with valgrind and profiling in general.
As the top half shows, the Cabin copy ctor is called a million times. You may have only 3778 "real" Cabin objects, but you're passing them around by value as if copies are free.
That said, since m_rows contains 20 ints and m_cols contains 6 ints, you should be using std::array<>. Even for the 120 pointers that may be a better choice.
You should look not just at the percentage of time spent, but also the number of calls column. And I see six million calls to operator new. Well, what did you expect, from that?
This is typically what happens when either the wrong container, or the wrong algorithm gets used. Let's take a look at the small code snippet you showed.
It initializes a vector in a loop that invokes emplace_back(). Well, if that's how you initialize a vector of a moderate-to-large size, you can expect the entire process to involve a series of repeated allocations and reallocations, as the vector grows in size, with each emplace_back()/push_back() call. Not to mention the addition CPU cycles burned moving the partially-grown vector from one buffer to another, each time the vector goes on a growth spurt.
To do this smartly, use reserve() once, to reserve sufficient amount of entries in the vector for the elements you are about to emplace_back() or push_back(). Just one allocation call. You state that this vector is expected to be initialized with a hundred, or so pointers. Most vector implementatins grow the internal vector buffer logarithmically, so that's six or seven allocations. So using reserve(), strategically, should result in one sixths the number of calls to operator new, at this point.
The other thing that the constructor is doing is copying several vectors. This shouldn't take a lot of time, but you also have to ask yourself, is this really necessary in the first place. Perhaps it's sufficient to put all these vectors in a std::shared_ptr-managed reference-counted object, so the only thing that happens here is a pointer copy and a reference count increment, rather than allocating a bunch of new vectors. This is something that only you would know the answer to.
Allocating a buffer for four new vectors, doesn't seem like much. But when you've profiled six millions calls to operator new, and if this is a hot code path, and your profile run indicates this is where you're burning up all your time, then getting rid of even a few unneeded buffer allocations could make quite a bit of a difference.
Well I am really curious as to what practice is better to keep, I know it (probably?) does not make any performance difference at all (even in performance critical applications?) but I am more curious about the impact on the generated code with optimization in mind (and for the sake of completeness, also "performance", if it makes any difference).
So the problem is as following:
element indexes range from A to B where A > 0 and B > A (eg, A = 1000 and B = 2000).
To store information about each element there are a few possible solutions, two of those which use plain arrays include direct index access and access by manipulating the index:
example 1
//declare the array with less memory, "just" 1000 elements, all elements used
std::array<T, B-A> Foo;
//but make accessing by index slower?
//accessing index N where B > N >= A
Foo[N-A];
example 2
//or declare the array with more memory, 2000 elements, 50% elements not used, not very "efficient" for memory
std::array<T, B> Foo;
//but make accessing by index faster?
//accessing index N where B > N >= A
Foo[N];
I'd personally go for #2 because I really like performance, but I think in reality:
the compiler will take care of both situations?
What is the impact on optimizations?
What about performance?
does it matter at all?
Or is this just the next "micro optimization" thing that no human being should worry about?
Is there some Tradeoff ratio between memory usage : speed which is recommended?
Accessing any array with an index involves adding an index multiplied by element size and adding it to the base-address of the array itself.
Since we are already adding one number to another, making the adjustment for foo[N-A] could easily be done by adjusting the base-address down by N * sizeof(T) before adding A * sizeof(T), rather than actually calculating (A-N)*sizeof(T).
In other words, any decent compiler should comletely hide this subtraction, assuming it is a constant value.
If it's not a constant [say you are using std::vector instread of std::array, then you will indeed subtract A from N at some point in the code. It is still pretty cheap to do this. Most modern processors can do this in one cycle with no latency for the result, so at worst adds a single clock-cycle to the access.
Of course, if the numbers are 1000-2000, probably makes really little difference in the whole scheme of things - either the total time to process that is nearly nothing, or it's a lot becuase you do complicated stuff. But if you were to make it a million elements, offset by half a million, it may make the difference between a simple or complex method of allocating them, or some such.
Also, as Hans Passant implies: Modern OS's with virutal memory handling, memory that isn't actually used doesn't get populated with "real memory". At work I was investigating a strange crash on a board that has 2GB of RAM, and when viewing the memory usage, it showed that this one applciation had allocated 3GB of virtual memory. This board does not have a swap-disk (it's an embedded system). It turns out that some code was simply allocating large chunks of memory that wasn't filled with anything, and it only stopped working when it reached 3GB (32-bit processor, 3+1GB memory split between user/kernel space). So even for LARGE lumps of memory, if you only have half of it, it won't actually take up any RAM, if you do not actually access it.
As ALWAYS when it comes to performance, compilers and such, if it's important, do not trust "the internet" to tell you the answer. Set up a test with the code you actually intend to use, using the actual compiler(s) and processor type(s) that you plan to produce your code with/for, and run benchmarks. Some compiler may well have a misfeature (on processor type XYZ9278) that makes it produce horrible code for a case that most other compilers do this "with no overhead at all".
I have a linked list of structures. Lets say I insert x million nodes into the linked list,
then I iterate trough all nodes to find a given value.
The strange thing is (for me at least), if I have a structure like this:
struct node
{
int a;
node *nxt;
};
Then I can iterate trough the list and check the value of a ten times faster compared to when I have another member in the struct, like this:
struct node_complex
{
int a;
string b;
node_complex *nxt;
};
I also tried it with C style strings (char array), the result was the same: just because I had another member (string), the whole iteration (+ value check) was 10 times slower, even if I did not even touched that member ever! Now, I do not know how the internals of structures work, but it looks like a high price to pay...
What is the catch?
Edit:
I am a beginner and this is the first time I use pointers, so chances are, the mistake is on my part. I will post the code ASAP (not being at home now).
Update:
I checked the values again, and I know see a much smaller difference: 2x instead of 10x.
It is much more reasonable for sure.
While it is certainly possible it was the case yesterday too and I was just so freaking tired last night I could not divide two numbers, I have just made more tests and the results are mind blowing.
The times for a the same number of nodes is:
One int and a pointer the time to iterate trough is 0.101
One int and a string: 0.196
One int and 2 strings: 0.274
One int and 3 strings: 0.147 (!!!)
For two ints it is: 0.107
Look what happens when there is more than two strings in the structure! It gets faster! Did somebody drop LSD into my coffee? No! I do not drink coffee.
It is way too fckd up for my brain at the mo' so I think I will just figure it out on my own instead of draining public resources here at SO.
(Ad: I do not think my profiling class is buggy, and anyway I can see the time difference with my own eyes).
Anyhow, thanks for the help.
Cheers.
I must be related to memory access. You speak of a million linked elements. With just an int and a pointer in the node, it takes 8 bytes (assuming 32 bits pointers). This takes up 8 MB memory, which is around the size of cache memory sizes.
When you add other members, you increase the overall size of your data. It does not fit anymore entirely in the cache memory. You revert to plain memory accesses that are much slower.
This may also be caused because during the iteration you may create a copy of your structures. That is:
node* pHead;
// ...
for (node* p = pHead; p; p = p->nxt)
{
node myNode = *p; // here you create a copy!
// ...
}
Copying a simple structure very fast. But the member you've added is a string, which is a complex object. Copying it is a relatively complex operation, with heap access.
Most likely, the issue is that your larger struct no longer fits inside a single cache line.
As I recall, mainstream CPUs typically use a cache line of 32 bytes. This means that data is read into the cache in chunks of 32 bytes at a time, and if you move past these 32 bytes, a second memory fetch is required.
Looking at your struct, it starts with an int, accounting for 4 bytes (usually), and then std::string (I assume, even though the namespace isn't specified), which in my standard library implementation (from VS2010) takes up 28 bytes, which gives us 32 bytes total. Which means that the initial int and the the next pointer will be placed in different cache lines, using twice as much cache space, and requiring twice as many memory accesses if both members are accessed during iteration.
If only the pointer is accessed, this shouldn't make a difference, though, as only the second cache line then has to be retrieved from memory.
If you always access the int and the pointer, and the string is required less often, reordering the members may help:
struct node_complex
{
int a;
node_complex *nxt;
string b;
};
In this case, the next pointer and the int are located next to each others, on the same cache line, so they can be read without requiring additional memory reads. But then you incur the additional cost once you need to read the string.
Of course, it's also possible that your benchmarking code includes creation of the nodes, or (intentional or otherwise) copies being created of the nodes, which would obviously also affect performance.
I'm not a spacialist at all, but the "cache miss" problem rings in my head while reading your problem.
When you had a member, as it makes the size of the structure get bigger, it also might cache misses when going throught the linked list (that is naturally cache-unfriendly if you don't have nodes allocated in one bloc and not far from each other in memory).
I can't find another explaination.
However, we don't have the creation and the loop provided so it's still hard to guess if you're not just having code that don't perform the list exploration in an efficient way.
Perhaps a solution would be a linked list of pointers to your object. It may make things more complicated (unless you use smart pointers, ect.) but it may increase search time.