What is Big O amortization? [duplicate] - c++

What is meant by "Constant Amortized Time" when talking about time complexity of an algorithm?

Amortised time explained in simple terms:
If you do an operation say a million times, you don't really care about the worst-case or the best-case of that operation - what you care about is how much time is taken in total when you repeat the operation a million times.
So it doesn't matter if the operation is very slow once in a while, as long as "once in a while" is rare enough for the slowness to be diluted away. Essentially amortised time means "average time taken per operation, if you do many operations". Amortised time doesn't have to be constant; you can have linear and logarithmic amortised time or whatever else.
Let's take mats' example of a dynamic array, to which you repeatedly add new items. Normally adding an item takes constant time (that is, O(1)). But each time the array is full, you allocate twice as much space, copy your data into the new region, and free the old space. Assuming allocates and frees run in constant time, this enlargement process takes O(n) time where n is the current size of the array.
So each time you enlarge, you take about twice as much time as the last enlarge. But you've also waited twice as long before doing it! The cost of each enlargement can thus be "spread out" among the insertions. This means that in the long term, the total time taken for adding m items to the array is O(m), and so the amortised time (i.e. time per insertion) is O(1).

It means that over time, the worst case scenario will default to O(1), or constant time. A common example is the dynamic array. If we have already allocated memory for a new entry, adding it will be O(1). If we haven't allocated it we will do so by allocating, say, twice the current amount. This particular insertion will not be O(1), but rather something else.
What is important is that the algorithm guarantees that after a sequence of operations the expensive operations will be amortised and thereby rendering the entire operation O(1).
Or in more strict terms,
There is a constant c, such that for
every sequence of operations (also one ending with a costly operation) of
length L, the time is not greater than
c*L (Thanks Rafał Dowgird)

To develop an intuitive way of thinking about it, consider insertion of elements in dynamic array (for example std::vector in C++). Let's plot a graph, that shows dependency of number of operations (Y) needed to insert N elements in array:
Vertical parts of black graph corresponds to reallocations of memory in order to expand an array. Here we can see that this dependency can be roughly represented as a line. And this line equation is Y=C*N + b (C is constant, b = 0 in our case). Therefore we can say that we need to spend C*N operations on average to add N elements to array, or C*1 operations to add one element (amortized constant time).

I found below Wikipedia explanation useful, after repeat reading for 3 times:
Source: https://en.wikipedia.org/wiki/Amortized_analysis#Dynamic_Array
"Dynamic Array
Amortized Analysis of the Push operation for a Dynamic Array
Consider a dynamic array that grows in size as more elements are added to it
such as an ArrayList in Java. If we started out with a dynamic array
of size 4, it would take constant time to push four elements onto it.
Yet pushing a fifth element onto that array would take longer as the
array would have to create a new array of double the current size (8),
copy the old elements onto the new array, and then add the new
element. The next three push operations would similarly take constant
time, and then the subsequent addition would require another slow
doubling of the array size.
In general if we consider an arbitrary number of pushes n to an array
of size n, we notice that push operations take constant time except
for the last one which takes O(n) time to perform the size doubling
operation. Since there were n operations total we can take the average
of this and find that for pushing elements onto the dynamic array
takes: O(n/n)=O(1), constant time."
To my understanding as a simple story:
Assume you have a lot of money.
And, you want to stack them up in a room.
And, you have long hands and legs, as much long as you need now or in future.
And, you have to fill all in one room, so it is easy to lock it.
So, you go right to the end/ corner of the room and start stacking them.
As you stack them, slowly the room will run out of space.
However, as you fill it was easy to stack them. Got the money, put the money. Easy. It's O(1). We don't need to move any previous money.
Once room runs out of space.
We need another room, which is bigger.
Here there is a problem, since we can have only 1 room so we can have only 1 lock, we need to move all the existing money in that room into the new bigger room.
So, move all money, from small room, to bigger room. That is, stack all of them again.
So, We DO need to move all the previous money.
So, it is O(N). (assuming N is total count of money of previous money)
In other words, it was easy till N, only 1 operation, but when we need to move to a bigger room, we did N operations. So, in other words, if we average out, it is 1 insert in the begin, and 1 more move while moving to another room.
Total of 2 operations, one insert, one move.
Assuming N is large like 1 million even in the small room, the 2 operations compared to N (1 million) is not really a comparable number, so it is considered constant or O(1).
Assuming when we do all the above in another bigger room, and again need to move.
It is still the same.
say, N2 (say, 1 billion) is new amount of count of money in the bigger room
So, we have N2 (which includes N of previous since we move all from small to bigger room)
We still need only 2 operations, one is insert into bigger room, then another move operation to move to a even bigger room.
So, even for N2 (1 billion), it is 2 operations for each. which is nothing again. So, it is constant, or O(1)
So, as the N increases from N to N2, or other, it does not matter much. It is still constant, or O(1) operations required for each of the N .
Now assume, you have N as 1, very small, the count of money is small, and you have a very small room, which will fit only 1 count of money.
As soon as you fill the money in the room, the room is filled.
When you go to the bigger room, assume it can only fit one more money in it, total of 2 count of money. That means, the previous moved money and 1 more. And again it is filled.
This way, the N is growing slowly, and it is no more constant O(1), since we are moving all money from previous room, but can only fit only 1 more money.
After 100 times, the new room fits 100 count of money from previous and 1 more money it can accommodate. This is O(N), since O(N+1) is O(N), that is, the degree of 100 or 101 is same, both are hundreds, as opposed to previous story of, ones to millions and ones to billions.
So, this is inefficient way of allocating rooms (or memory/ RAM) for our money (variables).
So, a good way is allocating more room, with powers of 2.
1st room size = fits 1 count of money
2nd room size = fits 4 count of money
3rd room size = fits 8 count of money
4th room size = fits 16 count of money
5th room size = fits 32 count of money
6th room size = fits 64 count of money
7th room size = fits 128 count of money
8th room size = fits 256 count of money
9th room size = fits 512 count of money
10th room size= fits 1024 count of money
11th room size= fits 2,048 count of money
...
16th room size= fits 65,536 count of money
...
32th room size= fits 4,294,967,296 count of money
...
64th room size= fits 18,446,744,073,709,551,616 count of money
Why is this better? Because it looks to grow slowly in the begin, and faster later, that is, compared to the amount of memory in our RAM.
This is helpful because, in the first case though it is good, total amount of work to be done per money is fixed (2) and not comparable to room size (N), the room that we took in the initial stages might be too big (1 million) that we may not use fully depending on if we may get that much money to save at all in first case.
However, in the last case, powers of 2, it grows in the limits of our RAM. And so, increasing in powers of 2, both the Armotized analysis remains constant and it is friendly for the limited RAM that we have as of today.

I created this simple python script to demonstrate the amortized complexity of append operation in a python list. We keep adding elements to the list and time every operation. During this process, we notice that some specific append operations take much longer time. These spikes are due to the new memory allocation being performed. The important point to note is that as the number of append operations increases, the spikes become higher but increase in spacing. The increase in spacing is because a larger memory (usually double the previous) is reserved every time the initial memory hits an overflow. Hope this helps, I can improve it further based on suggestions.
import matplotlib.pyplot as plt
import time
a = []
N = 1000000
totalTimeList = [0]*N
timeForThisIterationList = [0]*N
for i in range(1, N):
startTime = time.time()
a.append([0]*500) # every iteartion, we append a value(which is a list so that it takes more time)
timeForThisIterationList[i] = time.time() - startTime
totalTimeList[i] = totalTimeList[i-1] + timeForThisIterationList[i]
max_1 = max(totalTimeList)
max_2 = max(timeForThisIterationList)
plt.plot(totalTimeList, label='cumulative time')
plt.plot(timeForThisIterationList, label='time taken per append')
plt.legend()
plt.title('List-append time per operation showing amortised linear complexity')
plt.show()
This produces the following plot

The explanations above apply to Aggregate Analysis, the idea of taking "an average" over multiple operations.
I am not sure how they apply to Bankers-method or the Physicists Methods of Amortized analysis.
Now. I am not exactly sure of the correct answer.
But it would have to do with the principle condition of the both Physicists+Banker's methods:
(Sum of amortized-cost of operations) >= (Sum of actual-cost of operations).
The chief difficulty that I face is that given that Amortized-asymptotic costs of operations differ from the normal-asymptotic-cost, I am not sure how to rate the significance of amortized-costs.
That is when somebody gives my an amortized-cost, I know its not the same as normal-asymptotic cost What conclusions am I to draw from the amortized-cost then?
Since we have the case of some operations being overcharged while other operations are undercharged, one hypothesis could be that quoting amortized-costs of individual operations would be meaningless.
For eg: For a fibonacci heap, quoting amortized cost of just Decreasing-Key to be O(1) is meaningless since costs are reduced by "work done by earlier operations in increasing potential of the heap."
OR
We could have another hypothesis that reasons about the amortized-costs as follows:
I know that the expensive operation is going to preceded by MULTIPLE LOW-COST operations.
For the sake of analysis, I am going to overcharge some low-cost operations, SUCH THAT THEIR ASYMPTOTIC-COST DOES NOT CHANGE.
With these increased low-cost operations, I can PROVE THAT EXPENSIVE OPERATION has a SMALLER ASYMPTOTIC COST.
Thus I have improved/decreased the ASYMPTOTIC-BOUND of the cost of n operations.
Thus amortized-cost analysis + amortized-cost-bounds are now applicable to only the expensive operations. The cheap operations have the same asymptotic-amortized-cost as their normal-asymptotic-cost.

The performance of any function can be averaged by dividing the "total number of function calls" into the "total time taken for all those calls made". Even functions that take longer and longer for each call, can still be averaged in this way.
So, the essence of a function that performs at Constant Amortized Time is that this "average time" reaches a ceiling that does not get exceeded as the number of calls continues to be increased. Any particular call may vary in performance, but over the long run this average time won't keep growing bigger and bigger.
This is the essential merit of something that performs at Constant Amortized Time.

Amortized Running Time:
This refers to the calculation of the algorithmic complexity in terms of time or memory used per operation.
It's used when mostly the operation is fast but on some occasions the operation of the algorithm is slow.
Thus sequence of operations is studied to learn more about the amortized time.

Related

Role of std::unordered_set::reserve for container memory requirement?

Suppose I use a std::unordered_set<MyClass>, and suppose sizeof(MyClass) is large, i.e. much larger than sizeof(size_t) and sizeof(void*). I will add a large number numberOfElementsToBeAdded of elements to the unordered set.
From https://stackoverflow.com/a/25438497/4237 , I find:
"each memory allocation may be rounded up to a size convenient for the memory allocation library's management - e.g. the next power of two, which approaches 100% worst case inefficiency and 50% average, so let's add 50%, just for the list nodes as the buckets are likely powers of two given size_t and a pointer: 50% * size() * (sizeof(void*) + sizeof((M::value_type))"
This drives me to conclude that actual memory consumption will be between
1*numberOfElements*sizeof(MyClass) and (1+1)*numberOfElements*sizeof(MyClass), modulo some additional memory, which is of little interest here because it is of the order sizeof(size_t).
However, I know the number of elements to be added in advance, so I call:
std::unordered_set<MyClass> set;
set.reserve(numberOfElementsToBeAdded);
//Insert elements
Thinking about the parallel of calling std::vector::reserve, I would guess that this avoids the potential overhead, and would thus guarantee memory consumption to be approximately 1*numberOfElements*sizeof(MyClass) (modulo some additional memory, again of order sizeof(size_t)).
Could I rely on implementations of the standard library to guarantee that calling reserve like this will avoid the 0-100% (50% average) overhead mentioned in the above answer?
From this std::unordered_set::reserve reference
Sets the number of buckets to the number needed to accomodate at least count elements ...
There's some more, but the gist is that std::unordered_set::reserve doesn't actually work like std::vector::reserve. For an unordered set it allocates buckets for the hash table, not memory for the actual elements themselves.
The set could put one element per bucket, but it's not guaranteed.
Quoting the standard:
Sets the number of buckets to the number needed to accomodate at least
count elements without exceeding maximum load factor and rehashes the
container, i.e. puts the elements into appropriate buckets considering
that total number of buckets has changed
The important part is **at least*. I do not think you can rely on the implementation and be sure that it only uses less amount of memory. My guess is that regardless if you call reserve the memory usage for numElements will be similar.

boost::flat_map and its performance compared to map and unordered_map

It is common knowledge in programming that memory locality improves performance a lot due to cache hits. I recently found out about boost::flat_map which is a vector based implementation of a map. It doesn't seem to be nearly as popular as your typical map/unordered_map so I haven't been able to find any performance comparisons. How does it compare and what are the best use cases for it?
Thanks!
I have run a benchmark on different data structures very recently at my company so I feel I need to drop a word. It is very complicated to benchmark something correctly.
Benchmarking
On the web, we rarely find (if ever) a well-engineered benchmark. Until today I only found benchmarks that were done the journalist way (pretty quickly and sweeping dozens of variables under the carpet).
1) You need to consider cache warming
Most people running benchmarks are afraid of timer discrepancy, therefore they run their stuff thousands of times and take the whole time, they just are careful to take the same thousand of times for every operation, and then consider that comparable.
The truth is, in the real world it makes little sense, because your cache will not be warm, and your operation will likely be called just once. Therefore you need to benchmark using RDTSC, and time stuff calling them once only.
Intel has made a paper describing how to use RDTSC (using a cpuid instruction to flush the pipeline, and calling it at least 3 times at the beginning of the program to stabilize it).
2) RDTSC accuracy measure
I also recommend doing this:
u64 g_correctionFactor; // number of clocks to offset after each measurement to remove the overhead of the measurer itself.
u64 g_accuracy;
static u64 const errormeasure = ~((u64)0);
#ifdef _MSC_VER
#pragma intrinsic(__rdtsc)
inline u64 GetRDTSC()
{
int a[4];
__cpuid(a, 0x80000000); // flush OOO instruction pipeline
return __rdtsc();
}
inline void WarmupRDTSC()
{
int a[4];
__cpuid(a, 0x80000000); // warmup cpuid.
__cpuid(a, 0x80000000);
__cpuid(a, 0x80000000);
// measure the measurer overhead with the measurer (crazy he..)
u64 minDiff = LLONG_MAX;
u64 maxDiff = 0; // this is going to help calculate our PRECISION ERROR MARGIN
for (int i = 0; i < 80; ++i)
{
u64 tick1 = GetRDTSC();
u64 tick2 = GetRDTSC();
minDiff = std::min(minDiff, tick2 - tick1); // make many takes, take the smallest that ever come.
maxDiff = std::max(maxDiff, tick2 - tick1);
}
g_correctionFactor = minDiff;
printf("Correction factor %llu clocks\n", g_correctionFactor);
g_accuracy = maxDiff - minDiff;
printf("Measurement Accuracy (in clocks) : %llu\n", g_accuracy);
}
#endif
This is a discrepancy measurer, and it will take the minimum of all measured values, to avoid getting a -10**18 (64 bits first negatives values) from time to time.
Notice the use of intrinsics and not inline assembly. First inline assembly is rarely supported by compilers nowadays, but much worse of all, the compiler creates a full ordering barrier around inline assembly because it cannot static analyze the inside, so this is a problem to benchmark real-world stuff, especially when calling stuff just once. So an intrinsic is suited here because it doesn't break the compiler free-re-ordering of instructions.
3) parameters
The last problem is people usually test for too few variations of the scenario.
A container performance is affected by:
Allocator
size of the contained type
cost of implementation of the copy operation, assignment operation, move operation, construction operation, of the contained type.
number of elements in the container (size of the problem)
type has trivial 3.-operations
type is POD
Point 1 is important because containers do allocate from time to time, and it matters a lot if they allocate using the CRT "new" or some user-defined operation, like pool allocation or freelist or other...
(for people interested about pt 1, join the mystery thread on gamedev about system allocator performance impact)
Point 2 is because some containers (say A) will lose time copying stuff around, and the bigger the type the bigger the overhead. The problem is that when comparing to another container B, A may win over B for small types, and lose for larger types.
Point 3 is the same as point 2, except it multiplies the cost by some weighting factor.
Point 4 is a question of big O mixed with cache issues. Some bad-complexity containers can largely outperform low-complexity containers for a small number of types (like map vs. vector, because their cache locality is good, but map fragments the memory). And then at some crossing point, they will lose, because the contained overall size starts to "leak" to main memory and cause cache misses, that plus the fact that the asymptotic complexity can start to be felt.
Point 5 is about compilers being able to elide stuff that are empty or trivial at compile time. This can optimize greatly some operations because the containers are templated, therefore each type will have its own performance profile.
Point 6 same as point 5, PODs can benefit from the fact that copy construction is just a memcpy, and some containers can have a specific implementation for these cases, using partial template specializations, or SFINAE to select algorithms according to traits of T.
About the flat map
Apparently, the flat map is a sorted vector wrapper, like Loki AssocVector, but with some supplementary modernizations coming with C++11, exploiting move semantics to accelerate insert and delete of single elements.
This is still an ordered container. Most people usually don't need the ordering part, therefore the existence of unordered...
Have you considered that maybe you need a flat_unorderedmap? which would be something like google::sparse_map or something like that—an open address hash map.
The problem of open address hash maps is that at the time of rehash they have to copy everything around to the new extended flat land, whereas a standard unordered map just has to recreate the hash index, while the allocated data stays where it is. The disadvantage of course is that the memory is fragmented like hell.
The criterion of a rehash in an open address hash map is when the capacity exceeds the size of the bucket vector multiplied by the load factor.
A typical load factor is 0.8; therefore, you need to care about that, if you can pre-size your hash map before filling it, always pre-size to: intended_filling * (1/0.8) + epsilon this will give you a guarantee of never having to spuriously rehash and recopy everything during filling.
The advantage of closed address maps (std::unordered..) is that you don't have to care about those parameters.
But the boost::flat_map is an ordered vector; therefore, it will always have a log(N) asymptotic complexity, which is less good than the open address hash map (amortized constant time). You should consider that as well.
Benchmark results
This is a test involving different maps (with int key and __int64/somestruct as value) and std::vector.
tested types information:
typeid=__int64 . sizeof=8 . ispod=yes
typeid=struct MediumTypePod . sizeof=184 . ispod=yes
Insertion
EDIT:
My previous results included a bug: they actually tested ordered insertion, which exhibited a very fast behavior for the flat maps.
I left those results later down this page because they are interesting.
This is the correct test:
I have checked the implementation, there is no such thing as a deferred sort implemented in the flat maps here. Each insertion sorts on the fly, therefore this benchmark exhibits the asymptotic tendencies:
map: O(N * log(N))
hashmaps: O(N)
vector and flatmaps: O(N * N)
Warning: hereafter the 2 tests for std::map and both flat_maps are buggy and actually test ordered insertion (vs random insertion for other containers. yes it's confusing sorry):
We can see that ordered insertion, results in back pushing, and is extremely fast. However, from the non-charted results of my benchmark, I can also say that this is not near the absolute optimality for a back-insertion. At 10k elements, perfect back-insertion optimality is obtained on a pre-reserved vector. Which gives us 3Million cycles; we observe 4.8M here for the ordered insertion into the flat_map (therefore 160% of the optimal).
Analysis: remember this is 'random insert' for the vector, so the massive 1 billion cycles come from having to shift half (in average) the data upward (one element by one element) at each insertion.
Random search of 3 elements (clocks renormalized to 1)
in size = 100
in size = 10000
Iteration
over size 100 (only MediumPod type)
over size 10000 (only MediumPod type)
Final grain of salt
In the end, I wanted to come back on "Benchmarking §3 Pt1" (the system allocator). In a recent experiment, I am doing around the performance of an open address hash map I developed, I measured a performance gap of more than 3000% between Windows 7 and Windows 8 on some std::unordered_map use cases (discussed here).
This makes me want to warn the reader about the above results (they were made on Win7): your mileage may vary.
From the docs it seems this is analogous to Loki::AssocVector which I'm a fairly heavy user of. Since it's based on a vector it has the characteristics of a vector, that is to say:
Iterators gets invalidated whenever size grows beyond capacity.
When it grows beyond capacity it needs to reallocated and move objects over, ie insertion isn't guaranteed constant time except for the special case of inserting at end when capacity > size
Lookup is faster than std::map due to cache locality, a binary search which has the same performance characteristics as std::map otherwise
Uses less memory because it isn't a linked binary tree
It never shrinks unless you forcibly tell it to ( since that triggers reallocation )
The best use is when you know the number of elements in advance ( so you can reserve upfront ), or when insertion / removal is rare but lookup is frequent. Iterator invalidation makes it a bit cumbersome in some use cases so they're not interchangeable in terms of program correctness.

Time complexity and growth strategy of dynamic arrays

I was reading in More Exceptional C++ by Herb Sutter about growth strategy to be chosen when implementing a string. He lists the following:
1) Exact growth. In this strategy, the new buffer is made exactly as large as required by the current operation
Advantage: no wasted space.
Disadvantage: poor performance. This strategy requires O(N) allocations and an average of O(N) copy operations per char, but with a high constant factor in the worst case...
2) Fixed-increment growth. The new buffer should be a fixed amount larger than the current buffer
Advantage: little wasted space. The amount of unused space in the buffer is bounded by the increment size, and does not vary with the length of the string.
Disadvantage: moderate performance. This strategy requires both O(N) allocations and an average of O(N) copy operations per char. That is, both the number of allocations and the average number of times a given char is copied vary linearly with the length of the string.
However, control of the constant factor is in the hands of the String implementer.
note: characters are added to the string 1 by 1
Question 1 : How does the constant factor controlled in both? I din't Understand Herb's point here
Question 2: How will the Fixed increment be O(N), wouldn't depend on what the fixed size used is, if say its 100 chars, after the first resize, the next 99 insertions would be O(1), so why O(N) is considered?
The string implementer gets to choose how big the fixed-size increment f is. So he has control over the constant factor in 2), but not in 1). Note that there is no claim of control over the constant factor in 1).
The cost per char would be O(N/f). I believe what Herb means is that f is fixed by the implementation, and hence is essentially a constant factor in the big-oh notation (i.e. it's dropped). However, the larger the f the smaller the big-oh constant factor is, and hence better performance (at the cost of more wasted space). So the implementer has to weigh those two factors when choosing f.

Amortized analysis of std::vector insertion

How do we do the analysis of insertion at the back (push_back) in a std::vector? It's amortized time is O(1) per insertion. In particular in a video in channel9 by Stephan T Lavavej and in this ( 17:42 onwards ) he says that for optimal performance Microsoft's implementation of this method increases capacity of the vector by around 1.5.
How is this constant determined?
Assuming you mean push_back and not insertion, I believe that the important part is the multiply by some constant (as opposed to grabbing N more elements each time) and as long as you do this you'll get amortized constant time. Changing the factor changes the average case and worst case performance.
Concretely:
If your constant factor is too large, you'll have good average case performance, but bad worst case performance especially as the arrays get big. For instance, imagine doubling (2x) a 10000 size vector just because you have the 10001th element pushed. EDIT: As Michael Burr indirectly pointed out, the real cost here is probably that you'll grow your memory much larger than you need it to be. I would add to this that there are cache issues that affect speed if your factor is too large. Suffice it to say that there are real costs (memory and computation) if you grow much larger than you need.
However if your constant factor is too small, say (1.1x) then you're going to have good worst case performance, but bad average performance, because you're going to have to incur the cost of reallocating too many times.
Also, see Jon Skeet's answer to a similar question previously. (Thanks #Bo Persson)
A little more about the analysis: Say you have n items you are pushing back and a multiplication factor of M. Then the number of reallocations will be roughly log base M of n (log_M(n)). And the ith reallocation will cost proportional to M^i (M to the ith power). Then the total time of all the pushbacks will be M^1 + M^2 + ... M^(log_M(n)). The number of pushbacks is n, and thus you get this series (which is a geometric series, and reduces to roughly (nM)/(M-1) in the limit) divided by n. This is roughly a constant, M/(M-1).
For large values of M you will overshoot a lot and allocate much more than you need reasonably often (which I mentioned above). For small values of M (close to 1) this constant M/(M-1) becomes large. This factor directly affects the average time.
You can do the math to try to figure how this kind of thing works.
A popular method to work with asymptotic analysis is the Bankers method. What you do is markup all your operations with an extra cost, "saving" it for later to pay for an expensive operation latter on.
Let's make some dump assumptions to simplify the math:
Writing into an array costs 1. (Same for inserting and moving between arrays)
Allocating a larger array is free.
And our algorithm looks like:
function insert(x){
if n_elements >= maximum array size:
move all elements to a new array that
is K times larger than the current size
add x to array
n_elements += 1
Obviously, the "worst case" happens when we have to move the elements to the new array. Let's try to amortize this by adding a constant markup of d to the insertion cost, bringing it to a total of (1 + d) per operation.
Just after an array has been resized, we have (1/K) of it filled up and no money saved.
By the time we fill the array up, we can be sure to have at least d * (1 - 1/K) * N saved up. Since this money must be able to pay for all N elements being moved, we can figure out a relation between K and d:
d*(1 - 1/K)*N = N
d*(K-1)/K = 1
d = K/(K-1)
A helpful table:
k d 1+d(total insertion cost)
1.0 inf inf
1.1 11.0 12.0
1.5 3.0 4.0
2.0 2.0 3.0
3.0 1.5 2.5
4.0 1.3 2.3
inf 1.0 2.0
So from this you can get a rough mathematician's idea of how the time/memory tradeoff works for this problem. There are some caveats, of course: I didn't go over shrinking the array when it gets less elements, this only covers the worst case where no elements are ever removed and the time costs of allocating extra memory weren't accounted for.
They most likely ran a bunch of experimental tests to figure this out in the end making most of what I wrote irrelevant though.
Uhm, the analysis is really simple when you're familiar with number systems, such as our usual decimal one.
For simplicity, then, assume that each time the current capacity is reached, a new 10x as large buffer is allocated.
If the original buffer has size 1, then the first reallocation copies 1 element, the second (where now the buffer has size 10) copies 10 elements, and so on. So with five reallocations, say, you have 1+10+100+1000+10000 = 11111 element copies performed. Multiply that by 9, and you get 99999; now add 1 and you have 100000 = 10^5. Or in other words, doing that backwards, the number of element copies performed to support those 5 reallocations has been (10^5-1)/9.
And the buffer size after 5 reallocations, 5 multiplications by 10, is 10^5. Which is roughly a factor of 9 larger than the number of element copy operations. Which means that the time spent on copying is roughly linear in the resulting buffer size.
With base 2 instead of 10 you get (2^5-1)/1 = 2^5-1.
And so on for other bases (or factors to increase buffer size by).
Cheers & hth.

Efficient implementation of binary heaps

I'm looking for information on how to implement binary heaps efficiently. I feel like there should be a nice article somewhere about implementing heaps efficiently, but I haven't found one. In fact I've been unable to find any resources on the matter of efficient implementation beyond the basics like storing the heap in an array. I'm looking for techniques for making a fast binary heap beyond the ones I describe below.
I've already written a C++ implementation that is faster than Microsoft Visual C++'s and GCC's std::priority_queue or using std::make_heap, std::push_heap and std::pop_heap. The following are the techniques I've already got covered in my implementation. I only came up with the last 2 myself, though I doubt that those are new ideas:
(Edit: added section on memory optimization)
Start indexes at 1
Look at the Wikipedia implementation notes for binary heaps. If the heap's root is placed at index 0, then the formulas for parent, left-child and right-child of the node at index n are respectively (n-1)/2, 2n+1 and 2n+2. If you use a 1-based array then the formulas become the simpler n/2, 2n and 2n + 1. So parent and left-child are more efficient when using a 1-based array. If p points to a 0-based array and q = p - 1 then we can access p[0] as q[1] so there is no overhead in using a 1-based array.
Make pop/removal move element to bottom of heap before replacement with leaf
Pop on a heap is frequently described by replacing the top element by the left-most bottom leaf and then moving it down until the heap property is restored. This requires 2 comparisons per level that we go by, and we are likely to go far down the heap since we moved a leaf to the top of the heap. So we should expect a little less than 2 log n comparisons.
Instead, we can leave a hole in the heap where the top element was. Then we move that hole down the heap by iteratively moving the larger child up. This requires only 1 comparison per level that we go past. In this way the hole will become a leaf. At this point we can move the right-most bottom leaf into the position of the hole and move that value up until the heap property is restored. Since the value we moved was a leaf we don't expect it to move very far up the tree. So we should expect a little more than log n comparisons, which is better than before.
Support replace-top
Suppose you want to remove the max element and also insert a new element. Then you can do either of the removal/pop implementations described above, but instead of moving the right-most bottom leaf, you use the new value you wish to insert/push. (When most operations are of this kind I've found that a tournament tree is better than a heap, but otherwise the heap is slightly better.)
Make sizeof(T) a power of 2
The parent, left-child and right-child formulas work on indexes and they cannot be made to work directly on pointer values. So we are going to be working with indexes and that implies looking up values p[i] in an array p from an index i. If p is a T* and i is an integer, then
&(p[i]) == static_cast<char*>(p) + sizeof(T) * i
and the compiler has to perform this computation to get p[i]. sizeof(T) is a compile-time constant, and the multiplication can be done more efficiently if sizeof(T) is a power of two. My implementation got faster by adding 8 padding bytes to increase sizeof(T) from 24 to 32. Reduced efficiency of the cache probably means that this is not a win for sufficiently large data sets.
Pre-multiply indexes
This was a 23% performance increase on my data set. The only thing we ever do with an index other than finding parent, left-child and right-child is to look the index up in an array. So if we keep track of j = sizeof(T) * i instead of an index i, then we could do a lookup p[i] without the multiplication that is otherwise implicit in evaluating p[i] because
&(p[i]) == static_cast<char*>(p) + sizeof(T) * i == static_cast<char*>(p) + j
Then the left-child and right-child formulas for j-values become respectively 2*j and 2*j + sizeof(T). The parent formula is a little more tricky, and I haven't found a way to do it other than converting the j-value to an i-value and back like so:
parentOnJ(j) = parent(j/sizeof(T))*sizeof(T) == (j/(2*sizeof(T))*sizeof(T)
If sizeof(T) is a power of 2 then this will compile to 2 shifts. That is 1 operation more than the usual parent using indexes i. However we then save 1 operation on lookup. So the net effect is that finding the parent takes the same amount of time this way, while lookup of left-child and right-child becomes faster.
Memory optimization
The answers of TokenMacGuy and templatetypedef point out memory based optimizations that reduce cache misses. For very large data sets or infrequently used priority queues, parts of the queue can be swapped out to disk by the OS. In that case it is worth it to add a lot of overhead to make optimal use of the cache because swapping in from disk is very slow. My data easily fits in memory and is in continuous use, so no part of the queue will likely be swapped to disk. I suspect that this is the case for most uses of priority queues.
There are other priority queues that are designed to make better use of the CPU cache. For example a 4-heap should have fewer cache misses and the amount of extra overhead is not that much. LaMarca and Ladner report in 1996 that they get 75% performance improvement from going to aligned 4-heaps. However, Hendriks reports in 2010 that:
The improvements to the implicit heap suggested by LaMarca and Ladner [17] to improve data locality and reduce cache misses were also tested. We implemented a four-way heap, that indeed shows a slightly better consistency than the two-way heap for very skewed input data, but only for very large queue sizes. Very large queue sizes are better handled by the hierarchical heap.
Question
Are there more techniques than these?
An interesting paper/article on this topic considers the behavior of caching/paging on the overall layout of the heap; The idea being that it's vastly more costly to pay for a cache miss or page in than nearly any other part of a datastructure's implementation. The paper discusses a heap layout that addresses this.
You're Doing It Wrong
by Poul-Henning Kamp
As an elaboration on #TokenMacGuy's post, you might want to look into cache-oblivious data structures. The idea is to build data structures that, for arbitrary caching systems, minimize the number of cache misses. They're tricky, but they actually might be useful from your perspective since they perform well even when dealing with multi-layer cache systems (for example, registers / L1 / L2 / VM).
There's actually a paper detailing an optimal cache-oblivious priority queue that might be of interest. This data structure would have all sorts of advantages in terms of speed, since it would try to minimize the number of cache misses at every level.
On the first point: even having a "spare spot" for your array based implementation isn't a waste. Many operations need a temporary element anyway. Rather than initializing a new element each time, having a dedicated element at index [0] is handy.