Find hit/miss in cache c++ - c++

I'm struggling with my hw. It asks read a trace file, where each line has reference type, and address in hex. For example, 1st line in the file has address 0x4ef1200231, with type of instruction. It also asks to check if this address in cache is a hit or miss(in L1 and L2). I'm not quite sure how to write c++(i'm very new) to check if it is a hit or miss.
I'm picturing there is a function, say address(long int), then if I call address(0x4ef1200231) then the console can tell me if this address is a hit or miss at L1, and if it is a miss, then call another function to check this address at L2. Is this too naive? Please help. Thanks.
---few lines in the trace---
4ef1200231 Int
2ff1e0122234 WR
82039ef9a3 R
comment: Int means instruction, WR means data write, R means data read. Question is after reading the whole trace file, how many hits, and misses total. Thanks.

This question is probably too advanced for a C++ beginner, but here's some explanation of how to implement a solution....
First, you need to have a container that mimicks the logic used by each level of cache: the simplest (and likely adequate) such container is an Least Recently Used (LRU) data structure. What that does is record a fixed maximum number of in-cache elements, and when an element is accessed it searches for it in the list: if it's found it's moved to the top/front of the list, displacing the 1st and subsequent list elements until the gap it left behind is again filled. If it's not in the list, then it's also added at the top, with all other elements shifted down to make room, and the last element removed if the list is at its maximum size. To implement an LRU nicely, you need to be able to find elements quickly by value, while inserting and removing elements quickly mid-list. This is best done with a combination of an unordered_map and a list, but implementation of that alone is more than you can reasonably be expected to do as a C++ beginner. You could start by using only a list - the searching will be slow (O(n) or linear / brute force), but you can get it working functionally.
Given such an LRU class, you can set the sizes of two instances to represent pages in L1 and L2 cache, then for each address in the input file, you seek that page (say for 4k pages you could divide it by 4096, or bitwise-and it with the bitwise negation of 4095, or bitwise-or it with 4095, or bitshift it right 12 times etc.) in L1, falling back on L2 if necessary. The "is it already in the cache" code can keep hit/miss counters.
Here's some example code to get you started:
template <typename T>
class Dumb_LRU
{
Dumb_LRU(size_t max_size) : n_(max_size) { }
bool operator()(const T& t)
{
std::list<T>::iterator i = std::find(l_.begin(), l_.end(), t);
if (i == l_.end())
{
l_.push_front(t);
if (l_.size() > n_)
l_.pop_back();
return false;
}
if (i != l_.begin()) // not already the first element...
{
l_.erase(i);
l_.push_front(t);
}
return true;
}
private:
std::list<T> l_;
size_t n_;
};
You can then do your simulation like this:
static const size_t l1_cache_pages = 256;
static const size_t l2_cache_pages = 2048;
static const size_t page_size = 4096;
Dumb_LRU<size_t> l1(l1_cache_pages);
Dumb_LRU<size_t> l2(l2_cache_pages);
size_t address;
std::string doing_what;
int l1_hits = 0, l1_misses = 0, l2_hits = 0, l2_misses = 0;
while (std::cin >> address >> doing_what)
{
if (l1(address / page_size))
++l1_hits;
else
{
++l1_misses;
if (l2(address / page_size))
++l2_hits;
else
++l2_misses;
}
// ...print out hits/misses...

Related

Is there a fast way to 'move' all vector values by 1 position?

I want to implement an algorithm that basically moves every value(besides the last one) one place to the left, as in the first element becomes the second element, and so on.
I have already implemented it like this:
for(int i = 0; i < vct.size() - 1; i++){
vct[i] = vct[i + 1];
}
which works, but I was just wondering if there is a faster, optionally shorter way to achieve the same result?
EDIT: I have made a mistake where I said that I wanted it to move to the right, where in reality I wanted it to go left, so sorry for the confusion and thanks to everyone for pointing that out. I also checked if the vector isn't empty beforehand, just didn't include it in the snippet.
As a comment (or more than one?) has pointed out, the obvious choice here would be to just use a std::deque.
Another possibility would be to use a circular buffer. In this case, you'll typically have an index (or pointer) to the first and last items in the collection. Removing an item from the beginning consists of incrementing that index/pointer (and wrapping it around to the beginning of you've reached the end of the buffer). That's quite fast, and constant time, regardless of collection size. There is a downside that every time you add an item, remove an item, or look at an item, you need to do a tiny bit of extra math to do it. But it's usually just one addition, so overhead is pretty minimal. Circular buffers work well, but they have a fair number of corner cases, so getting them just right is often kind of a pain. Worse, many obvious implementations waste one the slot for one data item (though that often doesn't matter a lot).
A slightly simpler possibility would reduce overhead by some constant factor. To do this, use a little wrapper that keeps track of the first item in the collection, along with your vector of items. When you want to remove the first item, just increment the variable that keeps track of the first item. When you reach some preset limit, you erase the first N elements all at once. This reduces the time spent shifting items by a factor of N.
template <class T>
class pseudo_queue {
std::vector<T> data;
std:size_t b;
// adjust as you see fit:
static const int max_slop = 20;
void shift() {
data.erase(data.begin(), data.begin() + b);
}
public:
void push_back(T &&t) { data.push_back(std::move(t); }
void pop_back() { data.pop_back(); }
T &back() { return data.back(); }
T &front() { return data[b]; }
void pop_front() {
if (++b > max_slop) shift();
}
std::vector<T>::iterator begin() { return data.begin() + b; }
std::vector<T>::iterator end() { return data.end(); }
T &operator[](std::size_t index) { return data[index + b]; }
};
If you want to get really tricky, you can change this a bit, so you compute max_slop as a percentage of the size of data in the collection. In this case, you can change the computational complexity involved, rather than just leaving it linear but with a larger constant factor than you currently have. But I have no idea how much (if at all) you care about that--it's only likely to matter much if you deal with a wide range of sizes.
assuming you really meant moving data to the right, and that your code has a bug,
you have std::move_backwards from <algorithms> and its sibling std::move to do that, but accessing data backwards may be inefficient.
std::move_backward(vct.begin(),vct.end()-1,vct.end());
if you actually meant move data to the left, you can use std::move
std::move(vct.begin()+1,vct.end(),vct.begin())
you can also use std::copy and std::copy_backward instead if your object is trivially copiable, the syntax is exactly the same and it will be faster for trivially copyable objects (stack objects).
you can just do a normal loop to the right, assuming vct.size() is bigger than 1.
int temp1 = vct[0]; // assume vct of int
int temp2; // assume vct of int
for(int i = 0;i<vct.size() - 1;i++){
temp2 = vct[i+1];
vct[i+1] = temp1;
temp1 = temp2;
}
and your version is what's to do if you are moving to the left.
also as noted in the comments, you should check that your list is not empty if you are doing the loop version.

How to lock-free update index to maximum over a range?

The problem is best explained with some simple code.
struct foo
{
static constexpr auto N=8;
double data[N]; // initialised at construction
int max; // index to maximum: data[max] is largest value
// if value < data[index]:
// – update data[index] = value
// - update max
void update(int index, double value)
{
if(value >= data[index])
return;
data[index] = value;
if(index==max) // max unaffected if index!=max
for(index=0; index!=N; ++index)
if(data[index] > data[max])
max = index;
}
};
Now, I want to make foo::update() thread safe, i.e. allow concurrent calls from different threads, where participating threads cannot call with the same index. One way is to add a mutex or simple spinlock (the contention can be presumed low) to foo:
struct foo
{
static constexpr auto N=8;
std::atomic_flag lock = ATOMIC_FLAG_INIT;
double data[N];
int max;
// index is unique to each thread
// if value < data[index]:
// – update data[index] = value
// - update max
void update(int index, double value)
{
if(value >= data[index])
return;
while(lock.test_and_set(std::memory_order_acquire)); // aquire spinlock
data[index] = value;
if(index==max)
for(index=0; index!=N; ++index)
if(data[index] > data[max])
max = index;
lock.clear(std::memory_order_release); // release spinlock
}
};
However, how can I implement foo::update() lock-free (you may consider data and max to be atomic)?
NOTE: this is a simpler version of the original post, without relation to the tree structure.
So, IIUC, the array only gets new values if they are lower than what is already there (and I won't worry about how the initial values got there), and if the current max is lowered, find a new max.
Some of this is not too hard.
But some is... harder.
So the "if value < data[index] then write data" needs to be in a CAS-loop. Something like:
auto oldval = data[index].load(memory_order_relaxed);
do
if (value <= oldval) return;
while ( ! data[index].compare_exchange_weak(oldval, value) );
// (note that oldval is updated to data[index] each time comp-exch fails)
So now data[index] has the new lower value. Awesome. And relatively easy.
Now about max.
First question - Is it OK for max to ever be wrong? Because it may currently be wrong (in our scenario, where we update data[index] before handling max).
Can it be wrong in some ways, not others? ie let's say our data is just two entries:
data[2] = { 3, 7 };
And we want to do update(1, 2) ie change the 7 to a 2. (And thus update max!)
Scenario A: set data first, then max:
data[1] = 2;
pause(); // ie scheduler pauses this thread
max = 0; // data[0]==3 is now max
If another thread comes in at pause(), then data[max] is wrong: 2 instead of 3 :-(
Scenario B: set max first:
max = 0; // it will be "shortly"?
pause();
data[1] = 2;
Now a thread could read data[max] as 3 while 7 is still in data. But 7 is going to become 2 "soon", so is that OK? Is it "less wrong" than scenario A? Depends on usage? (ie if the important thing is "which is max" we have that right. But if max was the only important thing, why store all the data at all?)
It seems odd to ask "is wrong OK", but in some lock-free situations it actually is a valid question. To me B has a chance to be OK for some uses, whereas A doesn't.
Also, and this is important:
data[max] is always wrong, even in the perfect algorithm
By this I mean you need to realize that data[max], as soon as you read it is already out of date - if you are living in a lockfree world. Because it may have changed as soon as you read it. (Also because data and max change independently. But even if you had a function getMaxValue() it would be out of date as soon as it returns.)
Is that OK? Because, if not, you obviously need a lock. But if it is OK, we can use it to our advantage - we might be able to return an answer which we know is somewhat incorrect / out-of-date, but no more incorrect than what you could tell from the outside.
If neither scenario is OK, then you must update max and data[index] at the same time. Which is hard since they don't fit into a lock-free sized chunk.
So instead you can add a layer of indirection:
struct DataAndMax { double data[N]; int max; };
DataAndMax * ptr;
Whenever you need to update max, you need to make a whole new DataAndMax struct (ie allocate a new one), somehow fill it all out nicely, and then atomically swap ptr to the new struct.
And if some other thread changed ptr while you were preparing the new data, then you would need to start over, since you need their new data in your data.
And if ptr has changed twice, then it may look like it hasn't changed, when it really has: Let's say ptr currently has value 0xA000 and a 2nd thread allocates a new DataAndStruct at 0xB000, and sets ptr to 0xB000, and frees the old one at 0xA000. Now yet another thread (3rd) comes in, allocates yet another DataAndStruct - and low and behold the allocator gives you back 0xA000 (why not, it was just freed!). So this 3rd thread sets ptr to 0xA000.
And this all happens while you are trying to set ptr to 0xC000. All you see is that ptr was 0xA000, and later still is 0xA000, so you think it (and its data) hasn't changed. Yet it has - it went from 0xA000 to 0xB000 (when you weren't looking) back to 0xA000 - the address is the same, but the data is different. This is called the ABA problem.
Now, if you knew the max number of threads, you could pre-allocate:
DataAndMax dataBufs[NUM_THREADS];
DataAndMax * ptr; // current DataAndMax
And then never allocate/delete and never have ABA problems. Or there's other ways to avoid ABA.
Let's go back, and think about how we are going to - no matter what - return a max value that is potentially out of date. Can we use that?
So you come in, and first check if the index you are about to write to is the important one or not:
if (index != max) {
// we are not touching max,
// so nothing fancy here!
data[index] value;
return;
}
// else do it the hard way:
//...
But this is already wrong. After the if and before the set, max may have changed. Does every set need to update max!?!?
So, if N is small, you could just linear search for max. It may be wrong if someone makes an update while searching, but remember - it could also be wrong if someone makes an update right after searching or right after "insert magic here". So searching, other than possibly being slow, is as correct as any algorithm. You will find something that was, for a moment, max.
If N == 8, I would use searching. Definitely.
You can search 8 entries using memory_order_relaxed possibly, and that will be faster than trying to maintain anything using stronger atomic ops.
I have other ideas:
More bookkeeping? Store maxValue separately?
double data[N];
double maxValue;
int indexOfMax;
bool wasMax = false;
if (index == indexOfMax)
wasMax = true;
data[index] = value;
if (wasMax || index == indexOfMax)
findMax(&indexOfMax, &maxValue); // linear search
That would probably need a CAS-loop somewhere. Still linear search, but maybe less often?
Maybe you need extra data at each entry? Not sure yet.
Hmmmm.
This is not simple. Thus if there is a correct algorithm (and I think there is, within some constraints) it is unlikely to not have bugs. ie a correct algorithm might actually exist, but you don't find it - what you instead find is an algorithm that just looks correct.

Sorted data structure for in-order iteration, ordered push, and removal (N elements only from top)

What is considered an optimal data structure for pushing something in order (so inserts at any position, able to find correct position), in-order iteration, and popping N elements off the top (so the N smallest elements, N determined by comparisons with threshold value)? The push and pop need to be particularly fast (run every iteration of a loop), while the in-order full iteration of the data happens at a variable rate but likely an order of magnitude less often. The data can't be purged by the full iteration, it needs to be unchanged. Everything that is pushed will eventually be popped, but since a pop can remove multiple elements there can be more pushes than pops. The scale of data in the structure at any one time could go up to hundreds or low thousands of elements.
I'm currently using a std::deque and binary search to insert elements in ascending order. Profiling shows it taking up the majority of the time, so something has got to change. std::priority_queue doesn't allow iteration, and hacks I've seen to do it won't iterate in order. Even on a limited test (no full iteration!), the std::set class performed worse than my std::deque approach.
None of the classes I'm messing with seem to be built with this use case in mind. I'm not averse to making my own class, if there's a data structure not to be found in STL or boost for some reason.
edit:
There's two major functions right now, push and prune. push uses 65% of the time, prune uses 32%. Most of the time used in push is due to insertion into the deque (64% out of 65%). Only 1% comes from the binary search to find the position.
template<typename T, size_t Axes>
void Splitter<T, Axes>::SortedData::push(const Data& data) //65% of processing
{
size_t index = find(data.values[(axis * 2) + 1]);
this->data.insert(this->data.begin() + index, data); //64% of all processing happens here
}
template<typename T, size_t Axes>
void Splitter<T, Axes>::SortedData::prune(T value) //32% of processing
{
auto top = data.begin(), end = data.end(), it = top;
for (; it != end; ++it)
{
Data& data = *it;
if (data.values[(axis * 2) + 1] > value) break;
}
data.erase(top, it);
}
template<typename T, size_t Axes>
size_t Splitter<T, Axes>::SortedData::find(T value)
{
size_t start = 0;
size_t end = this->data.size();
if (!end) return 0;
size_t diff;
while (diff = (end - start) >> 1)
{
size_t mid = diff + start;
if (this->data[mid].values[(axis * 2) + 1] <= value)
{
start = mid;
}
else
{
end = mid;
}
}
return this->data[start].values[(axis * 2) + 1] <= value ? end : start;
}
With your requirements, a hybrid data-structure tailored to your needs will probably perform best. As others have said, continuous memory is very important, but I would not recommend keeping the array sorted at all times. I propose you use 3 buffers (1 std::array and 2 std::vectors):
1 (constant-size) Buffer for the "insertion heap". Needs to fit into the cache.
2 (variable-sized) Buffers (A+B) to maintain and update sorted arrays.
When you push an element, you add it to the insertion heap via std::push_heap. Since the insertion heap is constant size, it can overflow. When that happens, you std::sort it backwards and std::merge it with the already sorted-sequence buffer (A) into the third (B), resizing them as needed. That will be the new sorted buffer and the old one can be discarded, i.e. you swap A and B for the next bulk operation. When you need the sorted sequence for iteration, you do the same. When you remove elements, you compare the top element in the heap with the last element in the sorted sequence and remove that (which is why you sort it backwards, so that you can pop_back instead of pop_front).
For reference, this idea is loosely based on sequence heaps.
Have you tried messing around with std::vector? As weird as it may sound it could be actually pretty fast because it uses continuous memory. If I remember correctly Bjarne Stroustrup was talking about this at Going Native 2012 (http://channel9.msdn.com/Events/GoingNative/GoingNative-2012/Keynote-Bjarne-Stroustrup-Cpp11-Style but I'm not 100% sure that it's in this video).
You save time with the binary search, but the insertion in random positions of the deque is slow. I would suggest an std::map instead.
From your edit, it sounds like the delay is in copying - is it a complex object? Can you heap allocate and store pointers in the structure so each entry is created once only; you'll need to provide a custom comparitor that takes pointers, as the objects operator<() wouldn't be called. (The custom comparitor can simply call operator<())
EDIT:
Your own figures show it's the insertion that takes the time, not the 'sorting'. While some of that insertion time is creating a copy of your object, some (possibly most) is creation of the internal structure that will hold your object - and I don't think that will change between list/map/set/queue etc. IF you can predict the likely eventual/maximum size of your data set, and can write or find your own sorting algorithm, and the time is being lost in allocating objects, then vector might be the way to go.

How to zero out array in O(1)?

Is there an way to zero out an array in with time complexsity O(1)? It's obvious that this can be done by for-loop, memset. But their time complexity are not O(1).
Yes
However not any array. It takes an array that has been crafted for this to work.
template <typename T, size_t N>
class Array {
public:
Array(): generation(0) {}
void clear() {
// FIXME: deal with overflow
++generation;
}
T get(std::size_t i) const {
if (i >= N) { throw std::runtime_error("out of range"); }
TimedT const& t = data[i];
return t.second == generation ? t.first : T{};
}
void set(std::size_t i, T t) {
if (i >= N) { throw std::runtime_error("out of range"); }
data[i] = std::make_pair(t, generation);
}
private:
typedef std::pair<T, unsigned> TimedT;
TimedT data[N];
unsigned generation;
};
The principle is simple:
we define an epoch using the generation attribute
when an item is set, the epoch in which it has been set is recorded
only items of the current epoch can be seen
clearing is thus equivalent to incrementing the epoch counter
The method has two issues:
storage increase: for each item we store an epoch
generation counter overflow: there is something as a maximum number of epochs
The latter can be thwarted using a real big integer (uint64_t at the cost of more storage).
The former is a natural consequence, one possible solution is to use buckets to downplay the issue by having for example up to 64 items associated to a single counter and a bitmask identifying which are valid within this counter.
EDIT: just wanted to get back on the buckets idea.
The original solution has an overhead of 8 bytes (64 bits) per element (if already 8-bytes aligned). Depending on the elements stored it might or might not be a big deal.
If it is a big deal, the idea is to use buckets; of course like all trade-off it slows down access even more.
template <typename T>
class BucketArray {
public:
BucketArray(): generation(0), mask(0) {}
T get(std::size_t index, std::size_t gen) const {
assert(index < 64);
return gen == generation and (mask & (1 << index)) ?
data[index] : T{};
}
void set(std::size_t index, T t, std::size_t gen) {
assert(index < 64);
if (generation < gen) { mask = 0; generation = gen; }
mask |= (1 << index);
data[index] = t;
}
private:
std::uint64_t generation;
std::uint64_t mask;
T data[64];
};
Note that this small array of a fixed number of elements (we could actually template this and statically check it's inferior or equal to 64) only has 16 bytes of overhead. This means we have an overhead of 2 bits per element.
template <typename T, size_t N>
class Array {
typedef BucketArray<T> Bucket;
public:
Array(): generation(0) {}
void clear() { ++generation; }
T get(std::size_t i) const {
if (i >= N) { throw ... }
Bucket const& bucket = data[i / 64];
return bucket.get(i % 64, generation);
}
void set(std::size_t i, T t) {
if (i >= N) { throw ... }
Bucket& bucket = data[i / 64];
bucket.set(i % 64, t, generation);
}
private:
std::uint64_t generation;
Bucket data[N / 64 + 1];
};
We got the space overhead down by a factor of... 32. Now the array can even be used to store char for example, whereas before it would have been prohibitive. The cost is that access got slower, as we get a division and modulo (when we will get a standardized operation that returns both results in one shot ?).
You can't modify n locations in memory in less than O(n) (even if your hardware, for sufficiently small n, maybe allows a constant-time operation to zero certain nicely-aligned blocks of memory, like for example flash memory does).
However, if the object of the exercise is a bit of lateral thinking, then you can write a class representing a "sparse" array. The general idea of a sparse array is that you keep a collection (perhaps a map, although depending on usage that might not be all there is to it), and when you look up an index, if it's not in the underlying collection then you return 0.
If you can clear the underlying collection in O(1), then you can zero out your sparse array in O(1). Clearing a std::map isn't usually constant-time in the size of the map, because all those nodes need to be freed. But you could design a collection that can be cleared in O(1) by moving the whole tree over from "the contents of my map", to "a tree of nodes that I have reserved for future use". The disadvantage would just be that this "reserved" space is still allocated, a bit like what happens when a vector gets smaller.
It's certainly possible to zero out an array in O(1) as long as you accept a very large constant factor:
void zero_out_array_in_constant_time(void* a, size_t n)
{
char* p = (char*) a;
for (size_t i = 0; i < std::numeric_limits<size_t>::max(); ++i)
{
p[i % n] = 0;
}
}
This will always take the same number of steps, regardless of the size of the array, hence it's O(1).
No.
You can't visit every member of an N-element collection in anything less than O(N) time.
You might, as Mike Kwan has observed, shift the cost from run- to compile-time, but that doesn't alter the computational complexity of the operation.
It's clearly not possible to initialize an arbitrarily sized array in a fixed length of time. However, it is entirely possible to create an array-like ADT which amortizes the cost of initializing the array across its use. The usual construction for this takes upwards of 3x the storage, however. To whit:
template <typename T, size_t arr_size>
class NoInitArray
{
std::vector<T> storage;
// Note that 'lookup' and 'check' are not initialized, and may contain
// arbitrary garbage.
size_t lookup[arr_size];
size_t check[arr_size];
public:
T& operator[](size_t pos)
{
// find out where we actually stored the entry for (*this)[pos].
// This could be garbage.
size_t storage_loc=lookup[pos];
// Check to see that the storage_loc we found is valid
if (storage_loc < storage.size() && check[storage_loc] == pos)
{
// everything checks, return the reference.
return storage[storage_loc];
}
else
{
// storage hasn't yet been allocated/initialized for (*this)[pos].
// allocate storage:
storage_loc=storage.size();
storage.push_back(T());
// put entries in lookup and check so we can find
// the proper spot later:
lookup[pos]=storage_loc;
check[storage_loc]=pos;
// everything's set up, return appropriate reference:
return storage.back();
}
}
};
One could add a clear() member to empty the contents of such an array fairly easily if T is some type that doesn't require destruction, at least in concept.
I like Eli Bendersky's webpage http://eli.thegreenplace.net/2008/08/23/initializing-an-array-in-constant-time, with a solution that he attributes to the famous book Design and Analysis of Computer Algorithms by Aho, Hopcroft and Ullman. This is genuinely O(1) time complexity for initialization, rather than O(N). The space demands are O(N) additional storage, but allocating this space is also O(1), since the space is full of garbage. I enjoyed this for theoretical reasons, but I think it might also be of practical value for implementing some algorithms, if you need to repeatedly initialize a very large array, and each time you access only a relatively small number of positions in the array. Bendersky provides a C++ implementation of the algorithm.
A very pure theorist might start worrying about the fact that N needs O(log(N)) digits, but I have ignored that detail, which would presumably require looking carefully at the mathematical model of the computer. Volume 1 of The Art of Computer Programming probably gives Knuth's view of this issue.
It is impossible at runtime to zero out an array in O(1). This is intuitive given that there is no language mechanism which allows the value setting of arbitrary size blocks of memory in fixed time. The closest you can do is:
int blah[100] = {0};
That will allow the initialisation to happen at compiletime. At runtime, memset is generally the fastest but would be O(N). However, there are problems associated with using memset on particular array types.
While still O(N), implementations that map to hardware-assisted operations like clearing whole cache lines or memory pages can run at <1 cycle per word.
Actually, riffing on Steve Jessop's idea...
You can do it if you have hardware support to clear an arbitrarily large amount of memory all at the same time. If you posit an arbitrarily large array, then you can also posit an arbitrarily large memory with hardware parallelism so that a single reset pin will simultaneously clear every register at once. That line will have to be driven by an arbitrarily large logic gate (that dissipates arbitrarily large power), and the circuit traces will have to be arbitrarily short (to overcome R/C delay) (or superconducting), but these things are quite common in extradimensional spaces.
It is possible to do it with O(1) time, and even with O(1) extra space.
I explained the O(1) time solution David mentioned in another answer, but it uses 2n extra memory.
A better algorithm exists, which requires only 1 bit of extra memory.
See the Article I just wrote about that subject.
It also explains the algorithm David mentioned, some more, and the state-of-the-art algorithm today. It also features an implementation of the latter.
In a short explanation (as I'll just be repeating the article) it cleverly takes the algorithm presented in David's answer and does it all in-place, while using only a (very) little extra memory.

Help with algorithm for merging vectors

I need a very fast algorithm for the following task. I have already implemented several algorithms that complete it, but they're all too slow for the performance I need. It should be fast enough that the algorithm can be run at least 100,000 times a second on a modern CPU. It will be implemented in C++.
I am working with spans/ranges, a structure that has a start and an end coordinate on a line.
I have two vectors (dynamic arrays) of spans and I need to merge them. One vector is src and the other dst. The vectors are sorted by span start coordinates, and the spans do not overlap within one vector.
The spans in the src vector must be merged with the spans in the dst vector, such that the resulting vector is still sorted and has no overlaps. Ie. if overlaps are detected during the merging, the two spans are merged into one. (Merging two spans is just a matter of changing the coordinates in the structure.)
Now, there is one more catch, the spans in the src vector must be "widened" during the merge. This means that a constant will be added to the start and another (larger) constant to the end coordinate of every span in src. This means that after the src spans are widened they might overlap.
What I have arrived at so far is that it cannot be done fully in-place, some kind of temporary storage is needed. I think it should be doable in linear time over the number of elements of src and dst summed.
Any temporary storage can probably be shared between multiple runs of the algorithm.
The two primary approaches I have tried, which are too slow, are:
Append all elements of src to dst, widening each element before appending it. Then run an in-place sort. Finally iterate over the resulting vector using a "read" and "write" pointer, with the read pointer running ahead of the write pointer, merging spans as they go. When all elements have been merged (the read pointer reaches end) dst is truncated.
Create a temporary work-vector. Do a naive merge as described above by repeatedly picking the next element from either src or dst and merging into the work-vector. When done, copy the work-vector to dst, replacing it.
The first method has the problem that sorting is O((m+n)*log(m+n)) instead of O(m+n) and has somewhat overhead. It also means the dst vector has to grow much larger than it really needs.
The second has the primary problem of a lot of copying around and again allocation/deallocation of memory.
The data structures used for storing/managing the spans/vectors can be altered if you think that's needed.
Update: Forgot to say how large the datasets are. The most common cases are between 4 and 30 elements in either vector, and either dst is empty or there is a large amount of overlap between the spans in src and dst.
We know that the absolute best case runtime is O(m+n) this is due to the fact that you at least have to scan over all of the data in order to be able to merge the lists. Given this, your second method should give you that type of behavior.
Have you profiled your second method to find out what the bottlenecks are? It is quite possible that, depending on the amount of data you are talking about it is actually impossible to do what you want in the specified amount of time. One way to verify this is to do something simple like sum up all the start and end values of the spans in each vector in a loop, and time that. Basically here you are doing a minimal amount of work for each element in the vectors. This will provide you with a baseline for the best performance you can expect to get.
Beyond that you can avoid copying the vectors element by element by using the stl swap method, and you can preallocate the temp vector to a certain size in order to avoid triggering the expansion of the array when you are merging the elements.
You might consider using 2 vectors in your system and whenever you need to do a merge you merge into the unused vector, and then swap (this is similar to double buffering used in graphics). This way you don't have to reallocate the vectors every time you do the merge.
However, you are best off profiling first and finding out what your bottleneck is. If the allocations are minimal compared to the actual merging process than you need to figure out how to make that faster.
Some possible additional speedups could come from accessing the vectors raw data directly which avoids the bounds checks on each access the data.
The sort you mention in Approach 1 can be reduced to linear time (from log-linear as you describe it) because the two input lists are already sorted. Just perform the merge step of merge-sort. With an appropriate representation for the input span vectors (for example singly-linked lists) this can be done in-place.
http://en.wikipedia.org/wiki/Merge_sort
How about the second method without repeated allocation--in other words, allocate your temporary vector once, and never allocate it again? Or, if the input vectors are small enough (But not constant size), just use alloca instead of malloc.
Also, in terms of speed, you may want to make sure that your code is using CMOV for the sorting, since if the code is actually branching for every single iteration of the mergesort:
if(src1[x] < src2[x])
dst[x] = src1[x];
else
dst[x] = src2[x];
The branch prediction will fail 50% of the time, which will have an enormous hit on performance. A conditional move will likely do much much better, so make sure the compiler is doing that, and if not, try to coax it into doing so.
i don't think a strictly linear solution is possible, because widening the src vector spans may in the worst-case cause all of them to overlap (depending on the magnitude of the constant that you are adding)
the problem may be in the implementation, not in the algorithm; i would suggest profiling the code for your prior solutions to see where the time is spent
reasoning:
for a truly "modern" CPU like the Intel Core 2 Extreme QX9770 running at 3.2GHz, one can expect about 59,455 MIPS
for 100,000 vectors, you would have to process each vector in 594,550 instuctions. That's a LOT of instructions.
ref: wikipedia MIPS
in addition, note that adding a constant to the src vector spans does not de-sort them, so you can normalize the src vector spans independently, then merge them with the dst vector spans; this should reduce the workload of your original algorithm
1 is right out - a full sort is slower than merging two sorted lists.
So you're looking at tweaking 2 (or something entirely new).
If you change the data structures to doubly linked lists, then you can merge them in constant working space.
Use a fixed-size heap allocator for the list nodes, both to reduce memory use per node and to improve the chance that the nodes are close together in memory, reducing page misses.
You might be able to find code online or in your favourite algorithm book to optimise a linked list merge. You'll want to customise this in order to do span coalescing at the same time as the list merge.
To optimise the merge, first note that for each run of values coming off the same side without one coming from the other side, you can insert the whole run into the dst list in one go, instead of inserting each node in turn. And you can save one write per insertion over a normal list operation, by leaving the end "dangling", knowing that you'll patch it up later. And provided that you don't do deletions anywhere else in your app, the list can be singly-linked, which means one write per node.
As for 10 microsecond runtime - kind of depends on n and m...
I would always keep my vector of spans sorted. That makes implementing algorithms a LOT easier -- and possible to do in linear time.
OK, so I'd sort the spans based on:
span minimum in increasing order
then span maximum in decreasing order
You need to create a function to do that.
Then I'd use std::set_union to merge the vectors (you can merge more than one before continuing).
Then for each consecutive sets of spans with identical minimums, you keep the first and remove the rest (they're sub-spans of the first span).
Then you need to merge your spans. That should be pretty doable now and feasible in linear time.
OK, here's the trick now. Don't try to do this in-place. Use one or more temporary vectors (and reserve enough space ahead of time). Then at the end, call std::vector::swap to put the results in the input vector of your choice.
I hope that's enough to get you going.
What is your target system? Is it multi-core? If so you could consider multithreading this algorithm
I wrote a new container class just for this algorithm, tailored to the needs. This also gave me a chance to adjust other code around my program which got a little speed boost at the same time.
This is significantly faster than the old implementation using STL vectors, but which was otherwise basically the same thing. But while it's faster it's still not really fast enough... unfortunately.
Profiling doesn't reveal what is the real bottleneck any longer. The MSVC profiler seems to sometimes place the "blame" on the wrong calls (supposedly identical runs assign widely different running times) and most calls are getting coalesced into one big chink.
Looking at a disassembly of the generated code shows that there's a very large amount of jumps in the generated code, I think that might be the main reason behind the slowness now.
class SpanBuffer {
private:
int *data;
size_t allocated_size;
size_t count;
inline void EnsureSpace()
{
if (count == allocated_size)
Reserve(count*2);
}
public:
struct Span {
int start, end;
};
public:
SpanBuffer()
: data(0)
, allocated_size(24)
, count(0)
{
data = new int[allocated_size];
}
SpanBuffer(const SpanBuffer &src)
: data(0)
, allocated_size(src.allocated_size)
, count(src.count)
{
data = new int[allocated_size];
memcpy(data, src.data, sizeof(int)*count);
}
~SpanBuffer()
{
delete [] data;
}
inline void AddIntersection(int x)
{
EnsureSpace();
data[count++] = x;
}
inline void AddSpan(int s, int e)
{
assert((count & 1) == 0);
assert(s >= 0);
assert(e >= 0);
EnsureSpace();
data[count] = s;
data[count+1] = e;
count += 2;
}
inline void Clear()
{
count = 0;
}
inline size_t GetCount() const
{
return count;
}
inline int GetIntersection(size_t i) const
{
return data[i];
}
inline const Span * GetSpanIteratorBegin() const
{
assert((count & 1) == 0);
return reinterpret_cast<const Span *>(data);
}
inline Span * GetSpanIteratorBegin()
{
assert((count & 1) == 0);
return reinterpret_cast<Span *>(data);
}
inline const Span * GetSpanIteratorEnd() const
{
assert((count & 1) == 0);
return reinterpret_cast<const Span *>(data+count);
}
inline Span * GetSpanIteratorEnd()
{
assert((count & 1) == 0);
return reinterpret_cast<Span *>(data+count);
}
inline void MergeOrAddSpan(int s, int e)
{
assert((count & 1) == 0);
assert(s >= 0);
assert(e >= 0);
if (count == 0)
{
AddSpan(s, e);
return;
}
int *lastspan = data + count-2;
if (s > lastspan[1])
{
AddSpan(s, e);
}
else
{
if (s < lastspan[0])
lastspan[0] = s;
if (e > lastspan[1])
lastspan[1] = e;
}
}
inline void Reserve(size_t minsize)
{
if (minsize <= allocated_size)
return;
int *newdata = new int[minsize];
memcpy(newdata, data, sizeof(int)*count);
delete [] data;
data = newdata;
allocated_size = minsize;
}
inline void SortIntersections()
{
assert((count & 1) == 0);
std::sort(data, data+count, std::less<int>());
assert((count & 1) == 0);
}
inline void Swap(SpanBuffer &other)
{
std::swap(data, other.data);
std::swap(allocated_size, other.allocated_size);
std::swap(count, other.count);
}
};
struct ShapeWidener {
// How much to widen in the X direction
int widen_by;
// Half of width difference of src and dst (width of the border being produced)
int xofs;
// Temporary storage for OverlayScanline, so it doesn't need to reallocate for each call
SpanBuffer buffer;
inline void OverlayScanline(const SpanBuffer &src, SpanBuffer &dst);
ShapeWidener(int _xofs) : xofs(_xofs) { }
};
inline void ShapeWidener::OverlayScanline(const SpanBuffer &src, SpanBuffer &dst)
{
if (src.GetCount() == 0) return;
if (src.GetCount() + dst.GetCount() == 0) return;
assert((src.GetCount() & 1) == 0);
assert((dst.GetCount() & 1) == 0);
assert(buffer.GetCount() == 0);
dst.Swap(buffer);
const int widen_s = xofs - widen_by;
const int widen_e = xofs + widen_by;
size_t resta = src.GetCount()/2;
size_t restb = buffer.GetCount()/2;
const SpanBuffer::Span *spa = src.GetSpanIteratorBegin();
const SpanBuffer::Span *spb = buffer.GetSpanIteratorBegin();
while (resta > 0 || restb > 0)
{
if (restb == 0)
{
dst.MergeOrAddSpan(spa->start+widen_s, spa->end+widen_e);
--resta, ++spa;
}
else if (resta == 0)
{
dst.MergeOrAddSpan(spb->start, spb->end);
--restb, ++spb;
}
else if (spa->start < spb->start)
{
dst.MergeOrAddSpan(spa->start+widen_s, spa->end+widen_e);
--resta, ++spa;
}
else
{
dst.MergeOrAddSpan(spb->start, spb->end);
--restb, ++spb;
}
}
buffer.Clear();
}
If your most recent implementation still isn't fast enough, you might end up having to look at alternative approaches.
What are you using the outputs of this function for?