I need a very fast algorithm for the following task. I have already implemented several algorithms that complete it, but they're all too slow for the performance I need. It should be fast enough that the algorithm can be run at least 100,000 times a second on a modern CPU. It will be implemented in C++.
I am working with spans/ranges, a structure that has a start and an end coordinate on a line.
I have two vectors (dynamic arrays) of spans and I need to merge them. One vector is src and the other dst. The vectors are sorted by span start coordinates, and the spans do not overlap within one vector.
The spans in the src vector must be merged with the spans in the dst vector, such that the resulting vector is still sorted and has no overlaps. Ie. if overlaps are detected during the merging, the two spans are merged into one. (Merging two spans is just a matter of changing the coordinates in the structure.)
Now, there is one more catch, the spans in the src vector must be "widened" during the merge. This means that a constant will be added to the start and another (larger) constant to the end coordinate of every span in src. This means that after the src spans are widened they might overlap.
What I have arrived at so far is that it cannot be done fully in-place, some kind of temporary storage is needed. I think it should be doable in linear time over the number of elements of src and dst summed.
Any temporary storage can probably be shared between multiple runs of the algorithm.
The two primary approaches I have tried, which are too slow, are:
Append all elements of src to dst, widening each element before appending it. Then run an in-place sort. Finally iterate over the resulting vector using a "read" and "write" pointer, with the read pointer running ahead of the write pointer, merging spans as they go. When all elements have been merged (the read pointer reaches end) dst is truncated.
Create a temporary work-vector. Do a naive merge as described above by repeatedly picking the next element from either src or dst and merging into the work-vector. When done, copy the work-vector to dst, replacing it.
The first method has the problem that sorting is O((m+n)*log(m+n)) instead of O(m+n) and has somewhat overhead. It also means the dst vector has to grow much larger than it really needs.
The second has the primary problem of a lot of copying around and again allocation/deallocation of memory.
The data structures used for storing/managing the spans/vectors can be altered if you think that's needed.
Update: Forgot to say how large the datasets are. The most common cases are between 4 and 30 elements in either vector, and either dst is empty or there is a large amount of overlap between the spans in src and dst.
We know that the absolute best case runtime is O(m+n) this is due to the fact that you at least have to scan over all of the data in order to be able to merge the lists. Given this, your second method should give you that type of behavior.
Have you profiled your second method to find out what the bottlenecks are? It is quite possible that, depending on the amount of data you are talking about it is actually impossible to do what you want in the specified amount of time. One way to verify this is to do something simple like sum up all the start and end values of the spans in each vector in a loop, and time that. Basically here you are doing a minimal amount of work for each element in the vectors. This will provide you with a baseline for the best performance you can expect to get.
Beyond that you can avoid copying the vectors element by element by using the stl swap method, and you can preallocate the temp vector to a certain size in order to avoid triggering the expansion of the array when you are merging the elements.
You might consider using 2 vectors in your system and whenever you need to do a merge you merge into the unused vector, and then swap (this is similar to double buffering used in graphics). This way you don't have to reallocate the vectors every time you do the merge.
However, you are best off profiling first and finding out what your bottleneck is. If the allocations are minimal compared to the actual merging process than you need to figure out how to make that faster.
Some possible additional speedups could come from accessing the vectors raw data directly which avoids the bounds checks on each access the data.
The sort you mention in Approach 1 can be reduced to linear time (from log-linear as you describe it) because the two input lists are already sorted. Just perform the merge step of merge-sort. With an appropriate representation for the input span vectors (for example singly-linked lists) this can be done in-place.
http://en.wikipedia.org/wiki/Merge_sort
How about the second method without repeated allocation--in other words, allocate your temporary vector once, and never allocate it again? Or, if the input vectors are small enough (But not constant size), just use alloca instead of malloc.
Also, in terms of speed, you may want to make sure that your code is using CMOV for the sorting, since if the code is actually branching for every single iteration of the mergesort:
if(src1[x] < src2[x])
dst[x] = src1[x];
else
dst[x] = src2[x];
The branch prediction will fail 50% of the time, which will have an enormous hit on performance. A conditional move will likely do much much better, so make sure the compiler is doing that, and if not, try to coax it into doing so.
i don't think a strictly linear solution is possible, because widening the src vector spans may in the worst-case cause all of them to overlap (depending on the magnitude of the constant that you are adding)
the problem may be in the implementation, not in the algorithm; i would suggest profiling the code for your prior solutions to see where the time is spent
reasoning:
for a truly "modern" CPU like the Intel Core 2 Extreme QX9770 running at 3.2GHz, one can expect about 59,455 MIPS
for 100,000 vectors, you would have to process each vector in 594,550 instuctions. That's a LOT of instructions.
ref: wikipedia MIPS
in addition, note that adding a constant to the src vector spans does not de-sort them, so you can normalize the src vector spans independently, then merge them with the dst vector spans; this should reduce the workload of your original algorithm
1 is right out - a full sort is slower than merging two sorted lists.
So you're looking at tweaking 2 (or something entirely new).
If you change the data structures to doubly linked lists, then you can merge them in constant working space.
Use a fixed-size heap allocator for the list nodes, both to reduce memory use per node and to improve the chance that the nodes are close together in memory, reducing page misses.
You might be able to find code online or in your favourite algorithm book to optimise a linked list merge. You'll want to customise this in order to do span coalescing at the same time as the list merge.
To optimise the merge, first note that for each run of values coming off the same side without one coming from the other side, you can insert the whole run into the dst list in one go, instead of inserting each node in turn. And you can save one write per insertion over a normal list operation, by leaving the end "dangling", knowing that you'll patch it up later. And provided that you don't do deletions anywhere else in your app, the list can be singly-linked, which means one write per node.
As for 10 microsecond runtime - kind of depends on n and m...
I would always keep my vector of spans sorted. That makes implementing algorithms a LOT easier -- and possible to do in linear time.
OK, so I'd sort the spans based on:
span minimum in increasing order
then span maximum in decreasing order
You need to create a function to do that.
Then I'd use std::set_union to merge the vectors (you can merge more than one before continuing).
Then for each consecutive sets of spans with identical minimums, you keep the first and remove the rest (they're sub-spans of the first span).
Then you need to merge your spans. That should be pretty doable now and feasible in linear time.
OK, here's the trick now. Don't try to do this in-place. Use one or more temporary vectors (and reserve enough space ahead of time). Then at the end, call std::vector::swap to put the results in the input vector of your choice.
I hope that's enough to get you going.
What is your target system? Is it multi-core? If so you could consider multithreading this algorithm
I wrote a new container class just for this algorithm, tailored to the needs. This also gave me a chance to adjust other code around my program which got a little speed boost at the same time.
This is significantly faster than the old implementation using STL vectors, but which was otherwise basically the same thing. But while it's faster it's still not really fast enough... unfortunately.
Profiling doesn't reveal what is the real bottleneck any longer. The MSVC profiler seems to sometimes place the "blame" on the wrong calls (supposedly identical runs assign widely different running times) and most calls are getting coalesced into one big chink.
Looking at a disassembly of the generated code shows that there's a very large amount of jumps in the generated code, I think that might be the main reason behind the slowness now.
class SpanBuffer {
private:
int *data;
size_t allocated_size;
size_t count;
inline void EnsureSpace()
{
if (count == allocated_size)
Reserve(count*2);
}
public:
struct Span {
int start, end;
};
public:
SpanBuffer()
: data(0)
, allocated_size(24)
, count(0)
{
data = new int[allocated_size];
}
SpanBuffer(const SpanBuffer &src)
: data(0)
, allocated_size(src.allocated_size)
, count(src.count)
{
data = new int[allocated_size];
memcpy(data, src.data, sizeof(int)*count);
}
~SpanBuffer()
{
delete [] data;
}
inline void AddIntersection(int x)
{
EnsureSpace();
data[count++] = x;
}
inline void AddSpan(int s, int e)
{
assert((count & 1) == 0);
assert(s >= 0);
assert(e >= 0);
EnsureSpace();
data[count] = s;
data[count+1] = e;
count += 2;
}
inline void Clear()
{
count = 0;
}
inline size_t GetCount() const
{
return count;
}
inline int GetIntersection(size_t i) const
{
return data[i];
}
inline const Span * GetSpanIteratorBegin() const
{
assert((count & 1) == 0);
return reinterpret_cast<const Span *>(data);
}
inline Span * GetSpanIteratorBegin()
{
assert((count & 1) == 0);
return reinterpret_cast<Span *>(data);
}
inline const Span * GetSpanIteratorEnd() const
{
assert((count & 1) == 0);
return reinterpret_cast<const Span *>(data+count);
}
inline Span * GetSpanIteratorEnd()
{
assert((count & 1) == 0);
return reinterpret_cast<Span *>(data+count);
}
inline void MergeOrAddSpan(int s, int e)
{
assert((count & 1) == 0);
assert(s >= 0);
assert(e >= 0);
if (count == 0)
{
AddSpan(s, e);
return;
}
int *lastspan = data + count-2;
if (s > lastspan[1])
{
AddSpan(s, e);
}
else
{
if (s < lastspan[0])
lastspan[0] = s;
if (e > lastspan[1])
lastspan[1] = e;
}
}
inline void Reserve(size_t minsize)
{
if (minsize <= allocated_size)
return;
int *newdata = new int[minsize];
memcpy(newdata, data, sizeof(int)*count);
delete [] data;
data = newdata;
allocated_size = minsize;
}
inline void SortIntersections()
{
assert((count & 1) == 0);
std::sort(data, data+count, std::less<int>());
assert((count & 1) == 0);
}
inline void Swap(SpanBuffer &other)
{
std::swap(data, other.data);
std::swap(allocated_size, other.allocated_size);
std::swap(count, other.count);
}
};
struct ShapeWidener {
// How much to widen in the X direction
int widen_by;
// Half of width difference of src and dst (width of the border being produced)
int xofs;
// Temporary storage for OverlayScanline, so it doesn't need to reallocate for each call
SpanBuffer buffer;
inline void OverlayScanline(const SpanBuffer &src, SpanBuffer &dst);
ShapeWidener(int _xofs) : xofs(_xofs) { }
};
inline void ShapeWidener::OverlayScanline(const SpanBuffer &src, SpanBuffer &dst)
{
if (src.GetCount() == 0) return;
if (src.GetCount() + dst.GetCount() == 0) return;
assert((src.GetCount() & 1) == 0);
assert((dst.GetCount() & 1) == 0);
assert(buffer.GetCount() == 0);
dst.Swap(buffer);
const int widen_s = xofs - widen_by;
const int widen_e = xofs + widen_by;
size_t resta = src.GetCount()/2;
size_t restb = buffer.GetCount()/2;
const SpanBuffer::Span *spa = src.GetSpanIteratorBegin();
const SpanBuffer::Span *spb = buffer.GetSpanIteratorBegin();
while (resta > 0 || restb > 0)
{
if (restb == 0)
{
dst.MergeOrAddSpan(spa->start+widen_s, spa->end+widen_e);
--resta, ++spa;
}
else if (resta == 0)
{
dst.MergeOrAddSpan(spb->start, spb->end);
--restb, ++spb;
}
else if (spa->start < spb->start)
{
dst.MergeOrAddSpan(spa->start+widen_s, spa->end+widen_e);
--resta, ++spa;
}
else
{
dst.MergeOrAddSpan(spb->start, spb->end);
--restb, ++spb;
}
}
buffer.Clear();
}
If your most recent implementation still isn't fast enough, you might end up having to look at alternative approaches.
What are you using the outputs of this function for?
Related
I want to implement an algorithm that basically moves every value(besides the last one) one place to the left, as in the first element becomes the second element, and so on.
I have already implemented it like this:
for(int i = 0; i < vct.size() - 1; i++){
vct[i] = vct[i + 1];
}
which works, but I was just wondering if there is a faster, optionally shorter way to achieve the same result?
EDIT: I have made a mistake where I said that I wanted it to move to the right, where in reality I wanted it to go left, so sorry for the confusion and thanks to everyone for pointing that out. I also checked if the vector isn't empty beforehand, just didn't include it in the snippet.
As a comment (or more than one?) has pointed out, the obvious choice here would be to just use a std::deque.
Another possibility would be to use a circular buffer. In this case, you'll typically have an index (or pointer) to the first and last items in the collection. Removing an item from the beginning consists of incrementing that index/pointer (and wrapping it around to the beginning of you've reached the end of the buffer). That's quite fast, and constant time, regardless of collection size. There is a downside that every time you add an item, remove an item, or look at an item, you need to do a tiny bit of extra math to do it. But it's usually just one addition, so overhead is pretty minimal. Circular buffers work well, but they have a fair number of corner cases, so getting them just right is often kind of a pain. Worse, many obvious implementations waste one the slot for one data item (though that often doesn't matter a lot).
A slightly simpler possibility would reduce overhead by some constant factor. To do this, use a little wrapper that keeps track of the first item in the collection, along with your vector of items. When you want to remove the first item, just increment the variable that keeps track of the first item. When you reach some preset limit, you erase the first N elements all at once. This reduces the time spent shifting items by a factor of N.
template <class T>
class pseudo_queue {
std::vector<T> data;
std:size_t b;
// adjust as you see fit:
static const int max_slop = 20;
void shift() {
data.erase(data.begin(), data.begin() + b);
}
public:
void push_back(T &&t) { data.push_back(std::move(t); }
void pop_back() { data.pop_back(); }
T &back() { return data.back(); }
T &front() { return data[b]; }
void pop_front() {
if (++b > max_slop) shift();
}
std::vector<T>::iterator begin() { return data.begin() + b; }
std::vector<T>::iterator end() { return data.end(); }
T &operator[](std::size_t index) { return data[index + b]; }
};
If you want to get really tricky, you can change this a bit, so you compute max_slop as a percentage of the size of data in the collection. In this case, you can change the computational complexity involved, rather than just leaving it linear but with a larger constant factor than you currently have. But I have no idea how much (if at all) you care about that--it's only likely to matter much if you deal with a wide range of sizes.
assuming you really meant moving data to the right, and that your code has a bug,
you have std::move_backwards from <algorithms> and its sibling std::move to do that, but accessing data backwards may be inefficient.
std::move_backward(vct.begin(),vct.end()-1,vct.end());
if you actually meant move data to the left, you can use std::move
std::move(vct.begin()+1,vct.end(),vct.begin())
you can also use std::copy and std::copy_backward instead if your object is trivially copiable, the syntax is exactly the same and it will be faster for trivially copyable objects (stack objects).
you can just do a normal loop to the right, assuming vct.size() is bigger than 1.
int temp1 = vct[0]; // assume vct of int
int temp2; // assume vct of int
for(int i = 0;i<vct.size() - 1;i++){
temp2 = vct[i+1];
vct[i+1] = temp1;
temp1 = temp2;
}
and your version is what's to do if you are moving to the left.
also as noted in the comments, you should check that your list is not empty if you are doing the loop version.
I am a bit curiuous about vector optimization and have couple questions about it. (I am still a beginner in programing)
example:
struct GameInfo{
EnumType InfoType;
// Other info...
};
int _lastPosition;
// _gameInfoV is sorted beforehand
std::vector<GameInfo> _gameInfoV;
// The tick function is called every game frame (in "perfect" condition it's every 1.0/60 second)
void BaseClass::tick()
{
for (unsigned int i = _lastPosition; i < _gameInfoV.size(); i++{
auto & info = _gameInfoV[i];
if( !info.bhasbeenAdded ){
if( DoWeNeedNow() ){
_lastPosition++;
info.bhasbeenAdded = true;
_otherPointer->DoSomething(info.InfoType);
// Do something more with "info"....
}
else return; //Break the cycle since we don't need now other "info"
}
}
}
The _gameInfoV vector size can be between 2000 and 5000.
My main 2 questions are:
Is it better to leave the way how it is or it's better to make smaller chunks of it, which is checked for every different GameInfo.InfoType
Is it worth the hassle of storing the last start position index of the vector instead of iterating from the beginning.
Note that if using smaller vectors there will be like 3 to 6 of them
The third thing is probably that I am not using vector iterators, but is it safe to use then like this?
std::vector<GameInfo>::iterator it = _gameInfoV.begin() + _lastPosition;
for (it = _gameInfoV.begin(); it != _gameInfoV.end(); ++it){
//Do something
}
Note: It will be used in smartphones, so every optimization will be appreciated, when targeting weaker phones.
-Thank you
Don't; except if you frequently move memory around
It is no hassle if you do it correctly:
std::vector<GameInfo>::const_iterator _lastPosition(gameInfoV.begin());
// ...
for (std::vector<GameInfo>::iterator info=_lastPosition; it!=_gameInfoV.end(); ++info)
{
if (!info->bhasbeenAdded)
{
if (DoWeNeedNow())
{
++_lastPosition;
_otherPointer->DoSomething(info->InfoType);
// Do something more with "info"....
}
else return; //Break the cycle since we don't need now other "i
}
}
Breaking one vector up into several smaller vectors in general doesn't improve performance. It could even slightly degrade performance because the compiler has to manage more variables, which take up more CPU registers etc.
I don't know about gaming so I don't understand the implication of GameInfo.InfoType. Your processing time and CPU resource requirements are going to increase if you do more total iterations through loops (where each loop iteration performs the same type of operation). So if separating the vectors causes you to avoid some loop iterations because you can skip entire vectors, that's going to increase performance of your app.
iterators are the most secure way to iterate through containers. But for a vector I often just use the index operator [] and my own indexer (a plain old unsigned integer).
I call a library function that accepts pointer to std::set and processes elements of it.
However, it processes only certain number of elements (let's say 100) and if the set has more elements it simply throws an exception. However, I receive set of much larger size. So I need efficient way to get subset of std::set.
Currently, I am copying 100 elements to temporary set and pass it to the function.
struct MyClass
{
// Class having considerably large size instance
};
// Library function that processes set having only 100 elements at a time
void ProcessSet (std::set<MyClass>* ptrMyClassObjectsSet);
void FunctionToProcessLargeSet (std::set<MyClass>& MyClassObjSet)
{
std::set<MyClass> MyClass100ObjSet;
// Cannot pass MyClassObject as it is to ProcessSet as it might have large number of elements
// So create set of 100 elements and pass it to the function
std::set<MyClass>::iterator it;
for (it = MyClassObjSet.begin(); it != MyClassObjSet.end(); ++it)
{
MyClass100ObjSet.insert (*it);
if (MyClass100ObjSet.size() == 100)
{
ProcessSet (&MyClass100ObjSet);
MyClass100ObjSet.clear();
}
}
// Prrocess remaining elments
ProcessSet (&MyClass100ObjSet);
MyClass100ObjSet.clear();
}
But it's impacting performance. Can anyone please suggest better ways to do this?
Since it looks like you are locked into having to use a subset. I have tweaked your code a little and I think it might be faster for you. It is still an O(n) operation but there is no branching in the for loop which should increase performance.
void FunctionToProcessLargeSet(std::set<MyClass>& MyClassObjSet)
{
int iteration = MyClassOgjSet.size() / 100; // get number of times we have collection of 100
auto it = MyClassObjSet.begin();
auto end = MyClassObjSet.begin();
for (; iteration == 0; --iteration)
{
std::advance(end, 100); // move end 100 away
std::set<MyClass> MyClass100ObjSet(it, std::advance(it, end)); // construct with iterator range
std::advance(it, 100); // advace it to end pos
ProcessSet(&MyClass100ObjSet); // process subset
}
if (MyClassOgjSet.size() % 100 != 0) // get last subset
{
std::set<MyClass> MyClass100ObjSet(it, MyClassObjSet.end());
// Prrocess remaining elments
ProcessSet(&MyClass100ObjSet);
}
}
Let me know if this runs faster for you.
Well, that sounds like a bad library design, but if you have to work with what you have then:
If library can accept a pair of iterators - that's the easy way to go using std::advance
If it's templated and can accept std::set<T>, then copying a part of your set to std::set<std::reference_wrapper<T>> might perform better if copying T is slow (see here to see that no copies are created)
If it only acceptsstd::set<ParticularObjectType>, I don't see a way around copying the data.
Hope this helps,
Rostislav.
What is considered an optimal data structure for pushing something in order (so inserts at any position, able to find correct position), in-order iteration, and popping N elements off the top (so the N smallest elements, N determined by comparisons with threshold value)? The push and pop need to be particularly fast (run every iteration of a loop), while the in-order full iteration of the data happens at a variable rate but likely an order of magnitude less often. The data can't be purged by the full iteration, it needs to be unchanged. Everything that is pushed will eventually be popped, but since a pop can remove multiple elements there can be more pushes than pops. The scale of data in the structure at any one time could go up to hundreds or low thousands of elements.
I'm currently using a std::deque and binary search to insert elements in ascending order. Profiling shows it taking up the majority of the time, so something has got to change. std::priority_queue doesn't allow iteration, and hacks I've seen to do it won't iterate in order. Even on a limited test (no full iteration!), the std::set class performed worse than my std::deque approach.
None of the classes I'm messing with seem to be built with this use case in mind. I'm not averse to making my own class, if there's a data structure not to be found in STL or boost for some reason.
edit:
There's two major functions right now, push and prune. push uses 65% of the time, prune uses 32%. Most of the time used in push is due to insertion into the deque (64% out of 65%). Only 1% comes from the binary search to find the position.
template<typename T, size_t Axes>
void Splitter<T, Axes>::SortedData::push(const Data& data) //65% of processing
{
size_t index = find(data.values[(axis * 2) + 1]);
this->data.insert(this->data.begin() + index, data); //64% of all processing happens here
}
template<typename T, size_t Axes>
void Splitter<T, Axes>::SortedData::prune(T value) //32% of processing
{
auto top = data.begin(), end = data.end(), it = top;
for (; it != end; ++it)
{
Data& data = *it;
if (data.values[(axis * 2) + 1] > value) break;
}
data.erase(top, it);
}
template<typename T, size_t Axes>
size_t Splitter<T, Axes>::SortedData::find(T value)
{
size_t start = 0;
size_t end = this->data.size();
if (!end) return 0;
size_t diff;
while (diff = (end - start) >> 1)
{
size_t mid = diff + start;
if (this->data[mid].values[(axis * 2) + 1] <= value)
{
start = mid;
}
else
{
end = mid;
}
}
return this->data[start].values[(axis * 2) + 1] <= value ? end : start;
}
With your requirements, a hybrid data-structure tailored to your needs will probably perform best. As others have said, continuous memory is very important, but I would not recommend keeping the array sorted at all times. I propose you use 3 buffers (1 std::array and 2 std::vectors):
1 (constant-size) Buffer for the "insertion heap". Needs to fit into the cache.
2 (variable-sized) Buffers (A+B) to maintain and update sorted arrays.
When you push an element, you add it to the insertion heap via std::push_heap. Since the insertion heap is constant size, it can overflow. When that happens, you std::sort it backwards and std::merge it with the already sorted-sequence buffer (A) into the third (B), resizing them as needed. That will be the new sorted buffer and the old one can be discarded, i.e. you swap A and B for the next bulk operation. When you need the sorted sequence for iteration, you do the same. When you remove elements, you compare the top element in the heap with the last element in the sorted sequence and remove that (which is why you sort it backwards, so that you can pop_back instead of pop_front).
For reference, this idea is loosely based on sequence heaps.
Have you tried messing around with std::vector? As weird as it may sound it could be actually pretty fast because it uses continuous memory. If I remember correctly Bjarne Stroustrup was talking about this at Going Native 2012 (http://channel9.msdn.com/Events/GoingNative/GoingNative-2012/Keynote-Bjarne-Stroustrup-Cpp11-Style but I'm not 100% sure that it's in this video).
You save time with the binary search, but the insertion in random positions of the deque is slow. I would suggest an std::map instead.
From your edit, it sounds like the delay is in copying - is it a complex object? Can you heap allocate and store pointers in the structure so each entry is created once only; you'll need to provide a custom comparitor that takes pointers, as the objects operator<() wouldn't be called. (The custom comparitor can simply call operator<())
EDIT:
Your own figures show it's the insertion that takes the time, not the 'sorting'. While some of that insertion time is creating a copy of your object, some (possibly most) is creation of the internal structure that will hold your object - and I don't think that will change between list/map/set/queue etc. IF you can predict the likely eventual/maximum size of your data set, and can write or find your own sorting algorithm, and the time is being lost in allocating objects, then vector might be the way to go.
Is there an way to zero out an array in with time complexsity O(1)? It's obvious that this can be done by for-loop, memset. But their time complexity are not O(1).
Yes
However not any array. It takes an array that has been crafted for this to work.
template <typename T, size_t N>
class Array {
public:
Array(): generation(0) {}
void clear() {
// FIXME: deal with overflow
++generation;
}
T get(std::size_t i) const {
if (i >= N) { throw std::runtime_error("out of range"); }
TimedT const& t = data[i];
return t.second == generation ? t.first : T{};
}
void set(std::size_t i, T t) {
if (i >= N) { throw std::runtime_error("out of range"); }
data[i] = std::make_pair(t, generation);
}
private:
typedef std::pair<T, unsigned> TimedT;
TimedT data[N];
unsigned generation;
};
The principle is simple:
we define an epoch using the generation attribute
when an item is set, the epoch in which it has been set is recorded
only items of the current epoch can be seen
clearing is thus equivalent to incrementing the epoch counter
The method has two issues:
storage increase: for each item we store an epoch
generation counter overflow: there is something as a maximum number of epochs
The latter can be thwarted using a real big integer (uint64_t at the cost of more storage).
The former is a natural consequence, one possible solution is to use buckets to downplay the issue by having for example up to 64 items associated to a single counter and a bitmask identifying which are valid within this counter.
EDIT: just wanted to get back on the buckets idea.
The original solution has an overhead of 8 bytes (64 bits) per element (if already 8-bytes aligned). Depending on the elements stored it might or might not be a big deal.
If it is a big deal, the idea is to use buckets; of course like all trade-off it slows down access even more.
template <typename T>
class BucketArray {
public:
BucketArray(): generation(0), mask(0) {}
T get(std::size_t index, std::size_t gen) const {
assert(index < 64);
return gen == generation and (mask & (1 << index)) ?
data[index] : T{};
}
void set(std::size_t index, T t, std::size_t gen) {
assert(index < 64);
if (generation < gen) { mask = 0; generation = gen; }
mask |= (1 << index);
data[index] = t;
}
private:
std::uint64_t generation;
std::uint64_t mask;
T data[64];
};
Note that this small array of a fixed number of elements (we could actually template this and statically check it's inferior or equal to 64) only has 16 bytes of overhead. This means we have an overhead of 2 bits per element.
template <typename T, size_t N>
class Array {
typedef BucketArray<T> Bucket;
public:
Array(): generation(0) {}
void clear() { ++generation; }
T get(std::size_t i) const {
if (i >= N) { throw ... }
Bucket const& bucket = data[i / 64];
return bucket.get(i % 64, generation);
}
void set(std::size_t i, T t) {
if (i >= N) { throw ... }
Bucket& bucket = data[i / 64];
bucket.set(i % 64, t, generation);
}
private:
std::uint64_t generation;
Bucket data[N / 64 + 1];
};
We got the space overhead down by a factor of... 32. Now the array can even be used to store char for example, whereas before it would have been prohibitive. The cost is that access got slower, as we get a division and modulo (when we will get a standardized operation that returns both results in one shot ?).
You can't modify n locations in memory in less than O(n) (even if your hardware, for sufficiently small n, maybe allows a constant-time operation to zero certain nicely-aligned blocks of memory, like for example flash memory does).
However, if the object of the exercise is a bit of lateral thinking, then you can write a class representing a "sparse" array. The general idea of a sparse array is that you keep a collection (perhaps a map, although depending on usage that might not be all there is to it), and when you look up an index, if it's not in the underlying collection then you return 0.
If you can clear the underlying collection in O(1), then you can zero out your sparse array in O(1). Clearing a std::map isn't usually constant-time in the size of the map, because all those nodes need to be freed. But you could design a collection that can be cleared in O(1) by moving the whole tree over from "the contents of my map", to "a tree of nodes that I have reserved for future use". The disadvantage would just be that this "reserved" space is still allocated, a bit like what happens when a vector gets smaller.
It's certainly possible to zero out an array in O(1) as long as you accept a very large constant factor:
void zero_out_array_in_constant_time(void* a, size_t n)
{
char* p = (char*) a;
for (size_t i = 0; i < std::numeric_limits<size_t>::max(); ++i)
{
p[i % n] = 0;
}
}
This will always take the same number of steps, regardless of the size of the array, hence it's O(1).
No.
You can't visit every member of an N-element collection in anything less than O(N) time.
You might, as Mike Kwan has observed, shift the cost from run- to compile-time, but that doesn't alter the computational complexity of the operation.
It's clearly not possible to initialize an arbitrarily sized array in a fixed length of time. However, it is entirely possible to create an array-like ADT which amortizes the cost of initializing the array across its use. The usual construction for this takes upwards of 3x the storage, however. To whit:
template <typename T, size_t arr_size>
class NoInitArray
{
std::vector<T> storage;
// Note that 'lookup' and 'check' are not initialized, and may contain
// arbitrary garbage.
size_t lookup[arr_size];
size_t check[arr_size];
public:
T& operator[](size_t pos)
{
// find out where we actually stored the entry for (*this)[pos].
// This could be garbage.
size_t storage_loc=lookup[pos];
// Check to see that the storage_loc we found is valid
if (storage_loc < storage.size() && check[storage_loc] == pos)
{
// everything checks, return the reference.
return storage[storage_loc];
}
else
{
// storage hasn't yet been allocated/initialized for (*this)[pos].
// allocate storage:
storage_loc=storage.size();
storage.push_back(T());
// put entries in lookup and check so we can find
// the proper spot later:
lookup[pos]=storage_loc;
check[storage_loc]=pos;
// everything's set up, return appropriate reference:
return storage.back();
}
}
};
One could add a clear() member to empty the contents of such an array fairly easily if T is some type that doesn't require destruction, at least in concept.
I like Eli Bendersky's webpage http://eli.thegreenplace.net/2008/08/23/initializing-an-array-in-constant-time, with a solution that he attributes to the famous book Design and Analysis of Computer Algorithms by Aho, Hopcroft and Ullman. This is genuinely O(1) time complexity for initialization, rather than O(N). The space demands are O(N) additional storage, but allocating this space is also O(1), since the space is full of garbage. I enjoyed this for theoretical reasons, but I think it might also be of practical value for implementing some algorithms, if you need to repeatedly initialize a very large array, and each time you access only a relatively small number of positions in the array. Bendersky provides a C++ implementation of the algorithm.
A very pure theorist might start worrying about the fact that N needs O(log(N)) digits, but I have ignored that detail, which would presumably require looking carefully at the mathematical model of the computer. Volume 1 of The Art of Computer Programming probably gives Knuth's view of this issue.
It is impossible at runtime to zero out an array in O(1). This is intuitive given that there is no language mechanism which allows the value setting of arbitrary size blocks of memory in fixed time. The closest you can do is:
int blah[100] = {0};
That will allow the initialisation to happen at compiletime. At runtime, memset is generally the fastest but would be O(N). However, there are problems associated with using memset on particular array types.
While still O(N), implementations that map to hardware-assisted operations like clearing whole cache lines or memory pages can run at <1 cycle per word.
Actually, riffing on Steve Jessop's idea...
You can do it if you have hardware support to clear an arbitrarily large amount of memory all at the same time. If you posit an arbitrarily large array, then you can also posit an arbitrarily large memory with hardware parallelism so that a single reset pin will simultaneously clear every register at once. That line will have to be driven by an arbitrarily large logic gate (that dissipates arbitrarily large power), and the circuit traces will have to be arbitrarily short (to overcome R/C delay) (or superconducting), but these things are quite common in extradimensional spaces.
It is possible to do it with O(1) time, and even with O(1) extra space.
I explained the O(1) time solution David mentioned in another answer, but it uses 2n extra memory.
A better algorithm exists, which requires only 1 bit of extra memory.
See the Article I just wrote about that subject.
It also explains the algorithm David mentioned, some more, and the state-of-the-art algorithm today. It also features an implementation of the latter.
In a short explanation (as I'll just be repeating the article) it cleverly takes the algorithm presented in David's answer and does it all in-place, while using only a (very) little extra memory.