How to zero out array in O(1)? - c++

Is there an way to zero out an array in with time complexsity O(1)? It's obvious that this can be done by for-loop, memset. But their time complexity are not O(1).

Yes
However not any array. It takes an array that has been crafted for this to work.
template <typename T, size_t N>
class Array {
public:
Array(): generation(0) {}
void clear() {
// FIXME: deal with overflow
++generation;
}
T get(std::size_t i) const {
if (i >= N) { throw std::runtime_error("out of range"); }
TimedT const& t = data[i];
return t.second == generation ? t.first : T{};
}
void set(std::size_t i, T t) {
if (i >= N) { throw std::runtime_error("out of range"); }
data[i] = std::make_pair(t, generation);
}
private:
typedef std::pair<T, unsigned> TimedT;
TimedT data[N];
unsigned generation;
};
The principle is simple:
we define an epoch using the generation attribute
when an item is set, the epoch in which it has been set is recorded
only items of the current epoch can be seen
clearing is thus equivalent to incrementing the epoch counter
The method has two issues:
storage increase: for each item we store an epoch
generation counter overflow: there is something as a maximum number of epochs
The latter can be thwarted using a real big integer (uint64_t at the cost of more storage).
The former is a natural consequence, one possible solution is to use buckets to downplay the issue by having for example up to 64 items associated to a single counter and a bitmask identifying which are valid within this counter.
EDIT: just wanted to get back on the buckets idea.
The original solution has an overhead of 8 bytes (64 bits) per element (if already 8-bytes aligned). Depending on the elements stored it might or might not be a big deal.
If it is a big deal, the idea is to use buckets; of course like all trade-off it slows down access even more.
template <typename T>
class BucketArray {
public:
BucketArray(): generation(0), mask(0) {}
T get(std::size_t index, std::size_t gen) const {
assert(index < 64);
return gen == generation and (mask & (1 << index)) ?
data[index] : T{};
}
void set(std::size_t index, T t, std::size_t gen) {
assert(index < 64);
if (generation < gen) { mask = 0; generation = gen; }
mask |= (1 << index);
data[index] = t;
}
private:
std::uint64_t generation;
std::uint64_t mask;
T data[64];
};
Note that this small array of a fixed number of elements (we could actually template this and statically check it's inferior or equal to 64) only has 16 bytes of overhead. This means we have an overhead of 2 bits per element.
template <typename T, size_t N>
class Array {
typedef BucketArray<T> Bucket;
public:
Array(): generation(0) {}
void clear() { ++generation; }
T get(std::size_t i) const {
if (i >= N) { throw ... }
Bucket const& bucket = data[i / 64];
return bucket.get(i % 64, generation);
}
void set(std::size_t i, T t) {
if (i >= N) { throw ... }
Bucket& bucket = data[i / 64];
bucket.set(i % 64, t, generation);
}
private:
std::uint64_t generation;
Bucket data[N / 64 + 1];
};
We got the space overhead down by a factor of... 32. Now the array can even be used to store char for example, whereas before it would have been prohibitive. The cost is that access got slower, as we get a division and modulo (when we will get a standardized operation that returns both results in one shot ?).

You can't modify n locations in memory in less than O(n) (even if your hardware, for sufficiently small n, maybe allows a constant-time operation to zero certain nicely-aligned blocks of memory, like for example flash memory does).
However, if the object of the exercise is a bit of lateral thinking, then you can write a class representing a "sparse" array. The general idea of a sparse array is that you keep a collection (perhaps a map, although depending on usage that might not be all there is to it), and when you look up an index, if it's not in the underlying collection then you return 0.
If you can clear the underlying collection in O(1), then you can zero out your sparse array in O(1). Clearing a std::map isn't usually constant-time in the size of the map, because all those nodes need to be freed. But you could design a collection that can be cleared in O(1) by moving the whole tree over from "the contents of my map", to "a tree of nodes that I have reserved for future use". The disadvantage would just be that this "reserved" space is still allocated, a bit like what happens when a vector gets smaller.

It's certainly possible to zero out an array in O(1) as long as you accept a very large constant factor:
void zero_out_array_in_constant_time(void* a, size_t n)
{
char* p = (char*) a;
for (size_t i = 0; i < std::numeric_limits<size_t>::max(); ++i)
{
p[i % n] = 0;
}
}
This will always take the same number of steps, regardless of the size of the array, hence it's O(1).

No.
You can't visit every member of an N-element collection in anything less than O(N) time.
You might, as Mike Kwan has observed, shift the cost from run- to compile-time, but that doesn't alter the computational complexity of the operation.

It's clearly not possible to initialize an arbitrarily sized array in a fixed length of time. However, it is entirely possible to create an array-like ADT which amortizes the cost of initializing the array across its use. The usual construction for this takes upwards of 3x the storage, however. To whit:
template <typename T, size_t arr_size>
class NoInitArray
{
std::vector<T> storage;
// Note that 'lookup' and 'check' are not initialized, and may contain
// arbitrary garbage.
size_t lookup[arr_size];
size_t check[arr_size];
public:
T& operator[](size_t pos)
{
// find out where we actually stored the entry for (*this)[pos].
// This could be garbage.
size_t storage_loc=lookup[pos];
// Check to see that the storage_loc we found is valid
if (storage_loc < storage.size() && check[storage_loc] == pos)
{
// everything checks, return the reference.
return storage[storage_loc];
}
else
{
// storage hasn't yet been allocated/initialized for (*this)[pos].
// allocate storage:
storage_loc=storage.size();
storage.push_back(T());
// put entries in lookup and check so we can find
// the proper spot later:
lookup[pos]=storage_loc;
check[storage_loc]=pos;
// everything's set up, return appropriate reference:
return storage.back();
}
}
};
One could add a clear() member to empty the contents of such an array fairly easily if T is some type that doesn't require destruction, at least in concept.

I like Eli Bendersky's webpage http://eli.thegreenplace.net/2008/08/23/initializing-an-array-in-constant-time, with a solution that he attributes to the famous book Design and Analysis of Computer Algorithms by Aho, Hopcroft and Ullman. This is genuinely O(1) time complexity for initialization, rather than O(N). The space demands are O(N) additional storage, but allocating this space is also O(1), since the space is full of garbage. I enjoyed this for theoretical reasons, but I think it might also be of practical value for implementing some algorithms, if you need to repeatedly initialize a very large array, and each time you access only a relatively small number of positions in the array. Bendersky provides a C++ implementation of the algorithm.
A very pure theorist might start worrying about the fact that N needs O(log(N)) digits, but I have ignored that detail, which would presumably require looking carefully at the mathematical model of the computer. Volume 1 of The Art of Computer Programming probably gives Knuth's view of this issue.

It is impossible at runtime to zero out an array in O(1). This is intuitive given that there is no language mechanism which allows the value setting of arbitrary size blocks of memory in fixed time. The closest you can do is:
int blah[100] = {0};
That will allow the initialisation to happen at compiletime. At runtime, memset is generally the fastest but would be O(N). However, there are problems associated with using memset on particular array types.

While still O(N), implementations that map to hardware-assisted operations like clearing whole cache lines or memory pages can run at <1 cycle per word.
Actually, riffing on Steve Jessop's idea...
You can do it if you have hardware support to clear an arbitrarily large amount of memory all at the same time. If you posit an arbitrarily large array, then you can also posit an arbitrarily large memory with hardware parallelism so that a single reset pin will simultaneously clear every register at once. That line will have to be driven by an arbitrarily large logic gate (that dissipates arbitrarily large power), and the circuit traces will have to be arbitrarily short (to overcome R/C delay) (or superconducting), but these things are quite common in extradimensional spaces.

It is possible to do it with O(1) time, and even with O(1) extra space.
I explained the O(1) time solution David mentioned in another answer, but it uses 2n extra memory.
A better algorithm exists, which requires only 1 bit of extra memory.
See the Article I just wrote about that subject.
It also explains the algorithm David mentioned, some more, and the state-of-the-art algorithm today. It also features an implementation of the latter.
In a short explanation (as I'll just be repeating the article) it cleverly takes the algorithm presented in David's answer and does it all in-place, while using only a (very) little extra memory.

Related

Is there a fast way to 'move' all vector values by 1 position?

I want to implement an algorithm that basically moves every value(besides the last one) one place to the left, as in the first element becomes the second element, and so on.
I have already implemented it like this:
for(int i = 0; i < vct.size() - 1; i++){
vct[i] = vct[i + 1];
}
which works, but I was just wondering if there is a faster, optionally shorter way to achieve the same result?
EDIT: I have made a mistake where I said that I wanted it to move to the right, where in reality I wanted it to go left, so sorry for the confusion and thanks to everyone for pointing that out. I also checked if the vector isn't empty beforehand, just didn't include it in the snippet.
As a comment (or more than one?) has pointed out, the obvious choice here would be to just use a std::deque.
Another possibility would be to use a circular buffer. In this case, you'll typically have an index (or pointer) to the first and last items in the collection. Removing an item from the beginning consists of incrementing that index/pointer (and wrapping it around to the beginning of you've reached the end of the buffer). That's quite fast, and constant time, regardless of collection size. There is a downside that every time you add an item, remove an item, or look at an item, you need to do a tiny bit of extra math to do it. But it's usually just one addition, so overhead is pretty minimal. Circular buffers work well, but they have a fair number of corner cases, so getting them just right is often kind of a pain. Worse, many obvious implementations waste one the slot for one data item (though that often doesn't matter a lot).
A slightly simpler possibility would reduce overhead by some constant factor. To do this, use a little wrapper that keeps track of the first item in the collection, along with your vector of items. When you want to remove the first item, just increment the variable that keeps track of the first item. When you reach some preset limit, you erase the first N elements all at once. This reduces the time spent shifting items by a factor of N.
template <class T>
class pseudo_queue {
std::vector<T> data;
std:size_t b;
// adjust as you see fit:
static const int max_slop = 20;
void shift() {
data.erase(data.begin(), data.begin() + b);
}
public:
void push_back(T &&t) { data.push_back(std::move(t); }
void pop_back() { data.pop_back(); }
T &back() { return data.back(); }
T &front() { return data[b]; }
void pop_front() {
if (++b > max_slop) shift();
}
std::vector<T>::iterator begin() { return data.begin() + b; }
std::vector<T>::iterator end() { return data.end(); }
T &operator[](std::size_t index) { return data[index + b]; }
};
If you want to get really tricky, you can change this a bit, so you compute max_slop as a percentage of the size of data in the collection. In this case, you can change the computational complexity involved, rather than just leaving it linear but with a larger constant factor than you currently have. But I have no idea how much (if at all) you care about that--it's only likely to matter much if you deal with a wide range of sizes.
assuming you really meant moving data to the right, and that your code has a bug,
you have std::move_backwards from <algorithms> and its sibling std::move to do that, but accessing data backwards may be inefficient.
std::move_backward(vct.begin(),vct.end()-1,vct.end());
if you actually meant move data to the left, you can use std::move
std::move(vct.begin()+1,vct.end(),vct.begin())
you can also use std::copy and std::copy_backward instead if your object is trivially copiable, the syntax is exactly the same and it will be faster for trivially copyable objects (stack objects).
you can just do a normal loop to the right, assuming vct.size() is bigger than 1.
int temp1 = vct[0]; // assume vct of int
int temp2; // assume vct of int
for(int i = 0;i<vct.size() - 1;i++){
temp2 = vct[i+1];
vct[i+1] = temp1;
temp1 = temp2;
}
and your version is what's to do if you are moving to the left.
also as noted in the comments, you should check that your list is not empty if you are doing the loop version.

Which container is most efficient for multiple insertions / deletions in C++?

I was set a homework challenge as part of an application process (I was rejected, by the way; I wouldn't be writing this otherwise) in which I was to implement the following functions:
// Store a collection of integers
class IntegerCollection {
public:
// Insert one entry with value x
void Insert(int x);
// Erase one entry with value x, if one exists
void Erase(int x);
// Erase all entries, x, from <= x < to
void Erase(int from, int to);
// Return the count of all entries, x, from <= x < to
size_t Count(int from, int to) const;
The functions were then put through a bunch of tests, most of which were trivial. The final test was the real challenge as it performed 500,000 single insertions, 500,000 calls to count and 500,000 single deletions.
The member variables of IntegerCollection were not specified and so I had to choose how to store the integers. Naturally, an STL container seemed like a good idea and keeping it sorted seemed an easy way to keep things efficient.
Here is my code for the four functions using a vector:
// Previous bit of code shown goes here
private:
std::vector<int> integerCollection;
};
void IntegerCollection::Insert(int x) {
/* using lower_bound to find the right place for x to be inserted
keeps the vector sorted and makes life much easier */
auto it = std::lower_bound(integerCollection.begin(), integerCollection.end(), x);
integerCollection.insert(it, x);
}
void IntegerCollection::Erase(int x) {
// find the location of the first element containing x and delete if it exists
auto it = std::find(integerCollection.begin(), integerCollection.end(), x);
if (it != integerCollection.end()) {
integerCollection.erase(it);
}
}
void IntegerCollection::Erase(int from, int to) {
if (integerCollection.empty()) return;
// lower_bound points to the first element of integerCollection >= from/to
auto fromBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), from);
auto toBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), to);
/* std::vector::erase deletes entries between the two pointers
fromBound (included) and toBound (not indcluded) */
integerCollection.erase(fromBound, toBound);
}
size_t IntegerCollection::Count(int from, int to) const {
if (integerCollection.empty()) return 0;
int count = 0;
// lower_bound points to the first element of integerCollection >= from/to
auto fromBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), from);
auto toBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), to);
// increment pointer until fromBound == toBound (we don't count elements of value = to)
while (fromBound != toBound) {
++count; ++fromBound;
}
return count;
}
The company got back to me saying that they wouldn't be moving forward because my choice of container meant the runtime complexity was too high. I also tried using list and deque and compared the runtime. As I expected, I found that list was dreadful and that vector took the edge over deque. So as far as I was concerned I had made the best of a bad situation, but apparently not!
I would like to know what the correct container to use in this situation is? deque only makes sense if I can guarantee insertion or deletion to the ends of the container and list hogs memory. Is there something else that I'm completely overlooking?
We cannot know what would make the company happy. If they reject std::vector without concise reasoning I wouldn't want to work for them anyway. Moreover, we dont really know the precise requirements. Were you asked to provide one reasonably well performing implementation? Did they expect you to squeeze out the last percent of the provided benchmark by profiling a bunch of different implementations?
The latter is probably too much for a homework challenge as part of an application process. If it is the first you can either
roll your own. It is unlikely that the interface you were given can be implemented more efficiently than one of the std containers does... unless your requirements are so specific that you can write something that performs well under that specific benchmark.
std::vector for data locality. See eg here for Bjarne himself advocating std::vector rather than linked lists.
std::set for ease of implementation. It seems like you want the container sorted and the interface you have to implement fits that of std::set quite well.
Let's compare only isertion and erasure assuming the container needs to stay sorted:
operation std::set std::vector
insert log(N) N
erase log(N) N
Note that the log(N) for the binary_search to find the position to insert/erase in the vector can be neglected compared to the N.
Now you have to consider that the asymptotic complexity listed above completely neglects the non-linearity of memory access. In reality data can be far away in memory (std::set) leading to many cache misses or it can be local as with std::vector. The log(N) only wins for huge N. To get an idea of the difference 500000/log(500000) is roughly 26410 while 1000/log(1000) is only ~100.
I would expect std::vector to outperform std::set for considerably small container sizes, but at some point the log(N) wins over cache. The exact location of this turning point depends on many factors and can only reliably determined by profiling and measuring.
Nobody knows which container is MOST efficient for multiple insertions / deletions. That is like asking what is the most fuel-efficient design for a car engine possible. People are always innovating on the car engines. They make more efficient ones all the time. However, I would recommend a splay tree. The time required for a insertion or deletion is a splay tree is not constant. Some insertions take a long time and some take only a very a short time. However, the average time per insertion/deletion is always guaranteed to be be O(log n), where n is the number of items being stored in the splay tree. logarithmic time is extremely efficient. It should be good enough for your purposes.
The first thing that comes to mind is to hash the integer value so single look ups can be done in constant time.
The integer value can be hashed to compute an index in to an array of bools or bits, used to tell if the integer value is in the container or not.
Counting and and deleting large ranges could be sped up from there, by using multiple hash tables for specific integer ranges.
If you had 0x10000 hash tables, that each stored ints from 0 to 0xFFFF and were using 32 bit integers you could then mask and shift the upper half of the int value and use that as an index to find the correct hash table to insert / delete values from.
IntHashTable containers[0x10000];
u_int32 hashIndex = (u_int32)value / 0x10000;
u_int32int valueInTable = (u_int32)value - (hashIndex * 0x10000);
containers[hashIndex].insert(valueInTable);
Count for example could be implemented as so, if each hash table kept count of the number of elements it contained:
indexStart = startRange / 0x10000;
indexEnd = endRange / 0x10000;
int countTotal = 0;
for (int i = indexStart; i<=indexEnd; ++i) {
countTotal += containers[i].count();
}
Not sure if using sorting really is a requirement for removing the range. It might be based on position. Anyway, here is a link with some hints which STL container to use.
In which scenario do I use a particular STL container?
Just FYI.
Vector maybe a good choice, but it does a lot of re allocation, as you know. I prefer deque instead, as it doesn't require big chunk of memory to allocate all items. For such requirement as you had, list probably fit better.
Basic solution for this problem might be std::map<int, int>
where key is the integer you are storing and value is the number of occurences.
Problem with this is that you can not quickly remove/count ranges. In other words complexity is linear.
For quick count you would need to implement your own complete binary tree where you can know the number of nodes between 2 nodes(upper and lower bound node) because you know the size of tree, and you know how many left and right turns you took to upper and lower bound nodes. Note that we are talking about complete binary tree, in general binary tree you can not make this calculation fast.
For quick range remove I do not know how to make it faster than linear.

Any way to "factor out" common fields to save space?

I have a large array (> millions) of Items, where each Item has the form:
struct Item { void *a; size_t b; };
There are a handful of distinct a fields—meaning there are many items with the same a field.
I would like to "factor" this information out to save about 50% memory usage.
However, the trouble is that these Items have a significant ordering, and that may change over time. Therefore, I can't just go ahead make a separate Item[] for each distinct a, because that will lose the relative ordering of the items with respect to each other.
On the other hand, if I store the orderings of all the items in a size_t index; field, then I lose any memory savings from the removal of the void *a; field.
So is there a way for me to actually save memory here, or no?
(Note: I can already think of e.g. using an unsigned char for a to index into a small array, but I'm wondering if there's a better way. That one will require me to either use unaligned memory or to split every Item[] into two, which isn't great for memory locality, so I'd prefer something else.)
(Note: I can already think of e.g. using an unsigned char for a to index into a small array, but I'm wondering if there's a better way.)
This thinking is on the right track, but it's not that simple, since you will run into some nasty alignment/padding issues that will negate your memory gains.
At that point, when you start trying to scratch the last few bytes of a structure like this, you will probably want to use bit fields.
#define A_INDEX_BITS 3
struct Item {
size_t a_index : A_INDEX_BITS;
size_t b : (sizeof(size_t) * CHAR_BIT) - A_INDEX_BITS;
};
Note that this will limit how many bits are available for b, but on modern platforms, where sizeof(size_t) is 8, stripping 3-4 bits from it is rarely an issue.
Use a combination of lightweight compression schemes (see this for examples and some references) to represent the a* values. #Frank's answer employes DICT followed by NS, for example. If you have long runs of the same pointer, you could consider RLE (Run-Length Encoding) on top of that.
This is a bit of a hack, but I've used it in the past with some success. The extra overhead for object access was compensated for by the significant memory reduction.
A typical use case is an environment where (a) values are actually discriminated unions (that is, they include a type indicator) with a limited number of different types and (b) values are mostly kept in large contiguous vectors.
With that environment, it is quite likely that the payload part of (some kinds of) values uses up all the bits allocated for it. It is also possible that the datatype requires (or benefits from) being stored in aligned memory.
In practice, now that aligned access is not required by most mainstream CPUs, I would just used a packed struct instead of the following hack. If you don't pay for unaligned access, then storing a { one-byte type + eight-byte value } as nine contiguous bytes is probably optimal; the only cost is that you need to multiply by 9 instead of 8 for indexed access, and that is trivial since the 9 is a compile-time constant.
If you do have to pay for unaligned access, then the following is possible. Vectors of "augmented" values have the type:
// Assume that Payload has already been typedef'd. In my application,
// it would be a union of, eg., uint64_t, int64_t, double, pointer, etc.
// In your application, it would be b.
// Eight-byte payload version:
typedef struct Chunk8 { uint8_t kind[8]; Payload value[8]; }
// Four-byte payload version:
typedef struct Chunk4 { uint8_t kind[4]; Payload value[4]; }
Vectors are then vectors of Chunks. For the hack to work, they must be allocated on 8- (or 4-)byte aligned memory addresses, but we've already assumed that alignment is required for the Payload types.
The key to the hack is how we represent a pointer to an individual value, because the value is not contiguous in memory. We use a pointer to it's kind member as a proxy:
typedef uint8_t ValuePointer;
And then use the following low-but-not-zero-overhead functions:
#define P_SIZE 8U
#define P_MASK P_SIZE - 1U
// Internal function used to get the low-order bits of a ValuePointer.
static inline size_t vpMask(ValuePointer vp) {
return (uintptr_t)vp & P_MASK;
}
// Getters / setters. This version returns the address so it can be
// used both as a getter and a setter
static inline uint8_t* kindOf(ValuePointer vp) { return vp; }
static inline Payload* valueOf(ValuePointer vp) {
return (Payload*)(vp + 1 + (vpMask(vp) + 1) * (P_SIZE - 1));
}
// Increment / Decrement
static inline ValuePointer inc(ValuePointer vp) {
return vpMask(++vp) ? vp : vp + P_SIZE * P_SIZE;
}
static inline ValuePointer dec(ValuePointer vp) {
return vpMask(vp--) ? vp - P_SIZE * P_SIZE : vp;
}
// Simple indexed access from a Chunk pointer
static inline ValuePointer eltk(Chunk* ch, size_t k) {
return &ch[k / P_SIZE].kind[k % P_SIZE];
}
// Increment a value pointer by an arbitrary (non-negative) amount
static inline ValuePointer inck(ValuePointer vp, size_t k) {
size_t off = vpMask(vp);
return eltk((Chunk*)(vp - off), k + off);
}
I left out a bunch of the other hacks but I'm sure you can figure them out.
One cool thing about interleaving the pieces of the value is that it has moderately good locality of reference. For the 8-byte version, almost half of the time a random access to a kind and a value will only hit one 64-byte cacheline; the rest of the time two consecutive cachelines are hit, with the result that walking forwards (or backwards) through a vector is just as cache-friendly as walking through an ordinary vector, except that it uses fewer cachelines because the objects are half the size. The four byte version is even cache-friendlier.
I think I figured out the information-theoretically-optimal way to do this myself... it's not quite worth the gains in my case, but I'll explain it here in case it helps someone else.
However, it requires unaligned memory (in some sense).
And perhaps more importantly, you lose the ability easily add new values of a dynamically.
What really matters here is the number of distinct Items, i.e. the number of distinct (a,b) pairs. After all, it could be that for one a there are a billion different bs, but for the other ones there are only a handful, so you want to take advantage of that.
If we assume that there are N distinct items to choose from, then we need n = ceil(log2(N)) bits to represent each Item. So what we really want is an array of n-bit integers, with n computed at run time. Then, once you get the n-bit integer, you can do a binary search in log(n) time to figure out which a it corresponds to, based on your knowledge of the count of bs for each a. (This may be a bit of a performance hit, but it depends on the number of distinct as.)
You can't do this in a nice memory-aligned fashion, but that isn't too bad. What you would do is make a uint_vector data structure with the number of bits per element being a dynamically-specifiable quantity. Then, to randomly access into it, you'd do a few divisions or mod operations along with bit-shifts to extract the required integer.
The caveat here is that the dividing by a variable will probably severely damage your random-access performance (although it'll still be O(1)). The way to mitigate that would probably be to write a few different procedures for common values of n (C++ templates help here!) and then branch into them with various if (n == 33) { handle_case<33>(i); } or switch (n) { case 33: handle_case<33>(i); }, etc. so that the compiler sees the divisor as a constant and generates shifts/adds/multiplies as needed, rather than division.
This is information-theoretically optimal as long as you require a constant number of bits per element, which is what you would want for random-accessing. However, you could do better if you relax that constraint: you could pack multiple integers into k * n bits, then extract them with more math. This will probably kill performance too.
(Or, long story short: C and C++ really need a high-performance uint_vector data structure...)
An Array-of-Structures approach may be helpful. That is, have three vectors...
vector<A> vec_a;
vector<B> vec_b;
SomeType b_to_a_map;
You access your data as...
Item Get(int index)
{
Item retval;
retval.a = vec_a[b_to_a_map[index]];
retval.b = vec_b[index];
return retval;
}
Now all you need to do is choose something sensible for SomeType. For example, if vec_a.size() were 2, you could use vector<bool> or boost::dynamic_bitset. For more complex cases you could try bit-packing, for example to support 4-values of A, we simple change our function with...
int a_index = b_to_a_map[index*2]*2 + b_to_a_map[index*2+1];
retval.a = vec_a[a_index];
You can always beat bit-packing by using range-packing, using div/mod to store a fractional bit length per item, but the complexity grows quickly.
A good guide can be found here http://number-none.com/product/Packing%20Integers/index.html

Sort objects of dynamic size

Problem
Suppose I have a large array of bytes (think up to 4GB) containing some data. These bytes correspond to distinct objects in such a way that every s bytes (think s up to 32) will constitute a single object. One important fact is that this size s is the same for all objects, not stored within the objects themselves, and not known at compile time.
At the moment, these objects are logical entities only, not objects in the programming language. I have a comparison on these objects which consists of a lexicographical comparison of most of the object data, with a bit of different functionality to break ties using the remaining data. Now I want to sort these objects efficiently (this is really going to be a bottleneck of the application).
Ideas so far
I've thought of several possible ways to achieve this, but each of them appears to have some rather unfortunate consequences. You don't necessarily have to read all of these. I tried to print the central question of each approach in bold. If you are going to suggest one of these approaches, then your answer should respond to the related questions as well.
1. C quicksort
Of course the C quicksort algorithm is available in C++ applications as well. Its signature matches my requirements almost perfectly. But the fact that using that function will prohibit inlining of the comparison function will mean that every comparison carries a function invocation overhead. I had hoped for a way to avoid that. Any experience about how C qsort_r compares to STL in terms of performance would be very welcome.
2. Indirection using Objects pointing at data
It would be easy to write a bunch of objects holding pointers to their respective data. Then one could sort those. There are two aspects to consider here. On the one hand, just moving around pointers instead of all the data would mean less memory operations. On the other hand, not moving the objects would probably break memory locality and thus cache performance. Chances that the deeper levels of quicksort recursion could actually access all their data from a few cache pages would vanish almost completely. Instead, each cached memory page would yield only very few usable data items before being replaced. If anyone could provide some experience about the tradeoff between copying and memory locality I'd be very glad.
3. Custom iterator, reference and value objects
I wrote a class which serves as an iterator over the memory range. Dereferencing this iterator yields not a reference but a newly constructed object to hold the pointer to the data and the size s which is given at construction of the iterator. So these objects can be compared, and I even have an implementation of std::swap for these. Unfortunately, it appears that std::swap isn't enough for std::sort. In some parts of the process, my gcc implementation uses insertion sort (as implemented in __insertion_sort in file stl_alog.h) which moves a value out of the sequence, moves a number items by one step, and then moves the first value back into the sequence at the appropriate position:
typename iterator_traits<_RandomAccessIterator>::value_type
__val = _GLIBCXX_MOVE(*__i);
_GLIBCXX_MOVE_BACKWARD3(__first, __i, __i + 1);
*__first = _GLIBCXX_MOVE(__val);
Do you know of a standard sorting implementation which doesn't require a value type but can operate with swaps alone?
So I'd not only need my class which serves as a reference, but I would also need a class to hold a temporary value. And as the size of my objects is dynamic, I'd have to allocate that on the heap, which means memory allocations at the very leafs of the recusrion tree. Perhaps one alternative would be a vaue type with a static size that should be large enough to hold objects of the sizes I currently intend to support. But that would mean that there would be even more hackery in the relation between the reference_type and the value_type of the iterator class. And it would mean I would have to update that size for my application to one day support larger objects. Ugly.
If you can think of a clean way to get the above code to manipulate my data without having to allocate memory dynamically, that would be a great solution. I'm using C++11 features already, so using move semantics or similar won't be a problem.
4. Custom sorting
I even considered reimplementing all of quicksort. Perhaps I could make use of the fact that my comparison is mostly a lexicographical compare, i.e. I could sort sequences by first byte and only switch to the next byte when the firt byte is the same for all elements. I haven't worked out the details on this yet, but if anyone can suggest a reference, an implementation or even a canonical name to be used as a keyword for such a byte-wise lexicographical sorting, I'd be very happy. I'm still not convinced that with reasonable effort on my part I could beat the performance of the STL template implementation.
5. Completely different algorithm
I know there are many many kinds of sorting algorithms out there. Some of them might be better suited to my problem. Radix sort comes to my mind first, but I haven't really thought this through yet. If you can suggest a sorting algorithm more suited to my problem, please do so. Preferrably with implementation, but even without.
Question
So basically my question is this:
“How would you efficiently sort objects of dynamic size in heap memory?”
Any answer to this question which is applicable to my situation is good, no matter whether it is related to my own ideas or not. Answers to the individual questions marked in bold, or any other insight which might help me decide between my alternatives, would be useful as well, particularly if no definite answer to a single approach turns up.
The most practical solution is to use the C style qsort that you mentioned.
template <unsigned S>
struct my_obj {
enum { SIZE = S; };
const void *p_;
my_obj (const void *p) : p_(p) {}
//...accessors to get data from pointer
static int c_style_compare (const void *a, const void *b) {
my_obj aa(a);
my_obj bb(b);
return (aa < bb) ? -1 : (bb < aa);
}
};
template <unsigned N, typename OBJ>
void my_sort (const char (&large_array)[N], const OBJ &) {
qsort(large_array, N/OBJ::SIZE, OBJ::SIZE, OBJ::c_style_compare);
}
(Or, you can call qsort_r if you prefer.) Since STL sort inlines the comparision calls, you may not get the fastest possible sorting. If all your system does is sorting, it may be worth it to add the code to get custom iterators to work. But, if most of the time your system is doing something other than sorting, the extra gain you get may just be noise to your overall system.
Since there are only 31 different object variations (1 to 32 bytes), you could easily create an object type for each and select a call to std::sort based on a switch statement. Each call will get inlined and highly optimized.
Some object sizes might require a custom iterator, as the compiler will insist on padding native objects to align to address boundaries. Pointers can be used as iterators in the other cases since a pointer has all the properties of an iterator.
I'd agree with std::sort using a custom iterator, reference and value type; it's best to use the standard machinery where possible.
You worry about memory allocations, but modern memory allocators are very efficient at handing out small chunks of memory, particularly when being repeatedly reused. You could also consider using your own (stateful) allocator, handing out length s chunks from a small pool.
If you can overlay an object onto your buffer, then you can use std::sort, as long as your overlay type is copyable. (In this example, 4 64bit integers). With 4GB of data, you're going to need a lot of memory though.
As discussed in the comments, you can have a selection of possible sizes based on some number of fixed size templates. You would have to have pick from these types at runtime (using a switch statement, for example). Here's an example of the template type with various sizes and example of sorting the 64bit size.
Here's a simple example:
#include <vector>
#include <algorithm>
#include <iostream>
#include <ctime>
template <int WIDTH>
struct variable_width
{
unsigned char w_[WIDTH];
};
typedef variable_width<8> vw8;
typedef variable_width<16> vw16;
typedef variable_width<32> vw32;
typedef variable_width<64> vw64;
typedef variable_width<128> vw128;
typedef variable_width<256> vw256;
typedef variable_width<512> vw512;
typedef variable_width<1024> vw1024;
bool operator<(const vw64& l, const vw64& r)
{
const __int64* l64 = reinterpret_cast<const __int64*>(l.w_);
const __int64* r64 = reinterpret_cast<const __int64*>(r.w_);
return *l64 < *r64;
}
std::ostream& operator<<(std::ostream& out, const vw64& w)
{
const __int64* w64 = reinterpret_cast<const __int64*>(w.w_);
std::cout << *w64;
return out;
}
int main()
{
srand(time(NULL));
std::vector<unsigned char> buffer(10 * sizeof(vw64));
vw64* w64_arr = reinterpret_cast<vw64*>(&buffer[0]);
for(int x = 0; x < 10; ++x)
{
(*(__int64*)w64_arr[x].w_) = rand();
}
std::sort(
w64_arr,
w64_arr + 10);
for(int x = 0; x < 10; ++x)
{
std::cout << w64_arr[x] << '\n';
}
std::cout << std::endl;
return 0;
}
Given the enormous size (4GB), I would seriously consider dynamic code generation. Compile a custom sort into a shared library, and dynamically load it. The only non-inlined call should be the call into the library.
With precompiled headers, the compilation times may actually be not that bad. The whole <algorithm> header doesn't change, nor does your wrapper logic. You just need to recompile a single predicate each time. And since it's a single function you get, linking is trivial.
#define OBJECT_SIZE 32
struct structObject
{
unsigned char* pObject;
bool operator < (const structObject &n) const
{
for(int i=0; i<OBJECT_SIZE; i++)
{
if(*(pObject + i) != *(n.pObject + i))
return (*(pObject + i) < *(n.pObject + i));
}
return false;
}
};
int _tmain(int argc, _TCHAR* argv[])
{
std::vector<structObject> vObjects;
unsigned char* pObjects = (unsigned char*)malloc(10 * OBJECT_SIZE); // 10 Objects
for(int i=0; i<10; i++)
{
structObject stObject;
stObject.pObject = pObjects + (i*OBJECT_SIZE);
*stObject.pObject = 'A' + 9 - i; // Add a value to the start to check the sort
vObjects.push_back(stObject);
}
std::sort(vObjects.begin(), vObjects.end());
free(pObjects);
To skip the #define
struct structObject
{
unsigned char* pObject;
};
struct structObjectComparerAscending
{
int iSize;
structObjectComparerAscending(int _iSize)
{
iSize = _iSize;
}
bool operator ()(structObject &stLeft, structObject &stRight)
{
for(int i=0; i<iSize; i++)
{
if(*(stLeft.pObject + i) != *(stRight.pObject + i))
return (*(stLeft.pObject + i) < *(stRight.pObject + i));
}
return false;
}
};
int _tmain(int argc, _TCHAR* argv[])
{
int iObjectSize = 32; // Read it from somewhere
std::vector<structObject> vObjects;
unsigned char* pObjects = (unsigned char*)malloc(10 * iObjectSize);
for(int i=0; i<10; i++)
{
structObject stObject;
stObject.pObject = pObjects + (i*iObjectSize);
*stObject.pObject = 'A' + 9 - i; // Add a value to the start to work with something...
vObjects.push_back(stObject);
}
std::sort(vObjects.begin(), vObjects.end(), structObjectComparerAscending(iObjectSize));
free(pObjects);

Help with algorithm for merging vectors

I need a very fast algorithm for the following task. I have already implemented several algorithms that complete it, but they're all too slow for the performance I need. It should be fast enough that the algorithm can be run at least 100,000 times a second on a modern CPU. It will be implemented in C++.
I am working with spans/ranges, a structure that has a start and an end coordinate on a line.
I have two vectors (dynamic arrays) of spans and I need to merge them. One vector is src and the other dst. The vectors are sorted by span start coordinates, and the spans do not overlap within one vector.
The spans in the src vector must be merged with the spans in the dst vector, such that the resulting vector is still sorted and has no overlaps. Ie. if overlaps are detected during the merging, the two spans are merged into one. (Merging two spans is just a matter of changing the coordinates in the structure.)
Now, there is one more catch, the spans in the src vector must be "widened" during the merge. This means that a constant will be added to the start and another (larger) constant to the end coordinate of every span in src. This means that after the src spans are widened they might overlap.
What I have arrived at so far is that it cannot be done fully in-place, some kind of temporary storage is needed. I think it should be doable in linear time over the number of elements of src and dst summed.
Any temporary storage can probably be shared between multiple runs of the algorithm.
The two primary approaches I have tried, which are too slow, are:
Append all elements of src to dst, widening each element before appending it. Then run an in-place sort. Finally iterate over the resulting vector using a "read" and "write" pointer, with the read pointer running ahead of the write pointer, merging spans as they go. When all elements have been merged (the read pointer reaches end) dst is truncated.
Create a temporary work-vector. Do a naive merge as described above by repeatedly picking the next element from either src or dst and merging into the work-vector. When done, copy the work-vector to dst, replacing it.
The first method has the problem that sorting is O((m+n)*log(m+n)) instead of O(m+n) and has somewhat overhead. It also means the dst vector has to grow much larger than it really needs.
The second has the primary problem of a lot of copying around and again allocation/deallocation of memory.
The data structures used for storing/managing the spans/vectors can be altered if you think that's needed.
Update: Forgot to say how large the datasets are. The most common cases are between 4 and 30 elements in either vector, and either dst is empty or there is a large amount of overlap between the spans in src and dst.
We know that the absolute best case runtime is O(m+n) this is due to the fact that you at least have to scan over all of the data in order to be able to merge the lists. Given this, your second method should give you that type of behavior.
Have you profiled your second method to find out what the bottlenecks are? It is quite possible that, depending on the amount of data you are talking about it is actually impossible to do what you want in the specified amount of time. One way to verify this is to do something simple like sum up all the start and end values of the spans in each vector in a loop, and time that. Basically here you are doing a minimal amount of work for each element in the vectors. This will provide you with a baseline for the best performance you can expect to get.
Beyond that you can avoid copying the vectors element by element by using the stl swap method, and you can preallocate the temp vector to a certain size in order to avoid triggering the expansion of the array when you are merging the elements.
You might consider using 2 vectors in your system and whenever you need to do a merge you merge into the unused vector, and then swap (this is similar to double buffering used in graphics). This way you don't have to reallocate the vectors every time you do the merge.
However, you are best off profiling first and finding out what your bottleneck is. If the allocations are minimal compared to the actual merging process than you need to figure out how to make that faster.
Some possible additional speedups could come from accessing the vectors raw data directly which avoids the bounds checks on each access the data.
The sort you mention in Approach 1 can be reduced to linear time (from log-linear as you describe it) because the two input lists are already sorted. Just perform the merge step of merge-sort. With an appropriate representation for the input span vectors (for example singly-linked lists) this can be done in-place.
http://en.wikipedia.org/wiki/Merge_sort
How about the second method without repeated allocation--in other words, allocate your temporary vector once, and never allocate it again? Or, if the input vectors are small enough (But not constant size), just use alloca instead of malloc.
Also, in terms of speed, you may want to make sure that your code is using CMOV for the sorting, since if the code is actually branching for every single iteration of the mergesort:
if(src1[x] < src2[x])
dst[x] = src1[x];
else
dst[x] = src2[x];
The branch prediction will fail 50% of the time, which will have an enormous hit on performance. A conditional move will likely do much much better, so make sure the compiler is doing that, and if not, try to coax it into doing so.
i don't think a strictly linear solution is possible, because widening the src vector spans may in the worst-case cause all of them to overlap (depending on the magnitude of the constant that you are adding)
the problem may be in the implementation, not in the algorithm; i would suggest profiling the code for your prior solutions to see where the time is spent
reasoning:
for a truly "modern" CPU like the Intel Core 2 Extreme QX9770 running at 3.2GHz, one can expect about 59,455 MIPS
for 100,000 vectors, you would have to process each vector in 594,550 instuctions. That's a LOT of instructions.
ref: wikipedia MIPS
in addition, note that adding a constant to the src vector spans does not de-sort them, so you can normalize the src vector spans independently, then merge them with the dst vector spans; this should reduce the workload of your original algorithm
1 is right out - a full sort is slower than merging two sorted lists.
So you're looking at tweaking 2 (or something entirely new).
If you change the data structures to doubly linked lists, then you can merge them in constant working space.
Use a fixed-size heap allocator for the list nodes, both to reduce memory use per node and to improve the chance that the nodes are close together in memory, reducing page misses.
You might be able to find code online or in your favourite algorithm book to optimise a linked list merge. You'll want to customise this in order to do span coalescing at the same time as the list merge.
To optimise the merge, first note that for each run of values coming off the same side without one coming from the other side, you can insert the whole run into the dst list in one go, instead of inserting each node in turn. And you can save one write per insertion over a normal list operation, by leaving the end "dangling", knowing that you'll patch it up later. And provided that you don't do deletions anywhere else in your app, the list can be singly-linked, which means one write per node.
As for 10 microsecond runtime - kind of depends on n and m...
I would always keep my vector of spans sorted. That makes implementing algorithms a LOT easier -- and possible to do in linear time.
OK, so I'd sort the spans based on:
span minimum in increasing order
then span maximum in decreasing order
You need to create a function to do that.
Then I'd use std::set_union to merge the vectors (you can merge more than one before continuing).
Then for each consecutive sets of spans with identical minimums, you keep the first and remove the rest (they're sub-spans of the first span).
Then you need to merge your spans. That should be pretty doable now and feasible in linear time.
OK, here's the trick now. Don't try to do this in-place. Use one or more temporary vectors (and reserve enough space ahead of time). Then at the end, call std::vector::swap to put the results in the input vector of your choice.
I hope that's enough to get you going.
What is your target system? Is it multi-core? If so you could consider multithreading this algorithm
I wrote a new container class just for this algorithm, tailored to the needs. This also gave me a chance to adjust other code around my program which got a little speed boost at the same time.
This is significantly faster than the old implementation using STL vectors, but which was otherwise basically the same thing. But while it's faster it's still not really fast enough... unfortunately.
Profiling doesn't reveal what is the real bottleneck any longer. The MSVC profiler seems to sometimes place the "blame" on the wrong calls (supposedly identical runs assign widely different running times) and most calls are getting coalesced into one big chink.
Looking at a disassembly of the generated code shows that there's a very large amount of jumps in the generated code, I think that might be the main reason behind the slowness now.
class SpanBuffer {
private:
int *data;
size_t allocated_size;
size_t count;
inline void EnsureSpace()
{
if (count == allocated_size)
Reserve(count*2);
}
public:
struct Span {
int start, end;
};
public:
SpanBuffer()
: data(0)
, allocated_size(24)
, count(0)
{
data = new int[allocated_size];
}
SpanBuffer(const SpanBuffer &src)
: data(0)
, allocated_size(src.allocated_size)
, count(src.count)
{
data = new int[allocated_size];
memcpy(data, src.data, sizeof(int)*count);
}
~SpanBuffer()
{
delete [] data;
}
inline void AddIntersection(int x)
{
EnsureSpace();
data[count++] = x;
}
inline void AddSpan(int s, int e)
{
assert((count & 1) == 0);
assert(s >= 0);
assert(e >= 0);
EnsureSpace();
data[count] = s;
data[count+1] = e;
count += 2;
}
inline void Clear()
{
count = 0;
}
inline size_t GetCount() const
{
return count;
}
inline int GetIntersection(size_t i) const
{
return data[i];
}
inline const Span * GetSpanIteratorBegin() const
{
assert((count & 1) == 0);
return reinterpret_cast<const Span *>(data);
}
inline Span * GetSpanIteratorBegin()
{
assert((count & 1) == 0);
return reinterpret_cast<Span *>(data);
}
inline const Span * GetSpanIteratorEnd() const
{
assert((count & 1) == 0);
return reinterpret_cast<const Span *>(data+count);
}
inline Span * GetSpanIteratorEnd()
{
assert((count & 1) == 0);
return reinterpret_cast<Span *>(data+count);
}
inline void MergeOrAddSpan(int s, int e)
{
assert((count & 1) == 0);
assert(s >= 0);
assert(e >= 0);
if (count == 0)
{
AddSpan(s, e);
return;
}
int *lastspan = data + count-2;
if (s > lastspan[1])
{
AddSpan(s, e);
}
else
{
if (s < lastspan[0])
lastspan[0] = s;
if (e > lastspan[1])
lastspan[1] = e;
}
}
inline void Reserve(size_t minsize)
{
if (minsize <= allocated_size)
return;
int *newdata = new int[minsize];
memcpy(newdata, data, sizeof(int)*count);
delete [] data;
data = newdata;
allocated_size = minsize;
}
inline void SortIntersections()
{
assert((count & 1) == 0);
std::sort(data, data+count, std::less<int>());
assert((count & 1) == 0);
}
inline void Swap(SpanBuffer &other)
{
std::swap(data, other.data);
std::swap(allocated_size, other.allocated_size);
std::swap(count, other.count);
}
};
struct ShapeWidener {
// How much to widen in the X direction
int widen_by;
// Half of width difference of src and dst (width of the border being produced)
int xofs;
// Temporary storage for OverlayScanline, so it doesn't need to reallocate for each call
SpanBuffer buffer;
inline void OverlayScanline(const SpanBuffer &src, SpanBuffer &dst);
ShapeWidener(int _xofs) : xofs(_xofs) { }
};
inline void ShapeWidener::OverlayScanline(const SpanBuffer &src, SpanBuffer &dst)
{
if (src.GetCount() == 0) return;
if (src.GetCount() + dst.GetCount() == 0) return;
assert((src.GetCount() & 1) == 0);
assert((dst.GetCount() & 1) == 0);
assert(buffer.GetCount() == 0);
dst.Swap(buffer);
const int widen_s = xofs - widen_by;
const int widen_e = xofs + widen_by;
size_t resta = src.GetCount()/2;
size_t restb = buffer.GetCount()/2;
const SpanBuffer::Span *spa = src.GetSpanIteratorBegin();
const SpanBuffer::Span *spb = buffer.GetSpanIteratorBegin();
while (resta > 0 || restb > 0)
{
if (restb == 0)
{
dst.MergeOrAddSpan(spa->start+widen_s, spa->end+widen_e);
--resta, ++spa;
}
else if (resta == 0)
{
dst.MergeOrAddSpan(spb->start, spb->end);
--restb, ++spb;
}
else if (spa->start < spb->start)
{
dst.MergeOrAddSpan(spa->start+widen_s, spa->end+widen_e);
--resta, ++spa;
}
else
{
dst.MergeOrAddSpan(spb->start, spb->end);
--restb, ++spb;
}
}
buffer.Clear();
}
If your most recent implementation still isn't fast enough, you might end up having to look at alternative approaches.
What are you using the outputs of this function for?