Create matrix of random numbers in C++ without looping - c++

I need to create a multidimensional matrix of randomly distributed numbers using a Gaussian distribution, and am trying to keep the program as optimized as possible. Currently I am using Boost matrices, but I can't seem to find anything that accomplishes this without manually looping. Ideally, I would like something similar to Python's numpy.random.randn() function, but this must be done in C++. Is there another way to accomplish this that is faster than manually looping?

You're going to have to loop anyway, but you can eliminate the array lookup inside your loop. True N-dimensional array indexing is going to be expensive, so you best option is any library (or written yourself) which also provides you with an underlying linear data store.
You can then loop over the entire n-dimensional array as if it was linear, avoiding many multiplications of the indexes by the dimensions.
Another optimization is to do away with the index altogether, and take a pointer to the first element, then iterate the pointer itself, this does away with a whole variable in the CPU which can give the compiler more space for other things. e.g. if you had 1000 elements in a vector:
vector<int> data;
data.resize(1000);
int *intPtr = &data[0];
int *endPtr = &data[0] + 1000;
while(intPtr != endPtr)
{
(*intPtr) == rand_function();
++intPtr;
}
Here, two tricks have happened. Pre-calculate the end condition outside the loop itself (this avoids a lookup of a function such as vector::size() 1000 times), and working with pointers to the data in memory rather than indexes. An index gets internally converted to a pointer every time it's used to access the array. By storing the "current pointer" and adding 1 to that each time, then the cost of calculating the pointers from indexes 1000 times is eliminated.
This can be faster but it depends on the implementation. Compilers can do some of the same hand-optimizations, but not all of them. The rand_function should also be inline to avoid the function call overhead.
A warning however: if you use std::vector with the pointer trick then it's not thread safe, if another thread changed the vector's length during the loop then the vector can get reallocated to a different place in memory. Don't do pointer tricks unless you'd be perfectly comfortable writing your own vector, array, table classes as needed.

Related

What are the Disadvantages of Nested Vectors?

I'm still fairly new to C++ and have a lot left to learn, but something that I've become quite attached to recently is using nested (multidimensional) vectors. So I may typically end up with something like this:
std::vector<std::vector<std::string> > table;
Which I can then easily access elements of like this:
std::string data = table[3][5];
However, recently I've been getting the impression that it's better (in terms of performance) to have a single-dimensional vector and then just use "index arithmetic" to access elements correspondingly. I assume this performance impact is significant for much larger or higher dimensional vectors, but I honestly have no idea and haven't been able to find much information about it so far.
While, intuitively, it kind of makes sense that a single vector would have better performance than a a higher dimensional one, I honestly don't understand the actual reasons why. Furthermore, if I were to just use single-dimensional vectors, I would lose the intuitive syntax I have for accessing elements of multidimensional ones. So here are my questions:
Why are multidimensional vectors inefficient? If I were to only use a single-dimensional vector instead (to represent data in higher dimensions), what would be the best, most intuitive way to access its elements?
It depends on the exact conditions. I'll talk about the case, when the nested version is a true 2D table (i.e., all rows have equal length).
A 1D vector usually will be faster on every usage patterns. Or, at least, it won't be slower than the nested version.
Nested version can be considered worse, because:
it needs to allocate number-of-rows times, instead of one.
accessing an element takes an additional indirection, so it is slower (additional indirection is usually slower than the multiply needed in the 1D case)
if you process your data sequentially, then it could be much slower, if the 2D data is scattered around the memory. It is because there could be a lot of cache misses, depending how the memory allocator returns memory areas of different rows.
So, if you go for performance, I'd recommend you to create a 2D-wrapper class for 1D vector. This way, you could get as simple API as the nested version, and you'll get the best performance too. And even, if for some cause, you decide to use the nested version instead, you can just change the internal implementation of this wrapper class.
The most intuitive way to access 1D elements is y*width+x. But, if you know your access patterns, you can choose a different one. For example, in a painting program, a tile based indexing could be better for storing and manipulating the image. Here, data can be indexed like this:
int tileMask = (1<<tileSizeL)-1; // tileSizeL is log of tileSize
int tileX = x>>tileSizeL;
int tileY = y>>tileSizeL;
int tileIndex = tileY*numberOfTilesInARow + tileX;
int index = (tileIndex<<(tileSizeL*2)) + ((y&tileMask)<<tileSizeL) + (x&tileMask);
This method has a better spatial locality in memory (pixels near to each other tend to have a near memory address). Index calculation is slower than a simple y*width+x, but this method could have much less cache misses, so in the end, it could be faster.

Came up with an algorithm for sorting an array of large sized objects; can anyone tell me what this algorithm is called? (couldn't find it on Google)

I needed to sort an array of large sized objects and it got me thinking: could there be a way to minimize the number of swaps?
So I used quicksort (but any other fast sort should work here too) to sort indices to the elements in the array; indices are cheap to swap. Then I used those indices to swap the actual objects into their places. Unfortunately this uses O(n) additional space to store the indices. The code below illustrates the algorithm (which I'm calling IndexSort), and in my tests, appears to be faster than plain quicksort for arrays of large sized objects.
template <class Itr>
void IndexSort(Itr begin, Itr end)
{
const size_t count = end - begin;
// Create indices
vector<size_t> ind(count);
iota(ind.begin(), ind.end(), 0);
// Sort indices
sort(ind.begin(), ind.end(), [&begin] (const size_t i, const size_t j)
{
return begin[i] < begin[j];
});
// Create indices to indices. This provides
// constant time search in the next step.
vector<size_t> ind2(count);
for(size_t i = 0; i < count; ++i)
ind2[ind[i]] = i;
// Swap the objects into their final places
for(size_t i = 0; i < count; ++i)
{
if( ind[i] == i )
continue;
swap(begin[i], begin[ind[i]]);
const size_t j = ind[i];
swap(ind[i], ind[ind2[i]]);
swap(ind2[i], ind2[j]);
}
}
Now I have measured the swaps (of the large sized objects) done by both, quicksort, and IndexSort, and found that quicksort does a far greater number of swaps. So I know why IndexSort could be faster.
But can anyone with a more academic background explain why/how does this algorithm actually work? (it's not intuitive to me, although I somehow came up with it).
Thanks!
Edit: The following code was used to verify the results of IndexSort
// A class whose objects will be large
struct A
{
int id;
char data[1024];
// Use the id to compare less than ordering (for simplicity)
bool operator < (const A &other) const
{
return id < other.id;
}
// Copy assign all data from another object
void operator = (const A &other)
{
memcpy(this, &other, sizeof(A));
}
};
int main()
{
const size_t arrSize = 1000000;
// Create an array of objects to be sorted
vector<A> randArray(arrSize);
for( auto &item: randArray )
item.id = rand();
// arr1 will be sorted using quicksort
vector<A> arr1(arrSize);
copy(randArray.begin(), randArray.end(), arr1.begin());
// arr2 will be sorted using IndexSort
vector<A> arr2(arrSize);
copy(randArray.begin(), randArray.end(), arr2.begin());
{
// Measure time for this
sort(arr1.begin(), arr1.end());
}
{
// Measure time for this
IndexSort(arr2.begin(), arr2.end());
}
// Check if IndexSort yielded the same result as quicksort
if( memcmp(arr1.data(), arr2.data(), sizeof(A) * arr1.size()) != 0 )
cout << "sort failed" << endl;
return 0;
}
Edit: Made the test less pathological; reduced the size of the large object class to just 1024 bytes (plus one int), and increased the number of objects to be sorted to one million. This still results in IndexSort being significantly faster than quicksort.
Edit: This requires more testing for sure. But it makes me think, what if std::sort could, at compile time, check the object size, and (depending on some size threshold) choose either the existing quicksort implemenation or this IndexSort implementation.
Also, IndexSort could be described as an "in-place tag sort" (see samgak's answer and my comments below).
It seems to be a tag sort:
For example, the popular recursive quicksort algorithm provides quite reasonable performance with adequate RAM, but due to the recursive way that it copies portions of the array it becomes much less practical when the array does not fit in RAM, because it may cause a number of slow copy or move operations to and from disk. In that scenario, another algorithm may be preferable even if it requires more total comparisons.
One way to work around this problem, which works well when complex records (such as in a relational database) are being sorted by a relatively small key field, is to create an index into the array and then sort the index, rather than the entire array. (A sorted version of the entire array can then be produced with one pass, reading from the index, but often even that is unnecessary, as having the sorted index is adequate.) Because the index is much smaller than the entire array, it may fit easily in memory where the entire array would not, effectively eliminating the disk-swapping problem. This procedure is sometimes called "tag sort".
As described above, tag sort can be used to sort a large array of data that cannot fit into memory. However even when it can fit in memory, it still requires less memory read-write operations for arrays of large objects, as illustrated by your solution, because entire objects are not being copied each time.
Implementation detail: while your implementation sorts just the indices, and refers back to the original array of objects via the index when doing comparisons, another way of implementing it is to store index/sort key pairs in the sort buffer, using the sort keys for comparisons. This means that you can do the sort without having the entire array of objects in memory at once.
One example of a tag sort is the LINQ to Objects sorting algorithm in .NET:
The sort is somewhat flexible in that it lets you supply a comparison delegate. It does not, however, let you supply a swap delegate. That’s okay in many cases. However, if you’re sorting large structures (value types), or if you want to do an indirect sort (often referred to as a tag sort), a swap delegate is a very useful thing to have. The LINQ to Objects sorting algorithm, for example uses a tag sort internally. You can verify that by examining the source, which is available in the .NET Reference Source. Letting you pass a swap delegate would make the thing much more flexible.
I wouldn't exactly call that an algorithm so much as an indirection.
The reason you're doing fewer swaps of the larger objects is because you have the sorted indices (the final result, implying no redundant intermediary swaps). If you counted the number of index swaps in addition to object swaps, then you'd get more swaps total with your index sorting.
Nevertheless, you're not necessarily bound by algorithmic complexity all the time. Spending the expensive sorting time swapping cheap little indices around saves more time than it costs.
So you have a higher number of total swaps with the index sort, but the bulk of them are cheaper and you're doing far fewer of the expensive swaps of the original object.
The reason it's faster is because your original objects are larger than indices but perhaps inappropriate for a move constructor (not necessarily storing dynamically-allocated data).
At this level, the cost of the swap is going to be bound more by the structure size of the elements you're sorting, and this will be practical efficiency rather than theoretical algorithmic complexity. And if you get into the hardware details, that's going to boil down to things like more fitting in a cache line.
With sorting, the amount of computation done over the same data set is substantial. We're doing at optimal O(NLogN) compares and swaps, often more in practice. So when you use indices, you make both the swapping and comparison potentially cheaper (in your case, just the swapping since you're still using a comparator predicate to compare the original objects).
Put another way, std::sort is O(NLogN). Your index sort is O(N+NLogN). Yet you're making the bigger NLogN work much cheaper using indices and an indirection.
In your updated test case, you're using a very pathological case of enormous objects. So your index sorting is going to pay off big time there. More commonly, you don't have objects of type T where sizeof(T) spans 100 kilobytes. Typically if an object stores data of such size, it's going to store a pointer to it elsewhere and a move constructor to simply shallow copy the pointers (making it about as cheap to swap as an int). So most of the time you won't necessarily get such a big pay off sorting things indirectly this way, but if you do have enormous objects like that, this kind of index or pointer sort will be a great optimization.
Edit: This requires more testing for sure. But it makes me think, what if std::sort could, at compile time, check the object size, and (depending on some size threshold) choose either the existing quicksort implemenation or this IndexSort implementation.
I think that's not a bad idea. At least making it available might be a nice start. Yet I would suggest against the automatic approach. The reason I think it might be better to leave that to the side as a potential optimization the developer can opt into when appropriate is because there are sometimes cases where memory is more valuable than processing. The indices are going to seem trivial if you create like 1 kilobyte objects, but there are a lot of iffy scenarios, borderline cases, where you might be dealing with something more like 32-64 bytes (ex: a list of 32-byte, 4-component double-precision mathematical vectors). In those borderline cases, this index sort method may still be faster, but the extra temporary memory usage of 2 extra indices per element may actually become a factor (and may occasional cause a slowdown at runtime depending on the physical state of the environment). Consider that attempt to specialize cases with vector<bool> -- it often creates more harm than good. At the time it seemed like a great idea to treat vector<bool> as a bitset, now it often gets in the way. So I'd suggest leaving it to the side and letting people opt into it, but having it available might be a welcome addition.

Can I check in C(++) if an array is all 0 (or false)?

Can I check in C(++) if an array is all 0 (or false) without iterating/looping over every single value and without allocating a new array of the same size (to use memcmp)?
I'm abusing an array of bools to have arbitrary large bitsets at runtime and do some bitflipping on it
You can use the following condition:
(myvector.end() == std::find(myvector.begin(), myvector.end(), true))
Obviously, internally, this loops over all values.
The alternative (which really should avoid looping) is to override all write-access functions, and keep track of whether true has ever been written to your vector.
UPDATE
Lie Ryan's comments below describe a more robust method of doing this, based on the same principle.
If it's not sorted, no. How would you plan on accomplishing that? You would need to inspect every element to see if it's 0 or not! memcmp, of course, would also check every element. It would just be much more expensive since it reads another array as well.
Of course, you can early-out as soon as you hit a non-0 element.
Your only option would be to use SIMD (which technically still checks every element, but using fewer instructions), but you generally don't do that in a generic array.
(Btw, my answer assumes that you have a simple static C/C++ array. If you can specify what kind of array you have, we could be more specific.)
If you know that this is going to be a requirement, you could build a data structure consisting of an array (possibly dynamic) and a count or currently non-zero cells. Obviously the setting of cells must be abstracted through, but that is natural in c++ with overloading, and you can use an opaque type in c.
Assume that you have an array of N element, you can do a bit check against a set of base vectors.
For example, you have a 15-element array you want to test.
You can test it against an 8-element zero array, an 4-element zero array, a 2-element zero array and a 1-element zero array.
You only have to allocate these elements once given that you know the maximum size of arrays you want to test. Furthermore, the test can be done in parallel (and with assembly intrinsic if necessary).
Further improvement in term of memory allocation can be done with using only an 8-element array since a 4-element zero array is simply the first half of the 8-element zero array.
Consider using boost::dynamic_bitset instead. It has a none member and several other std::bitset-like operations, but its length can be set at runtime.
No, you can compare arrays with memcmp, but you can't compare one value against a block of memory.
What you can do is use algorithms in C++ but that still involves a loop internally.
You don't have to iterate over the entire thing, just stop looping on the first non-zero value.
I can't think of any way to check a set of values other than inspecting them each in turn - you could play games with checking the underlying memory as something larger than bool (__int64 say) but alignment is then an issue.
EDIT:
You could keep a separate count of set bits, and check that is non-zero. You'd have to be careful about maintenance of this, so that setting a set bit did not ++ it and so on.
knittl,
I don't suppose you have access to some fancy DMA hardware on the target computer? Sometimes DMA hardware supports exactly the operation you require, i.e. "Is this region of memory all-zero?" This sort of hardware-accelerated comparison is a common solution when dealing with large bit-buffers. For example, some RAID controllers use this mechanism for parity checking.

Choice of the most performant container (array)

This is my little big question about containers, in particular, arrays.
I am writing a physics code that mainly manipulates a big (> 1 000 000) set of "particles" (with 6 double coordinates each). I am looking for the best way (in term of performance) to implement a class that will contain a container for these data and that will provide manipulation primitives for these data (e.g. instantiation, operator[], etc.).
There are a few restrictions on how this set is used:
its size is read from a configuration file and won't change during execution
it can be viewed as a big two dimensional array of N (e.g. 1 000 000) lines and 6 columns (each one storing the coordinate in one dimension)
the array is manipulated in a big loop, each "particle / line" is accessed and computation takes place with its coordinates, and the results are stored back for this particle, and so on for each particle, and so on for each iteration of the big loop.
no new elements are added or deleted during the execution
First conclusion, as the access on the elements is essentially done by accessing each element one by one with [], I think that I should use a normal dynamic array.
I have explored a few things, and I would like to have your opinion on the one that can give me the best performances.
As I understand there is no advantage to use a dynamically allocated array instead of a std::vector, so things like double** array2d = new ..., loop of new, etc are ruled out.
So is it a good idea to use std::vector<double> ?
If I use a std::vector, should I create a two dimensional array like std::vector<std::vector<double> > my_array that can be indexed like my_array[i][j], or is it a bad idea and it would be better to use std::vector<double> other_array and acces it with other_array[6*i+j].
Maybe this can gives better performance, especially as the number of columns is fixed and known from the beginning.
If you think that this is the best option, would it be possible to wrap this vector in a way that it can be accessed with a index operator defined as other_array[i,j] // same as other_array[6*i+j] without overhead (like function call at each access) ?
Another option, the one that I am using so far is to use Blitz, in particular blitz::Array:
typedef blitz::Array<double,TWO_DIMENSIONS> store_t;
store_t my_store;
Where my elements are accessed like that: my_store(line, column);.
I think there are not much advantage to use Blitz in my case because I am accessing each element one by one and that Blitz would be interesting if I was using operations directly on array (like matrix multiplication) which I am not.
Do you think that Blitz is OK, or is it useless in my case ?
These are the possibilities I have considered so far, but maybe the best one I still another one, so don't hesitate to suggest me other things.
Thanks a lot for your help on this problem !
Edit:
From the very interesting answers and comments bellow a good solution seems to be the following:
Use a structure particle (containing 6 doubles) or a static array of 6 doubles (this avoid the use of two dimensional dynamic arrays)
Use a vector or a deque of this particle structure or array. It is then good to traverse them with iterators, and that will allow to change from one to another later.
In addition I can also use a Blitz::TinyVector<double,6> instead of a structure.
So is it a good idea to use std::vector<double> ?
Usually, a std::vector should be the first choice of container. You could use either std::vector<>::reserve() or std::vector<>::resize() to avoid reallocations while populating the vector. Whether any other container is better can be found by measuring. And only by measuring. But first measure whether anything the container is involved in (populating, accessing elements) is worth optimizing at all.
If I use a std::vector, should I create a two dimensional array like std::vector<std::vector<double> > [...]?
No. IIUC, you are accessing your data per particle, not per row. If that's the case, why not use a std::vector<particle>, where particle is a struct holding six values? And even if I understood incorrectly, you should rather write a two-dimensional wrapper around a one-dimensional container. Then align your data either in rows or columns - what ever is faster with your access patterns.
Do you think that Blitz is OK, or is it useless in my case?
I have no practical knowledge about blitz++ and the areas it is used in. But isn't blitz++ all about expression templates to unroll loop operations and optimizing away temporaries when doing matrix manipulations? ICBWT.
First of all, you don't want to scatter the coordinates of one given particle all over the place, so I would begin by writing a simple struct:
struct Particle { /* coords */ };
Then we can make a simple one dimensional array of these Particles.
I would probably use a deque, because that's the default container, but you may wish to try a vector, it's just that 1.000.000 of particles means about a single chunk of a few MBs. It should hold but it might strain your system if this ever grows, while the deque will allocate several chunks.
WARNING:
As Alexandre C remarked, if you go the deque road, refrain from using operator[] and prefer to use iteration style. If you really need random access and it's performance sensitive, the vector should prove faster.
The first rule when choosing from containers is to use std::vector. Then, only after your code is complete and you can actually measure performance, you can try other containers. But stick to vector first. (And use reserve() from the start)
Then, you shouldn't use an std::vector<std::vector<double> >. You know the size of your data: it's 6 doubles. No need for it to be dynamic. It is constant and fixed. You can define a struct to hold you particle members (the six doubles), or you can simply typedef it: typedef double particle[6]. Then, use a vector of particles: std::vector<particle>.
Furthermore, as your program uses the particle data contained in the vector sequentially, you will take advantage of the modern CPU cache read-ahead feature at its best performance.
You could go several ways. But in your case, don't declare astd::vector<std::vector<double> >. You're allocating a vector (and you copy it around) for every 6 doubles. Thats way too costly.
If you think that this is the best option, would it be possible to wrap this vector in a way that it can be accessed with a index operator defined as other_array[i,j] // same as other_array[6*i+j] without overhead (like function call at each access) ?
(other_array[i,j] won't work too well, as i,j employs the comma operator to evaluate the value of "i", then discards that and evaluates and returns "j", so it's equivalent to other_array[i]).
You will need to use one of:
other_array[i][j]
other_array(i, j) // if other_array implements operator()(int, int),
// but std::vector<> et al don't.
other_array[i].identifier // identifier is a member variable
other_array[i].identifier() // member function getting value
other_array[i].identifier(double) // member function setting value
You may or may not prefer to put get_ and set_ or similar on the last two functions should you find them useful, but from your question I think you won't: functions are prefered in APIs between parts of large systems involving many developers, or when the data items may vary and you want the algorithms working on the data to be independent thereof.
So, a good test: if you find yourself writing code like other_array[i][3] where you've decided "3" is the double with the speed in it, and other_array[i][5] because "5" is the the acceleration, then stop doing that and give them proper identifiers so you can say other_array[i].speed and .acceleration. Then other developers can read and understand it, and you're much less likely to make accidental mistakes. On the other hand, if you are iterating over those 6 elements doing exactly the same things to each, then you probably do want Particle to hold a double[6], or to provide an operator[](int). There's no problem doing both:
struct Particle
{
double x[6];
double& speed() { return x[3]; }
double speed() const { return x[3]; }
double& acceleration() { return x[5]; }
...
};
BTW / the reason that vector<vector<double> > may be too costly is that each set of 6 doubles will be allocated on the heap, and for fast allocation and deallocation many heap implementations use fixed-size buckets, so your small request will be rounded up t the next size: that may be a significant overhead. The outside vector will also need to record a extra pointer to that memory. Further, heap allocation and deallocation is relatively slow - in you're case, you'd only be doing it at startup and shutdown, but there's no particular point in making your program slower for no reason. Even more importantly, the areas on the heap may just around in memory, so your operator[] may have cache-faults pulling in more distinct memory pages than necessary, slowing the entire program. Put another way, vectors store elements contiguously, but the pointed-to-vectors may not be contiguous.

The fastest way to iterate through a collection of objects

First to give you some background: I have some research code which performs a Monte Carlo simulation, essential what happens is I iterate through a collection of objects, compute a number of vectors from their surface then for each vector I iterate through the collection of objects again to see if the vector hits another object (similar to ray tracing). The pseudo code would look something like this
for each object {
for a number of vectors {
do some computations
for each object {
check if vector intersects
}
}
}
As the number of objects can be quite large and the amount of rays is even larger I thought it would be wise to optimise how I iterate through the collection of objects. I created some test code which tests arrays, lists and vectors and for my first test cases found that vectors iterators were around twice as fast as arrays however when I implemented a vector in my code in was somewhat slower than the array I was using before.
So I went back to the test code and increased the complexity of the object function each loop was calling (a dummy function equivalent to 'check if vector intersects') and I found that when the complexity of the function increases the execution time gap between arrays and vectors reduces until eventually the array was quicker.
Does anyone know why this occurs? It seems strange that execution time inside the loop should effect the outer loop run time.
What you are measuring is the difference of overhead to access element from an array and a vector. (as well as their creation/modification etc... depending on the operation you are doing).
EDIT: It will vary depending on the platform/os/library you are using.
It probably depends on the implementation of vector iterators. Some implementations are better than others. (Visual C++ — at least older versions — I'm looking at you.)
I think the time difference I was witnessing was actually due to an error in the pointer handling code. After making a few modifications to make the code more readable the iterations were taking around the time (give or take 1%) regardless of the container. Which makes sense as all the containers have the same access mechanism.
However I did notice the vector runs a bit slower in an OpenMP architecture this is probably due to the overhead in each thread maintaining its own copy of the iterator.