cuda array sorting with thrust, not enough memory

cuda array sorting with thrust, not enough memory - c++

I'm trying to sort an array using Thrust, but it doesn't work if the array is too big. (I have a GTX460 1GB memory)
I'm using cuda with c++ integration on VS2012, Here is my code :
my .cpp
extern "C" void thrust_sort(uint32_t *data, int n);
int main(int argc, char **argv){
int n = 2<<26;
uint32_t * v = new uint32_t[n];
srand(time(NULL));
for (int i = 0; i < n; ++i) {
v[i] = rand()%n;
}
thrust_sort(v, n);
delete [] v;
return 0;
}
my .cu
extern "C"
void thrust_sort(uint32_t *data, int n){
thrust::device_vector<uint32_t> d_data(data, data + n);
thrust::stable_sort(d_data.begin(), d_data.end());
thrust::copy(d_data.begin(), d_data.end(), data);
}
The program stop working at the start of stable_sort().
How much more memory does stable_sort() need ?
Is there a way to fix this ? (even if it makes it a bit slower or whatever)
Is there another sorting algorithm that doesn't require more memory than the original array ?
Thanks for your help :)

There are in the literature some techniques used to deal with the problem of sorting data that is too big to fit in RAM, such as saving partial values in files, and so on. An example: Sorting a million 32-bit integers in 2MB of RAM using Python
Your problem is less complicated since your input fits in RAM but is too much for your GPU. You can solve this problem by using the strategy parallel by Regular Sampling. You can see here an example of this technique applied to quicksort.
Long story short, you divide the array into smaller sub-arrays that fit on the memory of the GPU. Then you sort each of the sub-arrays, and in the end, you merge the results base on the premises of the Regular Sampling approach.
You can use a hybrid approach, sorting some of the sub-arrays in the CPU by assigning each one to a different core (using multi-threading), and at the same time, sending others sub-arrays to the GPU. You can even subdivide this work also to different processors using a message passing interface such as MPI. Or you can simply sort each sub-array one-by-one on the GPU and do the final merge step using the CPU, taking (or not) advantage of the multi-cores.

Related

Efficient coding with memory management

I've switched recently from matlab to c++ in order to run simulations faster, however it still runs slow. I'm pretty positive that there is much to improve in terms of memory usage.
Consider the following code, it shows an example of two array/vector declaration, that I use in a simulation.
One with known fixed length (array01) and another with unknown length (array02) that changes during the run.
The question here is what is the best/proper/efficient way of declaring variables ( for both array types) in terms of memory usage and performance.
# include <iostream>
# include <vector>
# include <ctime>
# include <algorithm>
using namespace std;
const int n = 1000;
const int m= 100000;
int main()
{
srand((unsigned)time(NULL));
vector <double> array02;
vector <vector<double>> Array01(n,m);
for (unsigned int i=0; i<n; i++)
{
for (unsigned int j=0; j<m;j++)
{
array02.clear();
rr = rand() % 10;
for (unsigned int l = 0 ; l<rr <l++)
{
array02.pushback(l);
}
// perform some calculation with array01 and array02
}
}
}

You should consider defining your own Matrix class with a void resize(unsigned width, unsigned height) member function, and a double get(unsigned i, unsigned j) inlined member function and/or a double& at(unsigned i, unsigned j) inlined member function (both giving Mi,j element). The matrix internal data could be a one-dimensional array or vector of doubles. Using a vector of vectors (all of the same size) is not the best (or fastest) way to represent a matrix.
class Matrix {
std::vector<double> data;
unsigned width, height;
public:
Matrix() : data(), width(0), height(0) {};
~Matrix() = default;
/// etc..., see rule of five
void resize(unsigned w, unsigned h) {
data.resize(w*h);
width = w; height = h;
}
double get(unsigned i, unsigned j) const {
assert(i<width && j<height);
return data[i*width+j];
}
double& at(unsigned i, unsigned j) {
assert(i<width && j<height);
return data[i*width+j];
}
}; // end class Matrix
Read also about the rule of five.
You could also try scilab (it is free software). It is similar to Matlab and might have different performances. Don't forget to use a recent version.
BTW, there are tons of existing C++ numerical libraries dealing with matrices. Consider using one of them. If performance is of paramount importance, don't forget to ask your compiler to optimize your code after you have debugged it.
Assuming you are on Linux (which I recommend for numerical computations; it is significant that most supercomputers run Linux), compile using g++ -std=c++11 -Wall -Wextra -g during the debugging phase, then use g++ -std=c++11 -Wall -Wextra -mtune=native -O3 during benchmarking. Don't forget to profile, and remember that premature optimization is evil (you first need to make your program correct).
You might even spend weeks, or months and perhaps many years, of work to use techniques like OpenMP, OpenCL, MPI, pthreads or std::thread for parallelization (which is a difficult subject you'll need years to master).
If your matrix is big, and/or have additional properties (is sparse, triangular, symmetric, etc...) there are many mathematical and computer science knowledge to master to improve the performance. You can make a PhD on that, and spend your entire life on the subject. So go to your University library to read some books on numerical analysis and linear algebra.
For random numbers C++11 gives you <random>; BTW use C++11 or C++14, not some earlier version of C++.
Read also http://floating-point-gui.de/ and a good book about C++ programming.
PS. I don't claim any particular expertise on numerical computation. I prefer much symbolic computation.

First of all don't try to reinvent the wheel :) Try to use some heavily optimized numerical library, for example
Intel MKL (Fastest and most used math library for Intel and compatible processors)
LAPACK++ (library for high performance linear algebra)
Boost (not only numerical, but solves almost any problem)
Second: If you need a matrix for a very simple program, use vector[i + width * j] notation. It's faster because you save an extra memory allocation.
Your example doesn't event compile. I tried to rewrite it a little:
#include <vector>
#include <ctime>
int main()
{
const int rowCount = 1000;
const int columnCount = 1000;
srand(time(nullptr));
// Declare matrix
std::vector<double> matrix;
// Preallocate elemts (faster insertion later)
matrix.reserve(rowCount * columnCount);
// Insert elements
for (size_t i = 0; i < rowCount * columnCount; ++i) {
matrix.push_back(rand() % 10);
}
// perform some calculation with matrix
// For example this is a matrix element at matrix[1, 3]:
double element_1_3 = matrix[3 + 1 * rowCount];
return EXIT_SUCCESS;
}
Now the speed depends on rand() (which is slow).

As people said:
Prefer a 1d array instead of 2d array for matrices.
Don't reinvent the wheel, use existing library: I think that Eigen library is the best suite for you, judging from your code. It also have very, very optimized code generated since it use C++ template static calculation when ever possible.

c++: passing Eigen-defined matrices to functions, and using them - best practice

I have a function which requires me to pass a fairly large matrix (which I created using Eigen) - and ranges from dimensions 200x200 -> 1000x1000. The function is more complex than this, but the bare bones of it are:
#include <Eigen/Dense>
int main()
{
MatrixXi mIndices = MatrixXi::Zero(1000,1000);
MatrixXi* pMatrix = &mIndices;
MatrixXi mTest;
for(int i = 0; i < 10000; i++)
{
mTest = pMatrix[0];
// Then do stuff to the copy
}
}
Is the reason that it takes much longer to run with a larger size of matrix because it takes longer to find the available space in RAM for the array when I set it equal to mTest? When I switch to a sparse array, this seems to be quite a lot quicker.
If I need to pass around large matrices, and I want to minimise the incremental effect of matrix size on runtime, then what is best practice here? At the moment, the same program is running slower in c++ than it is in Matlab, and obviously I would like to speed it up!
Best,
Ben

In the code you show, you are copying a 1,000,000 element 10,000 times. The assignment in the loop creates a copy.
Generally if you're passing an Eigen matrix to another function, it can be beneficial to accept the argument by reference.
It's not really clear from your code what you're trying to achieve however.

efficiency when handling consecutive blocks vs. non consecutive blocks of memory

i have a struct
struct A
{
int v[10000000];
};
if i have A a[2]; and wish to calculate the total sum of values which of these 2 methods is the fastest?
int method_1(const A &a[],int length)
{
int total = 0;
for(int i=0;i<length;i++)
for(int j=0;j<10000000;j++)
total+=a[i][j];
return total;
}
int method_2(const A &a[],int length)
{
int total = 0;
for(int j=0;j<10000000;j++)
for(int i=0;i<length;i++)
total+=a[i][j];
return total;
}
a[2] is declared as two consective blocks of struct A as so:
----a[0]---- /--- a[1]----
[][][][][][][][]/[][][][][][][][]
so, i might be tempted to say that method_1 is faster, based on intuition that the blocks are consecutive and the iteration through each block's v is also consecutive.
What i am really interested in is how the memory is really accessed and how is the most efficient way to access it.
EDIT
i have changed the v size from 32 to 10000000, because apparently it wasn't understood that i was referring to a general case

Each time a memory fragment is read a whole cache line will be read from main memory to the CPU cache, today you'll probably have a 32byte long cache lines. Mostly because of this reading consecutive memory blocks is fast.
Now there is more then one cache line...
In your case both cases may have similar performance as both arrays will most probably not collide into the same cache line and so both may be in the cache on different lines so I suspect performance will be similar.
One related thing you may consider in this case to change the performance is NOT using the [] operators in favor of iterating more using "iterators" like this:
int method_1(const A &a[],int length)
{
int total = 0;
for(const A* aIt=a;aIt<a+length;++aIt)
for(const v* vIt=aIt->v;vIt<aIt->v+10000000;++vIt)
total+=*vIt;
return total;
}
This way you avoid a double [] which is simply a multiplication by the sizeof of an array element (which may be optimized but may not and if not it will be costly when called millions of times). Your compiler may be smart enough to optimize the code just as I've shown to only use additions but... it very well may not be and I've seen this make a big difference when the operation performed for each of the elements is as trivial as an incrementation - you're best to measure this and see how these options work out in your environment.

Accessing elements in the order they appear in memory will improve performance in most cases since it allows the prefetcher to load data before you even use it. Besides, if you use data in a non-contiguous way, you might load and discard the same cache line many times and this has a cost.

Data size is small enough to be fit completely in a single cache line on modern CPUs. I'm not sure about vertorizing this code by compiler

I don't think method_2 is slower than method_1. The chunk of memory will be taken to CPUs main memory and then accessing a[0] and a[1] both will be take same time.
For a safer side, method_1 can always be considered better than method_2.

If I want maximum speed, should I just use an array over a std::vector? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I am writing some code that needs to be as fast as possible without sucking up all of my research time (in other words, no hand optimized assembly).
My systems primarily consist of a bunch of 3D points (atomic systems) and so the code I write does lots of distance comparisons, nearest-neighbor searches, and other types of sorting and comparisons. These are large, million or billion point systems, and the naive O(n^2) nested for loops just won't cut it.
It would be easiest for me to just use a std::vector to hold point coordinates. And at first I thought it will probably be about as fast an array, so that's great! However, this question (Is std::vector so much slower than plain arrays?) has left me with a very uneasy feeling. I don't have time to write all of my code using both arrays and vectors and benchmark them, so I need to make a good decision right now.
I am sure that someone who knows the detailed implementation behind std::vector could use those functions with very little speed penalty. However, I primarily program in C, and so I have no clue what std::vector is doing behind the scenes, and I have no clue if push_back is going to perform some new memory allocation every time I call it, or what other "traps" I could fall into that make my code very slow.
An array is simple though; I know exactly when memory is being allocated, what the order of all my algorithms will be, etc. There are no blackbox unknowns that I may have to suffer through. Yet so often I see people criticized for using arrays over vectors on the internet that I can't but help wonder if I am missing some more information.
EDIT: To clarify, someone asked "Why would you be manipulating such large datasets with arrays or vectors"? Well, ultimately, everything is stored in memory, so you need to pick some bottom layer of abstraction. For instance, I use kd-trees to hold the 3D points, but even so, the kd-tree needs to be built off an array or vector.
Also, I'm not implying that compilers cannot optimize (I know the best compilers can outperform humans in many cases), but simply that they cannot optimize better than what their constraints allow, and I may be unintentionally introducing constraints simply due to my ignorance of the implementation of vectors.

all depends on this how you implement your algorithms. std::vector is such general container concept that gives us flexibility but leaves us with freedom and responsibility of structuring implementation of algorithm deliberately. Most of the efficiency overhead that we will observe from std::vector comes from copying. std::vector provides a constructor which lets you initialize N elements with value X, and when you use that, the vector is just as fast as an array.
I did a tests std::vector vs. array described here,
#include <cstdlib>
#include <vector>
#include <iostream>
#include <string>
#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/microsec_time_clock.hpp>
class TestTimer
{
public:
TestTimer(const std::string & name) : name(name),
start(boost::date_time::microsec_clock<boost::posix_time::ptime>::local_time())
{
}
~TestTimer()
{
using namespace std;
using namespace boost;
posix_time::ptime now(date_time::microsec_clock<posix_time::ptime>::local_time());
posix_time::time_duration d = now - start;
cout << name << " completed in " << d.total_milliseconds() / 1000.0 <<
" seconds" << endl;
}
private:
std::string name;
boost::posix_time::ptime start;
};
struct Pixel
{
Pixel()
{
}
Pixel(unsigned char r, unsigned char g, unsigned char b) : r(r), g(g), b(b)
{
}
unsigned char r, g, b;
};
void UseVector()
{
TestTimer t("UseVector");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels;
pixels.resize(dimension * dimension);
for(int i = 0; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
}
}
void UseVectorPushBack()
{
TestTimer t("UseVectorPushBack");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels;
pixels.reserve(dimension * dimension);
for(int i = 0; i < dimension * dimension; ++i)
pixels.push_back(Pixel(255, 0, 0));
}
}
void UseArray()
{
TestTimer t("UseArray");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
Pixel * pixels = (Pixel *)malloc(sizeof(Pixel) * dimension * dimension);
for(int i = 0 ; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
free(pixels);
}
}
void UseVectorCtor()
{
TestTimer t("UseConstructor");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels(dimension * dimension, Pixel(255, 0, 0));
}
}
int main()
{
TestTimer t1("The whole thing");
UseArray();
UseVector();
UseVectorCtor();
UseVectorPushBack();
return 0;
}
and here are results (compiled on Ubuntu amd64 with g++ -O3):
UseArray completed in 0.325 seconds
UseVector completed in 1.23 seconds
UseConstructor completed in 0.866 seconds
UseVectorPushBack completed in 8.987 seconds
The whole thing completed in 11.411 seconds
clearly push_back wasn't good choice here, using constructor is still 2 times slower than array.
Now, providing Pixel with empty copy Ctor:
Pixel(const Pixel&) {}
gives us following results:
UseArray completed in 0.331 seconds
UseVector completed in 0.306 seconds
UseConstructor completed in 0 seconds
UseVectorPushBack completed in 2.714 seconds
The whole thing completed in 3.352 seconds
So in summary: re-think your algorithm, otherwise, perhaps resort to a custom wrapper around New[]/Delete[]. In any case, the STL implementation isn't slower for some unknown reason, it just does exactly what you ask; hoping you know better.
In the case when you just started with vectors it might be surprising how they behave, for example this code:
class U{
int i_;
public:
U(){}
U(int i) : i_(i) {cout << "consting " << i_ << endl;}
U(const U& ot) : i_(ot.i_) {cout << "copying " << i_ << endl;}
};
int main(int argc, char** argv)
{
std::vector<U> arr(2,U(3));
arr.resize(4);
return 0;
}
results with:
consting 3
copying 3
copying 3
copying 548789016
copying 548789016
copying 3
copying 3

Vectors guarantee that the underlying data is a contiguous block in memory. The only sane way to guarantee this is by implementing it as an array.
Memory reallocation on pushing new elements can happen, because the vector can't know in advance how many elements you are going to add to it. But when you know it in advance, you can call reserve with the appropriate number of entries to avoid reallocation when adding them.
Vectors are usually preferred over arrays because they allow bound-checking when accessing elements with .at(). That means accessing indices outside of the vector doesn't cause undefined behavior like in an array. This bound-checking does however require additional CPU cycles. When you use the []-operator to access elements, no bound-checking is done and access should be as fast as an array. This however risks undefined behavior when your code is buggy.

People who invented STL, and then made it into the C++ standard library, are expletive deleted smart. Don't even let yourself imagine for one little moment you can outperform them because of your superior knowledge of legacy C arrays. (You would have a chance if you knew some Fortran though).
With std::vector, you can allocate all memory in one go, just like with C arrays. You can also allocate incrementally, again just like with C arrays. You can control when each allocation happens, just like with C arrays. Unlike with C arrays, you can also forget about it all and let the system manage the allocations for you, if that's what you want. This is all absolutely necessary, basic functionality. I'm not sure why anyone would assume it is missing.
Having said all that, go with arrays if you find them easier to understand.

I am not really advising you to go either for arrays or vectors, because I think that for your needs they may not be totally fit.
You need to be able to organize your data efficiently, so that queries would not need to scan the whole memory range to get the relevant data. So you want to group the points which are more likely to be selected together close to each other.
If your dataset is static, then you can do that sorting offline, and make your array nice and tidy to be loaded up in memory at your application start up time, and either vector or array would work (provided you do the reserve call up front for vector, since the default allocation growth scheme double up the size of the underlying array whenever it gets full, and you wouldn't want to use up 16Gb of memory for only 9Gb worth of data).
But if your dataset is dynamic, it will be difficult to do efficient inserts in your set with a vector or an array. Recall that each insert within the array would create a shift of all the successor elements of one place. Of course, an index, like the kd tree you mention, will help by avoiding a full scan of the array, but if the selected points are scattered accross the array, the effect on memory and cache will essentially be the same. The shift would also mean that the index needs to be updated.
My solution would be to cut the array in pages (either list linked or array indexed) and store data in the pages. That way, it would be possible to group relevant elements together, while still retaining the speed of contiguous memory access within pages. The index would then refer to a page and an offset in that page. Pages wouldn't be filled automatically, which leaves rooms to insert related elements, or make shifts really cheap operations.
Note that if pages are always full (excepted for the last one), you still have to shift every single one of them in case of an insert, while if you allow incomplete pages, you can limit a shift to a single page, and if that page is full, insert a new page right after it to contain the suplementary element.
Some things to keep in mind:
array and vector allocation have upper limits, which is OS dependent (these limits might be different)
On my 32bits system, the maximum allowed allocation for a vector of 3D points is at around 180 millions entries, so for larger datasets, on would have to find a different solution. Granted, on 64bits OS, that amount might be significantly larger (On windows 32bits, the maximum memory space for a process is 2Gb - I think they added some tricks on more advanced versions of the OS to extend that amount). Admittedly memory will be even more problematic for solutions like mine.
resizing of a vector requires allocating the new size of the heap, copy the elements from the old memory chunck to the new one.
So for adding just one element to the sequence, you will need twice the memory during the resizing. This issue may not come up with plain arrays, which can be reallocated using the ad hoc OS memory functions (realloc on unices for instance, but as far as I know that function doesn't make any guarantee that the same memory chunck will be reused). The problem might be avoided in vector as well if a custom allocator which would use the same functions is used.
C++ doesn't make any assumption about the underlying memory architecture.
vectors and arrays are meant to represent contiguous memory chunks provided by an allocator, and wrap that memory chunk with an interface to access it. But C++ doesn't know how the OS is managing that memory. In most modern OS, that memory is actually cut in pages, which are mapped in and out of physical memory. So my solution is somehow to reproduce that mechanism at the process level. In order to make the paging efficient, it is necessary to have our page fit the OS page, so a bit of OS dependent code will be necessary. On the other hand, this is not a concern at all for a vector or array based solution.
So in essence my answer is concerned by the efficiency of updating the dataset in a manner which will favor clustering points close to each others. It supposes that such clustering is possible. If not the case, then just pushing a new point at the end of the dataset would be perfectly alright.

Although I do not know the exact implementation of std:vector, most list systems like this are slower than arrays as they allocate memory when they are resized, normally double the current capacity although this is not always the case.
So if the vector contains 16 items and you added another, it needs memory for another 16 items. As vectors are contiguous in memory, this means that it will allocate a solid block of memory for 32 items and update the vector. You can get some performance improvements by constructing the std:vector with an initial capacity that is roughly the size you think your data set will be, although this isn't always an easy number to arrive at.

For operation that are common between vectors and arrays (hence not push_back or pop_back, since array are fixed in size) they perform exactly the same, because -by specification- they are the same.
vector access methods are so trivial that the simpler compiler optimization will wipe them out.
If you know in advance the size of a vector, just construct it by specifyinfg the size or just call resize, and you will get the same you can get with a new [].
If you don't know the size, but you know how much you will need to grow, just call reserve, and you get no penality on push_back, since all the required memory is already allocated.
In any case, relocation are not so "dumb": the capacity and the size of a vector are two distinct things, and the capacity is typically doubled upon exhaustion, so that relocation of big amounts are less and less frequent.
Also, in case you know everything about sizes, and you need no dynamic memory and want the same vector interface, consider also std::array.

Sounds like you need gigs of RAM so you're not paging. I tend to go along with #Philipp's answer, because you really really want to make sure it's not re-allocating under the hood
but
what's this about a tree that needs rebalancing?
and you're even thinking about compiler optimization?
Please take a crash course in how to optimize software.
I'm sure you know all about Big-O, but I bet you're used to ignoring the constant factors, right? They might be out of whack by 2 to 3 orders of magnitude, doing things you never would have thought costly.
If that translates to days of compute time, maybe it'll get interesting.
And no compiler optimizer can fix these things for you.
If you're academically inclined, this post goes into more detail.

Performance using IDs and arrays (vectors)

I have been taught at school to use database with integer IDs, and I want to know if it's also a good way to do so in C/C++. I'm making a game, using Ogre3D, so I'd like my game code to use as few cycles as possible.
This is not the exact code (I'm using vectors and it's about characters and abilities and such), but I'm curious to know if the line where I access the weight is going to cause a bottleneck or not, since I'd doing several array subscript.
struct item
{
float weight;
int mask;
item(): mask(0) {}
}
items[2000];
struct shipment
{
int item_ids[20];
}
shipments[10000];
struct order
{
int shipment_ids[20];
}
orders[3000];
int main()
{
// if I want to access an item's data of a certain order, I do:
for (int i = 0; i < 3000; ++ i)
{
if (items[shipments[orders[4].shipment_ids[5]]].weight > 23.0)
s |= (1<< 31);
}
}
I have heard that putting data into arrays is the best way to gain performance when looping over data repeatedly, I just want to know your opinion on this code...

A good optimizer should be able to compute the exact offset of the memory address each of those items. There is no dependency between loop iterations, so you should be able to get loop unrolled (SIMD processing). Looks great, IMHO. If you can avoid floats, that will also help you.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js