I started to some data-oriented design experiment. I initially start doing some oop code and found some code is extremely slow, don't know why. Here is one example:
I have a game object
class GameObject
{
public:
float m_Pos[2];
float m_Vel[2];
float m_Foo;
void UpdateFoo(float f){
float mag = sqrtf(m_Vel[0] * m_Vel[0] + m_Vel[1] * m_Vel[1]);
m_Foo += mag * f;
}
};
then I create 1,000,000 of objects using new, and then loop over calling UpdateFoo()
for (unsigned i=0; i<OBJECT_NUM; ++i)
{
v_objects[i]->UpdateFoo(10.0);
}
it takes about 20ms to finish the loop. And strange things happened when I comment out float m_Pos[2], so the object looks like this
class GameObject
{
public:
//float m_Pos[2];
float m_Vel[2];
float m_Foo;
void UpdateFoo(float f){
float mag = sqrtf(m_Vel[0] * m_Vel[0] + m_Vel[1] * m_Vel[1]);
m_Foo += mag * f;
}
};
and suddenly the loop takes about 150ms to finish. And if I put anything before m_Vel, much faster. I try to put some padding between m_Vel and m_Foo or other places except the place before m_Vel....slow.
I tested on vs2008 and vs2010 in release build, i7-4790
Any idea how this difference could happen? Is it related to any cache coherent behavior.
here is whole sample:
#include <iostream>
#include <math.h>
#include <vector>
#include <Windows.h>
using namespace std;
class GameObject
{
public:
//float m_Pos[2];
float m_Velocity[2];
float m_Foo;
void UpdateFoo(float f)
{
float mag = sqrtf(m_Velocity[0] * m_Velocity[0] + m_Velocity[1] *
m_Velocity[1]);
m_Foo += mag * f;
}
};
#define OBJECT_NUM 1000000
int main(int argc, char **argv)
{
vector<GameObject*> v_objects;
for (unsigned i=0; i<OBJECT_NUM; ++i)
{
GameObject * pObject = new GameObject;
v_objects.push_back(pObject);
}
LARGE_INTEGER nFreq;
LARGE_INTEGER nBeginTime;
LARGE_INTEGER nEndTime;
QueryPerformanceFrequency(&nFreq);
QueryPerformanceCounter(&nBeginTime);
for (unsigned i=0; i<OBJECT_NUM; ++i)
{
v_objects[i]->UpdateFoo(10.0);
}
QueryPerformanceCounter(&nEndTime);
double dWasteTime = (double)(nEndTime.QuadPart-
nBeginTime.QuadPart)/(double)nFreq.QuadPart*1000;
printf("finished: %f", dWasteTime);
// for (unsigned i=0; i<OBJECT_NUM; ++i)
// {
// delete(v_objects[i]);
// }
}
then I create 1,000,000 of objects using new, and then loop over
calling UpdateFoo()
There's your problem right there. Don't allocate a million teeny things individually that are going to be processed repeatedly using a general-purpose allocator.
Try storing the objects contiguously or in contiguous chunks. An easy solution is store them all in one big std::vector. To remove in constant time, you can swap the element to remove with the last and pop back. If you need stable indices, you can leave a hole behind to be reclaimed on insertion (can use a free list or stack approach). If you need stable pointers that don't invalidate, deque might be an option combined with the "holes" idea using a free list or separate stack of indices to reclaim/overwrite.
You can also just use a free list allocator and use placement new against it while careful to free using the same allocator and manually invoke the dtor, but that gets messier faster and requires more practice to do well than the data structure approach. I recommend instead to just seek to store your game objects in some big container so that you get back the control over where everything is going to reside in memory and the spatial locality that results.
I tested on vs2008 and vs2010 in release build, i7-4790 Any idea how
this difference could happen? Is it related to any cache coherent
behavior.
If you are benchmarking and building the project properly, maybe the allocator is fragmenting the memory more when GameObject is smaller where you are incurring more cache misses as a result. That would seem to be the most likely explanation, but difficult to know for sure without a good profiler.
That said, instead of analyzing it further, I recommend the above solution so that you don't have to worry about where the allocator is allocating every teeny thing in memory.
Related
I'm struggling with a small class I created. It is supposed to hold an array with data, that has to be deleted when the class object is deleted. Hence:
#include <stdio.h>
#include <stdlib.h>
#include <iomanip>
#include <iostream>
#include <fstream>
#include <math.h> // for sin/cos
#include <algorithm> // for std::swap
class Grid
{
public:
unsigned int dimX_inner, dimY_inner;
double *x_vector1;
unsigned int ghostzone;
int dimX_total, dimY_total;
double h_x, h_y;
Grid(const unsigned int dimX_inner, const unsigned int dimY_inner, const unsigned int ghostzone = 0);
~Grid();
};
Grid::Grid(unsigned int gridsize_x, unsigned int gridsize_y, unsigned int ghostzoneWidth) : dimX_inner(gridsize_x),
dimY_inner(gridsize_y),
x_vector1(new double[(gridsize_x + 2 * ghostzoneWidth) * (gridsize_y + 2 * ghostzoneWidth)])
{
ghostzone = ghostzoneWidth;
dimX_total = gridsize_x + 2 * ghostzoneWidth;
dimY_total = gridsize_y + 2 * ghostzoneWidth;
h_x = (double)1.0 / (gridsize_x - 1);
h_y = (double)1.0 / (gridsize_y - 1);
}
Grid::~Grid()
{
delete[] x_vector1;
}
That's all I need for now. Added a little main:
int main()
{
int problemsize = 5;
Grid *grid1 = new Grid(problemsize, problemsize);
delete[] grid1;
return 0;
}
I've written a similar class before, that ran without problems. Granted most of it was taken from some online exercises/examples.
But in this case I get a Segmentation fault when running the program (at the delete line). I guess it's the constructor, that somehow constitutes the problem. What I'd like to ask:
What might be the cause of the segmentation fault?
Is the constructor in the code "acceptable" (coding style wise)?
Would I have to make any additional considerations when adding another data object to the class (e.g. an array *x_vector2)?
Short remark: I'm running this using WSL (don't know whether this is relevant)
What might be the cause of the segmentation fault?
You cannot delete[] what you created via new. new / new[] must be matched with their respective counter part delete / delete[]. Mixing them results in undefined behavior. Actually you should avoid any use of new or delete when possible. For example your main should look like this:
int main()
{
int problemsize = 5;
Grid grid1{problemsize, problemsize};
}
The destructor of grid1 will be called when it goes out of scope.
Is the constructor in the code "acceptable" (coding style wise)?
No. You use the member initializer list for some members, but why not for all? For fundamental types the difference is negligible, but style-wise you should always prefer initialization rather than initialization plus assignment in the constructor body.
Would I have to make any additional considerations when adding another data object to the class (e.g. an array *x_vector2)?
You need not add another member. The class is already broken due to ignoring the rule of 3/5 (https://en.cppreference.com/w/cpp/language/rule_of_three). Instead of manually managing the memory you should delegate that to containers and/or smart pointers. In your case a std::vector seems to fit. The destructor can then be ~Grid() = default; (cf. rule of zero).
Managing a resource is already enough for a class to do. Consequently a class that does something should not in addition manage memory. In extremely rare cases you need to write a custom type that manages memory, but most of the time a standard container or a smart pointer should suffice.
My application consists of calling dozens of functions millions of times. In each of those functions, one or a few temporary std::vector containers of POD (plain old data) types are initialized, used, and then destructed. By profiling my code, I find the allocations and deallocations lead to a huge overhead.
A lazy solution is to rewrite all the functions as functors containing those temporary buffer containers as class members. However this would blow up the memory consumption as the functions are many and the buffer sizes are not trivial.
A better way is to analyze the code, gather all the buffers, premeditate how to maximally reuse them, and feed a minimal set of shared buffer containers to the functions as arguments. But this can be too much work.
I want to solve this problem once for all my future development during which temporary POD buffers become necessary, without having to have much premeditation. My idea is to implement a container port, and take the reference to it as an argument for every function that may need temporary buffers. Inside those functions, one should be able to fetch containers of any POD type from the port, and the port should also auto-recall the containers before the functions return.
// Port of vectors of POD types.
struct PODvectorPort
{
std::size_t Nlent; // Number of dispatched containers.
std::vector<std::vector<std::size_t> > X; // Container pool.
PODvectorPort() { Nlent = 0; }
};
// Functor that manages the port.
struct PODvectorPortOffice
{
std::size_t initialNlent; // Number of already-dispatched containers
// when the office is set up.
PODvectorPort *p; // Pointer to the port.
PODvectorPortOffice(PODvectorPort &port)
{
p = &port;
initialNlent = p->Nlent;
}
template<typename X, typename Y>
std::vector<X> & repaint(std::vector<Y> &y) // Repaint the container.
{
// return *((std::vector<X>*)(&y)); // UB although works
std::vector<X> *rst = nullptr;
std::memcpy(&rst, &y, std::min(
sizeof(std::vector<X>*), sizeof(std::vector<Y>*)));
return *rst; // guess it makes no difference. Should still be UB.
}
template<typename T>
std::vector<T> & lend()
{
++p->Nlent;
// Ensure sufficient container pool size:
while (p->X.size() < p->Nlent) p->X.push_back( std::vector<size_t>(0) );
return repaint<T, std::size_t>( p->X[p->Nlent - 1] );
}
void recall() { p->Nlent = initialNlent; }
~PODvectorPortOffice() { recall(); }
};
struct ArbitraryPODstruct
{
char a[11]; short b[7]; int c[5]; float d[3]; double e[2];
};
// Example f1():
// f2(), f3(), ..., f50() are similarly defined.
// All functions are called a few million times in certain
// order in main().
// port is defined in main().
void f1(other arguments..., PODvectorPort &port)
{
PODvectorPort portOffice(port);
// Oh, I need a buffer of chars:
std::vector<char> &tmpchar = portOffice.lend();
tmpchar.resize(789); // Trivial if container already has sufficient capacity.
// ... do things
// Oh, I need a buffer of shorts:
std::vector<short> &tmpshort = portOffice.lend();
tmpshort.resize(456); // Trivial if container already has sufficient capacity.
// ... do things.
// Oh, I need a buffer of ArbitraryPODstruct:
std::vector<ArbitraryPODstruct> &tmpArb = portOffice.lend();
tmpArb.resize(123); // Trivial if container already has sufficient capacity.
// ... do things.
// Oh, I need a buffer of integers, but also tmpArb is no longer
// needed. Why waste it? Cache hot.
std::vector<int> &tmpint = portOffice.repaint(tmpArb);
tmpint.resize(300); // Trivial.
// ... do things.
}
Although the code is compliable by both gcc-8.3 and MSVS 2019 with -O2 to -Ofast, and passes extensive tests for all options, I expect criticism due to the hacky nature of PODvectorPortOffice::repaint(), which "casts" the vector type in-place.
A set of sufficient but not necessary conditions for the correctness and efficiency of the above code are:
std::vector<T> stores 3 pointers to the underlying buffer's &[0], &[0] + .size(), &[0] + .capacity().
std::vector<T>'s allocator calls malloc().
malloc() returns an 8-byte (or sizeof(std::size_t)) aligned address.
So, if this is unacceptable to you, what would be the modern, proper way of addressing my need? Is there a way of writing a manager that achieve what my code does only without violating the Standard?
Thanks!
Edits: A little more context of my problem. Those functions mainly compute some simple statistics of the inputs. The inputs are data streams of financial parameters of different types and sizes. To compute the statistics, those data need to be altered and re-arranged first, thus the buffers for temporary copies. Computing the statistics is cheap, thus the allocations and deallocations can become expensive, relatively. Why do I want a manger for arbitrary POD type? Because 2 weeks from now I may start receiving a data stream of a different type, which can be a bunch of primitive types zipped in a struct, or a struct of the composite types encountered so far. I, of course, would like the upper stream to just send separate flows of primitive types, but I have no control of that aspect.
More edits: after tons of reading and code experimenting regarding the strict aliasing rule, the answer should be, don't try everything I put up there --- it works, for now, but don't do it. Instead, I'll be diligent and stick to my previous code-as-you-go style, just add a vector<vector<myNewType> > into the port once a new type comes up, and manage it in a similar way. The accepted answer also offers a nice alternative.
Even more edits: conceived a stronger class that has better chance to thwart potential optimizations under the strict aliasing rule. DO NOT USE IT WITHOUT TESTING AND THOROUGH UNDERSTANDING OF THE STRICT ALIASING RULE.
// -std=c++17
#include <cstring>
#include <cstddef>
#include <iostream>
#include <vector>
#include <chrono>
// POD: plain old data.
// Idea: design a class that can let you maximally reuse temporary
// containers during a program.
// Port of vectors of POD types.
template <std::size_t portsize = 42>
class PODvectorPort
{
static constexpr std::size_t Xsize = portsize;
std::size_t signature;
std::size_t Nlent; // Number of dispatched containers.
std::vector<std::size_t> X[portsize]; // Container pool.
PODvectorPort(const PODvectorPort &);
PODvectorPort & operator=( const PODvectorPort& );
public:
std::size_t Ndispatched() { return Nlent; }
std::size_t showSignature() { return signature; }
PODvectorPort() // Permuted random number generator.
{
std::size_t state = std::chrono::high_resolution_clock::now().time_since_epoch().count();
state ^= (uint64_t)(&std::memmove);
signature = ((state >> 18) ^ state) >> 27;
std::size_t rot = state >> 59;
signature = (signature >> rot) | (state << ((-rot) & 31));
Nlent = 0;
}
template<typename podvecport>
friend class PODvectorPortOffice;
};
// Functor that manages the port.
template<typename podvecport>
class PODvectorPortOffice
{
// Number of already-dispatched containers when the office is set up.
std::size_t initialNlent;
podvecport *p; // Pointer to the port.
PODvectorPortOffice( const PODvectorPortOffice& ); // non construction-copyable
PODvectorPortOffice& operator=( const PODvectorPortOffice& ); // non copyable
constexpr void check()
{
while (__cplusplus < 201703)
{
std::cerr << "PODvectorPortOffice: C++ < 17, Stall." << std::endl;
}
// Check if allocation will be 8-byte (or more) aligned.
// Intend it not to work on machine < 64-bit.
constexpr std::size_t aln = alignof(std::max_align_t);
while (aln < 8)
{
std::cerr << "PODvectorPortOffice: Allocation is not at least 8-byte aligned, Stall." <<
std::endl;
}
while ((aln & (aln - 1)) != 0)
{
std::cerr << "PODvectorPortOffice: Alignment is not a power of 2 bytes. Stall." << std::endl;
}
// Random checks to see if sizeof(vector<S>) != sizeof(vector<T>).
if(true)
{
std::size_t vecHeadSize[16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
vecHeadSize[0] = sizeof(std::vector<char>(0));
vecHeadSize[1] = sizeof(std::vector<short>(1));
vecHeadSize[2] = sizeof(std::vector<int>(2));
vecHeadSize[3] = sizeof(std::vector<long>(3));
vecHeadSize[4] = sizeof(std::vector<std::size_t>(5));
vecHeadSize[5] = sizeof(std::vector<float>(7));
vecHeadSize[6] = sizeof(std::vector<double>(11));
vecHeadSize[7] = sizeof(std::vector<std::vector<char> >(13));
vecHeadSize[8] = sizeof(std::vector<std::vector<int> >(17));
vecHeadSize[9] = sizeof(std::vector<std::vector<double> >(19));
struct tmpclass1 { char a; short b; };
struct tmpclass2 { char a; float b; };
struct tmpclass3 { char a; double b; };
struct tmpclass4 { int a; char b; };
struct tmpclass5 { double a; char b; };
struct tmpclass6 { double a[5]; char b[3]; short c[3]; };
vecHeadSize[10] = sizeof(std::vector<tmpclass1>(23));
vecHeadSize[11] = sizeof(std::vector<tmpclass2>(29));
vecHeadSize[12] = sizeof(std::vector<tmpclass3>(31));
vecHeadSize[13] = sizeof(std::vector<tmpclass4>(37));
vecHeadSize[14] = sizeof(std::vector<tmpclass4>(41));
vecHeadSize[15] = sizeof(std::vector<tmpclass4>(43));
std::size_t notSame = 0;
for(int i = 0; i < 16; ++i)
notSame += vecHeadSize[i] != sizeof(std::size_t) * 3;
while (notSame)
{
std::cerr << "sizeof(std::vector<S>) != sizeof(std::vector<T>), \
PODvectorPortOffice cannot handle. Stall." << std::endl;
}
}
}
void recall() { p->Nlent = initialNlent; }
public:
PODvectorPortOffice(podvecport &port)
{
check();
p = &port;
initialNlent = p->Nlent;
}
template<typename X, typename Y>
std::vector<X> & repaint(std::vector<Y> &y) // Repaint the container.
// AFTER A VECTOR IS REPAINTED, DO NOT USE THE OLD VECTOR AGAIN !!
{
while (std::is_same<bool, X>::value)
{
std::cerr << "PODvectorPortOffice: Cannot repaint the vector to \
std::vector<bool>. Stall." << std::endl;
}
std::vector<X> *x;
std::vector<Y> *yp = &y;
std::memcpy(&x, &yp, sizeof(x));
return *x; // Not compliant with strict aliasing rule.
}
template<typename T>
std::vector<T> & lend()
{
while (p->Nlent >= p->Xsize)
{
std::cerr << "PODvectorPortOffice: No more containers. Stall." << std::endl;
}
++p->Nlent;
return repaint<T, std::size_t>( p->X[p->Nlent - 1] );
}
~PODvectorPortOffice()
{
// Because p->signature can only be known at runtime, an aggressive,
// compliant compiler (ACC) will never remove this
// branch. Volatile might do, but trustworthiness?
if(p->signature == 0)
{
constexpr std::size_t sizeofvec = sizeof(std::vector<std::size_t>);
char dummy[sizeofvec * p->Xsize];
std::memcpy(dummy, p->X, p->Nlent * sizeofvec);
std::size_t ticketNum = 0;
char *xp = (char*)(p->X);
for(int i = 0, iend = p->Nlent * sizeofvec; i < iend; ++i)
{
xp[i] &= xp[iend - i - 1] * 5;
ticketNum += xp[i] ^ ticketNum;
}
std::cerr << "Congratulations! After the port office was decommissioned, \
you found a winning lottery ticket. The odds is less than 2.33e-10. Your \
ticket number is " << ticketNum << std::endl;
std::memcpy(p->X, dummy, p->Nlent * sizeofvec);
// According to the strict aliasing rule, a char* can point to any memory
// block pointed by another pointer of any type T*. Thus given an ACC,
// the writes to that block via the char* must be fully acknowledged in
// time by T*, namely, for reading contents from T*, a reload instruction
// will be kept in the assembly code to achieve a sort of
// "register-cache-memory coherence" (RCMC).
// We also do not care about the renters' (who received the reference via
// .lend()) RCMC, because PODvectorPortOffice never accesses the contents
// of those containers.
}
recall();
}
};
Any adversarial test case to break it, especially on GCC>=8.3 or MSVS >= 2019, is welcomed!
Let me frame this by saying I don't think there's an "authoritative" answer to this question. That said, you've provided enough constraints that a suggested path is at least worthwhile. Let's review the requirements:
Solution must use std::vector. This is in my opinion the most unfortunate requirement for reasons I won't get into here.
Solution must be standards compliant and not resort to rule violations, like the strict aliasing rule.
Solution must either reduce the number of allocations performed, or reduce the overhead of allocations to the point of being negligible.
In my opinion this is definitely a job for a custom allocator. There are a couple of off-the-shelf options that come close to doing what you want, for example the Boost Pool Allocators. The one you're most interested in is boost::pool_allocator. This allocator will create a singleton "pool" for each distinct object size (note: not object type), which grows as needed, but never shrinks until you explicitly purge it.
The main difference between this and your solution is that you'll have distinct pools of memory for objects of different sizes, which means it will use more memory than your posted solution, but in my opinion this is a reasonable trade-off. To be maximally efficient, you could simply start a batch of operations by creating vectors of each needed type with an appropriate size. All subsequent vector operations which use these allocators will do trivial O(1) allocations and deallocations. Roughly in pseudo-code:
// be careful with this, probably want [[nodiscard]], this is code
// is just rough guidance:
void force_pool_sizes(void)
{
std::vector<int, boost::pool_allocator<int>> size_int_vect;
std::vector<SomePodSize16, boost::pool_allocator<SomePodSize16>> size_16_vect;
...
size_int_vect.resize(100); // probably makes malloc calls
size_16_vect.resize(200); // probably makes malloc calls
...
// on return, objects go out of scope, but singleton pools
// with allocated blocks of memory remain for future use
// until explicitly purged.
}
void expensive_long_running(void)
{
force_pool_sizes();
std::vector<int, boost::pool_allocator<int>> data1;
... do stuff, malloc/free will never be called...
std::vector<SomePodSize16, boost::pool_allocator<SomePodSize16>> data2;
... do stuff, malloc/free will never be called...
// free everything:
boost::singleton_pool<boost::pool_allocator_tag, sizeof(int)>::release_memory();
}
If you want to take this a step further on being memory efficient, if you know for a fact that certain pool sizes are mutually exclusive, you could modify the boost pool_allocator to use a slightly different singleton backing store which allows you to move a memory block from one block size to another. This is probably out of scope for now, but the boost code itself is straightforward enough, if memory efficiency is critical, it's probably worthwhile.
It's worth pointing out that there's probably some confusion about the strict aliasing rule, especially when it comes to implementing your own memory allocators. There are lots and lots of SO questions about strict aliasing and what it does and doesn't mean. This one is a good place to start.
The key takeaway is that it's perfectly ordinary and acceptable in low level C++ code to take an array of memory and cast it to some object type. If this were not the case, std::allocator wouldn't exist. You also wouldn't have much use for things like std::aligned_storage. Look at the example use case for std::aligned_storage on cppreference. An STL-like static_vector class is created which keeps an array of aligned_storage objects that get recast to a concrete type. Nothing about this is "unacceptable" or "illegal", but it does require some additional knowledge and care in handling.
The reason your solution is especially going to enrage the code lawyers is that you're taking pointers of one non-char object type and casting them to different non-char object types. This is a particularly offensive violation of the strict aliasing rule, but also not really necessary given some of your other options.
Also keep in mind that it's not an error to alias memory, it's a warning. I'm not saying go crazy with aliasing, but I am saying that as with all things C and C++, there are justifiable cases to break rules, when you have very thorough knowledge and understanding of both your compiler and the machine you're running on. Just be prepared for some very long and painful debug sessions if it turns out you didn't in fact know those two things as well as you thought you did.
I am working on parallel solving of identical ordinary differential equations with different initial conditions. I have solved this problem with OpenMP and now I want to implement similar code on GPU. Specifically, I want to allocate memory on device for floats in class constructor and then deallocate it in destructor. It doesn't work for me since I get my executable "terminated by signal SIGSEGV (Address boundary error)". Is it possible to use classes, constructors and destructors in CUDA?
By the way, I am newbie in CUDA and not very experienced in C++ either.
I attach the code in case I have described my problem poorly.
#include <cmath>
#include <iostream>
#include <fstream>
#include <iomanip>
#include <random>
#include <string>
#include <chrono>
#include <ctime>
using namespace std;
template<class ode_sys>
class solver: public ode_sys
{
public:
int *nn;
float *t,*tt,*dt,*x,*xx,*m0,*m1,*m2,*m3;
using ode_sys::rhs_sys;
__host__ solver(int n): ode_sys(n)
{ //here I try to allocate memory. It works malloc() and doesn't with cudaMalloc()
size_t size=sizeof(float)*n;
cudaMalloc((void**)&nn,sizeof(int));
*nn=n;
cudaMalloc((void**)&t,sizeof(float));
cudaMalloc((void**)&tt,sizeof(float));
cudaMalloc((void**)&dt,sizeof(float));
cudaMalloc((void**)&x,size);
cudaMalloc((void**)&xx,size);
cudaMalloc((void**)&m0,size);
cudaMalloc((void**)&m1,size);
cudaMalloc((void**)&m2,size);
cudaMalloc((void**)&m3,size);
}
__host__ ~solver()
{
cudaFree(nn);
cudaFree(t);
cudaFree(tt);
cudaFree(dt);
cudaFree(x);
cudaFree(xx);
cudaFree(m0);
cudaFree(m1);
cudaFree(m2);
cudaFree(m3);
}
__host__ __device__ void rk4()
{//this part is not important now.
}
};
class ode
{
private:
int *nn;
public:
float *eps,*d;
__host__ ode(int n)
{
cudaMalloc((void**)&nn,sizeof(int));
*nn=n;
cudaMalloc((void**)&eps,sizeof(float));
size_t size=sizeof(float)*n;
cudaMalloc((void**)&d,size);
}
__host__ ~ode()
{
cudaFree(nn);
cudaFree(eps);
cudaFree(d);
}
__host__ __device__ float f(float x_,float y_,float z_,float d_)
{
return d_+*eps*(sinf(x_)+sinf(z_)-2*sinf(y_));
}
__host__ __device__ void rhs_sys(float *t,float *dt,float *x,float *dx)
{
}
};
//const float pi=3.14159265358979f;
__global__ void solver_kernel(int m,int n,solver<ode> *sys_d)
{
int index = threadIdx.x;
int stride = blockDim.x;
//actually ode numerical evaluation should be here
for (int l=index;l<m;l+=stride)
{//this is just to check that i can run kernel
printf("%d Hello \n", l);
}
}
int main ()
{
auto start = std::chrono::system_clock::now();
std::time_t start_time = std::chrono::system_clock::to_time_t(start);
cout << "started computation at " << std::ctime(&start_time);
int m=128,n=4,l;// i want to run 128 threads, n is dimension of ode
size_t size=sizeof(solver<ode>(n));
solver<ode> *sys_d; //an array of objects
cudaMalloc(&sys_d,size*m); //nvprof shows that this array is allocated
for (l=0;l<m;l++)
{
new (sys_d+l) solver<ode>(n); //it doesn't work as it meant to
}
solver_kernel<<<1,m>>>(m,n,sys_d);
for (l=0;l<m;l++)
{
(sys_d+l)->~solver<ode>(); //it doesn't work as it meant to
}
cudaFree(sys_d); //it works
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::time_t end_time = std::chrono::system_clock::to_time_t(end);
std::cout << "finished computation at " << std::ctime(&end_time) << "elapsed time: " << elapsed_seconds.count() << "s\n";
return 0;
}
//end of file
Distinguish host-side and device-side memory
As other answer also state:
GPU (global) memory you allocate with cudaMalloc() is not accessible by code running on the CPU; and
System memory (aka host memorY) you allocate in plain C++ (with std::vector, with std::make_unique, with new etc.) is not accessible by code running on the GPU.
So, you need to allocate both host-side and device-side memory. For a simple example of working with both device-side and host-side memory see the CUDA vectorAdd sample program.
(Actually, you can also make a special kind of allocation which is accessible from both the device and the host; this is Unified Memory. But let's ignore that for now since we're dealing with the basics.)
Don't live in the kingdom of nouns
Specifically, I want to allocate memory on device for floats in class constructor and then deallocate it in destructor.
I'm not sure you really want to do that. You seem to be taking a more Java-esque approach, in which everything you do is noun-centric, i.e. classes are used for everything: You don't solve equations, you have an "equation solver". You don't "do X", you have an "XDoer" class etc. Why not just have a (templated) function which solves an ODE system, returning the solution? Are you using your "solver" in any other way?
(this point is inspired by Steve Yegge's blog post, Execution in the Kingdom of Nouns.)
Try to avoid allocating and de-allocating yourself
In well-written modern C++, we try to avoid direct, manual allocation of memory (that's a link to the C++ Core Programming Guidelines by the way). Now, it's true that you free your memory with the destructor, so it's not all that bad, but I'd really consider using std::unique_ptr on the host and something equivalent on the device (like cuda::memory::unique_ptr from my Modern-C++ CUDA API wrapper cuda-api-wrappers library); or a GPU-oriented container class like thrust's device vector.
Check for errors
You really must check for errors after you call CUDA API functions. And this is doubly necessary after you launch a kernel. When you call a C++ standard library code, it throws an exception on error; CUDA's runtime API is C-like, and doesn't know about exceptions. It will just fail and set some error variable you need to check.
So, either you write error checks, like in the vectorAdd() sample I linked to above, or you get some library to exhibit more standard-library-like behavior. cuda-api-wrappers and thrust will both do that - on different levels of abstraction; and so will other libraries/frameworks.
You need an array on the host side and one on the device side.
Initialize the host array, then copy it to the device array with cudaMemcpy. The destruction has to be done on the host side again.
An alternative would be to initialize the array from the device, you would need to put __device__ in front of your constructor, then just use malloc.
You can not dereference pointer to device memory in host code:
__host__ ode(int n)
{
cudaMalloc((void**)&nn,sizeof(int));
*nn=n; // !!! ERROR
cudaMalloc((void**)&eps,sizeof(float));
size_t size=sizeof(float)*n;
cudaMalloc((void**)&d,size);
}
You will have to copy the values with cudaMemcpy.
(Or use the parameters of a __global__ function.)
I'm currently trying to work with vectors / deques of structures. Simple example of the structure...
struct job {
int id;
int time;
}
I want to be able to search through the structure to find the job that matches the time, remove it from the structure and continue to check for other ids in that structure. Sample code...
<vector> jobs;
<deque> started;
for (unsigned int i = 0; i < jobs.size(); i++)
{
if (jobs.at(i).time == time)
{
started.push_back(jobs.at(i));
jobs.erase(jobs.begin() + i);
i--;
}
}
time++;
This works how I want it to but it also seems very hacky since I'm adjusting the index whenever I delete and I think it's simply because I'm not as knowledgeable as should be with data structures. Anyone able to give me some advice?
NOTE - I don't think this is a duplicate to what this post has been tagged to as I'm not looking to do something efficiently with what I already have. To me, it seems efficient enough considering I'm reducing the size of the deque each time I get what I need from it. What I was hoping for, is some advice on figuring out what is the best data structure for what I'm attempting to do with deques, which are likely not meant to be handled as I'm handling them.
I could also be wrong and my usage is fine but just seems off to me to.
Well, I always knew that this talk would come in handy! The message here is "know your STL algorithms". With that, let me introduce you to std::stable_partition.
One thing you can do is use just one single vector, as follows:
using namespace std;
vector<job> jobs;
// fill the vector with jobs
auto startedJobsIter = stable_partition(begin(jobs), end(jobs),
[=time](job const &_job) { return _job.time == time; });
Now, everything between begin(jobs) and startedJobsIter satisfy the condition, while everything from startedJobsIter and end(jobs) does not.
Edit
If you don't care about the relative ordering of the items, then you could just use std::partition, which could be even more performant, because it would not preserve the relative ordering of the elements in the original vector, but will still divide it into the two parts.
Edit 2
Here's an adaptation for older C++ standards:
struct job_time_predicate {
public:
job_time_predicate(int time) : time_(time) { }
bool operator()(job const &the_job) { return the_job.time == time_; }
private:
int time_;
};
int main()
{
using namespace std;
int time = 10;
vector<job> jobs;
// fill that vector
vector<job>::iterator startedJobsIter =
stable_partition(jobs.begin(), jobs.end(), job_time_predicate(time));
}
My question is pretty simple. I have a vector of values (threads here, irrelevant) and I want to iterate through them. However there are two version of the code which looks same to me but only the second one works. I want to know why.
Version 1 (Does not compile)
int main(){
int someValue = 5;
vector<std::thread *> threadVector;
threadVector.resize(20);
for (int i = 0; i < 20; i++) {
threadVector[i] = new std::thread(foo, std::ref(someValue));
}
for (std::vector<std::thread *>::iterator it = threadVector.begin(); it != threadVector.end(); ++it) {
*it->join(); // *********Notice this Line*********
}
system("pause"); // I know I shouldn't be using this
}
Version 2 (Does work)
int main(){
int someValue = 5;
vector<std::thread *> threadVector;
threadVector.resize(20);
for (int i = 0; i < 20; i++) {
threadVector[i] = new std::thread(foo, std::ref(someValue));
}
for (std::vector<std::thread *>::iterator it = threadVector.begin(); it != threadVector.end(); ++it) {
(*it)->join(); // *********Notice this Line*********
}
system("pause"); // I know I shouldn't be using this
}
This is an issue with order of operations.
*it->join();
is parsed as:
*(it->join());
Taking it as a challenge, I just dabbed my feet in C++11 for the first time. I have found out that you can achieve the same without any dynamic allocation of std::thread objects:
#include <iostream>
#include <thread>
#include <vector>
void function()
{
std::cout << "thread function\n";
}
int main()
{
std::vector<std::thread> ths;
ths.push_back(std::move(std::thread(&function)));
ths.push_back(std::move(std::thread(&function)));
ths.push_back(std::move(std::thread(&function)));
while (!ths.empty()) {
std::thread th = std::move(ths.back());
ths.pop_back();
th.join();
}
}
This works because std::thread has a constructor and assignment operator taking an rvalue reference, making it movable. Further, all containers have gained support for storing movable objects and they move instead of copy them on reallocations. Read some online articles about this new C++11 feature, it's too wide to explain here and I also don't know it well enough.
About the concern you raised that threads have a memory cost, I don't think that your approach is an optimization. Rather the dynamic allocation itself has an overhead, both in memory and performance. For small objects, the memory overhead of one or two pointers plus possibly some padding is enormous. I wouldn't be surprised if std::thread objects had the size of a single pointer only, giving you an overhead of more than 100%.
Note that this only concerns the std::thread object. The memory required for the actual thread, in particular its stack, are a different issue. However, std::thread objects and the actual threads don't have a 1:1 relation and dynamic allocation of the std::thread object doesn't change anything there either.
If you're still afraid that the reallocations are too expensive, you could reserve a suitable size up front to avoid reallocations. However, if that really is an issue, then you are creating and terminating threads way too much, and that will by far dwarf the overhead of shifting a few, small objects around. Consider using a threadpool.