I'm working on implementing animations within my model loader which uses Assimp; C++/OpenGL for rendering. I've been following this tutorial: http://ogldev.atspace.co.uk/www/tutorial38/tutorial38.html extensively. Suffice it to say that I did not follow the tutorial completely as there were some bits that I disagreed with code-wise, so I adapted it. Mind you, I don't use none of the maths components the author there uses, so I used glm. At any rate, the problem is that sometimes my program runs, and on other times it doesn't. When I run my program it would run and then crash instantly, and on other times it would simply run as normal.
A few things to take into account:
Before animations/loading bones were added, the model loader worked completely fine and models were loaded without causing no crash whatsoever;
Models with NO bones still load just as fine; it only becomes a problem when models with bones are being loaded.
Please note that NOTHING from the bones is being rendered. I haven't even started allocating the bones to vertex attributes; not even the shaders are modified for this.
Everything is being run on a single thread; there is no multi-threading... yet.
So, naturally I took to this bit of code which actually loaded the bones. I've debugged the application and found that the problems lie mostly around here:
Mesh* processMesh(uint meshIndex, aiMesh *mesh)
vector<VertexBoneData> bones;
// .. getting other mesh data
if (pAnimate)
for (uint i = 0; i < mesh->mNumBones; i++)
uint boneIndex = 0;
string boneName(mesh->mBones[i]->mName.data);
auto it = pBoneMap.find(boneName);
if (it == pBoneMap.end())
boneIndex = pNumBones;
BoneInfo bi;
auto tempMat = mesh->mBones[i]->mOffsetMatrix;
pBoneInfo[boneIndex].boneOffset = to_glm_mat4(tempMat);
pBoneMap[boneName] = boneIndex;
else boneIndex = pBoneMap[boneName];
for (uint j = 0; j < mesh->mBones[i]->mNumWeights; j++)
uint vertexID = mesh->mBones[i]->mWeights[j].mVertexId;
float weit = mesh->mBones[i]->mWeights[j].mWeight;
bones.at(vertexID).addBoneData(boneIndex, weit);
In the last line the author used a [] operator to access elements, but I decided to use '.at for range-checking. The function to_glm_mat4 is defined thus:
glm::mat4 to_glm_mat4(const aiMatrix4x4 &m)
glm::mat4 to;
to[0][0] = m.a1; to[1][0] = m.a2;
to[2][0] = m.a3; to[3][0] = m.a4;
to[0][1] = m.b1; to[1][1] = m.b2;
to[2][1] = m.b3; to[3][1] = m.b4;
to[0][2] = m.c1; to[1][2] = m.c2;
to[2][2] = m.c3; to[3][2] = m.c4;
to[0][3] = m.d1; to[1][3] = m.d2;
to[2][3] = m.d3; to[3][3] = m.d4;
return to;
I also had to change VertexBoneData since it used raw arrays which I thought flawed:
struct VertexBoneData
vector boneIDs;
vector weights;
void reset()
void addBoneData(unsigned int boneID, float weight)
for (uint i = 0; i < boneIDs.size(); i++)
if (weights.at(i) == 0.0) // SEG FAULT HERE
boneIDs.at(i) = boneID;
weights.at(i) = weight;
Now, I'm not entirely sure what is causing the crash, but what baffles me most is that sometimes the program runs (implying that the code isn't necessarily the culprit). So I decided to do a debug-smashdown which involved me inspecting each bone (I skipped some; there are loads of bones!) and found that AFTER all the bones have been loaded I would get this very strange error:
No source available for "drm_intel_bo_unreference() at 0x7fffec369ed9"
and sometimes I would get this error:
Error in '/home/.../: corrupted double-linked list (not small): 0x00000 etc ***
and sometimes I would get a seg fault from glm regarding a vec4 instantiation;
and sometimes... my program runs without ever crashing!
To be fair, implementing animations may just about be harsh on my laptop so maybe it's a CPU/GPU problem as in it's unable to process so much data in one gulp, which is resulting in this crash. My theory is that since it's unable to process that much data, that data is never allocated to vectors.
I'm not using any multi-threading whatsoever, but it has crossed my mind. I figure that it may be the CPU being unable to process so much data hence the chance-run. If I implemented threading, such that the bone-loading is done on another thread; or better, use a mutex because what I found is that by debugging the application slowly the program runs, which makes sense because each task is being broken down into chunks; and that is what a mutex technically does, per se.
For the sake of the argument, and no mockery avowed, my technical specs:
Ubuntu 15.04 64-bit
Intel i5 dual-core
Intel HD 5500
Mesa 10.5.9 (OpenGL 3.3)
Programming on Eclipse Mars
I thus ask, what the hell is causing these intel_drm errors?
I've reproduced this issue and found it may have been a problem with the lack of multi-threading when it comes to loading bones. I decided to move the loading bone errata into its own function as prescribed in the foresaid tutorial. What I later did was:
if (pAnimate)
std::thread t1[&] {
loadBones(meshIndex, mesh, bones);
The lambda function above has the [&] to indicate we're passing everything as a reference to ensure no copies are created. To prevent any external forces from 'touching' the data within the loadBones(..) function, I've installed a mutex within the function like so:
void ModelLoader::loadBones(uint meshIndex, const aiMesh *mesh, std::vector<VertexBoneData> &bones)
std::mutex mut;
std::lock_guard<std::mutex> lock(mut);
// load bones
This is only a quick and dirty fix. It might not work for everyone, and there's no guarantee the program will run crash-less.
Here are some testing results:
Sans threading & mutex: program runs 0 out of 3 times in a row
With threading; sans mutex: program runs 2 out of 3 times in a row
With threading & mutex: program runs 3 out of 3 times in a row
If you're using Linux, remember to link pthread as well as including <thread> and <mutex>. Suggestions on thread-optimisation are welcome!
When trying to fill holes in a mesh with a border that is highly complex, the application takes 20 minutes in the hole filling call.
It can be any of the calls shown here.
The code I'm using is this:
int main()
std::ifstream input("V:/tobehealed2.off");
Triangle_mesh mesh;
input >> mesh;
std::vector<std::vector<int>> indices(mesh.num_faces());
std::vector<Point_3> vertices(mesh.num_vertices());
int i = 0;
for (auto& p : mesh.points()) {
vertices[i++] = p;
i = 0;
for (auto& f : mesh.faces()) {
std::vector<int> triangle(3);
int j = 0;
for (auto v : mesh.vertices_around_face(mesh.halfedge(f))) {
triangle[j++] = v;
indices[i++] = triangle;
CGAL::Polygon_mesh_processing::repair_polygon_soup(vertices, indices);
CGAL::Polygon_mesh_processing::orient_polygon_soup(vertices, indices);
CGAL::Polygon_mesh_processing::polygon_soup_to_polygon_mesh(vertices, indices, mesh);
CGAL::Polygon_mesh_processing::keep_largest_connected_components(mesh, 1);
bool hasHoles = true;
std::vector<face_descriptor> face_out;
std::vector<vertex_descriptor> vertex_out;
while (hasHoles) {
hasHoles = false;
for (auto& hh : mesh.halfedges()) {
if (mesh.is_border(hh)) {
hasHoles = true;
CGAL::Polygon_mesh_processing::triangulate_and_refine_hole(mesh, hh, std::back_inserter(face_out), std::back_inserter(vertex_out));
CGAL::Polygon_mesh_processing::keep_largest_connected_components(mesh, 1);
CGAL::Surface_mesh_simplification::Count_stop_predicate<Triangle_mesh> stop(60000);
int r = CGAL::Surface_mesh_simplification::edge_collapse(mesh, stop);
std::ofstream out2("V:/healed.off");
out2 << mesh;
Application takes over 20 minutes in the call to triangulate_and_refine_hole.
Tested model is available for download here:
My goal is just to be able to check beforehand if the model has a hole so complex the closing of it will take several minutes, so I can skip the hole filling attempt. Also, if there is a way to exit the function call after some threshold time it would be nice.
The size of the model doesn't matter so much. If I use a mesh 3 times larger, it can fill a not-so-complex-hole in just a few seconds.
Also, if there is a way to exit the function call after some threshold time it would be nice.
What always works is to start the task in another thread and monitor that thread, killing it if necessary after some time.
Really wouldn't be a pretty solution though. CGAL is still a maintained library, so fairly sure you just povide it some incorrect data. Or are you sure at all that it actually hangs up?
36Mb is quite a solid size for a model and hole-fixing is a task that grows with model complexity if I recall right.
Ah well, if you can afford to wait in general for 20 mintes before having a repaired model, threading is the way to go. It will just run in the background.
if you cannot afford that it takes so long, well, then it is not so easy. Either you find some significantly better implementation (not that likely), or you will have to do some trade offs. Either simplifying the model or living with less correct hole fixes (assuming there are some heuristic algorithms for this task).
Reason for re-posting:
Originally I got only one reply, that only pointed out that the title was exaggerated. Hence trying again, maybe more people will see this question this time as I really do not know where else to look... I will make sure to delete the original question to avoid duplication, and keep this new one instead. I'm not trying to spam the forum.
Feel free to remove the text above upon editing, I just wanted to explain why I'm re-posting - but it's not really a part of the question.
So, the original question was:
I have a few functions in my program that run extremely slow in Debug mode, in Visual Studio Community, 2015. They are functions to "index" the verts of 3D models.
Normally, I'm prepared for Debug mode to be a little slower, maybe 2 -3 times slower. But...
In Release mode, the program starts and indexes the models in about 2 - 3 seconds. Perfect.
In Debug mode however, it takes over 7 MINUTES for my program to actually respond, to start rendering and take input. It is stuck indexing one model for over seven minutes. During this time the program is completely froze.
The same model loads and indexes in "Release" mode in less than 3 seconds. How is it possible that it takes so unbelievably long in Debug?
Both Debug & Release modes are the standard out of the box modes. I don't recall changing any of the settings in either of them.
Here's the code that's slowing the program down in Debug mode:
// Main Indexer Function
void indexVBO_TBN(
std::vector<glm::vec3> &in_vertices,
std::vector<glm::vec2> &in_uvs,
std::vector<glm::vec3> &in_normals,
std::vector<glm::vec3> &in_tangents,
std::vector<glm::vec3> &in_bitangents,
std::vector<unsigned short> & out_indices,
std::vector<glm::vec3> &out_vertices,
std::vector<glm::vec2> &out_uvs,
std::vector<glm::vec3> &out_normals,
std::vector<glm::vec3> &out_tangents,
std::vector<glm::vec3> &out_bitangents){
int count = 0;
// For each input vertex
for (unsigned int i = 0; i < in_vertices.size(); i++) {
// Try to find a similar vertex in out_vertices, out_uvs, out_normals, out_tangents & out_bitangents
unsigned int index;
bool found = getSimilarVertexIndex(in_vertices[i], in_uvs[i], in_normals[i], out_vertices, out_uvs, out_normals, index);
if (found) {
// A similar vertex is already in the VBO, use it instead !
out_indices.push_back(unsigned short(index));
// Average the tangents and the bitangents
out_tangents[index] += in_tangents[i];
out_bitangents[index] += in_bitangents[i];
} else {
// If not, it needs to be added in the output data.
out_indices.push_back((unsigned short)out_vertices.size() - 1);
And then the 2 little "helper" functions it uses (isNear() and getSimilarVertexIndex()):
// Returns true if v1 can be considered equal to v2
bool is_near(float v1, float v2){
return fabs( v1-v2 ) < 0.01f;
bool getSimilarVertexIndex( glm::vec3 &in_vertex, glm::vec2 &in_uv, glm::vec3 &in_normal,
std::vector<glm::vec3> &out_vertices, std::vector<glm::vec2> &out_uvs, std::vector<glm::vec3> &out_normals,
unsigned int &result){
// Lame linear search
for (unsigned int i = 0; i < out_vertices.size(); i++) {
if (is_near(in_vertex.x, out_vertices[i].x) &&
is_near(in_vertex.y, out_vertices[i].y) &&
is_near(in_vertex.z, out_vertices[i].z) &&
is_near(in_uv.x, out_uvs[i].x) &&
is_near(in_uv.y, out_uvs[i].y) &&
is_near(in_normal.x, out_normals[i].x) &&
is_near(in_normal.y, out_normals[i].y) &&
is_near(in_normal.z, out_normals[i].z)
) {
result = i;
return true;
return false;
All credit for the functions above goes to:
Could this be a:
Visual Studio Community 2015 issue?
VSC15 Debug Mode issue?
Slow Code? (But it's only slow in Debug?!)
There are multiple things that will/might be optimized:
iterating a vector with indices [] is slower than using iterators; in debug, this certainly is not optimized away but in release it might
additionally, accessing a vector via [] is slow because of runtime checks and debugging features when being in debug mode; this can be fairly easily seen when you go to the implementation of operator[]
push_back and size might also have some additional checks than fall away when using release mode
So, my main guess would be that you use [] too much. It might be even faster in release when you change the iteration by means of using real iterators. So, instead of:
for (unsigned int i = 0; i < in_vertices.size(); i++) {
for(auto& vertex : in_vertices)
This indirectly uses iterators. You could also explicitly write:
for(auto vertexIt = in_vertices.begin(); vertexIt != in_vertices.end(); ++vertexIt)
auto& vertex = *vertexIt;
Obviously, this is longer code that seems less readable and has no practical advantage, unless you need the iterator for some other functions.
I am trying to implement a Binary Search in a compute shader with HLSL. It's not a classic Binary Search as the search key as well as the array values are float. If there is no matching array value to the search key, the search is supposed to return the last index (minIdx and maxIdx match at this point). This is the worst case for classic Binary Search as it takes the maximum number of operations, I am aware of this.
So here's my problem:
My implementation looks like this:
uint BinarySearch (Texture2D<float> InputTexture, float key, uint minIdx, uint maxIdx)
uint midIdx = 0;
while (minIdx <= maxIdx)
midIdx = minIdx + ((maxIdx + 1 - minIdx) / 2);
if (InputTexture[uint2(midIdx, 0)] == key)
// this might be a very rare case
return midIdx;
// determine which subarray to search next
else if (InputTexture[uint2(midIdx, 0)] < key)
// as we have a decreasingly sorted array, we need to change the
// max index here instead of the min
maxIdx = midIdx - 1;
else if (InputTexture[uint2(midIdx, 0)] > key)
minIdx = midIdx;
return minIdx;
This leads to my video driver crashing on program execution. I don't get a compile error.
However, if I use an if instead of the while I can execute it and the first iteration works as expected.
I already did a couple of searches and I suspect this might have to do something with dynamic looping in a compute shader. But I have no prior experience with compute shaders and only little experience with HLSL as well, which is why I feel kind of lost.
I am compiling this with cs_5_0.
Could anyone please explain what I am doing wrong or at least hint me to some documentation/explanation? Anything that can get me started on solving and understanding this would be super-appreciated!
DirectCompute shaders are still subject to the Timeout Detection & Recovery (TDR) behavior in the drivers. This basically means if your shader takes more than 2 seconds, the driver assumes the GPU has hung and resets it. This can be challenging with DirectCompute where you intentionally want the shader to run a long while (much longer than rendering usually would). In this case it may be a bug, but it's something to be aware of.
With Windows 8.0 or later, you can allow long-running shaders by using D3D11_CREATE_DEVICE_DISABLE_GPU_TIMEOUT when you create the device. This will, however, apply to all shaders not just DirectCompute so you should be careful about using this generally.
For special-purpose systems, you can also use registry keys to disable TDRs.
I just wonder how to convert the following openMP program to a openCL program.
The parallel section of algorithm implemented using openMP looks like this:
#pragma omp parallel
int thread_id = omp_get_thread_num();
//double mt_probThreshold = mt_nProbThreshold_;
double mt_probThreshold = nProbThreshold;
int mt_nMaxCandidate = mt_nMaxCandidate_;
double mt_nMinProb = mt_nMinProb_;
int has_next = 1;
std::list<ScrBox3d> mt_detected;
ScrBox3d sample;
while(has_next) {
#pragma omp critical
{ // '{' is very important and define the block of code that needs lock.
// Don't remove this pair of '{' and '}'.
if(piter_ == box_.end()) {
has_next = 0;
} else{
sample = *piter_;
} // '}' is very important and define the block of code that needs lock.
this->SetSample(&sample, thread_id);
//UpdateSample(sample, thread_id); // May be necesssary for more sophisticated features
sample._prob = (float)this->Prob( true, thread_id, mt_probThreshold);
//sample._prob = (float)_clf->LogLikelihood( thread_id);
InsertCandidate( mt_detected, sample, mt_probThreshold, mt_nMaxCandidate, mt_nMinProb );
#pragma omp critical
{ // '{' is very important and define the block of code that needs lock.
// Don't remove this pair of '{' and '}'.
if(mt_detected_.size()==0) {
mt_detected_ = mt_detected;
//mt_nProbThreshold_ = mt_probThreshold;
nProbThreshold = mt_probThreshold;
} else {
for(std::list<ScrBox3d>::iterator it = mt_detected.begin();
it!=mt_detected.end(); ++it)
InsertCandidate( mt_detected_, *it, /*mt_nProbThreshold_*/nProbThreshold,
mt_nMaxCandidate_, mt_nMinProb_ );
} // '}' is very important and define the block of code that needs lock.
}//parallel section end
My question is: can this section be implemented with openCL?
I followed a series of openCL tutorials, and I understood the manner of work, I was writing the code in .cu files, (I previously installed CUDA toolkit) but in this case the situation is more complicated, because there are used a lot of header files, template classes and object-oriented-programming were used.
How could I convert this section implemented in openMP to openCL?
Should I create a new .cu file?
Any advice could help.
Thanks in advance.
Using VS profiler I noticed that the most execution time is spent on InsertCandidate() function, I'm thinking about writing a kernel to execute this function on GPU. The most expensive operation of this function is a for instruction. But as it can be seen, each for cycle contains 3 if instructions, and this can lead to divergence, resulting in serialization, even if executed on GPU.
for( iter = detected.begin(); iter != detected.end(); iter++ )
if( nCandidate == nMaxCandidate-1 )
nProbThreshold = iter->_prob;
if( box._prob >= iter->_prob )
if( nCandidate >= nMaxCandidate && box._prob <= nMinProb )
nCandidate ++;
As a conclusion, can this program be converted to openCL?
It may be possible to convert your sample code to opencl, however I spotted a couple of issues with doing so.
There doesn't seem to be much parallel execution to begin with. More workers may not help at all.
Adding work to process during execution is a fairly recent feature in opencl. You would have to either use opencl 2.0, or know in advance how much work will be added, and pre-allocate memory to store the new data structures. The calls to InsertCandidate may be the part which "can't" be converted to opencl.
If the function is large enough, you may be able to port the calls to this->Prob(...) instead. You need to be able to cache up a bunch of calls' by storing the parameters in a suitable data structure. By 'a bunch' I mean at least hundreds but ideally thousands or more. Again, this is only worth it if this->Prob() is constant for all calls, and complex enough to be worth the round-trip to the opencl device and back.
I have read the question that was posted earlier that seemed to be having the same error that I am getting when using wait for multiple objects but I believe that mine is different. I am using several threads to compute different parts of a mandelbrot set. The program compiles and produces the correct result about 3 out of 5 times but sometimes I get an error that says "Access violation when writing to ... (some memory location that is different every time)". Like I said, sometimes it works, sometimes it doesn't. I put break points before and after the waitformultipleobjects and have concluded that that must be the culprit. I just dont know why. Here is the code...
int max = size();
if (max == 0) //Return false if there are no threads
return false;
for(int i=0;i<max;++i) //Resume all threads
HANDLE *first = &threads[0]; //Create a pointer to the first thread
WaitForMultipleObjects(max,first,TRUE,INFINITE);//Wait for all threads to finish
Update: I tried using a for loop and WaitForSingleObject and the problem still persisted.
Update 2: Here is the thread function. It looks kind of ugly with all of the pointers.
unsigned MandelbrotSet::tfcn(void* obj)
funcArg *args = (funcArg*) obj;
int count = 0;
vector<int> dummy;
while(args->set->counts.size() <= args->row)
for(int y = 0; y < args->set->nx; ++y)
complex<double> c(args->set->zCorner.real() + (y * args->set->dx), args->set->zCorner.imag() + (args->row * args->set->dy));
count = args->set->iterate(c);
return 0;
Resolved: Alright everyone, I found the issue. You were right. It was in the thread itself. The problem was that all of the threads were trying to add rows to my 2D vector of counts (counts.push_back(dummy)). I guess the race condition was taking effect and each thread assumed it should add more rows even when it wasn't necessary. Thanks for the help.
I solved the problem. I edited the question and stated what was wrong but I will do it again here. I was encountering the race condition when I tried to push a vector of complex numbers to the 2D vector in my thread function. This is controlled by the while loop and when each thread is executed, each thread believes that it needs to push more vector to the 2D vector called counts. I moved this loop to the constructor and simply push all of the necessary vectors onto counts upon creation. Thanks for helping me look in a different direction!