When trying to fill holes in a mesh with a border that is highly complex, the application takes 20 minutes in the hole filling call.
It can be any of the calls shown here.
The code I'm using is this:
int main()
{
std::ifstream input("V:/tobehealed2.off");
Triangle_mesh mesh;
input >> mesh;
//////////////
std::vector<std::vector<int>> indices(mesh.num_faces());
std::vector<Point_3> vertices(mesh.num_vertices());
int i = 0;
for (auto& p : mesh.points()) {
vertices[i++] = p;
}
i = 0;
for (auto& f : mesh.faces()) {
std::vector<int> triangle(3);
int j = 0;
for (auto v : mesh.vertices_around_face(mesh.halfedge(f))) {
triangle[j++] = v;
}
indices[i++] = triangle;
}
mesh.clear();
CGAL::Polygon_mesh_processing::repair_polygon_soup(vertices, indices);
CGAL::Polygon_mesh_processing::orient_polygon_soup(vertices, indices);
CGAL::Polygon_mesh_processing::polygon_soup_to_polygon_mesh(vertices, indices, mesh);
CGAL::Polygon_mesh_processing::keep_largest_connected_components(mesh, 1);
bool hasHoles = true;
std::vector<face_descriptor> face_out;
std::vector<vertex_descriptor> vertex_out;
while (hasHoles) {
hasHoles = false;
for (auto& hh : mesh.halfedges()) {
if (mesh.is_border(hh)) {
hasHoles = true;
CGAL::Polygon_mesh_processing::triangulate_and_refine_hole(mesh, hh, std::back_inserter(face_out), std::back_inserter(vertex_out));
break;
}
}
face_out.clear();
vertex_out.clear();
}
CGAL::Polygon_mesh_processing::keep_largest_connected_components(mesh, 1);
CGAL::Polygon_mesh_processing::remove_isolated_vertices(mesh);
CGAL::Polygon_mesh_processing::remove_self_intersections(mesh);
CGAL::Surface_mesh_simplification::Count_stop_predicate<Triangle_mesh> stop(60000);
int r = CGAL::Surface_mesh_simplification::edge_collapse(mesh, stop);
mesh.collect_garbage();
std::ofstream out2("V:/healed.off");
out2 << mesh;
}
Application takes over 20 minutes in the call to triangulate_and_refine_hole.
Tested model is available for download here:
https://drive.google.com/file/d/1t2dwJBs5vNpg2jLOVprHEY-8tCHvCErK/view?usp=sharing
My goal is just to be able to check beforehand if the model has a hole so complex the closing of it will take several minutes, so I can skip the hole filling attempt. Also, if there is a way to exit the function call after some threshold time it would be nice.
The size of the model doesn't matter so much. If I use a mesh 3 times larger, it can fill a not-so-complex-hole in just a few seconds.
Also, if there is a way to exit the function call after some threshold time it would be nice.
What always works is to start the task in another thread and monitor that thread, killing it if necessary after some time.
Really wouldn't be a pretty solution though. CGAL is still a maintained library, so fairly sure you just povide it some incorrect data. Or are you sure at all that it actually hangs up?
36Mb is quite a solid size for a model and hole-fixing is a task that grows with model complexity if I recall right.
EDIT:
Ah well, if you can afford to wait in general for 20 mintes before having a repaired model, threading is the way to go. It will just run in the background.
if you cannot afford that it takes so long, well, then it is not so easy. Either you find some significantly better implementation (not that likely), or you will have to do some trade offs. Either simplifying the model or living with less correct hole fixes (assuming there are some heuristic algorithms for this task).
Related
imagine I have something like this
void color(int a)
{
if (a > 10)
{
return;
}
square[a].red();
sleep(1second);
color(a+1);
}
while (programIsRunning())
{
color(1);
updateProgram();
}
but with something that actually requires a recursive function.
how can I call this recursive function to color the squares one by one.
because on its own its too fast and if the program is being updated every frame.
they instantly get colored when I want them to get colored one by one (with a delay).
sleep() will cause the current thread to stop. That makes it a bad candidate for human-perceptible delays from the main thread.
You "could" have a thread that only handles that process, but threads are expensive, and creating/managing one just to color squares in a sequence is completely overkill.
Instead, you could do something along the lines of: Every time the program updates, check if it's he appropriate time to color the next square.
const std::chrono::duration<double> color_delay{0.1};
auto last_color_time = std::chrono::steady_clock::now();
bool coloring_squares = true;
while (programIsRunning()) {
if (coloring_squares) {
auto now = std::chrono::steady_clock::now();
// This will "catch up" as needed.
while (now - last_color_time >= color_delay) {
last_color_time += color_delay;
coloring_squares = color_next_square();
}
}
updateProgram();
}
How color_next_square() works is up to you. You could possibly "pre-bake" a list of squares to color using your recursive function, and iterate through it.
Also, obviously, this example just uses the code you posted. You'll want to organise all this as part of updateProgram(), possibly in some sort of class SquareAnim {}; stateful wrapper.
N.B. If your program has little jitter, i.e. it has consistent time between updates, and the delay is low, using the following instead can lead to a slightly smoother animation:
if (now - last_color_time >= color_delay) {
last_color_time = now;
// ...
Before I start, let me say that I've only used threads once when we were taught about them in university. Therefore, I have almost zero experience using them and I don't know if what I'm trying to do is a good idea.
I'm doing a project of my own and I'm trying to make a for loop run fast because I need the calculations in the loop for a real-time application. After "optimizing" the calculations in the loop, I've gotten closer to the desired speed. However, it still needs improvement.
Then, I remembered threading. I thought I could make the loop run even faster if I split it in 4 parts, one for each core of my machine. So this is what I tried to do:
void doYourThing(int size,int threadNumber,int numOfThreads) {
int start = (threadNumber - 1) * size / numOfThreads;
int end = threadNumber * size / numOfThreads;
for (int i = start; i < end; i++) {
//Calculations...
}
}
int main(void) {
int size = 100000;
int numOfThreads = 4;
int start = 0;
int end = size / numOfThreads;
std::thread coreB(doYourThing, size, 2, numOfThreads);
std::thread coreC(doYourThing, size, 3, numOfThreads);
std::thread coreD(doYourThing, size, 4, numOfThreads);
for (int i = start; i < end; i++) {
//Calculations...
}
coreB.join();
coreC.join();
coreD.join();
}
With this, computation time changed from 60ms to 40ms.
Questions:
1)Do my threads really run on a different core? If that's true, I would expect a greater increase in speed. More specifically, I assumed it would take close to 1/4 of the initial time.
2)If they don't, should I use even more threads to split the work? Will it make my loop faster or slower?
(1).
The question #François Andrieux asked is good. Because in the original code there is a well-structured for-loop, and if you used -O3 optimization, the compiler might be able to vectorize the computation. This vectorization will give you speedup.
Also, it depends on what is the critical path in your computation. According to Amdahl's law, the possible speedups are limited by the un-parallelisable path. You might check if the computation are reaching some variable where you have locks, then the time could also spend to spin on the lock.
(2). to find out the total number of cores and threads on your computer you may have lscpu command, which will show you the cores and threads information on your computer/server
(3). It is not necessarily true that more threads will have a better performance
There is a header-only library in Github which may be just what you need. Presumably your doYourThing processes an input vector (of size 100000 in your code) and stores the results into another vector. In this case, all you need to do is to say is
auto vectorOut = Lazy::runForAll(vectorIn, myFancyFunction);
The library will decide how many threads to use based on how many cores you have.
On the other hand, if the compiler is able to vectorize your algorithm and it still looks like it is a good idea to split the work into 4 chunks like in your example code, you could do it for example like this:
#include "Lazy.h"
void doYourThing(const MyVector& vecIn, int from, int to, MyVector& vecOut)
{
for (int i = from; i < to; ++i) {
// Calculate vecOut[i]
}
}
int main(void) {
int size = 100000;
MyVector vecIn(size), vecOut(size)
// Load vecIn vector with input data...
Lazy::runForAll({{std::pair{0, size/4}, {size/4, size/2}, {size/2, 3*size/4}, {3*size/4, size}},
[&](auto indexPair) {
doYourThing(vecIn, indexPair.first, indexPair.second, vecOut);
});
// Now the results are in vecOut
}
README.md gives further examples on parallel execution which you might find useful.
Suppose I have some tasks (Monte Carlo simulations) that I want to run in parallel. I want to complete a given number of tasks, but tasks take different amount of time so not easy to divide the work evenly over the threads. Also: I need the results of all simulations in a single vector (or array) in the end.
So I come up with below approach:
int Max{1000000};
//SimResult is some struct with well-defined default value.
std::vector<SimResult> vec(/*length*/Max);//Initialize with default values of SimResult
int LastAdded{0};
void fill(int RandSeed)
{
Simulator sim{RandSeed};
while(LastAdded < Max)
{
// Do some work to bring foo to the desired state
//The duration of this work is subject to randomness
vec[LastAdded++]
= sim.GetResult();//Produces SimResult.
}
}
main()
{
//launch a bunch of std::async that start
auto fut1 = std::async(fill,1);
auto fut2 = std::async(fill,2);
//maybe some more tasks.
fut1.get();
fut2.get();
//do something with the results in vec.
}
The above code will give race conditions I guess. I am looking for a performant approach to avoid that. Requirements: avoid race conditions (fill the entire array, no skips) ; final result is immediately in array ; performant.
Reading on various approaches, it seems atomic is a good candidate, but I am not sure what settings will be most performant in my case? And not even sure whether atomic will cut it; maybe a mutex guarding LastAdded is needed?
One thing I would say is that you need to be very careful with the standard library random number functions. If your 'Simulator' class creates an instance of a generator, you should not run Monte Carlo simulations in parallel using the same object, because you'll get likely get repeated patterns of random numbers between the runs, which will give you inaccurate results.
The best practice in this area would be to create N Simulator objects with the same properties, and give each one a different random seed. Then you could pool these objects out over multiple threads using OpenMP, which is a common parallel programming model for scientific software development.
std::vector<SimResult> generateResults(size_t N_runs, double seed)
{
std::vector<SimResult> results(N_runs);
#pragma omp parallel for
for(auto i = 0; i < N_runs; i++)
{
auto sim = Simulator(seed + i);
results[i] = sim.GetResult();
}
}
Edit: With OpenMP, you can choose different scheduling models, which allow you to for e.g. dynamically split work between threads. You can do this with:
#pragma omp parallel for schedule(dynamic, 16)
which would give each thread chunks of 16 items to work on at a time.
Since you already know how many elements your are going to work with and never change the size of the vector, the easiest solution is to let each thread work on it's own part of the vector. For example
Update
to accomodate for vastly varying calculation times, you should keep your current code, but avoid race conditions via a std::lock_guard. You will need a std::mutex that is the same for all threads, for example a global variable, or pass a reference of the mutex to each thread.
void fill(int RandSeed, std::mutex &nextItemMutex)
{
Simulator sim{RandSeed};
size_t workingIndex;
while(true)
{
{
// enter critical area
std::lock_guard<std::mutex> nextItemLock(nextItemMutex);
// Acquire next item
if(LastAdded < Max)
{
workingIndex = LastAdded;
LastAdded++;
}
else
{
break;
}
// lock is released when nextItemLock goes out of scope
}
// Do some work to bring foo to the desired state
// The duration of this work is subject to randomness
vec[workingIndex] = sim.GetResult();//Produces SimResult.
}
}
Problem with this is, that snychronisation is quite expensive. But it's probably not that expensive in comparison to the simulation you run, so it shouldn't be too bad.
Version 2:
To reduce the amount of synchronisation that is required, you could acquire blocks to work on, instead of single items:
void fill(int RandSeed, std::mutex &nextItemMutex, size_t blockSize)
{
Simulator sim{RandSeed};
size_t workingIndex;
while(true)
{
{
std::lock_guard<std::mutex> nextItemLock(nextItemMutex);
if(LastAdded < Max)
{
workingIndex = LastAdded;
LastAdded += blockSize;
}
else
{
break;
}
}
for(size_t i = workingIndex; i < workingIndex + blockSize && i < MAX; i++)
vec[i] = sim.GetResult();//Produces SimResult.
}
}
Simple Version
void fill(int RandSeed, size_t partitionStart, size_t partitionEnd)
{
Simulator sim{RandSeed};
for(size_t i = partitionStart; i < partitionEnd; i++)
{
// Do some work to bring foo to the desired state
// The duration of this work is subject to randomness
vec[i] = sim.GetResult();//Produces SimResult.
}
}
main()
{
//launch a bunch of std::async that start
auto fut1 = std::async(fill,1, 0, Max / 2);
auto fut2 = std::async(fill,2, Max / 2, Max);
// ...
}
Reason for re-posting:
Originally I got only one reply, that only pointed out that the title was exaggerated. Hence trying again, maybe more people will see this question this time as I really do not know where else to look... I will make sure to delete the original question to avoid duplication, and keep this new one instead. I'm not trying to spam the forum.
Feel free to remove the text above upon editing, I just wanted to explain why I'm re-posting - but it's not really a part of the question.
So, the original question was:
I have a few functions in my program that run extremely slow in Debug mode, in Visual Studio Community, 2015. They are functions to "index" the verts of 3D models.
Normally, I'm prepared for Debug mode to be a little slower, maybe 2 -3 times slower. But...
In Release mode, the program starts and indexes the models in about 2 - 3 seconds. Perfect.
In Debug mode however, it takes over 7 MINUTES for my program to actually respond, to start rendering and take input. It is stuck indexing one model for over seven minutes. During this time the program is completely froze.
The same model loads and indexes in "Release" mode in less than 3 seconds. How is it possible that it takes so unbelievably long in Debug?
Both Debug & Release modes are the standard out of the box modes. I don't recall changing any of the settings in either of them.
Here's the code that's slowing the program down in Debug mode:
// Main Indexer Function
void indexVBO_TBN(
std::vector<glm::vec3> &in_vertices,
std::vector<glm::vec2> &in_uvs,
std::vector<glm::vec3> &in_normals,
std::vector<glm::vec3> &in_tangents,
std::vector<glm::vec3> &in_bitangents,
std::vector<unsigned short> & out_indices,
std::vector<glm::vec3> &out_vertices,
std::vector<glm::vec2> &out_uvs,
std::vector<glm::vec3> &out_normals,
std::vector<glm::vec3> &out_tangents,
std::vector<glm::vec3> &out_bitangents){
int count = 0;
// For each input vertex
for (unsigned int i = 0; i < in_vertices.size(); i++) {
// Try to find a similar vertex in out_vertices, out_uvs, out_normals, out_tangents & out_bitangents
unsigned int index;
bool found = getSimilarVertexIndex(in_vertices[i], in_uvs[i], in_normals[i], out_vertices, out_uvs, out_normals, index);
if (found) {
// A similar vertex is already in the VBO, use it instead !
out_indices.push_back(unsigned short(index));
// Average the tangents and the bitangents
out_tangents[index] += in_tangents[i];
out_bitangents[index] += in_bitangents[i];
} else {
// If not, it needs to be added in the output data.
out_vertices.push_back(in_vertices[i]);
out_uvs.push_back(in_uvs[i]);
out_normals.push_back(in_normals[i]);
out_tangents.push_back(in_tangents[i]);
out_bitangents.push_back(in_bitangents[i]);
out_indices.push_back((unsigned short)out_vertices.size() - 1);
}
count++;
}
}
And then the 2 little "helper" functions it uses (isNear() and getSimilarVertexIndex()):
// Returns true if v1 can be considered equal to v2
bool is_near(float v1, float v2){
return fabs( v1-v2 ) < 0.01f;
}
bool getSimilarVertexIndex( glm::vec3 &in_vertex, glm::vec2 &in_uv, glm::vec3 &in_normal,
std::vector<glm::vec3> &out_vertices, std::vector<glm::vec2> &out_uvs, std::vector<glm::vec3> &out_normals,
unsigned int &result){
// Lame linear search
for (unsigned int i = 0; i < out_vertices.size(); i++) {
if (is_near(in_vertex.x, out_vertices[i].x) &&
is_near(in_vertex.y, out_vertices[i].y) &&
is_near(in_vertex.z, out_vertices[i].z) &&
is_near(in_uv.x, out_uvs[i].x) &&
is_near(in_uv.y, out_uvs[i].y) &&
is_near(in_normal.x, out_normals[i].x) &&
is_near(in_normal.y, out_normals[i].y) &&
is_near(in_normal.z, out_normals[i].z)
) {
result = i;
return true;
}
}
return false;
}
All credit for the functions above goes to:
http://www.opengl-tutorial.org/intermediate-tutorials/tutorial-9-vbo-indexing/
Could this be a:
Visual Studio Community 2015 issue?
VSC15 Debug Mode issue?
Slow Code? (But it's only slow in Debug?!)
There are multiple things that will/might be optimized:
iterating a vector with indices [] is slower than using iterators; in debug, this certainly is not optimized away but in release it might
additionally, accessing a vector via [] is slow because of runtime checks and debugging features when being in debug mode; this can be fairly easily seen when you go to the implementation of operator[]
push_back and size might also have some additional checks than fall away when using release mode
So, my main guess would be that you use [] too much. It might be even faster in release when you change the iteration by means of using real iterators. So, instead of:
for (unsigned int i = 0; i < in_vertices.size(); i++) {
use:
for(auto& vertex : in_vertices)
This indirectly uses iterators. You could also explicitly write:
for(auto vertexIt = in_vertices.begin(); vertexIt != in_vertices.end(); ++vertexIt)
{
auto& vertex = *vertexIt;
Obviously, this is longer code that seems less readable and has no practical advantage, unless you need the iterator for some other functions.
(I have tried to simplify this as much as i could to find out where I'm doing something wrong.)
The ideea of the code is that I have a global array *v (I hope using this array isn't slowing things down, the threads should never acces the same value because they all work on different ranges) and I try to create 2 threads each one sorting the first half, respectively the second half by calling the function merge_sort() with the respective parameters.
On the threaded run, i see the process going to 80-100% cpu usage (on dual core cpu) while on the no threads run it only stays at 50% yet the run times are very close.
This is the (relevant) code:
//These are the 2 sorting functions, each thread will call merge_sort(..). Is this a problem? both threads calling same (normal) function?
void merge (int *v, int start, int middle, int end) {
//dynamically creates 2 new arrays for the v[start..middle] and v[middle+1..end]
//copies the original values into the 2 halves
//then sorts them back into the v array
}
void merge_sort (int *v, int start, int end) {
//recursively calls merge_sort(start, (start+end)/2) and merge_sort((start+end)/2+1, end) to sort them
//calls merge(start, middle, end)
}
//here i'm expecting each thread to be created and to call merge_sort on its specific range (this is a simplified version of the original code to find the bug easier)
void* mergesort_t2(void * arg) {
t_data* th_info = (t_data*)arg;
merge_sort(v, th_info->a, th_info->b);
return (void*)0;
}
//in main I simply create 2 threads calling the above function
int main (int argc, char* argv[])
{
//some stuff
//getting the clock to calculate run time
clock_t t_inceput, t_sfarsit;
t_inceput = clock();
//ignore crt_depth for this example (in the full code i'm recursively creating new threads and i need this to know when to stop)
//the a and b are the range of values the created thread will have to sort
pthread_t thread[2];
t_data next_info[2];
next_info[0].crt_depth = 1;
next_info[0].a = 0;
next_info[0].b = n/2;
next_info[1].crt_depth = 1;
next_info[1].a = n/2+1;
next_info[1].b = n-1;
for (int i=0; i<2; i++) {
if (pthread_create (&thread[i], NULL, &mergesort_t2, &next_info[i]) != 0) {
cerr<<"error\n;";
return err;
}
}
for (int i=0; i<2; i++) {
if (pthread_join(thread[i], &status) != 0) {
cerr<<"error\n;";
return err;
}
}
//now i merge the 2 sorted halves
merge(v, 0, n/2, n-1);
//calculate end time
t_sfarsit = clock();
cout<<"Sort time (s): "<<double(t_sfarsit - t_inceput)/CLOCKS_PER_SEC<<endl;
delete [] v;
}
Output (on 1 million values):
Sort time (s): 1.294
Output with direct calling of merge_sort, no threads:
Sort time (s): 1.388
Output (on 10 million values):
Sort time (s): 12.75
Output with direct calling of merge_sort, no threads:
Sort time (s): 13.838
Solution:
I'd like to thank WhozCraig and Adam too as they've hinted to this from the beginning.
I've used the inplace_merge(..) function instead of my own and the program run times are as they should now.
Here's my initial merge function (not really sure if the initial, i've probably modified it a few times since, also array indices might be wrong right now, i went back and forth between [a,b] and [a,b), this was just the last commented-out version):
void merge (int *v, int a, int m, int c) { //sorts v[a,m] - v[m+1,c] in v[a,c]
//create the 2 new arrays
int *st = new int[m-a+1];
int *dr = new int[c-m+1];
//copy the values
for (int i1 = 0; i1 <= m-a; i1++)
st[i1] = v[a+i1];
for (int i2 = 0; i2 <= c-(m+1); i2++)
dr[i2] = v[m+1+i2];
//merge them back together in sorted order
int is=0, id=0;
for (int i=0; i<=c-a; i++) {
if (id+m+1 > c || (a+is <= m && st[is] <= dr[id])) {
v[a+i] = st[is];
is++;
}
else {
v[a+i] = dr[id];
id++;
}
}
delete st, dr;
}
all this was replaced with:
inplace_merge(v+a, v+m, v+c);
Edit, some times on my 3ghz dual core cpu:
1 million values:
1 thread : 7.236 s
2 threads: 4.622 s
4 threads: 4.692 s
10 million values:
1 thread : 82.034 s
2 threads: 46.189 s
4 threads: 47.36 s
There's one thing that struck me: "dynamically creates 2 new arrays[...]". Since both threads will need memory from the system, they need to acquire a lock for that, which could well be your bottleneck. In particular the idea of doing microscopic array allocations sounds horribly inefficient. Someone suggested an in-place sort that doesn't need any additional storage, which is much better for performance.
Another thing is the often-forgotten starting half-sentence for any big-O complexity measurements: "There is an n0 so that for all n>n0...". In other words, maybe you haven't reached n0 yet? I recently saw a video (hopefully someone else will remember it) where some people tried to determine this limit for some algorithms, and their results were that these limits are surprisingly high.
Note: since OP uses Windows, my answer below (which incorrectly assumed Linux) might not apply. I left it for sake of those who might find the information useful.
clock() is a wrong interface for measuring time on Linux: it measures CPU time used by the program (see http://linux.die.net/man/3/clock), which in case of multiple threads is the sum of CPU time for all threads. You need to measure elapsed, or wallclock, time. See more details in this SO question: C: using clock() to measure time in multi-threaded programs, which also tells what API can be used instead of clock().
In the MPI-based implementation that you try to compare with, two different processes are used (that's how MPI typically enables concurrency), and the CPU time of the second process is not included - so the CPU time is close to wallclock time. Nevertheless, it's still wrong to use CPU time (and so clock()) for performance measurement, even in serial programs; for one reason, if a program waits for e.g. a network event or a message from another MPI process, it still spends time - but not CPU time.
Update: In Microsoft's implementation of C run-time library, clock() returns wall-clock time, so is OK to use for your purpose. It's unclear though if you use Microsoft's toolchain or something else, like Cygwin or MinGW.