is if-statement bad for OpenACC?

is if-statement bad for OpenACC? - if-statement

I heard that OpenACC does not handle if-statement efficiently and should try to avoid using it.
For example, it is no good to do something like that (loop with a couple of if-statement) in the device/OpenACC:
for (m=0; m<polygon2.num_vertices; m++) {
polygon2Vertex1 = listOfpolygon2Vertex[m];
if ((m+1) == polygon2.num_vertices){
// last vertex, so we get last and the first vertex
polygon2Vertex2 = listOfpolygon2Vertex[0];
} else {
// not the last vertex, so we get [m] and [m+1] vertex
polygon2Vertex2 = listOfpolygon2Vertex[m+1];
}
result = doIntersect(polygon1Vertex1, polygon1Vertex2, polygon2Vertex1, polygon2Vertex2);
if (result==1){
// found that these 2 edges intersect.
// no need to further check
break;
}
}
is it true? If so, what can I do to handle if-statement in OpenACC?

The problem is with branch divergence within a CUDA warp. Since all threads in a warp execute the same instruction at the same time, if some of the threads take one branch and the rest take another, you've doubled your time. However, it's only a problem within a warp. So if all the threads within the same warp take one branch, and all the threads in another warp take a different branch, then there will be no performance impact.
If you don't know what a warp is, this in old but still good article on the CUDA threading model: http://www.pgroup.com/lit/articles/insider/v2n1a5.htm
With this code, since only the last element takes the true case, the if statement will have very little impact.
I'd would recommend inverting your logic so that the last element case is in the else clause. This is a general optimization not specific to GPUs so that the more common case falls through rather than having to make a jump.

Related

race condition using OpenMP atomic capture operation for 3D histogram of particles and making an index

I have a piece of code in my full code:
const unsigned int GL=8000000;
const int cuba=8;
const int cubn=cuba+cuba;
const int cub3=cubn*cubn*cubn;
int Length[cub3];
int Begin[cub3];
int Counter[cub3];
int MIndex[GL];
struct Particle{
int ix,jy,kz;
int ip;
};
Particle particles[GL];
int GetIndex(const Particle & p){return (p.ix+cuba+cubn*(p.jy+cuba+cubn*(p.kz+cuba)));}
...
#pragma omp parallel for
for(int i=0; i<cub3; ++i) Length[i]=Counter[i]=0;
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
int ic=GetIndex(particles[i]);
#pragma omp atomic update
Length[ic]++;
}
Begin[0]=0;
#pragma omp single
for(int i=1; i<cub3; ++i) Begin[i]=Begin[i-1]+Length[i-1];
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
if(particles[i].ip==3)
{
int ic=GetIndex(particles[i]);
if(ic>cub3 || ic<0) printf("ic=%d out of range!\n",ic);
int cnt=0;
#pragma omp atomic capture
cnt=Counter[ic]++;
MIndex[Begin[ic]+cnt]=i;
}
}
If to remove
#pragma omp parallel for
the code works properly and the output results are always the same.
But with this pragma there is some undefined behaviour/race condition in the code, because each time it gives different output results.
How to fix this issue?
Update: The task is the following. Have lots of particles with some random coordinates. Need to output to the array MIndex the indices in the array particles of the particles, which are in each cell (cartesian cube, for example, 1×1×1 cm) of the coordinate system. So, in the beginning of MIndex there should be the indices in the array particles of the particles in the 1st cell of the coordinate system, then - in the 2nd, then - in the 3rd and so on. The order of indices within given cell in the area MIndex is not important, may be arbitrary. If it is possible, need to make this in parallel, may be using atomic operations.
There is a straight way: to traverse across all the coordinate cells in parallel and in each cell check the coordinates of all the particles. But for large number of cells and particles this seems to be slow. Is there a faster approach? Is it possible to travel across the particles array only once in parallel and fill MIndex array using atomic operations, something like written in the code piece above?

You probably can't get a compiler to auto-parallelize scalar code for you if you want an algorithm that can work efficiently (without needing atomic RMWs on shared counters which would be a disaster, see below). But you might be able to use OpenMP as a way to start threads and get thread IDs.
Keep per-thread count arrays from the initial histogram, use in 2nd pass
(Update: this might not work: I didn't notice the if(particles[i].ip==3) in the source before. I was assuming that Count[ic] will go as high as Length[ic] in the serial version. If that's not the case, this strategy might leave gaps or something.
But as Laci points out, perhaps you want that check when calculating Length in the first place, then it would be fine.)
Manually multi-thread the first histogram (into Length[]), with each thread working on a known range of i values. Keep those per-thread lengths around, even as you sum across them and prefix-sum to build Begin[].
So Length[thread][ic] is the number of particles in that cube, out of the range of i values that this thread worked on. (And will loop over again in the 2nd loop: the key is that we divide the particles between threads the same way twice. Ideally with the same thread working on the same range, so things may still be hot in L1d cache.)
Pre-process that into a per-thread Begin[][] array, so each thread knows where in MIndex to put data from each bucket.
// pseudo-code, fairly close to actual C
for(ic < cub3) {
// perhaps do this "vertical" sum into a temporary array
// or prefix-sum within Length before combining across threads?
int pos = sum(Length[0..nthreads-1][ic-1]) + Begin[0][ic-1];
Begin[0][ic] = pos;
for (int t = 1 ; t<nthreads ; t++) {
pos += Length[t][ic]; // prefix-sum across threads for this cube bucket
Begin[t][ic] = pos;
}
}
This has a pretty terrible cache access pattern, especially with cuba=8 making Length[t][0] and Length[t+1][0] 4096 bytes apart from each other. (So 4k aliasing is a possible problem, as are cache conflict misses).
Perhaps each thread can prefix-sum its own slice of Length into that slice of Begin, 1. for cache access pattern (and locality since it just wrote those Lengths), and 2. to get some parallelism for that work.
Then in the final loop with MIndex, each thread can do int pos = --Length[t][ic] to derive a unique ID from the Length. (Like you were doing with Count[], but without introducing another per-thread array to zero.)
Each element of Length will return to zero, because the same thread is looking at the same points it just counted. With correctly-calculated Begin[t][ic] positions, MIndex[...] = i stores won't conflict. False sharing is still possible, but it's a large enough array that points will tend to be scattered around.
Don't overdo it with number of threads, especially if cuba is greater than 8. The amount of Length / Begin pre-processing work scales with number of threads, so it may be better to just leave some CPUs free for unrelated threads or tasks to get some throughput done. OTOH, with cuba=8 meaning each per-thread array is only 4096 bytes (too small to parallelize the zeroing of, BTW), it's really not that much.
(Previous answer before your edit made it clearer what was going on.)
Is this basically a histogram? If each thread has its own array of counts, you can sum them together at the end (you might need to do that manually, not have OpenMP do it for you). But it seems you also need this count to be unique within each voxel, to have MIndex updated properly? That might be a showstopper, like requiring adjusting every MIndex entry, if it's even possible.
After your update, you are doing a histogram into Length[], so that part can be sped up.
Atomic RMWs would be necessary for your code as-is, performance disaster
Atomic increments of shared counters would be slower, and on x86 might destroy the memory-level parallelism too badly. On x86, every atomic RMW is also a full memory barrier, draining the store buffer before it happens, and blocking later loads from starting until after it happens.
As opposed to a single thread which can have cache misses to multiple Counter, Begin and MIndex elements outstanding, using non-atomic accesses. (Thanks to out-of-order exec, the next iteration's load / inc / store for Counter[ic]++ can be doing the load while there are cache misses outstanding for Begin[ic] and/or for Mindex[] stores.)
ISAs that allow relaxed-atomic increment might be able to do this efficiently, like AArch64. (Again, OpenMP might not be able to do that for you.)
Even on x86, with enough (logical) cores, you might still get some speedup, especially if the Counter accesses are scattered enough they cores aren't constantly fighting over the same cache lines. You'd still get a lot of cache lines bouncing between cores, though, instead of staying hot in L1d or L2. (False sharing is a problem,
Perhaps software prefetch can help, like prefetchw (write-prefetching) the counter for 5 or 10 i iterations later.
It wouldn't be deterministic which point went in which order, even with memory_order_seq_cst increments, though. Whichever thread increments Counter[ic] first is the one that associates that cnt with that i.
Alternative access patterns
Perhaps have each thread scan all points, but only process a subset of them, with disjoint subsets. So the set of Counter[] elements that any given thread touches is only touched by that thread, so the increments can be non-atomic.
Filtering by p.kz ranges maybe makes the most sense since that's the largest multiplier in the indexing, so each thread "owns" a contiguous range of Counter[].
But if your points aren't uniformly distributed, you'd need to know how to break things up to approximately equally divide the work. And you can't just divide it more finely (like OMP schedule dynamic), since each thread is going to scan through all the points: that would multiply the amount of filtering work.
Maybe a couple fixed partitions would be a good tradeoff to gain some parallelism without introducing a lot of extra work.
Re: your edit
You already loop over the whole array of points doing Length[ic]++;? Seems redundant to do the same histogramming work again with Counter[ic]++;, but not obvious how to avoid it.
The count arrays are small, but if you don't need both when you're done, you could maybe just decrement Length to assign unique indices to each point in a voxel. At least the first histogram could benefit from parallelizing with different count arrays for each thread, and just vertically adding at the end. Should scale perfectly with threads since the count array is small enough for L1d cache.
BTW, for() Length[i]=Counter[i]=0; is too small to be worth parallelizing. For cuba=8, it's 8*8*16 * sizeof(int) = 4096 bytes, just one page, so it's just two small memsets.
(Of course if each thread has their own separate Length array, they each need to zero it). That's small enough to even consider unrolling with maybe 2 count arrays per thread to hide store/reload serial dependencies if a long sequence of points are all in the same bucket. Combining count arrays at the end is a job for #pragma omp simd or just normal auto-vectorization with gcc -O3 -march=native since it's integer work.
For the final loop, you could split your points array in half (assign half to each thread), and have one thread get unique IDs by counting down from --Length[i], and another counting up from 0 in Counter[i]++. With different threads looking at different points, this could give you a factor of 2 speedup. (Modulo contention for MIndex stores.)
To do more than just count up and down, you'd need info you don't have from just the overall Length array... but which you did have temporarily. See the section at the top

You are right to make the update Counter[ic]++ atomic, but there is an additional problem on the next line: MIndex[Begin[ic]+cnt]=i; Different iterations can write into the same location here, unless you have mathematical proof that this is never the case from the structure of MIndex. So you have to make that line atomic too. And then there is almost no parallel work left in your loop, so your speed up if probably going to be abysmal.
EDIT the second line however is not of the right form for an atomic operation, so you have to make it critical. Which is going to make performance even worse.
Also, #Laci is correct that since this is an overwrite statement, the order of parallel scheduling is going to influence the outcome. So either live with that fact, or accept that this can not be parallelized.

Efficiency of 2 vectors versus a vector of structs

I'm working on a C++ project where I need to search through a vector ignoring those that have already been visited. If one has been visited I set its corresponding visited to 1 and ignore it. Which solution is faster?
Solution 1:
vector<string> stringsToVisit;
vector<int> stringVisited;
for (int i = 0; i < stringToVisit.size(); ++i) {
if (stringVisited[i] == 0) {
string current = stringsToVisit[i];
...Do Stuff...
stringVisited[i] = 1;
}
}
or
Solution 2:
struct StringInfo {
string myString;
int visited = 0;
}
vector<StringInfo> stringsToVisit;
for (int i = 0; i < stringsToVisit.size(); ++i) {
if (stringsToVisit[i].visited == 0) {
string current = stringsToVisit[i].myString;
...Do Stuff...
stringsToVisit[i].visited = 1;
}
}

As Bernard notes, the time and memory complexity of both proposed solutions is identical, and the slightly more complex addressing required by the second solution isn't going to slow things down on modern processors. But I disagree with his suggestion that "Solution 2 is likely to be faster." We really don't know enough to even say that it should theoretically be faster, and except perhaps in a few degenerate situations, the difference in actual performance is likely to be unmeasurable.
The first iteration of the loop would indeed likely be slower. The cache is cold, and the first solution requires two cache lines to store the first elements, while the second solution only requires one. But after that, both solutions are doing a forward linear traversal. The CPU is going to have no problem prefetching additional cache lines, so in most situations, that initial extra overhead is unlikely to actually matter too much.
On the other hand, you are writing data while in this loop, so some of the cache lines you access also get marked dirty (meaning their data needs to be written back to a shared cache or main memory eventually, and they get purged from any other cores' caches). In solution 1, depending on sizeof(string) and sizeof(int), only 5-25% of the cache lines are getting marked dirty. Solution 2, however, dirties every single one, so it may actually use more memory bandwidth.
So, some things that might make solution 2 faster are:
The list of strings being processed is extremely short
...Do Stuff... is very complicated (enough so that the cache lines holding the data get purged from L1 cache)
Some things that might make solution 1 equivalent or faster than solution 2:
The list of strings being processed is moderate to large
...Do Stuff... is not very complex, so the cache stays warm
The program is multithreaded, and another thread wants to read data from stringsToVisit at the same time.
The bottom line is, it probably doesn't matter.

First of all, you should profile your code to check if this piece of code is really the bottleneck, and accurately measure the amount of time each solution takes to run. That would give the best results.
Nevertheless, here's my answer:
The time complexity of both solutions are O(n), so we are talking only about constant-factor optimizations here.
Solution 1 requires the lookup of two different memory blocks - stringsToVisit[i] and stringVisited[i] in each loop. This isn't good for CPU caches, as compared to Solution 2, each iteration of the loop accesses a single struct stored in contiguous locations in memory. As such, Solution 2 would perform better.
Solution 2 would need a more complicated indirect memory lookup than Solution 1 to access the visited property of the struct: (base address of stringsToVisit) + (index) * (struct size) + (displacement in struct). Nevertheless, this kind of lookup fits well into most processors' SIB (scale-index-base) addressing, so it will compile to one assembly instruction only, so there would not be much slowness, if any at all. It is worth noting that an optimizing compiler might notice that you're accessing memory sequentially and do optimizations to avoid using SIB addressing totally.
Hence, Solution 2 is likely to be faster.

In GLSL, is it better to branch, or look up from a dummy texture?

To make a long story short, am I better off doing this:
if (normalMappingEnabled)
{
normal = calculateBumpedNormalFromTexture();
}
else
{
normal = somethingMuchEasierToCalculate();
}
Where normalMappingEnabled is a uniform, calculateBumpedNormalFromTexture requires a texture lookup and all of the other math required for normal mapping, and somethingMuchEasierToCalculate requires no texture lookup, and simply spits out the interpolated per-vertex normal.
or this:
normal = calculateBumpedNormalFromTexture();
Where in this case, if I don't need normal mapping, the normal texture is 1x1, containing a single texel that points straight up, producing the same result as if I had just used an interpolated per-vertex normal.
Which is faster on most modern hardware? Is there another solution I haven't considered (other than using 2 different shaders)?

If the condition is the same for all invocated fragments, then no branch divergence will happen. So AFAIK in this case, there will be no performance loss.
The problem is when some threads need to execute one branch and some threads the other branch. Since different instructions can't be executed in parallel (on one processor), both branches would be executed sequentially (some threads would get executed, the other part would wait and then the other part would get executed and the first part would wait).

Splitting up a program into 4 threads is slower than a single thread

I've been writing a raytracer the past week, and have come to a point where it's doing enough that multi-threading would make sense. I have tried using OpenMP to parallelize it, but running it with more threads is actually slower than running it with one.
Reading over other similar questions, especially about OpenMP, one suggestion was that gcc optimizes serial code better. However running the compiled code below with export OMP_NUM_THREADS=1 is twice as fast as with export OMP_NUM_THREADS=4. I.e. It's the same compiled code on both runs.
Running the program with time:
> export OMP_NUM_THREADS=1; time ./raytracer
real 0m34.344s
user 0m34.310s
sys 0m0.008s
> export OMP_NUM_THREADS=4; time ./raytracer
real 0m53.189s
user 0m20.677s
sys 0m0.096s
User time is a lot smaller than real, which is unusual when using multiple cores- user should be larger than real as several cores are running at the same time.
Code that I have parallelized using OpenMP
void Raytracer::render( Camera& cam ) {
// let the camera know to use this raytracer for probing the scene
cam.setSamplingFunc(getSamplingFunction());
int i, j;
#pragma omp parallel private(i, j)
{
// Construct a ray for each pixel.
#pragma omp for schedule(dynamic, 4)
for (i = 0; i < cam.height(); ++i) {
for (j = 0; j < cam.width(); ++j) {
cam.computePixel(i, j);
}
}
}
}
When reading this question I thought I had found my answer. It talks about the implementation of gclib rand() synchronizing calls to itself to preserve state for random number generation between threads. I am using rand() quite a lot for monte carlo sampling, so i thought that was the problem. I got rid of calls to rand, replacing them with a single value, but using multiple threads is still slower. EDIT: oops turns out I didn't test this correctly, it was the random values!
Now that those are out of the way, I will discuss an overview of what's being done on each call to computePixel, so hopefully a solution can be found.
In my raytracer I essentially have a scene tree, with all objects in it. This tree is traversed a lot during computePixel when objects are tested for intersection, however, no writes are done to this tree or any objects. computePixel essentially reads the scene a bunch of times, calling methods on the objects (all of which are const methods), and at the very end writes a single value to its own pixel array. This is the only part that I am aware of where more than one thread will try to write to to the same member variable. There is no synchronization anywhere since no two threads can write to the same cell in the pixel array.
Can anyone suggest places where there could be some kind of contention? Things to try?
Thank you in advance.
EDIT:
Sorry, was stupid not to provide more info on my system.
Compiler gcc 4.6 (with -O2 optimization)
Ubuntu Linux 11.10
OpenMP 3
Intel i3-2310M Quad core 2.1Ghz (on my laptop at the moment)
Code for compute pixel:
class Camera {
// constructors destructors
private:
// this is the array that is being written to, but not read from.
Colour* _sensor; // allocated using new at construction.
}
void Camera::computePixel(int i, int j) const {
Colour col;
// simple code to construct appropriate ray for the pixel
Ray3D ray(/* params */);
col += _sceneSamplingFunc(ray); // calls a const method that traverses scene.
_sensor[i*_scrWidth+j] += col;
}
From the suggestions, it might be the tree traversal that causes the slow-down. Some other aspects: there is quite a lot of recursion involved once the sampling function is called (recursive bouncing of rays)- could this cause these problems?

Thanks everyone for the suggestions, but after further profiling, and getting rid of other contributing factors, random-number generation did turn out to be the culprit.
As outlined in the question above, rand() needs to keep track of its state from one call to the next. If several threads are trying to modify this state, it would cause a race condition, so the default implementation in glibc is to lock on every call, to make the function thread-safe. This is terrible for performance.
Unfortunately the solutions to this problem that I've seen on stackoverflow are all local, i.e. deal with the problem in the scope where rand() is called. Instead I propose a "quick and dirty" solution that anyone can use in their program to implement independent random number generation for each thread, requiring no synchronization.
I have tested the code, and it works- there is no locking, and no noticeable slowdown as a result of calls to threadrand. Feel free to point out any blatant mistakes.
threadrand.h
#ifndef _THREAD_RAND_H_
#define _THREAD_RAND_H_
// max number of thread states to store
const int maxThreadNum = 100;
void init_threadrand();
// requires openmp, for thread number
int threadrand();
#endif // _THREAD_RAND_H_
threadrand.cpp
#include "threadrand.h"
#include <cstdlib>
#include <boost/scoped_ptr.hpp>
#include <omp.h>
// can be replaced with array of ordinary pointers, but need to
// explicitly delete previous pointer allocations, and do null checks.
//
// Importantly, the double indirection tries to avoid putting all the
// thread states on the same cache line, which would cause cache invalidations
// to occur on other cores every time rand_r would modify the state.
// (i.e. false sharing)
// A better implementation would be to store each state in a structure
// that is the size of a cache line
static boost::scoped_ptr<unsigned int> randThreadStates[maxThreadNum];
// reinitialize the array of thread state pointers, with random
// seed values.
void init_threadrand() {
for (int i = 0; i < maxThreadNum; ++i) {
randThreadStates[i].reset(new unsigned int(std::rand()));
}
}
// requires openmp, for thread number, to index into array of states.
int threadrand() {
int i = omp_get_thread_num();
return rand_r(randThreadStates[i].get());
}
Now you can initialize the random states for threads from main using init_threadrand(), and subsequently get a random number using threadrand() when using several threads in OpenMP.

The answer is, without knowing what machine you're running this on, and without really seeing the code of your computePixel function, that it depends.
There is quite a few factors that could affect the performance of your code, one thing that comes to mind is the cache alignment. Perhaps your data structures, and you did mention a tree, are not really ideal for caching, and the CPU ends up waiting for the data come from the RAM, since it cannot fit things into the cache. Wrong cache-line alignments could cause something like that. If the CPU has to wait for things to come from RAM, it is likely, that the thread will be context-switched out, and another will be run.
Your OS thread scheduler is non-deterministic, therefore, when a thread will run is not a predictable thing, so if it so happens that your threads are not running a lot, or are contending for CPU cores, this could also slow things down.
Thread affinity, also plays a role. A thread will be scheduled on a particular core, and normally it will be attempted to keep this thread on the same core. If more then one of your threads are running on a single core, they will have to share the same core. Another reason things could slow down. For performance reasons, once a particular thread has run on a core, it is normally kept there, unless there's a good reason to swap it to another core.
There's some other factors, which I don't remember off the top of my head, however, I suggest doing some reading on threading. It's a complicated and extensive subject. There's lots of material out there.
Is the data being written at the end, data that other threads need to be able to do computePixel ?

One strong possibility is false sharing. It looks like you are computing the pixels in sequence, thus each thread may be working on interleaved pixels. This is usually a very bad thing to do.
What could be happening is that each thread is trying to write the value of a pixel beside one written in another thread (they all write to the sensor array). If these two output values share the same CPU cache-line this forces the CPU to flush the cache between the processors. This results in an excessive amount of flushing between CPUs, which is a relatively slow operation.
To fix this you need to ensure that each thread truly works on an independent region. Right now it appears you divide on rows (I'm not positive since I don't know OMP). Whether this works depends on how big your rows are -- but still the end of each row will overlap with the beginning of the next (in terms of cache lines). You might want to try breaking the image into four blocks and have each thread work on a series of sequential rows (for like 1..10 11..20 21..30 31..40). This would greatly reduce the sharing.
Don't worry about reading constant data. So long as the data block is not being modified each thread can read this information efficiently. However, be leery of any mutable data you have in your constant data.

I just looked and the Intel i3-2310M doesn't actually have 4 cores, it has 2 cores and hyper-threading. Try running your code with just 2 threads and see it that helps. I find in general hyper-threading is totally useless when you have a lot of calculations, and on my laptop I turned it off and got much better compilation times of my projects.
In fact, just go into your BIOS and turn off HT -- it's not useful for development/computation machines.

Better way to copy several std::vectors into 1? (multithreading)

Here is what I'm doing:
I'm taking in bezier points and running bezier interpolation then storing the result in a std::vector<std::vector<POINT>.
The bezier calculation was slowing me down so this is what I did.
I start with a std::vector<USERPOINT> which is a struct with a point and 2 other points for bezier handles.
I divide these up into ~4 groups and assign each thread to do 1/4 of the work. To do this I created 4 std::vector<std::vector<POINT> > to store the results from each thread.In the end all the points have to be in 1 continuous vector, before I used multithreading I accessed this directly but now I reserve the size of the 4 vectors produced by the threads and insert them into the original vector, in the correct order. This works, but unfortunatly the copy part is very slow and makes it slower than without multithreading. So now my new bottleneck is copying the results to the vector. How could I do this way more efficiently?
Thanks

Have all the threads put their results into a single contiguous vector just like before. You have to ensure each thread only accesses parts of the vector that are separate from the others. As long as that's the case (which it should be regardless -- you don't want to generate the same output twice) each is still working with memory that's separate from the others, and you don't need any locking (etc.) for things to work. You do, however, need/want to ensure that the vector for the result has the correct size for all the results first -- multiple threads trying (for example) to call resize() or push_back() on the vector will wreak havoc in a hurry (not to mention causing copying, which you clearly want to avoid here).
Edit: As Billy O'Neal pointed out, the usual way to do this would be to pass a pointer to each part of the vector where each thread will deposit its output. For the sake of argument, let's assume we're using the std::vector<std::vector<POINT> > mentioned as the original version of things. For the moment, I'm going to skip over the details of creating the threads (especially since it varies across systems). For simplicity, I'm also assuming that the number of curves to be generated is an exact multiple of the number of threads -- in reality, the curves won't divide up exactly evenly, so you'll have to "fudge" the count for one thread, but that's really unrelated to the question at hand.
std::vector<USERPOINT> inputs; // input data
std::vector<std::vector<POINT> > outputs; // space for output data
const int thread_count = 4;
struct work_packet { // describe the work for one thread
USERPOINT *inputs; // where to get its input
std::vector<POINT> *outputs; // where to put its output
int num_points; // how many points to process
HANDLE finished; // signal when it's done.
};
std::vector<work_packet> packets(thread_count); // storage for the packets.
std::vector<HANDLE> events(thread_count); // storage for parent's handle to events
outputs.resize(inputs.size); // can't resize output after processing starts.
for (int i=0; i<thread_count; i++) {
int offset = i * inputs.size() / thread_count;
packets[i].inputs = &inputs[0]+offset;
packets[i].outputs = &outputs[0]+offset;
packets[i].count = inputs.size()/thread_count;
events[i] = packets[i].done = CreateEvent();
threads[i].process(&packets[i]);
}
// wait for curves to be generated (Win32 style, for the moment).
WaitForMultipleObjects(&events[0], thread_count, WAIT_ALL, INFINITE);
Note that although we have to be sure that the outputs vector doesn't get resized while be operated on by multiple threads, the individual vectors of points in outputs can be, because each will only ever be touched by one thread at a time.

If the simple copy in between things is slower than before you started using Mulitthreading, it's entirely likely that what you're doing simple isn't going to scale to multiple cores. If it's something simple like bezier stuff I suspect that's going to be the case.
Remember that the overhead of making the threads and such has an impact on total run time.
Finally.. for the copy, what are you using? Is it std::copy?

Multithreading is not going to speed up your process. Processing the data in different cores, could.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js