I am carrying out a 3D matrix by 1D vector multiplication within a class in C++. All variables are contained within the class. When I create one instance of the class on a single thread and carry out the multiplication 100 times, the multiplication operation takes ~0.8ms each time.
When I create 4 instances of the class, each on a separate thread, and run the multiplication operation 25 times on each, the operation takes ~1.7ms each time. The operations on each thread are being carried out on separate data, and are running on separate cores.
As expected, however, the overall time to complete the 100 matrix multiplications is reduced with 4 threads over a single thread.
My questions are:
1) What is the cause of the slowdown in the multiplication operation when multiple threads are used?
2) Is there any way in which the operation can be sped up?
EDIT:
To clarify the problem:
The overall time to carry out 100 matrix products does decrease when I split them over 4 threads - threading does make the overall program faster.
The timing in question is the actual matrix multiplication within the already created threads (see code). This time excludes thread creation, and memory allocation & deletion. This is the time that doubles when I use 4 threads rather than 1. The overall time to carry out all multiplications halves when I use 4 threads. My question is why are the individual matrix products slower when running on 4 threads rather than 1.
Below is a code sample. It is not my actual code, but a simplified example I have written to demonstrate the problem.
Multiply.h
class Multiply
{
public:
Multiply ();
~Multiply ();
void
DoProduct ();
private:
double *a;
};
Multiply.cpp
Multiply::Multiply ()
{
a = new double[100 * 100 * 100];
std::memset(a,1,100*100*100*sizeof(double));
}
void
Multiply::DoProduct ()
{
double *result = new double[100 * 100];
double *b = new double[100];
std::memset(result,0,100*100*sizeof(double));
std::memset(b,1,100*sizeof(double));
//Timer starts here, i.e. excluding memory allocation and thread creation and the rest
auto start_time = std::chrono::high_resolution_clock::now ();
//matrix product
for (int i = 0; i < 100; ++i)
for (int j = 0; j < 100; ++j)
{
double t = 0;
for (int k = 0; k < 100; ++k)
t = t + a[k + j * 100 + i * 100 * 100] * b[k];
result[j + 100 * i] = result[j + 100 * i] + t;
}
//Timer stops here, i.e. before memory deletion
int time = std::chrono::duration_cast < std::chrono::microseconds > (std::chrono::high_resolution_clock::now () - start_time).count ();
std::cout << "Time: " << time << std::endl;
delete []result;
delete []b;
}
Multiply::~Multiply ()
{
delete[] a;
}
Main.cpp
void
threadWork (int iters)
{
Multiply *m = new Multiply ();
for (int i = 0; i < iters; i++)
{
m->DoProduct ();
}
}
void
main ()
{
int numProducts = 100;
int numThreads = 1; //4;
std::thread t[numThreads];
auto start_time = std::chrono::high_resolution_clock::now ();
for (int i = 0; i < numThreads; i++)
t[i] = std::thread (threadWork, numProducts / numThreads);
for (int i = 0; i < n; i++)
t[i].join ();
int time = std::chrono::duration_cast < std::chrono::microseconds > (std::chrono::high_resolution_clock::now () - start_time).count ();
std::cout << "Time total: " << time << std::endl;
}
Async and thread calls are quite time expensive compare to ordinary function calls. So pre-launch threads and create a thread pool. You push your functions as tasks and request the thread pool to tether these tasks from the prority-queue.
The tasks could be set with priorities to execute in proper order to avoid use and hence delays arising due to use of mutexes and locks
You are launching too many threads , keep it below the maximum allowed by your system to avoid bottlenecks.
Related
I have this self-contained example of a TBB application that I run on a 2-NUMA-node CPU that performs a simple vector addition repeatedly on dynamic arrays. It recreates an issue that I am having with a bit more complicated example. I am trying to divide the computations cleanly between the available NUMA nodes by initializing the data in parallel with 2 task_arenas that are linked to separate NUMA nodes through TBB's NUMA API. The subsequent parallel execution should then be conducted so that that memory accesses are performed on data that is local to the cpu that computes its task. A control example uses a simple parallel_for with a static_partitioner to perform the computation while my intended example invokes per task_arena a task which invokes a parallel_for to compute the vector addition of the designated region, i.e. the half of the dynamic arena that was initialized before in the corresponding NUMA node. This example always takes twice as much time to perform the vector addition compared to the control example. It cannot be the overhead of creating the tasks for the task_arenas that will invoke the parallel_for algorithms, because the performance degradation only occurs when the tbb::task_arena::constraints are applied. Could anyone explain to me what happens and why this performance penalty is so harsh. A direction to resources would also be helpful as I am doing this for a university project.
#include <iostream>
#include <iomanip>
#include <tbb/tbb.h>
#include <vector>
int main(){
std::vector<int> numa_indexes = tbb::info::numa_nodes();
std::vector<tbb::task_arena> arenas(numa_indexes.size());
std::size_t numa_nodes = numa_indexes.size();
for(unsigned j = 0; j < numa_indexes.size(); j++){
arenas[j].initialize( tbb::task_arena::constraints(numa_indexes[j]));
}
std::size_t size = 10000000;
std::size_t part_size = std::ceil((float)size/numa_nodes);
double * A = (double *) malloc(sizeof(double)*size);
double * B = (double *) malloc(sizeof(double)*size);
double * C = (double *) malloc(sizeof(double)*size);
double * D = (double *) malloc(sizeof(double)*size);
//DATA INITIALIZATION
for(unsigned k = 0; k < numa_indexes.size(); k++)
arenas[k].execute(
[&](){
std::size_t local_start = k*part_size;
std::size_t local_end = std::min(local_start + part_size, size);
tbb::parallel_for(static_cast<std::size_t>(local_start), local_end,
[&](std::size_t i)
{
C[i] = D[i] = 0;
A[i] = B[i] = 1;
}, tbb::static_partitioner());
});
//PARALLEL ALGORITHM
tbb::tick_count t0 = tbb::tick_count::now();
for(int i = 0; i<100; i++)
tbb::parallel_for(static_cast<std::size_t>(0), size,
[&](std::size_t i)
{
C[i] += A[i] + B[i];
}, tbb::static_partitioner());
tbb::tick_count t1 = tbb::tick_count::now();
std::cout << "Time 1: " << (t1-t0).seconds() << std::endl;
//TASK ARENA & PARALLEL ALGORITHM
t0 = tbb::tick_count::now();
for(int i = 0; i<100; i++){
for(unsigned k = 0; k < numa_indexes.size(); k++){
arenas[k].execute(
[&](){
for(unsigned i=0; i<numa_indexes.size(); i++)
task_groups[i].wait();
task_groups[k].run([&](){
std::size_t local_start = k*part_size;
std::size_t local_end = std::min(local_start + part_size, size);
tbb::parallel_for(static_cast<std::size_t>(local_start), local_end,
[&](std::size_t i)
{
D[i] += A[i] + B[i];
});
});
});
}
t1 = tbb::tick_count::now();
std::cout << "Time 2: " << (t1-t0).seconds() << std::endl;
double sum1 = 0;
double sum2 = 0;
for(int i = 0; i<size; i++){
sum1 += C[i];
sum2 += D[i];
}
std::cout << sum1 << std::endl;
std::cout << sum2 << std::endl;
return 0;
}
Performance with:
for(unsigned j = 0; j < numa_indexes.size(); j++){
arenas[j].initialize( tbb::task_arena::constraints(numa_indexes[j]));
}
$ taskset -c 0,1,8,9 ./RUNME
Time 1: 0.896496
Time 2: 1.60392
2e+07
2e+07
Performance without constraints:
$ taskset -c 0,1,8,9 ./RUNME
Time 1: 0.652501
Time 2: 0.638362
2e+07
2e+07
EDIT: I implemented the use of task_group as found in #AlekseiFedotov's suggested resources, but the issue still remains.
Part of the provided example where the work with arenas happens is not one-to-one match to the example from the docs, "Setting the preferred NUMA node" section.
Looking further into the specification of the task_arena::execute() method, we can find out that the task_arena::execute() is a blocking API, i.e. it does not return until the passed lambda completes.
On the other hand, the specification of the task_group::run() method reveals that its method is asynchronous, i.e. returns immediately, not waiting for the passed functor to complete.
That is where the problem lies, I guess. The code executes two parallel loops within arenas one by one, in a serial manner so to say. Consider following the example from the docs carefully.
BTW, the oneTBB project, which is the revamped version of the TBB, can be found here.
EDIT answer for the EDITED question:
See the comment to the question.
The waiting should happen after work is submitted, not before it. Also, no need to go to another arena's task group to do the wait within the loop, just submit the work in the NUMA loop via arena[i].execute( [i, &] { task_group[i].run( [i, &] { /*...*/ } ); } ), then, in another loop, wait for each task_group within corresponding task_arena.
Please note how I capture the NUMA loop iteration by copy. Otherwise, the code might be referring the wrong data inside the lambda body.
So I want to optimize the sum of a really big array and in order to do that I have wrote a multi-threaded code. The problem is that with this code I'm getting better timing results using only one thread instead of 2 or 3 or 4 threads...
Can someone explain me why this happens?
(Also I've only started coding in C++ this semester, until then I only knew C, so I'm sorry for possible dumb mistakes)
This is the thread code
*localSum = 0.0;
for (size_t i = 0; i < stop; i++)
*localSum += v[i];
Main process code
int numThreads = atoi(argv[1]);
int N = 100000000;
// create the input vector v and put some values in v
vector<double> v(N);
for (int i = 0; i < N; i++)
v[i] = i;
// this vector will contain the partial sum for each thread
vector<double> localSum(numThreads, 0);
// create threads. Each thread will compute part of the sum and store
// its result in localSum[threadID] (threadID = 0, 1, ... numThread-1)
startChrono();
vector<thread> myThreads(numThreads);
for (int i = 0; i < numThreads; i++){
int start = i * v.size() / numThreads;
myThreads[i] = thread(threadsum, i, numThreads, &v[start], &localSum[i],v.size()/numThreads);
}
for_each(myThreads.begin(), myThreads.end(), mem_fn(&thread::join));
// calculate global sum
double globalSum = 0.0;
for (int i = 0; i < numThreads; i++)
globalSum += localSum[i];
cout.precision(12);
cout << "Sum = " << globalSum << endl;
cout << "Runtime: " << stopChrono() << endl;
exit(EXIT_SUCCESS);
}
There are a few things:
1- The array just isn't big enough. Vectorized streaming add will be really hard to beat. You need a more complex function than add to really see results. Or a very large array.
2- Related, the overhead of all the thread creation and joining is going to swamp any performance gains from the threading. Adding is really fast, and you can easily saturate the CPU's functional units. for the thread to help it can't even be a hyperthread on the same core, it would need to be on a different core entirely (as the hyperthreads would both compete for the floating point units).
To test this, you can try to create all the treads before you start the timer and stop them all after you stop the timer (have them set a done flag instead of waiting on the join).
3- All your localsum variables are sharing the same cache line. Better would be to make the localsum variable on the stack and put the result into the array instead of adding directly into the array: https://mechanical-sympathy.blogspot.com/2011/07/false-sharing.html
If for some reason, you need to keep the sum observable to others in that array, pad the localsum vector entries like this so they don't share the same cache line:
struct localsumentry {
double sum;
char pad[56];
};
I am very new to modern C++ library, and trying to learn how to use std::async to perform some operations on a big pointer array. The sample code I have written is crashing at the point where the async task is launched.
Sample code:
#include <iostream>
#include <future>
#include <tuple>
#include <numeric>
#define maximum(a,b) (((a) > (b)) ? (a) : (b))
class Foo {
bool flag;
public:
Foo(bool b) : flag(b) {}
//******
//
//******
std::tuple<long long, int> calc(int* a, int begIdx, int endIdx) {
long sum = 0;
int max = 0;
if (!(*this).flag) {
return std::make_tuple(sum, max);
}
if (endIdx - begIdx < 100)
{
for (int i = begIdx; i < endIdx; ++i)
{
sum += a[i];
if (max < a[i])
max = a[i];
}
return std::make_tuple(sum, max);
}
int midIdx = endIdx / 2;
auto handle = std::async(&Foo::calc, this, std::ref(a), midIdx, endIdx);
auto resultTuple = calc(a, begIdx, midIdx);
auto asyncTuple = handle.get();
sum = std::get<0>(asyncTuple) +std::get<0>(resultTuple);
max = maximum(std::get<1>(asyncTuple), std::get<1>(resultTuple));
return std::make_tuple(sum, max);
}
//******
//
//******
void call_calc(int*& a) {
auto handle = std::async(&Foo::calc, this, std::ref(a), 0, 10000);
auto resultTuple = handle.get();
std::cout << "Sum = " << std::get<0>(resultTuple) << " Maximum = " << std::get<1>(resultTuple) << std::endl;
}
};
//******
//
//******
int main() {
int* nums = new int[10000];
for (int i = 0; i < 10000; ++i)
nums[i] = rand() % 10000 + 1;
Foo foo(true);
foo.call_calc(nums);
delete[] nums;
}
Can anyone help me to identify why does it crash?
Is there any better approach to apply parallelism to operations on a big pointer array?
The fundamental problem is your code wants to launch more than array size / 100 threads. That means more than 100 threads. 100 threads won't do anything good; they'll thrash. See std::thread::hardware_concurrency, and in general don't use raw async or thread in production applications; write task pools and splice together futures and the like.
That many threads is both extremely inefficient and could exhaust system resources.
The second problem is you failed to calculate the average of 2 values.
The average of begIdx and endIdx is not endIdx/2 but rather:
int midIdx = begIdx + (endIdx-begIdx) / 2;
Live example.
You'll notice I discovered the problem with your program by adding intermediate output. In particular, I had it print out the ranges it was working on, and I noticed it was repeating ranges. This is known as "printf debugging", and is pretty powerful especially when step-based debugging isn't (with this many threads, stepping through the code will be brain-numbing)
The problem with async calls is that they are not done in some universe where an infinite amount of tasks can be executed all at the exact same time.
Async calls are executed on a processor which has a certain amount of processors/cores and the async calls have to be lined up to be executed on them.
Now here is where problems of synchronization, and the problems of blocking, starvation, ... and other multithreaded issues come into play.
Your algorithm is very difficult to follow, as it is spawning tasks inside already created tasks. Something is happening, but it is difficult to follow.
I would solve this problem by:
Creating a vector of results (which will be from async threads)
In a loop execute the async calls (assigning the result to the vector)
Afterwards loop through the reuslts vector gathering the results
I currently have a project going on, in which a large dataset has to be created using HDF5. Now, the naive implementation is all nice and dandy, but very slow. The slow part is in the calculation (10x slower than write) which I cannot speed up anymore, but maybe parallelization is possible.
I guess I could use a simple #pragma omp parallel for but the dataspace.write(..) method should be squential for speed reasons (maybe it doesnt matter). See this picture for example.
It should be noted that, because of the dimensionality, the write function uses a chunked layout of the same size as the buffer (in reality around 1Mb)
/*
------------NAIVE IMPLEMENTATION-----------------
|T:<calc0><W0><calc1><W1><calc2><W2>............|
|-----------------------------------------------|
|----------PARALLEL IMPLEMENTATION--------------|
|-----------------------------------------------|
|T0:<calc0----><W0><calc4>.....<W4>.............|
|T1:<calc1---->....<W1><calc5->....<W5>.........|
|T2:<calc2--->.........<W2>calc6-->....<W6>.....|
|T3:<calc3----->...........<W3><calc7-->...<W7>.|
------------DIFFERENT IMPLEMENTATION-------------
i.e.: Queuesize=4
T0:.......<W0><W1><W2><W3><W4><W5><W6>..........|
T1:<calc0><calc3>.....<calc6>...................|
T2:<calc1>....<calc4>.....<calc7>...............|
T3:<calc2>........<calc5>.....<calc8>...........|
T Thread
<calcn---> Calculation time
<Wn> Write data n. Order *important*
. Waiting
*/
Codeexample:
#include <chrono>
#include <cmath>
#include <iostream>
#include <memory>
double calculate(float *buf, const struct options *opts) {
// dummy function just to get a time reference
double res = 0;
for (size_t i = 0; i < 10000; i++)
res += std::sin(i);
return 1 / (1 + res);
}
struct options {
size_t idx[6];
};
class Dataspace {
public:
void selectHyperslab(){}; // selects region in disk space
void write(float *buf){}; // write buf to selected disk space
};
int main() {
size_t N = 6;
size_t dims[6] = {4 * N, 4 * N, 4 * N, 4 * N, 4 * N, 4 * N},
buf_offs[6] = {4, 4, 4, 4, 4, 4};
// dims: size of each dimension, multiple of 4
// buf_offs: size of buffer in each dimension
// Calcuate buffer size and allocate
// the size of the buffer is usually around 1Mb
// and not a float but a compund datatype
size_t buf_size = buf_offs[0];
for (auto off : buf_offs)
buf_size *= off;
std::unique_ptr<float[]> buf{new float[buf_size]};
struct options opts; // options parameters, passed to calculation fun
struct Dataspace dataspace; // dummy Dataspace. Supplied by HDF5
size_t i = 0;
size_t idx0, idx1, idx2, idx3, idx4, idx5;
auto t_start = std::chrono::high_resolution_clock::now();
std::cout << "[START]" << std::endl;
for (idx0 = 0; idx0 < dims[0]; idx0 += buf_offs[0])
for (idx1 = 0; idx1 < dims[1]; idx1 += buf_offs[1])
for (idx2 = 0; idx2 < dims[2]; idx2 += buf_offs[2])
for (idx3 = 0; idx3 < dims[3]; idx3 += buf_offs[3])
for (idx4 = 0; idx4 < dims[4]; idx4 += buf_offs[4])
for (idx5 = 0; idx5 < dims[5]; idx5 += buf_offs[5]) {
i++;
opts.idx[0] = idx0;
opts.idx[1] = idx1;
opts.idx[2] = idx2;
opts.idx[3] = idx3;
opts.idx[4] = idx4;
opts.idx[5] = idx5;
dataspace.selectHyperslab(/**/); // function from HDF5
calculate(buf.get(), &opts); // populate buf with data
dataspace.write(buf.get()); // has to be sequential
}
std::cout << "[DONE] " << i << " calls" << std::endl;
std::chrono::duration<double> diff =
std::chrono::high_resolution_clock::now() - t_start;
std::cout << "Time: " << diff.count() << std::endl;
return 0;
}
Code should work right out of the box.
I already took a quick look into OpenMP, but I can't wrap my head around yet. Can anyone give me a hint/working example? I am not good with parallelization, but wouldn't a writer-thread with a bufferqueue work? Or is using OpenMP overkill anyways and pthreads suffice?
Any help is kindly appreciated,
cheers
Your first parallel implementation idea is by far the simplest to implement. Making a queue and a dedicated I/O thread might perform better, but is significantly more difficult to implement using OpenMP.
Below is a simple example of how a parallel version could look like. The most important aspects are:
Shared data: Make sure that there is no race condition on any data that is shared among threads. For example each thread must have it's own buf and opts as they are clearly modified in parallel with no restriction. The simplest way is to define the variables locally within a parallel region. Also loop idxn, at least for the inner loops, and i must be defined locally. You cannot compute i like you did - this would create a dependency between each loop iteration and prevent parallelization.
Apply pragma omp for worksharing to the loop. Due to the small amount of iterations in each dimension, it is advisable to apply collapse. This will distribute the work of multiple nested loops. The optimal value for collapse will expose enough parallel work for your the number of threads available for your program, but not create too much overhead or hinder single-thread optimization of inner loops. You might want to try different values.
Protect writing the data with a critical section. Only one thread at a time will enter the section. This is most likely necessary for correctness (depending on how it is implemented in hdf5). Apparently selectHyperslab will control how write will operate, so it must be in the same critical section.
Put together, it could look like this:
#pragma omp parallel
{
// define EVERYTHING that is modified locally to each thread!
std::unique_ptr<float[]> buf{new float[buf_size]};
struct options opts;
// Try different values for collapse if performance is not satisfactory
#pragma omp for collapse(3)
for (size_t idx0 = 0; idx0 < dims[0]; idx0 += buf_offs[0])
for (size_t idx1 = 0; idx1 < dims[1]; idx1 += buf_offs[1])
for (size_t idx2 = 0; idx2 < dims[2]; idx2 += buf_offs[2])
for (size_t idx3 = 0; idx3 < dims[3]; idx3 += buf_offs[3])
for (size_t idx4 = 0; idx4 < dims[4]; idx4 += buf_offs[4])
for (size_t idx5 = 0; idx5 < dims[5]; idx5 += buf_offs[5]) {
size_t i = idx5 + idx4 * dims[5] + ...;
opts.idx[0] = idx0;
opts.idx[1] = idx1;
opts.idx[2] = idx2;
opts.idx[3] = idx3;
opts.idx[4] = idx4;
opts.idx[5] = idx5;
calculate(buf.get(), &opts); // populate buf with data
#pragma omp critical
{
// I do assume that this function selects where/how data
// will be written so you *must* protected it
// Only one thread can do this at a time.
dataspace.selectHyperslab(/**/); // function from HDF5
dataspace.write(buf.get()); // has to be sequential
}
}
}
I have the following code, which confuses me a lot:
float OverlapRate(cv::Mat& model, cv::Mat& img) {
if ((model.rows!=img.rows)||(model.cols!=img.cols)) {
return 0;
}
cv::Mat bgr[3];
cv::split(img, bgr);
int counter = 0;
float b_average = 0, g_average = 0, r_average = 0;
for (int i = 0; i < model.rows; i++) {
for (int j = 0; j < model.cols; j++) {
if((model.at<uchar>(i,j)==255)){
counter++;
b_average += bgr[0].at<uchar>(i, j);
g_average += bgr[1].at<uchar>(i, j);
r_average += bgr[2].at<uchar>(i, j);
}
}
}
b_average = b_average / counter;
g_average = g_average / counter;
r_average = r_average / counter;
counter = 0;
float b_stde = 0, g_stde = 0, r_stde = 0;
for (int i = 0; i < model.rows; i++) {
for (int j = 0; j < model.cols; j++) {
if((model.at<uchar>(i,j)==255)){
counter++;
b_stde += std::pow((bgr[0].at<uchar>(i, j) - b_average), 2);
g_stde += std::pow((bgr[1].at<uchar>(i, j) - g_average), 2);
r_stde += std::pow((bgr[2].at<uchar>(i, j) - r_average), 2);
}
}
}
b_stde = std::sqrt(b_stde / counter);
g_stde = std::sqrt(g_stde / counter);
r_stde = std::sqrt(r_stde / counter);
return (b_stde + g_stde + r_stde) / 3;
}
void work(cv::Mat& model, cv::Mat& img, int index, std::map<int, float>& results){
results[index] = OverlapRate(model, img);
}
int OCR(cv::Mat& a, std::map<int,cv::Mat>& b, const std::vector<int>& possible_values)
{
int recog_value = -1;
clock_t start = clock();
std::thread threads[10];
std::map<int, float> results;
for(int i=0; i<10; i++)
{
threads[i] = std::thread(work, std::ref(b[i]), std::ref(a), i, std::ref(results));
}
for(int i=0; i<10; i++)
threads[i].join();
float min_score = 1000;
int min_index = -1;
for(auto& it:results)
{
if (it.second < min_score) {
min_score = it.second;
min_index = it.first;
}
}
clock_t end = clock();
clock_t t = end - start;
printf ("It took me %d clicks (%f seconds) .\n",t,((float)t)/CLOCKS_PER_SEC);
recog_value = min_index;
}
What the above code does is just simple optical character recognition. I have one optical character as an input and compare it with 0 - 9 ten standard character models to get the most similar one, and then output the recognized value.
When I execute the above code without using ten threads running at the same time, the time is 7ms. BUT, when I use ten threads, it drops down to 1 or 2 seconds for a single optical character recognition.
What is the reason?? The debug information tells that thread creation consumes a lot of time, which is this code:
threads[i] = std::thread(work, std::ref(b[i]), std::ref(a), i, std::ref(results));
Why? Thanks.
Running multiple threads is useful in only 2 contexts: you have multiple hardware cores (so the threads can run simultaneously) OR each thread is waiting for IO (so one thread can run while another thread is waiting for IO, like a disk load or network transfer).
Your code is not IO bound, so I hope you have 10 cores to run your code. If you don't have 10 cores, then each thread will be competing for scarce resources, and the scarcest resource of all is L1 cache space. If all 10 threads are fighting for 1 or 2 cores and their cache space, then the caches will be "thrashing" and give you 10-100x slower performance.
Try running benchmarking your code 10 different times, with N=1 to 10 threads and see how it performs.
(There is one more reason the have multiple threads, which is when the cores support hyper threading. The OS will"pretend" that 1 core has 2 virtual processors, but with this you don't get 2x performance. You get something between 1x and 2x. But in order to get this partial boost, you have to run 2 threads per core)
Not always is efficient to use threads. If you use threads on small problem, then managing threads cost more time and resources then solving the problem. You must have enough work for threads and good managing work over threads.
If you want to know how many threads you can use on problem or how big must be problem, find Isoeffective functions (psi1, psi2, psi3) from theory of parallel computers.