Assume we have an array or vector of length 256(can be more or less) and the number of pthreads to generate to be 4(can be more or less).
I need to figure out how to assign each pthread to a process a section of the vector.
So the following code dispatches the multiple threads.
for(int i = 0; i < thread_count; i++)
{
int *arg = (int *) malloc(sizeof(*arg));
*arg = i;
thread_err = pthread_create(&(threads[i]), NULL, &multiThread_Handler, arg);
if (thread_err != 0)
printf("\nCan't create thread :[%s]", strerror(thread_err));
}
As you can tell from the above code, each thread passes an argument value to the starting function. Where in the case of the four threads, the argument values range from 0 to 3, 5 threads = 0 to 4, and so forth.
Now the starting function does the following:
void* multiThread_Handler(void *arg)
{
int thread_index = *((int *)arg);
unsigned int start_index = (thread_index*(list_size/thread_count));
unsigned int end_index = ((thread_index+1)*(list_size/thread_count));
std::cout << "Start Index: " << start_index << std::endl;
std::cout << "End Index: " << end_index << std::endl;
std::cout << "i: " << thread_index << std::endl;
for(int i = start_index; i < end_index; i++)
{
std::cout <<"Processing array element at: " << i << std::endl;
}
}
So in the above code, the thread whose argument is 0 should process the section 0 - 63(in the case of an array size of 256 and a thread count of 4), the thread whose argument is 1 should process the section 64 - 127, and so forth. The last thread processing 192 - 256.
Each of these four sections should processed in parallel.
Also, the pthread_join() functions are present in the original main code to make sure each thread finishes before the main thread terminates.
The problem is, that the value i in the above for-loop is taking on suspiciously large values. I'm not sure why this would occur since I am fairly new to pthreads.
It seems like sometimes it works perfectly fine and other times and other times, the value of i becomes so large that it causes the program to either abort or presents a segmentation fault.
The problem is indeed a data race caused by lack of synchronization. And the shared variable being used (and modified) by multiple threads is std::cout.
When using streams such as std::cout concurrently, you need to synchronize all operations with a stream by a mutex. Otherwise, depending on the platform and your luck, you might get output from multiple threads messed together (which might sometimes look like printed values being larger than you expect), or you might get the program crashed, or have other sorts of undefined behavior.
// Incorrect Code
unsigned int start_index = (thread_index*(list_size/thread_count));
unsigned int end_index = ((thread_index+1)*(list_size/thread_count));
The above code is critical region is wrong in your above program. as there is no synchronization mechanism has been used so there is data race.This leads to the wrong calculation of start_index and end_index counters and hence we may get wrong(random garbage values) and hence the for loop variable "i" goes on the toss. So you should use the following code to synchronize the critical region of your program.
// Correct Code
s=thread_mutex_lock (&mutexhandle);
start_index = (thread_index*(list_size/thread_count));
end_index = ((thread_index+1)*(list_size/thread_count));
s=thread_mutex_unlock (&mutexhandle);
Related
In my application, I have two threads, a producer (thread 1) and a consumer (thread 2). Each thread has an input and output interface (effectively a pointer to a list) that is connected to a third thread which serves as a router.
When the producer writes, it calls memcpy to copy data into a buffer and pushes the buffer into a list. Meanwhile, the router thread is round-robin searching through all the threads that are connected to it and monitoring their interfaces to see if any thread has data to send out. When it sees that thread 1's list is non-empty, it checks to determine which thread the data is intended for. The data is spliced into the destination thread's (in this case thread 2) input list, at which point thread 2 will malloc some memory, memcpy the data into it and return the pointer to this new region.
For my test, I'm measuring throughput to see how long it takes to send 100k messages of varying sizes. Thread 1 sends data of some size, thread 2 reads it and sends back a small reply message, which thread 1 reads. This would be one complete exchange. In the first test, in thread 1, I'm sending all 100k messages, and then reading 100k replies. In the second test, in thread 1, I'm alternating sending a message and waiting for the reply and repeating 100k times. In both tests, thread 2 is in a loop reading the message and sending a reply. I would expect test 1 to have higher throughput because the threads should spend less time waiting around. However, it has markedly worse throughput than test 2. I've measured how long individual function calls (to read/write) take in the two test cases and they invariably take longer in test 1 (based on the means and medians and no delay) though the numbers are of the same order of magnitude.
When I add a loop doing nothing into thread 1's sending loop in test 1, I see dramatically improved throughput for this case as opposed to not having the delay. My only guess is that adding a delay slows down the producer so the consumer can absorb the data which prevents its input list from growing very large. I'm wondering if there may be other explanations and if so, how I can test for them.
Edit
Unfortunately, my own code is just the test I described above which calls a library that actually performs the reads/writes, creates that third thread etc. It's difficult to make a minimal example out of it because the library is complex and not mine. I provide some pseudocode to illustrate the setup in more detail.
int NUM_ITERATIONS = 100000;
int msg_reply = 2; // size of the reply message in words
int msg_size = 512; // indicates 512 64 bit words
void generate(int iterations, int size, interface* out){
std::vector<long long> vec(size);
for(int i = 0; i < size; i++)
vec[i] = (long long) i;
for(int i = 0; i < iterations; i++)
out->lib_write((char*) vec.data(), size);
}
void receive(int iterations, int size, interface* in){
for(int i = 0; i < iterations; i++)
char* data = in->lib_read(size)
void producer(interface* in, interface* out){
// test 1
start = std::chrono::high_resolution_clock::now();
// write data of size msg_size, NUM_ITERATIONS times to out
generate(NUM_ITERATIONS, msg_size, out);
// read data of size msg_reply, NUM_ITERATIONS times from in
receive(NUM_ITERATIONS, msg_reply, in);
end = std::chrono::high_resolution_clock::now();
// using NUM_ITERATIONS, msg_size and time, compute and print throughput to stdout
print_throughput(end-start, "throughput_0", msg_size);
// test 2
start = std::chrono::high_resolution_clock::now();
for(int j = 0; j < NUM_ITERATIONS; j++){
generate(1, msg_size, out);
receive(1, msg_reply, in);
}
end = std::chrono::high_resolution_clock::now();
print_throughput(end-start, "throughput_1", msg_size);
}
void consumer(interface* in, interface* out){
for(int i = 0; i < 2; i++}{
for(int j = 0; j < NUM_ITERATIONS; j++){
receive(1, msg_size, in);
generate(1, msg_reply, out);
}
}
}
The calls to lib_write() and lib_read() become fairly complex. To elaborate on the description above, the data gets memcpy'd into a buffer and then moved into a list. The interface has a condition variable member and the write calls its notify_one() method. The third thread is looping through all the interface pointers it has and checking to see if their lists are non-empty. If so, the data is spliced from one output list to the destination's input list using the splice() method in std::list. Meanwhile, the consumer calls the lib_read() which waits on the condition variable while the interface is empty, and then memcpy's the data into a new region and returns it.
// note: these will not compile as is. Undefined variables are class members
char * interface::lib_read(size_t * _size){
char * ret;
{
std::unique_lock<std::mutex> lock(mutex);
// packets is an std::list containing the incoming data
while (packets.empty()) {
cv.wait(lock);
}
curr_read_it = packets.begin();
}
size_t buff_size = curr_read_it->size;
ret = (char *)malloc(buff_size);
memcpy((char *)ret, (char *)curr_read_it->data, buff_size);
{
std::unique_lock<std::mutex> lock(mutex);
packets.erase(curr_read_it);
curr_read_it = packets.end();
}
return ret;
}
void interface::lib_write(char * data, int size){
// indicates the destination thread id
long long header = 1;
// buffer is a just an array that's max packet sized
memcpy((char *)buffer.data, &header, sizeof(long long));
memcpy((char *)buffer.data + sizeof(long long), (char *)data, size * sizeof(long long));
std::lock_guard<std::mutex> guard(mutex);
packets.push_back(std::move(buffer));
cv.notify_one();
}
// this is on thread 3
void route(){
do{
// this is a vector containing all the "out" interfaces
for(int i = 0; i < out_ptrs.size(); i++){
interface <long long> * _out = out_ptrs[i];
if(!_out->empty()){
// this just returns the header id (also locks the mutex)
long long dest= _out->get_dest();
// looks up the correct interface based on the id and splices
// a packet into from _out to the appropriate one. Locks mutex
in_ptrs[dest_map[dest]]->splice(_out);
}
}
}while(!done());
I was looking for general advice on what factors may influence multithreading performance and what to test for in order to better understand what was going on.
I talked to some other people and the advice I got that was helpful was to determine if the OS scheduling was the issue (which is what I suspected but was unsure how to test). Essentially, I used taskset and sched_affinity() to force the application to run on one core or on a subset of cores and looked at how they compared to each other and to the unrestricted case.
Based on the restrictions, I got dramatically different results and could see some trends so I'm pretty confident in saying that it's an OS scheduling issue. Different ones can yield better performance under different workloads.
I'm trying to get my head around Windows API threads and thread control. I briefly worked with threads in Java so I know the basic concepts but something that has worked in Java seems to only work halfway in C++.
What I am trying to do is as follows: 2 threads in one process, sharing a common resource(for this case, the common resource is a pair of two global variables int a, b;).
The first thread should acquire a mutex, use rand() to generate pairs of numbers from 0 to 100 until it gets a pair such that b == 2 * a, then release the mutex.
The second thread should then acquire the mutex, and check if the b == 2 * a condition is true for the given values(printing something like "incorrect" in case it is not), then release the mutex so the first thread can get it back. This process of generating an checking pairs of numbers should be repeated quite a few times, say 500/1000 times.
My code is as follows:
Main:
#define INIT_SEED time(NULL)
#define NUMBER_OF_CHECKS 250
int a = 0;
int b = 1;
HANDLE mutexHandle = CreateMutex(NULL, FALSE, NULL);
int main()
{
HANDLE thread1Handle = CreateThread(NULL, NULL, Thread1Behaviour, NULL, NULL, NULL);
Sleep(50);
HANDLE thread2Handle = CreateThread(NULL, NULL, Thread2Behaviour, NULL, NULL, NULL);
WaitForSingleObject(thread1Handle, INFINITE);
WaitForSingleObject(thread2Handle, INFINITE);
return 0;
}
Thread 1 behavior:
DWORD WINAPI Thread1Behaviour( LPVOID _ )
{
srand(INIT_SEED);
for (int i = 0; i < NUMBER_OF_CHECKS; i++)
{
WaitForSingleObject(mutexHandle, INFINITE);
do
{
b = rand() % 100;
a = rand() % 100;
}
while (b != 2 * a);
cout << i << ".\t" << b << " " << a << endl;
ReleaseMutex(mutexHandle);
Sleep(50);
}
return 0;
}
Thread 2 behavior:
DWORD WINAPI Thread2Behaviour( LPVOID _ )
{
for (int i = 0; i < NUMBER_OF_CHECKS; i++)
{
WaitForSingleObject(mutexHandle, INFINITE);
if (b == 2 * a)
cout << i << ".\t" << b << "\t=\t2 * " << a << endl;
else
cout << i << ".\t" << b << "\t=\t2 * " << a << "\tINCORRECT!!!" << endl;
ReleaseMutex(mutexHandle);
Sleep(50);
}
return 0;
}
The implementation is simple enough(I skipped over the handle validity checks to keep the code short, in case a bad handle could be the cause i can add them in, but I imagine that in case of a bad handle everything should just crash & burn, not work but with incorrect outputs).
I remember for working with threads in Java that i used to sleep for some time to make sure the same thread does not reacquire the mutex. However, when I run the above code, it mainly works as intended however, when the number of checks is big enough, somethimes the first thread gets the mutex 2 times in a row, leading to an output like this:
1. 92 46
2. 66 33
1. 66 = 2 * 33
Which means that at the end, the second thread will end up checking the same pair several times:
249. 80 40
248. 80 = 2 * 40
249. 80 = 2 * 40
I have tried changing the sleep timer value with values between 0 and 250 but this remains the case no matter how large the sleep period is. When I put at least 250 it seems to work about half the time.
Also, if I remove the cout in the first thread, the problem becomes 2-3 times worse, with more botched synchronizations.
And one more thing I noticed is that for a certain configuration of sleep timer and cout/no cout in thread 1, the number of times the mutex is immediately reacquired is the same, so this is completely reproducible(at least for me).
Using logic, I got 2 conflicting conclusions:
Since it MOSTLY works as intended, it might be a synchronization problem, with the way threads "rush" for the mutex as soon as it is available
Since I am able to reproduce this issue in a pretty "deterministic" way, it might mean that it is an issue in the logic of the code
But the above can't both be true at once, so what exactly is the problem here?
EDIT: To clarify the question: I know that mutex is not technically used for order of execution, but in this case, why does it not work as intended and what would be the fix?
Thanks in advance!
Maybe I’m confusing myself with threads, but my understanding of threading conflicts with each other.
I’ve created a program which uses POSIX pthreads. Without using these threads the program takes 0.061723 seconds to run, and with threads takes 0.081061 seconds to run.
At first I thought this is what should happen, as threads allow something to happen while other things should be able to happen. i.e. processing a lot of data on one thread while still having responsive UI on another, this would mean the processing of the data would take longer as the CPU divides its time between processing UI and processing the data.
However, surely the point of multithreading is to make the program take advantage of multiple CPUs/cores?
As you can tell I’m something of an intermediate so excuse me if it’s a simple question.
But what should I expect the program to do?
I’m running this on a mid-2012 Macbook Pro 13” base model. CPU is 22 nm "Ivy Bridge" 2.5 GHz Intel "Core i5" processor (3210M), with two independent processor "cores" on a single silicon chip
UPDATED WITH CODE
This is in main function. I didn’t add variable declaration for convenience but I’m sure you can work out what each does by its name:
// Loop through all items we need to process
//
while (totalNumberOfItemsToProcess > 0 && numberOfItemsToProcessOnEachIteration > 0 && startingIndex <= totalNumberOfItemsToProcess)
{
// As long as we have items to process...
//
// Align the index with number of items to process per iteration
//
const uint endIndex = startingIndex + (numberOfItemsToProcessOnEachIteration - 1);
// Create range
//
Range range = RangeMake(startingIndex,
endIndex);
rangesProcessed[i] = range;
// Create thread
//
// Create a thread identifier, 'newThread'
//
pthread_t newThread;
// Create thread with range
//
int threadStatus = pthread_create(&newThread, NULL, processCoordinatesInRangePointer, &rangesProcessed[i]);
if (threadStatus != 0)
{
std::cout << "Failed to create thread" << std::endl;
exit(1);
}
// Add thread to threads
//
threadIDs.push_back(newThread);
// Setup next iteration
//
// Starting index
//
// Realign the index with number of items to process per iteration
//
startingIndex = (endIndex + 1);
// Number of items to process on each iteration
//
if (startingIndex > (totalNumberOfItemsToProcess - numberOfItemsToProcessOnEachIteration))
{
// If the total number of items to process is less than the number of items to process on each iteration
//
numberOfItemsToProcessOnEachIteration = totalNumberOfItemsToProcess - startingIndex;
}
// Increment index
//
i++;
}
std::cout << "Number of threads: " << threadIDs.size() << std::endl;
// Loop through all threads, rejoining them back up
//
for ( size_t i = 0;
i < threadIDs.size();
i++ )
{
// Wait for each thread to finish before returning
//
pthread_t currentThreadID = threadIDs[i];
int joinStatus = pthread_join(currentThreadID, NULL);
if (joinStatus != 0)
{
std::cout << "Thread join failed" << std::endl;
exit(1);
}
}
The processing functions:
void processCoordinatesAtIndex(uint index)
{
const int previousIndex = (index - 1);
// Get coordinates from terrain
//
Coordinate3D previousCoordinate = terrain[previousIndex];
Coordinate3D currentCoordinate = terrain[index];
// Calculate...
//
// Euclidean distance
//
double euclideanDistance = Coordinate3DEuclideanDistanceBetweenPoints(previousCoordinate, currentCoordinate);
euclideanDistances[index] = euclideanDistance;
// Angle of slope
//
double slopeAngle = Coordinate3DAngleOfSlopeBetweenPoints(previousCoordinate, currentCoordinate, false);
slopeAngles[index] = slopeAngle;
}
void processCoordinatesInRange(Range range)
{
for ( uint i = range.min;
i <= range.max;
i++ )
{
processCoordinatesAtIndex(i);
}
}
void *processCoordinatesInRangePointer(void *threadID)
{
// Cast the pointer to the right type
//
struct Range *range = (struct Range *)threadID;
processCoordinatesInRange(*range);
return NULL;
}
UPDATE:
Here are my global variables, which, are only global for simplicity - don’t have a go!
std::vector<Coordinate3D> terrain;
std::vector<double> euclideanDistances;
std::vector<double> slopeAngles;
std::vector<Range> rangesProcessed;
std::vector<pthread_t> threadIDs;
Correct me if I’m wrong, but, I think the issue was with how the time elapsed was measured. Instead of using clock_t I’ve moved to gettimeofday() and that reports a shorter time, from non threaded time of 22.629000 ms to a threaded time of 8.599000 ms.
Does this seem right to people?
Of course, my original question was based on whether or not a multithreaded program SHOULD be faster or not, so I won’t mark this answer as the correct one for that reason.
I am performing a series of calculations on a large number of threads using C++ AMP. The last step of the calculation though is to prune the result but only for a limited number of threads. For example, if the result of the calculation is below a threshold, then set the result to 0 BUT only do this for a maximum of X threads. Essentially this is a shared counter but also a shared conditional check.
Any help is appreciated!
My understanding of your question is the following pseudo-code performed by each thread:
auto result = ...
if(result < global_threshold) // if the result of the calculation is below a threshold
if(global_counter++ < global_max) // for a maximum of X threads
result = 0; // then set the result to 0
store(result);
I then further assume that both global_threshold and global_max does not change during the computation (i.e. between parallel_for_each start and finish) - so the most elegant way to pass them is through lambda capture.
On the other hand, global_counter clearly changes value, so it must be located in modifiable memory shared across all threads, effectively being array<T,N> or array_view<T,N>. Since the threads incrementing this object are not synchronized, the operation would need to be performed using atomic operation.
The above translates to the following C++ AMP code (I'm using Visual Studio 2013 syntax, but it is easily back-portable to Visual Studio 2012):
std::vector<int> result_storage(1024);
array_view<int> av_result{ result_storage };
int global_counter_storage[1] = { 0 };
array_view<int> global_counter{ global_counter_storage };
int global_threshold = 42;
int global_max = 3;
parallel_for_each(av_result.extent, [=](index<1> idx) restrict(amp)
{
int result = (idx[0] % 50) + 1; // 1 .. 50
if(result < global_threshold)
{
// assuming less than INT_MAX threads will enter here
if(atomic_fetch_inc(&global_counter[0]) < global_max)
{
result = 0;
}
}
av_result[idx] = result;
});
av_result.synchronize();
auto zeros = count(begin(result_storage), end(result_storage), 0);
std::cout << "Total number of zeros in results: " << zeros << std::endl
<< "Total number of threads lower than threshold: " << global_counter[0]
<< std::endl;
Original Problem:
So I have written some code to experiment with threads and do some testing.
The code should create some numbers and then find the mean of those numbers.
I think it is just easier to show you what I have so far. I was expecting with two threads that the code would run about 2 times as fast. Measuring it with a stopwatch I think it runs about 6 times slower! EDIT: Now using the computer and clock() function to tell the time.
void findmean(std::vector<double>*, std::size_t, std::size_t, double*);
int main(int argn, char** argv)
{
// Program entry point
std::cout << "Generating data..." << std::endl;
// Create a vector containing many variables
std::vector<double> data;
for(uint32_t i = 1; i <= 1024 * 1024 * 128; i ++) data.push_back(i);
// Calculate mean using 1 core
double mean = 0;
std::cout << "Calculating mean, 1 Thread..." << std::endl;
findmean(&data, 0, data.size(), &mean);
mean /= (double)data.size();
// Print result
std::cout << " Mean=" << mean << std::endl;
// Repeat, using two threads
std::vector<std::thread> thread;
std::vector<double> result;
result.push_back(0.0);
result.push_back(0.0);
std::cout << "Calculating mean, 2 Threads..." << std::endl;
// Run threads
uint32_t halfsize = data.size() / 2;
uint32_t A = 0;
uint32_t B, C, D;
// Split the data into two blocks
if(data.size() % 2 == 0)
{
B = C = D = halfsize;
}
else if(data.size() % 2 == 1)
{
B = C = halfsize;
D = hsz + 1;
}
// Run with two threads
thread.push_back(std::thread(findmean, &data, A, B, &(result[0])));
thread.push_back(std::thread(findmean, &data, C, D , &(result[1])));
// Join threads
thread[0].join();
thread[1].join();
// Calculate result
mean = result[0] + result[1];
mean /= (double)data.size();
// Print result
std::cout << " Mean=" << mean << std::endl;
// Return
return EXIT_SUCCESS;
}
void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
for(uint32_t i = 0; i < length; i ++) {
*result += (*datavec).at(start + i);
}
}
I don't think this code is exactly wonderful, if you could suggest ways of improving it then I would be grateful for that also.
Register Variable:
Several people have suggested making a local variable for the function 'findmean'. This is what I have done:
void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
register double holding = *result;
for(uint32_t i = 0; i < length; i ++) {
holding += (*datavec).at(start + i);
}
*result = holding;
}
I can now report: The code runs with almost the same execution time as with a single thread. That is a big improvement of 6x, but surely there must be a way to make it nearly twice as fast?
Register Variable and O2 Optimization:
I have set the optimization to 'O2' - I will create a table with the results.
Results so far:
Original Code with no optimization or register variable:
1 thread: 4.98 seconds, 2 threads: 29.59 seconds
Code with added register variable:
1 Thread: 4.76 seconds, 2 Threads: 4.76 seconds
With reg variable and -O2 optimization:
1 Thread: 0.43 seconds, 2 Threads: 0.6 seconds 2 Threads is now slower?
With Dameon's suggestion, which was to put a large block of memory in between the two result variables:
1 Thread: 0.42 seconds, 2 Threads: 0.64 seconds
With TAS 's suggestion of using iterators to access contents of the vector:
1 Thread: 0.38 seconds, 2 Threads: 0.56 seconds
Same as above on Core i7 920 (single channel memory 4GB):
1 Thread: 0.31 seconds, 2 Threads: 0.56 seconds
Same as above on Core i7 920 (dual channel memory 2x2GB):
1 Thread: 0.31 seconds, 2 Threads: 0.35 seconds
Why are 2 threads 6x slower than 1 thread?
You are getting hit by a bad case of false sharing.
After getting rid of the false-sharing, why is 2 threads not faster than 1 thread?
You are bottlenecked by your memory bandwidth.
False Sharing:
The problem here is that each thread is accessing the result variable at adjacent memory locations. It's likely that they fall on the same cacheline so each time a thread accesses it, it will bounce the cacheline between the cores.
Each thread is running this loop:
for(uint32_t i = 0; i < length; i ++) {
*result += (*datavec).at(start + i);
}
And you can see that the result variable is being accessed very often (each iteration). So each iteration, the threads are fighting for the same cacheline that's holding both values of result.
Normally, the compiler should put *result into a register thereby removing the constant access to that memory location. But since you never turned on optimizations, it's very likely the compiler is indeed still accessing the memory location and thus incurring false-sharing penalties at every iteration of the loop.
Memory Bandwidth:
Once you have eliminated the false sharing and got rid of the 6x slowdown, the reason why you're not getting improvement is because you've maxed out your memory bandwidth.
Sure your processor may be 4 cores, but they all share the same memory bandwidth. Your particular task of summing up an array does very little (computational) work for each memory access. A single thread is already enough to max out your memory bandwidth. Therefore going to more threads is not likely to get you much improvement.
In short, no you won't be able to make summing an array significantly faster by throwing more threads at it.
As stated in other answers, you are seeing false sharing on the result variable, but there is also one other location where this is happening. The std::vector<T>::at() function (as well as the std::vector<T>::operator[]()) access the length of the vector on each element access. To avoid this you should switch to using iterators. Also, using std::accumulate() will allow you to take advantage of optimizations in the standard library implementation you are using.
Here are the relevant parts of the code:
thread.push_back(std::thread(findmean, std::begin(data)+A, std::begin(data)+B, &(result[0])));
thread.push_back(std::thread(findmean, std::begin(data)+B, std::end(data), &(result[1])));
and
void findmean(std::vector<double>::const_iterator start, std::vector<double>::const_iterator end, double* result)
{
*result = std::accumulate(start, end, 0.0);
}
This consistently gives me better performance for two threads on my 32-bit netbook.
More threads doesn't mean faster! There is an overhead in creating and context-switching threads, even the hardware in which this code run is influencing the results. For such a trivial work like this it's better probably a single thread.
This is probably because the cost of launching and waiting for two threads is a lot more than computing the result in a single loop. Your data size is 128MB, which is not alot for modern processors to process in a single loop.