wait() for thread made via clone? - c++

I plan on rewriting this to assembly so I can't use c or c++ standard library. The code below runs perfectly. However I want a thread instead of a second process. If you uncomment /*CLONE_THREAD|*/ on line 25 waitpid will return -1. I would like to have a blocking function that will resume when my thread is complete. I couldn't figure out by looking at the man pages what it expects me to do
#include <sys/wait.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
int globalValue=0;
static int childFunc(void*arg)
{
printf("Global value is %d\n", globalValue);
globalValue += *(int*)&arg;
return 31;
}
int main(int argc, char *argv[])
{
auto stack_size = 1024 * 1024;
auto stack = (char*)mmap(NULL, stack_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
if (stack == MAP_FAILED) { perror("mmap"); exit(EXIT_FAILURE); }
globalValue = 5;
auto pid = clone(childFunc, stack + stack_size, /*CLONE_THREAD|*/CLONE_VM|CLONE_SIGHAND|SIGCHLD, (void*)7);
sleep(1); //So main and child printf don't collide
if (pid == -1) { perror("clone"); exit(EXIT_FAILURE); }
printf("clone() returned %d\n", pid);
int status;
int waitVal = waitpid(-1, &status, __WALL);
printf("Expecting 12 got %d. Expecting 31 got %d. ID=%d\n", globalValue, WEXITSTATUS(status), waitVal);
return 0;
}

If you want to call functions asynchronously with threads I recommend using std::async. Example here :
#include <iostream>
#include <future>
#include <mutex>
#include <condition_variable>
int globalValue = 0; // could also have been std::atomic<int> but I choose a mutex (to also serialize output to std::cout)
std::mutex mtx; // to protect access to data in multithreaded applications you can use mutexes
int childFunc(const int value)
{
std::unique_lock<std::mutex> lock(mtx);
globalValue = value;
std::cout << "Global value set to " << globalValue << "\n";
return 31;
}
int getValue()
{
std::unique_lock<std::mutex> lock(mtx);
return globalValue;
}
int main(int argc, char* argv[])
{
// shared memory stuff is not needed for threads
// launch childFunc asynchronously
// using a lambda function : https://en.cppreference.com/w/cpp/language/lambda
// to call a function asynchronously : https://en.cppreference.com/w/cpp/thread/async
// note I didn't ues the C++ thread class, it can launch things asynchronously
// however async is both a better abstraction and you can return values (and exceptions)
// to the calling thread if you need to (which you do in this case)
std::future<int> future = std::async(std::launch::async, []
{
return childFunc(12);
});
// wait until asynchronous function call is complete
// and get its return value;
int value_from_async = future.get();
std::cout << "Expected global value 12, value = " << getValue() << "\n";
std::cout << "Expected return value from asynchronous process is 31, value = " << value_from_async << "\n";
return 0;
}

Related

How to force terminate std::thread by TerminateThread()?

I called IMFSourceReader::ReadSample and I found it was stuck if it cannot read data.
So I tried to terminate the thread by TerminateThread() but it returned 0 as a fail.
How could I terminate the stuck thread?
This is my sample code:
#include <iostream>
#include <vector>
#include <codecvt>
#include <string>
#include <thread>
#include <mutex>
#include <chrono>
#include <condition_variable>
#include <Windows.h>
using namespace std::chrono_literals;
class MyObject
{
private:
...
std::thread *t;
std::mutex m;
std::condition_variable cv;
std::thread::native_handle_type handle;
int getsample(uint8_t* data)
{
// call a ReadSample
hr = pVideoReader->ReadSample(
MF_SOURCE_READER_ANY_STREAM, // Stream index.
0, // Flags.
&streamIndex, // Receives the actual stream index.
&flags, // Receives status flags.
&llTimeStamp, // Receives the time stamp.
&pSample // Receives the sample or NULL.
);
...
return 0;
}
int myfunc_wrapper(uint8_t* data)
{
int ret = 0;
BOOL bpass = 0;
if (t == nullptr) {
t = new std::thread([this, &data, &ret]()
{
ret = this->getsample(data);
this->cv.notify_one();
});
handle = t->native_handle();
t->detach();
}
{
std::unique_lock<std::mutex> l(this->m);
if (this->cv.wait_for(l, 2500ms) == std::cv_status::timeout) {
bpass = TerminateThread(handle, 0);
if (bpass == 0) {
std::cout << "TerminateThread Fail! " << GetLastError() << std::endl;
}
throw std::runtime_error("Timeout Fail 2500 ms");
}
}
delete t;
t = nullptr;
}
public:
int my_func(uint8_t* raw_data)
{
bool timedout = false;
try {
if (myfunc_wrapper(raw_data) != 0)
return -1;
}
catch (std::runtime_error& e) {
std::cout << e.what() << std::endl;
timedout = true;
}
if (timedout)
return -1;
return 0;
}
};
int main()
{
uint8_t data[512];
MyObject* obj = new MyObject();
while (true)
{
obj->my_func(data);
}
return 0;
}
Output:
TerminateThread Fail! 6
Timeout Fail 2500 ms
TerminateThread Fail! 6
Timeout Fail 2500 ms
...
I also tried to use pthread_cancel but it cannot be compiled because there is a type error.
no suitable constructor exists to convert from "std::thread::native_handle_type" to "__ptw32_handle_t"
handle = t->native_handle();
...
pthread_cancel(handle); // no suitable constructor exists to convert
The reason it failed to terminate is that the native handle is no longer valid after detaching, one way you could do this is to OpenThread using the thread id to get a new handle.
To get the thread id, you could use its handle before detaching like this:
DWORD nativeId = GetThreadId(t->native_handle());
t->detach();
After that, just open a new handle to the thread to terminate it:
HANDLE hThread = OpenThread(THREAD_TERMINATE, FALSE, nativeId);
if (hThread)
{
BOOL result = TerminateThread(hThread, 0);
CloseHandle(hThread);
}
But you should not do this, consider other ways to signal the thread to terminate on its own.

c++ asynchronous I/O in linux that waits on condition_variable, not waiting. What are we doing wrong?

In a previous questionTrying to write asynchronous I/O in C++ using locks and condition variables. This code calls terminate on the first lock() why?
,
we tried to use two mutexes to have asynchronous code that reads one block of a file into memory, then asynchronously tries to read the next block while processing the current one. Someone made a comment that using read was not the best way to do that. This is an attempt to use POSIX aio_read, but we are trying to wait on a condition_variable and do a notify on the condition variable in the callback after the I/O completes, and it's not working -- in the debugger we can see it blows right past the wait.
#include <aio.h>
#include <fcntl.h>
#include <signal.h>
#include <unistd.h>
#include <condition_variable>
#include <cstring>
#include <iostream>
#include <thread>
using namespace std;
using namespace std::chrono_literals;
constexpr uint32_t blockSize = 512;
mutex readMutex;
mutex procMutex;
condition_variable cv;
int fh;
int bytesRead;
void process(char* buf, uint32_t bytesRead) {
cout << "processing..." << endl;
usleep(100000);
}
void aio_completion_handler(sigval_t sigval) {
struct aiocb* req = (struct aiocb*)sigval.sival_ptr;
// check whether asynch operation is complete
if (aio_error(req) == 0) {
int ret = aio_return(req);
cout << "ret == " << ret << endl;
cout << (char*)req->aio_buf << endl;
}
cv.notify_one();
}
void thready() {
char* buf1 = new char[blockSize];
char* buf2 = new char[blockSize];
aiocb cb;
char* processbuf = buf1;
char* readbuf = buf2;
fh = open("smallfile.dat", O_RDONLY);
if (fh < 0) {
throw std::runtime_error("cannot open file!");
}
memset(&cb, 0, sizeof(aiocb));
cb.aio_fildes = fh;
cb.aio_nbytes = blockSize;
cb.aio_offset = 0;
// Fill in callback information
/*
Using SIGEV_THREAD to request a thread callback function as a notification
method
*/
cb.aio_sigevent.sigev_notify_attributes = nullptr;
cb.aio_sigevent.sigev_notify = SIGEV_THREAD;
cb.aio_sigevent.sigev_notify_function = aio_completion_handler;
/*
The context to be transmitted is loaded into the handler (in this case, a
reference to the aiocb request itself). In this handler, we simply refer to
the arrived sigval pointer and use the AIO function to verify that the request
has been completed.
*/
cb.aio_sigevent.sigev_value.sival_ptr = &cb;
int currentBytesRead = read(fh, buf1, blockSize); // read the 1st block
unique_lock<mutex> readLock(readMutex);
while (true) {
cb.aio_buf = readbuf;
aio_read(&cb); // each next block is read asynchronously
process(processbuf, currentBytesRead); // process while waiting
cv.wait(readLock);
if (currentBytesRead < blockSize) {
break; // last time, get out
}
cout << "back from wait" << endl;
swap(processbuf, readbuf); // switch to other buffer for next time
currentBytesRead = bytesRead; // create local copy
}
delete[] buf1;
delete[] buf2;
}
int main() {
try {
thready();
} catch (std::exception& e) {
cerr << e.what() << '\n';
}
return 0;
}

Measuring execution time when using threads

I would like to measure the execution time of some code. The code starts in the main() function and finishes in an event handler.
I have a C++11 code that looks like this:
#include <iostream>
#include <time.h>
...
volatile clock_t t;
void EventHandler()
{
// when this function called is the end of the part that I want to measure
t = clock() - t;
std::cout << "time in seconds: " << ((float)t)/CLOCKS_PER_SEC;
}
int main()
{
MyClass* instance = new MyClass(EventHandler); // this function starts a new std::thread
instance->start(...); // this function only passes some data to the thread working data, later the thread will call EventHandler()
t = clock();
return 0;
}
So it is guaranteed that the EventHandler() will be called only once, and only after an instance->start() call.
It is working, this code give me some output, but it is a horrible code, it uses global variable and different threads access global variable. However I can't change the used API (the constructor, the way the thread calls to EventHandler).
I would like to ask if a better solution exists.
Thank you.
Global variable is unavoidable, as long as MyClass expects a plain function and there's no way to pass some context pointer along with the function...
You could write the code in a slightly more tidy way, though:
#include <future>
#include <thread>
#include <chrono>
#include <iostream>
struct MyClass
{
typedef void (CallbackFunc)();
constexpr explicit MyClass(CallbackFunc* handler)
: m_handler(handler)
{
}
void Start()
{
std::thread(&MyClass::ThreadFunc, this).detach();
}
private:
void ThreadFunc()
{
std::this_thread::sleep_for(std::chrono::seconds(5));
m_handler();
}
CallbackFunc* m_handler;
};
std::promise<std::chrono::time_point<std::chrono::high_resolution_clock>> gEndTime;
void EventHandler()
{
gEndTime.set_value(std::chrono::high_resolution_clock::now());
}
int main()
{
MyClass task(EventHandler);
auto trigger = gEndTime.get_future();
auto startTime = std::chrono::high_resolution_clock::now();
task.Start();
trigger.wait();
std::chrono::duration<double> diff = trigger.get() - startTime;
std::cout << "Duration = " << diff.count() << " secs." << std::endl;
return 0;
}
clock() call will not filter out executions of different processes and threads run by scheduler in parallel with program event handler thread. There are alternative like times() and getrusage() which tells cpu time of process. Though it is not clearly mentioned about thread behaviour for these calls but if it is Linux, threads are treated as processes but it has to be investigated.
clock() is the wrong tool here, because it does not count the time actually required by the CPU to run your operation, for example, if the thread is not running at all, the time is still counted.
Instead you have to use platform-specific APIs, such as pthread_getcpuclockid for POSIX-compliant systems (Check if _POSIX_THREAD_CPUTIME is defined), that counts the actual time spent by a specific thread.
You can take a look at a benchmarking library I wrote for C++ that supports thread-aware measuring (see struct thread_clock implementation).
Or, you can use the code snippet from the man page:
/* Link with "-lrt" */
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <string.h>
#include <errno.h>
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
#define handle_error_en(en, msg) \
do { errno = en; perror(msg); exit(EXIT_FAILURE); } while (0)
static void *
thread_start(void *arg)
{
printf("Subthread starting infinite loop\n");
for (;;)
continue;
}
static void
pclock(char *msg, clockid_t cid)
{
struct timespec ts;
printf("%s", msg);
if (clock_gettime(cid, &ts) == -1)
handle_error("clock_gettime");
printf("%4ld.%03ld\n", ts.tv_sec, ts.tv_nsec / 1000000);
}
int
main(int argc, char *argv[])
{
pthread_t thread;
clockid_t cid;
int j, s;
s = pthread_create(&thread, NULL, thread_start, NULL);
if (s != 0)
handle_error_en(s, "pthread_create");
printf("Main thread sleeping\n");
sleep(1);
printf("Main thread consuming some CPU time...\n");
for (j = 0; j < 2000000; j++)
getppid();
pclock("Process total CPU time: ", CLOCK_PROCESS_CPUTIME_ID);
s = pthread_getcpuclockid(pthread_self(), &cid);
if (s != 0)
handle_error_en(s, "pthread_getcpuclockid");
pclock("Main thread CPU time: ", cid);
/* The preceding 4 lines of code could have been replaced by:
pclock("Main thread CPU time: ", CLOCK_THREAD_CPUTIME_ID); */
s = pthread_getcpuclockid(thread, &cid);
if (s != 0)
handle_error_en(s, "pthread_getcpuclockid");
pclock("Subthread CPU time: 1 ", cid);
exit(EXIT_SUCCESS); /* Terminates both threads */
}

POSIX semaphore doesn't work under high contention/load

Using C++11 on Linux kernel 4.4.0-57, I'm trying to run two busy-looping processes (say p1, p2) pinned (pthread_setaffinity_np) on the same core and making sure the interleaving execution order by using POSIX semaphore (semaphore.h) and sched_yield(). But it did not work out well.
Below is the parent code (parent-task) that spawns 2 processes and each executes child-task code.
#include <stdio.h>
#include <cstdlib>
#include <errno.h> // errno
#include <iostream> // cout cerr
#include <semaphore.h> // semaphore
#include <fcntl.h> // O_CREAT
#include <unistd.h> // fork
#include <string.h> // cpp string
#include <sys/types.h> //
#include <sys/wait.h> // wait()
int init_semaphore(){
std::string sname = "/SEM_CORE";
sem_t* sem = sem_open ( sname.c_str(), O_CREAT, 0644, 1 );
if ( sem == SEM_FAILED ) {
std::cerr << "sem_open failed!\n";
return -1;
}
sem_init( sem, 0, 1 );
return 0;
}
// Fork and exec child-task.
// Return pid of child
int fork_and_exec( std::string pname, char* cpuid ){
int pid = fork();
if ( pid == 0) {
// Child
char* const params[] = { "./child-task", "99", strdup( pname.c_str() ), cpuid, NULL };
execv( params[0], params );
exit(0);
}
else {
// Parent
return pid;
}
}
int main( int argc, char* argv[] ) {
if ( argc <= 1 )
printf( "Usage ./parent-task <cpuid> \n" );
char* cpuid = argv[1];
std::string pnames[2] = { "p111", "p222" };
init_semaphore();
int childid[ 2 ] = { 0 };
int i = 0;
for( std::string pname : pnames ){
childid[ i ] = fork_and_exec( pname, cpuid );
}
for ( i=0; i<2; i++ )
if ( waitpid( childid[i], NULL, 0 ) < 0 )
perror( "waitpid() failed.\n" );
return 0;
}
The child-task code looks like this:
#include <cstdlib>
#include <stdio.h>
#include <sched.h>
#include <pthread.h>
#include <stdint.h>
#include <errno.h>
#include <semaphore.h>
#include <iostream>
#include <sys/types.h>
#include <fcntl.h> // O_CREAT
sem_t* sm;
int set_cpu_affinity( int cpuid ) {
pthread_t current_thread = pthread_self();
cpu_set_t cpuset;
CPU_ZERO( &cpuset );
CPU_SET( cpuid, &cpuset );
return pthread_setaffinity_np( current_thread,
sizeof( cpu_set_t ), &cpuset );
}
int lookup_semaphore() {
sm = sem_open( "/SEM_CORE", O_RDWR );
if ( sm == SEM_FAILED ) {
std::cerr << "sem_open failed!" << std::endl ;
return -1;
}
}
int main( int argc, char* argv[] ) {
printf( "Usage: ./child-task <PRIORITY> <PROCESS-NAME> <CPUID>\n" );
printf( "Setting SCHED_RR and priority to %d\n", atoi( argv[1] ) );
set_cpu_affinity( atoi( argv[3] ) );
lookup_semaphore();
int res;
uint32_t n = 0;
while ( 1 ) {
n += 1;
if ( !( n % 1000 ) ) {
res = sem_wait( sm );
if( res != 0 ) {
printf(" sem_wait %s. errno: %d\n", argv[2], errno);
}
printf( "Inst:%s RR Prio %s running (n=%u)\n", argv[2], argv[1], n );
fflush( stdout );
sem_post( sm );
sched_yield();
}
sched_yield();
}
sem_close( sm );
}
In the child-task code, I have if ( !( n % 1000 ) ) to experiment reducing the contention/load in waiting and posting the semaphore. The outcome I got is that when n % 1000, one of the child process will be always in Sleep state (from top) and the other child process executes properly. However, if I set n % 10000, i.e. less load/contention, both processes will run and printout the output interleavingly which is my expected outcome.
Does anyone know if this is the limitaion of semaphore.h or there's a better way to ensure processes execution order?
Updated: I did a simple example with threads and semaphore, note that sched_yield may help avoiding unnecessary wakeups of the thread that is not 'in turn' to do work, but yielding is not a guarantee. I also show an example with mutex/condvar that is guaranteed to work, no yield necessary.
#include <stdexcept>
#include <semaphore.h>
#include <pthread.h>
#include <thread>
#include <iostream>
using std::thread;
using std::cout;
sem_t sem;
int count = 0;
const int NR_WORK_ITEMS = 10;
void do_work(int worker_id)
{
cout << "Worker " << worker_id << '\n';
}
void foo(int work_on_odd)
{
int result;
int contention_count = 0;
while (count < NR_WORK_ITEMS)
{
result = sem_wait(&sem);
if (result) {
throw std::runtime_error("sem_wait failed!");
}
if (count % 2 == work_on_odd)
{
do_work(work_on_odd);
count++;
}
else
{
contention_count++;
}
result = sem_post(&sem);
if (result) {
throw std::runtime_error("sem_post failed!");
}
result = sched_yield();
if (result < 0) {
throw std::runtime_error("yield failed!");
}
}
cout << "Worker " << work_on_odd << " terminating. Nr of redundant wakeups from sem_wait: " <<
contention_count << '\n';
}
int main()
{
int result = sem_init(&sem, 0, 1);
if (result) {
throw std::runtime_error("sem_init failed!");
}
thread t0 = thread(foo, 0);
thread t1 = thread(foo, 1);
t0.join();
t1.join();
return 0;
}
Here is one way to do it with condition variables and mutexes. Translating from C++ std threads to pthreads should be trivial. To do it between processes, you would have to use a pthread mutex type that can be shared between processes. Maybe the condvar and the mutex can both be placed in shared memory, to achieve the same thing I do below with threads.
See also the manpage pthread_condattr_setpshared (3) or
http://manpages.ubuntu.com/manpages/wily/man3/pthread_condattr_setpshared.3posix.html
On the other hand, maybe it is simpler to just use a SOCK_STREAM unix domain socket between the two worker processes, and just block on the socket until the peer worker pings you (i.e. send one char) over the socket.
#include <cassert>
#include <iostream>
#include <thread>
#include <condition_variable>
#include <unistd.h>
using std::thread;
using std::condition_variable;
using std::mutex;
using std::unique_lock;
using std::cout;
condition_variable cv;
mutex mtx;
int count;
void dowork(int arg)
{
std::thread::id this_id = std::this_thread::get_id();
cout << "Arg: " << arg << ", thread id: " << this_id << '\n';
}
void tfunc(int work_on_odd)
{
assert(work_on_odd < 2);
auto check_can_work = [&count, &work_on_odd](){ return ((count % 2) ==
work_on_odd); };
while (count < 10)
{
unique_lock<mutex> lk(mtx);
cv.wait (lk, check_can_work);
dowork(work_on_odd);
count++;
cv.notify_one();
// Lock is unlocked automatically here, but with threads and condvars,
// it is actually better to unlock manually before notify_one.
}
}
int main()
{
count = 0;
thread t0 = thread(tfunc, 0);
thread t1 = thread(tfunc, 1);
sleep(1);
cv.notify_one();
t0.join();
t1.join();
}

std::thread to std::async makes HUGE performance gain. How it can be possible?

I`ve made a test code between std::thread and std::async.
#include <iostream>
#include <mutex>
#include <fstream>
#include <string>
#include <memory>
#include <thread>
#include <future>
#include <functional>
#include <boost/noncopyable.hpp>
#include <boost/lexical_cast.hpp>
#include <boost/filesystem.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <boost/asio.hpp>
namespace fs = boost::filesystem;
namespace pt = boost::posix_time;
namespace as = boost::asio;
class Log : private boost::noncopyable
{
public:
void LogPath(const fs::path& filePath) {
boost::system::error_code ec;
if(fs::exists(filePath, ec)) {
fs::remove(filePath);
}
this->ofStreamPtr_.reset(new fs::ofstream(filePath));
};
void WriteLog(std::size_t i) {
assert(*this->ofStreamPtr_);
std::lock_guard<std::mutex> lock(this->logMutex_);
*this->ofStreamPtr_ << "Hello, World! " << i << "\n";
};
private:
std::mutex logMutex_;
std::unique_ptr<fs::ofstream> ofStreamPtr_;
};
int main(int argc, char *argv[]) {
if(argc != 2) {
std::cout << "Wrong argument" << std::endl;
exit(1);
}
std::size_t iter_count = boost::lexical_cast<std::size_t>(argv[1]);
Log log;
log.LogPath("log.txt");
std::function<void(std::size_t)> func = std::bind(&Log::WriteLog, &log, std::placeholders::_1);
auto start_time = pt::microsec_clock::local_time();
////// Version 1: use std::thread //////
// {
// std::vector<std::shared_ptr<std::thread> > threadList;
// threadList.reserve(iter_count);
// for(std::size_t i = 0; i < iter_count; i++) {
// threadList.push_back(
// std::make_shared<std::thread>(func, i));
// }
//
// for(auto it: threadList) {
// it->join();
// }
// }
// pt::time_duration duration = pt::microsec_clock::local_time() - start_time;
// std::cout << "Version 1: " << duration << std::endl;
////// Version 2: use std::async //////
start_time = pt::microsec_clock::local_time();
{
for(std::size_t i = 0; i < iter_count; i++) {
auto result = std::async(func, i);
}
}
duration = pt::microsec_clock::local_time() - start_time;
std::cout << "Version 2: " << duration << std::endl;
////// Version 3: use boost::asio::io_service //////
// start_time = pt::microsec_clock::local_time();
// {
// as::io_service ioService;
// as::io_service::strand strand{ioService};
// {
// for(std::size_t i = 0; i < iter_count; i++) {
// strand.post(std::bind(func, i));
// }
// }
// ioService.run();
// }
// duration = pt::microsec_clock::local_time() - start_time;
// std::cout << "Version 3: " << duration << std::endl;
}
With 4-core CentOS 7 box(gcc 4.8.5), Version 1(using std::thread) is about 100x slower compared to other implementations.
Iteration Version1 Version2 Version3
100 0.0034s 0.000051s 0.000066s
1000 0.038s 0.00029s 0.00058s
10000 0.41s 0.0042s 0.0059s
100000 throw 0.026s 0.061s
Why threaded version is so slow? I thought each thread won't take long time to complete Log::WriteLog function.
The function may never be called. You are not passing an std::launch policy in Version 2, so you are relying on the default behavior of std::async (emphasis mine):
Behaves the same as async(std::launch::async | std::launch::deferred, f, args...). In other words, f may be executed in another thread or it may be run synchronously when the resulting std::future is queried for a value.
Try re-running your benchmark with this minor change:
auto result = std::async(std::launch::async, func, i);
Alternatively, you could call result.wait() on each std::future in a second loop, similar to how you call join() on all of the threads in Version 1. This forces evaluation of the std::future.
Note that there is a major, unrelated, problem with this benchmark. func immediately acquires a lock for the full duration of the function call, which makes parallelism impossible. There is no advantage to using threads here - I suspect that it will be significantly slower (due to thread creation and locking overhead) than a serial implementation.