C++ Threading Error - c++

I am getting a C++ threading error with the below code:
//create MAX_THREADS arrays for writing data to
thread threads[MAX_THREADS];
char ** data = new char*[MAX_THREADS];
char * currentSlice;
int currentThread = 0;
for (int slice = 0; slice < convertToVoxels(ARM_LENGTH); slice+=MAX_SLICES_PER_THREAD){
currentThread++;
fprintf(stderr, "Generating volume for slice %d to %d on thread %d...\n", slice, slice + MAX_SLICES_PER_THREAD >= convertToVoxels(ARM_LENGTH) ? convertToVoxels(ARM_LENGTH) : slice + MAX_SLICES_PER_THREAD, currentThread);
try {
//Allocate memory for the slice
currentSlice = new char[convertToVoxels(ARM_RADIUS) * convertToVoxels(ARM_RADIUS) * MAX_SLICES_PER_THREAD];
} catch (std::bad_alloc&) {
cout << endl << "Bad alloc" << endl;
exit(0);
}
data[currentThread] = currentSlice;
//Spawn a thread
threads[currentThread] = thread(buildDensityModel, slice * MAX_SLICES_PER_THREAD, currentSlice);
//If the number of threads is maxed out or if we are on the last thread
if (currentThread == MAX_THREADS || slice + MAX_SLICES_PER_THREAD > convertToVoxels(ARM_LENGTH)){
fprintf(stderr, "Joining threads... \n");
//Join all threads
for (int i = 0; i < MAX_THREADS; i++){
threads[i].join();
}
fprintf(stderr, "Writing file chunks... \n");
FILE* fd = fopen("density.raw", "ab");
for (int i = 0; i < currentThread; i++){
fwrite(&data[i], sizeof(char), convertToVoxels(ARM_RADIUS) * convertToVoxels(ARM_RADIUS), fd);
delete data[i];
}
fclose(fd);
currentThread = 0;
}
}
The goal of this code is to create smaller sections of a large three dimensional array that can be threaded for increased processing speed, but can also be stitched back together when I write it to a file. To this end I tried to spawn n threads at a time, and after spawning the nth thread join all existing threads, write to the file in question, then reset things and continue the process until all sub problems have been completed.
I am getting the following error:
Generating volume for slice 0 to 230 on thread 1...
Generating volume for slice 230 to 460 on thread 2...
Generating volume for slice 460 to 690 on thread 3...
Generating volume for slice 690 to 920 on thread 4...
Generating volume for slice 920 to 1150 on thread 5...
Generating volume for slice 1150 to 1380 on thread 6...
Generating volume for slice 1380 to 1610 on thread 7...
terminate called without an active exception
Aborted (core dumped)
After doing some research it seems that the issue is I am not joining my threads before they go out of scope. However I thought the code I wrote would do this correctly. Namely this section:
//Join all threads
for (int i = 0; i < MAX_THREADS; i++){
threads[i].join();
}
Could anyone point out my error (or errors) and explain it a little clearer so I do not repeat the same mistake?
Edit: Note I have verified I am getting into inner if block that is meant to join the threads. After running the file with the thread spawning line and thread joining line commented out I get the following output:
Generating volume for slice 0 to 230 on thread 1...
Generating volume for slice 230 to 460 on thread 2...
Generating volume for slice 460 to 690 on thread 3...
Generating volume for slice 690 to 920 on thread 4...
Generating volume for slice 920 to 1150 on thread 5...
Generating volume for slice 1150 to 1380 on thread 6...
Generating volume for slice 1380 to 1610 on thread 7...
Joining threads and writing file chunk...

The issue: you are calling join method for empty thread - you cannot do this, when you call join on non-joinable thread you will get exception.
In this line
thread threads[MAX_THREADS];
you created MAX_THREADS threads using its default constructor. Each thread object after calling default ctor is in non-joinable state. Before calling join you should invoke joinable method, if it returns true you can call join method.
for (int i = 0; i < MAX_THREADS; i++){
if(threads[i].joinable())
threads[i].join();
}
Now your code crashes when i = 0 because you increment currentThread at the beginning of your for loop:
for (int slice = 0; slice < convertToVoxels(ARM_LENGTH); slice+=MAX_SLICES_PER_THREAD){
currentThread++; // <---
and you leave threads[0] with empty object while doing this assignment (before first assignment currentThread is 1)
threads[currentThread] = thread(buildDensityModel, slice * MAX_SLICES_PER_THREAD, currentSlice);

Related

C++ Thread Splitting mechanism

Initially, I'm getting the number of cores in a system.
int num_cpus = (int)std::thread::hardware_concurrency();
After that i created lambda function for thread compution function
auto properties = [&](int _startIndex)
{
for (int layer = _startIndex; layer < inputcount; layer += num_cpus)
{
..........//body
}
};
calling thread function and joining the thread
std::vector<std::thread> orientation_threads;
for (int i = 0; i < num_cpus; i++)
{
orientation_threads.push_back(std::thread(properties, i));
}
for (std::thread& trd : orientation_threads)
{
if (trd.joinable())
trd.join();
}
I'm getting the result correct but I want to change the thread allocation method.
means,
Initially, 8 thread is executing in 8 core 1st thread->1st core like this all 8 thread is allocated. 5th thread is executed first remaining 7thread is still executing. now the 5th core is free I want to allocate 9th thread to 5th core. like all thread should allocate based on which core is free

How can I create so many threads in c++ on beaglebone black

I want to create over 500 threads in c++ on beaglebone black
but the program has errors.
could you explain why the errors is occured and how I fix the errors
in thread func. : call_from_thread(int tid)
void call_from_thread(int tid)
{
cout << "thread running : " << tid << std::endl;
}
in main func.
int main() {
thread t[500];
for(int i=0; i<500; i++) {
t[i] = thread(call_from_thread, i);
usleep(100000);
}
std::cout << "main fun start" << endl;
return 0;
}
I expects
...
...
thread running : 495
thread running : 496
thread running : 497
thread running : 498
thread running : 499
main fun start
but
...
...
thread running : 374
thread running : 375
thread running : 376
thread running : 377
thread running : 378
terminate called after throwing an instance of 'std::system_error'
what(): Resource temporarily unavailable
Aborted
could you help me?
The beaglebone black appears to have a maximum of 512MB of DRAM.
The minimum stack size of a thread according to pthread_create() is 2MB.
i.e. 2^29 / 2^21 = 2^8 = 256. So what you're probably seeing around thread 374 is the allocator cannot free memory fast enough to meet the demand which
is handled by throwing an exception.
If you really want to see this explode, try moving that sleep call inside your thread function. :)
You could try preallocating the stack to 1MB or less (pthreads), but that has it's
own set of problems.
The questions to really ask yourself is:
Is my application io bound or compute bound?
What's my memory budget to run this application? If you spend your entire physical memory
on thread stacks, you'll have nothing left for the shared program heap.
Do I really need this much parallelism to do the job? The A8 is a single core machine BTW.
Could I solve the problem using a thread pool? Or not use threads at all?
Finally, you can't set the stack size in std::thread api, but you can in
boost::thread.
Or just write a thin wrapper around pthreads (assuming Linux).
Whenever you use threads, there are three parts.
Start the threads
Do the work
Release the thread
You're starting the threads and doing the work, but you're not releasing them.
Releasing threads. There are two options for releasing a thread.
You can join the thread (which basically waits for it to finish)
You can detach the thread, and let it execute independently.
In this particular case, you don't want the program to finish until all threads are done executing, so you should join them.
#include <iostream>
#include <thread>
#include <vector>
#include <string>
auto call_from_thread = [](int i) {
// I create the entire message before printing it, so that there's no interleaving of messages between threads
std::string message = "Calling from thread " + std::to_string(i) + '\n';
// Because I only call print once, everything gets printed together
std::cout << message;
};
using std::thread;
int main() {
thread t[500];
for(int i=0; i<500; i++) {
// Here, I don't have to start the thread with any delay
t[i] = thread(call_from_thread, i);
}
std::cout << "main fun start\n";
// I join each thread (which waits for them to finish before closing the program)
for(auto& item : t) {
item.join();
}
return 0;
}

Qt Concurrent run member function from another member function

I would like to launch a member function in a separate thread calling it from another member.
Maybe the code below is clearer.
There is a button which launches the counter in a thread and it works:
void MainWindow::on_pushButton_CountNoArgs_clicked()
{
myCounter *counter = new myCounter;
QFuture<void> future = QtConcurrent::run(counter, &myCounter::countUpToThousand);
}
MyCounter class member functions:
void myCounter::countUpToHundred()
{
for(int i = 0; i<=100; i++)
{
qDebug() << "up to 100: " << i;
}
}
void myCounter::countUpToThousand()
{
for(int i = 0; i<=1000; i++)
{
qDebug() << "up to 1000: " << i;
if (i == 500)
{
//here I want to launch myCounter::countUpToHundred() in another thread
}
}
}
Thanks in advance.
Assuming you want to run the 2 counters parallel, you have 3 threads:
Thread 1: UI-Thread (or main thread)
Here runs on_pushButton_CountNoArgs_clicked(). You should not do hard work in this function because if you want to achive 60 frames per second, you only have 16 ms for all the work. To starting a new thread to run countUpToThousand() is a good idea.
Thread 2: background thread (started by QtConcurrent, running countUpToThousand)
This runs in parallel to Thread 1, and you are working with the same instance of myCounter (i.e. the same place in memory) so be careful which member variables you read and write.
Thread 3: background thread (started by QtConcurrent, running countUpToHundred)
Start using (as hank pointed out)
void myCounter::countUpToThousand()
{
for(int i = 0; i<=1000; i++)
{
qDebug() << "up to 1000: " << i;
if (i == 500)
{
QtConcurrent::run(this, &myCounter::countUpToHundred);
}
}
}
This will run in parallel to Thread 1 and Thread 2.
Now you might get crazy output results like 988\n99\n when one counter is at 999 and the other is at 88 because Thread 2 and Thread 3 will be printing to console at the same time and don't care about what the other thread is doing.
Also note that you must not delete counter before Thread 2 and Thread 3 are done because of you do, the'll still try to access the memory and your application will probably crash.

TBB task_arena & task_group usage for scaling parallel_for work

I am trying to use the Threaded Building Blocks task_arena. There is a simple array full of '0'. Arena's threads put '1' in the array on the odd places. Main thread put '2' in the array on the even places.
/* Odd-even arenas tbb test */
#include <tbb/parallel_for.h>
#include <tbb/blocked_range.h>
#include <tbb/task_arena.h>
#include <tbb/task_group.h>
#include <iostream>
using namespace std;
const int SIZE = 100;
int main()
{
tbb::task_arena limited(1); // no more than 1 thread in this arena
tbb::task_group tg;
int myArray[SIZE] = {0};
//! Main thread create another thread, then immediately returns
limited.enqueue([&]{
//! Created thread continues here
tg.run([&]{
tbb::parallel_for(tbb::blocked_range<int>(0, SIZE),
[&](const tbb::blocked_range<int> &r)
{
for(int i = 0; i != SIZE; i++)
if(i % 2 == 0)
myArray[i] = 1;
}
);
});
});
//! Main thread do this work
tbb::parallel_for(tbb::blocked_range<int>(0, SIZE),
[&](const tbb::blocked_range<int> &r)
{
for(int i = 0; i != SIZE; i++)
if(i % 2 != 0)
myArray[i] = 2;
}
);
//! Main thread waiting for 'tg' group
//** it does not create any threads here (doesn't it?) */
limited.execute([&]{
tg.wait();
});
for(int i = 0; i < SIZE; i++) {
cout << myArray[i] << " ";
}
cout << endl;
return 0;
}
The output is:
0 2 0 2 ... 0 2
So the limited.enque{tg.run{...}} block doesn't work.
What's the problem? Any ideas? Thank you.
You have created limited arena for one thread only, and by default this slot is reserved for the master thread. Though, enqueuing into such a serializing arena will temporarily boost its concurrency level to 2 (in order to satisfy 'fire-and-forget' promise of the enqueue), enqueue() does not guarantee synchronous execution of the submitted task. So, tg.wait() can start before tg.run() executes and thus the program will not wait when the worker thread is created, joins the limited arena, and fills the array with '1' (BTW, the whole array is filled in each of 100 parallel_for iterations).
So, in order to wait for the tg.run() to complete, use limited.execute instead. But it will prevent automatic enhancing of the limited concurrency level and the task will be deferred till tg.wait() executed by master thread.
If you want to see asynchronous execution, set arena's concurrency to 2 manually: tbb::task_arena limited(2);
or disable slot reservation for master thread: tbb::task_arena limited(1,0) (but note, it implies additional overheads for dynamic balancing of the number of threads in arena).
P.S. TBB has no points where threads are guaranteed to come (unlike OpenMP). Only enqueue methods guarantee creation of at least one worker thread, but it says nothing about when it will come. See local observer feature to get notification when threads are actually joining arenas.

Code runs 6 times slower with 2 threads than with 1

Original Problem:
So I have written some code to experiment with threads and do some testing.
The code should create some numbers and then find the mean of those numbers.
I think it is just easier to show you what I have so far. I was expecting with two threads that the code would run about 2 times as fast. Measuring it with a stopwatch I think it runs about 6 times slower! EDIT: Now using the computer and clock() function to tell the time.
void findmean(std::vector<double>*, std::size_t, std::size_t, double*);
int main(int argn, char** argv)
{
// Program entry point
std::cout << "Generating data..." << std::endl;
// Create a vector containing many variables
std::vector<double> data;
for(uint32_t i = 1; i <= 1024 * 1024 * 128; i ++) data.push_back(i);
// Calculate mean using 1 core
double mean = 0;
std::cout << "Calculating mean, 1 Thread..." << std::endl;
findmean(&data, 0, data.size(), &mean);
mean /= (double)data.size();
// Print result
std::cout << " Mean=" << mean << std::endl;
// Repeat, using two threads
std::vector<std::thread> thread;
std::vector<double> result;
result.push_back(0.0);
result.push_back(0.0);
std::cout << "Calculating mean, 2 Threads..." << std::endl;
// Run threads
uint32_t halfsize = data.size() / 2;
uint32_t A = 0;
uint32_t B, C, D;
// Split the data into two blocks
if(data.size() % 2 == 0)
{
B = C = D = halfsize;
}
else if(data.size() % 2 == 1)
{
B = C = halfsize;
D = hsz + 1;
}
// Run with two threads
thread.push_back(std::thread(findmean, &data, A, B, &(result[0])));
thread.push_back(std::thread(findmean, &data, C, D , &(result[1])));
// Join threads
thread[0].join();
thread[1].join();
// Calculate result
mean = result[0] + result[1];
mean /= (double)data.size();
// Print result
std::cout << " Mean=" << mean << std::endl;
// Return
return EXIT_SUCCESS;
}
void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
for(uint32_t i = 0; i < length; i ++) {
*result += (*datavec).at(start + i);
}
}
I don't think this code is exactly wonderful, if you could suggest ways of improving it then I would be grateful for that also.
Register Variable:
Several people have suggested making a local variable for the function 'findmean'. This is what I have done:
void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
register double holding = *result;
for(uint32_t i = 0; i < length; i ++) {
holding += (*datavec).at(start + i);
}
*result = holding;
}
I can now report: The code runs with almost the same execution time as with a single thread. That is a big improvement of 6x, but surely there must be a way to make it nearly twice as fast?
Register Variable and O2 Optimization:
I have set the optimization to 'O2' - I will create a table with the results.
Results so far:
Original Code with no optimization or register variable:
1 thread: 4.98 seconds, 2 threads: 29.59 seconds
Code with added register variable:
1 Thread: 4.76 seconds, 2 Threads: 4.76 seconds
With reg variable and -O2 optimization:
1 Thread: 0.43 seconds, 2 Threads: 0.6 seconds 2 Threads is now slower?
With Dameon's suggestion, which was to put a large block of memory in between the two result variables:
1 Thread: 0.42 seconds, 2 Threads: 0.64 seconds
With TAS 's suggestion of using iterators to access contents of the vector:
1 Thread: 0.38 seconds, 2 Threads: 0.56 seconds
Same as above on Core i7 920 (single channel memory 4GB):
1 Thread: 0.31 seconds, 2 Threads: 0.56 seconds
Same as above on Core i7 920 (dual channel memory 2x2GB):
1 Thread: 0.31 seconds, 2 Threads: 0.35 seconds
Why are 2 threads 6x slower than 1 thread?
You are getting hit by a bad case of false sharing.
After getting rid of the false-sharing, why is 2 threads not faster than 1 thread?
You are bottlenecked by your memory bandwidth.
False Sharing:
The problem here is that each thread is accessing the result variable at adjacent memory locations. It's likely that they fall on the same cacheline so each time a thread accesses it, it will bounce the cacheline between the cores.
Each thread is running this loop:
for(uint32_t i = 0; i < length; i ++) {
*result += (*datavec).at(start + i);
}
And you can see that the result variable is being accessed very often (each iteration). So each iteration, the threads are fighting for the same cacheline that's holding both values of result.
Normally, the compiler should put *result into a register thereby removing the constant access to that memory location. But since you never turned on optimizations, it's very likely the compiler is indeed still accessing the memory location and thus incurring false-sharing penalties at every iteration of the loop.
Memory Bandwidth:
Once you have eliminated the false sharing and got rid of the 6x slowdown, the reason why you're not getting improvement is because you've maxed out your memory bandwidth.
Sure your processor may be 4 cores, but they all share the same memory bandwidth. Your particular task of summing up an array does very little (computational) work for each memory access. A single thread is already enough to max out your memory bandwidth. Therefore going to more threads is not likely to get you much improvement.
In short, no you won't be able to make summing an array significantly faster by throwing more threads at it.
As stated in other answers, you are seeing false sharing on the result variable, but there is also one other location where this is happening. The std::vector<T>::at() function (as well as the std::vector<T>::operator[]()) access the length of the vector on each element access. To avoid this you should switch to using iterators. Also, using std::accumulate() will allow you to take advantage of optimizations in the standard library implementation you are using.
Here are the relevant parts of the code:
thread.push_back(std::thread(findmean, std::begin(data)+A, std::begin(data)+B, &(result[0])));
thread.push_back(std::thread(findmean, std::begin(data)+B, std::end(data), &(result[1])));
and
void findmean(std::vector<double>::const_iterator start, std::vector<double>::const_iterator end, double* result)
{
*result = std::accumulate(start, end, 0.0);
}
This consistently gives me better performance for two threads on my 32-bit netbook.
More threads doesn't mean faster! There is an overhead in creating and context-switching threads, even the hardware in which this code run is influencing the results. For such a trivial work like this it's better probably a single thread.
This is probably because the cost of launching and waiting for two threads is a lot more than computing the result in a single loop. Your data size is 128MB, which is not alot for modern processors to process in a single loop.