C++ Thread Splitting mechanism

C++ Thread Splitting mechanism - c++

Initially, I'm getting the number of cores in a system.
int num_cpus = (int)std::thread::hardware_concurrency();
After that i created lambda function for thread compution function
auto properties = [&](int _startIndex)
{
for (int layer = _startIndex; layer < inputcount; layer += num_cpus)
{
..........//body
}
};
calling thread function and joining the thread
std::vector<std::thread> orientation_threads;
for (int i = 0; i < num_cpus; i++)
{
orientation_threads.push_back(std::thread(properties, i));
}
for (std::thread& trd : orientation_threads)
{
if (trd.joinable())
trd.join();
}
I'm getting the result correct but I want to change the thread allocation method.
means,
Initially, 8 thread is executing in 8 core 1st thread->1st core like this all 8 thread is allocated. 5th thread is executed first remaining 7thread is still executing. now the 5th core is free I want to allocate 9th thread to 5th core. like all thread should allocate based on which core is free

Related

How to run a function on a separate thread, if a thread is available

How can I run a function on a separate thread if a thread is available, assuming that i always want k threads running at the same time at any point?
Here's a pseudo-code
For i = 1 to N
IF numberOfRunningThreads < k
// run foo() on another thread
ELSE
// run foo()
In summary, once a thread is finished it notifies the other threads that there's a thread available that any of the other threads can use. I hope the description was clear.

My personal approach: Just do create the k threads and let them call foo repeatedly. You need some counter, protected against race conditions, that is decremented each time before foo is called by any thread. As soon as the desired number of calls has been performed, the threads will exit one after the other (incomplete/pseudo code):
unsigned int global_counter = n;
void fooRunner()
{
for(;;)
{
{
std::lock_guard g(global_counter_mutex);
if(global_counter == 0)
break;
--global_counter;
}
foo();
}
}
void runThreads(unsigned int n, unsigned int k)
{
global_counter = n;
std::vector<std::thread> threads(std::min(n, k - 1));
// k - 1: current thread can be reused, too...
// (provided it has no other tasks to perform)
for(auto& t : threads)
{
t = std::thread(&fooRunner);
}
fooRunner();
for(auto& t : threads)
{
t.join();
}
}
If you have data to pass to foo function, instead of a counter you could use e. g a FIFO or LIFO queue, whatever appears most appropriate for the given use case. Threads then exit as soon as the buffer gets empty; you'd have to prevent the buffer running empty prematurely, though, e. g. by prefilling all the data to be processed before starting the threads.
A variant might be a combination of both: exiting, if the global counter gets 0, waiting for the queue to receive new data e. g. via a condition variable otherwise, and the main thread continuously filling the queue while the threads are already running...

you can use (std::thread in <thread>) and locks to do what you want, but it seems to me that your code could be simply become parallel using openmp like this.
#pragma omp parallel num_threads(k)
#pragma omp for
for (unsigned i = 0; i < N; ++i)
{
auto t_id = omp_get_thread_num();
if (t_id < K)
foo()
else
other_foo()
}

c++ concurrency issue with scaling thread runtime

I have a program that performs the same function on a large array. I break the array into equal chunks and pass them to threads. Currently the threads perform the function and return what they are supposed to, BUT the more threads I add the longer each thread takes to run. Which totally negates the purpose of concurrency. I have tried with std::thread and std::async both with the same result. In the images below the amount of data processed by all child threads and the main thread are the same size (main has 6 more points), but what main runs in ~ 12 seconds the child threads take ~12 x the number of threads as if they were running asynchronously. But they all start at the same time, and if I output from each thread they are running concurrently. Does this have something to do with how they are being joined? I have tried everything I can think of, any help/advice is much appreciated! In the sample code main doesn't run the function until after child threads finish, if I put the join after the main runs it still doesn't run until the child threads finish. Below you can see the runtimes when run with 3 and 5 threads. These times are on a downscaled dataset for testing.
void foo(char* arg1, long arg2, std::promise<std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>>> & ftrV) {
std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>> Grid;
// does stuff....
// fills in "Grid"
ftrV.set_value(Grid);
}
int main(){
int thnmb = 3; // # of threads
std::vector<long> buffers; // fill in buffers
std::vector<char*> pointers; //fill in pointers
std::vector<std::promise<std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>>>> PV(thnmb); // vector of promise grids
std::vector<std::future<std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>>>> FV(thnmb); // vector of futures grids
std::vector<std::thread> th(thnmb); // vector of threads
std::vector<std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>>> vt1(thnmb); // vector to store thread grids
for (int i = 0; i < thnmb; i++) {
th[i] = std::thread(&foo, pointers[i], buffers[i], std::ref(PV[i]));
}
for (int i = 0; i < thnmb; i++) {
FV[i] = PV[i].get_future();
}
for (int i = 0; i < thnmb; i++) {
vt1[i] = FV[i].get();
}
for (int i = 0; i < thnmb; i++) {
th[i].join();
}
// main performs same function as foo here
// combine data
// do other stuff..
return(0);
}

It's hard to give a definitive answer without knowing what foo does, but you're probably running into memory access issues. Each access to your 5 dimension array will require 5 memory lookups, and it only takes 2 or 3 threads with memory access to saturate what a typical system can deliver.
main should perform it's foo work after creating the threads but before getting the value of the promises.
And foo should probably end with ftrV.set_value(std::move(Grid)) so that a copy of that array won't have to be made.

Boost, create thread pool before io_service.post

I successfully was testing an example about boost io_service:
for(x = 0; x < loops; x++)
{
// Add work to ioService.
for (i = 0; i < number_of_threads; i++)
{
ioService.post(boost::bind(worker_task, data, pre_data[i]));
}
// Now that the ioService has work, use a pool of threads to service it.
for (i = 0; i < number_of_threads; i++)
{
threadpool.create_thread(boost::bind(
&boost::asio::io_service::run, &ioService));
}
// threads in the threadpool will be completed and can be joined.
threadpool.join_all();
}
This will loop several times and it take a little bit long because every time the threads are created for each loop.
Is there a way to create all needed threads.
Then post in the loop the work for each thread.
After the work it is needed to wait until all threads have finished their work!
Something like this:
// start/create threads
for (i = 0; i < number_of_threads; i++)
{
threadpool.create_thread(boost::bind(
&boost::asio::io_service::run, &ioService));
}
for(x = 0; x < loops; x++)
{
// Add work to ioService.
for (i = 0; i < number_of_threads; i++)
{
ioService.post(boost::bind(worker_task, data, pre_data[i]));
}
// threads in the threadpool will be completed and can be joined.
threadpool.join_all();
}

The problem here is that your worker threads will finish immediately after creation, since there is no work to be done. io_service::run() will just return right away, so unless you manage to sneak in one of the post-calls before all worker threads have had an opportunity to call run(), they will all finish right away.
Two ways to fix this:
Use a barrier to stop the workers from calling run() right away. Only unblock them once the work has been posted.
Use an io_service::work object to prevent run from returning. You can destroy the work object once you posted everything (and must do so before attempting to join the workers again).

The loop wasn't realy usefull.
Here is a better shwoing how it works.
I getting data in a callback:
void worker_task(uint8_t * data, uint32_t len)
{
uint32_t pos = 0;
while(pos < len)
{
pos += process_data(data + pos);
}
}
void callback_f(uint8_t *data, uint32_t len)
{
//split data into parts
uint32_t number_of_data_per_thread = len / number_of_threads;
// Add work to ioService.
uint32_t x = 0;
for (i = 0; i < number_of_threads; i++)
{
ioService.post(boost::bind(worker_task, data + x, number_of_data_per_thread));
x += number_of_data_per_thread ;
}
// Now that the ioService has work, use a pool of threads to service it.
for (i = 0; i < number_of_threads; i++)
{
threadpool.create_thread(boost::bind(
&boost::asio::io_service::run, &ioService));
}
// threads in the threadpool will be completed and can be joined.
threadpool.join_all();
}
So this callback get called very fast from the host application (media stream). If the len what is comming in is big enough the threadpool makes sense. This is because the working time is higher than the init time of the threads and start running.
If the len of the data is small the advantage of the threadpool is getting lost because the init and starting of the threads takes more time then the processing of the data.
May question is now if it is possible to have the threads already running and waiting for data. If the callback get called push the data to the threads and wait for their finish. The number of threads is constant (CPU count).
And as because it is a callback from a host application the pointer to data is only valid while being in the callback function. This is why I have to wait until all threads have finished work.
The thread can starting working immediately after getting data even before other threads are getting started. There is no sync problem because every thread have its own memory area of the data.

Qt Concurrent run member function from another member function

I would like to launch a member function in a separate thread calling it from another member.
Maybe the code below is clearer.
There is a button which launches the counter in a thread and it works:
void MainWindow::on_pushButton_CountNoArgs_clicked()
{
myCounter *counter = new myCounter;
QFuture<void> future = QtConcurrent::run(counter, &myCounter::countUpToThousand);
}
MyCounter class member functions:
void myCounter::countUpToHundred()
{
for(int i = 0; i<=100; i++)
{
qDebug() << "up to 100: " << i;
}
}
void myCounter::countUpToThousand()
{
for(int i = 0; i<=1000; i++)
{
qDebug() << "up to 1000: " << i;
if (i == 500)
{
//here I want to launch myCounter::countUpToHundred() in another thread
}
}
}
Thanks in advance.

Assuming you want to run the 2 counters parallel, you have 3 threads:
Thread 1: UI-Thread (or main thread)
Here runs on_pushButton_CountNoArgs_clicked(). You should not do hard work in this function because if you want to achive 60 frames per second, you only have 16 ms for all the work. To starting a new thread to run countUpToThousand() is a good idea.
Thread 2: background thread (started by QtConcurrent, running countUpToThousand)
This runs in parallel to Thread 1, and you are working with the same instance of myCounter (i.e. the same place in memory) so be careful which member variables you read and write.
Thread 3: background thread (started by QtConcurrent, running countUpToHundred)
Start using (as hank pointed out)
void myCounter::countUpToThousand()
{
for(int i = 0; i<=1000; i++)
{
qDebug() << "up to 1000: " << i;
if (i == 500)
{
QtConcurrent::run(this, &myCounter::countUpToHundred);
}
}
}
This will run in parallel to Thread 1 and Thread 2.
Now you might get crazy output results like 988\n99\n when one counter is at 999 and the other is at 88 because Thread 2 and Thread 3 will be printing to console at the same time and don't care about what the other thread is doing.
Also note that you must not delete counter before Thread 2 and Thread 3 are done because of you do, the'll still try to access the memory and your application will probably crash.

TBB task_arena & task_group usage for scaling parallel_for work

I am trying to use the Threaded Building Blocks task_arena. There is a simple array full of '0'. Arena's threads put '1' in the array on the odd places. Main thread put '2' in the array on the even places.
/* Odd-even arenas tbb test */
#include <tbb/parallel_for.h>
#include <tbb/blocked_range.h>
#include <tbb/task_arena.h>
#include <tbb/task_group.h>
#include <iostream>
using namespace std;
const int SIZE = 100;
int main()
{
tbb::task_arena limited(1); // no more than 1 thread in this arena
tbb::task_group tg;
int myArray[SIZE] = {0};
//! Main thread create another thread, then immediately returns
limited.enqueue([&]{
//! Created thread continues here
tg.run([&]{
tbb::parallel_for(tbb::blocked_range<int>(0, SIZE),
[&](const tbb::blocked_range<int> &r)
{
for(int i = 0; i != SIZE; i++)
if(i % 2 == 0)
myArray[i] = 1;
}
);
});
});
//! Main thread do this work
tbb::parallel_for(tbb::blocked_range<int>(0, SIZE),
[&](const tbb::blocked_range<int> &r)
{
for(int i = 0; i != SIZE; i++)
if(i % 2 != 0)
myArray[i] = 2;
}
);
//! Main thread waiting for 'tg' group
//** it does not create any threads here (doesn't it?) */
limited.execute([&]{
tg.wait();
});
for(int i = 0; i < SIZE; i++) {
cout << myArray[i] << " ";
}
cout << endl;
return 0;
}
The output is:
0 2 0 2 ... 0 2
So the limited.enque{tg.run{...}} block doesn't work.
What's the problem? Any ideas? Thank you.

You have created limited arena for one thread only, and by default this slot is reserved for the master thread. Though, enqueuing into such a serializing arena will temporarily boost its concurrency level to 2 (in order to satisfy 'fire-and-forget' promise of the enqueue), enqueue() does not guarantee synchronous execution of the submitted task. So, tg.wait() can start before tg.run() executes and thus the program will not wait when the worker thread is created, joins the limited arena, and fills the array with '1' (BTW, the whole array is filled in each of 100 parallel_for iterations).
So, in order to wait for the tg.run() to complete, use limited.execute instead. But it will prevent automatic enhancing of the limited concurrency level and the task will be deferred till tg.wait() executed by master thread.
If you want to see asynchronous execution, set arena's concurrency to 2 manually: tbb::task_arena limited(2);
or disable slot reservation for master thread: tbb::task_arena limited(1,0) (but note, it implies additional overheads for dynamic balancing of the number of threads in arena).
P.S. TBB has no points where threads are guaranteed to come (unlike OpenMP). Only enqueue methods guarantee creation of at least one worker thread, but it says nothing about when it will come. See local observer feature to get notification when threads are actually joining arenas.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++ Thread Splitting mechanism - c++

Related

How to run a function on a separate thread, if a thread is available

c++ concurrency issue with scaling thread runtime

Boost, create thread pool before io_service.post

Qt Concurrent run member function from another member function

TBB task_arena & task_group usage for scaling parallel_for work

Categories

Resources