c++ pthread limit number of thread - c++

I tried to use pthread to do some task faster. I have thousands files (in args) to process and i want to create just a small number of thread many times.
Here's my code :
void callThread(){
int nbt = 0;
pthread_t *vp = (pthread_t*)malloc(sizeof(pthread_t)*NBTHREAD);
for(int i=0;i<args.size();i+=NBTHREAD){
for(int j=0;j<NBTHREAD;j++){
if(i+j<args.size()){
pthread_create(&vp[j],NULL,calcul,&args[i+j]);
nbt++;
}
}
for(int k=0;k<nbt;k++){
if(pthread_join(vp[k], NULL)){
cout<<"ERROR pthread_join()"<<endl;
}
}
}
}
It returns error, i don't know if it's a good way to solve my problem. All the resources are in args (vector of struct) and are independants.
Thanks for help.

You're better off making a thread pool with as many threads as the number of cores the cpu has. Then feed the tasks to this pool and let it do its job. You should take a look at this blog post right here for a great example of how to go about creating such thread pool.
A couple of tips that are not mentioned in that post:
Use std::thread::hardware_concurrency() to get the number of cores.
Figure out a way how to store the tasks, hint: std::packaged_task or something along
those lines wrapped in a class so you can track things such as when a task is done, or implement task.join().
Also, github with the code of his implementation plus some extra stuff such as std::future support can be found here.

You can use a semaphore to limit the number of parallel threads, here is a pseudo code:
Semaphore S = MAX_THREADS_AT_A_TIME // Initial semaphore value
declare handle_array[NUM_ITERS];
for(i=0 to NUM_ITERS)
{
wait-while(S<=0);
Acquire-Semaphore; // S--
handle_array[i] = Run-Thread(MyThread);
}
for(i=0 to NUM_ITERS)
{
Join_thread(handle_array[i])
Close_handle(handle_array[i])
}
MyThread()
{
mutex.lock
critical-section
mutex.unlock
release-semaphore // S++
}

Related

How to count completed jobs in a thread pool without sharing a single variable?

I have a thread pool that accepts jobs (function pointers + data), giving them to a worker thread to complete. Some of the jobs are given a pointer to a completion count std::atomic<uint32> which they increment when done, so the main thread creating those jobs can know how many of those jobs have finished.
The problem though, is that 12+ threads are contending on a single uint32. I've put a lot of work into separating jobs and their data along cache lines, so this is the only source of contention left I want to eliminate, but I'm not sure how best to solve this particular issue.
What would be the simplest way to gather the number of completed jobs, without sharing a single uint32 between multiple threads?
(It's okay if the main thread has to refresh it's cache when it checks this count, I only want to avoid dirtying the cache of worker threads. Also, the worker threads don't need to know the count, they only increment it, while the main thread can only read it.)
update:
I'm currently trying out the idea of not sharing a single count at all, but having a count for each worker thread that the main thread can add together when checking. The idea being that the main thread pays the main price (which is fine since it's waiting to "join" anyway).
Here's the code I have in it's ugly cooked in 10 minutes form
class Promise {
friend ThreadPool;
public:
~Promise() {
// this function destroys our object's memory underneath us; no members with destructors
m_pool->_destroyPromise(this);
}
void join() {
while (isDone() == false) {
if(m_pool->doAJob() == false){
// we've no jobs to steal, try to spin a little gentler
std::this_thread::yield();
}
}
}
void setEndCount(uint32 count) {
m_endCount = count;
}
bool isDone() {
return m_endCount == getCount();
}
uint32 getCount() {
uint32 count = 0;
for (uint32 n = 0; n < m_countCount; ++n) {
count += _getCountRef(n)->load();
}
return count;
}
uint32 getRemaining() {
return m_endCount - getCount();
}
private:
// ThreadPool creates these as a factory
Promise(ThreadPool * pool, uint32 countsToKeep, uint32 endCount, uint32 countStride, void * allocatedData)
: m_pool(pool)
, m_endCount(endCount)
, m_countCount(countsToKeep)
, m_countStride(countStride)
, m_perThreadCount(allocatedData)
{};
// all worker IDs start at 1, not 0, only ThreadPool should use this directly
std::atomic<uint32> * _getCountRef(uint32 workerID = 0) {
return (std::atomic<uint32>*)((char*)m_perThreadCount + m_countStride * workerID);
}
// data
uint32 m_endCount;
uint32 m_countCount; // the count of how many counts we're counting
uint32 m_countStride;
ThreadPool * m_pool;
void * m_perThreadCount; // an atomic count for each worker thread + one volunteer count (for non-worker threads), seperated by cacheline
};
update two:
Testing this, it seems to work quite well. It's a pretty large structure unfortunately, 64 bytes * worker thread count (for me this is pushing a KB), but the speed trade is about 5%+ for the jobs I usually use. I guess this could work for now.

what is the optimal Multithreading scenario for processing a long file lines?

I have a big file and i want to read and also [process] all lines (even lines) of the file with multi threads.
One suggests to read the whole file and break it to multiple files (same count as threads), then let every thread process a specific file. as this idea will read the whole file, write it again and read multiple files it seems to be slow (3x I/O) and i think there must be better scenarios,
I myself though this could be a better scenario:
One thread will read the file and put the data on a global variable and other threads will read the data from that variable and process. more detailed:
One thread will read the main file with running func1 function and put each even line on a Buffer: line1Buffer of a max size MAX_BUFFER_SIZE and other threads will pop their data from the Buffer and process it with running func2 function. in code:
Global variables:
#define MAX_BUFFER_SIZE 100
vector<string> line1Buffer;
bool continue = true;// to end thread 2 to last thread by setting to false
string file = "reads.fq";
Function func1 : (thread 1)
void func1(){
ifstream ifstr(file.c_str());
for (long long i = 0; i < numberOfReads; i++) { // 2 lines per read
getline(ifstr,ReadSeq);
getline(ifstr,ReadSeq);// reading even lines
while( line1Buffer.size() == MAX_BUFFER_SIZE )
; // to delay when the buffer is full
line1Buffer.push_back(ReadSeq);
}
continue = false;
return;
}
And function func2 : (other threads)
void func2(){
string ReadSeq;
while(continue){
if(line2Buffer.size() > 0 ){
ReadSeq = line1Buffer.pop_back();
// do the proccessing....
}
}
}
About the speed:
If the reading part is slower so the total time will be equal to reading the file for just one time(and the buffer may just contain 1 file at each time and hence just 1 another thread will be able to work with thread 1). and if the processing part is slower then the total time will be equal to the time for the whole processing with numberOfThreads - 1 threads. both cases is faster than reading the file and writing in multiple files with 1 thread and then read the files with multi threads and process...
and so there is 2 question:
1- how to call the functions by threads the way thread 1 runs func1 and others run func2 ?
2- is there any faster scenario?
3-[Deleted] anyone can extend this idea to M threads for reading and N threads for processing? obviously we know :M+N==umberOfThreads is true
Edit: the 3rd question is not right as multiple threads can't help in reading a single file
Thanks All
An other approach could be interleaved thread.
Reading is done by every thread, but only 1 at once.
Because of the waiting in the very first iteration, the
threads will be interleaved.
But this is only an scaleable option, if work() is the bottleneck
(then every non-parallel execution would be better)
Thread:
while (!end) {
// should be fair!
lock();
read();
unlock();
work();
}
basic example: (you should probably add some error-handling)
void thread_exec(ifstream* file,std::mutex* mutex,int* global_line_counter) {
std::string line;
std::vector<std::string> data;
int i;
do {
i = 0;
// only 1 concurrent reader
mutex->lock();
// try to read the maximum number of lines
while(i < MAX_NUMBER_OF_LINES_PER_ITERATION && getline(*file,line)) {
// only the even lines we want to process
if (*global_line_counter % 2 == 0) {
data.push_back(line);
i++;
}
(*global_line_counter)++;
}
mutex->unlock();
// execute work for every line
for (int j=0; j < data.size(); j++) {
work(data[j]);
}
// free old data
data.clear();
//until EOF was not reached
} while(i == MAX_NUMBER_OF_LINES_PER_ITERATION);
}
void process_data(std::string file) {
// counter for checking if line is even
int global_line_counter = 0;
// open file
ifstream ifstr(file.c_str());
// mutex for synchronization
// maybe a fair-lock would be a better solution
std::mutex mutex;
// create threads and start them with thread_exec(&ifstr, &mutex, &global_line_counter);
std::vector<std::thread> threads(NUM_THREADS);
for (int i=0; i < NUM_THREADS; i++) {
threads[i] = std::thread(thread_exec, &ifstr, &mutex, &global_line_counter);
}
// wait until all threads have finished
for (int i=0; i < NUM_THREADS; i++) {
threads[i].join();
}
}
What is your bottleneck? Hard disk or processing time?
If it's the hard disk, then you're probably not going to get any more performance out as you've hit the limits of the hardware. Concurrent reads are by far faster than trying to jump around the file. Having multiple threads trying to read your file will almost certainly reduce the overall speed as it will increase disk thrashing.
A single thread reading the file and a thread pool (or just 1 other thread) to deal with the contents is probably as good as you can get.
Global variables:
This is a bad habit to get into.
Assume having #p treads, two scenarios mentioned in the post and answers:
1) Reading with 'a' thread and processing with other threads, in this case #p-1 thread will process in comparison with only one thread reading. assume the time for full operation is jobTime and time for processing with n threads is pTime(n) so:
worst case occurs when reading time is very slower than processing and jobTime = pTime(1)+readTime and the best case is when the processing is slower than reading in which jobTime is equal to pTime(#p-1)+readTime
2) read and process with all #p threads. in this scenario every thread needs to do two steps. first step is to read a part of the file with size MAX_BUFFER_SIZE which is sequential; means no two threads can read at one time. but the second part is processing the read data which can be parallel. this way in the worst case jobTime is pTime(1)+readTime as before (but*), but the best optimized case is pTime(#p)+readTime which is better than previous.
*: in 2nd approach's worst case, however reading is slower but you can find a optimized MAX_BUFFER_SIZE in which (in the worst case) some reading with one thread will overlaps with some processing with another thread. with this optimized MAX_BUFFER_SIZE the jobTime will be less than pTime(1)+readTime and could diverge to readTime
First off, reading a file is a slow operation so unless you are doing some superheavy processing, the file reading will be limiting.
If you do decide to go the multithreaded route a queue is the right approach. Just make sure you push in front an pop out back. An stl::deque should work well. Also you will need to lock the queue with a mutex and sychronize it with a conditional variable.
One last thing is you will need to limit the size if the queue for the scenario where we are pushing faster than we are popping.

Boost Thread_Group in a loop is very slow

I wanted to use threading to run check multiple images in a vector at the same time. Here is the code
boost::thread_group tGroup;
for (int line = 0;line < sourceImageData.size(); line++) {
for (int pixel = 0;pixel < sourceImageData[line].size();pixel++) {
for (int im = 0;im < m_images.size();im++) {
tGroup.create_thread(boost::bind(&ClassX::ClassXFunction, this, line, pixel, im));
}
tGroup.join_all();
}
}
This creates the thread group and loops thru lines of pixel data and each pixel and then multiple images. Its a weird project but anyway I bind the thread to a method in the same instance of the class this code is in so "this" is used. This runs through a population of about 20 images, binding each thread as it goes and then when it is done looping the join_all function takes effect when the threads are done. Then it goes to the next pixel and starts over again.
I'v tested running 50 threads at the same time with this simple program
void run(int index) {
for (int i = 0;i < 100;i++) {
std::cout << "Index : " <<index<<" "<<i << std::endl;
}
}
int main() {
boost::thread_group tGroup;
for (int i = 0;i < 50;i++){
tGroup.create_thread(boost::bind(run, i));
}
tGroup.join_all();
int done;
std::cin >> done;
return 0;
}
This works very quickly. Even though the method the threads are bound to in the previous program is more complicated it shouldn't be as slow as it is. It takes like 4 seconds for one loop of sourceImageData (line) to complete. I'm new to boost threading so I don't know if something is blatantly wrong with the nested loops or otherwise. Any insight is appreciated.
The answer is simple. Don't start that many threads. Consider starting as many threads as you have logical CPU cores. Starting threads is very expensive.
Certainly never start a thread just to do one tiny job. Keep the threads and give them lots of (small) tasks using a task queue.
See here for a good example where the number of threads was similarly the issue: boost thread throwing exception "thread_resource_error: resource temporarily unavailable"
In this case I'd think you can gain a lot of performance by increasing the size of each task (don't create one per pixel, but per scan-line for example)
I believe the difference here is in when you decide to join the threads.
In the first piece of code, you join the threads at every pixel of the supposed source image. In the second piece of code, you only join the threads once at the very end.
Thread synchronization is expensive and often a bottleneck for parallel programs because you are basically pausing execution of any new threads until ALL threads that need to be synchronized, which in this case is all the threads that are active, are done running.
If the iterations of the innermost loop(the one with im) are not dependent on each other, I would suggest you join the threads after the entire outermost loop is done.

Boost thread_group thread limiter

Am trying to limit the number of threads at anytime to be equal to utmost the number of cores available. Is the following a reasonable method? Is there a better alternative? Thanks!
boost::thread_group threads;
iThreads = 0;
for (int i = 0; i < Utility::nIterations; i++)
{
threads.create_thread(
boost::bind(&ScenarioInventory::BuildInventoryWorker, this,i));
thread_limiter.lock();
iThreads++;
thread_limiter.unlock();
while (iThreads > nCores)
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}
threads.join_all();
void ScenarioInventory::BuildInventoryWorker(int i)
{
//code code code....
thread_limiter.lock();
iThreads--;
thread_limiter.unlock();
}
What you are likely looking for is thread_pool with a task queue.
Have a fixed number of threads blocking an queue. Whenever a task is pushed onto the queue a worker thread gets signalled (condition variable) and processes the task.
That way you
don't have the (inefficient) waiting lock
don't have any more threads than the "maximum"
don't have to block in the code that pushes a task
don't have redundant creation of threads each time around
See this answer for two different demos of such a thread pool w/ task queue: Calculating the sum of a large vector in parallel

Spawn a set of threads iteratively in C++11?

I have a function that populates entries in a large matrix. As the computations are independent, I was thinking about exploiting std::thread so that chunks of the matrix can be processed by separate threads.
Instead of dividing the matrix in to n chunks where n is the limit on the maximum number of threads allowed to run simultaneously, I would like to make finer chunks, so that I could spawn a new thread when an existing thread is finished. (As the compute time will be widely different for different entries, and equally dividing the matrix will not be very efficient here. Hence the latter idea.)
What are the concepts in std::thread I should look into for doing this? (I came across async and condition_variables although I don't clearly see how they can be exploited for such kinds of spawning). Some example pseudo code would greatly help!
Why tax the OS scheduler with thread creation & destruction? (Assume these operations are expensive.) Instead, make your threads work more instead.
EDIT: If you do no want to split the work in equal chunks, then the best solution really is a thread pool. FYI, there is a thread_pool library in the works for C++14.
What is below assumed that you could split the work in equal chunks, so is not exactly applicable to your question. END OF EDIT.
struct matrix
{
int nrows, ncols;
// assuming row-based processing; adjust for column-based processing
void fill_rows(int first, int last);
};
int num_threads = std::thread::hardware_concurrency();
std::vector< std::thread > threads(num_threads);
matrix m; // must be initialized...
// here - every thread will process as many rows as needed
int nrows_per_thread = m.nrows / num_threads;
for(int i = 0; i != num_threads; ++i)
{
// thread i will process these rows:
int first = i * nrows_per_thread;
int last = first + nrows_per_thread;
// last thread gets remaining rows
last += (i == num_threads - 1) ? m.nrows % nrows_per_thread : 0;
threads[i] = std::move(std::thread([&m,first,last]{
m.fill_rows(first,last); }))
}
for(int i = 0; i != num_threads; ++i)
{
threads[i].join();
}
If this is an operation you do very frequently, then use a worker pool as #Igor Tandetnik suggests in the comments. For one-offs, it's not worth the trouble.