I have an application where a bit of parallel processing would be of benefit. For the purposes of the discussion, let's say there is a directory with 10 text files in it, and I want to start a program, that forks off 10 processes, each taking one of the files, and uppercasing the contents of the file. I acknowledge that the parent program can wait for the children to complete using one of the wait functions, or using the select function.
What I would like to do is have the parent process monitor the progress of each forked process, and display something like a progress bar as the processes run.
My Question.
What would be a reasonable alternatives do I have for the forked processes to communicate this information back to the parent? What IPC techniques would be reasonable to use?
In this kind of situation where you only want to monitor the progress, the easiest alternative is to use shared memory. Every process updates it progress value (e.g. an integer) on a shared memory block, and the master process reads the block regularly. Basically, you don't need any locking in this scheme. Also, it is a "polling" style application because the master can read the information whenever it wants, so you do not need any event processing for handling the progress data.
If the only progress you need is "how many jobs have completed?", then a simple
while (jobs_running) {
pid = wait(&status);
for (i = 0; i < num_jobs; i++)
if (pid == jobs[i]) {
jobs_running--;
break;
}
printf("%i/%i\n", num_jobs - jobs_running, num_jobs);
}
will do. For reporting progress while, well, in progress, here's dumb implementations of some of the other suggestions.
Pipes:
#include <poll.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
int child(int fd) {
int i;
struct timespec ts;
for (i = 0; i < 100; i++) {
write(fd, &i, sizeof(i));
ts.tv_sec = 0;
ts.tv_nsec = rand() % 512 * 1000000;
nanosleep(&ts, NULL);
}
write(fd, &i, sizeof(i));
exit(0);
}
int main() {
int fds[10][2];
int i, j, total, status[10] = {0};
for (i = 0; i < 10; i++) {
pipe(fds[i]);
if (!fork())
child(fds[i][1]);
}
for (total = 0; total < 1000; sleep(1)) {
for (i = 0; i < 10; i++) {
struct pollfd pfds = {fds[i][0], POLLIN};
for (poll(&pfds, 1, 0); pfds.revents & POLLIN; poll(&pfds, 1, 0)) {
read(fds[i][0], &status[i], sizeof(status[i]));
for (total = j = 0; j < 10; j++)
total += status[j];
}
}
printf("%i/1000\n", total);
}
return 0;
}
Shared memory:
#include <semaphore.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <time.h>
#include <unistd.h>
int child(int *o, sem_t *sem) {
int i;
struct timespec ts;
for (i = 0; i < 100; i++) {
sem_wait(sem);
*o = i;
sem_post(sem);
ts.tv_sec = 0;
ts.tv_nsec = rand() % 512 * 1000000;
nanosleep(&ts, NULL);
}
sem_wait(sem);
*o = i;
sem_post(sem);
exit(0);
}
int main() {
int i, j, size, total;
void *page;
int *status;
sem_t *sems;
size = sysconf(_SC_PAGESIZE);
size = (10 * sizeof(*status) + 10 * sizeof(*sems) + size - 1) & size;
page = mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0);
status = page;
sems = (void *)&status[10];
for (i = 0; i < 10; i++) {
status[i] = 0;
sem_init(&sems[i], 1, 1);
if (!fork())
child(&status[i], &sems[i]);
}
for (total = 0; total < 1000; sleep(1)) {
for (total = i = 0; i < 10; i++) {
sem_wait(&sems[i]);
total += status[i];
sem_post(&sems[i]);
}
printf("%i/1000\n", total);
}
return 0;
}
Error handling etc. elided for clarity.
A few options (no idea which, if any, will suit you--a lot depends on what you are actually doing, as opped to the "uppercasing files" analogy):
signals
fifos / named pipes
the STDOUT of the children or other passed handles
message queues (if appropriate)
If all you want is a progress update, by far the easiest way is probably to use an anonymous pipe. The pipe(2) call will give you two file descriptors, one for each end of the pipe. Call it just before you fork, and have the parent listen to the first fd and the child write to the second. (This works because both the file descriptors and the two-element array containing them are shared between the processes -- not shared memory per se, but it's copy-on-write so they share the values unless you overwrite them.)
Just earlier today someone told me that they always use a pipe, by which the children can send notification to the parent process that all is going well. This seems a decent solution, and is especially useful in places where you would want to print an error, but no longer have access to stdout/stderr, etc.
Boost.MPI should be useful in this scenario. You may consider it overkill but it's definitely worth investigating:
www.boost.org/doc/html/mpi.html
Related
I have read various articles on C++ threading, among others GeeksForGeeks article. I have also read this quection but none of these has an answer for my need. In my project, (which is too complex to mention here), I would need something along the lines:
#include <iostream>
#include <thread>
using namespace std;
class Simulate{
public:
int Numbers[100][100];
thread Threads[100][100];
// Method to be passed to thread - in the same way as function pointer?
void DoOperation(int i, int j) {
Numbers[i][j] = i + j;
}
// Method to start the thread from
void Update(){
// Start executing threads
for (int i = 0; i < 100; i++) {
for (int j = 0; j < 100; j++) {
Threads[i][j] = thread(DoOperation, i, j);
}
}
// Wait till all of the threads finish
for (int i = 0; i < 100; i++) {
for (int j = 0; j < 100; j++) {
if (Threads[i][j].joinable()) {
Threads[i][j].join();
}
}
}
}
};
int main()
{
Simulate sim;
sim.Update();
}
How can I do this please? Any help is appreciated, and alternative solutions wellcomed. I am a mathematician by training, learning C++ for less than a week, so simplicity is pereferred. I desperately need something along these lines to make my research simulations faster.
The easiest way to call member functions and pass arguments is to use a lambda expression:
Threads[i][j] = std::thread([this, i, j](){ this->DoOperation(i, j); });
The variables listed in [] are captured and their values can be used by the code inside {}. The lambda itself has a unique anonymous type, but can be implicitly cast to std::function which is accepted by std::thread constructor.
However, starting 100x100 = 10000 threads will quickly exhaust memory on most systems. Adding more threads than there are CPU cores does not improve performance for computational tasks. Instead it is a better idea to start e.g. 10 threads that each process 1000 items in a loop.
I've made a program which process a lot of data, and it takes forever at runtime, but looking in Task Manager I found out that the executable only uses a small part of my cpu and my RAM...
How can I tell my IDE to allocate more resources (as much as he can) to my program?
Running it in Release x64 helps but not enough.
#include <cstddef>
#include <iostream>
#include <utility>
#include <vector>
int main() {
using namespace std;
struct library {
int num = 0;
unsigned int total = 0;
int booksnum = 0;
int signup = 0;
int ship = 0;
vector<int> scores;
};
unsigned int libraries = 30000; // in the program this number is read a file
unsigned int books = 20000; // in the program this number is read a file
unsigned int days = 40000; // in the program this number is read a file
vector<int> scores(books, 0);
vector<library*> all(libraries);
for(auto& it : all) {
it = new library;
it->booksnum = 15000; // in the program this number is read a file
it->signup = 50000; // in the program this number is read a file
it->ship = 99999; // in the program this number is read a file
it->scores.resize(it->booksnum, 0);
}
unsigned int past = 0;
for(size_t done = 0; done < all.size(); done++) {
if(!(done % 1000)) cout << done << '-' << all.size() << endl;
for(size_t m = done; m < all.size() - 1; m++) {
all[m]->total = 0;
{
double run = past + all[m]->signup;
for(auto at : all[m]->scores) {
if(days - run > 0) {
all[m]->total += scores[at];
run += 1. / all[m]->ship;
} else
break;
}
}
}
for(size_t n = done; n < all.size(); n++)
for(size_t m = 0; m < all.size() - 1; m++) {
if(all[m]->total < all[m + 1]->total) swap(all[m], all[m + 1]);
}
past += all[done]->signup;
if (past > days) break;
}
return 0;
}
this is the cycle which takes up so much time... For some reason even using pointers to library doesn't optimize it
RAM doesn't make things go faster. RAM is just there to store data your program uses; if it's not using much then it doesn't need much.
Similarly, in terms of CPU usage, the program will use everything it can (the operating system can change priority, and there are APIs for that, but this is probably not your issue).
If you're seeing it using a fraction of CPU percentage, the chances are you're either waiting on I/O or writing a single threaded application that can only use a single core at any one time. If you've optimised your solution as much as possible on a single thread, then it's worth looking into breaking its work down across multiple threads.
What you need to do is use a tool called a profiler to find out where your code is spending its time and then use that information to optimise it. This will help you with microoptimisations especially, but for larger algorithmic changes (i.e. changing how it works entirely), you'll need to think about things at a higher level of abstraction.
I have a program which reads the file line by line and then stores each possible substring of length 50 in a hash table along with its frequency. I tried to use threads in my program so that it will read 5 lines and then use five different threads to do the processing. The processing involves reading each substring of that line and putting them into hash map with frequency. But it seems there is something wrong which I could not figure out for which the program is not faster then the serial approach. Also, for large input file it is aborted. Here is the piece of code I am using
unordered_map<string, int> m;
mutex mtx;
void parseLine(char *line, int subLen){
int no_substr = strlen(line) - subLen;
for(int i = 0; i <= no_substr; i++) {
char *subStr = (char*) malloc(sizeof(char)* subLen + 1);
strncpy(subStr, line+i, subLen);
subStr[subLen]='\0';
mtx.lock();
string s(subStr);
if(m.find(s) != m.end()) m[s]++;
else {
pair<string, int> ret(s, 1);
m.insert(ret);
}
mtx.unlock();
}
}
int main(){
char **Array = (char **) malloc(sizeof(char *) * num_thread +1);
int num = 0;
while (NOT END OF FILE) {
if(num < num_th) {
if(num == 0)
for(int x = 0; x < num_th; x++)
Array[x] = (char*) malloc(sizeof(char)*strlen(line)+1);
strcpy(Array[num], line);
num++;
}
else {
vector<thread> threads;
for(int i = 0; i < num_th; i++) {
threads.push_back(thread(parseLine, Array[i]);
}
for(int i = 0; i < num_th; i++){
if(threads[i].joinable()) {
threads[i].join();
}
}
for(int x = 0; x < num_th; x++) free(seqArray[x]);
num = 0;
}
}
}
It's a myth that just by the virtue of using threads, the end result must be faster. In general, in order to take advantage of multithreading, two conditions must be met(*):
1) You actually have to have sufficient physical CPU cores, that can run the threads at the same time.
2) The threads have independent tasks to do, that they can do on their own.
From a cursory examination of the shown code, it seems to fail on the second part. It seems to me that, most of the time all of these threads will be fighting each other in order to acquire the same mutex. There's little to be gained from multithreading, in this situation.
(*) Of course, you don't always use threads for purely performance reasons. Multithreading also comes in useful in many other situations too, for example, in a program with a GUI, having a separate thread updating the GUI helps the UI working even while the main execution thread is chewing on something, for a while...
I have created a model program of a more complex program that will utilize multithreading and multiple harddrives to increase performance. The data size is so large that reading all data into memory will not be feasible so the data will be read, processed, and written back out in chunks. This test program uses pipeline design to be able to read, process and write out at the same time on 3 different threads. Because read and write are to different harddrive, there is no problem with read and write at the same time. However, the program utilizing multithread seems to run 2x slower than its linear version(also in the code). I have tried to have the read and write thread not be destoryed after running a chunk but the synchronization seem to have slowed it down even more than the current version. I was wondering if I am doing something wrong or how I can improve this. Thank You.
Tested using i3-2100 # 3.1ghz and 16GB ram.
#include <iostream>
#include <fstream>
#include <ctime>
#include <thread>
#define CHUNKSIZE 8192 //size of each chunk to process
#define DATASIZE 2097152 //total size of data
using namespace std;
int data[3][CHUNKSIZE];
int run = 0;
int totalRun = DATASIZE/CHUNKSIZE;
bool finishRead = false, finishWrite = false;
ifstream infile;
ofstream outfile;
clock_t starttime, endtime;
/*
Process a chunk of data(simulate only, does not require to sort all data)
*/
void quickSort(int arr[], int left, int right) {
int i = left, j = right;
int tmp;
int pivot = arr[(left + right) / 2];
while (i <= j) {
while (arr[i] < pivot) i++;
while (arr[j] > pivot) j--;
if (i <= j) {
tmp = arr[i];
arr[i] = arr[j];
arr[j] = tmp;
i++;
j--;
}
};
if (left < j) quickSort(arr, left, j);
if (i < right) quickSort(arr, i, right);
}
/*
Find runtime
*/
void diffclock(){
double diff = (endtime - starttime)/(CLOCKS_PER_SEC/1000);
cout<<"Total run time: "<<diff<<"ms"<<endl;
}
/*
Read a chunk of data
*/
void readData(){
for(int i = 0; i < CHUNKSIZE; i++){
infile>>data[run%3][i];
}
finishRead = true;
}
/*
Write a chunk of data
*/
void writeData(){
for(int i = 0; i < CHUNKSIZE; i++){
outfile<<data[(run-2)%3][i]<<endl;
}
finishWrite = true;
}
/*
Pipelines Read, Process, Write using multithread
*/
void threadtransfer(){
starttime = clock();
infile.open("/home/pcg/test/iothread/source.txt");
outfile.open("/media/pcg/Data/test/iothread/ThreadDuplicate.txt");
thread read, write;
run = 0;
readData();
run = 1;
readData();
quickSort(data[(run-1)%3], 0, CHUNKSIZE - 1);
run = 2;
while(run < totalRun){
//cout<<run<<endl;
finishRead = finishWrite = false;
read = thread(readData);
write = thread(writeData);
read.detach();
write.detach();
quickSort(data[(run-1)%3], 0, CHUNKSIZE - 1);
while(!finishRead||!finishWrite){} //check if next cycle is ready.
run++;
}
quickSort(data[(run-1)%3], 0, CHUNKSIZE - 1);
writeData();
run++;
writeData();
infile.close();
outfile.close();
endtime = clock();
diffclock();
}
/*
Linearly read, sort, and write a chunk and repeat.
*/
void lineartransfer(){
int totalRun = DATASIZE/CHUNKSIZE;
int holder[CHUNKSIZE];
starttime = clock();
infile.open("/home/pcg/test/iothread/source.txt");
outfile.open("/media/pcg/Data/test/iothread/Linearduplicate.txt");
run = 0;
while(run < totalRun){
for(int i = 0; i < CHUNKSIZE; i++) infile>>holder[i];
quickSort(holder, 0, CHUNKSIZE - 1);
for(int i = 0; i < CHUNKSIZE; i++) outfile<<holder[i]<<endl;
run++;
}
endtime = clock();
diffclock();
}
/*
Create large amount of data for testing
*/
void createData(){
outfile.open("/home/pcg/test/iothread/source.txt");
for(int i = 0; i < DATASIZE; i++){
outfile<<rand()<<endl;
}
outfile.close();
}
int main(){
int mode=0;
cout<<"Number of threads: "<<thread::hardware_concurrency()<<endl;
cout<<"Enter mode\n1.Create Data\n2.thread copy\n3.linear copy\ninput mode:";
cin>>mode;
if(mode == 1) createData();
else if(mode == 2) threadtransfer();
else if(mode == 3) lineartransfer();
return 0;
}
Don't busy-wait. This wastes precious CPU time and may well slow down the rest (not to mention the compiler can optimize it into an infinite loop because it can't guess whether those flags will change or not, so it's not even correct in the first place). And don't detach() either. Replace both detach() and busy-waiting with join():
while (run < totalRun) {
read = thread(readData);
write = thread(writeData);
quickSort(data[(run-1)%3], 0, CHUNKSIZE - 1);
read.join();
write.join();
run++;
}
As to the global design, well, ignoring the global variables I guess it's otherwise acceptable if you don't expect the processing (quickSort) part to ever exceed the read/write time. I for one would use message queues to pass the buffers between the various threads (which allows to add more processing threads if you need it, either doing the same tasks in parallel or different tasks in sequence) but maybe that's because I'm used to do it that way.
Since you are measuing time using clock on a Linux machine, I expect that the total CPU time is (roughly) the same whether you run one thread or multiple threads.
Maybe you want to use time myprog instead? Or use gettimeofday to fetch the time (which will give you a time in seconds + nanoseconds [although the nanoseconds may not be "accurate" down to the last digit].
Edit:
Next, don't use endl when writing to a file. It slows things down a lot, because the C++ runtime goes and flushes to the file, which is an operating system call. It is almost certainly somehow protected against multiple threads, so you have three threads doing write-data, a single line, synchronously, at a time. Most likely going to take nearly 3x as long as running a single thread. Also, don't write to the same file from three different threads - that's going to be bad in one way or another.
Please correct me if I am wrong, but it seems your threaded function is basically a linear function doing 3 times the work of your linear function?
In a threaded program you would create three threads and run the readData/quicksort functions once on each thread (distributing thee workload), but in your program it seems like the thread simulation is actually just reading three times, quicksorting three times, and writing three times, and totalling the time it takes to do all three of each.
I'm trying to run a short program that creates three threads within a for loop, each of them writing "inside" to the screen. This is happening with Cygwin running on both XP and Vista on different machines. This is the current code.
#include <iostream>
#include <unistd.h>
#include <pthread.h>
#include <semaphore.h>
using namespace std;
void* printInside(void* arg);
int main()
{
pthread_t threads[3];
for(int i = 0; i < 3; i++)
{
pthread_create(&threads[i], 0, printInside, 0);
}
return 0;
}
void* printInside(void* arg)
{
cout << "inside";
return 0;
}
And it doesn't work. If I add a cout to the inside of the for loop, it appears to slow it down into working.
for(int i = 0; i < 3; i++)
{
cout << "";
pthread_create(&threads[i], 0, printInside, 0);
}
Any suggestions as to why this is the case?
EDIT:
I've gotten responses to add a join after the loop
int main()
{
pthread_t threads[3];
for(int i = 0; i < 3; i++)
{
pthread_create(&threads[i], 0, printInside, 0);
}
for(int i = 0; i < 3; i++)
{
void* result;
pthread_join(threads[i],&result);
}
}
void* printInside(void* arg)
{
cout << "inside";
return 0;
}
But it still doesn't work, is the join done wrong?
FIXED
"Output is usually buffered by the standard library. It is flushed in certain circumstances but sometimes you have to do it manually. So even if the threads run and produce output you won't see it unless you flush it."
You need to join or the main thread will just exit:
for(int i = 0; i < 3; i++)
{
pthread_create(&threads[i], 0, printInside, 0);
}
/* Join here. */
If I add a cout to the inside of the for loop, it appears to slow it
down into working.
Doing I/O is generally hard and slow. This gives the other threads enough CPU time to run.
Remember, when using multiple threads if one calls exit they all die.
EDIT
adding an endl to the end of "inside" does make it work better, but it
seems like a cop-out solution. Just wondering why it would be
necessary even with the join present.
Output is usually buffered by the standard library. It is flushed in certain circumstances but sometimes you have to do it manually. So even if the threads run and produce output you won't see it unless you flush it.
You start all threads without any wait, and exit the main thread (thus the whole program) before they start executing.
Calling pthread_join before return will wait for the other threads to finish.
The cout helps, as it generates a context switch and a window to the other threads to run.