multithreading using openmp in c++ - c++

What is the best solution if I just want to parallelize the loop only and sequential saving to file using openmp. I have a file with a large volume of information I would like to split into equal chunks (16 bytes each) , encrypted using openmp (multithreading programming in C++). After completion of encryption process these chunks store in a single file, but the same sequence of the original.
i_count=meta->subchunk2_size-meta->subchunk2_size%16;// TO GET THE EXACT LENTH MODE 16
// Get the number of processors in this system
int iCPU = omp_get_num_procs();
// Now set the number of threads
omp_set_num_threads(iCPU);
#pragma omp parallel for ordered
for( i=0;i<i_count;i+=16)
{
fread(Store,sizeof(char),16,Rfile);// read
ENCRYPT();
#pragma omp ordered
fwrite(Store,sizeof(char),16,Wfile) // write
}
The program it supposed works parallel but the saving to file work sequential, but the implementation of the program shows it work in sequential order.

You're much better off reading the whole file into a buffer in one thread, processing the buffer in parallel without using ordered, and then writing the buffer in one thread. Something like this
fread(Store,sizeof(char),icount,Rfile);// read
#pragma omp parallel for schedule(static)
for( i=0;i<i_count;i+=16) {
ENCRYPT(&Store[i]); //ENCRYPT acts on 16 bytes at a time
}
fwrite(Store,sizeof(char),icount,Wfile) // write
If the file is too big to read in all at once then do it in chunks in a loop. The main point is that the ENCRYPT function should be much slower than reading writing files. Otherwise, there is no point in using multiple threads anyway because you can't really speed up reading writing files with multiple threads.

Related

How to ensure no buffer overruns occur, mulithreading?

I am using ALSA to read in an audio stream. I have set my period size to 960 frames which are received every 20ms. I read in the PCM values using: snd_pcm_readi(handle, buffer, period_size);
Every time I fill this buffer, I need to iterate through it and perform multiple checks on the received values. Iterating through this using a simple for loop takes too long and I get a buffer overrun error on subsequent calls to snd_pcm_readi(). I have been told not to increase the ALSA buffer_size to prevent this from happening. A friend suggested I create a seperate thread to iterate through this buffer and perform the checks? How would this work given that I don't know exactly how long it will take for snd_pcm_readi() to fill the buffer, so knowing when to lock the buffer is a bit confusing to me.
A useful way to multithread an application for signal processing and computation is to use OpenMP (OMP). This avoids the developer having to use locking mechanisms themselves to synchronise multiple computation threads. Typically using locking is a bad thing in real time application programming.
In this example, a for loop is multithreaded in the C language :
#include <omp.h>
int N=960;
float audio[960]; // the signal to process
#pragma omp parallel for
for (int n=0; i<N; n++){ // perform the function on each sample in parallel
printf("n=%d thread num %d\n", n, omp_get_thread_num()); // remove this to speed up the code
audio[n]; // process your signal here
}
You can see a concrete example of this in the gtkIOStream FIR code base. The FIR filter's channels are processed independently, one per available thread.
To initalise the actual OMP subsystem specify the number of threads you want to use like this to maximised the available threads :
int MProcessors = omp_get_max_threads();
omp_set_num_threads(MProcessors);
If you would prefer to look at an approach which uses locking techniques, then you could go for a concurrency pattern such as that developed for the Nuclear Processing.

why c++ openMP program execute longer

I've problem with understanding how it's possible.
I've long textfile (ten thousand lines), i read it to
variable text as string. I'd like to split it on 200 parts.
I've written this code using openMP directives:
std::string str[200];
omp_set_num_threads(200);
#pragma omp parallel
{
#pragma omp for
for (int i=0;i<200;i++)
{
str[i]= text.substr(i*(text.length()/200),text.length()/200);
}
}
and it's execution time is 231059 us
if i write it as sequence
for (int i=0;i<200;i++)
{
str[i]= text.substr(i*(text.length()/200),text.length()/200);
}
execution time is 215902us
Am I using openMP wrong, or what's happen here
substr causes a memory allocation and a memcpy, and not much else. So instead of 1 thread asking the OS for some ram, you now have N threads asking the OS for some RAM, at the same time. This isn't a great design.
Splitting a workload to be tackled by a thread group makes a lot of sense when the workload is CPU intensive. It makes a no sense at all, when all of those threads are competing for the same shared resource (e.g. the ram). One thread will simply block all the others until each allocation has been completed.

What is the best way to parallelise tasks sharing an object but otherwise independent?

I'm coding a physics simulation consisting mainly of a central loop of hundreds of billions of repetitions of operations on an array. These operations are independent from the other (well actually the array changes along the way) and so I'm thinking about parallelising my code as I can make it run on 4 or 8 cores computers in my lab.
It's my first time doing something alike and I've been advised to look at openmp. I've started to code some toy programs with it, but I'm really unsure about how it works and the documentation is quite cryptic to me. For example the following code:
int a = 0;
#pragma omp parallel
{
a++;
}
cout << a << endl;
launched on my computer (4 cores CPU) gives me sometimes 4, other times 3 or 2. Is it because it doesn't wait for all the cores to execute the instructions? Because I sure need to know how many iterations were done in my case. Should I look for something else than openmp considering what I want in the end?
When writing concurrently to a shared variable (a in your code), you have a data race. To avoid different threads writing "simultaneously", you must either use an atomic assignment or protect the assignment with a mutex (= mutual exclusion). In OpenMP, the latter is done via a critical region
int a = 0;
#pragma omp parallel
{
#pragma omp critical
{
a++;
}
}
cout << a << endl;
(of course, this particular program does nothing in parallel, hence will be slower than a serial one doing the same).
For more info, read the openMP documentation! However, I would advise you to not use OpenMP, but TBB if you're using C++. It's much more flexible.
What you are seeing is the typical example of a race condition. Four threads are trying to increment variable a and they are fighting for it. Some 'lose' and they are not able to increment so you see a result lower than 4.
What happens is that the a++ command is actually a set of three instructions: read a from memory and put it in a register, increment the value in the register, then put the value back in memory. If thread 1 reads the value of a after thread 2 has read it but before thread 2 has written the new value back to a, the increment operation of thread2 will be overwritten. Using #omp critical is a way to ensure that all the read/increment/write operations are not interrupted by another thread.
If you need to parallelize iterations, you can use omp parallel for, for instance to increment all the elements in an array.
Typical use:
#pragma omp parallel for
for (i = 0; i < N; i++)
a[i]++;

why does openMP needs so long?

I just spent quite some time trying to get this loop openMPed, but for 2 threads, it doubles Wall time! Am I missing something important?
The overall task is to read in a big file (~ 1GB) in parallel, an ifstream is divided into several stringbuffer and these are used to insert the data into the structs Symbol. Up to here everything is fast. Also giving the loop private variables str and locVec to operate on doesn't change something.
vector<string> strbuf; // filled from ifstream
vector< vector <Symbol> > symVec; // to be filled
#pragma omp parallel for num_threads(2) default(none) shared(strbuf, symVec)
for (int i=0; i<2; i++)
{
string str = strbuf[i];
std::stringstream ss(str);
// no problem until here
// this is where it slows down:
vector<Symbol> locVec;
std::copy(std::istream_iterator<Symbol>(ss), std::istream_iterator<Symbol>(), std::back_inserter(locVec));
symVec[i] = locVec;
}
EDIT::
Sorry for being unprecise, but the file content is already read in sequencially and divided into the strbufs at this point. The file is closed. Within the loop there is no file access.
It's much better to do sequential I/O on a file rather than I/O at different sections of a file. This essentially boils down to causing a lot of seeks on the underlying device (I'm assuming a disk here). This also increases the amount of underlying system calls required to read the file into said buffers. You're better off using 1 thread to read the file in it's totality sequentially (maybe mmap() with MAP_POPULATE) and assigning processing to different threads.
Another option is to use calls such as aio_read() to handle reading in different sections if for some reason you really do not want to read the file all at once.
Without all the code I cannot be completely sure but remember that simply opening a file does not guarantee it's content to be in memory and reading from a file will cause page faults that will then cause the actual file contents to be read so even if you're not explicitly trying to read from the file using a read/write the OS will take care of that for you.

Provide thread private preallocated buffer to a parallelized for() loop?

My program contains a for() loop that processes some raw image data, line by line, which I want to parallelize using OpenMP like this:
...
#if defined(_OPENMP)
int const threads = 8;
omp_set_num_threads( threads );
omp_set_dynamic( threads );
#endif
int line = 0;
#pragma omp parallel private( line )
{
// tell the compiler to parallelize the next for() loop using static
// scheduling (i.e. balance workload evenly among threads),
// while letting each thread process exactly one line in a single run
#pragma omp for schedule( static, 1 )
for( line = 0 ; line < max; ++line ) {
// some processing-heavy code in need of a buffer
}
} // end of parallel section
....
The question is this:
Is it possible to provide an individual (preallocated) buffer (pointer) to each thread of the team executing my loop using a standard OpenMP pragma/function (thus eliminating the need to allocate a fresh buffer with each loop)?
Thanks in advance.
Bjoern
I may be understanding you wrong, but I think this should do it:
#pragma omp parallel
{
unsigned char buffer[1024]; // private
// while letting each thread process exactly one line in a single run
#pragma omp for // ... etc
for(int line = 0; line < max; ++line ) {
//...
}
}
If you really meant you want to share the same buffer for different parallell blocks, you'll have to resort to thread-local storage. (Boost as well as C++11 have facilities for making that easier to do (more portably too) than directly using TlsAlloc and friends).
Note that this approach replaces some of the thread-safety checking burden back on the programmer because it is perfectly possible to have different omp parallel sections running at the same time, especially when they are being nested.
Consider that parallel blocks could be nesting at runtime, even though they are not lexically nested. In practice that is usually not good style - and often results in poor performance. However, it is something you need to be aware of when doing this).
There is threadprivate: http://msdn.microsoft.com/en-us/library/2z1788dd
static int buffer[BUFSIZE];
#pragma omp threadprivate(buffer)
This pragma works on a global/static variable, so you don't need to worry about the stack overflow. (In such a stack-overflow case, it's not a bad idea at all to increase the stack size by tweaking linker option.)
Note that compilers may have different implementation details for threadprivate. For example, VS 2010 compiler can't do make threadprivate if the variable has a constructor. However, Intel C/C++ compiler does this job greatly.
Using separate omp parallel and omp for is also good idea as sehe showed it. However, using threadprivate allows you to use omp parallel for directly.
FYI: Even if you need to allocate your own thread-local storage, in many case you don't actually need to call an OS-specific function call such as TlsAlloc. You may simply allocate an array of N of the data structures. And, then access them by using omp_get_thread_num that gives thread ID from 0 to N-1. Of course, you must consider false sharing by inserting a padding to ensure each data structure should be aligned to a different cache line (mostly modern CPU caches have 64-byte cache line).