Sequential program that accesses the whole of the array - c++

I'm working on threads, however before I use threads I am to write 2 programs.
Set up an array and write a sequential program that accesses the whole of the array and performs some simple task on the contents.
Modify the program so that it is still sequential but accesses the array by a series of calls to a function. Each call to that function will process a number of rows of the array as defined by a parameter passed to the function.
I'm having problems understanding the questions, it seems so simple but yet I can't seem to get my head around it. I am to write the programs based on the above two questions before I start creating a program that will allow the processing to be carried out in one or more threads. Each thread should access a different set of rows of the array.
For the first question, the code I have written so far is
#include <iostream>
#include <stdio.h>
int main()
{
int array [20][20];
int i, j;
/* output each array element's value */
for ( i = 0; i < 20; i++ )
{
for ( j = 0; j < 20; j++ )
{
printf("a[%d][%d] = %d\n", i,j, array[i][j] );
}
}
system ("PAUSE");
return 0;
}
I want to know if the above program is a sequential program? I have run the program and it access the whole array and perform one tasks which is to print out all data in the arrays.
I researched on on-line what it means by sequential program and I found it means the following statement: perform task a before task b but not at the same time. Is this right?
For the second part I have done the following:
#include <iostream>
#include <stdio.h>
void print_array(int array[20][20]);
int main()
{
int array [20][20];
int i, j;
print_array(array);
system ("PAUSE");
return 0;
}
// Output data in an array
void print_array(int array)
{
int i, j;
for ( i = 0; i < 20; i++ )
{
for ( j = 0; j < 20; j++ )
{
printf("a[%d][%d] = %d\n", i,j, array[i][j] );
}
}
}
Am I going in the right direction? As I also got to write a version of the program that will allow the processing to be carried out in one or more threads.
EDIT: I am to use 2D arrays, sorry it wasn't clear above

I don't think you're going in the right direction, but you're not far off. What the instructions are asking for are some of the preliminary steps needed to take the work of processing an array sequentially and make it run in parallel. When writing a parallel program, it is often useful to start with a working sequential program and slowly transform it into a parallel program. Following the instructions is a way to do this.
Let's consider the parts of the question separately:
Set up an array and write a sequential program that accesses the whole of the array and performs some simple task on the contents.
The simple task that you chose for your array is to print the contents, but this isn't a suitable task, because it has no functional result. A more suitable task would be the sum the elements in the array. Other tasks might be count the elements that meet some condition, or to multiple each element by two.
So, first try to modify your initial program to sum the elements instead of printing them.
(In your code you are using a two-dimensional array. I would suggest using a 1-dimensional array for simplicity.)
Modify the program so that it is still sequential but accesses the array by a series of calls to a function. Each call to that function will process a number of rows of the array as defined by a parameter passed to the function.
In this part what you are trying to do is break up the functionality into small pieces of work. (Eventually you will send these units of work to threads for processing, but you are just doing the preliminary steps now.) If we did a sum in part 1, then here you might write a function which is int sumKitems(int *array, int startIndex, int numItems). The main program would then call this on each set of (say) 10 items in the array, and combine the full results by summing the results from each sumKitems call.
So, if there are 100 items in the array, you could make 10 calls to sumKitems(...), telling the function to process 0...9, 10...19, ..., 90...99. This would be in place of doing the sum on all 100 items individually.
--
To summarize, part one would be a simple for loop, not too differently from what you've written, just with some computation being performed.
The second part should do exactly the same computation and return exactly the same result, just using another function call which handles k items at time. (If you pass the number of items to handle at a time as a parameter you will be able to balance the cost of communication vs work being done when moving to a threaded implementation.)
In the end, you will probably be asked to replace the call to sumKitems(...) with a queue that sends work to the threads to do independently.

I believe that if you are not creating separate threads in any way then you are in fact writing a sequential program. There is no part of your code where you jump into a new thread to do some operation while the main thread does something else.
Summary: your code runs on a single thread - it is sequential
You must pass the array not as an integer but as a double pointer -> int array[][] or int** array

To perform operations on the array sequentially would be to start at the first place in the array and increment through the array performing operations as you go
array[10] = {0,1,2,3,4,5,6,7,8,9}
iterate through array with some action such as adding +5
array [10] = {5,6....}
To make this multi-threaded you need to have different threads operate on different segments of the array such as places 0-4,5-9 and perform the action so it can be done in less time. If you're doing it this way you will not need to worry about mutexes.
So Thread one increments through array[10] {0,1,2,3}
Thread two increments through array[10] {4,5,6,7}
Each increment one place at a time and both threads run concurrently

Related

Increasing array index in openMP

I am new to using OpenMP. I am trying to parallelize a nested loop, and so far I have something of this form...
#pragma omp parallel for
for (j=0;j <m; j++) {
some work;
for (i= 0; i < n ; i++) {
p =b[i];
if (P< 0 && k < m) {
a[k] = c[i]; k++ ;
} else {
x=c[i];
}
}
some work
}
The outer loop is in parallel, and the inner loop updates k. The current value of k is needed for the other threads to update a[k] correctly. The problem is that all of the threads are updating a[k], but the proper order of k is not kept.
Some threads will update k and a[k], and some will not. How do I communicate the latest k between threads to update a[k] properly, since c[i] will have different values for each thread?
For example, when it runs serially, the program might set the first seven values of a to {1,3,5,7,3,9,13} and terminate with k equal to 7, but when done parallel, produces different results, or results in a different (therefore wrong) order.
How do I keep the same order and ensure parallelism at the same time?
Note: this answer was completely rewritten in light of OP clarifications. The original answer text is at the end.
How do I keep the same order and ensure parallelism at the same time?
Order dependency is antithetical to parallelism, as running operations in parallel inherently entails relaxing the relative order in which they are performed. Not all computations can be effectively parallelized.
Your case is not an exception. The second and each subsequent iteration of your outer loop needs to use the final value of k (among other things) computed by the previous iteration. How can it get that? Only by performing the previous iteration first. What room does that leave for concurrent operation? None. Concurrency is not the same thing as parallelism, but it is one of the main motivations for parallelism, because that's how parallelism yields improvements in elapsed time.
With no scope for concurrency, parallelism is actively counterproductive for you. Suppose you made the whole body of the outer loop a critical section, so that there was no concurrency in fact (as your present code requires) and no data races involving k. Then you would still pay the overhead for parallelism, get no speedup in return, and probably still get the wrong results because of evaluations of the outer-loop body being performed in the wrong order.
It may be that the whole thing can be rewritten to reduce or remove the data dependencies that prevent effective parallelization of the computation, or it may not. We haven't enough information to determine, as it depends in part on the details of "some work" and on the significance of the data. Probably you would need an altogether different algorithm for producing the desired results.
> Instead of giving a[n]={0,1,2,3,.......n} , it gives me garbage values for a when I use the reduction clause. I need the total sum of K, hence the reduction clause.
There is a closed-form equation for the sum of consecutive integers, and it has especially simple form when the first integer in the list is 0 or 1. In particular, the sum of the integers from 0 to n, inclusive, is n * (n + 1) / 2. You do not need a reduction for this.
If you wanted to use a reduction anyway, then you need to understand that it doesn't work the way you seem to think it does. What you get is a separate, private copy of the reduction variable for each thread executing the parallel construct, with the per thread (not per iteration) final values of those independant variables combined according to the reduction operator. Thus, if you really want to do the computation via an OpenMP reduction, then you would need to restructure the loop something like this:
#pragma omp parallel for reduction (+:k)
for (i = 0; i < 10; i++) {
a[i] = i;
k += i;
}
That assumes that the value of k is 0 immediately prior to the loop, as you indeed seem to be doing. If that were not a safe assumption then you would need something like
type_of_k k0 = k;
k = 0;
#pragma omp parallel for reduction (+:k)
for (i = 0; i < 10; i++) {
a[k0 + i] = i;
k += k0 + i;
}
Note that in either case, not only does that set up the reduction correctly, but it also breaks the data dependency between loop iterations that was previously carried by the expression k++.
It sounds like you're essentially filling in a with a filter of entries from c, and want to preserve their order. If this is the only use k has, some other methods spring to mind:
Always write a[i], but use a mark indicating unused values where the P predicate wasn't satisfied. This preserves order, but requires a larger a you can compact in a second pass.
Write an a_i array storing which index each entry belonged to. This still requires a #pragma omp atomic k_local = k++ access to k, and a second sort to restore order. And you'd need both a and a_i to be the full size again, or you might miss entries, so in all a terrible workaround.
Even with some sequential dependencies you can do optimizations, e.g. a scan to calculate what k would be for each i could be done in O(log n) rather than O(n). E.g. parallel prefix sum, openmp discussion on stack overflow. This sort of thing is what OpenMP's ordered depend is for, I believe. Anyhow, this leads to the third solution:
Generate a k array, holding the values k will have for each iteration, such that those threads that will write write to the correct places. This requires scanning the predicate.
It is useful to have higher level constructs like map, scan and reduce when planning out algorithms.

Multithreading is slower than no threading C++

I am new to multi-thread programming and I am aware several similar questions have been asked on SO before however I would like to get an answer specific to my code.
I have two vectors of objects (v1 & v2) that I want to loop through and depending on if they meet some criteria, add these objects to a single vector like so:
Non-Multithread Case
std::vector<hobj> validobjs;
int length = 70;
for(auto i = this->v1.begin(); i < this->v1.end() ;++i) {
if( !(**i).get_IgnoreFlag() && !(**i).get_ErrorFlag() ) {
hobj obj(*i, length);
validobjs.push_back(hobj);
}
}
for(auto j = this->v2.begin(); j < this->v2.end() ;++j) {
if( !(**j).get_IgnoreFlag() && !(**j).get_ErrorFlag() ) {
hobj obj(*j, length);
validobjs.push_back(hobj);
}
}
Multithread Case
std::vector<hobj> validobjs;
int length = 70;
#pragma omp parallel
{
std::vector<hobj> threaded1; // Each thread has own local vector
#pragma omp for nowait firstprivate(length)
for(auto i = this->v1.begin(); i < this->v1.end() ;++i) {
if( !(**i).get_IgnoreFlag() && !(**i).get_ErrorFlag() ) {
hobj obj(*i, length);
threaded1.push_back(obj);
}
}
std::vector<hobj> threaded2; // Each thread has own local vector
#pragma omp for nowait firstprivate(length)
for(auto j = this->v2.begin(); j < this->v2.end() ;++j) {
if( !(**j).get_IgnoreFlag() && !(**j).get_ErrorFlag() ) {
hobj obj(*j, length);
threaded2.push_back(obj);
}
}
#pragma omp critical // Insert local vectors to main vector one thread at a time
{
validobjs.insert(validobjs.end(), threaded1.begin(), threaded1.end());
validobjs.insert(validobjs.end(), threaded2.begin(), threaded2.end());
}
}
In the non-multithreaded case my total time spent doing the operation is around 4x faster than the multithreaded case (~1.5s vs ~6s).
I am aware that the #pragma omp critical directive is a performance hit but since I do not know the size of the validobjs vector beforehand I cannot rely on random insertion by index.
So questions:
1) Is this kind of operation suited for multi-threading?
2) If yes to 1) - does the multithreaded code look reasonable?
3) Is there anything I can do to improve the performance to get it faster than the no-thread case?
Additional info:
The above code is nested within a much larger codebase that is performing 10,000 - 100,000s of iterations (this loop is not using multithreading). I am aware that spawning threads also incurs a performance overhead but as afar as I am aware these threads are being kept alive until the above code is once again executed every iteration
omp_set_num_threads is set to 32 (I'm on a 32 core machine).
Ubuntu, gcc 7.4
Cheers!
I'm no expert on multithreading, but I'll give it a try:
Is this kind of operation suited for multi-threading?
I would say yes. Especially if you got huge datasets, you could split them even further, running any number of filtering operations in parallel. But it depends on the amount of data you want to process, thread creation and synchronization is not free.
As is the merging at the end of the threaded version.
Does the multithreaded code look reasonable?
I think you'r on the right path to let each thread work on independent data.
Is there anything I can do to improve the performance to get it faster than the no-thread case?
I see a few points that might improve performance:
The vectors will need to resize often, which is expensive. You can use reserve() to, well, reserve memory beforehand and thus reduce the number of reallocations (to 0 in the optimal case).
Same goes for the merging of the two vectors at the end, which is a critical point, first reserve:
validobjs.reserve(v1.size() + v2.size());
then merge.
Copying objects from one vector to another can be expensive, depending on the size of the objects you copy and if there is a custom copy-constructor that executes some more code or not. Consider storing only indices of the valid elements or pointers to valid elements.
You could also try to replace elements in parallel in the resulting vector. That could be useful if default-constructing an element is cheap and copying is a bit expensive.
Filter the data in two threads as you do now.
Synchronise them and allocate a vector with a number of elements:
validobjs.resize(v1.size() + v2.size());
Let each thread insert elements on independent parts of the vector. For example, thread one will write to indices 1 to x and thread 2 writes to indices x + 1 to validobjs.size() - 1
Allthough I'm not sure if this is entirely legal or if it is undefined behaviour
You could also think about using std::list (linked list). Concatenating linked lists, or removing elements happens in constant time, however adding elements is a bit slower than on a std::vector with reserved memory.
Those were my thoughts on this, I hope there was something usefull in it.
IMHO,
You copy each element twice: into threaded1/2 and after that into validobjs.
It can make your code slower.
You can add elements into single vector by using synchronization.

Why are threads slowing down the execution of my program?

I have some code that can make use of parallelism for efficiency gain. Since my PC has a dual processor I tried running the code on two threads. So I wrote the below code (This is a very simplified version of it):
Evaluator::evaluate(vector<inpType> input, bool parallel) {
std::thread t0;
if(parallel) {
// Evaluate half of the input in a spawned thread
t0 = std::thread(&Evaluator::evaluatePart, this, std::ref(input), 0, input.size()/2 + input.size()%2);
// Evaluate the other half of the input in this thread
evaluatePart(input, input.size()/2 + input.size()%2 - 1, input.size()/2);
} else {
// sequential evaluate all of the input
evaluatePart(input, 0, input.size());
}
// some other code
// after finishing everything join the thread
if(parallel) t0.join();
}
Evaluator::evaluatePart(vector<inpType> &input, int start, int count) {
for(int i=start; i<count; i++) {
evaluateSingle(input[i]);
}
}
Evaluator::evaluateSingle(inpType &input) {
// do stuff with input
// note I use a vector<int> belonging to Evaluator object in here, not sure it matters though
}
Running sequentially takes around 3 ms but running in parallel is taking around 6 ms. Does that mean spawning a thread takes so much time that it is more efficient to just evaluate sequentially? Or am I doing something wrong?
Note that I don't make use of any locking mechanisms because the evaluations are independent on each other. Every evaluateSingle reads from a vector that is a member of the Evaluator object but only alters the single input that was given to it. Hence there is no need for any locking.
Update
I apologize that I didn't make this clear. This is more of a pseudo code describing in abstract how my code looks like. It will not work or compile but mine does so that is not the issue. Anyways I fixed the t0 scope issue in this code.
Also the input size is around 38,000 which I think is sufficient enough to make use of parallelism.
Update
I tried increasing the size of input to 5,000,000 but that didn't help. Sequential is still faster than multi-threaded.
Update
I tried increasing the number of threads running while splitting the vector evenly between them for evaluation, and got some interesting results:
Note that I have an i7-7500U CPU that can run 4 threads in parallel. This leaves me with two questions:
Why does creating 4 or more threads starts to see a performance improvement in comparison with 2, 3.
Why is it the case that creating more than 4 threads is more efficient than
just 4 threads (the maximum the CPU can run concurrently).

Alternative to using mpfr arrays

I am trying to write a function in C++ using MPFR to calculate multiple values. I am currently using an mpfr array to store those values. It is unknown how many values need to be calculated and stored each time. Here is the function:
void Calculator(mpfr_t x, int v, mpfr_t *Values, int numOfTerms, int mpfr_bits) {
for (int i = 0; i < numOfTerms; i++) {
mpfr_init2(Values[i], mpfr_bits);
mpfr_set(Values[i], x, GMP_RNDN);
mpfr_div_si(Values[i], Values[i], pow(-1,i+1)*(i+1)*pow(v,i+1), GMP_RNDN);
}
}
The program itself has a while loop that has a nested for loop that takes these values and does calculations with them. In this way, I don't have to recalculate these values each time within the for loop. When the for loop is finished, I clear the memory with
delete[] Values;
before the the while loops starts again in which case, it redeclares the array with
mpfr_t *Values;
Values = new mpfr_t[numOfTerms];
The number of values that need to be stored are calculated by a different function and is told to the function through the variable numOfTerms. The problem is that for some reason, the array slows down the program tremendously. I am working with very large numbers so the thought is that if I recalculate those values each time, it gets extremely expensive but this method is significantly slower than just recalculating the values in each iteration of the for loop. Is there an alternative method to this?
EDIT** Instead of redeclaring the array over each time, I moved the declaration and the delete[] Values outside of the while loop. Now I am just clearing each element of the array with
for (int i = 0; i < numOfTerms; i++) {
mpfr_clear(Values[i]);
}
inside of the while loop before the while loop starts over. The program has gotten noticeably faster but is still much slower than just calculating each value over.
If I understand correctly, you are doing inside a while loop: mpfr_init2 (at the beginning of the iteration) and mpfr_clear (at the end of the iteration) on numOfTerms MPFR numbers, and the value of numOfTerms depends on the iteration. And this is what takes most of the time.
To avoid these many memory allocations by mpfr_init2 and deallocations by mpfr_clear, I suggest that you declare the array outside the while loop and initially call the mpfr_init2 outside the while loop. The length of the array (i.e. the number of terms) should be what you think is the maximum number of terms. What can happen is that for some iterations, the chosen number of terms was too small. In such a case, you need to increase the length of the array (this will need a reallocation) and call mpfr_init2 on the new elements. This will be the new length of the array for the remaining iterations, until the array needs to be enlarged again. After the while loop, do the mpfr_clear's.
When you need to enlarge the array, have a good strategy to choose the new number of elements. Just taking the needed value of numOfTerms for the current iteration may not be a good one, since it may yield many reallocations. For instance, make sure that you have at least a N% increase. Do some tests to choose the best value for N... See Dynamic array for instance. In particular, you may want to use the C++ implementation of dynamic arrays, as mentioned on this Wikipedia article.

Pthreads and mutexes; locking part of an array

I am trying to parallelize an operation using pthreads. The process looks something like:
double* doSomething( .... ) {
double* foo;
foo = new double[220];
for(i = 0; i<20; i++)
{
//do something with the elements in foo located between 10*i and 10*(i+2)
}
return foo;
}
The stuff happening inside the for-loop can be done in any order, so I want to organize this using threads.
For instance, I could use a number of threads such that each thread goes through parts of the for-loop, but works on different parts of the array. To avoid trouble when working on overlapping parts, i need to lock some memory.
How can I make a mutex (or something else) that locks only part of the array?
If you are using latest gcc you can try parallel versions of standard algorithms. See the libstdc++ parallel mode.
If you just want to make sure that a section of the array is worked once...
Make a global variable:
int _iNextSection;
Whenever a thread gets ready to operate on a section, the thread gets the next available section this way:
iMySection = __sync_fetch_and_add(&_iNextSection, 1);
__sync_fetch_and_add() returns the current value of _iNextSection and then increments _iNextSection. __sync_fetch_and_add() is atomic, which means __sync_fetch_and_add() is guaranteed to complete before another thread can do it. No locking, no blocking, simple, fast.
If the loop looks exactly like you wrote, I would use an array of 21 mutexes and block in each thread on ith an (i + 1)th mutex on the beginning of the loop.
So something like:
...
for (i = 0; i < 20; i++) {
mutex[i].lock();
mutex[i+1].lock();
...
mutex[i+1].unlock();
mutex[i].unlock();
}
The logic is that only two neighboring loop executions can access same data (if the limits are [i * 10, (i + 2) * 10)), so you only need to worry about them.