in my previous question
Shared vectors in OpenMP
it was stated that one can let diferent threads read and write on a shared vector as long as
the different threads access different elements of the vector.
What if different threads have to read all the (so sometimes the same) elements of a vector, like in the following case ?
#include <vector>
int main(){
vector<double> numbers;
vector<double> results(10);
double x;
//write 25 values in vector numbers
for (int i =0; i<25; i++){
numbers.push_back(cos(i));
}
#pragma omp parallel for default(none) \
shared(numbers, results) \
private(x)
for (int j = 0; j < 10; j++){
for(int k = 0; k < 25; k++){
x += 2 * numbers[j] * numbers[k] + 5 * numbers[j * k / 25];
}
results[j] = x;
}
return 0;
}
Will this parallelization be slow because only one thread at a time can read any element of the vector or is this not the case? Could I resolve the problem with the clause firstprivate(numbers)?
Would it make sense to create an array of vectors so every thread gets his own vector ?
For instance:
vector<double> numbersx[**-number of threads-**];
Reading elements of the same vector from multiple threads is not a problem. There is no synchronization in your code, so they will be accessed concurrently.
With the size of vectors that you are working with, you will not have any cache problems either, although for bigger vectors you may get some slow-downs due to the cache access pattern. In that case, separate copies of the numbers data would improve performance.
better approach:
#include <vector>
int main(){
vector<double> numbers;
vector<double> results(10);
//write 25 values in vector numbers
for (int i =0; i<25; i++){
numbers.push_back(cos(i));
}
#pragma omp parallel for
for (int j = 0; j < 10; j++){
double x = 0; // make x local var
for(int k = 0; k < 25; k++){
x += 2 * numbers[j] * numbers[k] + 5 * numbers[j * k / 25];
}
results[j] = x; // no race here
}
return 0;
}
it will be slow kinda due to fact that there isn't much work to share
Related
I aim to compute a simple N-body program on C++ and I am using OpenMP to speed things up with the computations. At some point, I have nested loops that look like that:
int N;
double* S = new double[N];
double* Weight = new double[N];
double* Coordinate = new double[N];
...
#pragma omp parallel for
for (int i = 0; i < N; ++i)
{
for (int j = 0; j < i; ++j)
{
double K = Coordinate[i] - Coordinate[j];
S[i] += K*Weight[j];
S[j] -= K*Weight[i];
}
}
The issue here is that I do not obtain exactly the same result when removing the #pragma ... I am guessing it has to do with the fact that the second loop is dependent on the integer i, but I don't see how to get past that issue
The problem is that there is a data race during updating S[i] and S[j]. Different threads may read from/write to the same element of the array at the same time, therefore it should be an atomic operation (you have to add #pragma omp atomic) to avoid data race and to ensure memory consistency:
for (int j = 0; j < i; ++j)
{
double K = Coordinate[i] - Coordinate[j];
#pragma omp atomic
S[i] += K*Weight[j];
#pragma omp atomic
S[j] -= K*Weight[i];
}
I have the following piece of C++ code. The scale of the problem is N and M. Running the code takes about two minutes on my machine. (after g++ -O3 compilation). Is there anyway to further accelerate it, on the same machine? Any kind of option, choosing a better data structure, library, GPU or parallelism, etc, is on the table.
void demo() {
int N = 1000000;
int M=3000;
vector<vector<int> > res(M);
for (int i =0; i <N;i++) {
for (int j=1; j < M; j++){
res[j].push_back(i);
}
}
}
int main() {
demo();
return 0;
}
An additional info: The second loop above for (int j=1; j < M; j++) is a simplified version of the real problem. In fact, j could be in a different range for each i (of the outer loop), but the number of iterations is about 3000.
With the exact code as shown when writing this answer, you could create the inner vector once, with the specific size, and call iota to initialize it. Then just pass this vector along to the outer vector constructor to use it for each element.
Then you don't need any explicit loops at all, and instead use the (highly optimized, hopefully) standard library to do all the work for you.
Perhaps something like this:
void demo()
{
static int const N = 1000000;
static int const M = 3000;
std::vector<int> data(N);
std::iota(begin(data), end(data), 0);
std::vector<std::vector<int>> res(M, data);
}
Alternatively you could try to initialize just one vector with that elements, and then create the other vectors just by copying that part of the memory using std::memcpy or std::copy.
Another optimization would be to allocate the memory in advance (e.g. array.reserve(3000)).
Also if you're sure that all the members of the vector are similar vectors, you could do a hack by just creating a single vector with 3000 elements, and in the other res just put the same reference of that 3000-element vector million times.
On my machine which has enough memory to avoid swapping your original code took 86 seconds.
Adding reserve:
for (auto& v : res)
{
v.reserve(N);
}
made basically no difference (85 seconds but I only ran each version once).
Swapping the loop order:
for (int j = 1; j < M; j++) {
for (int i = 0; i < N; i++) {
res[j].push_back(i);
}
}
reduced the time to 10 seconds, this is likely due to a combination of allowing the compiler to use SIMD optimisations and improving cache coherency by accessing memory in sequential order.
Creating one vector and copying it into the others:
for (int i = 0; i < N; i++) {
res[1].push_back(i);
}
for (int j = 2; j < M; j++) {
res[j] = res[1];
}
reduced the time to 4 seconds.
Using a single vector:
void demo() {
size_t N = 1000000;
size_t M = 3000;
vector<int> res(M*N);
size_t offset = N;
for (size_t i = 0; i < N; i++) {
res[offset++] = i;
}
for (size_t j = 2; j < M; j++) {
std::copy(res.begin() + N, res.begin() + N * 2, res.begin() + offset);
offset += N;
}
}
also took 4 seconds, there probably isn't much improvement because you have 3,000 4 MB vectors, there would likely be more difference if N was smaller or M was larger.
Would false sharing happen in the following program?
Memory
1 array divided into 4 equal regions: [A1, A2, B1, B2]
The whole array can fit into L1 cache in the actual program.
Each region is padded to be a multiple of 64 bytes.
Steps
1. thread 1 write to region A1 and A2 while thread 2 write to region B1 and B2.
2. barrier
3. thread 1 read B1 and write to A1 while thread 2 read B2 and write to A2.
4. barrier
5. Go to step 1.
Test
#include <vector>
#include <iostream>
#include <stdint.h>
int main() {
int N = 64;
std::vector<std::int32_t> x(N, 0);
#pragma omp parallel
{
for (int i = 0; i < 1000; ++i) {
#pragma omp for
for (int j = 0; j < 2; ++j) {
for (int k = 0; k < (N / 2); ++k) {
x[j*N/2 + k] += 1;
}
}
#pragma omp for
for (int j = 0; j < 2; ++j) {
for (int k = 0; k < (N/4); ++k) {
x[j*N/4 + k] += x[N/2 + j*N/4 + k] - 1;
}
}
}
}
for (auto i : x ) std::cout << i << " ";
std::cout << "\n";
}
Result
32 elements of 500500 (1000 * 1001 / 2)
32 elements of 1000
There is some false sharing in your code since x is not guaranteed to be aligned to a cache-line. Padding is not necessarily enough. In your example N is really small which may be a problem. Note at your example N, the biggest overhead would probably be worksharing and thread management. If N is sufficiently large, i.e. array-size / number-of-threads >> cache-line-size, false sharing is not a relevant problem.
Alternating writes to A2 from different threads in your code is also not optimal in terms of cache usage, but that is not a false sharing issue.
Note, you do not need to split the loops. If you access index into memory contiguously in a loop, one loop is just fine, e.g.
#pragma omp for
for (int j = 0; j < N; ++j)
x[j] += 1;
If you are really careful you may add schedule(static), then you have a guarantee of an even contiguous word distribution.
Remember that false sharing is a performance issue, not a correctness problem, and only relevant if it occurs frequently. Typical bad patterns are writes to vector[my_thread_index].
for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
XY[2*nind] = i;
XY[2*nind + 1] = j;
nind++;
}
}
}
here x = 512 and z = 512 and nind = 0 initially
and XY[2*x*y].
I want to optimize this for loops with openMP but 'nind' variable is closely binded serially to for loop. I have no clue because I am also checking a condition and so some of the time it will not enter in if and will skip increment or it will enter increment nind. openMP threads will increment nind variable as first come will increment nind firstly. Is there any way to unbind it. ('binding' I mean only can be implemented serially).
A typical cache-friendly solution in that case is to collect the (i,j) pairs in private arrays, then concatenate those private arrays at the end, and finally sort the result if needed:
#pragma omp parallel
{
uint myXY[2*z*x];
uint mynind = 0;
#pragma omp for collapse(2) schedule(dynamic,N)
for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
myXY[2*mynind] = i;
myXY[2*mynind + 1] = j;
mynind++;
}
}
}
#pragma omp critical(concat_arrays)
{
memcpy(&XY[2*nind], myXY, 2*mynind*sizeof(uint));
nind += mynind;
}
}
// Sort the pairs if needed
qsort(XY, nind, 2*sizeof(uint), compar);
int compar(const uint *p1, const uint *p2)
{
if (p1[0] < p2[0])
return -1;
else if (p1[0] > p2[0])
return 1;
else
{
if (p1[1] < p2[1])
return -1;
else if (p1[1] > p2[1])
return 1;
}
return 0;
}
You should experiment with different values of N in the schedule(dynamic,N) clause in order to achieve the best trade-off between overhead (for small values of N) and load imbalance (for large values of N). The comparison function compar could probably be written in a more optimal way.
The assumption here is that the overhead from merging and sorting the array is small. Whether that will be the case depends on many factors.
Here is a variation on Hristo Iliev's good answer.
The important parameter to act on here is the index of the pairs rather than the pairs themselves.
We can fill private arrays of the pair indices in parallel for each thread. The arrays for each thread will be sorted (irrespective of the scheduling).
The following function merges two sorted arrays
void merge(int *a, int *b, int*c, int na, int nb) {
int i=0, j=0, k=0;
while(i<na && j<nb) c[k++] = a[i] < b[j] ? a[i++] : b[j++];
while(i<na) c[k++] = a[i++];
while(j<nb) c[k++] = b[j++];
}
Here is the remaining code
uint nind = 0;
uint *P;
#pragma omp parallel
{
uint myP[x*z];
uint mynind = 0;
#pragma omp for schedule(dynamic) nowait
for(uint k = 0 ; k < x*z; k++) {
if (inFunc(p, index)) myP[mynind++] = k;
}
#pragma omp critical
{
uint *t = (uint*)malloc(sizeof *P * (nind+mynind));
merge(P, myP, t, nind, mynind);
free(P);
P = t;
nind += mynind;
}
}
Then given an index k in P the pair is (k/z, k%z).
The merging can be improved. Right now it goes at O(omp_get_num_threads()) but it could be done in O(log2(omp_get_num_threads())). I did not bother with this.
Hristo Iliev's pointed out that dynamic scheduling does not guarantee that the iterations per thread increase monotonically. I think in practice they are but it's not guaranteed in principle.
If you want to be 100% sure that the iterations increase monotonically you can implement dynamic scheduling by hand.
The code you provide looks like you are trying to fill the XY data in sequential order. In this case OMP multithreading is probably not the tool for the job as threads (in a best case) should avoid communication as much as possible. You could introduce an atomic counter, but then again, it is probably going to be faster just doing it sequentially.
Also what do you want to achieve by optimizing it? The x and z are not too big, so I doubt that you will get a substantial speed increase even if you reformulate your problem in a parallel fashion.
If you do want parallel execution - map your indexes to the array, e.g. (not tested, but should do)
#pragma omp parallel for shared(XY)
for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
uint idx = (2 * i) * x + 2 * j;
XY[idx] = i;
XY[idx + 1] = j;
}
}
}
However, you will have gaps in your array XY then. Which may or may not be a problem for you.
Here is code to find determinant of matrix n x n.
#include <iostream>
using namespace std;
int determinant(int *matrix[], int size);
void ijMinor(int *matrix[], int *minorMatrix[], int size, int row, int column);
int main()
{
int size;
cout << "What is the size of the matrix for which you want to find the determinant?:\t";
cin >> size;
int **matrix;
matrix = new int*[size];
for (int i = 0 ; i < size ; i++)
matrix[i] = new int[size];
cout << "\nEnter the values of the matrix seperated by spaces:\n\n";
for(int i = 0; i < size; i++)
for(int j = 0; j < size; j++)
cin >> matrix[i][j];
cout << "\nThe determinant of the matrix is:\t" << determinant(matrix, size) << endl;
return 0;
}
int determinant(int *matrix[], int size){
if(size==1)return matrix[0][0];
else{
int result=0, sign=-1;
for(int j = 0; j < size; j++){
int **minorMatrix;
minorMatrix = new int*[size-1];
for (int k = 0 ; k < size-1 ; k++)
minorMatrix[k] = new int[size-1];
ijMinor(matrix, minorMatrix, size, 0, j);
sign*=-1;
result+=sign*matrix[0][j]*determinant(minorMatrix, size-1);
for(int i = 0; i < size-1; i++){
delete minorMatrix[i];
}
}
return result;
}
}
void ijMinor(int *matrix[], int *minorMatrix[], int size, int row, int column){
for(int i = 0; i < size; i++){
for(int j = 0; j < size; j++){
if(i < row){
if(j < column)minorMatrix[i][j] = matrix[i][j];
else if(j == column)continue;
else minorMatrix[i][j-1] = matrix[i][j];
}
else if(i == row)continue;
else{
if(j < column)minorMatrix[i-1][j] = matrix[i][j];
else if(j == column)continue;
else minorMatrix[i-1][j-1] = matrix[i][j];
}
}
}
}
After adding OpenMP pragmas, I've changed the determinant function and now it looks like this:
int determinant(int *matrix[], int size){
if(size==1)return matrix[0][0];
else{
int result=0, sign=-1;
#pragma omp parallel for default(none) shared(size,matrix,sign) private(j,k) reduction(+ : result)
for(int j = 0; j < size; j++){
int **minorMatrix;
minorMatrix = new int*[size-1];
for (int k = 0 ; k < size-1 ; k++)
minorMatrix[k] = new int[size-1];
ijMinor(matrix, minorMatrix, size, 0, j);
sign*=-1;
result+=sign*matrix[0][j]*determinant(minorMatrix, size-1);
for(int i = 0; i < size-1; i++){
delete minorMatrix[i];
}
}
return result;
delete [] matrix;
}
}
My problem is that the result is every time different. Sometimes it gives correct value, but most often it is wrong. I think it's because of the sign variable. I am following the formula:
As you can see, in every iteration of my for loop there should be different sign but when I use OpenMP, something is wrong. How can I make this program to run with OpenMP?
Finally, my second issue is that using OpenMP does not make the program run quicker than without OpenMP. I also tried to make a 100,000 x 100,000 matrix, but my program reports an error about allocating memory. How can I run this program with very large matrices?
Your issues as I see it are as follows:
1) As noted by Hristo, your threads are stomping over each other's data with respect to the sign variable. It should be private to each thread so that they have full read/write access to it without having to worry about race conditions. Then, you simply need an algorithm to compute whether sign is plus or minus 1 depending on the iteration j independently from the other iterations. With a little thinking, you'll see that Hristo's suggestion is correct: sign = (j % 2) ? -1 : 1; should do the trick.
2) Your determinant() function is recursive. As is, that means that every iteration of the loop, after forming your minors, you then call your function again on that minor. Therefore, a single thread is going to be performing its iteration, enter the recursive function, and then try to split itself up into nthreads more threads. You can see now how you are oversubscribing your system by launching many more threads than you physically have cores. Two easy solutions:
Call your original serial function from within the omp parallel code. This is the fastest way to do it because this would avoid any OpenMP-startup overhead.
Turn off nested parallelism by calling omp_set_nested(0); before your first call to determinant().
Add an if clause to your parallel for directive: if(omp_in_parallel())
3) Your memory issues are because every iteration of your recursion, you are allocating more memory. If you fix problem #2, then you should be using comparable amounts of memory in the serial case as the parallel case. That being said, it would be much better to allocate all the memory you want before entering your algorithm. Allocating large chunks of memory (and then freeing it!), especially in parallel, is a terrible bottleneck in your code.
Compute the amount of memory you would need (on paper) before entering the first loop and allocate it all at once. I would also strongly suggest you consider allocating your memory contiguously (aka in 1D) to take better advantage of caching as well. Remember that each thread should have its own separate area to work with. Then, change your function to:
int determinant(int *matrix, int *startOfMyWorkspace, int size).
Instead of allocating a new (size-1)x(size-1) matrix inside of your loop, you would simply utilize the next (size-1)*(size-1) integers of your workspace, update what startOfMyWorkspace would be for the next recursive call, and continue along.