Parallel DES implementation using openMP in C++ - c++

I am trying to Parallelize DES but hardly getting any speedup. Parallelizing the s-box part is not giving any speed up rather it is running in polynomial time.
Here is the s-box part of the DES:
int row[8],col[8],val[8];
//s box parallelism
#pragma omp parallel for num_threads(8) schedule(static)
for (int i = 0; i < 8; i++) {
//the value of '0' is 48, '1' is 49 and so on. but since we are referring the matrix index, we are interested in 0,1,..
//So, the '0' should be subtracted . i.e. the 49 value of '1' will be 49-48=1.
int tid = omp_get_thread_num();
row[tid] = 2 * int(x[tid * 6] - '0') + int(x[tid * 6 + 5] - '0');
col[tid] = 8 * int(x[tid * 6 + 1] - '0') + 4 * int(x[tid * 6 + 2] - '0') + 2 * int(x[tid * 6 + 3] - '0') + int(x[tid * 6 + 4] - '0');
val[tid] = sbox[tid][row[tid]][col[tid]];
result[tid]= decimalToBinary(val[tid]);
}
Is there a way I can parallelize s-boxes to improve speedup? or is there another part of algorithm which can be parallelized to get maximum speedup? Any examples?

use auto range1= std::async(dowork) to calculate ranges in parallel and return range1.get()+range2.get() etc..

Related

Mutex gone wrong?

I am trying Pthreads and its pretty basic program: I have two shared variables (declared global) among all threads
long Sum = 0;
long Sum1 = 0;
pthread_mutex_t mutexLock = PTHREAD_MUTEX_INITIALIZER;
In thread function:
for(int i=start; i<end; i++) //start and end are being passed to thread and they are being passed correctly
{
pthread_mutex_lock(&mutexLock);
Sum1+=i;
Sum+=Sum1;
pthread_mutex_unlock(&mutexLock);
}
main() in case one needs for reference:
int main()
{
pthread_t threadID[10];
for(int i=0; i<10; i++)
{
int a = (i*500) + 1;
int b =(i + 1)*500;
ThreadStruct* obj = new ThreadStruct(a,b);
pthread_create(&threadID[i],NULL,ThreadFunc,obj);
}
for(int i=0; i<10; i++)
{
pthread_join(threadID[i], NULL);
}
cout<<"Sum: "<<Sum<<endl;
cout<<"Sum1: "<<Sum1<<endl;
return 0;
}
OUTPUT
Sum: 40220835000
Sum1: 12502500
Run again
Sum: 38720835000
Sum1: 12502500
Run again
Sum: 39720835000
Sum1: 12502500
PROBLEM
Why I am getting a different value for Sum in each iteration?
Rest whole code is working ok and output of Sum1 is correct - no matter how much times do I run the code. (Only issue is in Sum). Am I doing something wrong in use of mutex here?
UPDATE
If I use local variables as #molbdnilo specified in his well detailed answer, this problem is solved. In start, I thought that mutex is irrelevant here but I tested it a number of times and observed the cases when not using a mutex results in recurrence of this problem. So, solution of this problem (courtesy: Answer by #molbdnilo) is to use local variables WITH mutex and I have tested it to work perfectly!
It's not a threading problem – the problem is that even though the order of additions to Sum1 doesn't matter, the order of additions to Sum does.
Consider the much shorter sum 1 + 2 + 3 and the following interleavings
1:
Sum1 = 1 + 2 = 3
Sum = 0 + 3 = 3
Sum1 = 3 + 3 = 6
Sum = 3 + 6 = 9
2:
Sum1 = 1 + 3 = 4
Sum = 0 + 4 = 4
Sum1 = 4 + 2 = 6
Sum = 4 + 6 = 10
3:
Sum1 = 2 + 3 = 5
Sum = 0 + 5 = 5
Sum1 = 5 + 1 = 6
Sum = 5 + 6 = 11
You could solve this by having the threads compute their own sum-of-sums independently and adding them afterwards.
(Notice that there's no concurrent mutation here, so locking anything can't make any difference.)
For a more concrete example, let's limit your program to two threads and the sum from 1 to 6.
You then have one thread computing 1 + 2 + 3 and one doing 4 + 5 + 6.
At a glance, thread one should also compute 1 + (1 + 2) + (1 + 2 + 3) and thread 2, 4 + (4 + 5) + (4 + 5 + 6).
Except they don't – every time they use it, Sum may have been modified by the other thread.
So thread one may compute 1 + ((1 + 4) + 2) + ((1 + 4) + 2 + 3), or something else.
When you use local variables, you keep each thread's result independent of the others.
(I think this problem is a pretty good illustration of how shared mutable state can complicate things in unexpected ways, by the way.)

How to convert triangular matrix indexes in to row, column coordinates?

I have these indexes:
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,etc...
Which are indexes of nodes in a matrix (including diagonal elements):
1
2 3
4 5 6
7 8 9 10
11 12 13 14 15
16 17 18 19 20 21
etc...
and I need to get i,j coordinates from these indexes:
1,1
2,1 2,2
3,1 3,2 3,3
4,1 4,2 4,3 4,4
5,1 5,2 5,3 5,4 5,5
6,1 6,2 6,3 6,4 6,5 6,6
etc...
When I need to calculate coordinates I have only one index and cannot access others.
Not optimized at all :
int j = idx;
int i = 1;
while(j > i) {
j -= i++;
}
Optimized :
int i = std::ceil(std::sqrt(2 * idx + 0.25) - 0.5);
int j = idx - (i-1) * i / 2;
And here is the demonstration:
You're looking for i such that :
sumRange(1, i-1) < idx && idx <= sumRange(1, i)
when sumRange(min, max) sum integers between min and max, both inxluded.
But since you know that :
sumRange(1, i) = i * (i + 1) / 2
Then you have :
idx <= i * (i+1) / 2
=> 2 * idx <= i * (i+1)
=> 2 * idx <= i² + i + 1/4 - 1/4
=> 2 * idx + 1/4 <= (i + 1/2)²
=> sqrt(2 * idx + 1/4) - 1/2 <= i
In my case (a CUDA kernel implemented in standard C), I use zero-based indexing (and I want to exclude the diagonal) so I needed to make a few adjustments:
// idx is still one-based
unsigned long int idx = blockIdx.x * blockDim.x + threadIdx.x + 1; // CUDA kernel launch parameters
// but the coordinates are now zero-based
unsigned long int x = ceil(sqrt((2.0 * idx) + 0.25) - 0.5);
unsigned long int y = idx - (x - 1) * x / 2 - 1;
Which results in:
[0]: (1, 0)
[1]: (2, 0)
[2]: (2, 1)
[3]: (3, 0)
[4]: (3, 1)
[5]: (3, 2)
I also re-derived the formula of Flórez-Rueda y Moreno 2001 and arrived at:
unsigned long int x = floor(sqrt(2.0 * pos + 0.25) + 0.5);
CUDA Note: I tried everything I could think of to avoid using double-precision math, but the single-precision sqrt function in CUDA is simply not precise enough to convert positions greater than 121 million or so to x, y coordinates (when using 1,024 threads per block and indexing only along 1 block dimension). Some articles have employed a "correction" to bump the result in a particular direction, but this inevitably falls apart at a certain point.

Decreasing Loop Interval by 1 in C/C++

Let's say I have 15 elements. I want to group them such a way that:
group1 = 1 - 5
group2 = 6 - 9
group3 = 10 - 12
group4 = 13 - 14
group5 = 15
This way I'll get elements in each group as below:
group1 = 5
group2 = 4
group3 = 3
group4 = 2
group5 = 1
As you can see loop interval is decreasing.
I took 15 just for an example. In actual programme it's user driven parameter which can be anything (hopefully few thousand).
Now what I'm looking for is:
Whatever is in group1 should have variable "loop" value 0, group2 should have 1, group3 should have 2 and so on... "loop" is an int variable which is being used to calculate some other stuff.
Let's put in other words too
I have an int variable called "loop". I want to assign value to it such a way that:
First n frames loop value 0 next (n -1) frames loop value 1 then next (n - 2) frames loop value 2 all the way to loop value (n - 1)
Let's say I have 15 frames on my timeline.
So n will be 5 ====>>>>> (5 + 4 + 3 + 2 + 1 = 15; as interval is decreasing by 1)
then
first 5 frames(1 - 5) loop is 0 then next 4 frames(6 - 9) loop is 1 then next 3 frames(10 - 12) loop is 2 then next 2 frames(13 - 14) loop is 3 and for last frame(15) loop is 4.
frames "loop" value
1 - 5 => 0
6 - 9 => 1
10 - 12 => 2
13 - 14 => 3
15 => 4
I've tried with modulo(%). But the issue is on frame 12 loop is 2 so (12 % (5 - 2)) remainder is 0 so it increments loop value.
The following lines are sample code which is running inside a solver. #loop is by default 0 and #Frame is current processing frame number.
int loopint = 5 - #loop;
if (#Frame % loopint == 0)
#loop += 1;
If I understand this correctly, then
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main(int argc, char *argv[]) {
int n = atoi(argv[1]);
for(int i = 1; i <= n; ++i) {
printf("%d: %f\n", i, ceil((sqrt(8 * (n - i + 1) + 1) - 1) / 2));
}
}
is an implementation in C.
The math behind this is as follows: The 1 + 2 + 3 + 4 + 5 you have there is a Gauß sum, which has a closed form S = n * (n + 1) / 2 for n terms. Solving this for n, we get
n = (sqrt(8 * S + 1) - 1) / 2
Rounding this upward would give us the solution if you wanted the short stretches at the beginning, that is to say 1, 2, 2, 3, 3, 3, ...
Since you want the stretches to become progressively shorter, we have to invert the order, so S becomes (n - S + 1). Therefore the formula up there.
EDIT: Note that unless the number of elements in your data set fits the n * (n+1) / 2 pattern precisely, you will have shorter stretches either at the beginning or in the end. This implementation places the irregular stretch at the beginning. If you want them at the end,
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main(int argc, char *argv[]) {
int n = atoi(argv[1]);
int n2 = (int) ceil((sqrt(8 * n + 1) - 1) / 2);
int upper = n2 * (n2 + 1) / 2;
for(int i = 1; i <= n; ++i) {
printf("%d: %f\n", i, n2 - ceil((sqrt(8 * (upper - i + 1) + 1) - 1) / 2));
}
}
does it. This calculates the next such number beyond your element count, then calculates the numbers you would have if you had that many elements.

How to find Big-O notation of FFT multiplication under a loop

The Big-O notation of FFT Multiplication is O(nlogn). What is Big-O notation of a FFT multiplication under a loop as given in algorithm below? The code is given in matlab and FFTmulti is a function for FFT Multiplication of two polynomials
rG=1;
rN=1;
AreaFunc=[1 2 5 2 3 6 7 2 4 5 6];
N=length(AreaFunc);
for i=1:(N-1)
ref_coeff(i) = (AreaFunc(i+1) - AreaFunc(i)) / (AreaFunc(i+1) + AreaFunc(i));
end
ref_coeff=[ref_coeff rN];
G = (1 + rG) / 2;
A0 = [1]; B0 = [-rG];
for i = 1 : length(ref_coeff)
G = G * (1 + ref_coeff(i));
A1 = [-ref_coeff(i) 0]; B1 = [1 0];
An = [0 A0] + FFTmulti(A1,B0);
Bn = [0 -ref_coeff(i)*A0] + FFTmulti(B1,B0);
A0=An;
B0=Bn;
end
A0 =fliplr(A0);
num = zeros(1, (floor(N/2)));
num = [num G];
FFT complexity -for the best known optimization algorithms - is N*log2(N).
If you call it inside a loop of N, will be N^2 log2(N).

Sum of submatrices of bigger matrix

I have a big matrix as input, and I have the size of a smaller matrix. I have to compute the sum of all possible smaller matrices which can be formed out of the bigger matrix.
Example.
Input matrix size: 4 × 4
Matrix:
1 2 3 4
5 6 7 8
9 9 0 0
0 0 9 9
Input smaller matrix size: 3 × 3 (not necessarily a square)
Smaller matrices possible:
1 2 3
5 6 7
9 9 0
5 6 7
9 9 0
0 0 9
2 3 4
6 7 8
9 0 0
6 7 8
9 0 0
0 9 9
Their sum, final output
14 18 22
29 22 15
18 18 18
I did this:
int** matrix_sum(int **M, int n, int r, int c)
{
int **res = new int*[r];
for(int i=0 ; i<r ; i++) {
res[i] = new int[c];
memset(res[i], 0, sizeof(int)*c);
}
for(int i=0 ; i<=n-r ; i++)
for(int j=0 ; j<=n-c ; j++)
for(int k=i ; k<i+r ; k++)
for(int l=j ; l<j+c ; l++)
res[k-i][l-j] += M[k][l];
return res;
}
I guess this is too slow, can anyone please suggest a faster way?
Your current algorithm is O((m - p) * (n - q) * p * q). The worst case is when p = m / 2 and q = n / 2.
The algorithm I'm going to describe will be O(m * n + p * q), which will be O(m * n) regardless of p and q.
The algorithm consists of 2 steps.
Let the input matrix A's size be m x n and the size of the window matrix being p x q.
First, you will create a precomputed matrix B of the same size as the input matrix. Each element of the precomputed matrix B contains the sum of all the elements in the sub-matrix, whose top-left element is at coordinate (1, 1) of the original matrix, and the bottom-right element is at the same coordinate as the element that we are computing.
B[i, j] = Sum[k = 1..i, l = 1..j]( A[k, l] ) for all 1 <= i <= m, 1 <= j <= n
This can be done in O(m * n), by using this relation to compute each element in O(1):
B[i, j] = B[i - 1, j] + Sum[k = 1..j-1]( A[i, k] ) + A[j] for all 2 <= i <= m, 1 <= j <= n
B[i - 1, j], which is everything of the sub-matrix we are computing except the current row, has been computed previously. You keep a prefix sum of the current row, so that you can use it to quickly compute the sum of the current row.
This is another way to compute B[i, j] in O(1), using the property of the 2D prefix sum:
B[i, j] = B[i - 1, j] + B[i, j - 1] - B[i - 1, j - 1] + A[j] for all 1 <= i <= m, 1 <= j <= n and invalid entry = 0
Then, the second step is to compute the result matrix S whose size is p x q. If you make some observation, S[i, j] is the sum of all elements in the matrix size (m - p + 1) * (n - q + 1), whose top-left coordinate is (i, j) and bottom-right is (i + m - p + 1, j + n - q + 1).
Using the precomputed matrix B, you can compute the sum of any sub-matrix in O(1). Apply this to compute the result matrix S:
SubMatrixSum(top-left = (x1, y1), bottom-right = (x2, y2))
= B[x2, y2] - B[x1 - 1, y2] - B[x2, y1 - 1] + B[x1 - 1, y1 - 1]
Therefore, the complexity of the second step will be O(p * q).
The final complexity is as mentioned above, O(m * n), since p <= m and q <= n.