What is the reason of bad parallel performance?

What is the reason of bad parallel performance? - c++

I'm trying to implement parallel algorithm that will compute Levenshtein distance between each of sequences in a list and store them in matrix (2d vector). In other words, I'm given 2d vector with numbers (thousands of number sequences of up to 30 numbers) and I need to compute Levenshtein distance between each vector of integers. I implemented serial algorithm that works, but when I tried to convert it to parallel, it is much slower (the more threads, the slower it is). The parallel version is implemented with c++11 threads (I also tried OpenMP, but with the same results).
Here is the function that distributes work:
vector<vector<int>> getGraphParallel(vector<vector<int>>& records){
int V = records.size();
auto threadCount = std::thread::hardware_concurrency();
if(threadCount == 0){
threadCount = 1;
}
vector<future<vector<vector<int>>>> futures;
int rowCount = V / threadCount;
vector<vector<int>>::const_iterator first = records.begin();
vector<vector<int>>::const_iterator last = records.begin() + V;
for(int i = 0; i < threadCount; i++){
int start = i * rowCount;
if(i == threadCount - 1){
rowCount += V % threadCount;
}
futures.push_back(std::async(getRows, std::ref(records), start, rowCount, V));
}
vector<vector<int>> graph;
for(int i = 0; i < futures.size(); i++){
auto result = futures[i].get();
for(const auto &row : result){
graph.push_back(row);
}
}
for(int i = 0; i < V; i++)
{
for(int j = i + 1; j < V; j++){
graph[j][i] = graph[i][j];
}
}
return graph;
}
Here is the function that computes rows of final matrix:
vector<vector<int>> getRows(vector<vector<int>>& records, int from, int count, int size){
vector<vector<int>> result(count, vector<int>(size, 0));
for(int i = 0; i < count; i++){
for(int j = i + from + 1; j < size; j++){
result[i][j] = levenshteinDistance(records[i + from], records[j]);
}
}
return result;
}
And finally function that computes Levenshtein distance:
int levenshteinDistance(const vector<int>& first, const vector<int>& second){
const int sizeFirst = first.size();
const int sizeSecond = second.size();
if(sizeFirst == 0) return sizeSecond;
if(sizeSecond == 0) return sizeFirst;
vector<vector<int>> distances(sizeFirst + 1, vector<int>(sizeSecond + 1, 0));
for(int i = 0; i <= sizeFirst; i++){
distances[i][0] = i;
}
for(int j = 0; j <= sizeSecond; j++){
distances[0][j] = j;
}
for (int j = 1; j <= sizeSecond; j++)
for (int i = 1; i <= sizeFirst; i++)
if (first[i - 1] == second[j - 1])
distances[i][j] = distances[i - 1][j - 1];
else
distances[i][j] = min(min(
distances[i - 1][j] + 1,
distances[i][j - 1] + 1),
distances[i - 1][j - 1] + 1
);
return distances[sizeFirst][sizeSecond];
}
One thing that came to my mind is that this slow down is caused by false sharing, but I could not check it with perf, because I'm working with Ubuntu in Oracle VirtualBox - cache misses are not avalaible there. If I'm right and the slow down is caused by false sharing, what should I do to fix it? If not, what is the reason of this slow down?

One problem I can see is that you are using std::async without declaring how it should run. It can either be run async or deferred. In the case of deferred it would all run in one thread, it would just be lazily evaluated. The default behavior is implementation defined. For your case it'd make sense that it would run slower with more deferred evaluations. You can try std::async(std::launch::async, ...).
Make sure your VM is also set up to use more than 1 core. Ideally when doing optimizations such as these it'd be best to try and eliminate as many variables as possible. If you can, run the program locally without a VM. Profiling tools are your best bet and will show you exactly where time is being spent.

Related

Improving a solution

The description of a task goes like this:
We have n numbers, and we have to find quantity of unique sums of all the pairs in the array.
For example:
3 2 5 6 3
The sums of all the pairs(non-repeated) are 5 9 8 6 8 7 5 11 9 8
Unique are 5 9 8 6 7 11
Therefore output is 6
I have come up with this really primitive, and time-consuming (meaning complexity) solution:
int n = 0;
cin >> n;
vector<int> vec(n);
for (int i = 0; i < n; i++)
{
cin >> vec[i];
}
vector<int> sum;
for (int i = 0; i < n; i++)
{
for (int j = i+1; j < n; j++)
{
sum.push_back(vec[i] + vec[j]);
}
}
sort(sum.begin(), sum.end());
for (int i = 0; i < sum.size()-1;)
{
if (sum[i] == sum[i + 1]) sum.erase(sum.begin() + i);
else i++;
}
cout << endl << sum.size();
I feel like there could be a solution using Combinatorics or something easier. I have thought a lot and couldn't think of anything. So my request is if anyone can improve the solution.

As mentioned above what you need it is difficult to do this without computing the sum of all pairs, so I am not going to handle that, I am just going to advise about efficient data structures.
Analysis of your solution
Your code adds everything in advance O(n^2) then sorts O(n^2 log(n)), then remove duplicates. But since you are erasing from a vector, that ultimately has complexity linear with the number of elements to the end of the list. It means that the second loop will make the complexity of your algorithm O(n^4).
You can count the unique elements in a sorted array without removing
int count = 0;
for (int i = 0; i < sum.size()-1; ++i)
{
if (sum[i] != sum[i + 1]) ++count
}
This change alone makes your algorithm complexity O(n^2 log n).
Alternatives without sorting.
Here are alternatives that O(n^2) and storage depending on the range of the input values instead of the length of the vector (except for the last).
I am testing with 1000 elements smaller between 0 and 10000
vector<int> vec;
for(int i = 0; i < 1000; ++i){
vec.push_back(rand() % 10000);
}
Your implementation sum_pairs1(vec) (18 seconds)
int sum_pairs1(const vector<int> &vec){
vector<int> sum;
int n = vec.size();
for (int i = 0; i < n; i++)
{
for (int j = i+1; j < n; j++)
{
sum.push_back(vec[i] + vec[j]);
}
}
sort(sum.begin(), sum.end());
for (int i = 0; i < sum.size()-1;)
{
if (sum[i] == sum[i + 1]) sum.erase(sum.begin() + i);
else i++;
}
return sum.size();
}
If you know the range for the sum of the values you can use a bitset, efficient use of memory sum_pairs2<20000>(vec) (0.016 second).
template<size_t N>
int sum_pairs2(const vector<int> &vec){
bitset<N> seen;
int n = vec.size();
for (int i = 0; i < n; i++)
{
for (int j = i+1; j < n; j++)
{
seen[vec[i] + vec[j]] = true;
}
}
return seen.count();
}
If you know that the maximum sum is not so high (the vector is not very sparse), but you don't know at compilation time you can use a vector, you can keep track of minimum and maximum to allocate the minimum possible and also supporting negative values.
int sum_pairs2b(const vector<int> &vec){
int VMAX = vec[0];
int VMIN = vec[0]
for(auto v : vec){
if(VMAX < v) VMAX = v;
else if(VMIN > v) VMIN = v;
}
vector<bool> seen(2*(VMAX - VMIN) + 1);
int n = vec.size();
for (int i = 0; i < n; i++)
{
for (int j = i+1; j < n; j++)
{
seen[vec[i] + vec[j] - 2*VMIN] = true;
}
}
int count = 0;
for(auto c : seen){
if(c) ++count;
}
return count;
}
And If you want a more general solution that works well with sparse data sum_pairs3<int>(vec) (0.097 second)
template<typename T>
int sum_pairs3(const vector<T> &vec){
unordered_set<T> seen;
int n = vec.size();
for (int i = 0; i < n; i++)
{
for (int j = i+1; j < n; j++)
{
seen.insert(vec[i] + vec[j]);
}
}
return seen.size();
}

Skipping vector iterations based on index equality

Let's say I have three vectors.
#include <vector>
vector<long> Alpha;
vector<long> Beta;
vector<long> Gamma;
And let's assume I've filled them up with numbers, and that we know they're all the same length. (and we know that length ahead of time - let's say it's 3.)
What I want to have at the end is the minimum of all sums Alpha[i] + Beta[j] + Gamma[k] such that i, j, and k are all unequal to each other.
The naive approach would look something like this:
#include <climits>
long min = LONG_MAX;
for (int i = 0; i < 3; i++) {
for (int j = 0; j < 3; j++) {
for (int k=0; k < 3; k++) {
if (i != j && i != k && j != k) {
long sum = Alpha[i] + Beta[j] + Gamma[k];
if (sum < min)
min = sum;
}
}
}
}
Frankly, that code doesn't feel right. Is there a faster and/or more elegant way - one that skips the redundant iterations?

The computational complexity of your algorithm is an O(N^3). You can save a very small bit by using:
for (int i = 0; i < 3; i++) {
for (int j = 0; j < 3; j++) {
if ( i == j )
continue;
long sum1 = Alpha[i] + Beta[j];
for (int k=0; k < 3; k++) {
if (i != k && j != k) {
long sum2 = sum1 + Gamma[k];
if (sum2 < min)
min = sum2;
}
}
}
}
However, the complexity of the algorithm is still O(N^3).
Without the if ( i == j ) check, the innermost loop will be executed N^2 times. With that check, you will be able to avoid the innermost loop N times. It will be executed N(N-1) times. The check is almost not worth it .

If you can temporarily modify the input vectors, you can swap the used values with the end of the vectors, and just iterate over the start of the vectors:
for (int i = 0; i < size; i++) {
std::swap(Beta[i],Beta[size-1]); // swap the used index with the last index
std::swap(Gamma[i],Gamma[size-1]);
for (int j = 0; j < size-1; j++) { // don't try the last index
std::swap(Gamma[j],Gamma[size-2]); // swap j with the 2nd-to-last index
for (int k=0; k < size-2; k++) { // don't try the 2 last indices
long sum = Alpha[i] + Beta[j] + Gamma[k];
if (sum < min) {
min = sum;
}
}
std::swap(Gamma[j],Gamma[size-2]); // restore values
}
std::swap(Beta[i],Beta[size-1]); // restore values
std::swap(Gamma[i],Gamma[size-1]);
}

Is it possible to avoid the for-loop to compute matrix entries?

I have to use a nested for-loop to compute the entries of a Eigen::MatrixXd type matrix output columnwise. Here input[0], input[1] and input[2] are defined as Eigen::ArrayXXd in order to use the elementwise oprerations. This part seems to be the bottleneck for my code. Can anyone help me to accelerate this loop? Thanks!
for (int i = 0; i < r; i++) {
for (int j = 0; j < r; j++) {
for (int k = 0; k < r; k++) {
output.col(i * (r * r) + j * r + k) =
input[0].col(i) * input[1].col(j) * input[2].col(k);
}
}
}

When thinking about optimizing code of a for loop, it helps to think, "Are there redundant calculations that I can eliminate?"
Notice how in the inner most loop, only k is changing. You should move all possible calculations that don't involve k out of that loop:
for (int i = 0; i < r; i++) {
int temp1 = i * (r * r);
for (int j = 0; j < r; j++) {
int temp2 = j * r;
for (int k = 0; k < r; k++) {
output.col(temp1 + temp2 + k) =
input[0].col(i) * input[1].col(j) * input[2].col(k);
}
}
}
Notice how i * (r * r) is being calculated over and over, but the answer is always the same! You only need to recalculate this when i increments. The same goes for j * r.
Hopefully this helps!

To reduce the number of flops, you should cache the result of input[0]*input[1]:
ArrayXd tmp(input[0].rows());
for (int i = 0; i < r; i++) {
for (int j = 0; j < r; j++) {
tmp = input[0].col(i) * input[1].col(j);
for (int k = 0; k < r; k++) {
output.col(i * (r * r) + j * r + k) = tmp * input[2].col(k);
}
}
}
Then, to fully use your CPU, enable AVX/FMA with -march=native and of course compiler optimizations (-O3).
Then, to get an idea of what you could gain more, measure accurately the time taken by this part, count the number of multiplications (r^2*(n+r*n)), and then compute the number of floating point operations per second you achieve. Then compare it to the capacity of your CPU. If you're good, then the only option is to multithread one of the for loop using, e.g., OpenMP. The choice of which for loop depends on the size of your inputs, but you can try with the outer one, making sure each thread has its own tmp array.

Is it possible to parallelize this for loop?

I was given some code to paralellize using OpenMP and, among the various function calls, I noticed this for loop takes some good guilt on the computation time.
double U[n][n];
double L[n][n];
double Aprime[n][n];
for(i=0; i<n; i++) {
for(j=0; j<n; j++) {
if (j <= i) {
double s;
s=0;
for(k=0; k<j; k++) {
s += L[j][k] * U[k][i];
}
U[j][i] = Aprime[j][i] - s;
} else if (j >= i) {
double s;
s=0;
for(k=0; k<i; k++) {
s += L[j][k] * U[k][i];
}
L[j][i] = (Aprime[j][i] - s) / U[i][i];
}
}
However, after trying to parallelize it and applying some semaphores here and there (with no luck), I came to the realization that the else if condition has a strong dependency on the early if (L[j][i] being a processed number with U[i][i], which may be set on the early if), making it, in my oppinion, non-parallelizable due to race conditions.
Is it possible to parallelize this code in such a manner to make the else if only be executed if the earlier if has already completed?

Before trying to parallelize things, try simplification first.
For example, the if can be completely eliminated.
Also, the code is accessing the matrixes in a way that causes worst cache performance. That may be the real bottleneck.
Note: In update #3 below, I did benchmarks and the cache friendly version fix5, from update #2, outperforms the original by 3.9x.
I've cleaned things up in stages, so you can see the code transformations.
With this, it should be possible to add omp directives successfully. As I mentioned in my top comment, the global vs. function scope of the variables affects the type of update that may be required (e.g. omp atomic update, etc.)
For reference, here is your original code:
double U[n][n];
double L[n][n];
double Aprime[n][n];
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
if (j <= i) {
double s;
s = 0;
for (k = 0; k < j; k++) {
s += L[j][k] * U[k][i];
}
U[j][i] = Aprime[j][i] - s;
}
else if (j >= i) {
double s;
s = 0;
for (k = 0; k < i; k++) {
s += L[j][k] * U[k][i];
}
L[j][i] = (Aprime[j][i] - s) / U[i][i];
}
}
}
The else if (j >= i) was unnecessary and could be replaced with just else. But, we can split the j loop into two loops so that neither needs an if/else:
// fix2.c -- split up j's loop to eliminate if/else inside
double U[n][n];
double L[n][n];
double Aprime[n][n];
for (i = 0; i < n; i++) {
for (j = 0; j <= i; j++) {
double s = 0;
for (k = 0; k < j; k++)
s += L[j][k] * U[k][i];
U[j][i] = Aprime[j][i] - s;
}
for (; j < n; j++) {
double s = 0;
for (k = 0; k < i; k++)
s += L[j][k] * U[k][i];
L[j][i] = (Aprime[j][i] - s) / U[i][i];
}
}
U[i][i] is invariant in the second j loop, so we can presave it:
// fix3.c -- save off value of U[i][i]
double U[n][n];
double L[n][n];
double Aprime[n][n];
for (i = 0; i < n; i++) {
for (j = 0; j <= i; j++) {
double s = 0;
for (k = 0; k < j; k++)
s += L[j][k] * U[k][i];
U[j][i] = Aprime[j][i] - s;
}
double Uii = U[i][i];
for (; j < n; j++) {
double s = 0;
for (k = 0; k < i; k++)
s += L[j][k] * U[k][i];
L[j][i] = (Aprime[j][i] - s) / Uii;
}
}
The access to the matrixes is done in probably the worst way for cache performance. So, if the assignment of dimensions can be flipped, a substantial savings in memory access can be achieved:
// fix4.c -- transpose matrix coordinates to get _much_ better memory/cache
// performance
double U[n][n];
double L[n][n];
double Aprime[n][n];
for (i = 0; i < n; i++) {
for (j = 0; j <= i; j++) {
double s = 0;
for (k = 0; k < j; k++)
s += L[k][j] * U[i][k];
U[i][j] = Aprime[i][j] - s;
}
double Uii = U[i][i];
for (; j < n; j++) {
double s = 0;
for (k = 0; k < i; k++)
s += L[k][j] * U[i][k];
L[i][j] = (Aprime[i][j] - s) / Uii;
}
}
UPDATE:
In the Op's first k-loop its k<j and in the 2nd k<i don't you have to fix that?
Yes, I've fixed it. It was too ugly a change for fix1.c, so I removed that and applied the changes to fix2-fix4 where it was easy to do.
UPDATE #2:
These variables are all local to the function.
If you mean they are function scoped [without static], this says that the matrixes can't be too large because, unless the code increases the stack size, they're limited to the stack size limit (e.g. 8 MB)
Although the matrixes appeared to be VLAs [because n was lowercase], I ignored that. You may want to try a test case using fixed dimension arrays as I believe they may be faster.
Also, if the matrixes are function scope, and want to parallelize things, you'd probably need to do (e.g.) #pragma omp shared(Aprime) shared(U) shared(L).
The biggest drag on cache were the loops to calculate s. In fix4, I was able to make access to U cache friendly, but L access was poor.
I'd need to post a whole lot more if I did include the external context
I guessed as much, so I did the matrix dimension swap speculatively, not knowing how much other code would need changing.
I've created a new version that changes the dimensions on L back to the original way, but keeping the swapped versions on the other ones. This provides the best cache performance for all matrixes. That is, the inner loop for most matrix access is such that each iteration is incrementing along the cache lines.
In fact, give it a try. It may improve things to the point where parallel isn't needed. I suspect the code is memory bound anyway, so parallel might not help as much.
// fix5.c -- further transpose to fix poor performance on s calc loops
//
// flip the U dimensions back to original
double U[n][n];
double L[n][n];
double Aprime[n][n];
double *Up;
double *Lp;
double *Ap;
for (i = 0; i < n; i++) {
Ap = Aprime[i];
Up = U[i];
for (j = 0; j <= i; j++) {
double s = 0;
Lp = L[j];
for (k = 0; k < j; k++)
s += Lp[k] * Up[k];
Up[j] = Ap[j] - s;
}
double Uii = Up[i];
for (; j < n; j++) {
double s = 0;
Lp = L[j];
for (k = 0; k < i; k++)
s += Lp[k] * Up[k];
Lp[i] = (Ap[j] - s) / Uii;
}
}
Even if you really need the original dimensions, depending upon the other code, you might be able to transpose going in and transpose back going out. This would keep things the same for other code, but, if this code is truly a bottleneck, the extra transpose operations might be small enough to merit this.
UPDATE #3:
I've run benchmarks on all the versions. Here are the elapsed times and ratios relative to original for n equal to 1037:
orig: 1.780916929 1.000x
fix1: 3.730602026 0.477x
fix2: 1.743769884 1.021x
fix3: 1.765769482 1.009x
fix4: 1.762100697 1.011x
fix5: 0.452481270 3.936x
Higher ratios are better.
Anyway, this is the limit of what I can do. So, good luck ...

Bad index issues while populating a matrix from an array by rows, columns or randomly

Let you have a std::vector< T > (where by T I mean any class or typename) containing n elements and you want to populate a matrix-like object, as instance an object of class boost::numeric::ublas::matrix< T >, whose dimensionality is m1 rows and m2 columns, inserting those elements one by one by rows or by columns.
First of all, it must be n <= m1 * m2 or we're trying to fill, say, 10 slots with 11 or more elements; then: if n <= m1 (respectively, n <= m2) no problems arise for a loop is enough.
But if n > m1 (n > m2) you have to split your std::vector< T > to avoid bad index issues and to populate your matrix according to your choice.
Here is my attempt:
#include <algorithm>
#include <boost/numeric/ublas/matrix.hpp>
#include <vector>
template < class T > inline void PopulateGrid(const std::vector< T >& tVector,
boost::numeric::ublas::matrix< T >& tMatrix,
char by = 'n')
{
if (tVector.size() > tMatrix.size1() * tMatrix.size2())
throw("Matrix is too small to contain all the array elements!");
else
{
switch(by)
{
case 'r' :
for (unsigned i = 0; i < tMatrix.size1(); i++)
{
for (unsigned j = 0; j < tVector.size(); j++)
{
if (j <= tMatrix.size2())
tMatrix(i, j) = tVector[j];
else
tMatrix(i + 1, j - tMatrix.size2()) = tVector[j];
}
}
break;
case 'c' :
for (unsigned j = 0; j < tMatrix.size2(); j++)
{
for (unsigned i = 0; i < tVector.size(); i++)
{
if (i <= tMatrix.size1())
tMatrix(i, j) = tVector[i];
else
tMatrix(i - tMatrix.size1(), j + 1) = tVector[j];
}
}
break;
default:
for (unsigned i = 0; i < tMatrix.size1(); i++)
{
for (unsigned j = 0; j < tVector.size(); j++)
{
if (j <= tMatrix.size2())
tMatrix(i, j) = tVector[j];
else
tMatrix(i + 1, j - tMatrix.size2()) = tVector[j];
}
}
// Following is just to populate it randomly
std::random_shuffle(tMatrix.begin1(), tMatrix.end1());
std::random_shuffle(tMatrix.begin2(), tMatrix.end2());
}
}
}
A fast main to try it:
int main()
{
std::vector< int > ciao;
for (int i = 0; i < 12; i++)
ciao.push_back(i);
boost::numeric::ublas::matrix< int > peppa(10, 10);
PopulateGrid< int >(ciao, peppa); // crashes!
/* Check failed in file C:\DevTools\boost_1_54_0/boost/numeric/ublas/functional.hpp
* at line 1371:
* j < size_j
* terminate called after throwing an instance of 'boost::numeric::ublas::bad_index'
* what(): bad index
*/
return 0;
}
The point is: I am missing some easier operations on indexes to achieve the same result and I am messing with indexes.

I will completely re-write this answer, it's more of an answer and less of a suggestion now.
Again, the problem you have is here:
for (unsigned j = 0; j < tVector.size(); j++) {
if (j <= tMatrix.size2())
tMatrix(i, j) = tVector[j];
else
tMatrix(i + 1, j - tMatrix.size2()) = tVector[j];
}
This is the inner loop of the nested for-loop construct that you are using. In here, you run through the entire vector. Upon j exceeding size2 you are not exiting the loop, but instead adressing (i+1) as the first index for tMatrix(i1,i2). And then for i2 you subtract size2 from j.
Two problems: As soon as your j exceeds size2 you will forever rewrite into the matrix i1 index of value 1.
And then, if j becomes greater than 2*size2 you will try writing out of bound indicies. For example, if the vecor contains 20 elements and you have a 7*7 matrix, you will write to vectorIndex-7 as soon as vectorIndex becomes 8. [Note that your <=tMatrix.size2() is also wrong, should be only < ). That works all the way until vectorIndex is 14 or greater, as 14-7 = 7 or great -> out of bound.
Second problem is that your outer for-loop will still do its iteration and recall the entire process for every i-value. In the end you will be at the maximum allowed i-value and in your inner loop you will try accessing (i+1). This will also give you an out of bound expection.
I would suggest something like this as a solution:
for ( std::vector<T>::iterator it = yourVector.begin(), int rowIndex = 0, colmIndex = 0;
it != yourVector.end(); ++it, colmIndex++ ) {
if ( colmIndex >= yourMatrix.size2() ) {
rowIndex++;
colmIndex = 0;
yourMatrix (rowIndex, colmIndex ) = *it;
}
else {
yourMatrix (rowIndex, colmIndex ) = *it;
}
}
This is untested and fills the vector by column. You should be able to figure out what to do for byRow yourself!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

What is the reason of bad parallel performance? - c++

Related

Improving a solution

Skipping vector iterations based on index equality

Is it possible to avoid the for-loop to compute matrix entries?

Is it possible to parallelize this for loop?

Bad index issues while populating a matrix from an array by rows, columns or randomly

Categories

Resources