How to parallelize this array correct way using OpenMP? - c++

After I try to parallelize the code with openmp, the elements in the array are wrong, as for the order of the elements is not very important. Or is it more convenient to use c++ std vector instead of array to parallelize, could you suggest a easy way?
#include <stdio.h>
#include <math.h>
int main()
{
int n = 100;
int a[n*(n+1)/2]={0};
int count=0;
#pragma omp parallel for reduction(+:a,count)
for (int i = 1; i <= n; i++) {
for (int j = i + 1; j <= n; j++) {
double k = sqrt(i * i + j * j);
if (fabs(round(k) - k) < 1e-10) {
a[count++] = i;
a[count++] = j;
a[count++] = (int) k;
}
}
}
for(int i=0;i<count;i++)
printf("%d %s",a[i],(i+1)%3?"":", ");
printf("\ncount: %d", count);
return 0;
}
Original output:
3 4 5 , 5 12 13 , 6 8 10 , 7 24 25 , 8 15 17 , 9 12 15 , 9 40 41 , 10 24 26 , 11 60 61 , 12 16 20 , 12 35 37 , 13 84 85 , 14 48 50 , 15 20 25 , 15 36 39 , 16 30 34 , 16 63 65 , 18 24 30 , 18 80 82 , 20 21 29 , 20 48 52 , 20 99 101 , 21 28 35 , 21 72 75 , 24 32 40 , 24 45 51 , 24 70 74 , 25 60 65 , 27 36 45 , 28 45 53 , 28 96 100 , 30 40 50 , 30 72 78 , 32 60 68 , 33 44 55 , 33 56 65 , 35 84 91 , 36 48 60 , 36 77 85 , 39 52 65 , 39 80 89 , 40 42 58 , 40 75 85 , 40 96 104 , 42 56 70 , 45 60 75 , 48 55 73 , 48 64 80 , 48 90 102 , 51 68 85 , 54 72 90 , 56 90 106 , 57 76 95 , 60 63 87 , 60 80 100 , 60 91 109 , 63 84 105 , 65 72 97 , 66 88 110 , 69 92 115 , 72 96 120 , 75 100 125 , 80 84 116 ,
count: 189
After using OpenMP(gcc file.c -fopenmp):
411 538 679 , 344 609 711 , 354 533 649 , 218 387 449 , 225 475 534 , 182 283 339 , 81 161 182 , 74 190 204 , 77 138 159 , 79 176 195 , 18 24 30 , 18 80 82 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 ,
count: 189

Your threads are all accessing the shared count.
You would be better off eliminating count and have each loop iteration determine where to write its output based only on the (per-thread) values of i and j.
Alternatively, use a vector to accumulate the results:
#include <cmath>
#include <iostream>
#include <utility>
#include <vector>
#pragma omp declare \
reduction(vec_append : std::vector<std::pair<int,int>> : \
omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))
int main()
{
constexpr int n = 100'000;
std::vector<std::pair<int,int>> result;
#pragma omp parallel for \
reduction(vec_append:result) \
schedule(dynamic)
for (int i = 1; i <= n; ++i) {
for (int j = i + 1; j <= n; ++j) {
auto const h2 = i * i + j * j; // hypotenuse squared
int h = std::sqrt(h2) + 0.5; // integer square root
if (h * h == h2) {
result.emplace_back(i, j);
}
}
}
// for (auto const& v: result) {
// std::cout << v.first << ' '
// << v.second << ' '
// << std::hypot(v.first, v.second) << ' ';
// }
std::cout << "\ncount: " << result.size() << '\n';
}

As an alternative to using a critical section, this solution uses atomics and could therefore be faster.
The following code might freeze your computer due to memory consumption. Be careful!
#include <cstdio>
#include <cmath>
#include <vector>
int main() {
int const n = 100;
// without a better (smaller) upper_bound this is extremely
// wasteful in terms of memory for big n
long const upper_bound = 3L * static_cast<long>(n) *
(static_cast<long>(n) - 1L) / 2l;
std::vector<int> a(upper_bound, 0);
int count = 0;
#pragma omp parallel for schedule(dynamic) shared(a, count)
for (int i = 1; i <= n; ++i) {
for (int j = i + 1; j <= n; ++j) {
double const k = std::sqrt(static_cast<double>(i * i + j * j));
if (std::fabs(std::round(k) - k) < 1e-10) {
int my_pos;
#pragma omp atomic capture
my_pos = count++;
a[3 * my_pos] = i;
a[3 * my_pos + 1] = j;
a[3 * my_pos + 2] = static_cast<int>(std::round(k));
}
}
}
count *= 3;
for(int i = 0; i < count; ++i) {
std::printf("%d %s", a[i], (i + 1) % 3 ? "" : ", ");
}
printf("\ncount: %d", count);
return 0;
}
EDIT:
My answer was initially a reaction to a by now deleted answer using a critical section in a sub-optimal way. In the following I will present another solution which combines a critical section with using std::vector::emplace_back() to circumvent the need for upper_bound similar to Toby Speight's solution. Generally using a reduce clause like in Toby Speight's solution should be preferred over critical sections and atomics, as reductions should scale better for big numbers of threads. In this particular case (relatively few calculations will be written to a) and without a big amount of cores to run on, the following code might still be preferable.
#include <cstdio>
#include <cmath>
#include <tuple>
#include <vector>
int main() {
int const n = 100;
std::vector<std::tuple<int, int, int>> a{};
// optional, might reduce number of reallocations
a.reserve(2 * n); // 2 * n is an arbitrary choice
#pragma omp parallel for schedule(dynamic) shared(a)
for (int i = 1; i <= n; ++i) {
for (int j = i + 1; j <= n; ++j) {
double const k = std::sqrt(static_cast<double>(i * i + j * j));
if (std::fabs(std::round(k) - k) < 1e-10) {
#pragma omp critical
a.emplace_back(i, j, static_cast<int>(std::round(k)));
}
}
}
long const count = 3L * static_cast<long>(a.size());
for(unsigned long i = 0UL; i < a.size(); ++i) {
std::printf("%d %d %d\n",
std::get<0>(a[i]), std::get<1>(a[i]), std::get<2>(a[i]));
}
printf("\ncount: %ld", count);
return 0;
}

The count variable is an index into a. The reduction(+:a,count) operator is summing the array, it is not a concatenation operation which is what I think you are looking for.
The count variable needs to be surrounded by a mutex, something like #pragma omp critical, but I am not an OpenMP expert.
Alternatively, create int a[n][n], set all of them to -1 (a sentinel value to indicate "invalid") then assign the result of the sqrt() when it is near enough to a whole number.

Related

counting sort algorithm not working in hackerrank [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 25 days ago.
Improve this question
this is my code
vector<int> countingSort(vector<int> arr) {
int max = arr[0];
for(unsigned int i = 1; i < arr.size(); i++) {
if(arr[i] > arr[0]) {
max = arr[i];
}
}
max++;
vector<int> frequency_arr;
for(int i = 0; i < max; i++) {
frequency_arr.push_back(0);
}
for(unsigned int i = 0; i < arr.size(); i++) {
int elem = arr[i];
frequency_arr[elem]++;
}
return frequency_arr;
}
explanation : "countingSort" function is a sorting function which sort elements by using "counting sort" algorithm. it takes vector array as an input and return the "frequency array" as output.
input array
63 25 73 1 98 73 56 84 86 57 16 83 8 25 81 56 9 53 98 67 99 12 83 89 80 91 39 86 76 85 74 39 25 90 59 10 94 32 44 3 89 30 27 79 46 96 27 32 18 21 92 69 81 40 40 34 68 78 24 87 42 69 23 41 78 22 6 90 99 89 50 30 20 1 43 3 70 95 33 46 44 9 69 48 33 60 65 16 82 67 61 32 21 79 75 75 13 87 70 33
my output
0 2 0 2 0 0 1 0 1 2 1 0 1 1 0 0 2 0 1 0 1 2 1 1 1 3 0 2 0 0 2 0 3 3 1 0 0 0 0 2 2 1 1 1 2 0 2 0 1 0 1 0 0 1 0 0 2 1 0 1 1 1 0 1 0 1 0 2 1 3 2
expected output
0 2 0 2 0 0 1 0 1 2 1 0 1 1 0 0 2 0 1 0 1 2 1 1 1 3 0 2 0 0 2 0 3 3 1 0 0 0 0 2 2 1 1 1 2 0 2 0 1 0 1 0 0 1 0 0 2 1 0 1 1 1 0 1 0 1 0 2 1 3 2 0 0 2 1 2 1 0 2 2 1 2 1 2 1 1 2 2 0 3 2 1 1 0 1 1 1 0 2 2
Constraints
length of input array -> [100, 10^6]
range of elements in array -> [1, 100]
i was solving a counting sort problem in hackerrank, but when i run the code , in output it do not show all element of "frequency_arr"
it only show the bold numbers in expected output and do not show the remaning elements.
do you know what i am doing wrong here and what are the potential fixes?
It seems you have a typo in the first for loop
int max = arr[0];
for(unsigned int i = 1; i < arr.size(); i++) {
if(arr[i] > arr[0]) {
^^^^^^^^^^^^^^^^^
max = arr[i];
}
}
You need to write the if statement within the for loop the following way
if(arr[i] > max) {
Moreover in general the variable i should have the type size_t instead of unsigned int.
for( size_t i = 1; i < arr.size(); i++) {
Pay attention to that there is standard algorithm std::max_element declared in header <algorithm>.
And instead of this code snippet
vector<int> frequency_arr;
for(int i = 0; i < max; i++) {
frequency_arr.push_back(0);
}
you could just write
vector<int> frequency_arr( max );
In this case all max elements of the vector will be zero-initialized.

Nested for_each loops causing unexpected increase in size of a vector

I have the following piece of code that I am using to convert a 2D array of doubles to a 1D vector. The arrays are allocated using std::vector and a nested for_each loop is used to transfer the contents of the 2D array to the 1D array.
#include <iostream>
#include <algorithm>
#include <vector>
#include <stdexcept>
#define UNUSED(expr) (void)(expr)
using usll = __uint64_t;
void print1d(std::vector<double> vv);
void matToVect1d(const std::vector<std::vector<double>>& v2d, std::vector<double>& v1d);
void matToVect1dEx(const std::vector<std::vector<double>>& v2d, std::vector<double>& v1d);
int main(int argc, char* argv[])
{
UNUSED(argc);
UNUSED(argv);
std::cout << std::endl;
const usll DIM0 {10};
const usll DIM1 {8};
std::vector<std::vector<double>> data2d(DIM0, std::vector<double>(DIM1));
std::vector<double> data1d(DIM0 * DIM1);
double temp = 0.0;
for (usll i{}; i<DIM0; ++i)
{
for (usll j{}; j<DIM1; ++j)
{
data2d[i][j] = temp++;
}
}
try
{
matToVect1d(data2d, data1d);
std::cout << "2D array data2d as a 1D vector is:" << std::endl;
print1d(data1d);
std::cout << std::endl;
}
catch (const std::exception& e)
{
std::cerr << e.what() << std::endl;
}
std::cout << "Press enter to continue";
std::cin.get();
return 0;
}
void print1d(std::vector<double> vv)
{
for (size_t i{}; i<vv.size(); ++i)
{
std::cout << vv[i] << " ";
if ((i+1)%10 == 0)
{
std::cout << std::endl;
}
}
std::cout << std::endl;
}
void matToVect1d(const std::vector<std::vector<double>>& v2d, std::vector<double>& v1d)
{
if (v1d.size() != v2d.size()*v2d[0].size())
{
throw std::runtime_error("An exception was caught. The sizes of the input arrays must match.");
}
for_each(v2d.cbegin(), v2d.cend(), [&v1d](const std::vector<double> vec)
{
for_each(vec.cbegin(), vec.cend(), [&v1d](const double& dValue)
{
v1d.emplace_back(dValue);
});
});
}
void matToVect1dEx(const std::vector<std::vector<double>>& v2d, std::vector<double>& v1d)
{
usll index{};
if (v1d.size() != v2d.size()*v2d[0].size())
{
throw std::runtime_error("An exception was caught. The sizes of the input arrays must match.");
}
for (usll i=0; i<v2d.size(); ++i)
{
for (usll j=0; j<v2d[0].size(); ++j)
{
index = j + i*v2d[0].size();
v1d[index] = v2d[i][j];
}
}
}
Each time I run the code, the output is:
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69
70 71 72 73 74 75 76 77 78 79
Which is twice as large as the original 1D array. What? Where did the zeros come from? What caused the vector size to grow from 80 to 160? In contrast, when I change the for_each loop to regular for loops, I get the correct output:
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69
70 71 72 73 74 75 76 77 78 79
I suspect the anomaly is due to the use of the for_each algorithm, yet Meyer (2018) in his book Effective C++ Digital Collection: 140 Ways to Improve Your Programming says "the algorithm call is preferable." He says, and I quote: "From an efficiency perspective, algorithms can beat explicit loops in three ways, two majors, one minor. The minor way involves the elimination of
redundant computations."
In my actual use case, the method matToVect1d() is called a few 1000 times and the total number of elements per array is 240 x 200.
My final question is whether it makes sense to implement the loop using the for_each algorithm? Answers will be highly appreciated.
It is because of the line
std::vector<double> data1d(DIM0 * DIM1);
where the data1d you create a vector of doubles and initialize it with 0.0 s in DIM0 x DIM0 in size.
Therefore, later in std::for_each case, when you std::vector::emplace_back, you insert the elements after this. Hence, you see DIM0 x DIM0 0 s and later the element that you inserted.
You need instead std::vector::reserve for the desired behaviour
std::vector<double> data1d;
data1d.reserve(DIM0 * DIM1); // for unwated reallocations
I suspect the anomaly is due to the use of the for_each algorithm[...]
No, for the reason(i.e. mistake) mentioned above.
My final question is whether it makes sense to implement the loop
using the std::for_each algorithm?
There is nothing wrong with std::for_each approach, other than you copy the std::vector<double> in the first lambdas parameter list and (maybe) less readable.
std::for_each(v2d.cbegin(), v2d.cend(), [&v1d](std::vector<double> const& vec)
// ^^^^^^ --> prefer const-ref
{
for_each(vec.cbegin(), vec.cend(), [&v1d](const double dValue)
{
v1d.emplace_back(dValue);
});
});
than the range - for-loop approach shown below
std::vector<double> data1d;
data1d.reserve(DIM0 * DIM1); // for unwated reallocations
// ... later in the function
for (const std::vector<double>& vector : v2d)
for (const double element : vector)
v1d.emplace_back(element);
std::vector<double> data1d(DIM0 * DIM1); <- this line of code doesn't "reserve". It fills the vector with DIM0 * DIM1 default initialised doubles.
This is the constructor you are using.
vector( size_type count, const T& value, const Allocator& alloc = Allocator());
Check this for more reference:
https://en.cppreference.com/w/cpp/container/vector/vector
Change it to this:
std::vector<double> data1d();
data1d.reserve(DIM0 * DIM1);

Why is this Pascal Triangle implementation giving me trailing zeroes?

I tried to implement it recursively (iteratively seemed less elegant, but please do correct me if I am wrong).But the output seems to be giving me trailing zeroes and the first few rows are unexpected.I have checked the base cases and the recursive cases , but they seem to be all right.The problem is definitely within the function.
#include <iostream>
unsigned long long p[1005][1005];
void pascal(int n)
{
if (n == 1)
{
p[0][0] = 1;
return;
}
else if (n == 2)
{
p[0][0] = 1; p[0][1] = 1;
return;
}
p[n][0] = 1;
p[n][n-1] = 1;
pascal(n-1);
for (int i = 1; i < n;++i)
{
p[n][i] = p[n-1][i-1] + p[n-1][i];
}
return;
}
int main()
{
int n;
std::cin >> n;
pascal(n);
for (int i = 0 ; i < n ; ++i)
{
for (int j = 0 ;j < i+1 ; ++j)
{
std::cout << p[i][j] << " ";
}
std::cout << "\n";
}
}
Output:
(I enter)15
1
0 0
0 0 0
1 0 0 0
1 1 0 0 0
1 2 1 0 0 0
1 3 3 1 0 0 0
1 4 6 4 1 0 0 0
1 5 10 10 5 1 0 0 0
1 6 15 20 15 6 1 0 0 0
1 7 21 35 35 21 7 1 0 0 0
1 8 28 56 70 56 28 8 1 0 0 0
1 9 36 84 126 126 84 36 9 1 0 0 0
1 10 45 120 210 252 210 120 45 10 1 0 0 0
1 11 55 165 330 462 462 330 165 55 11 1 0 0 0
The base cases n = 1 and n = 2 are too aggressive (1 is never reached for a normal input like 10 because 2 breaks the recursion prematurely, leaving untouched zeroes in the array). These values for n should be covered automatically by the recursive case. Our real base case where we do nothing is when n < 0.
void pascal(int n)
{
if (n < 0) return;
p[n][0] = 1;
pascal(n - 1);
for (int i = 1; i <= n; ++i)
{
p[n][i] = p[n-1][i-1] + p[n-1][i];
}
}
Output for n = 15:
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
1 5 10 10 5 1
1 6 15 20 15 6 1
1 7 21 35 35 21 7 1
1 8 28 56 70 56 28 8 1
1 9 36 84 126 126 84 36 9 1
1 10 45 120 210 252 210 120 45 10 1
1 11 55 165 330 462 462 330 165 55 11 1
1 12 66 220 495 792 924 792 495 220 66 12 1
1 13 78 286 715 1287 1716 1716 1287 715 286 78 13 1
Having said this, it's poor practice to hard code the size of the array. Consider using vectors and passing parameters to the functions so that they don't mutate global state.
We can also write it iteratively in (to me) a more intuitive way:
void pascal(int n)
{
for (int i = 0; i < n; ++i)
{
p[i][0] = 1;
for (int j = 1; j <= i; ++j)
{
p[i][j] = p[i-1][j-1] + p[i-1][j];
}
}
}

OpenMP integral image slower then sequential

I have implemented Summed Area Table (or Integral image) in C++ using OpenMP.
The problem is that the Sequential code is always faster then the Parallel code even changing the number of threads and image sizes.
For example I tried images from (100x100) to (10000x10000) and threads from 1 to 64, but none of the combination is ever faster.
I also tried this code in different machines like:
Mac OSX 1,4 GHz Intel Core i5 dual core
Mac OSX 2,3 GHz Intel Core i7 quad core
Ubuntu 16.04 Intel Xeon E5-2620 2,4 GHz 12 cores
The time has been measured with OpenMP function: omp_get_wtime().
For compiling I use: g++ -fopenmp -Wall main.cpp.
Here is the parallel code:
void transpose(unsigned long *src, unsigned long *dst, const int N, const int M) {
#pragma omp parallel for
for(int n = 0; n<N*M; n++) {
int i = n/N;
int j = n%N;
dst[n] = src[M*j + i];
}
}
unsigned long * integralImageMP(uint8_t*x, int n, int m){
unsigned long * out = new unsigned long[n*m];
unsigned long * rows = new unsigned long[n*m];
#pragma omp parallel for
for (int i = 0; i < n; ++i)
{
rows[i*m] = x[i*m];
for (int j = 1; j < m; ++j)
{
rows[i*m + j] = x[i*m + j] + rows[i*m + j - 1];
}
}
transpose(rows, out, n, m);
#pragma omp parallel for
for (int i = 0; i < n; ++i)
{
rows[i*m] = out[i*m];
for (int j = 1; j < m; ++j)
{
rows[i*m + j] = out[i*m + j] + rows[i*m + j - 1];
}
}
transpose(rows, out, m, n);
delete [] rows;
return out;
}
Here is the sequential code:
unsigned long * integralImage(uint8_t*x, int n, int m){
unsigned long * out = new unsigned long[n*m];
for (int i = 0; i < n; ++i)
{
for (int j = 0; j < m; ++j)
{
unsigned long val = x[i*m + j];
if (i>=1)
{
val += out[(i-1)*m + j];
if (j>=1)
{
val += out[i*m + j - 1] - out[(i-1)*m + j - 1];
}
} else {
if (j>=1)
{
val += out[i*m + j -1];
}
}
out[i*m + j] = val;
}
}
return out;
}
I also tried without the transpose but it was even slower probably because the cache accesses.
An example of calling code:
int main(int argc, char **argv){
uint8_t* image = //read image from file (gray scale)
int height = //height of the image
int width = //width of the image
double start_omp = omp_get_wtime();
unsigned long* integral_image_parallel = integralImageMP(image, height, width); //parallel
double end_omp = omp_get_wtime();
double time_tot = end_omp - start_omp;
std::cout << time_tot << std::endl;
start_omp = omp_get_wtime();
unsigned long* integral_image_serial = integralImage(image, height, width); //sequential
end_omp = omp_get_wtime();
time_tot = end_omp - start_omp;
std::cout << time_tot << std::endl;
return 0;
}
Each thread is working on a block of rows (maybe an illustration of what each thread is doing can be useful):
Where ColumnSum is done transposing the matrix and repeating RowSum.
Let me first say, that the results are a bit surprising to me and I would guesstimate the problem being in the non local memory access required by the transpose algorithm.
You can anyway mitigate it by turning your sequential algorithm into parallel by a two pass approach. The first pass has to calculate the 2D integral in T threads N rows apart and the second pass must compensate the fact that each block didn't start from the accumulated result of the previous row but from zero.
An example with Matlab shows the principle in 2D.
f=fix(rand(12,8)*8) % A random matrix with 12 rows, 8 columns
5 6 1 4 7 5 4 4
4 6 0 7 1 3 2 0
7 0 2 3 0 1 6 3
5 3 1 7 4 3 7 2
6 4 3 2 7 3 5 1
3 3 2 5 5 0 2 1
3 5 7 5 1 4 4 3
6 5 7 4 2 1 0 0
0 2 0 5 3 3 7 4
1 3 5 5 7 4 7 3
1 0 2 1 1 2 6 5
3 7 3 1 6 2 2 5
ff=cumsum(cumsum(f')') % The Summed Area Table
5 11 12 16 23 28 32 36
9 21 22 33 41 49 55 59
16 28 31 45 53 62 74 81
21 36 40 61 73 85 104 113
27 46 53 76 95 110 134 144
30 52 61 89 113 128 154 165
33 60 76 109 134 153 183 197
39 71 94 131 158 178 208 222
39 73 96 138 168 191 228 246
40 77 105 152 189 216 260 281
41 78 108 156 194 223 273 299
44 88 121 170 214 245 297 328
fx=[cumsum(cumsum(f(1:4,:)')'); % The original table summed in
cumsum(cumsum(f(5:8,:)')'); % three parts -- 4 rows per each
cumsum(cumsum(f(9:12,:)')')] % "thread"
5 11 12 16 23 28 32 36
9 21 22 33 41 49 55 59
16 28 31 45 53 62 74 81
21 36 40 61 73 85 104 113 %% Notice this row #4
6 10 13 15 22 25 30 31
9 16 21 28 40 43 50 52
12 24 36 48 61 68 79 84
18 35 54 70 85 93 104 109 %% Notice this row #8
0 2 2 7 10 13 20 24
1 6 11 21 31 38 52 59
2 7 14 25 36 45 65 77
5 17 27 39 56 67 89 106
fx(4,:) + fx(8,:) %% this is the SUM of row #4 and row #8
39 71 94 131 158 178 208 222
%% and finally -- what is the difference of the piecewise
%% calculated result and the real result?
ff-fx
0 0 0 0 0 0 0 0 %% look !! the first block
0 0 0 0 0 0 0 0 %% is already correct
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
21 36 40 61 73 85 104 113 %% All these rows in this
21 36 40 61 73 85 104 113 %% block are short by
21 36 40 61 73 85 104 113 %% the row #4 above
21 36 40 61 73 85 104 113 %%
39 71 94 131 158 178 208 222 %% and all these rows
39 71 94 131 158 178 208 222 %% in this block are short
39 71 94 131 158 178 208 222 %% by the SUM of the rows
39 71 94 131 158 178 208 222 %% #4 and #8 above
Fortunately one can start integrating the block 2, i.e. rows 2N..3N-1 before the block #1 has been compensated -- one just has to calculate the offset, which is a relatively small sequential task.
acc_for_block_2 = row[2*N-1] + row[N-1];
acc_for_block_3 = acc_for_block_2 + row[3*N-1];
..
acc_for_block_T-1 = acc_for_block_(T-2) + row[N*(T-1)-1];

Floyd's Algorithm (Shortest Paths) Issue - C++

Basically, I'm tasked with implementing Floyd's algorithm to find the shortest path of a matrix. A value, in my case, arg, is taken in and the matrix becomes size arg*arg. The next string of values are applied to the matrix in the order received. Lastly, a -1 represents infinity.
To be quite honest, I've no idea where my problem is coming in. When ran through the tests, the first couple pass, but the rest fail. I'll only post the first two failures along with the passes. I'll just post the relevant segment of code.
int arg, var, i, j;
cin >> arg;
int arr[arg][arg];
for (i = 0; i < arg; i++)
{
for(j = 0; j < arg; j++)
{
cin >> var;
arr[i][j] = var;
}
}
for(int pivot = 0; pivot < arg; pivot++)
{
for(i = 0; i < arg; i++)
{
for(j = 0; j < arg; j++)
{
if((arr[i][j] > (arr[i][pivot] + arr[pivot][j])) && ((arr[i][pivot] != -1) && arr[pivot][j] != -1))
{
arr[i][j] = (arr[i][pivot] + arr[pivot][j]);
arr[j][i] = (arr[i][pivot] + arr[pivot][j]);
}
}
}
}
And here are the failures that I'm receiving. The rest of them get longer and longer, up to a 20*20 matrix, so I'll spare you from that:
floyd>
* * * Program successfully started and correct prompt received.
floyd 2 0 14 14 0
0 14 14 0
floyd> PASS : Input "floyd 2 0 14 14 0" produced output "0 14 14 0".
floyd 3 0 85 85 85 0 26 85 26 0
0 85 85 85 0 26 85 26 0
floyd> PASS : Input "floyd 3 0 85 85 85 0 26 85 26 0" produced output "0 85 85 85 0 26 85 26 0".
floyd 3 0 34 7 34 0 -1 7 -1 0
0 34 7 34 0 -1 7 -1 0
floyd> FAIL : Input "floyd 3 0 34 7 34 0 -1 7 -1 0" did not produce output "0 34 7 34 0 41 7 41 0".
floyd 4 0 -1 27 98 -1 0 41 74 27 41 0 41 98 74 41 0
0 -1 27 68 -1 0 41 74 27 41 0 41 68 74 41 0
floyd> FAIL : Input "floyd 4 0 -1 27 98 -1 0 41 74 27 41 0 41 98 74 41 0" did not produce output "0 68 27 68 68 0 41 74 27 41 0 41 68 74 41 0".
Imagine the situation arr[i][j] == -1, obviously (arr[i][j] > (arr[i][pivot] + arr[pivot][j])) && ((arr[i][pivot] != -1) && arr[pivot][j] != -1) fails, but it shouldn’t if arr[i][pivot] and arr[pivot][j] are not -1
Since you are using -1 instead of infinity, you have to have something like if ((arr[i][j] == -1 || arr[i][j] > (arr[i][pivot] + arr[pivot][j])) && ((arr[i][pivot] != -1) && arr[pivot][j] != -1)) i.e. you check against 2 things: the first one is your condition and the 2nd one is the situation when arr[i][j] is infinity and the path through pivot exists, as in this case any valid path is less then infinity.