Basically, I'm tasked with implementing Floyd's algorithm to find the shortest path of a matrix. A value, in my case, arg, is taken in and the matrix becomes size arg*arg. The next string of values are applied to the matrix in the order received. Lastly, a -1 represents infinity.
To be quite honest, I've no idea where my problem is coming in. When ran through the tests, the first couple pass, but the rest fail. I'll only post the first two failures along with the passes. I'll just post the relevant segment of code.
int arg, var, i, j;
cin >> arg;
int arr[arg][arg];
for (i = 0; i < arg; i++)
{
for(j = 0; j < arg; j++)
{
cin >> var;
arr[i][j] = var;
}
}
for(int pivot = 0; pivot < arg; pivot++)
{
for(i = 0; i < arg; i++)
{
for(j = 0; j < arg; j++)
{
if((arr[i][j] > (arr[i][pivot] + arr[pivot][j])) && ((arr[i][pivot] != -1) && arr[pivot][j] != -1))
{
arr[i][j] = (arr[i][pivot] + arr[pivot][j]);
arr[j][i] = (arr[i][pivot] + arr[pivot][j]);
}
}
}
}
And here are the failures that I'm receiving. The rest of them get longer and longer, up to a 20*20 matrix, so I'll spare you from that:
floyd>
* * * Program successfully started and correct prompt received.
floyd 2 0 14 14 0
0 14 14 0
floyd> PASS : Input "floyd 2 0 14 14 0" produced output "0 14 14 0".
floyd 3 0 85 85 85 0 26 85 26 0
0 85 85 85 0 26 85 26 0
floyd> PASS : Input "floyd 3 0 85 85 85 0 26 85 26 0" produced output "0 85 85 85 0 26 85 26 0".
floyd 3 0 34 7 34 0 -1 7 -1 0
0 34 7 34 0 -1 7 -1 0
floyd> FAIL : Input "floyd 3 0 34 7 34 0 -1 7 -1 0" did not produce output "0 34 7 34 0 41 7 41 0".
floyd 4 0 -1 27 98 -1 0 41 74 27 41 0 41 98 74 41 0
0 -1 27 68 -1 0 41 74 27 41 0 41 68 74 41 0
floyd> FAIL : Input "floyd 4 0 -1 27 98 -1 0 41 74 27 41 0 41 98 74 41 0" did not produce output "0 68 27 68 68 0 41 74 27 41 0 41 68 74 41 0".
Imagine the situation arr[i][j] == -1, obviously (arr[i][j] > (arr[i][pivot] + arr[pivot][j])) && ((arr[i][pivot] != -1) && arr[pivot][j] != -1) fails, but it shouldn’t if arr[i][pivot] and arr[pivot][j] are not -1
Since you are using -1 instead of infinity, you have to have something like if ((arr[i][j] == -1 || arr[i][j] > (arr[i][pivot] + arr[pivot][j])) && ((arr[i][pivot] != -1) && arr[pivot][j] != -1)) i.e. you check against 2 things: the first one is your condition and the 2nd one is the situation when arr[i][j] is infinity and the path through pivot exists, as in this case any valid path is less then infinity.
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 25 days ago.
Improve this question
this is my code
vector<int> countingSort(vector<int> arr) {
int max = arr[0];
for(unsigned int i = 1; i < arr.size(); i++) {
if(arr[i] > arr[0]) {
max = arr[i];
}
}
max++;
vector<int> frequency_arr;
for(int i = 0; i < max; i++) {
frequency_arr.push_back(0);
}
for(unsigned int i = 0; i < arr.size(); i++) {
int elem = arr[i];
frequency_arr[elem]++;
}
return frequency_arr;
}
explanation : "countingSort" function is a sorting function which sort elements by using "counting sort" algorithm. it takes vector array as an input and return the "frequency array" as output.
input array
63 25 73 1 98 73 56 84 86 57 16 83 8 25 81 56 9 53 98 67 99 12 83 89 80 91 39 86 76 85 74 39 25 90 59 10 94 32 44 3 89 30 27 79 46 96 27 32 18 21 92 69 81 40 40 34 68 78 24 87 42 69 23 41 78 22 6 90 99 89 50 30 20 1 43 3 70 95 33 46 44 9 69 48 33 60 65 16 82 67 61 32 21 79 75 75 13 87 70 33
my output
0 2 0 2 0 0 1 0 1 2 1 0 1 1 0 0 2 0 1 0 1 2 1 1 1 3 0 2 0 0 2 0 3 3 1 0 0 0 0 2 2 1 1 1 2 0 2 0 1 0 1 0 0 1 0 0 2 1 0 1 1 1 0 1 0 1 0 2 1 3 2
expected output
0 2 0 2 0 0 1 0 1 2 1 0 1 1 0 0 2 0 1 0 1 2 1 1 1 3 0 2 0 0 2 0 3 3 1 0 0 0 0 2 2 1 1 1 2 0 2 0 1 0 1 0 0 1 0 0 2 1 0 1 1 1 0 1 0 1 0 2 1 3 2 0 0 2 1 2 1 0 2 2 1 2 1 2 1 1 2 2 0 3 2 1 1 0 1 1 1 0 2 2
Constraints
length of input array -> [100, 10^6]
range of elements in array -> [1, 100]
i was solving a counting sort problem in hackerrank, but when i run the code , in output it do not show all element of "frequency_arr"
it only show the bold numbers in expected output and do not show the remaning elements.
do you know what i am doing wrong here and what are the potential fixes?
It seems you have a typo in the first for loop
int max = arr[0];
for(unsigned int i = 1; i < arr.size(); i++) {
if(arr[i] > arr[0]) {
^^^^^^^^^^^^^^^^^
max = arr[i];
}
}
You need to write the if statement within the for loop the following way
if(arr[i] > max) {
Moreover in general the variable i should have the type size_t instead of unsigned int.
for( size_t i = 1; i < arr.size(); i++) {
Pay attention to that there is standard algorithm std::max_element declared in header <algorithm>.
And instead of this code snippet
vector<int> frequency_arr;
for(int i = 0; i < max; i++) {
frequency_arr.push_back(0);
}
you could just write
vector<int> frequency_arr( max );
In this case all max elements of the vector will be zero-initialized.
I have the following piece of code that I am using to convert a 2D array of doubles to a 1D vector. The arrays are allocated using std::vector and a nested for_each loop is used to transfer the contents of the 2D array to the 1D array.
#include <iostream>
#include <algorithm>
#include <vector>
#include <stdexcept>
#define UNUSED(expr) (void)(expr)
using usll = __uint64_t;
void print1d(std::vector<double> vv);
void matToVect1d(const std::vector<std::vector<double>>& v2d, std::vector<double>& v1d);
void matToVect1dEx(const std::vector<std::vector<double>>& v2d, std::vector<double>& v1d);
int main(int argc, char* argv[])
{
UNUSED(argc);
UNUSED(argv);
std::cout << std::endl;
const usll DIM0 {10};
const usll DIM1 {8};
std::vector<std::vector<double>> data2d(DIM0, std::vector<double>(DIM1));
std::vector<double> data1d(DIM0 * DIM1);
double temp = 0.0;
for (usll i{}; i<DIM0; ++i)
{
for (usll j{}; j<DIM1; ++j)
{
data2d[i][j] = temp++;
}
}
try
{
matToVect1d(data2d, data1d);
std::cout << "2D array data2d as a 1D vector is:" << std::endl;
print1d(data1d);
std::cout << std::endl;
}
catch (const std::exception& e)
{
std::cerr << e.what() << std::endl;
}
std::cout << "Press enter to continue";
std::cin.get();
return 0;
}
void print1d(std::vector<double> vv)
{
for (size_t i{}; i<vv.size(); ++i)
{
std::cout << vv[i] << " ";
if ((i+1)%10 == 0)
{
std::cout << std::endl;
}
}
std::cout << std::endl;
}
void matToVect1d(const std::vector<std::vector<double>>& v2d, std::vector<double>& v1d)
{
if (v1d.size() != v2d.size()*v2d[0].size())
{
throw std::runtime_error("An exception was caught. The sizes of the input arrays must match.");
}
for_each(v2d.cbegin(), v2d.cend(), [&v1d](const std::vector<double> vec)
{
for_each(vec.cbegin(), vec.cend(), [&v1d](const double& dValue)
{
v1d.emplace_back(dValue);
});
});
}
void matToVect1dEx(const std::vector<std::vector<double>>& v2d, std::vector<double>& v1d)
{
usll index{};
if (v1d.size() != v2d.size()*v2d[0].size())
{
throw std::runtime_error("An exception was caught. The sizes of the input arrays must match.");
}
for (usll i=0; i<v2d.size(); ++i)
{
for (usll j=0; j<v2d[0].size(); ++j)
{
index = j + i*v2d[0].size();
v1d[index] = v2d[i][j];
}
}
}
Each time I run the code, the output is:
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69
70 71 72 73 74 75 76 77 78 79
Which is twice as large as the original 1D array. What? Where did the zeros come from? What caused the vector size to grow from 80 to 160? In contrast, when I change the for_each loop to regular for loops, I get the correct output:
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69
70 71 72 73 74 75 76 77 78 79
I suspect the anomaly is due to the use of the for_each algorithm, yet Meyer (2018) in his book Effective C++ Digital Collection: 140 Ways to Improve Your Programming says "the algorithm call is preferable." He says, and I quote: "From an efficiency perspective, algorithms can beat explicit loops in three ways, two majors, one minor. The minor way involves the elimination of
redundant computations."
In my actual use case, the method matToVect1d() is called a few 1000 times and the total number of elements per array is 240 x 200.
My final question is whether it makes sense to implement the loop using the for_each algorithm? Answers will be highly appreciated.
It is because of the line
std::vector<double> data1d(DIM0 * DIM1);
where the data1d you create a vector of doubles and initialize it with 0.0 s in DIM0 x DIM0 in size.
Therefore, later in std::for_each case, when you std::vector::emplace_back, you insert the elements after this. Hence, you see DIM0 x DIM0 0 s and later the element that you inserted.
You need instead std::vector::reserve for the desired behaviour
std::vector<double> data1d;
data1d.reserve(DIM0 * DIM1); // for unwated reallocations
I suspect the anomaly is due to the use of the for_each algorithm[...]
No, for the reason(i.e. mistake) mentioned above.
My final question is whether it makes sense to implement the loop
using the std::for_each algorithm?
There is nothing wrong with std::for_each approach, other than you copy the std::vector<double> in the first lambdas parameter list and (maybe) less readable.
std::for_each(v2d.cbegin(), v2d.cend(), [&v1d](std::vector<double> const& vec)
// ^^^^^^ --> prefer const-ref
{
for_each(vec.cbegin(), vec.cend(), [&v1d](const double dValue)
{
v1d.emplace_back(dValue);
});
});
than the range - for-loop approach shown below
std::vector<double> data1d;
data1d.reserve(DIM0 * DIM1); // for unwated reallocations
// ... later in the function
for (const std::vector<double>& vector : v2d)
for (const double element : vector)
v1d.emplace_back(element);
std::vector<double> data1d(DIM0 * DIM1); <- this line of code doesn't "reserve". It fills the vector with DIM0 * DIM1 default initialised doubles.
This is the constructor you are using.
vector( size_type count, const T& value, const Allocator& alloc = Allocator());
Check this for more reference:
https://en.cppreference.com/w/cpp/container/vector/vector
Change it to this:
std::vector<double> data1d();
data1d.reserve(DIM0 * DIM1);
I tried to implement it recursively (iteratively seemed less elegant, but please do correct me if I am wrong).But the output seems to be giving me trailing zeroes and the first few rows are unexpected.I have checked the base cases and the recursive cases , but they seem to be all right.The problem is definitely within the function.
#include <iostream>
unsigned long long p[1005][1005];
void pascal(int n)
{
if (n == 1)
{
p[0][0] = 1;
return;
}
else if (n == 2)
{
p[0][0] = 1; p[0][1] = 1;
return;
}
p[n][0] = 1;
p[n][n-1] = 1;
pascal(n-1);
for (int i = 1; i < n;++i)
{
p[n][i] = p[n-1][i-1] + p[n-1][i];
}
return;
}
int main()
{
int n;
std::cin >> n;
pascal(n);
for (int i = 0 ; i < n ; ++i)
{
for (int j = 0 ;j < i+1 ; ++j)
{
std::cout << p[i][j] << " ";
}
std::cout << "\n";
}
}
Output:
(I enter)15
1
0 0
0 0 0
1 0 0 0
1 1 0 0 0
1 2 1 0 0 0
1 3 3 1 0 0 0
1 4 6 4 1 0 0 0
1 5 10 10 5 1 0 0 0
1 6 15 20 15 6 1 0 0 0
1 7 21 35 35 21 7 1 0 0 0
1 8 28 56 70 56 28 8 1 0 0 0
1 9 36 84 126 126 84 36 9 1 0 0 0
1 10 45 120 210 252 210 120 45 10 1 0 0 0
1 11 55 165 330 462 462 330 165 55 11 1 0 0 0
The base cases n = 1 and n = 2 are too aggressive (1 is never reached for a normal input like 10 because 2 breaks the recursion prematurely, leaving untouched zeroes in the array). These values for n should be covered automatically by the recursive case. Our real base case where we do nothing is when n < 0.
void pascal(int n)
{
if (n < 0) return;
p[n][0] = 1;
pascal(n - 1);
for (int i = 1; i <= n; ++i)
{
p[n][i] = p[n-1][i-1] + p[n-1][i];
}
}
Output for n = 15:
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
1 5 10 10 5 1
1 6 15 20 15 6 1
1 7 21 35 35 21 7 1
1 8 28 56 70 56 28 8 1
1 9 36 84 126 126 84 36 9 1
1 10 45 120 210 252 210 120 45 10 1
1 11 55 165 330 462 462 330 165 55 11 1
1 12 66 220 495 792 924 792 495 220 66 12 1
1 13 78 286 715 1287 1716 1716 1287 715 286 78 13 1
Having said this, it's poor practice to hard code the size of the array. Consider using vectors and passing parameters to the functions so that they don't mutate global state.
We can also write it iteratively in (to me) a more intuitive way:
void pascal(int n)
{
for (int i = 0; i < n; ++i)
{
p[i][0] = 1;
for (int j = 1; j <= i; ++j)
{
p[i][j] = p[i-1][j-1] + p[i-1][j];
}
}
}
I have implemented Summed Area Table (or Integral image) in C++ using OpenMP.
The problem is that the Sequential code is always faster then the Parallel code even changing the number of threads and image sizes.
For example I tried images from (100x100) to (10000x10000) and threads from 1 to 64, but none of the combination is ever faster.
I also tried this code in different machines like:
Mac OSX 1,4 GHz Intel Core i5 dual core
Mac OSX 2,3 GHz Intel Core i7 quad core
Ubuntu 16.04 Intel Xeon E5-2620 2,4 GHz 12 cores
The time has been measured with OpenMP function: omp_get_wtime().
For compiling I use: g++ -fopenmp -Wall main.cpp.
Here is the parallel code:
void transpose(unsigned long *src, unsigned long *dst, const int N, const int M) {
#pragma omp parallel for
for(int n = 0; n<N*M; n++) {
int i = n/N;
int j = n%N;
dst[n] = src[M*j + i];
}
}
unsigned long * integralImageMP(uint8_t*x, int n, int m){
unsigned long * out = new unsigned long[n*m];
unsigned long * rows = new unsigned long[n*m];
#pragma omp parallel for
for (int i = 0; i < n; ++i)
{
rows[i*m] = x[i*m];
for (int j = 1; j < m; ++j)
{
rows[i*m + j] = x[i*m + j] + rows[i*m + j - 1];
}
}
transpose(rows, out, n, m);
#pragma omp parallel for
for (int i = 0; i < n; ++i)
{
rows[i*m] = out[i*m];
for (int j = 1; j < m; ++j)
{
rows[i*m + j] = out[i*m + j] + rows[i*m + j - 1];
}
}
transpose(rows, out, m, n);
delete [] rows;
return out;
}
Here is the sequential code:
unsigned long * integralImage(uint8_t*x, int n, int m){
unsigned long * out = new unsigned long[n*m];
for (int i = 0; i < n; ++i)
{
for (int j = 0; j < m; ++j)
{
unsigned long val = x[i*m + j];
if (i>=1)
{
val += out[(i-1)*m + j];
if (j>=1)
{
val += out[i*m + j - 1] - out[(i-1)*m + j - 1];
}
} else {
if (j>=1)
{
val += out[i*m + j -1];
}
}
out[i*m + j] = val;
}
}
return out;
}
I also tried without the transpose but it was even slower probably because the cache accesses.
An example of calling code:
int main(int argc, char **argv){
uint8_t* image = //read image from file (gray scale)
int height = //height of the image
int width = //width of the image
double start_omp = omp_get_wtime();
unsigned long* integral_image_parallel = integralImageMP(image, height, width); //parallel
double end_omp = omp_get_wtime();
double time_tot = end_omp - start_omp;
std::cout << time_tot << std::endl;
start_omp = omp_get_wtime();
unsigned long* integral_image_serial = integralImage(image, height, width); //sequential
end_omp = omp_get_wtime();
time_tot = end_omp - start_omp;
std::cout << time_tot << std::endl;
return 0;
}
Each thread is working on a block of rows (maybe an illustration of what each thread is doing can be useful):
Where ColumnSum is done transposing the matrix and repeating RowSum.
Let me first say, that the results are a bit surprising to me and I would guesstimate the problem being in the non local memory access required by the transpose algorithm.
You can anyway mitigate it by turning your sequential algorithm into parallel by a two pass approach. The first pass has to calculate the 2D integral in T threads N rows apart and the second pass must compensate the fact that each block didn't start from the accumulated result of the previous row but from zero.
An example with Matlab shows the principle in 2D.
f=fix(rand(12,8)*8) % A random matrix with 12 rows, 8 columns
5 6 1 4 7 5 4 4
4 6 0 7 1 3 2 0
7 0 2 3 0 1 6 3
5 3 1 7 4 3 7 2
6 4 3 2 7 3 5 1
3 3 2 5 5 0 2 1
3 5 7 5 1 4 4 3
6 5 7 4 2 1 0 0
0 2 0 5 3 3 7 4
1 3 5 5 7 4 7 3
1 0 2 1 1 2 6 5
3 7 3 1 6 2 2 5
ff=cumsum(cumsum(f')') % The Summed Area Table
5 11 12 16 23 28 32 36
9 21 22 33 41 49 55 59
16 28 31 45 53 62 74 81
21 36 40 61 73 85 104 113
27 46 53 76 95 110 134 144
30 52 61 89 113 128 154 165
33 60 76 109 134 153 183 197
39 71 94 131 158 178 208 222
39 73 96 138 168 191 228 246
40 77 105 152 189 216 260 281
41 78 108 156 194 223 273 299
44 88 121 170 214 245 297 328
fx=[cumsum(cumsum(f(1:4,:)')'); % The original table summed in
cumsum(cumsum(f(5:8,:)')'); % three parts -- 4 rows per each
cumsum(cumsum(f(9:12,:)')')] % "thread"
5 11 12 16 23 28 32 36
9 21 22 33 41 49 55 59
16 28 31 45 53 62 74 81
21 36 40 61 73 85 104 113 %% Notice this row #4
6 10 13 15 22 25 30 31
9 16 21 28 40 43 50 52
12 24 36 48 61 68 79 84
18 35 54 70 85 93 104 109 %% Notice this row #8
0 2 2 7 10 13 20 24
1 6 11 21 31 38 52 59
2 7 14 25 36 45 65 77
5 17 27 39 56 67 89 106
fx(4,:) + fx(8,:) %% this is the SUM of row #4 and row #8
39 71 94 131 158 178 208 222
%% and finally -- what is the difference of the piecewise
%% calculated result and the real result?
ff-fx
0 0 0 0 0 0 0 0 %% look !! the first block
0 0 0 0 0 0 0 0 %% is already correct
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
21 36 40 61 73 85 104 113 %% All these rows in this
21 36 40 61 73 85 104 113 %% block are short by
21 36 40 61 73 85 104 113 %% the row #4 above
21 36 40 61 73 85 104 113 %%
39 71 94 131 158 178 208 222 %% and all these rows
39 71 94 131 158 178 208 222 %% in this block are short
39 71 94 131 158 178 208 222 %% by the SUM of the rows
39 71 94 131 158 178 208 222 %% #4 and #8 above
Fortunately one can start integrating the block 2, i.e. rows 2N..3N-1 before the block #1 has been compensated -- one just has to calculate the offset, which is a relatively small sequential task.
acc_for_block_2 = row[2*N-1] + row[N-1];
acc_for_block_3 = acc_for_block_2 + row[3*N-1];
..
acc_for_block_T-1 = acc_for_block_(T-2) + row[N*(T-1)-1];
I'm writing some code in CUDA (Huffman algorithm to be exact, but it's totally irrelevant to the case). I've got a file Paralellel.cu with two functions: one (WriteDictionary) is an ordinary function, the second (wrtDict) is a special CUDA _global_ function running in CUDA GPU. Here are bodies of these functions:
//I know body of this function looks kinda not-related
// to program main topic, but it's just for tests.
__global__ void wrtDict(Node** nodes, unsigned char* str)
{
int i = threadIdx.x;
Node* n = nodes[i];
char c = n->character;
str[6 * i] = 1;//c; !!!
str[6 * i + 1] = 2;
str[6 * i + 2] = 0;
str[6 * i + 3] = 0;
str[6 * i + 4] = 0;
str[6 * i + 5] = 0;
}
I know these two first lines seem pointless, since I don't use this object n of Node class here, but just let them be for a while. And there's a super secret comment marked by "!!!". Here is WriteDictionary:
void WriteDictionary(NodeList* nodeList, unsigned char* str)
{
Node** nodes = nodeList->elements;
int N = nodeList->getCount();
Node** cudaNodes;
unsigned char* cudaStr;
cudaMalloc((void**)&cudaStr, 6 * N * sizeof(unsigned char));
cudaMalloc((void**)&cudaNodes, N * sizeof(Node*));
cudaMemcpy(cudaStr, str, 6 * N * sizeof(char), cudaMemcpyHostToDevice);
cudaMemcpy(cudaNodes, nodes, N * sizeof(Node*), cudaMemcpyHostToDevice);
dim3 block(1);
dim3 thread(N);
std::cout << N << "\n";
wrtDict<<<block,thread>>>(cudaNodes, cudaStr);
cudaMemcpy(str, cudaStr, 6 * N * sizeof(unsigned char), cudaMemcpyDeviceToHost);
cudaFree(cudaNodes);
cudaFree(cudaStr);
}
As one can see, the function WriteDictionary is kind of a proxy between CUDA and rest of the program. I've got a bunch of objects of my class Node somewhere in an ordinary memory pointed by the Node * array elements keeped within my object NodeList. For now it's enough to know about Node, that it has a public field char character. A char * str for now is going to be filled with some test data. It contains 6 * N allocated memory for chars, where N = count of all elements in the elements array. So I allocate in CUDA a memory space for 6 * N chars and N Node pointers. Then I copy there my Node pointers, they're still pointing to an ordinary memory. I'm running the function. Within the function wrtDict I'm extracting character into char c variable and this time NOT trying to put it into output array str.
So, when I'm writing a content of output array str (outside WriteDictionary function), I'm getting perfectly correct answer, i.e.:
1 2 0 0 0 0 1 2 0 0 0 0
1 2 0 0 0 0 1 2 0 0 0 0
1 2 0 0 0 0 1 2 0 0 0 0
1 2 0 0 0 0 1 2 0 0 0 0
1 2 0 0 0 0 1 2 0 0 0 0
1 2 0 0 0 0 1 2 0 0 0 0
1 2 0 0 0 0 1 2 0 0 0 0
1 2 0 0 0 0 1 2 0 0 0 0
1 2 0 0 0 0 1 2 0 0 0 0
1 2 0 0 0 0 1 2 0 0 0 0
1 2 0 0 0 0 1 2 0 0 0 0
1 2 0 0 0 0 1 2 0 0 0 0
1 2 0 0 0 0 1 2 0 0 0 0
1 2 0 0 0 0 1 2 0 0 0 0
1 2 0 0 0 0 1 2 0 0 0 0
1 2 0 0 0 0 1 2 0 0 0 0
1 2 0 0 0 0 1 2 0 0 0 0
1 2 0 0 0 0 1 2 0 0 0 0
1 2 0 0 0 0 1 2 0 0 0 0
1 2 0 0 0 0
Yeah, here we've got 39 correct sixes of chars (shown in hex). BUT when we slightly change our super secret comment within wrtDict function, like this:
__global__ void wrtDict(Node** nodes, unsigned char* str)
{
int i = threadIdx.x;
Node* n = nodes[i];
char c = n->character;
str[6 * i] = c;//1; !!!
str[6 * i + 1] = 2;
str[6 * i + 2] = 0;
str[6 * i + 3] = 0;
str[6 * i + 4] = 0;
str[6 * i + 5] = 0;
}
we will see strange things. I'm now expecting the first char of every six to be a character from Node pointed by the array - each one different. Or, even if it fails, I'm expecting only the first char of every six to be messed up, but the rest of them left intact: ? 2 0 0 0 0. But NO! When I do this EVERYTHING completely messes up, and now content of an output array str looks like this:
70 21 67 b7 70 21 67 b7 0 0 0 0
0 0 0 0 18 d7 85 8 b8 d7 85 8
78 d7 85 8 38 d9 85 8 d8 d7 85 8
f8 d5 85 8 58 d6 85 8 d8 d5 85 8
78 d6 85 8 b8 d6 85 8 98 d7 85 8
98 d6 85 8 38 d6 85 8 d8 d6 85 8
38 d5 85 8 18 d6 85 8 f8 d6 85 8
58 d9 85 8 f8 d7 85 8 78 d9 85 8
98 d9 85 8 d8 d4 85 8 b8 d8 85 8
38 d8 85 8 38 d7 85 8 78 d8 85 8
f8 d8 85 8 d8 d8 85 8 18 d5 85 8
61 20 75 6c 74 72 69 63 65 73 20 6d
6f 6c 65 73 74 69 65 20 73 69 74 20
61 6d 65 74 20 69 64 20 73 61 70 69
65 6e 2e 20 4d 61 75 72 69 73 20 73
61 70 69 65 6e 20 65 73 74 2c 20 64
69 67 6e 69 73 73 69 6d 20 61 63 20
70 6f 72 74 61 20 75 74 2c 20 76 75
6c 70 75 74 61 74 65 20 61 63 20 61
6e 74 65 2e 20 46
I'm asking now - why? Is it because I tried to reach an ordinary memory from within CUDA GPU? I'm getting a warning, probably about exactly this case, saying:
Cannot tell what pointer points to, assuming global memory space
I've googled about this, found only this, that CUDA it's reaching exactly an ordinary memory, cause couldn't find out where to reach, and this warning in 99.99% should be ignored. So I'm ignoring it, thinking it'll be fine, but it isn't - is my case within that 0.01%?
How can I solve this problem? I know I could just copy Nodes, not pointers to them, into CUDA, but I assume copying them would cost me more time than I save paralellizing what's being done to them inside. I could also extract character from every Node, put them all into an array and then copy it to CUDA, but - the same problem as in the previous statement.
I just completely don't know what to do and, what's worse, deadline of CUDA project in my college is today, apx. 17pm (I just haven't got enough time to make it earlier, damn it...).
PS. If it helps: I'm compiling using pretty simple (no any switches) command:
nvcc -o huff ArchiveManager.cpp IOManager.cpp Node.cpp NodeList.cpp Program.cpp Paraleller.cu
This is a terrible question, see talonmies' comment.
Check the error values from every CUDA API call. You will get a launch failure message on the cudaMemcpy after your kernel launch
Run cuda-memcheck to help debug the error (which is basically a segmentation fault)
Realise that you are dereferencing a (unmapped) pointer into host memory from the GPU, you need to copy the nodes, not just the pointers to the nodes
You can also run your program from inside cuda-gdb. cuda-gdb will show you what error you're hitting. Also, right at the beginning in cuda-gdb, do a "set cuda memcheck on", it will turn on memcheck inside cuda-gdb.
In the latest cuda-gdb version (5.0 as of today), you can also see warnings if you're not checking return codes from API calls and those API calls are failing.