APPCRASH (not when debugging) and Segmentation fault using QtCreator (C/C++) - c++

I'm using QtCreator to code an algorithm that I have already coded on Matlab.
When coding this program, I have two errors. The firts one (APPCRASH) appears just when I build and execute the program normally, but not when I try to debug it (Heisenbug) and it appears on the function 'matriceA'. I tried to make the variables volatile and to write the matrix A term formulas on other function, hoping that that will stop the compiler optimization (I think that the compiler optimization might cause the problem), but I have not been able to solve the problem. I have not tried to to compile the project using the option -o0 because my professor (it's an university project) has to be able to compile it normally (without specific options).
The second one is a SISSEGV segmentation fault. It happens when the code arrives to "DestroyFloatArray(&b, width);" on InpaintingColor.
And here the codes:
clanu_process.cpp (it's little messy because I've tried a lot of things...)
#include "clanu_process.h"
#include "iomanip"
void InpaintingColor(float **Rout, float **Gout, float **Bout, float **Rin, float **Gin, float **Bin, float **Mask, int width, int height, double param)
{
cout << "1" << endl;
float alphak = 0, bethak = 0, res = 0;
float **b = 0, **xk = 0, **dk = 0, **rk = 0, **Ark = 0, **tmp1 = 0,**tmp2 = 0,**tmp3 = 0;
Ark = AllocateFloatArray( width, height);
tmp1 = AllocateFloatArray( width, height);
tmp2 = AllocateFloatArray( width, height);
tmp3 = AllocateFloatArray( width, height);
xk = AllocateFloatArray( width, height);
dk = AllocateFloatArray( width, height);
rk = AllocateFloatArray( width, height);
b = AllocateFloatArray( width, height);
cout << "2" << endl;
res = 1e8;
matrixProductByScalar(b,1.0/(3.0*256),Rin,width,height);
matrixDuplicate(xk, b, width, height);
// APPCRASH error
matriceA(Ark,xk,Mask,width,height);
//More code
// SIGSEGV error
DestroyFloatArray(&b, width);
DestroyFloatArray(&xk, width);
DestroyFloatArray(&dk, width);
DestroyFloatArray(&rk, width);
DestroyFloatArray(&Ark, width);
DestroyFloatArray(&tmp1, width);
DestroyFloatArray(&tmp2, width);
DestroyFloatArray(&tmp3, width);
}
float** matriceA(float **A, float **I, float **Masque, int N2, int N1){
volatile bool bool_iplus = false, bool_imoins = false, bool_jmoins = false, bool_jplus = false;
volatile int iplus = 0, imoins = 0, jplus = 0, jmoins = 0;
for(int i = 1; i <= N1; i++){
bool_iplus = i<N1;
iplus = i+1 < N1 ? i+1 : N1;
bool_imoins = i>1;
imoins = i-1 > 1 ? i-1 : 1;
for(int j = 1; j <= N2; j++){
bool_jplus = j<N2;
jplus = j+1 < N2 ? j+1 : N2;
bool_jmoins = j>1;
jmoins = j -1 > 1 ? j-1 : 1;
if(Masque[i-1][j-1]!=0){
//cout << "if - " << i << ", " << j<< endl;
A[i-1][j-1] = (1.0/36)*(16*I[i-1][j-1]
+ 4*(
(bool_iplus?I[iplus-1][j-1]:0)
+ (bool_imoins?I[imoins-1][j-1]:0)
+ (bool_jplus?I[i-1][jplus-1]:0)
+ (bool_jmoins?I[i-1][jmoins-1]:0)
)+(
(bool_iplus&&bool_jplus?I[iplus-1][jplus-1]:0)
+ (bool_imoins&&bool_jplus?I[imoins-1][jplus-1]:0)
+ (bool_imoins&&bool_jmoins?I[imoins-1][jmoins-1]:0))
+ (bool_iplus&&bool_jmoins?I[iplus-1][jmoins-1]:0));
}else{
//cout << "else - " << i << ", " << j << endl;
A[i-1][j-1]=
-(1.0*N1*N2)*(
-8.0*I[i-1][j-1]
+ I[iplus-1][j-1]
+ I[imoins-1][j-1]
+ I[i-1][jplus-1]
+ I[i-1][jmoins-1]
+ I[iplus-1][jplus-1]
+ I[imoins-1][jplus-1]
+ I[imoins-1][jmoins-1]
+ I[iplus-1][jmoins-1]);
}
}
}
return A;
}
The functions AllocateFloatArray and DestroyFloatArray
float ** AllocateFloatArray(int width, int height)
{
float ** r = new float*[width];
for(int i=0; i<width; i++)
r[i] = new float[height];
return r;
}
void DestroyFloatArray(float ***a, int width)
{
if( *a == 0 ) return;
for(int i=0; i<width; i++)
delete[] a[0][i];
delete[] *a;
*a = 0;
}
Thank you for your time.

I'm no sure that it's the cause of your problem but...
Your function "Matrix operations" (sum(), matrixSubstraction(), matrixAddition(), matrixProductByElement(), matrixProductByScalar(), and matrixDuplicate()) are ranging the first index from zero to width and the second one from zero to height.
If I'm not wrong, this is correct and is consistent with allocation/deallocation (AllocateFloatArray() and DestroyFloatArray()).
But look at the two matriceA() functions; they are defined as
float** matriceA(float **A, float **I, int N2, int N1)
float** matriceA(float **A, float **I, float **Masque, int N2, int N1)
In both functions the first index range from zero to N1 and the second one from zero to N2; by example
for(int i = 1; i <= N1; i++){
// ...
for(int j = 1; j <= N2; j++){
// ...
A[i-1][j-1] = (1.0/36)*(16*I[i-1][j-1] // ...
Good. But you call matriceA() in this way
matriceA(Ark,rk,Mask,width,height);
Briefly: you allocate your matrices as width * height matrices; your "matrix operations" are using they as width * height matrices but your matriceA() function are using they as height * width.
Wonderful way to devastate the memory.
I suppose the solution could be
1) switch N1 and N2 in matriceA() definition
2) or switch width and height in matriceA() calling
p.s.: sorry for my bad English.

Related

What is the risk of using struct dataContent decimate(struct dataContent)?

I have written the following code to retrieve the content from the audio file. This is just part of the full project. I just want to know Would there be any risk using struct dataContent decimate(struct dataContent)? If yes, what are these and how can I improve this code to reduce the risk?
struct dataContent
{
DoubleArrayPtr data;
DoubleArrayPtr memorydata;
int numberofvalues;
int datasize;
long int sizeoffile;
};
struct dataContent decimate(struct dataContent dataprocess)
{
int i = 0, j = 0, k = 0, l = 0, m = 0, n = 0, p = 0, q = 0, r = 0, s = 0, t = 0;
cout << "Total number of blocks is: " << dataprocess.datasize << endl;
size_t size = dataprocess.datasize;
vector<double> sum(size);
vector<double> mean(size); //The mean is the arithmetic average of a set of given numbers
vector<double> secondmoment(size); //
vector<double> fourthmoment(size);
vector<double> kurtosis(size);
sum[0] = 0.0;
secondmoment[0] = 0.0;
fourthmoment[0] = 0.0;
kurtosis[0] = 0.0;
// Finding statistical moments for the data: mean, second- and fourth-order moments, and kurtosis
for(j = 0 ; j < size ; ++j)
{
// Mean Value
for(i = k ; i < (k + BUFFER_SIZE) ; i++)
{
sum[j] = sum[j] + abs(dataprocess.memorydata[i]);
//sum[j] = sum[j] + dataprocess.memorydata[i];
}
mean[j] = sum[j] / BUFFER_SIZE;
cout << "The mean of the absolute value of data in block " << (j + 1) << " is: " << mean[j] << endl;
k = k + BUFFER_SIZE;
}
return dataprocess;
} // End of decimate()
Thanks for your time.
the obvious issue is that you have BUFFER_SIZE and stride along the data array in those size pieces with no concern about running off the end.
I assume DoubleArrayPtr is double *, why not vector<double> given that you use vector elsewhere.
Also dont do this
int i = 0, j = 0, k = 0, l = 0, m = 0, n = 0, p = 0, q = 0, r = 0, s = 0, t = 0;
create an initialize at the point of use
like this
for(int j = 0 ; j < size ; ++j)
in c++ structs are types so you can do this
dataContent decimate(dataContent dataprocess)
pass in a reference to the struct thos, at the moment you are copying it
dataContent decimate(dataContent &dataprocess)
all this needs to be in codereview really tho

How to call existing host function from device function in cuda [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 8 years ago.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Improve this question
I have seen a similar question here
However,I could not get an exact answer here, and it is written in 2012.
I am trying to call cublasStatus_t cublasSgbmv(...) function, which is defined in "cublas_v2.h", in a __global__ function. However, I could not use the dynamic parallelism feature. I only have 1 source.cu file. However, I have read that I should compile it in a dynamic way so that it separates device and host functions, then I can link these outputs.
Is there anyone who knows how to do it, or a good source to explain it?
Thanks in advance
edit : if undervoted, please explain the reason at least for me to learn my mistake?
edit2 :
my specific problem is, I'm using the following code in my Source.cu :
#include <iostream>
#include <vector>
#include <cuda.h>
#include <cstdio>
#include <stdio.h>
#include <device_launch_parameters.h>
#include <stdlib.h> //srand(), rand()
#include <time.h>
#include <builtin_types.h>
#include <cuda_runtime.h>
#include <cublas_v2.h>
#define IDX2C(i ,j , ld ) ((( j )*( ld ))+( i ))
#define HEIGHT 4
#define WIDTH 4
#define V 4
#define KL 2
#define KU 1
#define THREADS_PER_BLOCK 512
#pragma comment(lib, "cublas")
//#pragma comment(lib, "helper_cuda")
using namespace std;
void create_Matrix(int* matrix, int width, int height){
int i, len;
len = height * width;
srand(time(NULL));
for (i = 0; i < len; i++){
matrix[i] = rand() % 10 + 1; //generates number between 1-10
}
}
template <typename T>
void print_vector(T* vector, int len){
for (int i = 0; i < len; i++)
cout << vector[i] << " ";
cout << endl;
}
template <typename T>
void creating_bandedMatrix(T* bandedMatrix, int height, int width, int ku, int kl){
//fill matrix with zeros at the beginning
int i, len;
len = height * width;
for (i = 0; i < len; i++){
bandedMatrix[i] = 0; //generates number between 1-10
}
srand(time(NULL));
//filling banded diagonal
int start, end;
for (int i = 0; i < height; i++){
start = i - kl;
if (start < 0)
start = 0;
end = i + ku + 1;
if (end > width)
end = width;
for (int j = start; j < end; j++){
*(bandedMatrix + (i*width) + j) = (float)(rand() % (10) + 1); //rand() / (T)RAND_MAX;;
}
}
}
template <typename T>
void print_matrix(T* matrix, int width, int height){
int len = width*height;
cout << "asdsffffff" << endl;
for (int i = 0; i < len; i++){
if (!(i%width))
cout << endl;
cout << i << ":" <<matrix[i] << " ";
}
cout << endl;
}
template <typename T>
void computeMatrixVectorMultiplication(T* bandedMatrix, T* vector2){
T row_sum = 0;
T* bandedHostResult = (T*)malloc(WIDTH * sizeof(T));
for (int i = 0; i < HEIGHT; i++){
row_sum = 0;
for (int j = 0; j < WIDTH; j++){
row_sum += (*(bandedMatrix + i*WIDTH + j)) * vector2[j];
}
bandedHostResult[i] = row_sum;
}
//priting the result
cout << "\n\nBanded Host Result...\n";
print_vector(bandedHostResult, WIDTH);
}
template <typename T>
void fillLapackMatrix(T* lapack_matrix, T* bandedMatrix, int kl, int ku, int banded_w, int banded_h, int lapack_w, int lapack_h){
int i, j, lapack_i;
int len = lapack_h * lapack_w;
for (i = 0; i < len; i++){
lapack_matrix[i] = 0; //generates number between 1-10
}
for (i = 0; i < banded_w; i++){
for (j = 0; j < banded_h; j++){
lapack_i = ku + i - j;
*(lapack_matrix + lapack_i*lapack_w + j) = *(bandedMatrix + i*banded_w + j);
//lapack_matrix[lapack_i*lapack_w + j] = bandedMatrix[i*bandedMatrix + j];
}
}
}
__global__ void device_cublasSgbmv(int m,int n,int kl, int ku,float* alpha, float* A, int lda ,float* B,int ldb,float*R, int ldr, float* beta){
int index = blockIdx.x * blockDim.x + threadIdx.x;
cublasHandle_t handle;
cublasCreate(&handle);
cublasOperation_t trans = CUBLAS_OP_N;
float* dev_x;
cudaMalloc((void**)&dev_x,sizeof(float) * n);
if(index < ldr){
cublasSgbmv(handle, trans,m, n, kl, ku, alpha, A, m, B+index*n, 1, beta, R+index*n, 1);
index = 0;
}
}
void fillNormalMatrix(float* B,int h,int w){
for(int i = 0; i < h;i++){
for(int j = 0; j < w;j++){
B[i*w + j] = 1;
}
}
}
int main()
{
cublasStatus_t status;
float *A;
float *x, *y;
float *dev_x, *dev_y;
int incx, incy;
float *dev_A = 0;
float alpha = 1.0f;
float beta = 0.0f;
int matrixSize = WIDTH * HEIGHT;
int i, j;
cublasHandle_t handle;
/* Initialize CUBLAS */
status = cublasCreate(&handle);
if (status != CUBLAS_STATUS_SUCCESS)
{
fprintf(stderr, "!!!! CUBLAS initialization error\n");
return EXIT_FAILURE;
}
//Allocate host memory for the matrices
A = (float *)malloc(matrixSize* sizeof(float));
//Allocate memory for host vectors
x = (float *)malloc(WIDTH * sizeof(float));
y = (float*)malloc(WIDTH * sizeof(float));
// Fill the matrices with test data
creating_bandedMatrix(A, WIDTH, HEIGHT, KU, KL);
cout << "Banded Matrix\n";
print_matrix(A, WIDTH, HEIGHT);
//Fill the vectors with random data
for (i = 0; i < WIDTH; i++){
x[i] = 1;// (float)(rand() % (10) + 1);:
y[i] = (float)(rand() % (10) + 1);
}
cout << "\nvector x...\n";
print_vector(x, WIDTH);
//cout << "\nvector y...\n";
//print_vector(y, WIDTH);
//Allocate device memory for the matrix
if (cudaMalloc((void **)&dev_A, matrixSize * sizeof(float)) != cudaSuccess)
{
fprintf(stderr, "!!!! device memory allocation error (allocate A)\n");
return EXIT_FAILURE;
}
//Allocate device memory for vectors
if (cudaMalloc((void**)&dev_x, WIDTH * sizeof(float)) != cudaSuccess){
fprintf(stderr, "Device Vector Allocation PROBLEM\n");
return EXIT_FAILURE;
}
if (cudaMalloc((void**)&dev_y, WIDTH * sizeof(float)) != cudaSuccess){
fprintf(stderr, "Device Vector Allocation PROBLEM\n");
return EXIT_FAILURE;
}
// Initialize the device vectors with the host vectors
status = cublasSetVector(WIDTH, sizeof(float), x, 1, dev_x, 1);
if (status != CUBLAS_STATUS_SUCCESS)
{
fprintf(stderr, "!!!! device access error (write x vector)\n");
return EXIT_FAILURE;
}
status = cublasSetVector(WIDTH, sizeof(float), y, 1, dev_y, 1);
if (status != CUBLAS_STATUS_SUCCESS)
{
fprintf(stderr, "!!!! device access error (write y vector)\n");
return EXIT_FAILURE;
}
//initialize matrix with lapack format
int lapack_width = WIDTH > HEIGHT ? HEIGHT : WIDTH;
int lapack_height = KL + KU + 1;
int lapackSize = lapack_height * lapack_width;
float* lapack_matrix = (float*)malloc(lapackSize * sizeof(float));
fillLapackMatrix(lapack_matrix, A, KL, KU, WIDTH, HEIGHT, lapack_width, lapack_height);
cout << "\n\nLAPACK MAtrix\n";
print_matrix(lapack_matrix, lapack_width, lapack_height);
//convert to column column matrix
float* col = (float*)malloc(lapackSize * sizeof(float));
for (i = 0; i < WIDTH; i++){
for (j = 0; j < HEIGHT; j++){
col[i + WIDTH*j] = lapack_matrix[WIDTH*i + j];
}
}
cout << "Lapack Column Based Matrix\n";
print_matrix(col,HEIGHT-1,WIDTH);
//status = cublasSetVector(lapackSize, sizeof(float), A, 1, dev_A, 1);
cublasSetMatrix(HEIGHT, WIDTH, sizeof(float), col, HEIGHT, dev_A, HEIGHT);
cublasOperation_t trans = CUBLAS_OP_N;
incy = incx = 1;
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
///////////////////////// Banded Matrix Matrix Multipllicatio ///////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
float* B,*dev_B,*dev_R,*R;
B = (float*)malloc(WIDTH*HEIGHT*sizeof(float));
R = (float*)malloc(WIDTH*HEIGHT*sizeof(float));
fillNormalMatrix(B,WIDTH,HEIGHT);
cudaMalloc((void**)&dev_B,matrixSize*sizeof(*B));
cudaMalloc((void**)&dev_R,matrixSize*sizeof(*R));
cublasSetMatrix(HEIGHT, WIDTH, sizeof(*B), B, HEIGHT, dev_B, HEIGHT);
cout << "Matrix B\n";
print_matrix(B,HEIGHT,WIDTH);
cout << "gfsdf\n";
device_cublasSgbmv<<<1,4>>>(HEIGHT, WIDTH, KL, KU, &alpha, dev_A, WIDTH, dev_B, HEIGHT, dev_R, HEIGHT,&beta);
cout << "after\n";
cublasGetMatrix(HEIGHT,WIDTH, sizeof (*R) ,dev_R ,WIDTH,R,WIDTH);
getchar();
return 0;
}
and compile it like :
nvcc -gencode=arch=compute_35,code=sm_35 -lcublas -lcudadevrt -O3 Source.cu -o Source.o -dc
g++ Source.o -lcublas -lcudart
then, I get the following :
In function `__sti____cudaRegisterAll_48_tmpxft_00001f1e_00000000_6_Source_cpp1_ii_ebe2258a()':
tmpxft_00001f1e_00000000-3_lapack_vector.cudafe1.cpp:(.text.startup+0x575): undefined reference to `__cudaRegisterLinkedBinary_48_tmpxft_00001f1e_00000000_6_Source_cpp1_ii_ebe2258a'
collect2: error: ld returned 1 exit status
You can compile and link the code you have now shown with a single command like this:
nvcc -arch=sm_35 -rdc=true -lcublas -lcublas_device -lcudadevrt -o test Source.cu
You may get some warnings like this:
nvlink warning : SM Arch ('sm_35') not found in '/usr/local/cuda/bin/..//lib64/libcublas_device.a:maxwell_sgemm.asm.o'
nvlink warning : SM Arch ('sm_35') not found in '/usr/local/cuda/bin/..//lib64/libcublas_device.a:maxwell_sm50_sgemm.o'
nvlink warning : SM Arch ('sm_35') not found in '/usr/local/cuda/bin/..//lib64/libcublas_device.a:maxwell_sm50_ssyrk.o'
Those can be safely ignored.

Code not compiling - ends up in xmemory

I just got started trying to learn how to code graphics using C++. When compiling a linear interpolation code, the code does not run and sends VC++ to the xmemory file. No errors or warnings given, thus leaving me with nothing to work on. What did I do wrong? I suspect the problem is connected to the way I assign the vectors, yet none of my changes have worked.
Here is the code:
#include "SDL.h"
#include <iostream>
#include <glm/glm.hpp>
#include <vector>
#include "SDLauxiliary.h"
using namespace std;
using glm::vec3;
using std::vector;
const int SCREEN_WIDTH = 640;
const int SCREEN_HEIGHT = 480;
SDL_Surface* screen;
void Draw();
void Interpolate( float a, float b, vector<float>& result ) {
int i = 0;
for ( float x=a;x < b+1; ++x )
{
result[i] = x;
i = i + 1;
}
}
void InterpolateVec( vec3 a, vec3 b, vector<vec3>& resultvec ) {
int i = 0;
for (int add=0; add < 4; ++add) {
float count1 = (b[add]-a[add])/resultvec.size() + a[add];
float count2 = (b[add]-a[add])/resultvec.size() + a[add];
float count3 = (b[add]-a[add])/resultvec.size() + a[add];
resultvec[i].x = (count1, count2, count3);
resultvec[i].y = (count1, count2, count3);
resultvec[i].z = (count1, count2, count3);
i = i + 1;
}
}
int main( int argc, char* argv[] )
{
vector<float> result(10); // Create a vector width 10 floats
Interpolate(5, 14, result); // Fill it with interpolated values
for( int i=0; i < result.size(); ++i )
cout << result[i] << " "; // Print the result to the terminal
vector<vec3> resultvec( 4 );
vec3 a(1,4,9.2);
vec3 b(4,1,9.8);
InterpolateVec( a, b, resultvec );
for( int i=0; i<resultvec.size(); ++i )
{
cout << "( "
<< resultvec[i].x << ", "
<< resultvec[i].y << ", "
<< resultvec[i].z << " ) ";
}
screen = InitializeSDL( SCREEN_WIDTH, SCREEN_HEIGHT );
while( NoQuitMessageSDL() )
{
Draw();
}
SDL_SaveBMP( screen, "screenshot.bmp" );
return 0;
}
void Draw()
{
for( int y=0; y<SCREEN_HEIGHT; ++y )
{
for( int x=0; x<SCREEN_WIDTH; ++x )
{
vec3 color(1,0,1);
PutPixelSDL( screen, x, y, color );
}
}
if( SDL_MUSTLOCK(screen) )
SDL_UnlockSurface(screen);
SDL_UpdateRect( screen, 0, 0, 0, 0 );
}
I can not post a comment to the question so I'll write my thoughts as answer.
resultvec[i].x = (count1, count2, count3);
resultvec[i].y = (count1, count2, count3);
resultvec[i].z = (count1, count2, count3);
It looks like you (or one of your library) overload operator, for float to make vec2 and after vec3. Nice solution, but if I right, then no reason to assign each components to that value and your code will be similiar to:
resultvec[i] = (count1, count2, count3);
Again this is just a hypothesis! I can not compile your code and see the error.
Also I am not understand why you using i, which equal to add.
Strange that some of you could not compile the code; it may be that you have not installed the libraries (the n00b speculating, yay ...).
So here is what I did to make it work (less code is better in this case, as the first comment stated):
void InterpolateVec( vec3 a, vec3 b, vector<vec3>& resultvec ) {
resultvec[0].x = a[0];
resultvec[0].y = a[1];
resultvec[0].z = a[2];
float count1 = (b[0]-a[0])/(resultvec.size() - 1);
float count2 = (b[1]-a[1])/(resultvec.size() - 1);
float count3 = (b[2]-a[2])/(resultvec.size() - 1);
for (int add=1; add < 5; ++add) {
a[0] = a[0] + count1;
a[1] = a[1] + count2;
a[2] = a[2] + count3;
resultvec[add].x = a[0];
resultvec[add].y = a[1];
resultvec[add].z = a[2];
}
}
I discovered (after many an hour ...) was that I did not need to add count1, count2 and count3; vec3 is such a type that adding count1 does what I wanted it to; assigning color (i.e. something like (0,0,1)). Am I making since? My vocabulary is not that technical I know.
Or, you could save some time and let glm::vec3 do what glm::vec3 is supposed to do.
In the mean time; here, have a cookie (cookie.png)
void Interpolate(vec3 a, vec3 b, vector<vec3>& result) {
vec3 diffStep = (b-a) * (1.0f / (result.size() - 1)); // Operator overloading
result[0] = vec3(a);
for(int i = 1; i < result.size(); i++) {
result[i] = result[i-1] + diffStep;
}
}

How to speed up vector initialization c++

I had a previous question about a stack overflow error and switch to vectors for my arrays of objects. That question can be referenced here if needed: How to get rid of stack overflow error
My current question is however, how do I speed up the initialization of the vectors. My current method currently takes ~15 seconds. Using arrays instead of vectors it took like a second with a size of arrays small enough that didn't throw the stack overflow error.
Here is how I am initializing it:
in main.cpp I initialize my dungeon object:
dungeon = Dungeon(0, &textureHandler, MIN_X, MAX_Y);
in my dungeon(...) constructor, I initialize my 5x5 vector of rooms and call loadDungeon:
Dungeon::Dungeon(int dungeonID, TextureHandler* textureHandler, int topLeftX, int topLeftY)
{
currentRoomRow = 0;
currentRoomCol = 0;
for (int r = 0; r < MAX_RM_ROWS; ++r)
{
rooms.push_back(vector<Room>());
for (int c = 0; c < MAX_RM_COLS; ++c)
{
rooms[r].push_back(Room());
}
}
loadDungeon(dungeonID, textureHandler, topLeftX, topLeftY);
}
my Room constructor populates my 30x50 vector of cells (so I can set them up in the loadDungeon function):
Room::Room()
{
for (int r = 0; r < MAX_ROWS; ++r)
{
cells.push_back(vector<Cell>());
for (int c = 0; c < MAX_COLS; ++c)
{
cells[r].push_back(Cell());
}
}
}
My default cell constructor is simple and isn't doing much but I'll post it anyway:
Cell::Cell()
{
x = 0;
y = 0;
width = 16;
height = 16;
solid = false;
texCoords.push_back(0);
texCoords.push_back(0);
texCoords.push_back(1);
texCoords.push_back(0);
texCoords.push_back(1);
texCoords.push_back(1);
texCoords.push_back(0);
texCoords.push_back(1);
}
And lastly my loadDungeon() function will set up the cells. Eventually this will read from a file and load the cells up but for now I would like to optimize this a bit if possible.
void Dungeon::loadDungeon(int dungeonID, TextureHandler* textureHandler, int topLeftX, int topLeftY)
{
int startX = topLeftX + (textureHandler->getSpriteWidth()/2);
int startY = topLeftY - (textureHandler->getSpriteHeight()/2);
int xOffset = 0;
int yOffset = 0;
for (int r = 0; r < MAX_RM_ROWS; ++r)
{
for (int c = 0; c < MAX_RM_COLS; ++c)
{
for (int cellRow = 0; cellRow < rooms[r][c].getMaxRows(); ++cellRow)
{
xOffset = 0;
for (int cellCol = 0; cellCol < rooms[r][c].getMaxCols(); ++cellCol)
{
rooms[r][c].setupCell(cellRow, cellCol, startX + xOffset, startY - yOffset, textureHandler->getSpriteWidth(), textureHandler->getSpriteHeight(), false, textureHandler->getSpriteTexCoords("grass"));
xOffset += textureHandler->getSpriteWidth();
}
yOffset += textureHandler->getSpriteHeight();
}
}
}
currentDungeon = dungeonID;
currentRoomRow = 0;
currentRoomCol = 0;
}
So how can I speed this up so it doesn't take ~15 seconds to load up every time. I feel like it shouldn't take 15 seconds to load a simple 2D game.
SOLUTION
Well my solution was to use std::vector::reserve call (rooms.reserve in my code and it ended up working well. I changed my function Dungeon::loadDungeon to Dungeon::loadDefaultDungeon because it now loads off a save file.
Anyway here is the code (I got it down to about 4-5 seconds from ~15+ seconds in debug mode):
Dungeon::Dungeon()
{
rooms.reserve(MAX_RM_ROWS * MAX_RM_COLS);
currentDungeon = 0;
currentRoomRow = 0;
currentRoomCol = 0;
}
void Dungeon::loadDefaultDungeon(TextureHandler* textureHandler, int topLeftX, int topLeftY)
{
int startX = topLeftX + (textureHandler->getSpriteWidth()/2);
int startY = topLeftY - (textureHandler->getSpriteHeight()/2);
int xOffset = 0;
int yOffset = 0;
cerr << "Loading default dungeon..." << endl;
for (int roomRow = 0; roomRow < MAX_RM_ROWS; ++roomRow)
{
for (int roomCol = 0; roomCol < MAX_RM_COLS; ++roomCol)
{
rooms.push_back(Room());
int curRoom = roomRow * MAX_RM_COLS + roomCol;
for (int cellRow = 0; cellRow < rooms[curRoom].getMaxRows(); ++cellRow)
{
for (int cellCol = 0; cellCol < rooms[curRoom].getMaxCols(); ++cellCol)
{
rooms[curRoom].setupCell(cellRow, cellCol, startX + xOffset, startY - yOffset, textureHandler->getSpriteWidth(), textureHandler->getSpriteHeight(), false, textureHandler->getSpriteTexCoords("default"), "default");
xOffset += textureHandler->getSpriteWidth();
}
yOffset += textureHandler->getSpriteHeight();
xOffset = 0;
}
cerr << " room " << curRoom << " complete" << endl;
}
}
cerr << "default dungeon loaded" << endl;
}
Room::Room()
{
cells.reserve(MAX_ROWS * MAX_COLS);
for (int r = 0; r < MAX_ROWS; ++r)
{
for (int c = 0; c < MAX_COLS; ++c)
{
cells.push_back(Cell());
}
}
}
void Room::setupCell(int row, int col, float x, float y, float width, float height, bool solid, /*std::array<float, 8>*/ vector<float> texCoords, string texName)
{
cells[row * MAX_COLS + col].setup(x, y, width, height, solid, texCoords, texName);
}
void Cell::setup(float x, float y, float width, float height, bool solid, /*std::array<float,8>*/ vector<float> t, string texName)
{
this->x = x;
this->y = y;
this->width = width;
this->height = height;
this->solid = solid;
for (int i = 0; i < t.size(); ++i)
this->texCoords.push_back(t[i]);
this->texName = texName;
}
It seems wasteful to have so many dynamic allocations. You can get away with one single allocation by flattening out your vector and accessing it in strides:
std::vector<Room> rooms;
rooms.resize(MAX_RM_ROWS * MAX_RM_COLS);
for (unsigned int i = 0; i != MAX_RM_ROWS; ++i)
{
for (unsigned int j = 0; j != MAX_RM_COLS; ++j)
{
Room & r = rooms[i * MAX_RM_COLS + j];
// use `r` ^^^^^^^^^^^^^^^^^^^-----<< strides!
}
}
Note how resize is performed exactly once, incurring only one single allocation, as well as default-constructing each element. If you'd rather construct each element specifically, use rooms.reserve(MAX_RM_ROWS * MAX_RM_COLS); instead and populate the vector in the loop.
You may also wish to profile with rows and columns swapped and see which is faster.
Since it seems that your vectors have their size defined at compile time, if you can use C++11, you may consider using std::array instead of std::vector. std::array cannot be resized and lacks many of the operations in std::vector, but is much more lightweight and it seems a good fit for what you are doing.
As an example, you could declare cells as:
#include <array>
/* ... */
std::array<std::array<Cell, MAX_COLS>, MAX_ROWS> cells;
UPDATE: since a locally defined std::array allocates its internal array on the stack, the OP will experience a stack overflow due to the considerably large size of the arrays. Still, it is possible to use an std::array (and its benefits compared to using std::vector), by allocating the array on the heap. That can be done by doing something like:
typedef std::array<std::array<Cell, MAX_COLS>, MAX_ROWS> Map;
Map* cells;
/* ... */
cells = new Map();
Even better, smart pointers can be used:
#include <memory>
/* ... */
std::unique_ptr<Map> cells;
cells = std::unique_ptr(new Map());

Optimizing 1D Convolution

Is there a way to speed up this 1D convolution ? I tried to make the dy cache efficient
but compiling with g++ and -O3 gave worse performances.
I am convolving with [-1. , 0., 1] in both directions.
Is not homework.
#include<iostream>
#include<cstdlib>
#include<sys/time.h>
void print_matrix( int height, int width, float *matrix){
for (int j=0; j < height; j++){
for (int i=0; i < width; i++){
std::cout << matrix[j * width + i] << ",";
}
std::cout << std::endl;
}
}
void fill_matrix( int height, int width, float *matrix){
for (int j=0; j < height; j++){
for (int i=0; i < width; i++){
matrix[j * width + i] = ((float)rand() / (float)RAND_MAX) ;
}
}
}
#define RESTRICT __restrict__
void dx_matrix( int height, int width, float * RESTRICT in_matrix, float * RESTRICT out_matrix, float *min, float *max){
//init min,max
*min = *max = -1.F * in_matrix[0] + in_matrix[1];
for (int j=0; j < height; j++){
float* row = in_matrix + j * width;
for (int i=1; i < width-1; i++){
float res = -1.F * row[i-1] + row[i+1]; /* -1.F * value + 0.F * value + 1.F * value; */
if (res > *max ) *max = res;
if (res < *min ) *min = res;
out_matrix[j * width + i] = res;
}
}
}
void dy_matrix( int height, int width, float * RESTRICT in_matrix, float * RESTRICT out_matrix, float *min, float *max){
//init min,max
*min = *max = -1.F * in_matrix[0] + in_matrix[ width + 1];
for (int j=1; j < height-1; j++){
for (int i=0; i < width; i++){
float res = -1.F * in_matrix[ (j-1) * width + i] + in_matrix[ (j+1) * width + i] ;
if (res > *max ) *max = res;
if (res < *min ) *min = res;
out_matrix[j * width + i] = res;
}
}
}
double now (void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec / 1000000.0;
}
int main(int argc, char **argv){
int width, height;
float *in_matrix;
float *out_matrix;
if(argc < 3){
std::cout << argv[0] << "usage: width height " << std::endl;
return -1;
}
srand(123);
width = atoi(argv[1]);
height = atoi(argv[2]);
std::cout << "Width:"<< width << " Height:" << height << std::endl;
if (width < 3){
std::cout << "Width too short " << std::endl;
return -1;
}
if (height < 3){
std::cout << "Height too short " << std::endl;
return -1;
}
in_matrix = (float *) malloc( height * width * sizeof(float));
out_matrix = (float *) malloc( height * width * sizeof(float));
fill_matrix(height, width, in_matrix);
//print_matrix(height, width, in_matrix);
float min, max;
double a = now();
dx_matrix(height, width, in_matrix, out_matrix, &min, &max);
std::cout << "dx min:" << min << " max:" << max << std::endl;
dy_matrix(height, width, in_matrix, out_matrix, &min, &max);
double b = now();
std::cout << "dy min:" << min << " max:" << max << std::endl;
std::cout << "time: " << b-a << " sec" << std::endl;
return 0;
}
Use local variables for computing the min and max. Every time you do this:
if (res > *max ) *max = res;
if (res < *min ) *min = res;
max and min have to get written to memory. Adding restrict on the pointers would help (indicating the writes are independent), but an even better way would be something like
//Setup
float tempMin = ...
float tempMax = ...
...
// Inner loop
tempMin = (res < tempMin) ? res : tempMin;
tempMax = (res > tempMax) ? res : tempMax;
...
// End
*min = tempMin;
*max = tempMax;
First of all, I would rewrite the dy loop to get rid of "[ (j-1) * width + i]" and "in_matrix[ (j+1) * width + i]", and do something like:
float* p, *q, *out;
p = &in_matrix[(j-1)*width];
q = &in_matrix[(j+1)*width];
out = &out_matrix[j*width];
for (int i=0; i < width; i++){
float res = -1.F * p[i] + q[i] ;
if (res > *max ) *max = res;
if (res < *min ) *min = res;
out[i] = res;
}
But that is a trivial optimization that the compiler may already be doing for you.
It will be slightly faster to do "q[i]-p[i]" instead of "-1.f*p[i]+q[i]", but, again, the compiler may be smart enough to do that behind your back.
The whole thing would benefit considerably from SSE2 and multithreading. I'd bet on at least a 3x speedup from SSE2 right away. Multithreading can be added using OpenMP and it will only take a few lines of code.
The compiler might notice this but you are creating/freeing a lot of variables on the stack as you go in and out of the scope operators {}. Instead of:
for (int j=0; j < height; j++){
float* row = in_matrix + j * width;
for (int i=1; i < width-1; i++){
float res = -1.F * row[i-1] + row[i+1];
How about:
int i, j;
float *row;
float res;
for (j=0; j < height; j++){
row = in_matrix + j * width;
for (i=1; i < width-1; i++){
res = -1.F * row[i-1] + row[i+1];
Well, the compiler might be taking care of these, but here are a couple of small things:
a) Why are you multiplying by -1.F? Why not just subtract? For instance:
float res = -1.F * row[i-1] + row[i+1];
could just be:
float res = row[i+1] - row[i-1];
b) This:
if (res > *max ) *max = res;
if (res < *min ) *min = res;
can be made into
if (res > *max ) *max = res;
else if (res < *min ) *min = res;
and in other places. If the first is true, the second can't be so let's not check it.
Addition:
Here's another thing. To minimize your multiplications, change
for (int j=1; j < height-1; j++){
for (int i=0; i < width; i++){
float res = -1.F * in_matrix[ (j-1) * width + i] + in_matrix[ (j+1) * width + i] ;
to
int h = 0;
int width2 = 2 * width;
for (int j=1; j < height-1; j++){
h += width;
for (int i=h; i < h + width; i++){
float res = in_matrix[i + width2] - in_matrix[i];
and at the end of the loop
out_matrix[i + width] = res;
You can do similar things in other places, but hopefully you get the idea. Also, there is a minor bug,
*min = *max = -1.F * in_matrix[0] + in_matrix[ width + 1 ];
should be just in_matrix[ width ] at the end.
Profiling this with -O3 and -O2 using versions of both the clang and g++ compilers on OS X, I found that
30% of the time was spent filling the initial matrix
matrix[j * width + i] = ((float)rand() / (float)RAND_MAX) ;
40% of the time was spent in dx_matrix on the line.
out_matrix[j * width + i] = row[i+1] -row[i-1];
About 9% of the time was spent in the conditionals in dx_matrix .. I separated them into a separate loop to see if that helped, but it didn't change anything much.
Shark gave the suggestion that this could be improved through the use of SSE instructions.
Interestingly only about 19% of the time was spent in the dy_matrix routine.
This was running on 10k by 10k matrix ( about 1.6 seconds )
Note your results may be different if you're using a different compiler, different OS etc.