What I'm doing here is basically joining to a global array (optimisedMesh) some smaller arrays (Bx, By, and Bz). As you can see the content and size of Bx, By and Bz is set on the b and c loops. Once they are fully defined they are joined to optimisedMesh.
This process should happen every "a" for loop.
I've found two problems trying this. The first one is that when I call free(Bx); once I've finished needing this array, the debugger returns me a segmentation fault, I'm not sure why.
The second one happens on the second loop of the "a" for loop. On the first loop
the realloc seems to work fine, but on the second it returns the 0x0 address, which causes a segmentation fault further in the code.
Also, I've ommited the By and Bz code because it looks exactly the same as Bx.
Thanks for your time.
int* LBxIA = (int*) calloc (1,sizeof(int*)); int* LBxIB = (int*) calloc (1,sizeof(int*)); int* LByIA = (int*) calloc (1,sizeof(int*)); int* LByIB = (int*) calloc (1,sizeof(int*)); int* LBzIA = (int*) calloc (1,sizeof(int*)); int* LBzIB = (int*) calloc (1,sizeof(int*));
int* LBxFA = (int*) calloc (1,sizeof(int*)); int* LBxFB = (int*) calloc (1,sizeof(int*)); int* LByFA = (int*) calloc (1,sizeof(int*)); int* LByFB = (int*) calloc (1,sizeof(int*)); int* LBzFA = (int*) calloc (1,sizeof(int*)); int* LBzFB = (int*) calloc (1,sizeof(int*));
Quad** Bx = (Quad**) calloc(1,sizeof(Quad*));
int maxSize = Math::maxof(xLenght,yLenght,zLenght);
for(int a = 0; a < maxSize; a++){
int BxCount = 0; int ByCount = 0; int BzCount = 0;
Bx = (Quad**) realloc(Bx,sizeof(Quad*));
for(int b = 0; b < maxSize; b++){
for(int c = 0; c < maxSize; c++){
//Bx
if(a <xLenght && b < yLenght && c < zLenght){
if(cubes[a][b][c] != nullptr){
if(!cubes[a][b][c]->faces[FACE_LEFT].hidden){
if(!LBxIA){
LBxIA = new int(c);
}else{
LBxFA = new int(c);
}
}else{
if(LBxIA && LBxFA){
BxCount++;
Bx = (Quad**) realloc(Bx, BxCount * sizeof(Quad*));
Bx[BxCount - 1] = new Quad(Vector3(a,b,*LBxIA),Vector3(a,b,*LBxFA),Vector3(a,b+1,*LBxIA),Vector3(a,b+1,*LBxFA));
LBxIA = nullptr;
LBxFA = nullptr;
}
}
}else{
if(LBxIA && LBxFA){
BxCount++;
Bx = (Quad**) realloc(Bx, BxCount * sizeof(Quad*));
Bx[BxCount-1] = new Quad(Vector3(a,b,*LBxIA),Vector3(a,b,*LBxFA),Vector3(a,b+1,*LBxIA),Vector3(a,b+1,*LBxFA));
LBxIA = nullptr;
LBxFA = nullptr;
}
if(LBxIB && LBxFB){
BxCount++;
Bx = (Quad**) realloc(Bx, BxCount * sizeof(Quad*));
Bx[BxCount-1] = new Quad(Vector3(a+1,b,*LBxIB),Vector3(a+1,b,*LBxFB),Vector3(a+1,b+1,*LBxIB),Vector3(a+1,b+1,*LBxFB));
LBxIB = nullptr;
LBxFB = nullptr;
}
}
}
}
}
optimisedMeshCount += (BxCount + ByCount + BzCount)*sizeof(Quad*);
optimisedMesh = (Quad**) realloc(optimisedMesh, optimisedMeshCount);
copy(Bx, Bx + BxCount*sizeof(Quad*), optimisedMesh + (optimisedMeshCount - (BxCount + ByCount + BzCount)*sizeof(Quad*)));
copy(By, By + ByCount*sizeof(Quad*), optimisedMesh + (optimisedMeshCount - (BxCount + ByCount + BzCount)*sizeof(Quad*)) + BxCount*sizeof(Quad*));//TODO AquĆ error
copy(Bz, Bz + BzCount*sizeof(Quad*), optimisedMesh + (optimisedMeshCount - (BxCount + ByCount + BzCount)*sizeof(Quad*)) + BxCount*sizeof(Quad*) + ByCount*sizeof(Quad*));
free(Bx);
}
I guess, the problem is with the three copy lines.
copy expects begin and end of some container or memory range. In your case you provide Bx, which is fine, and Bx + BxCount*sizeof(Quad*), which is way beyond the end of Bx memory.
This is because Bx + 1 is not Bx + 1 byte, but &Bx[1], which is the second element. Equally, Bx + BxCount would be the "end" as expected by copy.
This means Bx + BxCount*sizeof(Quad*) is, on a 64 bit system, eight times as much beyond the end of Bx memory range. Same goes for optimisedMesh, By and Bz. As a consequence, you copy too many elements and as a result get memory corruption.
Using std::vector and storing Quads instead of pointers to Quad
std::vector<Quad> Bx, By, Bz, optimisedMesh;
for (int a = 0; a < maxSize; a++) {
Bx.clear();
for (int b = 0; b < maxSize; b++) {
for (int c = 0; c < maxSize; c++) {
// ...
Quad qx(Vector3(a,b,*LBxIA),
Vector3(a,b,*LBxFA),
Vector3(a,b+1,*LBxIA),
Vector3(a,b+1,*LBxFA));
Bx.push_back(qx);
// ...
}
}
std::copy(Bx.begin(), Bx.end(), std::back_inserter(optimizedMesh));
std::copy(By.begin(), By.end(), std::back_inserter(optimizedMesh));
std::copy(Bz.begin(), Bz.end(), std::back_inserter(optimizedMesh));
}
As you can see, no explicit allocation, reallocation or freeing of memory, no counting of elements.
Unrelated, but you must also pay attention to LBxIA = new int(c); and LBxIA = nullptr;, which leaks memory.
Related
according to some exercise I have to define allocation is ascending or descending in stack and in a heap;
so when I print out the result in case of stack it is always 4 bytes in descending order but in a heap the difference is 32 in some case and after 848 why?
#include <iostream>
using namespace std;
int main(){
int a = 1;
int b = 1;
int c = 1;
int d = 1;
cout<<"Allocate in stack:"<<endl;
cout<<"adress a - "<<(uintptr_t)&a<<endl;
cout<<"adress b - "<<(uintptr_t)&b<<endl;
cout<<"adress c - "<<(uintptr_t)&c<<endl;
cout<<"adress d - "<<(uintptr_t)&d<<endl;
int* i = new int(1);
int* j = new int(1);
int* k = new int(1);
int* l = new int(1);
cout<<"Allocate in heap:"<<endl;
cout<<"adress i - "<<(uintptr_t)i<<endl;
cout<<"adress j - "<<(uintptr_t)j<<endl;
cout<<"adress k - "<<(uintptr_t)k<<endl;
cout<<"adress k - "<<(uintptr_t)l<<endl;
}
this are the a results:
Allocate in stack:
adress a - 622057224140
adress b - 622057224136
adress c - 622057224132
adress d - 622057224128
Allocate in heap:
adress i - 2426123523968
adress j - 2426123524000 // check this and successor the difference is 848
adress k - 2426123524848
adress k - 2426123524880
My computer RAM has 32GB of memory available. I want to define a 1500*1500*500 size array. How should I define a dynamic array?
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <assert.h>
#include <openacc.h>
#include <time.h>
#include <string.h>
#include <cuda_runtime_api.h>
void main(void) {
#define NX 1501
#define NY 1501
#define NZ 501
int i, j, k, l, m, dt, nstop;
double comp;
dt = 5;
nstop = 5
static double ex[NX][NY][NZ] = { 0. }, ey[NX][NY][NZ] = { 0. }, ez[NX][NY][NZ] = { 0. };
static double hx[NX][NY][NZ] = { 1. }, hy[NX][NY][NZ] = { 0. }, hz[NX][NY][NZ] = { 1. };
static double t, comp;
FILE *file;
file = fopen("point A hm=0.csv", "w"); /* Output data file name */
t = 0.;
for (l = 0; l < nstop; l++) {
for (i = 0; i < NX - 1; i++) {
for (j = 1; j < NY - 1; j++) {
for (k = 1; k < NZ - 1; k++) {
ex[i][j][k] = 2 * ey[i][j][k]
+ 3 * (hz[i][j][k] - hx[i][j - 1][k])
- 5 * (hy[i][j][k] - 2 * hz[i][j][k - 1]);
}
}
}
comp = ((double)(l + 1) / nstop) * 100.;
printf("Computation: %4.3f %% completed \r", comp);
fprintf(file, "%e, %e \n", t * 1e6, -ex[1200][950][20] + ex[1170][950][20]) / 2.);
t = t + dt;
}
fclose(file);
}
There must be an error in your problem statement:
the formula to compute ex[i][j][k] only depends on values from the other arrays with same i index for the first dimension. Since you only output the value of -ex[1200][950][20] + ex[1170][950][20]) / 2., you only need to compute the values for i=1200 and i=1170 and there is no need to allocate so much memory.
furthermore, the computed values in ex are the same for all values of l. No need to recompute at each iteration.
finally, given the initialization of the arrays, all values of ex for a first index other than 0 are null, so the output is trivial to compute: 0.0.
More seriously, if the initial values are small integers, the results seem to require only 32-bit integer arithmetics, which would reduce the memory requirements by 50%. Yet this would still exceed the maximum size for statically allocated objects on your system. You should allocate these 3D matrices dynamically this way:
double (*ex)[NY][NZ] = calloc(NX, sizeof(*ex));
Assuming your code is more complex than the sample posted, which incidentally contains a few typos that prevent compilation, here is what the modified code would look like:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
#define NX 1501
#define NY 1501
#define NZ 501
int i, j, k, l, dt, nstop;
double comp;
dt = 5;
nstop = 5;
double (*ex)[NY][NZ] = calloc(NX, sizeof(*ex));
if (ex == NULL) { fprintf(stderr, "allocation failed for ex\n"); exit(1); }
double (*ey)[NY][NZ] = calloc(NX, sizeof(*ey));
if (ey == NULL) { fprintf(stderr, "allocation failed for ey\n"); exit(1); }
double (*ez)[NY][NZ] = calloc(NX, sizeof(*ez));
if (ez == NULL) { fprintf(stderr, "allocation failed for ez\n"); exit(1); }
double (*hx)[NY][NZ] = calloc(NX, sizeof(*hx));
if (hx == NULL) { fprintf(stderr, "allocation failed for hx\n"); exit(1); }
double (*hy)[NY][NZ] = calloc(NX, sizeof(*hy));
if (hy == NULL) { fprintf(stderr, "allocation failed for hy\n"); exit(1); }
double (*hz)[NY][NZ] = calloc(NX, sizeof(*hz));
if (hz == NULL) { fprintf(stderr, "allocation failed for hz\n"); exit(1); }
hx[0][0][0] = 1.;
hz[0][0][0] = 1.;
// probably many more initializations missing
double t;
FILE *file;
file = fopen("point A hm=0.csv", "w"); /* Output data file name */
if (file == NULL) { fprintf(stderr, "cannot create output file\n"); exit(1); }
t = 0.;
for (l = 0; l < nstop; l++) {
for (i = 0; i < NX - 1; i++) {
for (j = 1; j < NY - 1; j++) {
for (k = 1; k < NZ - 1; k++) {
ex[i][j][k] = 2 * ey[i][j][k]
+ 3 * (hz[i][j][k] - hx[i][j - 1][k])
- 5 * (hy[i][j][k] - 2 * hz[i][j][k - 1]);
}
}
}
comp = ((double)(l + 1) / nstop) * 100.;
printf("Computation: %4.3f %% completed \r", comp);
fprintf(file, "%e, %e \n", t * 1e6, (-ex[1200][950][20] + ex[1170][950][20]) / 2.);
t = t + dt;
}
fclose(file);
free(ex);
free(ey);
free(ez);
free(hx);
free(hy);
free(hz);
return 0;
}
There are several options. If you need to allocate the entire memory structure at once, your probably want to allocate for a pointer-to-pointer-to-array int[500] (int (**)[500]) rather than allocating for a pointer-to-pointer-to-pointer int (int ***)- though both are technically correct.
(note: I used int in the example, so just change the type of a to double to satisfy your needs)
To approach the allocation for a pointer-to-pointer-to-array int[500], start with your pointer and allocate 1500 pointers, e.g.
#define Z 500
#define X 1500
#define Y X
int main (void) {
int (**a)[Z] = NULL; /* pointer to pointer to array of int[500] */
if (!(a = malloc (X * sizeof *a))) { /* allocate X pointers to (*)[Z] */
perror ("malloc-X (**)[Z]");
return 1;
}
At this point you have 1500 pointers-to-array-of int[500]. You can loop of each allocated pointer above, allocating 1500 * sizeof (int[500) and assigning the starting address to each block allocated to one of the pointers, e.g.
for (int i = 0; i < X; i++) /* for each pointer */
if (!(a[i] = malloc (Y * sizeof **a))) { /* alloc Y * sizeof int[Z] */
perror ("malloc-YZ (*)[Z]");
return 1;
}
Now you can address each integer in your allocation as a[x][y][z]. Then to free the allocated memory, you just free() in the reverse order you allocated, e.g.
for (int i = 0; i < X; i++)
free (a[i]); /* free allocated blocks */
free (a); /* free pointers */
A short example that exercises this and writes a value to each index could be:
#include <stdio.h>
#include <stdlib.h>
#define Z 500
#define X 1500
#define Y X
int main (void) {
int (**a)[Z] = NULL; /* pointer to pointer to array of int[500] */
if (!(a = malloc (X * sizeof *a))) { /* allocate X pointers to (*)[Z] */
perror ("malloc-X (**)[Z]");
return 1;
}
puts ("pointers allocated");
for (int i = 0; i < X; i++) /* for each pointer */
if (!(a[i] = malloc (Y * sizeof **a))) { /* alloc Y * sizeof int[Z] */
perror ("malloc-YZ (*)[Z]");
return 1;
}
puts ("all allocated");
for (int i = 0; i < X; i++) /* set mem to prevent optimize out */
for (int j = 0; j < Y; j++)
for (int k = 0; k < Z; k++)
a[i][j][k] = i * j * k;
puts ("freeing memory");
for (int i = 0; i < X; i++)
free (a[i]); /* free allocated blocks */
free (a); /* free pointers */
}
Example Use/Output -- Timed Run
$ time ./bin/malloc_1500x1500x500
pointers allocated
all allocated
freeing memory
real 0m1.481s
user 0m0.649s
sys 0m0.832s
Memory Use/Error Check
That's 4.5G of memory allocated and used (warning: you will swap on 8G or less depending what else you have running if you run valgrind)
$ valgrind ./bin/malloc_1500x1500x500
==7750== Memcheck, a memory error detector
==7750== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==7750== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==7750== Command: ./bin/malloc_1500x1500x500
==7750==
pointers allocated
all allocated
freeing memory
==7750==
==7750== HEAP SUMMARY:
==7750== in use at exit: 0 bytes in 0 blocks
==7750== total heap usage: 1,502 allocs, 1,502 frees, 4,500,013,024 bytes allocated
==7750==
==7750== All heap blocks were freed -- no leaks are possible
==7750==
==7750== For counts of detected and suppressed errors, rerun with: -v
==7750== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Look things over and let me know if you have questions.
In C (as your code seems to be), you can for example use a triple pointer and malloc():
#define NX 1501
#define NY 1501
#define NZ 501
int*** p_a = malloc(sizeof(double**) * NX);
for (int i = 0; i < NX; i++)
{
p_a[i] = malloc(sizeof(double*) * NY)
for (int j = 0; j < NY; j++)
p_a[i][j] = malloc(sizeof(double) * NZ);
}
A more efficient way would be to use a single pointer and use the size of each dimension in call to malloc() at once:
double* p_a = malloc(sizeof(*p_a) * (NX * NY * NZ));
In C++, the most common and efficient way is to use a std::vector for dynamically allocating an array:
#define NX 1501
#define NY 1501
#define NZ 501
std::vector<std::vector<std::vector<double>>> a(NX, vector<vector<double>>(NY, vector<double>(NZ)));
Note that the size of a double object is on most modern platforms 8 Byte. Means, When you want to achieve what you want, you need at least 8 * 1500 * 1500 * 500 = 9000000000 Byte = about 8,3 Gbyte for each 3D array to allocate it. You define 6 of them, so 49,8 Gbyte are required to only allocate those arrays which is not provided by your systems as you said your system has 32 Gbyte available.
2D array initialization:
....
int main (...) {
....
double **hes = allocArray (2, 2);
// function (....) returning double
hes[0][0] = function (3, &_X, &_Y, _usrvar_a, _usrvar_b);
hes[0][1] = function (4, &_X, &_Y, _usrvar_a, _usrvar_b);
hes[1][0] = function (4, &_X, &_Y, _usrvar_a, _usrvar_b);
hes[1][1] = function (5, &_X, &_Y, _usrvar_a, _usrvar_b);
....
return 0;
}
double **allocArray (int row, int col) {
double **ptr = new double*[row];
for (int i = 0; i < row; i++)
{
ptr[i] = new double[col];
}
return ptr;
}
Values of 2d double type array is:
12 2
2 14
I know that because I have crossed it with iterators (i, j)
void otherFunc (double **h, ....) {
....
for (int i = 0; i < 2; i++)
for (int j = 0; j < 2; j++)
std::cout << " " << h[i][j];
....
}
Output is
12 2 2 14
(I do not need to separate the rows of 2D array in output, do not write about that)
I want to cross it with pointer:
void otherFunc (double **h, ....) {
....
for (double *ptr = h[0]; ptr <= &h[1][1]; ptr++)
std::cout << " " << *ptr;
....
}
Output is:
12 2 0 1.63042e-322 2 14
Why 0 and 1.63042e-322 appeared here?
h[0] and h[1] in your run are not one just after the other:
h[1] in your specific run happens to be four numbers after h[0].
This behavior probably is random meaning that (as far as we know from your question) probably you didn't explicitly specify the relative positions of h[0] and h[1]. If this is the case the next time you run your code h[1] could even be smaller than h[0] this results in undefined behavior.
What you probably want is something of this kind: allocate four doubles and assign the address of the first to a pointer double* hh = malloc(4 * sizeof(double)); And then for the variable h which is a pointer to pointer double* h[2]; you want to assign the pointers as follows:
h[0] = hh;
h[1] = hh+2;
of course there are safer ways to do this. But this could be a good start.
I've a written a function to calculate the correlation matrix for variables (risks) held in a flat file structure. I.e. RiskID | Year | Amount
I have written the function because the library routines that I can find necessitate a matrix input. That is, RiskID as 2nd dimension and year as the 1st dimension - with amounts as actual array values. The matrix needs to be complete, in that zero values must be included also and hence for sparsely populated non zero data - this leads to wasted iterations which can be bypassed. The routine relies upon the data being sorted first by Year (asc) then by RiskID (asc)
I have written the routine in C++ (for speed) to be compiled as a dll and referenced in VB.NET. I need to pass 3 arrays (one each for each of the headers) and return a 2 dimensional array back to VB.NET. I guess I'm cheating by passing 3 individual 1d arrays instead of a 2d array but there you go. I'll post the full C++ routine as others may find it useful if seeking to do something similar. I'd be surprised if this hasn't been done before - but I just can't find it.
I lack the interop knowledge to implement this properly and am getting nowhere googling around. As far as I can workout I may need to use SAFEARRAY ?
Or is there a quick fix to this problem? Or is SAFEARRAY a piece of cake. Either way an example would be very helpful.
Also, as a side note - I'm sure the memory management is failing somewhere?
Here is the Visual C++ (VS2013)
Header File
#ifndef CorrelLib_EXPORTS
#define CorrelLib_API __declspec(dllexport)
#else
#define CorrelLib_API __declspec(dllimport)
#endif
// Returns correlation matrix for values in flat file
extern "C" CorrelLib_API double** __stdcall CalcMatrix(int* Risk, int* Year, double* Loss, const int& RowNo, const int& RiskNo, const int& NoSimYear);
CPP File
#include "stdafx.h"
#include "CorrelLib.h"
#include <memory>
#include <ctime>
using namespace std;
extern "C" CorrelLib_API double** __stdcall CalcMatrix(int* Risk, int* Year, double* Loss, const int& RowNo, const int& RiskNo, const int& NoSimYear)
{
int a, b;
int i, j, k;
int YearCount, MissingYears;
int RowTrack;
//Relies on Year and Risk being sorted in ascending order in those respective orders Year asc, Risk asc
double *RiskTrack = new double[RiskNo](); //array of pointers?
int *RiskTrackBool = new int[RiskNo](); //() sets inital values to zero
double *RiskAvg = new double[RiskNo]();
double *RiskSD = new double[RiskNo]();
//Create 2d array to hold results 'array of pointers to 1D arrays of doubles'
double** Res = new double*[RiskNo];
for (i = 0; i < RiskNo; ++i)
{
Res[i] = new double[RiskNo](); //()sets initial values to zero
}
//calculate average
for (i = 0; i < RowNo; i++)
{
a = Risk[i];
RiskAvg[a] = RiskAvg[a] + Loss[i];
}
for (i = 0; i < RiskNo; i++)
{
RiskAvg[i] = RiskAvg[i] / NoSimYear;
}
//Enter Main Loop
YearCount = 0;
i = 0; //start at first row
do {
YearCount = YearCount + 1;
a = Risk[i];
RiskTrack[a] = Loss[i] - RiskAvg[a];
RiskTrackBool[a] = 1;
j = i + 1;
do
{
if (Year[j] != Year[i])
{
break;
}
b = (int)Risk[j];
RiskTrack[b] = Loss[j] - RiskAvg[b];
RiskTrackBool[b] = 1;
j = j + 1;
} while (j < RowNo);
RowTrack = j;
//check through RiskTrack and if no entry set to 0 - avg
for (j = 0; j < RiskNo; j++)
{
if (RiskTrackBool[j] == 0)
{
RiskTrack[j] = -1.0 * RiskAvg[j];
RiskTrackBool[j] = 1;
}
}
//Now loop through and perform calcs
for (j = 0; j < RiskNo; j++)
{
//SD
RiskSD[j] = RiskSD[j] + RiskTrack[j] * RiskTrack[j];
//Covar
for (k = j + 1; k < RiskNo; k++)
{
Res[j][k] = Res[j][k] + RiskTrack[j] * RiskTrack[k];
}
}
//Reset RiskTrack
for (k = 0; k<RiskNo; k++)
{
RiskTrack[k] = 0.0;
RiskTrackBool[k] = 0;
}
i = RowTrack;
} while (i < RowNo);
//Account For Missing Years
MissingYears = NoSimYear - YearCount;
for (i = 0; i < RiskNo; i++)
{
//SD
RiskSD[i] = RiskSD[i] + MissingYears * RiskAvg[i] * RiskAvg[i];
//Covar
for (j = i + 1; j < RiskNo; j++)
{
Res[i][j] = Res[i][j] + MissingYears * RiskAvg[i] * RiskAvg[j];
}
}
//Covariance Matrix
for (i = 0; i < RiskNo; i++)
{
//SD
RiskSD[i] = sqrt(RiskSD[i] / (NoSimYear - 1));
if (RiskSD[i] == 0.0)
{
RiskSD[i] = 1.0;
}
//Covar
for (j = i + 1; j < RiskNo; j++)
{
Res[i][j] = Res[i][j] / (NoSimYear - 1);
}
}
//Correlation Matrix
for (i = 0; i < RiskNo; i++)
{
Res[i][i] = 1.0;
for (j = i + 1; j < RiskNo; j++)
{
Res[i][j] = Res[i][j] / (RiskSD[i] * RiskSD[j]);
}
}
//Clean up
delete[] RiskTrack;
delete[] RiskTrackBool;
delete[] RiskAvg;
delete[] RiskSD;
//Return Array
return Res;
}
Def File
LIBRARY CorrelLib
EXPORTS
CalcMatrix
VB.NET
I've created a simple winform with a button which triggers the code below. I wish to link to the dll, pass the arrays and receive the result as a 2d array.
Imports System
Imports System.Runtime.InteropServices
Public Class Form1
<DllImport("CorrelLib.dll", EntryPoint:="CalcMatrix", CallingConvention:=CallingConvention.StdCall)> _
Public Shared Function CorrelMatrix2(ByRef Risk_FE As Integer, ByRef Year_FE As Integer, ByRef Loss_FE As Double, _
ByRef RowNo As Long, ByRef RiskNo As Long, ByRef NoSimYear As Long) As Double(,)
End Function
Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
Dim i As Integer, j As Integer
Dim Risk() As Long, Year() As Long, Loss() As Double
Dim NoRisks As Long, NoSimYear As Long, NoRows As Long
Dim counter As Long
Dim Result(,) As Double
NoRisks = 50
NoSimYear = 10000
NoRows = NoRisks * NoSimYear
ReDim Risk(0 To NoRows - 1), Year(0 To NoRows - 1), Loss(0 To NoRows - 1)
counter = 0
For i = 1 To NoSimYear
For j = 1 To NoRisks
Risk(counter) = j
Year(counter) = i
Loss(counter) = CDbl(Math.Floor((1000000 - 1 + 1) * Rnd())) + 1
counter = counter + 1
Next j
Next i
Dim dllDirectory As String = "C:\Users\Documents\Visual Studio 2013\Projects\CorrelLibTestForm"
Environment.SetEnvironmentVariable("PATH", Environment.GetEnvironmentVariable("PATH") + ";" + dllDirectory)
Result = CorrelMatrix2(Risk(1), Year(1), Loss(1), NoRows, NoRisks, NoSimYear)
End Sub
End Class
Current Error Message
An unhandled exception of type >'System.Runtime.InteropServices.MarshalDirectiveException' occurred in >CorrelLibTestForm.exe
Additional information: Cannot marshal 'return value': Invalid >managed/unmanaged type combination.
A double ** pointer to a pointer is not the same with a 2 dimension array in vb. Your best bet is to return just a pointer:
double *pdbl;
pdbl = &res[0][0];
return pdbl; //pdbl points to the first element
In vb you use an IntPtr to get the pointer:
Dim Result As IntPtr = Marshal.AllocHGlobal(4)
Dim dbl As Double
Result = CorrelMatrix2(Risk(1), Year(1), Loss(1), NoRows, NoRisks, NoSimYear)
//derefference the double pointer, i(integer) is actually the index in the array of doubles
dbl = CType(Marshal.PtrToStructure(IntPtr.Add(Result, i * 8), GetType(Double)), Double)
Your res array in c++ function needs to be public so the memory allocated to it is valid after the function returns.
I'm writing a sparse matrix solver using the Gauss-Seidel method. By profiling, I've determined that about half of my program's time is spent inside the solver. The performance-critical part is as follows:
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
All arrays involved are of float type. Actually, they are not arrays but objects with an overloaded [] operator, which (I think) should be optimized away, but is defined as follows:
inline float &operator[](size_t i) { return d_cells[i]; }
inline float const &operator[](size_t i) const { return d_cells[i]; }
For d_nx = d_ny = 128, this can be run about 3500 times per second on an Intel i7 920. This means that the inner loop body runs 3500 * 128 * 128 = 57 million times per second. Since only some simple arithmetic is involved, that strikes me as a low number for a 2.66 GHz processor.
Maybe it's not limited by CPU power, but by memory bandwidth? Well, one 128 * 128 float array eats 65 kB, so all 6 arrays should easily fit into the CPU's L3 cache (which is 8 MB). Assuming that nothing is cached in registers, I count 15 memory accesses in the inner loop body. On a 64-bits system this is 120 bytes per iteration, so 57 million * 120 bytes = 6.8 GB/s. The L3 cache runs at 2.66 GHz, so it's the same order of magnitude. My guess is that memory is indeed the bottleneck.
To speed this up, I've attempted the following:
Compile with g++ -O3. (Well, I'd been doing this from the beginning.)
Parallelizing over 4 cores using OpenMP pragmas. I have to change to the Jacobi algorithm to avoid reads from and writes to the same array. This requires that I do twice as many iterations, leading to a net result of about the same speed.
Fiddling with implementation details of the loop body, such as using pointers instead of indices. No effect.
What's the best approach to speed this guy up? Would it help to rewrite the inner body in assembly (I'd have to learn that first)? Should I run this on the GPU instead (which I know how to do, but it's such a hassle)? Any other bright ideas?
(N.B. I do take "no" for an answer, as in: "it can't be done significantly faster, because...")
Update: as requested, here's a full program:
#include <iostream>
#include <cstdlib>
#include <cstring>
using namespace std;
size_t d_nx = 128, d_ny = 128;
float *d_x, *d_b, *d_w, *d_e, *d_s, *d_n;
void step() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
void solve(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
step();
}
}
void clear(float *a) {
memset(a, 0, d_nx * d_ny * sizeof(float));
}
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d_b = new float[n]; clear(d_b);
d_w = new float[n]; clear(d_w);
d_e = new float[n]; clear(d_e);
d_s = new float[n]; clear(d_s);
d_n = new float[n]; clear(d_n);
solve(atoi(argv[1]));
cout << d_x[0] << endl; // prevent the thing from being optimized away
}
I compile and run it as follows:
$ g++ -o gstest -O3 gstest.cpp
$ time ./gstest 8000
0
real 0m1.052s
user 0m1.050s
sys 0m0.010s
(It does 8000 instead of 3500 iterations per second because my "real" program does a lot of other stuff too. But it's representative.)
Update 2: I've been told that unititialized values may not be representative because NaN and Inf values may slow things down. Now clearing the memory in the example code. It makes no difference for me in execution speed, though.
Couple of ideas:
Use SIMD. You could load 4 floats at a time from each array into a SIMD register (e.g. SSE on Intel, VMX on PowerPC). The disadvantage of this is that some of the d_x values will be "stale" so your convergence rate will suffer (but not as bad as a jacobi iteration); it's hard to say whether the speedup offsets it.
Use SOR. It's simple, doesn't add much computation, and can improve your convergence rate quite well, even for a relatively conservative relaxation value (say 1.5).
Use conjugate gradient. If this is for the projection step of a fluid simulation (i.e. enforcing non-compressability), you should be able to apply CG and get a much better convergence rate. A good preconditioner helps even more.
Use a specialized solver. If the linear system arises from the Poisson equation, you can do even better than conjugate gradient using an FFT-based methods.
If you can explain more about what the system you're trying to solve looks like, I can probably give some more advice on #3 and #4.
I think I've managed to optimize it, here's a code, create a new project in VC++, add this code and simply compile under "Release".
#include <iostream>
#include <cstdlib>
#include <cstring>
#define _WIN32_WINNT 0x0400
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <conio.h>
using namespace std;
size_t d_nx = 128, d_ny = 128;
float *d_x, *d_b, *d_w, *d_e, *d_s, *d_n;
void step_original() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
void step_new() {
//size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
float
*d_b_ic,
*d_w_ic,
*d_e_ic,
*d_x_ic,
*d_x_iw,
*d_x_ie,
*d_x_is,
*d_x_in,
*d_n_ic,
*d_s_ic;
d_b_ic = d_b;
d_w_ic = d_w;
d_e_ic = d_e;
d_x_ic = d_x;
d_x_iw = d_x;
d_x_ie = d_x;
d_x_is = d_x;
d_x_in = d_x;
d_n_ic = d_n;
d_s_ic = d_s;
for (size_t y = 1; y < d_ny - 1; ++y)
{
for (size_t x = 1; x < d_nx - 1; ++x)
{
/*d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];*/
*d_x_ic = *d_b_ic
- *d_w_ic * *d_x_iw - *d_e_ic * *d_x_ie
- *d_s_ic * *d_x_is - *d_n_ic * *d_x_in;
//++ic; ++iw; ++ie; ++is; ++in;
d_b_ic++;
d_w_ic++;
d_e_ic++;
d_x_ic++;
d_x_iw++;
d_x_ie++;
d_x_is++;
d_x_in++;
d_n_ic++;
d_s_ic++;
}
//ic += 2; iw += 2; ie += 2; is += 2; in += 2;
d_b_ic += 2;
d_w_ic += 2;
d_e_ic += 2;
d_x_ic += 2;
d_x_iw += 2;
d_x_ie += 2;
d_x_is += 2;
d_x_in += 2;
d_n_ic += 2;
d_s_ic += 2;
}
}
void solve_original(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
step_original();
}
}
void solve_new(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
step_new();
}
}
void clear(float *a) {
memset(a, 0, d_nx * d_ny * sizeof(float));
}
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d_b = new float[n]; clear(d_b);
d_w = new float[n]; clear(d_w);
d_e = new float[n]; clear(d_e);
d_s = new float[n]; clear(d_s);
d_n = new float[n]; clear(d_n);
if(argc < 3)
printf("app.exe (x)iters (o/n)algo\n");
bool bOriginalStep = (argv[2][0] == 'o');
size_t iters = atoi(argv[1]);
/*printf("Press any key to start!");
_getch();
printf(" Running speed test..\n");*/
__int64 freq, start, end, diff;
if(!::QueryPerformanceFrequency((LARGE_INTEGER*)&freq))
throw "Not supported!";
freq /= 1000000; // microseconds!
{
::QueryPerformanceCounter((LARGE_INTEGER*)&start);
if(bOriginalStep)
solve_original(iters);
else
solve_new(iters);
::QueryPerformanceCounter((LARGE_INTEGER*)&end);
diff = (end - start) / freq;
}
printf("Speed (%s)\t\t: %u\n", (bOriginalStep ? "original" : "new"), diff);
//_getch();
//cout << d_x[0] << endl; // prevent the thing from being optimized away
}
Run it like this:
app.exe 10000 o
app.exe 10000 n
"o" means old code, yours.
"n" is mine, the new one.
My results:
Speed (original):
1515028
1523171
1495988
Speed (new):
966012
984110
1006045
Improvement of about 30%.
The logic behind:
You've been using index counters to access/manipulate.
I use pointers.
While running, breakpoint at a certain calculation code line in VC++'s debugger, and press F8. You'll get the disassembler window.
The you'll see the produced opcodes (assembly code).
Anyway, look:
int *x = ...;
x[3] = 123;
This tells the PC to put the pointer x at a register (say EAX).
The add it (3 * sizeof(int)).
Only then, set the value to 123.
The pointers approach is much better as you can understand, because we cut the adding process, actually we handle it ourselves, thus able to optimize as needed.
I hope this helps.
Sidenote to stackoverflow.com's staff:
Great website, I hope I've heard of it long ago!
For one thing, there seems to be a pipelining issue here. The loop reads from the value in d_x that has just been written to, but apparently it has to wait for that write to complete. Just rearranging the order of the computation, doing something useful while it's waiting, makes it almost twice as fast:
d_x[ic] = d_b[ic]
- d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in]
- d_w[ic] * d_x[iw] /* d_x[iw] has just been written to, process this last */;
It was Eamon Nerbonne who figured this out. Many upvotes to him! I would never have guessed.
Poni's answer looks like the right one to me.
I just want to point out that in this type of problem, you often gain benefits from memory locality. Right now, the b,w,e,s,n arrays are all at separate locations in memory. If you could not fit the problem in L3 cache (mostly in L2), then this would be bad, and a solution of this sort would be helpful:
size_t d_nx = 128, d_ny = 128;
float *d_x;
struct D { float b,w,e,s,n; };
D *d;
void step() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d[ic].b
- d[ic].w * d_x[iw] - d[ic].e * d_x[ie]
- d[ic].s * d_x[is] - d[ic].n * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
void solve(size_t iters) { for (size_t i = 0; i < iters; ++i) step(); }
void clear(float *a) { memset(a, 0, d_nx * d_ny * sizeof(float)); }
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d = new D[n]; memset(d,0,n * sizeof(D));
solve(atoi(argv[1]));
cout << d_x[0] << endl; // prevent the thing from being optimized away
}
For example, this solution at 1280x1280 is a little less than 2x faster than Poni's solution (13s vs 23s in my test--your original implementation is then 22s), while at 128x128 it's 30% slower (7s vs. 10s--your original is 10s).
(Iterations were scaled up to 80000 for the base case, and 800 for the 100x larger case of 1280x1280.)
I think you're right about memory being a bottleneck. It's a pretty simple loop with just some simple arithmetic per iteration. the ic, iw, ie, is, and in indices seem to be on opposite sides of the matrix so i'm guessing that there's a bunch of cache misses there.
I'm no expert on the subject, but I've seen that there are several academic papers on improving the cache usage of the Gauss-Seidel method.
Another possible optimization is the use of the red-black variant, where points are updated in two sweeps in a chessboard-like pattern. In this way, all updates in a sweep are independent and can be parallelized.
I suggest putting in some prefetch statements and also researching "data oriented design":
void step_original() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
float dw_ic, dx_ic, db_ic, de_ic, dn_ic, ds_ic;
float dx_iw, dx_is, dx_ie, dx_in, de_ic, db_ic;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
// Perform the prefetch
// Sorting these statements by array may increase speed;
// although sorting by index name may increase speed too.
db_ic = d_b[ic];
dw_ic = d_w[ic];
dx_iw = d_x[iw];
de_ic = d_e[ic];
dx_ie = d_x[ie];
ds_ic = d_s[ic];
dx_is = d_x[is];
dn_ic = d_n[ic];
dx_in = d_x[in];
// Calculate
d_x[ic] = db_ic
- dw_ic * dx_iw - de_ic * dx_ie
- ds_ic * dx_is - dn_ic * dx_in;
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
This differs from your second method since the values are copied to local temporary variables before the calculation is performed.