I am working with Armadillo and there seems to be some weird memory management in my program.
I need to solve a matrix system recursively, and for this I call the following function in a for-loop :
void get_T(arma::cx_vec &T, arma::mat M0, arma::vec Q, int dim, int nmem){
int n;
n = pow(2 * dim, nmem);
arma::cx_mat M(n, n);
arma::cx_mat P(n, n);
arma::cx_mat InvP(n, n);
arma::vec U(n);
for (int i = 0; i < n; i++)
{
U(i) = 1;
for (int j = 0; j < n; j++)
{
M(i, j) = std::complex<double>(0, 0);
P(i, j) = std::complex<double>(0, 0);
InvP(i, j) = std::complex<double>(0, 0);
}
}
get_M(M, M0, Q, dim, nmem);
P = arma::eye(n, n) - M;
InvP = P.i();
T = InvP * U;
}
I checked the overall RSS memory taken by the entire program, and it seems that the step involving P.i() increases the amount of memory used (which makes sense), but it does not free it when the program exits the get_T function. So the overall memory keeps increasing as the for-loop continues, which in the end leads to a huge amount of memory required. How can I fix this ? I read that Armadillo cleans up memory every time it exits a function, but it does not seem to do it here.
Thanks for helping !
Related
I have a very straightforward function that counts how many inner entries of an N by N 2D matrix (represented by a pointer arr) is below a certain threshold, and updates a counter below_threshold that is passed by reference:
void count(float *arr, const int N, const float threshold, int &below_threshold) {
below_threshold = 0; // make sure it is reset
bool comparison;
float temp;
#pragma omp parallel for shared(arr, N, threshold) private(temp, comparison) reduction(+:below_threshold)
for (int i = 1; i < N-1; i++) // count only the inner N-2 rows
{
for (int j = 1; j < N-1; j++) // count only the inner N-2 columns
{
temp = *(arr + i*N + j);
comparison = (temp < threshold);
below_threshold += comparison;
}
}
}
When I do not use OpenMP, it runs fine (thus, the allocation and initialization were done correctly already).
When I use OpenMP with an N that is less than around 40000, it runs fine.
However, once I start using a larger N with OpenMP, it keeps giving me a segmentation fault (I am currently testing with N = 50000 and would like to eventually get it up to ~100000).
Is there something wrong with this at a software level?
P.S. The allocation was done dynamically ( float *arr = new float [N*N] ), and here is the code used to randomly initialize the entire matrix, which didn't have any issues with OpenMP with large N:
void initialize(float *arr, const int N)
{
#pragma omp parallel for
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
*(arr + i*N + j) = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
}
}
}
UPDATE:
I have tried changing i, j, and N to long long int, and it still has not fixed my segmentation fault. If this was the issue, why has it already worked without OpenMP? It is only once I add #pragma omp ... that it fails.
I think, it is because, your value (50000*50000 = 2500000000) reached above INT_MAX (2147483647) in c++. As a result, the array access behaviour will be undefined.
So, you should use UINT_MAX or some other types that suits with your usecase.
I am filling an Eigen matrix with the following code:
int M = 3;
int N = 4;
MatrixXd A(M, N);
double res = sin(4);
for (int i = 0; i < M; i++) {
for (int j = 0; j < N; j++) {
A(i, j) = sin(i+j);
}
}
In Matlab I only need 1 for loop to do the same thing using vectorization:
M = 3;
N = 4;
N_Vec = 0:(N-1);
A = zeros(M,N);
for i=1:M
A(i,:) = sin((i-1)+N_Vec);
end
Is it possible to do something similar in C++/Eigen so that I can get rid of one of the for loops? If it is possible to somehow get rid of both for loops that would be even better. Is that possible?
Using a NullaryExpr you can do this with zero (manual) loops in Eigen:
Eigen::MatrixXd A = Eigen::MatrixXd::NullaryExpr(M, N,
[](Eigen::Index i, Eigen::Index j) {return std::sin(i+j);});
When compiled with optimization this is not necessarily faster than the manual two-loop version (and without optimization it could even be slower).
You can write int or long instead of Eigen::Index, if that is more readable ...
I seem to be having some trouble getting this mergesort to run. When I try to run it with g++ the terminal says "Segmentation fault (core dumped)," and I don't know what is causing this to happen (you might be able to tell that I'm still a beginner). Could anybody help out?
#include <iostream>
using namespace std;
void merge (int*, int, int, int);
void mergesort (int* A, int p, int r){
if (p < r){
int q = (p+r)/2;
mergesort (A, p, q);
mergesort (A, q+1, r);
merge ( A, p , q, r);
}
}
void merge (int* A, int p, int q, int r){
int n = q-p+1;
int m = r-q ;
int L [n+1];
int R [m+1];
for (int i=1;i <n+1;i++)
L[i] = A[p+i-1];
for (int j=1; j< m+1; j++)
R[j] = A[q+j];
L[n+1];
R[m+1];
int i= 1;
int j=1;
for (int k = p; k= r + 1; k++){
if (L[i] <= R[j]){
A[k] = L[i];
i+=1;
}
else{
j += 1;
}
}
}
int main() {
int A [15] = {1, 5, 6, 7,3, 4,8,2,3,6};
mergesort (A, 0, 9);
for (int i=0; i <9; i++){
cout << A[i] << endl;
}
return 0;
}
Thanks a lot!
There are three things in your implementation that either don't make sense or are outright wrong:
First these:
L[n+1];
R[m+1];
Neither of these statement have any effect at all, and I've no idea what you're trying to do.
Next, a significant bug:
for (int k = p; k= r + 1; k++){
The conditional clause of this for-loop is the assignment k = r + 1. Since r does not change anywhere within your loop, the only way that expression is false is if r == -1, which it never is. You've just created an infinite-loop on a counter k that will run forever up into the stratosphere, and in the process index, and write, to memory no longer valid in your process. This, as a result, is undefined behavior. I'm fairly sure you wanted this:
for (int k = p; k< (r + 1); k++){
though I can't comment on whether that is a valid limit since I've not dissected your algorithm further. I've not take the time to debug this any further. that I leave to you.
Edit. in your main mergsesort, this is not "wrong" but very susceptible to overflow
int q = (p+r)/2;
Consider this instead:
int q = p + (r-p)/2;
And not least this:
int L [n+1];
int R [m+1];
Uses a variable-length array extension not supported by the standard for C++. You may want to use std::vector<int> L(n+1) etc.. instead.
In your case the segmentation fault is likely being caused when you are trying to read memory in that does not exist for a variable, for example say you have an array called foo of size 10 (so foo[10]) and you this statement foo[11] would cause a segmentation fault.
What you need to do is use debug statements to print out your index variables (i, j, n, m, p and q) and see if any of these are larger than your array sizes
EDIT: Another unrelated issue is that you should not use using namespace std, this line of code can cause scoping issues if you are not careful, just something to keep in mind :)
I'm stuck at an impass with this implementation. My n2 variable is being overwritten during the merging of the subarrays, what could be causing this? I have tried hard-coding values in but it does not seem to work.
#include <iostream>
#include <cstdlib>
#include <ctime> // For time(), time(0) returns the integer number of seconds from the system clock
#include <iomanip>
#include <algorithm>
#include <cmath>//added last nite 3/18/12 1:14am
using namespace std;
int size = 0;
void Merge(int A[], int p, int q, int r)
{
int i,
j,
k,
n1 = q - p + 1,
n2 = r - q;
int L[5], R[5];
for(i = 0; i < n1; i++)
L[i] = A[i];
for(j = 0; j < n2; j++)
R[j] = A[q + j + 1];
for(k = 0, i = 0, j = 0; i < n1 && j < n2; k++)//for(k = p,i = j = 1; k <= r; k++)
{
if(L[i] <= R[j])//if(L[i] <= R[j])
{
A[k] = L[i++];
} else {
A[k] = R[j++];
}
}
}
void Merge_Sort(int A[], int p, int r)
{
if(p < r)
{
int q = 0;
q = (p + r) / 2;
Merge_Sort(A, p, q);
Merge_Sort(A, q+1, r);
Merge(A, p, q, r);
}
}
void main()
{
int p = 1,
A[8];
for (int i = 0;i < 8;i++) {
A[i] = rand();
}
for(int l = 0;l < 8;l++)
{
cout<<A[l]<<" \n";
}
cout<<"Enter the amount you wish to absorb from host array\n\n";
cin>>size;
cout<<"\n";
int r = size; //new addition
Merge_Sort(A, p, size - 1);
for(int kl = 0;kl < size;kl++)
{
cout<<A[kl]<<" \n";
}
}
What tools are you using to compile the program? There are some flags which switch on checks for this sort of thing in e,.g. gcc (e.g. -fmudflap, I haven't used it, but it looks potehtially useful).
If you can use a debugger (e.g. gdb) you should be able to add a 'data watch' for the variable n2, and the debugger will stop the program whenever it detects anything writing into n2. That should help you track down the bug. Or try valgrind.
A simple technique to temporarily stop this type of bug is to put some dummy variables around the one getting trashed, so:
int dummy1[100];
int n2 = r - q;
int dummy2[100];
int L[5], R[5];
Variables being trashed are usually caused by code writing beyond the bounds of arrays.
The culprit is likely R[5] because that is likely the closest. You can look in the dummies to see what is being written, and may be able to deduce from that what is happening.
ANother option is to make all arrays huge, while you track down the problem. Again set values beyond the correct bounds to a known value, and check those values that should be unchanged.
You could make a little macro to do those checks, and drop it in at any convenient place.
I had used the similar Merge function earlier and it doesn't seem to work properly. Then I redesigned and now it works perfectly fine. Below is the redesigned function definition for merge function in C++.
void merge(int a[], int p, int q, int r){
int n1 = q-p+1; //no of elements in first half
int n2 = r-q; //no of elements in second half
int i, j, k;
int * b = new int[n1+n2]; //temporary array to store merged elements
i = p;
j = q+1;
k = 0;
while(i<(p+n1) && j < (q+1+n2)){ //merging the two sorted arrays into one
if( a[i] <= a[j]){
b[k++] = a[i++];
}
else
b[k++] = a[j++];
}
if(i >= (p+n1)) //checking first which sorted array is finished
while(k < (n1+n2)) //and then store the remaining element of other
b[k++] = a[j++]; //array at the end of merged array.
else
while(k < (n1+n2))
b[k++] = a[i++];
for(i = p,j=0;i<= r;){ //store the temporary merged array at appropriate
a[i++] = b[j++]; //location in main array.
}
delete [] b;
}
I hope it helps.
void Merge(int A[], int p, int q, int r)
{
int i,
j,
k,
n1 = q - p + 1,
n2 = r - q;
int L[5], R[5];
for(i = 0; i < n1; i++)
L[i] = A[i];
You only allocate L[5], but the n1 bound you're using is based on inputs q and p -- and the caller is allowed to call the function with values of q and p that allow writing outside the bounds of L[]. This can manifest itself as over-writing any other automatic variables, but because it is undefined behavior, just about anything could happen. (Including security vulnerabilities.)
I do not know what the best approach to fix this is -- I don't understand why you've got fixed-length buffers in Merge(), I haven't read closely enough to discover why -- but you should not access L[i] when i is greater than or equal to 5.
This entire conversation also holds for R[]. And, since *A is passed to Merge(), it'd make sense to ensure that your array accesses for it are also always in bound. (I haven't spotted them going out of bounds, but since this code needs re-working anyway, I'm not sure it's worth my looking for them carefully.)
I'm performing matrix multiplication with this simple algorithm. To be more flexible I used objects for the matricies which contain dynamicly created arrays.
Comparing this solution to my first one with static arrays it is 4 times slower. What can I do to speed up the data access? I don't want to change the algorithm.
matrix mult_std(matrix a, matrix b) {
matrix c(a.dim(), false, false);
for (int i = 0; i < a.dim(); i++)
for (int j = 0; j < a.dim(); j++) {
int sum = 0;
for (int k = 0; k < a.dim(); k++)
sum += a(i,k) * b(k,j);
c(i,j) = sum;
}
return c;
}
EDIT
I corrected my Question avove! I added the full source code below and tried some of your advices:
swapped k and j loop iterations -> performance improvement
declared dim() and operator()() as inline -> performance improvement
passing arguments by const reference -> performance loss! why? so I don't use it.
The performance is now nearly the same as it was in the old porgram. Maybe there should be a bit more improvement.
But I have another problem: I get a memory error in the function mult_strassen(...). Why?
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
OLD PROGRAM
main.c http://pastebin.com/qPgDWGpW
c99 main.c -o matrix -O3
NEW PROGRAM
matrix.h http://pastebin.com/TYFYCTY7
matrix.cpp http://pastebin.com/wYADLJ8Y
main.cpp http://pastebin.com/48BSqGJr
g++ main.cpp matrix.cpp -o matrix -O3.
EDIT
Here are some results. Comparison between standard algorithm (std), swapped order of j and k loop (swap) and blocked algortihm with block size 13 (block).
Speaking of speed-up, your function will be more cache-friendly if you swap the order of the k and j loop iterations:
matrix mult_std(matrix a, matrix b) {
matrix c(a.dim(), false, false);
for (int i = 0; i < a.dim(); i++)
for (int k = 0; k < a.dim(); k++)
for (int j = 0; j < a.dim(); j++) // swapped order
c(i,j) += a(i,k) * b(k,j);
return c;
}
That's because a k index on the inner-most loop will cause a cache miss in b on every iteration. With j as the inner-most index, both c and b are accessed contiguously, while a stays put.
Make sure that the members dim() and operator()() are declared inline, and that compiler optimization is turned on. Then play with options like -funroll-loops (on gcc).
How big is a.dim() anyway? If a row of the matrix doesn't fit in just a couple cache lines, you'd be better off with a block access pattern instead of a full row at-a-time.
You say you don't want to modify the algorithm, but what does that mean exactly?
Does unrolling the loop count as "modifying the algorithm"? What about using SSE/VMX whichever SIMD instructions are available on your CPU? What about employing some form of blocking to improve cache locality?
If you don't want to restructure your code at all, I doubt there's more you can do than the changes you've already made. Everything else becomes a trade-off of minor changes to the algorithm to achieve a performance boost.
Of course, you should still take a look at the asm generated by the compiler. That'll tell you much more about what can be done to speed up the code.
Use SIMD if you can. You absolutely have to use something like VMX registers if you do extensive vector math assuming you are using a platform that is capable of doing so, otherwise you will incur a huge performance hit.
Don't pass complex types like matrix by value - use a const reference.
Don't call a function in each iteration - cache dim() outside your loops.
Although compilers typically optimize this efficiently, it's often a good idea to have the caller provide a matrix reference for your function to fill out rather than returning a matrix by type. In some cases, this may result in an expensive copy operation.
Here is my implementation of the fast simple multiplication algorithm for square float matrices (2D arrays). It should be a little faster than chrisaycock code since it spares some increments.
static void fastMatrixMultiply(const int dim, float* dest, const float* srcA, const float* srcB)
{
memset( dest, 0x0, dim * dim * sizeof(float) );
for( int i = 0; i < dim; i++ ) {
for( int k = 0; k < dim; k++ )
{
const float* a = srcA + i * dim + k;
const float* b = srcB + k * dim;
float* c = dest + i * dim;
float* cMax = c + dim;
while( c < cMax )
{
*c++ += (*a) * (*b++);
}
}
}
}
Pass the parameters by const reference to start with:
matrix mult_std(matrix const& a, matrix const& b) {
To give you more details we need to know the details of the other methods used.
And to answer why the original method is 4 times faster we would need to see the original method.
The problem is undoubtedly yours as this problem has been solved a million times before.
Also when asking this type of question ALWAYS provide compilable source with appropriate inputs so we can actually build and run the code and see what is happening.
Without the code we are just guessing.
Edit
After fixing the main bug in the original C code (a buffer over-run)
I have update the code to run the test side by side in a fair comparison:
// INCLUDES -------------------------------------------------------------------
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include <time.h>
// DEFINES -------------------------------------------------------------------
// The original problem was here. The MAXDIM was 500. But we were using arrays
// that had a size of 512 in each dimension. This caused a buffer overrun that
// the dim variable and caused it to be reset to 0. The result of this was causing
// the multiplication loop to fall out before it had finished (as the loop was
// controlled by this global variable.
//
// Everything now uses the MAXDIM variable directly.
// This of course gives the C code an advantage as the compiler can optimize the
// loop explicitly for the fixed size arrays and thus unroll loops more efficiently.
#define MAXDIM 512
#define RUNS 10
// MATRIX FUNCTIONS ----------------------------------------------------------
class matrix
{
public:
matrix(int dim)
: dim_(dim)
{
data_ = new int[dim_ * dim_];
}
inline int dim() const {
return dim_;
}
inline int& operator()(unsigned row, unsigned col) {
return data_[dim_*row + col];
}
inline int operator()(unsigned row, unsigned col) const {
return data_[dim_*row + col];
}
private:
int dim_;
int* data_;
};
// ---------------------------------------------------
void random_matrix(int (&matrix)[MAXDIM][MAXDIM]) {
for (int r = 0; r < MAXDIM; r++)
for (int c = 0; c < MAXDIM; c++)
matrix[r][c] = rand() % 100;
}
void random_matrix_class(matrix& matrix) {
for (int r = 0; r < matrix.dim(); r++)
for (int c = 0; c < matrix.dim(); c++)
matrix(r, c) = rand() % 100;
}
template<typename T, typename M>
float run(T f, M const& a, M const& b, M& c)
{
float time = 0;
for (int i = 0; i < RUNS; i++) {
struct timeval start, end;
gettimeofday(&start, NULL);
f(a,b,c);
gettimeofday(&end, NULL);
long s = start.tv_sec * 1000 + start.tv_usec / 1000;
long e = end.tv_sec * 1000 + end.tv_usec / 1000;
time += e - s;
}
return time / RUNS;
}
// SEQ MULTIPLICATION ----------------------------------------------------------
int* mult_seq(int const(&a)[MAXDIM][MAXDIM], int const(&b)[MAXDIM][MAXDIM], int (&z)[MAXDIM][MAXDIM]) {
for (int r = 0; r < MAXDIM; r++) {
for (int c = 0; c < MAXDIM; c++) {
z[r][c] = 0;
for (int i = 0; i < MAXDIM; i++)
z[r][c] += a[r][i] * b[i][c];
}
}
}
void mult_std(matrix const& a, matrix const& b, matrix& z) {
for (int r = 0; r < a.dim(); r++) {
for (int c = 0; c < a.dim(); c++) {
z(r,c) = 0;
for (int i = 0; i < a.dim(); i++)
z(r,c) += a(r,i) * b(i,c);
}
}
}
// MAIN ------------------------------------------------------------------------
using namespace std;
int main(int argc, char* argv[]) {
srand(time(NULL));
int matrix_a[MAXDIM][MAXDIM];
int matrix_b[MAXDIM][MAXDIM];
int matrix_c[MAXDIM][MAXDIM];
random_matrix(matrix_a);
random_matrix(matrix_b);
printf("%d ", MAXDIM);
printf("%f \n", run(mult_seq, matrix_a, matrix_b, matrix_c));
matrix a(MAXDIM);
matrix b(MAXDIM);
matrix c(MAXDIM);
random_matrix_class(a);
random_matrix_class(b);
printf("%d ", MAXDIM);
printf("%f \n", run(mult_std, a, b, c));
return 0;
}
The results now:
$ g++ t1.cpp
$ ./a.exe
512 1270.900000
512 3308.800000
$ g++ -O3 t1.cpp
$ ./a.exe
512 284.900000
512 622.000000
From this we see the C code is about twice as fast as the C++ code when fully optimized. I can not see the reason in the code.
I'm taking a wild guess here, but if you dynamically allocating the matrices makes such a huge difference, maybe the problem is fragmentation. Again, I've no idea how the underlying matrix is implemented.
Why don't you allocate the memory for the matrices by hand, ensuring it's contiguous, and build the pointer structure yourself?
Also, does the dim() method have any extra complexity? I would declare it inline, too.