On memory allocation in armadillo sparse matrices - c++

I want to know whether I need to free up the memory occupied by locations and values objects after sparse matrix has been created. Here is the code:
void load_data(umat &locations, vec& values){
// fill up locations and values
}
int main(int argc, char**argv){
umat loc;
vec val;
load_data(loc,val);
sp_mat X(loc,val);
return 0;
}
In the above code, load_data() fills up the locations and values objects and then sparse matrix is created in main(). My Q: Do I need to free up memory used by locations and values after construction of X? The reason is that X could be large and I am on low RAM. I know when main returns, OS will free up locations and values as well as X. But, the real question is whether the memory occupied by X is the same as by locationsand values OR X is allocated memory separately and I need to free locations and values in the case.

The constructor (SpMat_meat.hpp:231) you are using
template<typename T1, typename T2>
inline SpMat(const Base<uword,T1>& locations, const Base<eT,T2>& values, const bool sort_locations = true);
fills the sparse matrix with copies of the values in values.
I understand you are worried that you will run out of memory and if you keep loc, val and X separately you basically have two copies of the same data, taking up twice as much memory as actually needed (this is indeed what happens in your code snippet), so I will try to focus on addressing this problem and give you a few options:
1) If you are fine with keeping two copies of the data for a short while, the easiest solution is to dynamically allocate loc and val and delete them right after initialization of X
int main(int argc, char**argv){
umat* ploc;
vec* pval;
load_data(*ploc,*pval);
// at this point we have one copy of the data
sp_mat X(*ploc,*pval);
// at this point we have two copies of the data
delete ploc;
delete pval;
// at this point we have one copy of the data
return 0;
}
Of course you can use safe pointers instead of C style ones, but you get the idea.
2) If you absolutely don't want to have two copies of the data at any time, I would suggest that you modify your load_data routine to sequentially load values one by one and directly insert them into X
void load_data(umat &locations, vec& values, sp_mat& X){
// fill up locations and values one by one into X
}
Other options would be to i) use move semantics to directly move the values in val into X or ii) directly use the memory allocated for val as the memory for X, similar to the advanced constructor for matrices
Mat(eT* aux_mem, const uword aux_n_rows, const uword aux_n_cols, const bool copy_aux_mem = true, const bool strict = false)
Both options would however require modifications on the level of the armadillo library, as such a functionality is not supplied yet for sparse matrices (There is only a plain move constructor so far). It would be a good idea to request these features from the developers though!

Related

What is the proper way to preallocate memory for a function return value that is called many times in a loop?

I'm trying to improve my c++ code and trying to improve my coding style.
I want to implement this function that is called multiple times in a loop.
Class C {
double alpha = 0.1;
std::valarray<double> f(std::valarray<double> const & arr) //called many times in a loop
{
return arr * alpha;
}
}
the array passed in is quite large and every time f returns it allocates a brand new array for the return value, really slowing down my code.
I've tried to implement a fix by preallocating a return value for it in the class it is a member of as soon as the size of the arrays are known during execution;
Class C {
double alpha = 0.1;
std::valarray<double> f_retval;
void f(std::valarray<double> const & arr) //called many times in a loop
{
f_retval = arr * alpha;
}
void allocateMembers(int Nx) //known size of the arrays used in the class
{
f_retval = std::valarray<double>(Nx);
}
}
But there must be a better way to do this. Any suggestions?
You could return by passing by non-const reference to preallocate outside of the member function.
Class C {
double alpha = 0.1;
void f(std::valarray<double> const & arr, std::valarray<double>& result) //called many times in a loop
{
result = arr * alpha;
}
}
The caller would then need to create their own preallocated result variable, but then they could reuse that variable during repeated calls to f.
std::valarray<double> f_retval = std::valarray<double>(Nx);
while (/*some condition*/) {
myC.f(toModify, f_retval);
// do something with f_retval
}
The advantages that this has over the solution that you suggested include:
the return-by-reference is more obvious to the user of the member function
the member function's functionality is more contained (it doesn't require two methods to execute), which also avoids bugs caused by improper usage
the class itself is less complex
The only potential drawback I can see with return-by-reference is that calling this method requires an extra variable declaration.
The first step to speeding this up is eliminating the memory allocations for every call to f. This requires having a valarray variable that can be reused. This can either be a member of class C or passed in as a reference parameter.
However, because the valarray multiplication operator will always allocate a new valarray, there will still be a memory allocation for each call. If performance is critical you need to roll your own multiplication loop to store the result into the reusable array (possibly resizing it to the correct size, which is essential for the first call).
In addition to not allocating new memory, this can possibly provide extra benefits from cache usage, since the memory is reused and will likely already be in the CPU data cache.

How can I make my dynamic array or vector operate at a similar speed to a standard array? C++

I'm still quite inexperienced in C++ and i'm trying to write sum code to add numbers precisely. This is a dll plugin for some finite difference software and the code is called several million times during a run. I want to write a function where any number of arguments can be passed in and the sum will be returned. My code looks like:
#include <cstdarg>
double SumFunction(int numArgs, ...){ // this allows me to pass any number
// of arguments to my function.
va_list args;
va_start(args,numArgs); //necessary prerequisites for using cstdarg
double myarray[10];
for (int i = 0; i < numArgs; i++) {
myarray[i] = va_arg(args,double);
} // I imagine this is sloppy code; however i cannot create
// myarray{numArgs] because numArgs is not a const int.
sum(myarray); // The actual method of addition is not relevant here, but
//for more complicated methods, I need to put the summation
// terms in a list.
vector<double> vec(numArgs); // instead, place all values in a vector
for (int i = 0; i < numArgs; i++) {
vec.at(i) = va_arg(args,double);
}
sum(vec); //This would be passed by reference, of course. The function sum
// doesn't actually exist, it would all be contained within the
// current function. This is method is twice as slow as placing
//all the values in the static array.
double *vec;
vec = new double[numArgs];
for (int i = 0; i < (numArgs); i++) {
vec[i] = va_arg(args,double);
}
sum(vec); // Again half of the speed of using a standard array and
// increasing in magnitude for every extra dynamic array!
delete[] vec;
va_end(args);
}
So the problem I have is that using an oversized static array is sloppy programming, but using either a vector or a dynamic array slows the program down considerably. So I really don't know what to do. Can anyone help, please?
One way to speed the code up (at the cost of making it more complicated) is to reuse a dynamic array or vector between calls, then you will avoid incurring the overhead of memory allocation and deallocation each time you call the function.
For example declare these variables outside your function either as global variables or as member variables inside some class. I'll just make them globals for ease of explanation:
double* sumArray = NULL;
int sumArraySize = 0;
In your SumFunction, check if the array exists and if not allocate it, and resize if necessary:
double SumFunction(int numArgs, ...){ // this allows me to pass any number
// of arguments to my function.
va_list args;
va_start(args,numArgs); //necessary prerequisites for using cstdarg
// if the array has already been allocated, check if it is large enough and delete if not:
if((sumArray != NULL) && (numArgs > sumArraySize))
{
delete[] sumArray;
sumArray = NULL;
}
// allocate the array, but only if necessary:
if(sumArray == NULL)
{
sumArray = new double[numArgs];
sumArraySize = numArgs;
}
double *vec = sumArray; // set to your array, reusable between calls
for (int i = 0; i < (numArgs); i++) {
vec[i] = va_arg(args,double);
}
sum(vec, numArgs); // you will need to pass the array size
va_end(args);
// note no array deallocation
}
The catch is that you need to remember to deallocate the array at some point by calling a function similar to this (like I said, you pay for speed with extra complexity):
void freeSumArray()
{
if(sumArray != NULL)
{
delete[] sumArray;
sumArray = NULL;
sumArraySize = 0;
}
}
You can take a similar (and simpler/cleaner) approach with a vector, allocate it the first time if it doesn't already exist, or call resize() on it with numArgs if it does.
When using a std::vector the optimizer must consider that relocation is possible and this introduces an extra indirection.
In other words the code for
v[index] += value;
where v is for example a std::vector<int> is expanded to
int *p = v._begin + index;
*p += value;
i.e. from vector you need first to get the field _begin (that contains where the content starts in memory), then apply the index, and then dereference to get the value and mutate it.
If the code performing the computation on the elements of the vector in a loop calls any unknown non-inlined code, the optimizer is forced to assume that unknown code may mutate the _begin field of the vector and this will require doing the two-steps indirection for each element.
(NOTE: that the vector is passed with a cost std::vector<T>& reference is totally irrelevant: a const reference doesn't mean that the vector is const but simply puts a limitation on what operations are permitted using that reference; external code could have a non-const reference to access the vector and constness can also be legally casted away... constness of references is basically ignored by the optimizer).
One way to remove this extra lookup (if you know that the vector is not being resized during the computation) is to cache this address in a local and use that instead of the vector operator [] to access the element:
int *p = &v[0];
for (int i=0,n=v.size(); i<n; i++) {
/// use p[i] instead of v[i]
}
This will generate code that is almost as efficient as a static array because, given that the address of p is not published, nothing in the body of the loop can change it and the value p can be assumed constant (something that cannot be done for v._begin as the optimizer cannot know if someone else knows the address of _begin).
I'm saying "almost" because a static array only requires indexing, while using a dynamically allocated area requires "base + indexing" access; most CPUs however provide this kind of memory access at no extra cost. Moreover if you're processing elements in sequence the indexing addressing becomes just a sequential memory access but only if you can assume the start address constant (i.e. not in the case of std::vector<T>::operator[]).
Assuming that the "max storage ever needed" is in the order of 10-50, I'd say using a local array is perfectly fine.
Using vector<T> will use 3 * sizeof(*T) (at least) to track the contents of the vector. So if we compare that to an array of double arr[10];, then that's 7 elements more on the stack of equal size (or 8.5 in 32-bit build). But you also need a call to new, which takes a size argument. So that takes up AT LEAST one, more likely 2-3 elements of stackspace, and the implementation of new is quite possibly not straightforward, so further calls are needed, which take up further stack-space.
If you "don't know" the number of elements, and need to cope with quite large numbers of elements, then using a hybrid solution, where you have a small stack-based local array, and if numargs > small_size use vector, and then pass vec.data() to the function sum.

workaround for Eigen::Matrix to release data

I want to use Eigen3 on data coming from another library. An earlier answer by ggael indicates a way for Eigen::Matrix to adopt preexsisting data with the new keyword. However, this is not sufficient for me, because the resulting Matrix still seems to acquire ownership of the data, meaning that it will free the data when going out of scope. To wit, this is a crasher if data eventually gets deleted by the library it comes from:
void crasher(double* data, size_t dim)
{
MatrixXd m;
new (&m) Map<MatrixXd>(data,dim,dim); // matrix adopts the passed data
m.setRandom(); cout<<m<<endl; // manipulation of the passed data via the Matrix interface
} // data deleted by m => potential problem in scope of function call
I have come up with two workarounds:
void nonCrasher1(double* data, size_t dim)
{
MatrixXd m; // semantically, a non-owning matrix
const Map<const MatrixXd> cache(m.data(),0,0); // cache the original „data” (in a const-correct way)
new (&m) Map<MatrixXd>(data,dim,dim); // matrix adopts the passed data
m.setRandom(); cout<<m<<endl; // manipulation of the passed data via the Matrix interface
new (&m) Map<const MatrixXd>(cache); // re-adopting the original data
} // original data deleted by m
This is rather inconvenient because of the presence of cache. The other is free from this problem:
void nonCrasher2(double* data, size_t dim) // no need for caching
{
MatrixXd m; // semantically, a non-owning matrix
new (&m) Map<MatrixXd>(data,dim,dim); // matrix adopts the passed data
m.setRandom(); cout<<m<<endl; // manipulation of the passed data via the Matrix interface
new (&m) Map<MatrixXd>(nullptr,0,0); // adopting nullptr for subsequent deletion
} // nullptr „deleted” by m (what happens with original data [if any]?)
However, here it’s unclear what happens with the original data of m (if any – all this is not completely clear from the documentation of Eigen3).
My question is whether there is a canonical way for Eigen::Matrix to release ownership of it’s data (self-allocated or adopted).
In nonCrasher2 nothing happen to data, except that it has been filled with random values on purpose.
However, this still look like hackish, and the clean approach be to use a Map object instead of MatrixXd:
Map<MatrixXd> m(data, dim, dim);
m.setRandom();
If you need m to be a MatrixXd because you need to call functions taking MatrixXd objects, and such functions cannot be templated, then you might consider generalizing these functions to take Ref<MatrixXd> objects. By default, a Ref<MatrixXd> can accept any expression which storage reassemble to a MatrixXd with arbitrary leading dimension. It has been introduced in Eigen 3.2, check the doc for more details.

creating large 2d array of size int arr[1000000][1000000]

I want to create a two-dimensional integer array of size 106 × 106 elements. For this I'm using the boost library:
boost::multi_array<int, 2> x(boost::extents[1000000][1000000]);
But it throws the following exception:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Please tell me how to solve the problem.
You seriously don't want to allocate an array that huge. It's about 4 terabytes in memory.
Depending on what you want to do with that array you should consider two options:
External data structure. The array will be written on a hard drive. The most recently accessed parts is also in RAM, so depending on how you access it it can be pretty fast, but of course never as fast as if it would be fully in RAM. Have a look at STXXL for external data structures.
This method has the advantage that you can access all of the elements in the array (in contrast to the second method as you'll see). However, the problem still remains: 4 terabytes are very huge even on a hard drive, at least if you are talking about a general desktop application.
Sparse data structure. If you only actually need a couple of items from that array, but you want to address these items in a space of size 10⁶ ⨯ 10⁶, don't use an array but something like a map or a combination of both: Allocate the array in "blocks" of, let's say 1024 x 1024 elements. Put these blocks into a map while referencing the block index (coordinate divided by 1024) as the key in the map.
This method has the advantage that you don't have to link against another library, since it can be written easily by yourself. However, it has the disadvantage that if you access elements distributed over the whole coordinate space of 10⁶ ⨯ 10⁶ or even need all of the values, it also uses around 4TB (even a bit more) memory. It only works if you actually access only a smart part of this huge "virtual" array.
The following (untested) C++ code should demonstrate this:
class Sparse2DArray
{
struct Coord {
int x, y;
Coord(int x, int y) : x(x), y(y) {}
bool operator<(const Coord &o) const { return x < o.x || (x == o.x && y < o,y); } // required for std::map
};
static const int BLOCKSIZE = 1024;
std::map<Coord, std::array<std::array<int,BLOCKSIZE>,BLOCKSIZE> blocks;
static Coord block(Coord c) {
return coord(c.x / BLOCKSIZE, c.y / BLOCKSIZE);
}
static Coord blockSubCoord(Coord c) {
return coord(c.x % BLOCKSIZE, c.y % BLOCKSIZE);
}
public:
int & operator[](int x, int y) {
Coord c(x, y);
Coord b = block(c);
Coord s = blockSubCoord(c);
return blocks[b][s.x][s.y];
}
};
Instead of a std::map you can also use a std::unordered_map (hash map) but have to define a hash function instead of operator< for the Coord type (or use std::pair instead).
When you create an array that way, it is created on the stack and the stack has a limited size. Therefore, your program will crash because it doesn't have enough room to allocate that big of an array.
There are two ways you can solve this, you can create the array on the heap using the new keyword But you have to delete it afterword or else you have a memory leak, and also be careful because while the heap has a larger memory size then the stack it is still finite.
The other way is for you to use std::vector inside std::vector and let it handle the memory for you.
What do you intend by creating a 106×106 matrix? If you're trying to create a sparse matrix (i.e. a diffusion matrix for a heat transfer problem with 106 finite elements), then you should look at using an existing linear algebra library. For example, the trilinos project has support for solving large sparse matrices like the one you may be trying to create.

How can I make my char buffer more performant?

I have to read a lot of data into:
vector<char>
A 3rd party library reads this data in many turns. In each turn it calls my callback function whose signature is like this:
CallbackFun ( int CBMsgFileItemID,
unsigned long CBtag,
void* CBuserInfo,
int CBdataSize,
void* CBdataBuffer,
int CBisFirst,
int CBisLast )
{
...
}
Currently I have implemented a buffer container using an STL Container where my method insert() and getBuff are provided to insert a new buffer and getting stored buffer. But still I want better performing code, so that I can minimize allocations and de-allocations:
template<typename T1>
class buffContainer
{
private:
class atomBuff
{
private:
atomBuff(const atomBuff& arObj);
atomBuff operator=(const atomBuff& arObj);
public:
int len;
char *buffPtr;
atomBuff():len(0),buffPtr(NULL)
{}
~atomBuff()
{
if(buffPtr!=NULL)
delete []buffPtr;
}
};
public :
buffContainer():_totalLen(0){}
void insert(const char const *aptr,const unsigned long &alen);
unsigned long getBuff(T1 &arOutObj);
private:
std::vector<atomBuff*> moleculeBuff;
int _totalLen;
};
template<typename T1>
void buffContainer< T1>::insert(const char const *aPtr,const unsigned long &aLen)
{
if(aPtr==NULL,aLen<=0)
return;
atomBuff *obj=new atomBuff();
obj->len=aLen;
obj->buffPtr=new char[aLen];
memcpy(obj->buffPtr,aPtr,aLen);
_totalLen+=aLen;
moleculeBuff.push_back(obj);
}
template<typename T1>
unsigned long buffContainer<T1>::getBuff(T1 &arOutObj)
{
std::cout<<"Total Lenght of Data is: "<<_totalLen<<std::endl;
if(_totalLen==0)
return _totalLen;
// Note : Logic pending for case size(T1) > T2::Value_Type
int noOfObjRqd=_totalLen/sizeof(T1::value_type);
arOutObj.resize(noOfObjRqd);
char *ptr=(char*)(&arOutObj[0]);
for(std::vector<atomBuff*>::const_iterator itr=moleculeBuff.begin();itr!=moleculeBuff.end();itr++)
{
memcpy(ptr,(*itr)->buffPtr,(*itr)->len);
ptr+= (*itr)->len;
}
std::cout<<arOutObj.size()<<std::endl;
return _totalLen;
}
How can I make this more performant?
If my wild guess about your callback function makes sense, you don't need anything more than a vector:
std::vector<char> foo;
foo.reserve(MAGIC); // this is the important part. Reserve the right amount here.
// and you don't have any reallocs.
setup_callback_fun(CallbackFun, &foo);
CallbackFun ( int CBMsgFileItemID,
unsigned long CBtag,
void* CBuserInfo,
int CBdataSize,
void* CBdataBuffer,
int CBisFirst,
int CBisLast )
{
std::vector<char>* pFoo = static_cast<std::vector<char>*>(CBuserInfo);
char* data = static_cast<char*>CBdataBuffer;
pFoo->insert(pFoo->end(), data, data+CBdataSize);
}
Depending on how you plan to use the result, you might try putting the incoming data into a rope datastructure instead of vector, especially if the strings you expect to come in are very large. Appending to the rope is very fast, but subsequent char-by-char traversal is slower by a constant factor. The tradeoff might work out for you or not, I don't know what you need to do with the result.
EDIT: I see from your comment this is no option, then. I don't think you can do much more efficient in the general case when the size of the data coming in is totally arbitrary. Otherwise you could try to initially reserve enough space in the vector so that the data will fit without or at most one reallocation in the average case or so.
One thing I noticed about your code:
if(aPtr==NULL,aLen<=0)
I think you mean
if(aPtr==NULL || aLen<=0)
The main thing you can do is avoid doing quite so much copying of the data. Right now, when insert() is called, you're copying the data into your buffer. Then, when getbuff() is called, you're copying the data out to a buffer they've (hopefully) specified. So, to get data from outside to them, you're copying each byte twice.
This part:
arOutObj.resize(noOfObjRqd);
char *ptr=(char*)(&arOutObj[0]);
Seems to assume that arOutObj is really a vector. If so, it would be a whole lot better to rewrite getbuff as a normal function taking a (reference to a) vector instead of being a template that really only works for one type of parameter.
From there, it becomes a fairly simple matter to completely eliminate one copy of the data. In insert(), instead of manually allocating memory and tracking the size, put the data directly into a vector. Then, when getbuff() is called, instead of copying the data into their buffer, just give then a reference to your existing vector.
class buffContainer {
std::vector<char> moleculeBuff;
public:
void insert(char const *p, unsigned long len) {
Edit: Here you really want to add:
moleculeBuff.reserve(moleculeBuff.size()+len);
End of edit.
std::copy(p, p+len, std::back_inserter(moleculeBuff));
}
void getbuff(vector<char> &output) {
output = moleculeBuff;
}
};
Note that I've changed the result of getbuff to void -- since you're giving them a vector, its size is known, and there's no point in returning the size. In reality, you might want to actually change the signature a bit, to just return the buffer:
vector<char> getbuff() {
vector<char> temp;
temp.swap(moleculeBuff);
return temp;
}
Since it's returning a (potentially large) vector by value, this depends heavily on your compiler implementing the named return value optimization (NRVO), but 1) the worst case is that it does about what you were doing before anyway, and 2) virtually all reasonably current compilers DO implement NRVO.
This also addresses one other detail your original code didn't (seem to). As it was, getbuff returns some data, but if you call it again, it (apparently doesn't keep track of what data has already been returned, so it returns it all again. It keeps allocating data, but never deletes any of it. That's what the swap is for: it creates an empty vector, and then swaps that with the one that's being maintained by buffContainer, so buffContainer now has an empty vector, and the filled one is handed over to whatever called getbuff().
Another way to do things would be to take the swap a step further: basically, you have two buffers:
one owned by buffContainer
one owned by whatever calls getbuffer()
In the normal course of things, we can probably expect that the buffer sizes will quickly reach some maximum size. From there on, we'd really like to simply re-cycle that space: read some data into one, pass it to be processed, and while that's happening, read data into the other.
As it happens, that's pretty easy to do too. Change getbuff() to look something like this:
void getbuff(vector<char> &output) {
swap(moleculeBuff, output);
moleculeBuff.clear();
}
This should improve speed quite a bit -- instead of copying data back and forth, it just swaps one vector's pointer to the data with the others (along with a couple other details like the current allocation size, and used size of the vector). The clear is normally really fast -- for a vector (or any type without a dtor) it'll just set the number of items in the vector to zero (if the items have dtors, it has to destroy them, of course). From there, the next time insert() is called, the new data will just be copied into the memory the vector already owns (until/unless it needs more space than the vector had allocated).