RcppArmadillo: Issue with memory usage - c++

I have started using Rcpp. I like it a lot. I am fairly new to programming. I have a question regarding memory usage. Below is a reproducible problem:
library(RcppArmadillo)
library(inline)
code <- "
Rcpp::NumericVector input_(input);
arma::cube disturb(input_.begin(), 2, 2, 50000000, false);
return wrap(2);
"
Test <- cxxfunction(signature(input = "numeric"), plugin = "RcppArmadillo", body = code)
input <- array(rnorm(2 * 2 * 50000000), dim = c(2, 2, 50000000))
Test(input)
My understanding is that in the problem above the only memory usage is when I assign an array to the variable input in R. So I should only be using around 1.6 gb (2*2*50*8 = 1600). When I go to Rcpp, I initialise the variable input_ using SEXP object which is a pointer. So this should not use any additional memory. Then when I initialise the variable disturb, I also use a pointer and set copy_aux = FALSE. So I should not be using any memory. So if my understanding is correct, I should only be using 1.6 gb when I run the code. Is this correct?
However, when I run the code, the memory usage (judging by looking at System Monitor in Ubuntu) jumps above 10 gb (from around 1 gb) before falling back down to around 4 gb. I don't understand what is going on. Did I use Rcpp incorrectly?
Your help is appreciated. Many thanks.

Edit after new version of Armadillo (5.300)
After this initial Q/A on StackOverflow, Conrad Sanderson and I engaged in some email discussion about this issue. By design, the arma::cube objects create an arma::mat for each slice (the third dimension) of the cube. This is done during the creation of the cube, even if the data is copied from existing memory (as in the original question). Since this is not always needed, I suggested there should be an option to disable the pre-allocation of matrices for the slices. As of the current version of Armadillo (5.300.4), there now is. This can be installed from CRAN.
Example code:
library(RcppArmadillo)
library(inline)
code <- "
Rcpp::NumericVector input_(input);
arma::cube disturb(input_.begin(), 2, 2, 50000000, false, true, false);
return wrap(2);
"
Test <- cxxfunction(signature(input = "numeric"), plugin = "RcppArmadillo", body = code)
input <- array(rnorm(2 * 2 * 50000000), dim = c(2, 2, 50000000))
Test(input)
The key thing here is that the cube constructor is now called using arma::cube disturb(input.begin(), 2, 2, 50000000, false, true, false);. The final false here is the new prealloc_mat parameter which determines whether or not to pre-allocate the matrices. The slice method will still work fine on a cube without pre-allocated matrices - the matrix will be allocated on demand. However, if you're directly accessing the mat_ptrs member of a cube it will be filled with NULL pointers. The help has also been updated.
Many thanks to Conrad Sanderson for acting so quickly to provide this additional option, and to Dirk Eddelbuettel for all his work on Rcpp and RcppArmadillo!
Original answer
It's a slightly bizarre one. I've tried with a range of different array sizes, and the problem only occurs with arrays where the 3rd dimension is much bigger than the other 2. Here's a reproducible example:
library("RcppArmadillo")
library("inline")
code <- "
Rcpp::NumericVector input_(input);
IntegerVector dim = input_.attr(\"dim\");
arma::cube disturb(input_.begin(), dim[0], dim[1], dim[2], false);
disturb[0, 0, 0] = 45;
return wrap(2);
"
Test <- cxxfunction(signature(input = "numeric"), plugin = "RcppArmadillo", body = code)
input <- array(0, c(1e7, 2, 2))
Test(input)
# no change in memory usage
dim(input) <- c(2, 1e7, 2)
gc()
Test(input)
# no change in memory usage
dim(input) <- c(2, 2, 1e7)
gc()
Test(input)
# spike in memory usage
dim(input) <- c(20, 2, 1e6)
gc()
Test(input)
# no change in memory usage
This suggests it's something about the way that the Aramadillo library is implemented (or possibly RcppArmadillo). It certainly doesn't seem to be something you're doing wrong.
Note I've included some modification in place of the data (setting the first element to 45), and you can confirm that in each case the data is modified in place, suggesting there isn't a copy going on.
For now, I'd suggest if possible organising your 3d arrays such that the largest dimension isn't the third one.
EDIT After doing some more digging, it looks as though there is allocation of RAM during the creation of the arma::cube. In Cube_meat.hpp, in the create_mat method, there's the following code:
if(n_slices <= Cube_prealloc::mat_ptrs_size)
{
access::rw(mat_ptrs) = const_cast< const Mat<eT>** >(mat_ptrs_local);
}
else
{
access::rw(mat_ptrs) = new(std::nothrow) const Mat<eT>*[n_slices];
arma_check_bad_alloc( (mat_ptrs == 0), "Cube::create_mat(): out of memory" );
}
}
Cube_prealloc::mat_ptrs_size seems to be 4, so it's actually an issue for any array with more than 4 slices.
I posted an issue on github.
EDIT2 However, it's definitely an issue with the underlying Armadillo code. Here's a reproducible example which doesn't use Rcpp at all. This is linux-only - it uses code from How to get memory usage at run time in c++? to pull out the current memory usage of the running process.
#include <iostream>
#include <armadillo>
#include <unistd.h>
#include <ios>
#include <fstream>
#include <string>
//////////////////////////////////////////////////////////////////////////////
//
// process_mem_usage(double &, double &) - takes two doubles by reference,
// attempts to read the system-dependent data for a process' virtual memory
// size and resident set size, and return the results in KB.
//
// On failure, returns 0.0, 0.0
void process_mem_usage(double& vm_usage, double& resident_set)
{
using std::ios_base;
using std::ifstream;
using std::string;
vm_usage = 0.0;
resident_set = 0.0;
// 'file' stat seems to give the most reliable results
//
ifstream stat_stream("/proc/self/stat",ios_base::in);
// dummy vars for leading entries in stat that we don't care about
//
string pid, comm, state, ppid, pgrp, session, tty_nr;
string tpgid, flags, minflt, cminflt, majflt, cmajflt;
string utime, stime, cutime, cstime, priority, nice;
string O, itrealvalue, starttime;
// the two fields we want
//
unsigned long vsize;
long rss;
stat_stream >> pid >> comm >> state >> ppid >> pgrp >> session >> tty_nr
>> tpgid >> flags >> minflt >> cminflt >> majflt >> cmajflt
>> utime >> stime >> cutime >> cstime >> priority >> nice
>> O >> itrealvalue >> starttime >> vsize >> rss; // don't care about the rest
stat_stream.close();
long page_size_kb = sysconf(_SC_PAGE_SIZE) / 1024; // in case x86-64 is configured to use 2MB pages
vm_usage = vsize / 1024.0;
resident_set = rss * page_size_kb;
}
using namespace std;
using namespace arma;
void test_cube(double* numvec, int dim1, int dim2, int dim3) {
double vm, rss;
cout << "Press enter to continue";
cin.get();
process_mem_usage(vm, rss);
cout << "Before:- VM: " << vm << "; RSS: " << rss << endl;
cout << "cube c1(numvec, " << dim1 << ", " << dim2 << ", " << dim3 << ", false)" << endl;
cube c1(numvec, dim1, dim2, dim3, false);
process_mem_usage(vm, rss);
cout << "After:- VM: " << vm << "; RSS: " << rss << endl << endl;
}
int
main(int argc, char** argv)
{
double* numvec = new double[40000000];
test_cube(numvec, 10000000, 2, 2);
test_cube(numvec, 2, 10000000, 2);
test_cube(numvec, 2, 2, 1000000);
test_cube(numvec, 2, 2, 2000000);
test_cube(numvec, 4, 2, 2000000);
test_cube(numvec, 2, 4, 2000000);
test_cube(numvec, 4, 4, 2000000);
test_cube(numvec, 2, 2, 10000000);
cout << "Press enter to finish";
cin.get();
return 0;
}
EDIT 3 Per the create_mat code above, an arma::mat is created for each slice of a cube. On my 64-bit machine, this results in 184 bytes of overhead for each slice. For a cube with 5e7 slices, that equates to 8.6 GiB of overhead even though the underlying numeric data only takes up 1.5 GiB. I've emailed Conrad Sanderson to ask if this is fundamental to the way that Armadillo works or could be changed, but for now it definitely seems that you want your slice dimension (the third one) to be the smallest of the three if at all possible. It's also worth noting that this applies to all cubes, not just those created from existing memory. Using the arma::cube(dim1, dim2, dim3) constructor leads to the same memory usage.

Related

the dimension and read problem when using DPC++ image arrays

This problem really confuse me. I wanna use DPC++ to read a group of images, so the sycl::image was used ,below are my code.
#define N 4 //dimensin
#define M 128 //dimension
#define C 4 //4 channel
#define L 2 // 2 images
int * host_array3_2 = malloc_host<int>(N*M*C*L, Q);
image im3(host_array3_2, image_channel_order::rgba, image_channel_type::unsigned_int32, range{ M,N,L}); //the image format
The kernel code are as follows, I use accessor with image_array tags to read the data:
Q.submit([&](handler &h) {
auto out = sycl::stream(1024, 1024 * 2, h);
accessor<int4, 2, access::mode::read, access::target::image_array> acs3(im3, h);//the accessor format
h.parallel_for(nd_range{ range{ M ,N,L }, range{ N,N,L } }, [=](nd_item<3> it) {
int idx = it.get_global_linear_id();
if (idx == 0){
confuse here: out << acs3.get_count() << " " << acs3.get_range() << " \n";
//const auto &ss = acs3[0]; no compile error
confuse here: //ss.read(int2(0, 1)); compiler error: "array subscript out of range" "SYCL kernel cannot call a variadic function"}
});
});
In addition to the read problem , I found that the range is{128,4,4},
why the third dimension is 4? Isn’t the value of L(2)?
And it seems that the third dimension is only depend on the second dimension, no matter what the L is. Can anybody answer me?
An interesting phenomenon, under the release compile mode ,it will be ok ,and the output are also different. that's strange

Why my inversions of matrices are such slow with LAPACKE in C++ : MAGMA Alternative and set up

I am using LAPACK to inverse a matrix: I did a reference passing, i.e by working on the address. Here below the function with an input matrix and an output matrix referenced by their address.
The issue is that I am obliged to convert the F_matrix into 1D array and I think this is a waste of performances on the runtime level : which way could I find to get rid of this supplementary task which is time consuming I think if I call a lot of times the
function matrix_inverse_lapack.
Below the function concerned :
// Passing Matrixes by Reference
void matrix_inverse_lapack(vector<vector<double>> const &F_matrix, vector<vector<double>> &F_output) {
// Index for loop and arrays
int i, j, ip, idx;
// Size of F_matrix
int N = F_matrix.size();
int *IPIV = new int[N];
// Statement of main array to inverse
double *arr = new double[N*N];
// Output Diagonal block
double *diag = new double[N];
for (i = 0; i<N; i++){
for (j = 0; j<N; j++){
idx = i*N + j;
arr[idx] = F_matrix[i][j];
}
}
// LAPACKE routines
int info1 = LAPACKE_dgetrf(LAPACK_ROW_MAJOR, N, N, arr, N, IPIV);
int info2 = LAPACKE_dgetri(LAPACK_ROW_MAJOR, N, arr, N, IPIV);
for (i = 0; i<N; i++){
for (j = 0; j<N; j++){
idx = i*N + j;
F_output[i][j] = arr[idx];
}
}
delete[] IPIV;
delete[] arr;
}
For example, I call it this way :
vector<vector<double>> CO_CL(lsize*(2*Dim_x+Dim_y), vector<double>(lsize*(2*Dim_x+Dim_y), 0));
... some code
matrix_inverse_lapack(CO_CL, CO_CL);
The performances on inversion are not which are expected, I think this is due to this conversion 2D -> 1D that I described in the function matrix_inverse_lapack.
Update
I was advised to install MAGMA on my MacOS Big Sur 11.3 but I have a lot of difficulties to set up it.
I have a AMD Radeon Pro 5600M graphic card. I have already installed by default Big Sur version all the Framework OpenCL (maybe I am wrong by saying that). Anyone could tell the procedure to follow for the installation of MAGMA. I saw that on a MAGMA software exists on http://magma.maths.usyd.edu.au/magma/ but it is really expensive and doesn't correspond to what I want : I just need all the SDK (headers and libraries) , if possible built with my GPU card. I have already installed all the Intel OpenAPI SDK on my MacOS. Maybe, I could link it to a MAGMA installation.
I saw another link https://icl.utk.edu/magma/software/index.html where MAGMA seems to be public : there is none link with the non-free version above, isn't there ?
First of all let me complain that OP did not provide all necessary data. The program is almost complete, but it is not a minimal, reproducible example. This is important because (a) it wastes time and (b) it hides potentially relevant information, eg. about the matrix initialization. Second, OP did not provide any details on the compilation, which, again may be relevant.
Last, but not least, OP didn't check the status code for possible errors from Lapack functions, and this could also be important for correct interpretation of the results.
Let's start from a minimal reproducible example:
#include <lapacke.h>
#include <vector>
#include <chrono>
#include <iostream>
using Matrix = std::vector<std::vector<double>>;
std::ostream &operator<<(std::ostream &out, Matrix const &v)
{
const auto size = std::min<int>(10, v.size());
for (int i = 0; i < size; i++)
{
for (int j = 0; j < size; j++)
{
out << v[i][j] << "\t";
}
if (size < std::ssize(v)) out << "...";
out << "\n";
}
return out;
}
void matrix_inverse_lapack(Matrix const &F_matrix, Matrix &F_output, std::vector<int> &IPIV_buffer,
std::vector<double> &matrix_buffer)
{
// std::cout << F_matrix << "\n";
auto t0 = std::chrono::steady_clock::now();
const int N = F_matrix.size();
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
auto idx = i * N + j;
matrix_buffer[idx] = F_matrix[i][j];
}
}
auto t1 = std::chrono::steady_clock::now();
// LAPACKE routines
int info1 = LAPACKE_dgetrf(LAPACK_ROW_MAJOR, N, N, matrix_buffer.data(), N, IPIV_buffer.data());
int info2 = LAPACKE_dgetri(LAPACK_ROW_MAJOR, N, matrix_buffer.data(), N, IPIV_buffer.data());
auto t2 = std::chrono::steady_clock::now();
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
auto idx = i * N + j;
F_output[i][j] = matrix_buffer[idx];
}
}
auto t3 = std::chrono::steady_clock::now();
auto whole_fun_time = std::chrono::duration<double>(t3 - t0).count();
auto lapack_time = std::chrono::duration<double>(t2 - t1).count();
// std::cout << F_output << "\n";
std::cout << "status: " << info1 << "\t" << info2 << "\t" << (info1 == 0 && info2 == 0 ? "Success" : "Failure")
<< "\n";
std::cout << "whole function: " << whole_fun_time << "\n";
std::cout << "LAPACKE matrix operations: " << lapack_time << "\n";
std::cout << "conversion: " << (whole_fun_time - lapack_time) / whole_fun_time * 100.0 << "%\n";
}
int main(int argc, const char *argv[])
{
const int M = 5; // numer of test repetitions
const int N = (argc > 1) ? std::stoi(argv[1]) : 10;
std::cout << "Matrix size = " << N << "\n";
std::vector<int> IPIV_buffer(N);
std::vector<double> matrix_buffer(N * N);
// Test matrix_inverse_lapack M times
for (int i = 0; i < M; i++)
{
Matrix CO_CL(N);
for (auto &v : CO_CL) v.resize(N);
int idx = 1;
for (auto &v : CO_CL)
{
for (auto &x : v)
{
x = idx + 1.0 / idx;
idx++;
}
}
matrix_inverse_lapack(CO_CL, CO_CL, IPIV_buffer, matrix_buffer);
}
}
Here, operator<< is an overkill, but may be useful for anyone wanting to verify half-manually that the code works (by uncommenting lines 26 and 58), and ensuring that the code is correct is more important that measuring its performance.
The code can be compiled with
g++ -std=c++20 -O3 main.cpp -llapacke
The program relies on an external library, lapacke, which needs to be installed, headers + binaries, for the code to compile and run.
My code differs a bit from OP's: it is closer to "modern C++" in that it refrains from using naked pointers; I also added external buffers to matrix_inverse_lapack to suppress continual launching of memory allocator and deallocator, a small improvement that reduces the 2D-1D-2D conversion overhead in a measurable way. I also had to initialize the matrix and find a way to read in OP's mind what the value of N could be. I also added some timer readings for benchmarking. Apart from this, the logic of the code is unchanged.
Now a benchmark carried out on a decent workstation. It lists the percentage of time the conversion takes relative to the total time taken by matrix_inverse_lapack. In other words, I measure the conversion overhead:
N = 10, 3.5%
N = 30, 1.5%
N = 100, 1%
N = 300, 0.5%
N = 1000, 0.35%
N = 3000, 0.1%
The time taken by Lapack nicely scales as N3, as expected (data not shown). The time to invert a matrix is about 16 seconds for N = 3000, and about 5-6 s (5 microseconds) for N = 10.
I assume the overhead of even 3% is completely acceptable. I believe OP uses matrices of size larger then 100, in which case the overhead at or below 1% is certainly acceptable.
So what OP (or anyone having a similar problem) could have done wrong to obtain "unacceptable overhead conversion values"? Here's my short list
Improper compilation
Improper matrix initialization (for tests)
Improper benchmarking
1. Improper compilation
If one forgets to compile in Release mode, one ends up with optimized Lapacke competing with unoptimized conversion. On my machine this peaks at an 33% overhead for N = 20.
2. Improper matrix initialization (for tests)
If one initializes the matrix like this:
for (auto &v : CO_CL)
{
for (auto &x : v)
{
x = idx; // rather than, eg., idx + 1.0/idx
idx++;
}
}
then the matrix is singular, lapack returns quite quickly with the status different from 0. This increases the relative importance of the conversion part. But singular matrices are not what one wants to invert (it's impossible to do).
3. Improper benchmarking
Here's an example of the program output for N = 10:
./a.out 10
Matrix size = 10
status: 0 0 Success
whole function: 0.000127658
LAPACKE matrix operations: 0.000126783
conversion: 0.685425%
status: 0 0 Success
whole function: 1.2497e-05
LAPACKE matrix operations: 1.2095e-05
conversion: 3.21677%
status: 0 0 Success
whole function: 1.0535e-05
LAPACKE matrix operations: 1.0197e-05
conversion: 3.20835%
status: 0 0 Success
whole function: 9.741e-06
LAPACKE matrix operations: 9.422e-06
conversion: 3.27482%
status: 0 0 Success
whole function: 9.939e-06
LAPACKE matrix operations: 9.618e-06
conversion: 3.2297%
One can see that the first call to lapack functions can take 10 times more time than the subsequent calls. This is quite a stable pattern, as if Lapack needed some time for self-initialization. It can affect the measurements for small N badly.
4. What else can be done?
OP apperas to believe that his approach to 2D arrays is good and Lapack is strange and old-fashionable in its packing a 2D array into a 1D array. No. It is Lapack who is right.
If one defines a 2D array as vector<vector<double>>, one obtains one advantage: code simplicity. This comes at a price. Each row of such a matrix is allocated separateley from the others. Thus, a matrix 100 by 100 may be stored in 100 completely different memory blocks. This has a bad impact on the cache (and prefetcher) utilization. Lapck (and other linear algebra packages) enforces compactification of the data in a single, continuous array. This is so to minimize cache and prefetcher misses. If OP had used such an approach from the very beginning, he would probably have gained more than 1-3% that they pay now for the conversion.
This compactification can be achieved in at least three ways.
Write a custom class for a 2D matrix, with the internal data stored in a 1D array and convenient access member funnctions (e.g.: operator ()), or find a library that does just that
Write a custom allocator for std::vector (or find a library). This allocator should allocate the memory from a preallocated 1D vector exactly matching the data storage pattern used by Lapack
Use std::vector<double*> and initailze the pointers with the addresses pointing at the appropriate elements of a preallocated 1D array.
Each of the above solutions forces some changes to the surrounding code, which OP might not want to do. All depends on the code complexity and expected performance gains.
EDIT: Alternative libraries
An alternative approach is to use a library that is known for being a highly optimzed one. Lapack by itself can be regardered as a standard interface with many implementations and it may happen that OP uses an unoptimized one. Which library to choose may depend on the hardware/software platform OP is interested in and may vary in time.
As for now (mid-2021) a decent suggestions are:
Lapack https://www.netlib.org/lapack/
Atlas https://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Algebra_Software http://math-atlas.sourceforge.net/
OpenBlas https://www.openblas.net/
Magma https://developer.nvidia.com/magma
Plasma https://bitbucket.org/icl/plasma/src/main/
If OP uses martices of sizes at least 100, then GPU-oriented MAGMA might be worth trying.
An easier (installation, running) way might with a parallel CPU library, e.g. Plasma. Plsama is Lapack-compliant, it has been being developed by a large team of people, including Jack Dongarra, it also should be rather easy to compile it locally as it is provided with a CMake script.
An example how much a parallel CPU-based, multicore implementation can outperform a single-threaded implementation of the LU-decomposition can be found for example here: https://cse.buffalo.edu/faculty/miller/Courses/CSE633/Tummala-Spring-2014-CSE633.pdf (short answer: 5 to 15 times for matrices of size 1000).

How to explicitly get linear indices from arrayfire?

Suppose I have an stl::array<float, 24> foo which is the linearized STL pendant to a Column-Major format arrayfire array, e.g. af::array bar = af::array(4,3,2, 1, f32);. So I have an af::dim4 object dims with the dimensions of bar, I have up to 4 af::seq-objects and I have the linearized array foo.
How is it possible to get explicitly the indices of foo (i.e. linearized version of bar) representing e.g. the 2.nd and 3.rd row, i.e. bar(af::seq(1,2), af::span, af::span, af::span)? I have a small code example given below, which shows what I want. In the end I also explain why I want this.
af::dim4 bigDims = af::dim4(4,3,2);
stl::array<float, 24> foo; // Resides in RAM and is big
float* selBuffer_ptr; // Necessary for AF correct type autodetection
stl::vector<float> selBuffer;
// Load some data into foo
af::array selection; // Resides in VRAM and is small
af::seq selRows = af::seq(1,2);
af::seq selCols = af::seq(bigDims[1]); // Emulates af::span
af::seq selSlices = af::seq(bigDims[2]); // Emulates af::span
af::dim4 selDims = af::dim4(selRows.size, selCols.size, selSlices.size);
dim_t* linIndices;
// Magic functionality getting linear indices of the selection
// selRows x selCols x selSlices
// Assign all indexed elements to a consecutive memory region in selBuffer
// I know their positions within the full dataset, b/c I know the selection ranges.
selBuffer_ptr = static_cast<float> &(selBuffer[0]);
selection = af::array(selDims, selBuffer_ptr); // Copies just the selection to the device (e.g. GPU)
// Do sth. with selection and be happy
// I don't need to write back into the foo array.
Arrayfire must have such a logic implemented in order to access elements and I found several related classes/functions such as af::index, af::seqToDims, af::gen_indexing, af::array::operator() - however I couldn't figure an easy way out yet.
I thought about basically reimplementing the operator(), so that it would work similarly but not require a reference to an array-object. But this might be wasted effort if there is an easy way in the arrayfire-framework.
Background:
The reason I want to do so is because arrayfire does not allow to store data only in main memory (CPU-context) while being linked against a GPU backend. Since I have a big chunk of data that needs to be processed only piece by piece and the VRAM is quite limited, I'd like to instantiate af::array-objects ad-hoc from an stl-container which always resided in main memory.
Of course I know that I could program some index magic to work around my problem but I'd like to use quite complicated af::seq objects which could make an efficient implementation of the index logic complicated.
After a discussion with Pavan Yalamanchili on Gitter I managed to get a working piece of code that I want to share in case anybody else needs to hold his variables only in RAM and copy-on-use parts of it to VRAM, i.e. the Arrayfire universe (if linked against OpenCL on GPU or Nvidia).
This solution will also help anybody who is using AF somewhere else in his project anyways and who wants to have a convenient way of accessing a big linearized N-dim array with (N<=4).
// Compile as: g++ -lafopencl malloc2.cpp && ./a.out
#include <stdio.h>
#include <arrayfire.h>
#include <af/util.h>
#include <cstdlib>
#include <iostream>
#define M 3
#define N 12
#define O 2
#define SIZE M*N*O
int main() {
int _foo; // Dummy variable for pausing program
double* a = new double[SIZE]; // Allocate double array on CPU (Big Dataset!)
for(long i = 0; i < SIZE; i++) // Fill with entry numbers for easy debugging
a[i] = 1. * i + 1;
std::cin >> _foo; // Pause
std::cout << "Full array: ";
// Display full array, out of convenience from GPU
// Don't use this if "a" is really big, otherwise you'll still copy all the data to the VRAM.
af::array ar = af::array(M, N, O, a); // Copy a RAM -> VRAM
af_print(ar);
std::cin >> _foo; // Pause
// Select a subset of the full array in terms of af::seq
af::seq seq0 = af::seq(1,2,1); // Row 2-3
af::seq seq1 = af::seq(2,6,2); // Col 3:5:7
af::seq seq2 = af::seq(1,1,1); // Slice 2
// BEGIN -- Getting linear indices
af::array aidx0 = af::array(seq0);
af::array aidx1 = af::array(seq1).T() * M;
af::array aidx2 = af::reorder(af::array(seq2), 1, 2, 0) * M * N;
af::gforSet(true);
af::array aglobal_idx = aidx0 + aidx1 + aidx2;
af::gforSet(false);
aglobal_idx = af::flat(aglobal_idx).as(u64);
// END -- Getting linear indices
// Copy index list VRAM -> RAM (for easier/faster access)
uintl* global_idx = new uintl[aglobal_idx.dims(0)];
aglobal_idx.host(global_idx);
// Copy all indices into a new RAM array
double* a_sub = new double[aglobal_idx.dims(0)];
for(long i = 0; i < aglobal_idx.dims(0); i++)
a_sub[i] = a[global_idx[i]];
// Generate the "subset" array on GPU & diplay nicely formatted
af::array ar_sub = af::array(seq0.size, seq1.size, seq2.size, a_sub);
std::cout << "Subset array: "; // living on seq0 x seq1 x seq2
af_print(ar_sub);
return 0;
}
/*
g++ -lafopencl malloc2.cpp && ./a.out
Full array: ar
[3 12 2 1]
1.0000 4.0000 7.0000 10.0000 13.0000 16.0000 19.0000 22.0000 25.0000 28.0000 31.0000 34.0000
2.0000 5.0000 8.0000 11.0000 14.0000 17.0000 20.0000 23.0000 26.0000 29.0000 32.0000 35.0000
3.0000 6.0000 9.0000 12.0000 15.0000 18.0000 21.0000 24.0000 27.0000 30.0000 33.0000 36.0000
37.0000 40.0000 43.0000 46.0000 49.0000 52.0000 55.0000 58.0000 61.0000 64.0000 67.0000 70.0000
38.0000 41.0000 44.0000 47.0000 50.0000 53.0000 56.0000 59.0000 62.0000 65.0000 68.0000 71.0000
39.0000 42.0000 45.0000 48.0000 51.0000 54.0000 57.0000 60.0000 63.0000 66.0000 69.0000 72.0000
ar_sub
[2 3 1 1]
44.0000 50.0000 56.0000
45.0000 51.0000 57.0000
*/
The solution uses some undocumented AF functions and is supposedly slow due to the for loop running over global_idx, but so far its really the best one can do if on wants to hold data in the CPU context exclusively and share only parts with the GPU context of AF for processing.
If anybody knows a way to speed this code up, I'm still open for suggestions.

Fastest Way to Read a File Into Memory in c++?

I'm trying to read from a file in a faster way. The current way I'm doing it is the following, but it is very slow for large files. I am wondering if there is a faster way to do this? I need the values stored a struct, which I have defined below.
std::vector<matEntry> matEntries;
inputfileA.open(matrixAfilename.c_str());
// Read from file to continue setting up sparse matrix A
while (!inputfileA.eof()) {
// Read row, column, and value into vector
inputfileA >> (int) row; // row
inputfileA >> (int) col; // col
inputfileA >> val; // value
// Add row, column, and value entry to the matrix
matEntries.push_back(matEntry());
matEntries[index].row = row-1;
matEntries[index].col = col-1;
matEntries[index].val = val;
// Increment index
index++;
}
my struct:
struct matEntry {
int row;
int col;
float val;
};
The file is formatted like this (int, int, float):
1 2 7.9
4 5 9.008
6 3 7.89
10 4 10.21
More info:
I know the number of lines in the file at run time.
I am facing a bottleneck. The profiler says the while() loop is the bottleneck.
To make things easier, I'd define an input stream operator for your struct.
std::istream& operator>>(std::istream& is, matEntry& e)
{
is >> e.row >> e.col >> e.val;
e.row -= 1;
e.col -= 1;
return is;
}
Regarding speed, there is not much to improve without going to a very basic level of file IO. I think the only thing you could do is to initialize your vector such that it doesn't resize all the time inside the loop. And with the defined input stream operator it looks much cleaner as well:
std::vector<matEntry> matEntries;
matEntries.resize(numberOfLines);
inputfileA.open(matrixAfilename.c_str());
// Read from file to continue setting up sparse matrix A
while(index < numberOfLines && (is >> matEntries[index++]))
{ }
In my experience, the slowest part in such code is the parsing of numeric values (especially the floating point ones). Therefore your code is most probably CPU-bound and can be sped-up through parallelization as follows:
Assuming that your data is on N lines and you are going to process it using k threads, each thread will have to handle about [N/k] lines.
mmap() the file.
Scan the entire file for newline symbols and identify the range that you are going to assign to every thread.
Let each thread process its range in parallel by using an implementation of an std::istream that wraps an in-memory buffer).
Note that this will require ensuring that the code for populating your data structure is thread safe.
As suggested in the comments, you should profile your code before trying to optimize. If you want to try random stuff until the performance is good enough, you can try reading it into memory first. Here's a simple example with some basic profiling written in:
#include <vector>
#include <ctime>
#include <fstream>
#include <sstream>
#include <iostream>
// Assuming something like this...
struct matEntry
{
int row, col;
double val;
};
std::istream& operator << ( std::istream& is, matEntry& e )
{
is >> matEntry.row >> matEntry.col >> matEntry.val;
matEntry.row -= 1;
matEntry.col -= 1;
return is;
}
std::vector<matEntry> ReadMatrices( std::istream& stream )
{
auto matEntries = std::vector<matEntry>();
auto e = matEntry();
// For why this is better than your EOF test, see https://isocpp.org/wiki/faq/input-output#istream-and-while
while( stream >> e ) {
matEntries.push_back( e );
}
return matEntries;
}
int main()
{
const auto time0 = std::clock();
// Read file a piece at a time
std::ifstream inputFileA( "matFileA.txt" );
const auto matA = ReadMatrices( inputFileA );
const auto time1 = std::clock();
// Read file into memory (from http://stackoverflow.com/a/2602258/201787)
std::ifstream inputFileB( "matFileB.txt" );
std::stringstream buffer;
buffer << inputFileB.rdbuf();
const auto matB = ReadMatrices( buffer );
const auto time2 = std::clock();
std::cout << "A: " << ((time1 - time0) * CLOCKS_PER_SEC) << " B: " << ((time2 - time1) * CLOCKS_PER_SEC) << "\n";
std::cout << matA.size() << " " << matB.size();
}
Beware reading the same file on disk twice in a row since the disk caching may hide performance differences.
Other options include:
Preallocate space in your vector (perhaps adding a size to file format or estimating it based on file size or something)
Change your file format to be binary or perhaps compressed data to minimize read time
Memory map the file
Parallelize (easy: process file A and file B in separate threads [see std::async()]; medium: pipeline it so the read and convert are done on different threads; hard: process the same file in separate threads)
Other higher-level considerations might include:
It looks like you have a 4-D array of data (rows/cols of 2D matrices). In many applications, this is a mistake. Take a moment to reconsider if this data structure is really what you need.
There are many high-quality matrix libraries available (e.g., Boost.QVM, Blaze, etc.). Use them rather than reinventing the wheel.

lookup table vs runtime computation efficiency - C++

My code requires continuously computing a value from the following function:
inline double f (double x) {
return ( tanh( 3*(5-x) ) *0.5 + 0.5);
}
Profiling indicates that this part of the program is where most of the time is spent. Since the program will run for weeks if not months, I would like to optimize this operation and am considering the use of a lookup table.
I know that the efficiency of a lookup table depends on the size of the table itself, and on the way it's designed. Currently I cannot use less than 100 MB and can use up to 2GB. Values between two points in the matrix will be linearly interpolated.
Would using a lookup table be faster than doing the computation? Also, would using an N-dimensional matrix be better than a 1-D std::vector and what is the threshold (if any) on the size of the table that should not be crossed?
I'm writing a code that continuously requires to compute a value from a particular function. After some profiling, I discovered that this part of my program is where most of the time is spent.
So far, I'm not allowed to use less than 100 MB, and I can use up to 2GB. A linear interpolation will be used for points between to points in the matrix.
If you would have huge lookup table (hundreds of MB as you said), which does not fit to cache - most likely memory lookup time would be much higher than calculation itself. RAM is "very slow", especially when fetching from random locations of huge arrays.
Here is synthetic test:
live demo
#include <boost/progress.hpp>
#include <iostream>
#include <ostream>
#include <vector>
#include <cmath>
using namespace boost;
using namespace std;
inline double calc(double x)
{
return ( tanh( 3*(5-x) ) *0.5 + 0.5);
}
template<typename F>
void test(F &&f)
{
progress_timer t;
volatile double res;
for(unsigned i=0;i!=1<<26;++i)
res = f(i);
(void)res;
}
int main()
{
const unsigned size = (1 << 26) + 1;
vector<double> table(size);
cout << "table size is " << 1.0*sizeof(double)*size/(1 << 20) << "MiB" << endl;
cout << "calc ";
test(calc);
cout << "dummy lookup ";
test([&](unsigned i){return table[(i << 12)%size];}); // dummy lookup, not real values
}
Output on my machine is:
table size is 512MiB
calc 0.52 s
dummy lookup 0.92 s