This problem really confuse me. I wanna use DPC++ to read a group of images, so the sycl::image was used ,below are my code.
#define N 4 //dimensin
#define M 128 //dimension
#define C 4 //4 channel
#define L 2 // 2 images
int * host_array3_2 = malloc_host<int>(N*M*C*L, Q);
image im3(host_array3_2, image_channel_order::rgba, image_channel_type::unsigned_int32, range{ M,N,L}); //the image format
The kernel code are as follows, I use accessor with image_array tags to read the data:
Q.submit([&](handler &h) {
auto out = sycl::stream(1024, 1024 * 2, h);
accessor<int4, 2, access::mode::read, access::target::image_array> acs3(im3, h);//the accessor format
h.parallel_for(nd_range{ range{ M ,N,L }, range{ N,N,L } }, [=](nd_item<3> it) {
int idx = it.get_global_linear_id();
if (idx == 0){
confuse here: out << acs3.get_count() << " " << acs3.get_range() << " \n";
//const auto &ss = acs3[0]; no compile error
confuse here: //ss.read(int2(0, 1)); compiler error: "array subscript out of range" "SYCL kernel cannot call a variadic function"}
});
});
In addition to the read problem , I found that the range is{128,4,4},
why the third dimension is 4? Isn’t the value of L(2)?
And it seems that the third dimension is only depend on the second dimension, no matter what the L is. Can anybody answer me?
An interesting phenomenon, under the release compile mode ,it will be ok ,and the output are also different. that's strange
Related
I am using LAPACK to inverse a matrix: I did a reference passing, i.e by working on the address. Here below the function with an input matrix and an output matrix referenced by their address.
The issue is that I am obliged to convert the F_matrix into 1D array and I think this is a waste of performances on the runtime level : which way could I find to get rid of this supplementary task which is time consuming I think if I call a lot of times the
function matrix_inverse_lapack.
Below the function concerned :
// Passing Matrixes by Reference
void matrix_inverse_lapack(vector<vector<double>> const &F_matrix, vector<vector<double>> &F_output) {
// Index for loop and arrays
int i, j, ip, idx;
// Size of F_matrix
int N = F_matrix.size();
int *IPIV = new int[N];
// Statement of main array to inverse
double *arr = new double[N*N];
// Output Diagonal block
double *diag = new double[N];
for (i = 0; i<N; i++){
for (j = 0; j<N; j++){
idx = i*N + j;
arr[idx] = F_matrix[i][j];
}
}
// LAPACKE routines
int info1 = LAPACKE_dgetrf(LAPACK_ROW_MAJOR, N, N, arr, N, IPIV);
int info2 = LAPACKE_dgetri(LAPACK_ROW_MAJOR, N, arr, N, IPIV);
for (i = 0; i<N; i++){
for (j = 0; j<N; j++){
idx = i*N + j;
F_output[i][j] = arr[idx];
}
}
delete[] IPIV;
delete[] arr;
}
For example, I call it this way :
vector<vector<double>> CO_CL(lsize*(2*Dim_x+Dim_y), vector<double>(lsize*(2*Dim_x+Dim_y), 0));
... some code
matrix_inverse_lapack(CO_CL, CO_CL);
The performances on inversion are not which are expected, I think this is due to this conversion 2D -> 1D that I described in the function matrix_inverse_lapack.
Update
I was advised to install MAGMA on my MacOS Big Sur 11.3 but I have a lot of difficulties to set up it.
I have a AMD Radeon Pro 5600M graphic card. I have already installed by default Big Sur version all the Framework OpenCL (maybe I am wrong by saying that). Anyone could tell the procedure to follow for the installation of MAGMA. I saw that on a MAGMA software exists on http://magma.maths.usyd.edu.au/magma/ but it is really expensive and doesn't correspond to what I want : I just need all the SDK (headers and libraries) , if possible built with my GPU card. I have already installed all the Intel OpenAPI SDK on my MacOS. Maybe, I could link it to a MAGMA installation.
I saw another link https://icl.utk.edu/magma/software/index.html where MAGMA seems to be public : there is none link with the non-free version above, isn't there ?
First of all let me complain that OP did not provide all necessary data. The program is almost complete, but it is not a minimal, reproducible example. This is important because (a) it wastes time and (b) it hides potentially relevant information, eg. about the matrix initialization. Second, OP did not provide any details on the compilation, which, again may be relevant.
Last, but not least, OP didn't check the status code for possible errors from Lapack functions, and this could also be important for correct interpretation of the results.
Let's start from a minimal reproducible example:
#include <lapacke.h>
#include <vector>
#include <chrono>
#include <iostream>
using Matrix = std::vector<std::vector<double>>;
std::ostream &operator<<(std::ostream &out, Matrix const &v)
{
const auto size = std::min<int>(10, v.size());
for (int i = 0; i < size; i++)
{
for (int j = 0; j < size; j++)
{
out << v[i][j] << "\t";
}
if (size < std::ssize(v)) out << "...";
out << "\n";
}
return out;
}
void matrix_inverse_lapack(Matrix const &F_matrix, Matrix &F_output, std::vector<int> &IPIV_buffer,
std::vector<double> &matrix_buffer)
{
// std::cout << F_matrix << "\n";
auto t0 = std::chrono::steady_clock::now();
const int N = F_matrix.size();
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
auto idx = i * N + j;
matrix_buffer[idx] = F_matrix[i][j];
}
}
auto t1 = std::chrono::steady_clock::now();
// LAPACKE routines
int info1 = LAPACKE_dgetrf(LAPACK_ROW_MAJOR, N, N, matrix_buffer.data(), N, IPIV_buffer.data());
int info2 = LAPACKE_dgetri(LAPACK_ROW_MAJOR, N, matrix_buffer.data(), N, IPIV_buffer.data());
auto t2 = std::chrono::steady_clock::now();
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
auto idx = i * N + j;
F_output[i][j] = matrix_buffer[idx];
}
}
auto t3 = std::chrono::steady_clock::now();
auto whole_fun_time = std::chrono::duration<double>(t3 - t0).count();
auto lapack_time = std::chrono::duration<double>(t2 - t1).count();
// std::cout << F_output << "\n";
std::cout << "status: " << info1 << "\t" << info2 << "\t" << (info1 == 0 && info2 == 0 ? "Success" : "Failure")
<< "\n";
std::cout << "whole function: " << whole_fun_time << "\n";
std::cout << "LAPACKE matrix operations: " << lapack_time << "\n";
std::cout << "conversion: " << (whole_fun_time - lapack_time) / whole_fun_time * 100.0 << "%\n";
}
int main(int argc, const char *argv[])
{
const int M = 5; // numer of test repetitions
const int N = (argc > 1) ? std::stoi(argv[1]) : 10;
std::cout << "Matrix size = " << N << "\n";
std::vector<int> IPIV_buffer(N);
std::vector<double> matrix_buffer(N * N);
// Test matrix_inverse_lapack M times
for (int i = 0; i < M; i++)
{
Matrix CO_CL(N);
for (auto &v : CO_CL) v.resize(N);
int idx = 1;
for (auto &v : CO_CL)
{
for (auto &x : v)
{
x = idx + 1.0 / idx;
idx++;
}
}
matrix_inverse_lapack(CO_CL, CO_CL, IPIV_buffer, matrix_buffer);
}
}
Here, operator<< is an overkill, but may be useful for anyone wanting to verify half-manually that the code works (by uncommenting lines 26 and 58), and ensuring that the code is correct is more important that measuring its performance.
The code can be compiled with
g++ -std=c++20 -O3 main.cpp -llapacke
The program relies on an external library, lapacke, which needs to be installed, headers + binaries, for the code to compile and run.
My code differs a bit from OP's: it is closer to "modern C++" in that it refrains from using naked pointers; I also added external buffers to matrix_inverse_lapack to suppress continual launching of memory allocator and deallocator, a small improvement that reduces the 2D-1D-2D conversion overhead in a measurable way. I also had to initialize the matrix and find a way to read in OP's mind what the value of N could be. I also added some timer readings for benchmarking. Apart from this, the logic of the code is unchanged.
Now a benchmark carried out on a decent workstation. It lists the percentage of time the conversion takes relative to the total time taken by matrix_inverse_lapack. In other words, I measure the conversion overhead:
N = 10, 3.5%
N = 30, 1.5%
N = 100, 1%
N = 300, 0.5%
N = 1000, 0.35%
N = 3000, 0.1%
The time taken by Lapack nicely scales as N3, as expected (data not shown). The time to invert a matrix is about 16 seconds for N = 3000, and about 5-6 s (5 microseconds) for N = 10.
I assume the overhead of even 3% is completely acceptable. I believe OP uses matrices of size larger then 100, in which case the overhead at or below 1% is certainly acceptable.
So what OP (or anyone having a similar problem) could have done wrong to obtain "unacceptable overhead conversion values"? Here's my short list
Improper compilation
Improper matrix initialization (for tests)
Improper benchmarking
1. Improper compilation
If one forgets to compile in Release mode, one ends up with optimized Lapacke competing with unoptimized conversion. On my machine this peaks at an 33% overhead for N = 20.
2. Improper matrix initialization (for tests)
If one initializes the matrix like this:
for (auto &v : CO_CL)
{
for (auto &x : v)
{
x = idx; // rather than, eg., idx + 1.0/idx
idx++;
}
}
then the matrix is singular, lapack returns quite quickly with the status different from 0. This increases the relative importance of the conversion part. But singular matrices are not what one wants to invert (it's impossible to do).
3. Improper benchmarking
Here's an example of the program output for N = 10:
./a.out 10
Matrix size = 10
status: 0 0 Success
whole function: 0.000127658
LAPACKE matrix operations: 0.000126783
conversion: 0.685425%
status: 0 0 Success
whole function: 1.2497e-05
LAPACKE matrix operations: 1.2095e-05
conversion: 3.21677%
status: 0 0 Success
whole function: 1.0535e-05
LAPACKE matrix operations: 1.0197e-05
conversion: 3.20835%
status: 0 0 Success
whole function: 9.741e-06
LAPACKE matrix operations: 9.422e-06
conversion: 3.27482%
status: 0 0 Success
whole function: 9.939e-06
LAPACKE matrix operations: 9.618e-06
conversion: 3.2297%
One can see that the first call to lapack functions can take 10 times more time than the subsequent calls. This is quite a stable pattern, as if Lapack needed some time for self-initialization. It can affect the measurements for small N badly.
4. What else can be done?
OP apperas to believe that his approach to 2D arrays is good and Lapack is strange and old-fashionable in its packing a 2D array into a 1D array. No. It is Lapack who is right.
If one defines a 2D array as vector<vector<double>>, one obtains one advantage: code simplicity. This comes at a price. Each row of such a matrix is allocated separateley from the others. Thus, a matrix 100 by 100 may be stored in 100 completely different memory blocks. This has a bad impact on the cache (and prefetcher) utilization. Lapck (and other linear algebra packages) enforces compactification of the data in a single, continuous array. This is so to minimize cache and prefetcher misses. If OP had used such an approach from the very beginning, he would probably have gained more than 1-3% that they pay now for the conversion.
This compactification can be achieved in at least three ways.
Write a custom class for a 2D matrix, with the internal data stored in a 1D array and convenient access member funnctions (e.g.: operator ()), or find a library that does just that
Write a custom allocator for std::vector (or find a library). This allocator should allocate the memory from a preallocated 1D vector exactly matching the data storage pattern used by Lapack
Use std::vector<double*> and initailze the pointers with the addresses pointing at the appropriate elements of a preallocated 1D array.
Each of the above solutions forces some changes to the surrounding code, which OP might not want to do. All depends on the code complexity and expected performance gains.
EDIT: Alternative libraries
An alternative approach is to use a library that is known for being a highly optimzed one. Lapack by itself can be regardered as a standard interface with many implementations and it may happen that OP uses an unoptimized one. Which library to choose may depend on the hardware/software platform OP is interested in and may vary in time.
As for now (mid-2021) a decent suggestions are:
Lapack https://www.netlib.org/lapack/
Atlas https://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Algebra_Software http://math-atlas.sourceforge.net/
OpenBlas https://www.openblas.net/
Magma https://developer.nvidia.com/magma
Plasma https://bitbucket.org/icl/plasma/src/main/
If OP uses martices of sizes at least 100, then GPU-oriented MAGMA might be worth trying.
An easier (installation, running) way might with a parallel CPU library, e.g. Plasma. Plsama is Lapack-compliant, it has been being developed by a large team of people, including Jack Dongarra, it also should be rather easy to compile it locally as it is provided with a CMake script.
An example how much a parallel CPU-based, multicore implementation can outperform a single-threaded implementation of the LU-decomposition can be found for example here: https://cse.buffalo.edu/faculty/miller/Courses/CSE633/Tummala-Spring-2014-CSE633.pdf (short answer: 5 to 15 times for matrices of size 1000).
When I try to display/print some tensors to the screen, I face something like the following where instead of getting the final result, it seems libtorch displays the tensor with a multiplier (i.e. 0.01* and the likes as you can see below) :
offsets.shape: [1, 4, 46, 85]
probs.shape: [46, 85]
offsets: (1,1,.,.) =
0.01 *
0.1006 1.2322
-2.9587 -2.2280
(1,2,.,.) =
0.01 *
1.3772 1.3971
-1.2813 -0.8563
(1,3,.,.) =
0.01 *
6.2367 9.2561
3.5719 5.4744
(1,4,.,.) =
0.2901 0.2963
0.2618 0.2771
[ CPUFloatType{1,4,2,2} ]
probs: 0.0001 *
1.4593 1.0351
6.6782 4.9104
[ CPUFloatType{2,2} ]
How can I disable this behavior and get the final output? I tried to explicitly convert this into float hoping this will lead to the finalized output to be stored/displayed but that doesn't work either.
Basing on libtorch's source code for outputting the tensors, after searching for " *" string within the repository, it turns out that this "pretty-print" is done in aten/src/ATen/core/Formatting.cpp translation unit. The scale and asterisk is prepended here:
static void printScale(std::ostream & stream, double scale) {
FormatGuard guard(stream);
stream << defaultfloat << scale << " *" << std::endl;
}
And later on all coordinates of the Tensor are divided by the scale:
if(scale != 1) {
printScale(stream, scale);
}
double* tensor_p = tensor.data_ptr<double>();
for(int64_t i = 0; i < tensor.size(0); i++) {
stream << std::setw(sz) << tensor_p[i]/scale << std::endl;
}
Basing on this translation unit, this is not configurable at all.
I guess you've got two options here:
Tweak around with the functions and edit already existing functions minimally to meet your requirements.
Remove (or add #ifdef) the << operator overload for Tensor in Formatting.cpp and provide your own implementation. When building libtorch, however, you'd have to link it to your target containing the method's implementation.
Both options, however, require your to change 3rd party code, which is quite bad, I believe.
I try to perform a QR factorization on GPU using the cusolver library from CUDA.
I reduced my problem to the example below.
Basically, the few steps are :
I allocate memory and initialized a [5x3] matrix with 1s on the
host,
I allocate memory and copy the matrix on the device
I initialize the solver handler with cusolverDnCreate
I determine the size of the needed work space with cusolverDnDgeqrf_bufferSize
And, finally, try to do the QR factorization with cusolverDnDgeqrf
Unfortunately, the last command systematically fails by returning a CUSOLVER_STATUS_EXECUTION_FAILED (int value = 6) and I can't figure out what went wrong!
Here is the faulty code:
#include <cusolverDn.h>
#include <cuda_runtime_api.h>
int main(void)
{
int N = 5, P = 3;
double *hostData;
cudaMallocHost((void **) &hostData, N * sizeof(double));
for (int i = 0; i < N * P; ++i)
hostData[i] = 1.;
double *devData;
cudaMalloc((void**)&devData, N * sizeof(double));
cudaMemcpy((void*)devData, (void*)hostData, N * sizeof(double), cudaMemcpyHostToDevice);
cusolverStatus_t retVal;
cusolverDnHandle_t solverHandle;
retVal = cusolverDnCreate(&solverHandle);
std::cout << "Handler creation : " << retVal << std::endl;
double *devTau, *work;
int szWork;
cudaMalloc((void**)&devTau, P * sizeof(double));
retVal = cusolverDnDgeqrf_bufferSize(solverHandle, N, P, devData, N, &szWork);
std::cout << "Work space sizing : " << retVal << std::endl;
cudaMalloc((void**)&work, szWork * sizeof(double));
int *devInfo;
cudaMalloc((void **)&devInfo, 1);
retVal = cusolverDnDgeqrf(solverHandle, N, P, devData, N, devTau, work, szWork, devInfo); //CUSOLVER_STATUS_EXECUTION_FAILED
std::cout << "QR factorization : " << retVal << std::endl;
int hDevInfo = 0;
cudaMemcpy((void*)devInfo, (void*)&hDevInfo, 1 * sizeof(int), cudaMemcpyDeviceToHost);
std::cout << "Info device : " << hDevInfo << std::endl;
cudaFree(devInfo);
cudaFree(work);
cudaFree(devTau);
cudaFree(devData);
cudaFreeHost(hostData);
cudaDeviceReset();
}
Would you see any obvious error in my code, please let me know!
Many thanks.
Any time you are having trouble with a cuda code, you should always use proper cuda error checking and run your code with cuda-memcheck, before asking for help.
You may also want to be aware of the fact that a fully worked QR factorization example is given in the relevant CUDA/cusolver sample code and there is also sample code in the documentation.
With proper error checking, you may have discovered:
this is not correct:
cudaMalloc((void **)&devInfo, 1);
the second parameter is the size in bytes, so it should be sizeof(int), not 1. This error results in an error in a cudaMemcpyAsync operation internal to the cusolverDnDgeqrf call, which would show up in cuda-memcheck output.
This is not correct:
cudaMemcpy((void*)devInfo, (void*)&hDevInfo, 1 * sizeof(int), cudaMemcpyDeviceToHost);
the order of the pointer parameters is destination first, followed by source. So you have those parameters reversed, and this call would throw a runtime API error that you could observe if you were doing proper error checking (or visible in cuda-memcheck output).
Once you fix those errors, then the qrf call will actually return a zero status (no error). But we're not quite done yet (again, proper error checking would let us know we are not quite done yet.)
In addition to the above errors, you have made some additional sizing errors. Your matrix is of size N*P, so it has N*P elements, and you are initializing that many elements here:
for (int i = 0; i < N * P; ++i)
hostData[i] = 1.;
but you are not allocating for that many elements on the host here:
cudaMallocHost((void **) &hostData, N * sizeof(double));
or on the device here:
cudaMalloc((void**)&devData, N * sizeof(double));
and you are not transferring that many elements here:
cudaMemcpy((void*)devData, (void*)hostData, N * sizeof(double), cudaMemcpyHostToDevice);
So in the 3 cases above, if you change N*sizeof(double) to N*P*sizeof(double) you will be able to fix those errors, and the code then runs with no errors reported by cuda-memcheck, and also no errors returned from any of the API calls.
I have started using Rcpp. I like it a lot. I am fairly new to programming. I have a question regarding memory usage. Below is a reproducible problem:
library(RcppArmadillo)
library(inline)
code <- "
Rcpp::NumericVector input_(input);
arma::cube disturb(input_.begin(), 2, 2, 50000000, false);
return wrap(2);
"
Test <- cxxfunction(signature(input = "numeric"), plugin = "RcppArmadillo", body = code)
input <- array(rnorm(2 * 2 * 50000000), dim = c(2, 2, 50000000))
Test(input)
My understanding is that in the problem above the only memory usage is when I assign an array to the variable input in R. So I should only be using around 1.6 gb (2*2*50*8 = 1600). When I go to Rcpp, I initialise the variable input_ using SEXP object which is a pointer. So this should not use any additional memory. Then when I initialise the variable disturb, I also use a pointer and set copy_aux = FALSE. So I should not be using any memory. So if my understanding is correct, I should only be using 1.6 gb when I run the code. Is this correct?
However, when I run the code, the memory usage (judging by looking at System Monitor in Ubuntu) jumps above 10 gb (from around 1 gb) before falling back down to around 4 gb. I don't understand what is going on. Did I use Rcpp incorrectly?
Your help is appreciated. Many thanks.
Edit after new version of Armadillo (5.300)
After this initial Q/A on StackOverflow, Conrad Sanderson and I engaged in some email discussion about this issue. By design, the arma::cube objects create an arma::mat for each slice (the third dimension) of the cube. This is done during the creation of the cube, even if the data is copied from existing memory (as in the original question). Since this is not always needed, I suggested there should be an option to disable the pre-allocation of matrices for the slices. As of the current version of Armadillo (5.300.4), there now is. This can be installed from CRAN.
Example code:
library(RcppArmadillo)
library(inline)
code <- "
Rcpp::NumericVector input_(input);
arma::cube disturb(input_.begin(), 2, 2, 50000000, false, true, false);
return wrap(2);
"
Test <- cxxfunction(signature(input = "numeric"), plugin = "RcppArmadillo", body = code)
input <- array(rnorm(2 * 2 * 50000000), dim = c(2, 2, 50000000))
Test(input)
The key thing here is that the cube constructor is now called using arma::cube disturb(input.begin(), 2, 2, 50000000, false, true, false);. The final false here is the new prealloc_mat parameter which determines whether or not to pre-allocate the matrices. The slice method will still work fine on a cube without pre-allocated matrices - the matrix will be allocated on demand. However, if you're directly accessing the mat_ptrs member of a cube it will be filled with NULL pointers. The help has also been updated.
Many thanks to Conrad Sanderson for acting so quickly to provide this additional option, and to Dirk Eddelbuettel for all his work on Rcpp and RcppArmadillo!
Original answer
It's a slightly bizarre one. I've tried with a range of different array sizes, and the problem only occurs with arrays where the 3rd dimension is much bigger than the other 2. Here's a reproducible example:
library("RcppArmadillo")
library("inline")
code <- "
Rcpp::NumericVector input_(input);
IntegerVector dim = input_.attr(\"dim\");
arma::cube disturb(input_.begin(), dim[0], dim[1], dim[2], false);
disturb[0, 0, 0] = 45;
return wrap(2);
"
Test <- cxxfunction(signature(input = "numeric"), plugin = "RcppArmadillo", body = code)
input <- array(0, c(1e7, 2, 2))
Test(input)
# no change in memory usage
dim(input) <- c(2, 1e7, 2)
gc()
Test(input)
# no change in memory usage
dim(input) <- c(2, 2, 1e7)
gc()
Test(input)
# spike in memory usage
dim(input) <- c(20, 2, 1e6)
gc()
Test(input)
# no change in memory usage
This suggests it's something about the way that the Aramadillo library is implemented (or possibly RcppArmadillo). It certainly doesn't seem to be something you're doing wrong.
Note I've included some modification in place of the data (setting the first element to 45), and you can confirm that in each case the data is modified in place, suggesting there isn't a copy going on.
For now, I'd suggest if possible organising your 3d arrays such that the largest dimension isn't the third one.
EDIT After doing some more digging, it looks as though there is allocation of RAM during the creation of the arma::cube. In Cube_meat.hpp, in the create_mat method, there's the following code:
if(n_slices <= Cube_prealloc::mat_ptrs_size)
{
access::rw(mat_ptrs) = const_cast< const Mat<eT>** >(mat_ptrs_local);
}
else
{
access::rw(mat_ptrs) = new(std::nothrow) const Mat<eT>*[n_slices];
arma_check_bad_alloc( (mat_ptrs == 0), "Cube::create_mat(): out of memory" );
}
}
Cube_prealloc::mat_ptrs_size seems to be 4, so it's actually an issue for any array with more than 4 slices.
I posted an issue on github.
EDIT2 However, it's definitely an issue with the underlying Armadillo code. Here's a reproducible example which doesn't use Rcpp at all. This is linux-only - it uses code from How to get memory usage at run time in c++? to pull out the current memory usage of the running process.
#include <iostream>
#include <armadillo>
#include <unistd.h>
#include <ios>
#include <fstream>
#include <string>
//////////////////////////////////////////////////////////////////////////////
//
// process_mem_usage(double &, double &) - takes two doubles by reference,
// attempts to read the system-dependent data for a process' virtual memory
// size and resident set size, and return the results in KB.
//
// On failure, returns 0.0, 0.0
void process_mem_usage(double& vm_usage, double& resident_set)
{
using std::ios_base;
using std::ifstream;
using std::string;
vm_usage = 0.0;
resident_set = 0.0;
// 'file' stat seems to give the most reliable results
//
ifstream stat_stream("/proc/self/stat",ios_base::in);
// dummy vars for leading entries in stat that we don't care about
//
string pid, comm, state, ppid, pgrp, session, tty_nr;
string tpgid, flags, minflt, cminflt, majflt, cmajflt;
string utime, stime, cutime, cstime, priority, nice;
string O, itrealvalue, starttime;
// the two fields we want
//
unsigned long vsize;
long rss;
stat_stream >> pid >> comm >> state >> ppid >> pgrp >> session >> tty_nr
>> tpgid >> flags >> minflt >> cminflt >> majflt >> cmajflt
>> utime >> stime >> cutime >> cstime >> priority >> nice
>> O >> itrealvalue >> starttime >> vsize >> rss; // don't care about the rest
stat_stream.close();
long page_size_kb = sysconf(_SC_PAGE_SIZE) / 1024; // in case x86-64 is configured to use 2MB pages
vm_usage = vsize / 1024.0;
resident_set = rss * page_size_kb;
}
using namespace std;
using namespace arma;
void test_cube(double* numvec, int dim1, int dim2, int dim3) {
double vm, rss;
cout << "Press enter to continue";
cin.get();
process_mem_usage(vm, rss);
cout << "Before:- VM: " << vm << "; RSS: " << rss << endl;
cout << "cube c1(numvec, " << dim1 << ", " << dim2 << ", " << dim3 << ", false)" << endl;
cube c1(numvec, dim1, dim2, dim3, false);
process_mem_usage(vm, rss);
cout << "After:- VM: " << vm << "; RSS: " << rss << endl << endl;
}
int
main(int argc, char** argv)
{
double* numvec = new double[40000000];
test_cube(numvec, 10000000, 2, 2);
test_cube(numvec, 2, 10000000, 2);
test_cube(numvec, 2, 2, 1000000);
test_cube(numvec, 2, 2, 2000000);
test_cube(numvec, 4, 2, 2000000);
test_cube(numvec, 2, 4, 2000000);
test_cube(numvec, 4, 4, 2000000);
test_cube(numvec, 2, 2, 10000000);
cout << "Press enter to finish";
cin.get();
return 0;
}
EDIT 3 Per the create_mat code above, an arma::mat is created for each slice of a cube. On my 64-bit machine, this results in 184 bytes of overhead for each slice. For a cube with 5e7 slices, that equates to 8.6 GiB of overhead even though the underlying numeric data only takes up 1.5 GiB. I've emailed Conrad Sanderson to ask if this is fundamental to the way that Armadillo works or could be changed, but for now it definitely seems that you want your slice dimension (the third one) to be the smallest of the three if at all possible. It's also worth noting that this applies to all cubes, not just those created from existing memory. Using the arma::cube(dim1, dim2, dim3) constructor leads to the same memory usage.
Given the maximum possible value, how to simply express the space needed to write such number in decimal form as text ?
The real task: logging process ids (pid_t) with fixed length, using gcc on Linux. It'd be good to have a compile time expression to be used in the std::setw() iomanipulator.
I have found that linux/threads.h header contains a PID_MAX value with the maximum pid allocated to a process. So having
#define LENGTH(t) sizeof(#t)-1
the LENGTH(PID_MAX) would be a compile time expression, but unfortunatelly this number is defined in hexa:
#define PID_MAX 0x8000
My current best solution is a bit oddish
static_cast<int>( ::floor( ::log(PID_MAX)/::log(10) + 1 ) );
But this is calculated runtime and uses functions from math.h
You could do it with a little template meta programming:
//NunLength_interal does the actual calculation.
template <unsigned num>
struct NumLength_internal
{ enum { value = 1 + NumLength_internal<num/10>::value }; };
template <>
struct NumLength_internal<0>
{ enum { value = 0 }; };
//NumLength is a wrapper to handle zero. For zero we want to return
//a length of one as a special case.
template <unsigned num>
struct NumLength
{ enum { value = NumLength_internal<num>::value };};
template <>
struct NumLength<0>
{ enum { value = 1 }; };
This should work for anything now. For example:
cout << NumLength<0>::value << endl; // writes: 1
cout << NumLength<5>::value << endl; // writes: 1
cout << NumLength<10>::value << endl; // writes: 2
cout << NumLength<123>::value << endl; // writes: 3
cout << NumLength<0x8000>::value << endl; // writes: 5
This is all handled at compile time.
Edit: I added another layer to handle the case when the number passed in is zero.
I don't think you can get it exactly without invoking logarithms, but you can get an upper bound:
CHAR_BIT * sizeof(PID_MAX) will give you an upper bound on the number of bits needed to represent PID_MAX. You can then precompute log(10) = 3.32 and round down to 3. Forget about floor, because integer division will truncate like that anyhow. So
#define LENGTH(t) (((CHAR_BIT * sizeof(t)) / 3) + 1)
Should give you a compile-time computable upper bound on the number of characters needed to display t in decimal.