GNU Octave Code Equivalent in C/C++ - c++

I have some GNU octave/Matlab code that I would like to translate into C or C++. I can handle most of this translation but I don't know what the line x1=0:1:pts-1;would translate to in C code. If i understand correctly it is a Range type in Octave but i'm not sure what data type in C or C++ would support that same functionality.
The Full script is:
pkg load signal
fs = 48000;
fc=18300;
rlen=10;
ppiv=100;
beta=9.0;
apof=0.9;
apobeta=0.7;
pts = ppiv*rlen+1;
x1=0:1:pts-1;%this line here!!!!
x2=rlen*2*(x1-(pts-1)/2 +0.00001)/(pts-1); % and the the usage of x1 in this line
x3=pi*fc/fs*x2;
h=sin(x3)./x3;
w=kaiser(pts,beta);
g=w.*h;
aw = 1-apof*kaiser(pts,apobeta);
g=aw.*g;
g=g/max(g);
figure(1);
subplot(1,2,1);
plot(x2/2,g);
axis([-rlen/2 rlen/2 -0.2 2.0002]);
%xlabel(“Time in Sampling Intervals”);
%title(‘Bandlimited Impulse’);
subplot(1,2,2);
zpad=20;
g2=[g;zeros((zpad-1)*pts,1)];
wspec=abs(fft(g2));
wspec=max(wspec/max(wspec),0.00001);
fmax=60000;
rng = round(rlen*zpad*fmax/fs);
xidx = 0:1:rng;
semilogy(fmax/1000*xidx/rng,wspec(1:(rng+1)));
%xlabel(‘Frequency in kHz’);
%title(‘amplitude spectrum’);
grid;
hold;
plot([20 20],[0.00001,1]);
plot([fs/1000-20 fs/1000-20], [0.00001 1]);
plot([fs/1000 fs/1000], [0.00001 1]);
hold off;
So what i am looking for Is either a code snippet or some resource of how to deal with this conversion.
Thanks in advance

Since you are converting from Octave code, it makes a lot of sense to use Octave's C++ library. They can see its doxygen docs online.
For your specific case you can use the octave_range class:
#include "ov-range.h"
octave_range x (0, pts -1, 1);
Note that this would only be a range, just like in Octave. If you then want a matrix out of it, you can do:
Matrix mx = x.matrix_value ();
If this confuses you, converting the range to a matrix, see how it's actually done in Octave. Create a range and check its size in memory. Then compare with the matrix created from it:
octave-cli-3.8.1> x = 0:1:10000;
octave-cli-3.8.1> whos x
Variables in the current scope:
Attr Name Size Bytes Class
==== ==== ==== ===== =====
x 1x10001 24 double
Total is 10001 elements using 24 bytes
octave-cli-3.8.1> x = [0:1:10000];
octave-cli-3.8.1> whos x
Variables in the current scope:
Attr Name Size Bytes Class
==== ==== ==== ===== =====
x 1x10001 80008 double
Total is 10001 elements using 80008 bytes

x1 is an array (or if you prefer, a matrix) holding values 0,1,2,3,...,(pts-1).
So you could generate it in C with something like:
int a[500];
int i;
for(i = 0; i < 500; i++) {
a[i] = i;
}

Related

Why my inversions of matrices are such slow with LAPACKE in C++ : MAGMA Alternative and set up

I am using LAPACK to inverse a matrix: I did a reference passing, i.e by working on the address. Here below the function with an input matrix and an output matrix referenced by their address.
The issue is that I am obliged to convert the F_matrix into 1D array and I think this is a waste of performances on the runtime level : which way could I find to get rid of this supplementary task which is time consuming I think if I call a lot of times the
function matrix_inverse_lapack.
Below the function concerned :
// Passing Matrixes by Reference
void matrix_inverse_lapack(vector<vector<double>> const &F_matrix, vector<vector<double>> &F_output) {
// Index for loop and arrays
int i, j, ip, idx;
// Size of F_matrix
int N = F_matrix.size();
int *IPIV = new int[N];
// Statement of main array to inverse
double *arr = new double[N*N];
// Output Diagonal block
double *diag = new double[N];
for (i = 0; i<N; i++){
for (j = 0; j<N; j++){
idx = i*N + j;
arr[idx] = F_matrix[i][j];
}
}
// LAPACKE routines
int info1 = LAPACKE_dgetrf(LAPACK_ROW_MAJOR, N, N, arr, N, IPIV);
int info2 = LAPACKE_dgetri(LAPACK_ROW_MAJOR, N, arr, N, IPIV);
for (i = 0; i<N; i++){
for (j = 0; j<N; j++){
idx = i*N + j;
F_output[i][j] = arr[idx];
}
}
delete[] IPIV;
delete[] arr;
}
For example, I call it this way :
vector<vector<double>> CO_CL(lsize*(2*Dim_x+Dim_y), vector<double>(lsize*(2*Dim_x+Dim_y), 0));
... some code
matrix_inverse_lapack(CO_CL, CO_CL);
The performances on inversion are not which are expected, I think this is due to this conversion 2D -> 1D that I described in the function matrix_inverse_lapack.
Update
I was advised to install MAGMA on my MacOS Big Sur 11.3 but I have a lot of difficulties to set up it.
I have a AMD Radeon Pro 5600M graphic card. I have already installed by default Big Sur version all the Framework OpenCL (maybe I am wrong by saying that). Anyone could tell the procedure to follow for the installation of MAGMA. I saw that on a MAGMA software exists on http://magma.maths.usyd.edu.au/magma/ but it is really expensive and doesn't correspond to what I want : I just need all the SDK (headers and libraries) , if possible built with my GPU card. I have already installed all the Intel OpenAPI SDK on my MacOS. Maybe, I could link it to a MAGMA installation.
I saw another link https://icl.utk.edu/magma/software/index.html where MAGMA seems to be public : there is none link with the non-free version above, isn't there ?
First of all let me complain that OP did not provide all necessary data. The program is almost complete, but it is not a minimal, reproducible example. This is important because (a) it wastes time and (b) it hides potentially relevant information, eg. about the matrix initialization. Second, OP did not provide any details on the compilation, which, again may be relevant.
Last, but not least, OP didn't check the status code for possible errors from Lapack functions, and this could also be important for correct interpretation of the results.
Let's start from a minimal reproducible example:
#include <lapacke.h>
#include <vector>
#include <chrono>
#include <iostream>
using Matrix = std::vector<std::vector<double>>;
std::ostream &operator<<(std::ostream &out, Matrix const &v)
{
const auto size = std::min<int>(10, v.size());
for (int i = 0; i < size; i++)
{
for (int j = 0; j < size; j++)
{
out << v[i][j] << "\t";
}
if (size < std::ssize(v)) out << "...";
out << "\n";
}
return out;
}
void matrix_inverse_lapack(Matrix const &F_matrix, Matrix &F_output, std::vector<int> &IPIV_buffer,
std::vector<double> &matrix_buffer)
{
// std::cout << F_matrix << "\n";
auto t0 = std::chrono::steady_clock::now();
const int N = F_matrix.size();
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
auto idx = i * N + j;
matrix_buffer[idx] = F_matrix[i][j];
}
}
auto t1 = std::chrono::steady_clock::now();
// LAPACKE routines
int info1 = LAPACKE_dgetrf(LAPACK_ROW_MAJOR, N, N, matrix_buffer.data(), N, IPIV_buffer.data());
int info2 = LAPACKE_dgetri(LAPACK_ROW_MAJOR, N, matrix_buffer.data(), N, IPIV_buffer.data());
auto t2 = std::chrono::steady_clock::now();
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
auto idx = i * N + j;
F_output[i][j] = matrix_buffer[idx];
}
}
auto t3 = std::chrono::steady_clock::now();
auto whole_fun_time = std::chrono::duration<double>(t3 - t0).count();
auto lapack_time = std::chrono::duration<double>(t2 - t1).count();
// std::cout << F_output << "\n";
std::cout << "status: " << info1 << "\t" << info2 << "\t" << (info1 == 0 && info2 == 0 ? "Success" : "Failure")
<< "\n";
std::cout << "whole function: " << whole_fun_time << "\n";
std::cout << "LAPACKE matrix operations: " << lapack_time << "\n";
std::cout << "conversion: " << (whole_fun_time - lapack_time) / whole_fun_time * 100.0 << "%\n";
}
int main(int argc, const char *argv[])
{
const int M = 5; // numer of test repetitions
const int N = (argc > 1) ? std::stoi(argv[1]) : 10;
std::cout << "Matrix size = " << N << "\n";
std::vector<int> IPIV_buffer(N);
std::vector<double> matrix_buffer(N * N);
// Test matrix_inverse_lapack M times
for (int i = 0; i < M; i++)
{
Matrix CO_CL(N);
for (auto &v : CO_CL) v.resize(N);
int idx = 1;
for (auto &v : CO_CL)
{
for (auto &x : v)
{
x = idx + 1.0 / idx;
idx++;
}
}
matrix_inverse_lapack(CO_CL, CO_CL, IPIV_buffer, matrix_buffer);
}
}
Here, operator<< is an overkill, but may be useful for anyone wanting to verify half-manually that the code works (by uncommenting lines 26 and 58), and ensuring that the code is correct is more important that measuring its performance.
The code can be compiled with
g++ -std=c++20 -O3 main.cpp -llapacke
The program relies on an external library, lapacke, which needs to be installed, headers + binaries, for the code to compile and run.
My code differs a bit from OP's: it is closer to "modern C++" in that it refrains from using naked pointers; I also added external buffers to matrix_inverse_lapack to suppress continual launching of memory allocator and deallocator, a small improvement that reduces the 2D-1D-2D conversion overhead in a measurable way. I also had to initialize the matrix and find a way to read in OP's mind what the value of N could be. I also added some timer readings for benchmarking. Apart from this, the logic of the code is unchanged.
Now a benchmark carried out on a decent workstation. It lists the percentage of time the conversion takes relative to the total time taken by matrix_inverse_lapack. In other words, I measure the conversion overhead:
N = 10, 3.5%
N = 30, 1.5%
N = 100, 1%
N = 300, 0.5%
N = 1000, 0.35%
N = 3000, 0.1%
The time taken by Lapack nicely scales as N3, as expected (data not shown). The time to invert a matrix is about 16 seconds for N = 3000, and about 5-6 s (5 microseconds) for N = 10.
I assume the overhead of even 3% is completely acceptable. I believe OP uses matrices of size larger then 100, in which case the overhead at or below 1% is certainly acceptable.
So what OP (or anyone having a similar problem) could have done wrong to obtain "unacceptable overhead conversion values"? Here's my short list
Improper compilation
Improper matrix initialization (for tests)
Improper benchmarking
1. Improper compilation
If one forgets to compile in Release mode, one ends up with optimized Lapacke competing with unoptimized conversion. On my machine this peaks at an 33% overhead for N = 20.
2. Improper matrix initialization (for tests)
If one initializes the matrix like this:
for (auto &v : CO_CL)
{
for (auto &x : v)
{
x = idx; // rather than, eg., idx + 1.0/idx
idx++;
}
}
then the matrix is singular, lapack returns quite quickly with the status different from 0. This increases the relative importance of the conversion part. But singular matrices are not what one wants to invert (it's impossible to do).
3. Improper benchmarking
Here's an example of the program output for N = 10:
./a.out 10
Matrix size = 10
status: 0 0 Success
whole function: 0.000127658
LAPACKE matrix operations: 0.000126783
conversion: 0.685425%
status: 0 0 Success
whole function: 1.2497e-05
LAPACKE matrix operations: 1.2095e-05
conversion: 3.21677%
status: 0 0 Success
whole function: 1.0535e-05
LAPACKE matrix operations: 1.0197e-05
conversion: 3.20835%
status: 0 0 Success
whole function: 9.741e-06
LAPACKE matrix operations: 9.422e-06
conversion: 3.27482%
status: 0 0 Success
whole function: 9.939e-06
LAPACKE matrix operations: 9.618e-06
conversion: 3.2297%
One can see that the first call to lapack functions can take 10 times more time than the subsequent calls. This is quite a stable pattern, as if Lapack needed some time for self-initialization. It can affect the measurements for small N badly.
4. What else can be done?
OP apperas to believe that his approach to 2D arrays is good and Lapack is strange and old-fashionable in its packing a 2D array into a 1D array. No. It is Lapack who is right.
If one defines a 2D array as vector<vector<double>>, one obtains one advantage: code simplicity. This comes at a price. Each row of such a matrix is allocated separateley from the others. Thus, a matrix 100 by 100 may be stored in 100 completely different memory blocks. This has a bad impact on the cache (and prefetcher) utilization. Lapck (and other linear algebra packages) enforces compactification of the data in a single, continuous array. This is so to minimize cache and prefetcher misses. If OP had used such an approach from the very beginning, he would probably have gained more than 1-3% that they pay now for the conversion.
This compactification can be achieved in at least three ways.
Write a custom class for a 2D matrix, with the internal data stored in a 1D array and convenient access member funnctions (e.g.: operator ()), or find a library that does just that
Write a custom allocator for std::vector (or find a library). This allocator should allocate the memory from a preallocated 1D vector exactly matching the data storage pattern used by Lapack
Use std::vector<double*> and initailze the pointers with the addresses pointing at the appropriate elements of a preallocated 1D array.
Each of the above solutions forces some changes to the surrounding code, which OP might not want to do. All depends on the code complexity and expected performance gains.
EDIT: Alternative libraries
An alternative approach is to use a library that is known for being a highly optimzed one. Lapack by itself can be regardered as a standard interface with many implementations and it may happen that OP uses an unoptimized one. Which library to choose may depend on the hardware/software platform OP is interested in and may vary in time.
As for now (mid-2021) a decent suggestions are:
Lapack https://www.netlib.org/lapack/
Atlas https://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Algebra_Software http://math-atlas.sourceforge.net/
OpenBlas https://www.openblas.net/
Magma https://developer.nvidia.com/magma
Plasma https://bitbucket.org/icl/plasma/src/main/
If OP uses martices of sizes at least 100, then GPU-oriented MAGMA might be worth trying.
An easier (installation, running) way might with a parallel CPU library, e.g. Plasma. Plsama is Lapack-compliant, it has been being developed by a large team of people, including Jack Dongarra, it also should be rather easy to compile it locally as it is provided with a CMake script.
An example how much a parallel CPU-based, multicore implementation can outperform a single-threaded implementation of the LU-decomposition can be found for example here: https://cse.buffalo.edu/faculty/miller/Courses/CSE633/Tummala-Spring-2014-CSE633.pdf (short answer: 5 to 15 times for matrices of size 1000).

Matlab coder: How to force a variable to have variable :inf size

I can't find how to force Matlab coder to make a parameter size be variable rather than fixed.
Here is a MCVE:
Function code:
function [sz] = my_varsize(x)
sz = length(x);
end
Sample main program used in Matlab coder:
samp = 100;
x = zeros(1,samp);
sz = my_varsize(x);
display(sz);
Then, Matlab coder generates C/C++ code where x size is (1x100).
I can manually change the variable size from 1x100 to 1x:Inf from the GUI, this works fine, but I'd prefer Matlab coder to do it automatically. I tried to add coder.varsize('x',[1,inf]); and coder.typeof(x,[1,inf]); both in the function and in the main program, but none had the expected behaviour.
Edit: Based on Ryan comment, I tried to call my_varsize with different objects of different sizes to see if Matlab realizes then that this should use a variable size:
samp = int64(round(rand()*100));
x = zeros(1,samp);
sz = my_varsize(x);
display(sz);
samp = int64(round(rand()*100));
x = zeros(1,samp);
sz = my_varsize(x);
display(sz);
Then, generated code uses a variable size of 61 (biggest result of the two rand() calls) [1,:61], while I need [1,:inf] so that my generated C/C++ code can be used with any input!
Presumably you're using the auto-define capability in the MATLAB Coder app. What that does is to run the script you provided and monitor inputs to your function my_varsize. Let's take a concrete example:
function my_varsize.m
function [sz] = my_varsize(x)
sz = length(x);
end
test script my_varsize_tb.m
samp = 20;
x = zeros(1,samp);
sz = my_varsize(x);
display(sz);
samp = 37;
x = zeros(1,samp);
sz = my_varsize(x);
display(sz);
Here, my_varsize_tb is run and Coder detects 2 calls to my_varsize. The first takes a 1-by-20 double array. The second takes a 1-by-37 array. So it computes that the input must be 1-by-:37. Since you can only make a finite number of calls this way, the input will only ever be determined to have a finite upper bound.
You can then tweak the size to be 1-by-:Inf in the Coder App:
More info
There is a command-line function giving the same behavior that you might be using:
>> t = coder.getArgTypes('my_varsize_tb','my_varsize')
t =
1×1 cell array
{1×1 coder.PrimitiveType}
>> t{1}
ans =
coder.PrimitiveType
1×:37 double
You can similarly tweak that size:
>> inputType = coder.resize(t{1},[1,Inf])
inputType =
coder.PrimitiveType
1×:inf double
>> codegen my_varsize -args inputType
to use it with the codegen command.
Lastly given that you have a simple function you can just do:
t = coder.typeof(1, [1,Inf]);
codegen my_varsize -args t
coder.typeof takes the first input to determine that it's a real double and the size. When you pass a second argument, that overrides the size, producing a 1-by-:Inf as expected in this case.

Linear Programming: Modulo constraint

I am using Coin-Or's rehearse to implement linear programming.
I need a modulo constraint. Example: x shall be a multiple of 3.
OsiCbcSolverInterface solver;
CelModel model(solver);
CelNumVar x;
CelIntVar z;
unsigned int mod = 3;
// Maximize
solver.setObjSense(-1.0);
model.setObjective(x);
model.addConstraint(x <= 7.5);
// The modulo constraint:
model.addConstraint(x == z * mod);
The result for x should be 6. However, z is set to 2.5, which should not be possible as I declared it as a CellIntVar.
How can I enforce z to be an integer?
I never used that lib, but you i think you should follow the tests.
The core message comes from the readme:
If you want some of your variables to be integers, use CelIntVar instead of CelNumVar. You must bind the solver to an Integer Linear Programming solver as well, for example Coin-cbc.
Looking at Rehearse/tests/testRehearse.cpp -> exemple4() (here presented: incomplete code; no copy-paste):
OsiClpSolverInterface *solver = new OsiClpSolverInterface();
CelModel model(*solver);
...
CelIntVar x1("x1");
...
solver->initialSolve(); // this is the relaxation (and maybe presolving)!
...
CbcModel cbcModel(*solver); // MIP-solver
cbcModel.branchAndBound(); // Use MIP-solver
printf("Solution for x1 : %g\n", model.getSolutionValue(x1, *cbcModel.solver()));
printf("Solution objvalue = : %g\n", cbcModel.solver()->getObjValue());
This kind of usage (use Osi to get LP-solver; build MIP-solver on top of that Osi-provided-LP-solver and call brandAndBound) basically follows Cbc's internal interface (with python's cylp this looks similar).
Just as reference: This is the official CoinOR Cbc (Rehearse-free) example from here:
// Copyright (C) 2005, International Business Machines
// Corporation and others. All Rights Reserved.
#include "CbcModel.hpp"
// Using CLP as the solver
#include "OsiClpSolverInterface.hpp"
int main (int argc, const char *argv[])
{
OsiClpSolverInterface solver1;
// Read in example model in MPS file format
// and assert that it is a clean model
int numMpsReadErrors = solver1.readMps("../../Mps/Sample/p0033.mps","");
assert(numMpsReadErrors==0);
// Pass the solver with the problem to be solved to CbcModel
CbcModel model(solver1);
// Do complete search
model.branchAndBound();
/* Print the solution. CbcModel clones the solver so we
need to get current copy from the CbcModel */
int numberColumns = model.solver()->getNumCols();
const double * solution = model.bestSolution();
for (int iColumn=0;iColumn<numberColumns;iColumn++) {
double value=solution[iColumn];
if (fabs(value)>1.0e-7&&model.solver()->isInteger(iColumn))
printf("%d has value %g\n",iColumn,value);
}
return 0;
}

How to explicitly get linear indices from arrayfire?

Suppose I have an stl::array<float, 24> foo which is the linearized STL pendant to a Column-Major format arrayfire array, e.g. af::array bar = af::array(4,3,2, 1, f32);. So I have an af::dim4 object dims with the dimensions of bar, I have up to 4 af::seq-objects and I have the linearized array foo.
How is it possible to get explicitly the indices of foo (i.e. linearized version of bar) representing e.g. the 2.nd and 3.rd row, i.e. bar(af::seq(1,2), af::span, af::span, af::span)? I have a small code example given below, which shows what I want. In the end I also explain why I want this.
af::dim4 bigDims = af::dim4(4,3,2);
stl::array<float, 24> foo; // Resides in RAM and is big
float* selBuffer_ptr; // Necessary for AF correct type autodetection
stl::vector<float> selBuffer;
// Load some data into foo
af::array selection; // Resides in VRAM and is small
af::seq selRows = af::seq(1,2);
af::seq selCols = af::seq(bigDims[1]); // Emulates af::span
af::seq selSlices = af::seq(bigDims[2]); // Emulates af::span
af::dim4 selDims = af::dim4(selRows.size, selCols.size, selSlices.size);
dim_t* linIndices;
// Magic functionality getting linear indices of the selection
// selRows x selCols x selSlices
// Assign all indexed elements to a consecutive memory region in selBuffer
// I know their positions within the full dataset, b/c I know the selection ranges.
selBuffer_ptr = static_cast<float> &(selBuffer[0]);
selection = af::array(selDims, selBuffer_ptr); // Copies just the selection to the device (e.g. GPU)
// Do sth. with selection and be happy
// I don't need to write back into the foo array.
Arrayfire must have such a logic implemented in order to access elements and I found several related classes/functions such as af::index, af::seqToDims, af::gen_indexing, af::array::operator() - however I couldn't figure an easy way out yet.
I thought about basically reimplementing the operator(), so that it would work similarly but not require a reference to an array-object. But this might be wasted effort if there is an easy way in the arrayfire-framework.
Background:
The reason I want to do so is because arrayfire does not allow to store data only in main memory (CPU-context) while being linked against a GPU backend. Since I have a big chunk of data that needs to be processed only piece by piece and the VRAM is quite limited, I'd like to instantiate af::array-objects ad-hoc from an stl-container which always resided in main memory.
Of course I know that I could program some index magic to work around my problem but I'd like to use quite complicated af::seq objects which could make an efficient implementation of the index logic complicated.
After a discussion with Pavan Yalamanchili on Gitter I managed to get a working piece of code that I want to share in case anybody else needs to hold his variables only in RAM and copy-on-use parts of it to VRAM, i.e. the Arrayfire universe (if linked against OpenCL on GPU or Nvidia).
This solution will also help anybody who is using AF somewhere else in his project anyways and who wants to have a convenient way of accessing a big linearized N-dim array with (N<=4).
// Compile as: g++ -lafopencl malloc2.cpp && ./a.out
#include <stdio.h>
#include <arrayfire.h>
#include <af/util.h>
#include <cstdlib>
#include <iostream>
#define M 3
#define N 12
#define O 2
#define SIZE M*N*O
int main() {
int _foo; // Dummy variable for pausing program
double* a = new double[SIZE]; // Allocate double array on CPU (Big Dataset!)
for(long i = 0; i < SIZE; i++) // Fill with entry numbers for easy debugging
a[i] = 1. * i + 1;
std::cin >> _foo; // Pause
std::cout << "Full array: ";
// Display full array, out of convenience from GPU
// Don't use this if "a" is really big, otherwise you'll still copy all the data to the VRAM.
af::array ar = af::array(M, N, O, a); // Copy a RAM -> VRAM
af_print(ar);
std::cin >> _foo; // Pause
// Select a subset of the full array in terms of af::seq
af::seq seq0 = af::seq(1,2,1); // Row 2-3
af::seq seq1 = af::seq(2,6,2); // Col 3:5:7
af::seq seq2 = af::seq(1,1,1); // Slice 2
// BEGIN -- Getting linear indices
af::array aidx0 = af::array(seq0);
af::array aidx1 = af::array(seq1).T() * M;
af::array aidx2 = af::reorder(af::array(seq2), 1, 2, 0) * M * N;
af::gforSet(true);
af::array aglobal_idx = aidx0 + aidx1 + aidx2;
af::gforSet(false);
aglobal_idx = af::flat(aglobal_idx).as(u64);
// END -- Getting linear indices
// Copy index list VRAM -> RAM (for easier/faster access)
uintl* global_idx = new uintl[aglobal_idx.dims(0)];
aglobal_idx.host(global_idx);
// Copy all indices into a new RAM array
double* a_sub = new double[aglobal_idx.dims(0)];
for(long i = 0; i < aglobal_idx.dims(0); i++)
a_sub[i] = a[global_idx[i]];
// Generate the "subset" array on GPU & diplay nicely formatted
af::array ar_sub = af::array(seq0.size, seq1.size, seq2.size, a_sub);
std::cout << "Subset array: "; // living on seq0 x seq1 x seq2
af_print(ar_sub);
return 0;
}
/*
g++ -lafopencl malloc2.cpp && ./a.out
Full array: ar
[3 12 2 1]
1.0000 4.0000 7.0000 10.0000 13.0000 16.0000 19.0000 22.0000 25.0000 28.0000 31.0000 34.0000
2.0000 5.0000 8.0000 11.0000 14.0000 17.0000 20.0000 23.0000 26.0000 29.0000 32.0000 35.0000
3.0000 6.0000 9.0000 12.0000 15.0000 18.0000 21.0000 24.0000 27.0000 30.0000 33.0000 36.0000
37.0000 40.0000 43.0000 46.0000 49.0000 52.0000 55.0000 58.0000 61.0000 64.0000 67.0000 70.0000
38.0000 41.0000 44.0000 47.0000 50.0000 53.0000 56.0000 59.0000 62.0000 65.0000 68.0000 71.0000
39.0000 42.0000 45.0000 48.0000 51.0000 54.0000 57.0000 60.0000 63.0000 66.0000 69.0000 72.0000
ar_sub
[2 3 1 1]
44.0000 50.0000 56.0000
45.0000 51.0000 57.0000
*/
The solution uses some undocumented AF functions and is supposedly slow due to the for loop running over global_idx, but so far its really the best one can do if on wants to hold data in the CPU context exclusively and share only parts with the GPU context of AF for processing.
If anybody knows a way to speed this code up, I'm still open for suggestions.

ASCII data import: how can I match Fortran's bulk read performance in C++?

The setup
Hello, I have Fortran code for reading in ASCII double precision data (example of data file at bottom of question):
program ReadData
integer :: mx,my,mz
doubleprecision, allocatable, dimension(:,:,:) :: charge
! Open the file 'CHGCAR'
open(11,file='CHGCAR',status='old')
! Get the extent of the 3D system and allocate the 3D array
read(11,*)mx,my,mz
allocate(charge(mx,my,mz) )
! Bulk read the entire block of ASCII data for the system
read(11,*) charge
end program ReadData
and the "equivalent" C++ code:
#include <fstream>
#include <vector>
using std::ifstream;
using std::vector;
using std::ios;
int main(){
int mx, my, mz;
// Open the file 'CHGCAR'
ifstream InFile('CHGCAR', ios::in);
// Get the extent of the 3D system and allocate the 3D array
InFile >> mx >> my >> mz;
vector<vector<vector<double> > > charge(mx, vector<vector<double> >(my, vector<double>(mz)));
// Method 1: std::ifstream extraction operator to double
for (int i = 0; i < mx; ++i)
for (int j = 0; j < my; ++j)
for (int k = 0; k < mz; ++k)
InFile >> charge[i][j][k];
return 0;
}
Fortran kicking #$$ and taking names
Note that the line
read(11,*) charge
performs the same task as the C++ code:
for (int i = 0; i < mx; ++i)
for (int j = 0; j < my; ++j)
for (int k = 0; k < mz; ++k)
InFile >> charge[i][j][k];
where InFile is an if stream object (note that while iterators in the Fortran code start at 1 and not 0, the range is the same).
However, the Fortran code runs way, way faster than the C++ code, I think because Fortran does something clever like reading/parsing the file according to the range and shape (values of mx, my, mz) all in one go, and then simply pointing charge to the memory the data was read to. The C++ code, by comparison, needs to access InFile and then charge (which is typically large) back and forth with each iteration, resulting in (I believe) many more IO and memory operations.
I'm reading in potentially billions of of values (several gigabytes), so I really want to maximize performance.
My question:
How can I achieve the performance of the Fortran code in C++?
Moving on...
Here is a much faster (than the above C++) C++ implementation, where the file is read in one go into a char array, and then charge is populated as the char array is parsed:
#include <fstream>
#include <vector>
#include <cstdlib>
using std::ifstream;
using std::vector;
using std::ios;
int main(){
int mx, my, mz;
// Open the file 'CHGCAR'
ifstream InFile('CHGCAR', ios::in);
// Get the extent of the 3D system and allocate the 3D array
InFile >> mx >> my >> mz;
vector<vector<vector<double> > > charge(mx, vector<vector<double> >(my, vector<double>(mz)));
// Method 2: big char array with strtok() and atof()
// Get file size
InFile.seekg(0, InFile.end);
int FileSize = InFile.tellg();
InFile.seekg(0, InFile.beg);
// Read in entire file to FileData
vector<char> FileData(FileSize);
InFile.read(FileData.data(), FileSize);
InFile.close();
/*
* Now simply parse through the char array, saving each
* value to its place in the array of charge density
*/
char* TmpCStr = strtok(FileData.data(), " \n");
// Gets TmpCStr to the first data value
for (int i = 0; i < 3 && TmpCStr != NULL; ++i)
TmpCStr = strtok(NULL, " \n");
for (int i = 0; i < Mz; ++i)
for (int j = 0; j < My; ++j)
for (int k = 0; k < Mx && TmpCStr != NULL; ++k){
Charge[i][j][k] = atof(TmpCStr);
TmpCStr = strtok(NULL, " \n");
}
return 0;
}
Again, this is much faster than the simple >> operator-based method, but still considerably slower than the Fortran version--not to mention much more code.
How to get better performance?
I'm sure that method 2 is the way to go if I am to implement it myself, but I'm curious how I can increase performance to match the Fortran code. The types of things I'm considering and currently researching are:
C++11 and C++14 features
Optimized C or C++ library for doing just this type of thing
Improvements on the individual methods being used in method 2
tokenization library such as that in the C++ String Toolkit Library instead of strtok()
more efficient char to double conversion than atof()
C++ String Toolkit
In particular, the C++ String Toolkit Library will take FileData and the delimiters " \n" and give me a string token object (call it FileTokens, then the triple for loop would look like
for (int k = 0; k < Mz; ++k)
for (int j = 0; j < My; ++j)
for (int i = 0; i < Mx; ++i)
Charge[k][j][i] = FileTokens.nextFloatToken();
This would simplify the code slightly, but there is extra work in copying (in essence) the contents of FileData into FileTokens, which might kill any performance gains from using the nextFloatToken() method (presumedly more efficient than the strtok()/atof() combination).
There is an example on the C++ String Toolkit (StrTk) Tokenizer tutorial page (included at the bottom of the question) using StrTk's for_each_line() processor that looks to be similar to my desired application. A difference between the cases, however, is that I cannot assume how many data will appear on each line of the input file, and I do not know enough about StrTk to say if this is a viable solution.
NOT A DUPLICATE
The topic of fast reading of ASCII data to an array or struct has come up before, but I have reviewed the following posts and their solutions were not sufficient:
Fastest way to read data from a lot of ASCII files
How to read numbers from an ASCII file (C++)
Read Numeric Data from a Text File in C++
Reading a file and storing the contents in an array
C/C++ Fast reading large ASCII data file to array or struct
Read ASCII file into matrix in C++
How can I read ASCII data file in C++
Reading a file and storing the contents in an array
Reading in data in columns from a file (C++)
The Fastest way to read a .txt File
How does fast input/ output work in C/C++, by using registers, hexadecimal number and the likes?
reading file into struct array
Example data
Here is an example of the data file I'm importing. The ASCII data is delimited by spaces and line breaks like the below example:
5 3 3
0.23080516813E+04 0.22712439791E+04 0.21616898980E+04 0.19829996749E+04 0.17438686650E+04
0.14601734127E+04 0.11551623512E+04 0.85678544224E+03 0.59238325489E+03 0.38232265554E+03
0.23514479113E+03 0.14651943589E+03 0.10252743482E+03 0.85927499703E+02 0.86525872161E+02
0.10141182750E+03 0.13113419142E+03 0.18057147781E+03 0.25973252462E+03 0.38303754418E+03
0.57142097675E+03 0.85963728360E+03 0.12548019843E+04 0.17106124085E+04 0.21415379433E+04
0.24687336309E+04 0.26588012477E+04 0.27189091499E+04 0.26588012477E+04 0.24687336309E+04
0.21415379433E+04 0.17106124085E+04 0.12548019843E+04 0.85963728360E+03 0.57142097675E+03
0.38303754418E+03 0.25973252462E+03 0.18057147781E+03 0.13113419142E+03 0.10141182750E+03
0.86525872161E+02 0.85927499703E+02 0.10252743482E+03 0.14651943589E+03 0.23514479113E+03
StrTk example
Here is the StrTk example mentioned above. The scenario is parsing the data file that contains the information for a 3D mesh:
input data:
5
+1.0,+1.0,+1.0
-1.0,+1.0,-1.0
-1.0,-1.0,+1.0
+1.0,-1.0,-1.0
+0.0,+0.0,+0.0
4
0,1,4
1,2,4
2,3,4
3,1,4
code:
struct point
{
double x,y,z;
};
struct triangle
{
std::size_t i0,i1,i2;
};
int main()
{
std::string mesh_file = "mesh.txt";
std::ifstream stream(mesh_file.c_str());
std::string s;
// Process points section
std::deque<point> points;
point p;
std::size_t point_count = 0;
strtk::parse_line(stream," ",point_count);
strtk::for_each_line_n(stream,
point_count,
[&points,&p](const std::string& line)
{
if (strtk::parse(line,",",p.x,p.y,p.z))
points.push_back(p);
});
// Process triangles section
std::deque<triangle> triangles;
triangle t;
std::size_t triangle_count = 0;
strtk::parse_line(stream," ",triangle_count);
strtk::for_each_line_n(stream,
triangle_count,
[&triangles,&t](const std::string& line)
{
if (strtk::parse(line,",",t.i0,t.i1,t.i2))
triangles.push_back(t);
});
return 0;
}
This...
vector<vector<vector<double> > > charge(mx, vector<vector<double> >(my, vector<double>(mz)));
...creates a temporary vector<double>(mz), with all 0.0 values, and copies it my times (or perhaps moves then copies my-1 times with a C++11 compiler, but little difference...) to create a temporary vector<vector<double>>(my, ...) which is then copied mx times (...as above...) to initialise all the data. You're reading data in over these elements anyway - there's no need to spend time initialising it here. Instead, create an empty charge and use nested loops to reserve() enough memory for the elements without populating them yet.
Next, check you're compiling with optimisation on. If you are and you're still slower than FORTRAN, in the data-populating nested loops try creating a reference to the vector you're about .emplace_back elements on to:
for (int i = 0; i < mx; ++i)
for (int j = 0; j < my; ++j)
{
std::vector<double>& v = charge[i][j];
for (int k = 0; k < mz; ++k)
{
double d;
InFile >> d;
v.emplace_pack(d);
}
}
That shouldn't help if your optimiser's done a good job, but is worth trying as a sanity check.
If you're still slower - or just want to try to be even faster - you could try optimising your number parsing: you say your data's all formatted ala 0.23080516813E+04 - with fixed sizes like that you can easily calculate how many bytes to read into a buffer to give you a decent number of values from memory, then for each you could start an atol after the . to extract 23080516813 then multiply it by 10 to the power of minus (11 (your number of digits) minus 04): for speed, keep a table of those powers of ten and index into it using the extracted exponent (i.e. 4). (Note multiplying by e.g. 1E-7 can be faster than dividing by 1E7 on a lot of common hardware.)
And if you want to blitz this thing, switch to using memory mapped file access. Worth considering boost::mapped_file_source as it's easier to use than even the POSIX API (let alone Windows), and portable, but programming directly against an OS API shouldn't be much of a struggle either.
UPDATE - response to first & second comments
Example of using boost memory mapping:
#include <boost/iostreams/device/mapped_file.hpp>
boost::mapped_file_params params("dbldat.in");
boost::mapped_file_source file(params);
file.open();
ASSERT(file.is_open());
const char* p = file.data();
const char* nl = strchr(p, '\n');
std::istringstream iss(std::string(p, nl - p));
size_t x, y, z;
ASSERT(iss >> x >> y >> z);
The above maps a file into memory at address p, then parses the dimensions from the first line. Continue parsing the actual double representations from ++nl onwards. I mention an approach to that above, and you're concerned about the data format changing: you could add a version number to the file, so you can use optimised parsing until the version number changes then fall back on something generic for "unknown" file formats. As far as something generic goes, for in-memory representations using int chars_to_skip; double my_double; ASSERT(sscanf(ptr, "%f%n", &my_double, &chars_to_skip) == 1); is reasonable: see sscanf docs here - you can then advance the pointer through the data by chars_to_skip.
Next, are you suggesting to combine the reserve() solution with the reference creation solution?
Yes.
And (pardon my ignorance) why would using a reference to charge[i][j] and v.emplace_back() be better than charge[i][j].emplace_back()?
That suggestion was to sanity check that the compiler's not repeatedly evaluating charge[i][j] for each element being emplaced: hopefully it will make no performance difference and you can go back to the charge[i][j].emplace(), but IMHO it's worth a quick check.
Lastly, I'm skeptical about using an empty vector and reserve()ing at the tops of each loop. I have another program that came to a grinding halt using that method, and replacing the reserve()s with a preallocated multidimensional vector sped it up a lot.
That's possible, but not necessarily true in general or applicable here - a lot depends on the compiler / optimiser (particularly loop unrolling) etc.. With unoptimised emplace_back you're having to check vector size() against capacity() repeatedly, but if the optimiser does a good job that should be reduced to insignificance. As with a lot of performance tuning, you often can't reason about things perfectly and conclude what's going to be fastest, and will have to try alternatives and measure them with your actual compiler, program data etc..