How to explicitly get linear indices from arrayfire? - c++

Suppose I have an stl::array<float, 24> foo which is the linearized STL pendant to a Column-Major format arrayfire array, e.g. af::array bar = af::array(4,3,2, 1, f32);. So I have an af::dim4 object dims with the dimensions of bar, I have up to 4 af::seq-objects and I have the linearized array foo.
How is it possible to get explicitly the indices of foo (i.e. linearized version of bar) representing e.g. the 2.nd and 3.rd row, i.e. bar(af::seq(1,2), af::span, af::span, af::span)? I have a small code example given below, which shows what I want. In the end I also explain why I want this.
af::dim4 bigDims = af::dim4(4,3,2);
stl::array<float, 24> foo; // Resides in RAM and is big
float* selBuffer_ptr; // Necessary for AF correct type autodetection
stl::vector<float> selBuffer;
// Load some data into foo
af::array selection; // Resides in VRAM and is small
af::seq selRows = af::seq(1,2);
af::seq selCols = af::seq(bigDims[1]); // Emulates af::span
af::seq selSlices = af::seq(bigDims[2]); // Emulates af::span
af::dim4 selDims = af::dim4(selRows.size, selCols.size, selSlices.size);
dim_t* linIndices;
// Magic functionality getting linear indices of the selection
// selRows x selCols x selSlices
// Assign all indexed elements to a consecutive memory region in selBuffer
// I know their positions within the full dataset, b/c I know the selection ranges.
selBuffer_ptr = static_cast<float> &(selBuffer[0]);
selection = af::array(selDims, selBuffer_ptr); // Copies just the selection to the device (e.g. GPU)
// Do sth. with selection and be happy
// I don't need to write back into the foo array.
Arrayfire must have such a logic implemented in order to access elements and I found several related classes/functions such as af::index, af::seqToDims, af::gen_indexing, af::array::operator() - however I couldn't figure an easy way out yet.
I thought about basically reimplementing the operator(), so that it would work similarly but not require a reference to an array-object. But this might be wasted effort if there is an easy way in the arrayfire-framework.
Background:
The reason I want to do so is because arrayfire does not allow to store data only in main memory (CPU-context) while being linked against a GPU backend. Since I have a big chunk of data that needs to be processed only piece by piece and the VRAM is quite limited, I'd like to instantiate af::array-objects ad-hoc from an stl-container which always resided in main memory.
Of course I know that I could program some index magic to work around my problem but I'd like to use quite complicated af::seq objects which could make an efficient implementation of the index logic complicated.

After a discussion with Pavan Yalamanchili on Gitter I managed to get a working piece of code that I want to share in case anybody else needs to hold his variables only in RAM and copy-on-use parts of it to VRAM, i.e. the Arrayfire universe (if linked against OpenCL on GPU or Nvidia).
This solution will also help anybody who is using AF somewhere else in his project anyways and who wants to have a convenient way of accessing a big linearized N-dim array with (N<=4).
// Compile as: g++ -lafopencl malloc2.cpp && ./a.out
#include <stdio.h>
#include <arrayfire.h>
#include <af/util.h>
#include <cstdlib>
#include <iostream>
#define M 3
#define N 12
#define O 2
#define SIZE M*N*O
int main() {
int _foo; // Dummy variable for pausing program
double* a = new double[SIZE]; // Allocate double array on CPU (Big Dataset!)
for(long i = 0; i < SIZE; i++) // Fill with entry numbers for easy debugging
a[i] = 1. * i + 1;
std::cin >> _foo; // Pause
std::cout << "Full array: ";
// Display full array, out of convenience from GPU
// Don't use this if "a" is really big, otherwise you'll still copy all the data to the VRAM.
af::array ar = af::array(M, N, O, a); // Copy a RAM -> VRAM
af_print(ar);
std::cin >> _foo; // Pause
// Select a subset of the full array in terms of af::seq
af::seq seq0 = af::seq(1,2,1); // Row 2-3
af::seq seq1 = af::seq(2,6,2); // Col 3:5:7
af::seq seq2 = af::seq(1,1,1); // Slice 2
// BEGIN -- Getting linear indices
af::array aidx0 = af::array(seq0);
af::array aidx1 = af::array(seq1).T() * M;
af::array aidx2 = af::reorder(af::array(seq2), 1, 2, 0) * M * N;
af::gforSet(true);
af::array aglobal_idx = aidx0 + aidx1 + aidx2;
af::gforSet(false);
aglobal_idx = af::flat(aglobal_idx).as(u64);
// END -- Getting linear indices
// Copy index list VRAM -> RAM (for easier/faster access)
uintl* global_idx = new uintl[aglobal_idx.dims(0)];
aglobal_idx.host(global_idx);
// Copy all indices into a new RAM array
double* a_sub = new double[aglobal_idx.dims(0)];
for(long i = 0; i < aglobal_idx.dims(0); i++)
a_sub[i] = a[global_idx[i]];
// Generate the "subset" array on GPU & diplay nicely formatted
af::array ar_sub = af::array(seq0.size, seq1.size, seq2.size, a_sub);
std::cout << "Subset array: "; // living on seq0 x seq1 x seq2
af_print(ar_sub);
return 0;
}
/*
g++ -lafopencl malloc2.cpp && ./a.out
Full array: ar
[3 12 2 1]
1.0000 4.0000 7.0000 10.0000 13.0000 16.0000 19.0000 22.0000 25.0000 28.0000 31.0000 34.0000
2.0000 5.0000 8.0000 11.0000 14.0000 17.0000 20.0000 23.0000 26.0000 29.0000 32.0000 35.0000
3.0000 6.0000 9.0000 12.0000 15.0000 18.0000 21.0000 24.0000 27.0000 30.0000 33.0000 36.0000
37.0000 40.0000 43.0000 46.0000 49.0000 52.0000 55.0000 58.0000 61.0000 64.0000 67.0000 70.0000
38.0000 41.0000 44.0000 47.0000 50.0000 53.0000 56.0000 59.0000 62.0000 65.0000 68.0000 71.0000
39.0000 42.0000 45.0000 48.0000 51.0000 54.0000 57.0000 60.0000 63.0000 66.0000 69.0000 72.0000
ar_sub
[2 3 1 1]
44.0000 50.0000 56.0000
45.0000 51.0000 57.0000
*/
The solution uses some undocumented AF functions and is supposedly slow due to the for loop running over global_idx, but so far its really the best one can do if on wants to hold data in the CPU context exclusively and share only parts with the GPU context of AF for processing.
If anybody knows a way to speed this code up, I'm still open for suggestions.

Related

Element-wise shifting from smaller array to a larger array

I am programming an ESP32 in the Arduino framework. For my application, I need to create a buffer which will store information from both the present and the last time it was accessed. Here is what I am attempting to do.
//first buffer
char buffer1[4];
//second buffer
char buffer2[8];
void setup {
//setup
}
//buffer1 values will change with each iteration of loop from external inputs
//buffer2 must store most recent values of buffer1 plus values of buffer1 from when loop last ran
for example:
**loop first iteration**
void loop {
buffer1[0] = {1};
buffer1[1] = {2};
buffer1[2] = {3};
buffer1[3] = {1};
saveold(); //this is the function I'm trying to implement to save values to buffer2 in an element-wise way
}
//value of buffer2 should now be: buffer2 = {1,2,3,1,0,0,0,0}
**loop second iteration**
void loop {
buffer1[0] = {2};
buffer1[1] = {3};
buffer1[2] = {4};
buffer1[3] = {2};
saveold();
}
//value of buffer2 should now be: buffer2 = {2,3,4,2,1,2,3,1}
From what I've been able to understand through searching online, the "saveold" function I'm trying to make
should implement some form of memmove for these array operations
I've tried to piece it together, but I always overwrite the value of buffer2 instead of somehow shifting new values in, while retaining the old ones
This is all I've got:
void saveold() {
memmove(&buffer2[0], &buffer1[0], (sizeof(buffer1[0]) * 4));
}
From my understanding, this copies buffer1 starting from index position 0 to buffer2, starting at index position 0, for 4 bytes (where 1 char = 1 byte).
Computer science is not my backround, so perhaps there is some fundamental solution or strategy that I am missing. Any pointers would be appreciated.
You have multiple options to implement saveold():
Solution 1
void saveold() {
// "shift" lower half into upper half, saving recent values (actually it's a copy)
buffer2[4] = buffer2[0];
buffer2[5] = buffer2[1];
buffer2[6] = buffer2[2];
buffer2[7] = buffer2[3];
// copy current values
buffer2[0] = buffer[0];
buffer2[1] = buffer[1];
buffer2[2] = buffer[2];
buffer2[3] = buffer[3];
}
Solution 2
void saveold() {
// "shift" lower half into upper half, saving recent values (actually it's a copy)
memcpy(buffer2 + 4, buffer2 + 0, 4 * sizeof buffer2[0]);
// copy current values
memcpy(buffer2 + 0, buffer1, 4 * sizeof buffer1[0]);
}
Some notes
There are even more ways to do it. Anyway, choose the one you understand best.
Be sure that buffer2 is exactly double size of buffer1.
memcpy() can be used safely if source and destination don't overlap. memmove() checks for overlaps and reacts accordingly.
&buffer1[0] is the same as buffer1 + 0. Feel free to use the expression you better understand.
sizeof is an operator, not a function. So sizeof buffer[0] evaluates to the size of buffer[0]. A common and most accepted expression to calculate the size of an array dimension is sizeof buffer1 / sizeof buffer1[0]. You only need parentheses if you evaluate the size of a data type, like sizeof (int).
Solution 3
The last note leads directly to this improvement of solution 1:
void saveold() {
// "shift" lower half into upper half, saving recent values
size_t size = sizeof buffer2 / sizeof buffer2[0];
for (int i = 0; i < size / 2; ++i) {
buffer2[size / 2 + i] = buffer2[i];
}
// copy current values
for (int i = 0; i < size / 2; ++i) {
buffer2[i] = buffer1[i];
}
}
To apply this knowledge to solution 2 is left as an exercise for you. ;-)
The correct way to do this is to use buffer pointers, not by doing hard-copy backups. Doing hardcopies with memcpy is particularly bad on slow legacy microcontrollers such as AVR. Not quite sure what MCU this ESP32 got, seems to be some oddball one from Tensilica. Anyway, this answer applies universally for any processor where you have more data than CPU data word length.
perhaps there is some fundamental solution or strategy that I am missing.
Indeed - it really sounds that what you are looking for is a ring buffer. That is, an array of fixed size which has a pointer to the beginning of the valid data, and another pointer at the end of the data. You move the pointers, not the data. This is much more efficient both in terms of execution speed and RAM usage, compared to making naive hardcopies with memcpy.

Reading hdf5 into c++ with memory problems

I am rewriting a code I had developed in python into c++ mainly for an improvement in speed; while also hoping to gain more experience in this language. I also plan on using openMP to parallelize this code onto 48 cores which share 204GB of memory.
The program I am writing is simple, I import an hdf5 file which is 3D :
A[T][X][E], where T is associated to each timestep from a simulation, X represents where the field is measured, and E(0:2) represents the electric field in x,y,z.
Each element in A is a double, and the bin sizes span: A[15000][80][3].
The first hiccup I have run into is inputting this 'large' h5 file into an array and would like a professional opinion before I continue. My first attempt:
...
#define RANK 3
#define DIM1 15001
#define DIM2 80
#define DIM3 3
using namespace std;
int main (void)
{
// Define HDF5 variables for opening file.
hid_t file1, dataset1;
double bufnew[DIM1][DIM2][DIM3];
herr_t ret;
uint i, j, k;
file1 = H5Fopen (FILE1, H5F_ACC_RDWR, H5P_DEFAULT);
dataset1 = H5Dopen (file1, "EFieldOnLine", H5P_DEFAULT);
ret = H5Dread (dataset1, H5T_NATIVE_DOUBLE, H5S_ALL, H5S_ALL,
H5P_DEFAULT, bufnew);
cout << "Let's try dumping 0->100 elements" << endl;
for(i=1; i < 100; i++) cout << bufnew[i][20][2] << endl;
...
which leads to a segmentation fault from array declaration. My next move was to use either a 3D array (new), or a 3D vector. However, I have seen much debate against these methods, and more importantly, I only need ONE component of the E, i.e. I would like to reshape A[T][X][E] -> B[T][X] for say, the x-component of E.
Sorry for the lengthy post, but I wanted to be as clear as possible and would like to emphasize again that I am interested in learning how to write the fastest, and most efficient code.
I appreciate all of your suggestions, time and wisdom.
Defining an array as a local variable means allocating it on stack. The stack is usually limited with several megabytes, and stack overflow surely leads to a segfault. Large data structures should be allocated at heap dynamically (using new operator) or statically (when defined as global variables).
I wouldn't advise to make a vector of vectors of vectors for such dimensions.
Instead, creating a one-dimensional array to store all values
double *bufnew = new double[DIM1*DIM2*DIM3];
and accessing it with the following formula to calculate linear position of a given 3D item
bufnew[(T*DIM2+X)*DIM3+E] = ... ; // bufnew[T][X][E]
should work ok.

Efficient index bound check and double to int cast

Consider the following code snippet
double *x, *id;
int i, n; // = vector size
// allocate and zero x
// set id to 0:n-1
for(i=0; i<n; i++) {
long iid = (long)id[i];
if(iid>=0 && iid<n && (double)iid==id[i]){
x[iid] = 1;
} else break;
}
The code uses values in vector id of type double as indices into vector x. In order for the indices to be valid I verify that they are greater than or equal to 0, less than vector size n, and that doubles stored in id are in fact integers. In this example id stores integers from 1 to n, so all vectors are accessed linearly and branch prediction of the if statement should always work.
For n=1e8 the code takes 0.21s on my computer. Since it seems to me it is a computationally light-weight loop, I expect it to be memory bandwidth bounded. Based on the benchmarked memory bandwidth I expect it to run in 0.15s. I calculate the memory footprint as 8 bytes per id value, and 16 bytes per x value (it needs to be both written, and read from memory since I assume SSE streaming is not used). So a total of 24 bytes per vector entry.
The questions:
Am I wrong saying that this code should be memory bandwidth bounded, and that it can be improved?
If not, do you know a way in which I could improve the performance so that it works with the speed of the memory?
Or maybe everything is fine and I can not easily improve it otherwise than running it in parallel?
Changing the type of id is not an option - it must be double. Also, in the general case id and x have different sizes and must be kept as separate arrays - they come from different parts of the program. In short, I wonder if it is possible to write the bound checks and the type cast/integer validation in a more efficient manner.
For convenience, the entire code:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
static struct timeval tb, te;
void tic()
{
gettimeofday(&tb, NULL);
}
void toc(const char *idtxt)
{
long s,u;
gettimeofday(&te, NULL);
s=te.tv_sec-tb.tv_sec;
u=te.tv_usec-tb.tv_usec;
printf("%-30s%10li.%.6li\n", idtxt,
(s*1000000+u)/1000000, (s*1000000+u)%1000000);
}
int main(int argc, char *argv[])
{
double *x = NULL;
double *id = NULL;
int i, n;
// vector size is a command line parameter
n = atoi(argv[1]);
printf("x size %i\n", n);
// not included in timing in MATLAB
x = calloc(sizeof(double),n);
memset(x, 0, sizeof(double)*n);
// create index vector
tic();
id = malloc(sizeof(double)*n);
for(i=0; i<n; i++) id[i] = i;
toc("id = 1:n");
// use id to index x and set all entries to 4
tic();
for(i=0; i<n; i++) {
long iid = (long)id[i];
if(iid>=0 && iid<n && (double)iid==id[i]){
x[iid] = 1;
} else break;
}
toc("x(id) = 1");
}
EDIT: Disregard if you can't split the arrays!
I think it can be improved by taking advantage of a common cache concept. You can either make data accesses close in time or location. With tight for-loops, you can achieve a better data hit-rate by shaping your data structures like your for-loop. In this case, you access two different arrays, usually the same indices in each array. Your machine is loading chunks of both arrays each iteration through that loop. To increase the use of each load, create a structure to hold an element of each array, and create a single array with that struct:
struct my_arrays
{
double x;
int id;
};
struct my_arrays* arr = malloc(sizeof(my_arrays)*n);
Now, each time you load data into cache, you'll hit everything you load because the arrays are close together.
EDIT: Since your intent is to check for an integer value, and you make the explicit assumption that the values are small enough to be represented precisely in a double with no loss of precision, then I think your comparison is fine.
My previous answer had a reference to beware comparing large doubles after implicit casting, and I referenced this:
What is the most effective way for float and double comparison?
It might be worth considering examination of double type representation.
For example, the following code shows how to compare a double number greater than 1 to 999:
bool check(double x)
{
union
{
double d;
uint32_t y[2];
};
d = x;
bool answer;
uint32_t exp = (y[1] >> 20) & 0x3ff;
uint32_t fraction1 = y[1] << (13 + exp); // upper bits of fractiona part
uint32_t fraction2 = y[0]; // lower 32 bits of fractional part
if (fraction2 != 0 || fraction1 != 0)
answer = false;
else if (exp > 8)
answer = false;
else if (exp == 8)
answer = (y[1] < 0x408f3800); // this is the representation of 999
else
answer = true;
return answer;
}
This looks like much code, but it might be vectorized easily (using e.g. SSE), and if your bound is a power of 2, it might simplify the code further.

Thrust: summing the elements of an array indexed by another array [Matlab's syntax sum(x(indices))]

I'm trying to sum the elements of an array indexed by another array using the Thrust library, but I couldn't find an example. In other words, I want to implement Matlab's syntax
sum(x(indices))
Here is a guideline code trying to point out what do I like to achieve:
#define N 65536
// device array copied using cudaMemcpyToSymbol
__device__ int global_array[N];
// function to implement with thrust
__device__ int support(unsigned short* _memory, unsigned short* _memShort)
{
int support = 0;
for(int i=0; i < _memSizeShort; i++)
support += global_array[_memory[i]];
return support;
}
Also, from the host code, can I use the global_array[N] without copying it back with cudaMemcpyFromSymbol ?
Every comment/answer is appreciated :)
Thanks
This is a very late answer provided here to remove this question from the unanswered list. I'm sure that the OP has already found a solution (since May 2012 :-)), but I believe that the following could be useful to other users.
As pointed out by #talonmies, the problem can be solved by a fused gather-reduction. The solution is indeed an application of Thurst's permutation_iterator and reduce. The permutation_iterator allows to (implicitly) reorder the target array x according to the indices in the indices array. reduce performs the sum of the (implicitly) reordered array.
This application is part of Thrust's documentation, below reported for convenience
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/reduce.h>
#include <thrust/device_vector.h>
// this example fuses a gather operation with a reduction for
// greater efficiency than separate gather() and reduce() calls
int main(void)
{
// gather locations
thrust::device_vector<int> map(4);
map[0] = 3;
map[1] = 1;
map[2] = 0;
map[3] = 5;
// array to gather from
thrust::device_vector<int> source(6);
source[0] = 10;
source[1] = 20;
source[2] = 30;
source[3] = 40;
source[4] = 50;
source[5] = 60;
// fuse gather with reduction:
// sum = source[map[0]] + source[map[1]] + ...
int sum = thrust::reduce(thrust::make_permutation_iterator(source.begin(), map.begin()),
thrust::make_permutation_iterator(source.begin(), map.end()));
// print sum
std::cout << "sum is " << sum << std::endl;
return 0;
}
In the above example, map plays the role of indices, while source plays the role of x.
Concerning the additional question in your comment (iterating over a reduced number of terms), it will be sufficient to change the following line
int sum = thrust::reduce(thrust::make_permutation_iterator(source.begin(), map.begin()),
thrust::make_permutation_iterator(source.begin(), map.end()));
to
int sum = thrust::reduce(thrust::make_permutation_iterator(source.begin(), map.begin()),
thrust::make_permutation_iterator(source.begin(), map.begin()+N));
if you want to iterate only over the first N terms of the indexing array map.
Finally, concerning the possibility of using global_array from the host, you should notice that this is a vector residing on the device, so you do need a cudaMemcpyFromSymbol to move it to the host first.

High number causes seg fault

This bit of code is from a program I am writing to take in x col and x rows to run a matrix multiplication on CUDA, parallel processing. The larger the sample size, the better.
I have a function that auto generates x amount of random numbers.
I know the answer is simple but I just wanted to know exactly why. But when I run it with say 625000000 elements in the array, it seg faults. I think it is because I have gone over the size allowed in memory for an int.
What data type should I use in place of int for a larger number?
This is how the data is being allocated, then passed into the function.
a.elements = (float*) malloc(mem_size_A);
where
int mem_size_A = sizeof(float) * size_A; //for the example let size_A be 625,000,000
Passed:
randomInit(a.elements, a.rowSize,a.colSize, oRowA, oColA);
What the randomInit is doing is say I enter a 2x2 but I am padding it up to a multiple of 16. So it takes the 2x2 and pads the matrix to a 16x16 of zeros and the 2x2 is still there.
void randomInit(float* data, int newRowSize,int newColSize, int oldRowSize, int oldColSize)
{
printf("Initializing random function. The new sized row is %d\n", newRowSize);
for (int i = 0; i < newRowSize; i++)//go per row of new sized row.
{
for(int j=0;j<newColSize;j++)
{
printf("This loop\n");
if(i<oldRowSize&&j<oldColSize)
{
data[newRowSize*i+j]=rand() / (float)RAND_MAX;//brandom();
}
else
data[newRowSize*i+j]=0;
}
}
}
I've even ran it with the printf in the loop. This is the result I get:
Creating the random numbers now
Initializing random function. The new sized row is 25000
This loop
Segmentation fault
Your memory allocation for data is probably failing.
Fortunately, you almost certainly don't need to store a large collection of random numbers.
Instead of storing:
data[n]=rand() / (float)RAND_MAX
for some huge collection of n, you can run:
srand(n);
value = rand() / (float)RAND_MAX;
when you need a particular number and you'll get the same value every time, as if they were all calculated in advance.
I think you're going past the value you allocated for data. when you're newrowsize is too large, you're accessing unallocated memory.
remember, data isn't infinitely big.
Well the real problem is that, if the problem is really the integer size used for your array access, you will be not able to fix it. I think you probably just have not enough space in your memory so as to store that huge number of data.
If you want to extends that, just define a custom structure or class if you are in C++. But you will loose the O(1) time access complexity involves with array.