How to create an array of cl::sycl::buffers? - c++

I am using the Xilinx's triSYCL github implementation,https://github.com/triSYCL/triSYCL.
I am trying to create a design with 100 producer/consumer to read/write from 100 pipes.
What I am not sure of is, How to create an array of cl::sycl::buffer and initialize it using std::iota.
Here is my code:
constexpr size_t T=6;
constexpr size_t n_threads=100;
cl::sycl::buffer<float, n_threads> a { T };
for (int i=0; i<n_threads; i++)
{
auto ba = a[i].get_access<cl::sycl::access::mode::write>();
// Initialize buffer a with increasing integer numbers starting at 0
std::iota(ba.begin(), ba.end(), i*T);
}
And I am getting the following error:
error: no matching function for call to ‘cl::sycl::buffer<float, 2>::buffer(<brace-enclosed initializer list>)’
cl::sycl::buffer<float, n_threads> a { T };
I am new to C++ programming. So I am not able to figure out the exact way to do this.

There are 2 points I think cause the issue you are currently having:
The 2nd template argument in the buffer object definition should be the dimensionality of the buffer (count of dimensions, should be 1, 2 or 3), not the dimensions themselves.
The constructor for the buffer should contain either the actual dimensions of the buffer, or the data that you want the buffer to have and the dimensions. To pass the dimensions, you need to pass a cl::sycl::range object to the constructor
As I understand you are trying to initialize a buffer of dimensionality 1 and with dimensions { 100, 1, 1 }. To do this, the definition of a should change to:
cl::sycl::buffer < float, 1 > a(cl::sycl::range< 1 >(n_threads));
Also, as the dimensionality can be deduced from the range template parameter, thus you can achieve the same effect with:
cl::sycl::buffer< float > a (cl::sycl::range< 1 >(n_threads));
As for initializing the buffer with std::iota, you have 3 options:
Use an array to initialize the data with the iota usage and pass them to the sycl buffer (case A),
Use the accessor to write to the buffer directly for host - CPU only (case B), or
Use an accessor with a parallel_for for execution on either host or an OpenCL device (case C).
Accessors should not be used as iterators (with .begin(), .end())
Case A:
std::vector<float> data(n_threads); // or std::array<float, n_threads> data;
std::iota(data.begin(), data.end(), 0); // this will create the data { 0, 1, 2, 3, ... }
cl::sycl::buffer<float> a(data.data(), cl::sycl::range<1>(n_threads));
// The data in a are already initialized, you can create an accessor to use them directly
Case B:
cl::sycl::buffer<float> a(cl::sycl::range<1>(n_threads));
{
auto ba = a.get_access<cl::sycl::access::mode::write>();
for(size_t i=0; i< n_threads; i++) {
ba[i] = i;
}
}
Case C:
cl::sycl::buffer<float> a(cl::sycl::range<1>(n_threads));
cl::sycl::queue q{cl::sycl::default_selector()}; // create a command queue for host or device execution
q.Submit([&](cl::sycl::handler& cgh) {
auto ba = a.get_access<cl::sycl::access::mode::write>();
cgh.parallel_for<class kernel_name>([=](cl::sycl::id<1> i){
ba[i] = i.get(0);
});
});
q.wait_and_throw(); // wait until kernel execution completes
Also check chapter 4.8 of the SYCL 1.2.1 spec https://www.khronos.org/registry/SYCL/specs/sycl-1.2.1.pdf as it has an example for iota

Disclaimer: triSYCL is a research project for now. Please use ComputeCpp for anything serious. :-)
If you really need arrays of buffer, I guess you can use something similar to Is there a way I can create an array of cl::sycl::pipe?
As a variant, you can use a std::vector<cl::sycl::buffer<float>> or std::array<cl::sycl::buffer<float>, n_threads> and initialize with a loop from a cl::sycl::buffer<float> { T }.

Related

Creating BoolTensor Mask in torch C++

I am trying to create a mask for torch in C++ of type BoolTensor. The first n elements in dimension one need to be False and the rest need to be True.
This is my attempt but I do not know if this is correct (size is the number of elements):
src_mask = torch::BoolTensor({6, 1});
src_mask[:size,:] = 0;
src_mask[size:,:] = 1;
I'm not sure to understand exactly your goal here, so here is my best attempt to convert into C++ you pseudo-code .
First, with libtorch you declare the type of your tensor through the torch::TensorOptions struct (types names are prefixed with a lowercase k)
Second, your python-like slicing is possible thanks to the torch::Tensor::slicefunction (see here and there).
Finally, that gives you something like :
// Creates a tensor of boolean, initially all ones
auto options = torch::TensorOptions().dtype(torch::kBool));
torch::Tensor bool_tensor = torch::ones({6,1}, options);
// Set the slice to 0
int size = 3;
bool_tensor.slice(/*dim=*/0, /*start=*/0, /*end=*/size) = 0;
std::cout << bool_tensor << std::endl;
Please not that this will set the first size rows to 0. I assumed that's what you meant by "first elements in dimension x".
Another way to do it:
using namespace torch::indexing; //for using Slice(...) function
at::Tensor src_mask = at::empty({ 6, 1 }, at::kBool); //empty bool tensor
src_mask.index_put_({ Slice(None, size), Slice() }, 0); //src_mask[:size,:] = 0
src_mask.index_put_({ Slice(size, None), Slice() }, 1); //src_mask[size:,:] = 0

dart/flutter: getting data array from C/C++ using ffi?

The official flutter tutorial on C/C++ interop through ffi only touches on calling a C++ function and getting a single return value.
Goal
What if I have a data buffer created on C/C++ side, but want to deliver to dart/flutter-side to show?
Problem
With #MilesBudnek 's tip, I'm testing Dart's FFI by trying to have safe memory deallocation from Dart to C/C++. The test reuses the official struct sample .
I could get the Array as a dart Pointer, but it's unclear to me how to iterate the array as a collection easily.
Code
I'm implementing a Dart-side C array binding like this:
In struct.h
struct Array
{
int* array;
int len;
};
and a pair of simple allocation/deallocation test functions:
struct Array* get_array();
int del_array(struct Array* arr);
Then on Dart side in structs.dart:
typedef get_array_func = Pointer<Array> Function();
typedef del_array_func = void Function(int arrAddress);
...
final getArrayPointer = dylib.lookup<NativeFunction<get_array_func>>('get_array');
final getArray = getArrayPointer.asFunction<get_array_func>();
final arrayPointer = getArray();
final array = arrayPointer.ref.array;
print('array.array: $array');
This gives me the print out
array.array: Pointer<Int32>: address=0x7fb0a5900000
Question
Can I convert the array pointer to a List easily? Something like:
final array = arrayPointer.ref.array.toList();
array.forEach(index, elem) => print("array[$idx]: $elem");
======
Old Question (you can skip this)
Problem
It's unclear to me how to retrieve this kind of vector data from C/C++ by dart/flutter.
Possible solutions
More importantly, how to push data from C++ side from various threads?
If there is no builtin support, off the top of my head I'd need to implement some communication schemes.
Option #1: Networking
I could do network through TCP sockets. But I'm reluctant to go there if there are easier solutions.
Option #2: file I/O
Write data to file with C/C++, and let dart/flutter poll on the file and stream data over. This is not realtime friendly.
So, are there better options?
Solved it.
According to this issue, the API asTypedList is the way to go.
Here is the code that works for me
final getArrayPointer = dylib.lookup<NativeFunction<get_array_func>>('get_array');
final getArray = getArrayPointer.asFunction<get_array_func>();
final arrayPointer = getArray();
final arr = arrayPointer.ref.arr;
print('array.array: $arr');
final arrReal = arr.asTypedList(10);
final arrType = arrReal.runtimeType;
print('arrReal: $arrReal, $arrType');
arrReal.forEach((elem) => print("array: $elem"));
This gives me:
array.array: Pointer<Int32>: address=0x7f9eebb02870
arrReal: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], Int32List
array: 0
array: 1
array: 2
array: 3
array: 4
array: 5
array: 6
array: 7
array: 8
array: 9
asTypedList will only work with pointers that relate to TypedData.
there are other cases where, for example, you want to convert an Pointer<UnsignedChar> to a Uint8List, in this case you can:
use an extension and then either cast the Pointer<UnsignedChar to a Pointer<Uint8> and then use asTypedList. In this case you have to make sure the pointer is not freed while the Uint8List is still referenced.
extension UnsignedCharPointerExtension on Pointer<UnsignedChar> {
Uint8List? toUint8List(int length) {
if (this == nullptr) {
return null;
}
return cast<Uint8>().asTypedList(length);
}
}
use an extension and don't cast the pointer but copy it manually. In this case you can free the pointer after you get the Uint8List
extension UnsignedCharPointerExtension on Pointer<UnsignedChar> {
Uint8List? toUint8List(int length) {
if (this == nullptr) {
return null;
}
final Uint8List list = Uint8List(length);
for (int i = 0; i < length; i++) {
list[i] = this[i];
}
return list;
}
}

luaT_pushudata returns proper Tensor type and dimension, but garbage data

I have a short clip of C++ code that should theoretically work to create and return a torch.IntTensor object, but when I call it from Torch I get garbage data.
Here is my code (note this snippet leaves out the function registering, but suffice it to say that it registers fine--I can provide it if necessary):
static int ltest(lua_State* L)
{
std::vector<int> matches;
for (int i = 0; i < 10; i++)
{
matches.push_back(i);
}
performMatching(dist, matches, ratio_threshold);
THIntStorage* storage = THIntStorage_newWithData(&matches[0], matches.size());
THIntTensor* tensorMatches = THIntTensor_newWithStorage1d(storage, 0, matches.size(), 1);
// Push result to Lua stack
luaT_pushudata(L, (void*)tensorMatches, "torch.IntTensor");
return 1;
}
When I call this from Lua, I should get a [torch.IntTensor of size 10] and I do. However, the data appears to be either memory addresses or junk:
29677072
0
16712197
3
0
0
29677328
0
4387616
0
[torch.IntTensor of size 10]
It should have been the numbers [0,9].
Where am I going wrong?
For the record, when I test it in C++
for (int i = 0; i < storage->size; i++)
std::cout << *(storage->data+i) << std::endl;
prints the proper values.
As does
for (int i = 0; i < tensorMatches->storage->size; i++)
std::cout << *(tensorMatches->storage->data+i) << std::endl;
so it seems clear to me that the problem lies in the exchange between C++ and Lua.
So I got an answer elsewhere--the Google group for Torch7--but I'll copy and paste it here for anyone who may need it.
From user #alban desmaison:
Your problem is actually memory management.
When your C++ function return, you vector<int> is free, and so is its content.
From that point onward, the tensor is pointing to free memory and when you access it, you access freed memory.
You will have to either:
Allocate memory on the heap with malloc (as an array of ints) and use THIntStorage_newWithData as you currently do (the pointer that you give to newWithData will be freeed when it is not used anymore by Torch).
Use a vector<int> the way you currently do but create a new Tensor with a given size with THIntTensor_newWithSize1d(matches.size()) and then copy the content of the vector into the tensor.
For the record, I couldn't get it to work with malloc but the copying memory approach worked just fine.

first value of a loop in c++ different for the others

I need to put the first value of a loop = 0, and then use a range to start the loop.
In MatLab this is possible : x = [0 -range:range] (range is a integer)
This will give a value of [0, -range, -range+1, -range+2, .... , range-1, range]
The problem is I need to do this in C++, I tried to do by an array and then put in like the value on the loop without success.
//After loading 2 images, put it into matrix values and then trying to compare each one.
for r=1:bRows
for c=1:bCols
rb=r*blockSize;
cb=c*blockSize;
%%for each block search in the near position(1.5 block size)
search=blockSize*1.5;
for dr= [0 -search:search] //Here's the problem.
for dc= [0 -search:search]
%%check if it is inside the image
if(rb+dr-blockSize+1>0 && rb+dr<=rows && cb+dc-blockSize+1>0 && cb+dc<=cols)
%compute the error and check if it is lower then the previous or not
block=I1(rb+dr-blockSize+1:rb+dr,cb+dc-blockSize+1:cb+dc,1);
TE=sum( sum( abs( block - cell2mat(B2(r,c)) ) ) );
if(TE<E)
M(r,c,:)=[dr dc]; %store the motion vector
Err(r,c,:)=TE; %store th error
E=TE;
end
end
end
end
%reset the error for the next search
E=255*blockSize^2;
end
end
C++ doesn't natively support ranges of the kind you know from MatLab, although external solutions are available, if somewhat of an overkill for your use case. However, C++ allows you to implement them easily (and efficiently) using the primitives provided by the language, such as for loops and resizable arrays. For example:
// Return a vector consisting of
// {0, -limit, -limit+1, ..., limit-1, limit}.
std::vector<int> build_range0(int limit)
{
std::vector<int> ret{0};
for (auto i = -limit; i <= limit; i++)
ret.push_back(i);
return ret;
}
The resulting vector can be easily used for iteration:
for (int dr: build_range0(search)) {
for (int dc: build_range0(search)) {
if (rb + dr - blockSize + 1 > 0 && ...)
...
}
}
The above of course wastes some space to create a temporary vector, only to throw it away (which I suspect happens in your MatLab example as well). If you want to just iterate over the values, you will need to incorporate the loop such as the one in build_range0 directly in your function. This has the potential to reduce readability and introduce repetition. To keep the code maintainable, you can abstract the loop into a generic function that accepts a callback with the loop body:
// Call fn(0), fn(-limit), fn(-limit+1), ..., fn(limit-1), and fn(limit)
template<typename F>
void for_range0(int limit, F fn) {
fn(0);
for (auto i = -limit; i <= limit; i++)
fn(i);
}
The above function can be used to implement iteration by providing the loop body as an anonymous function:
for_range0(search, [&](int dr) {
for_range0(search, [&](int dc) {
if (rb + dr - blockSize + 1 > 0 && ...)
...
});
});
(Note that both anonymous functions capture enclosing variables by reference in order to be able to mutate them.)
Reading your comment, you could do something like this
for (int i = 0, bool zero = false; i < 5; i++)
{
cout << "hi" << endl;
if (zero)
{
i = 3;
zero = false;
}
}
This would start at it 0, then after doing what I want it to do, assign i the value 3, and then continue adding to it each iteration.

Passing a struct to a thread, how access multiple data in struct?

I have a problem. I need to use a struct of OpenCV Mat images for passing multiple arguments to a thread.
I have a struct like this:
struct Args
{
Mat in[6];
Mat out[6];
};
And a void function called by thread, like this:
void grey (void *param){
while (TRUE)
{
WaitForSingleObject(mutex,INFINITE);
Args* arg = (Args*)param;
cvtColor(*arg->in,*arg->out,CV_BGR2GRAY);
ReleaseMutex(mutex);
_endthread();
}
}
For launch the grey function as thread with two Mat array arguments, I use the follow line in main:
Args dati;
*dati.in = *inn;
*dati.out = *ou;
handle1 = (HANDLE) _beginthread(grey,0,&dati);
Now, my problem is: I need to access to all 6 elements of two array "in" and "out" in struct passed to thread from thread itself or however, find a mode to shift array from 0 to 5 to elaborate all elements with the "grey" functions.
How can I do this from thread or from main? I mean using grey function for elaborate all 6 elements of array Mat in[6] of struct Args that I pass to thread in that mode.
Can someone help me or gime me an idea? I don't know how do this.
Before you create the thread, you assign the array like this:
*dati.in = *inn;
*dati.out = *ou;
This will only assign the first entry in the array. The rest of the array will be untouched.
You need to copy all of the source array into the destination array. You can use std::copy for this:
std::copy(std::begin(dati.in), std::end(dati.in), std::begin(inn));
Of course, that requires that the source "array" inn contains at least as many items as the destination array.
Then in the thread simply loop over the items:
for (int i = 0; i < 6; i++)
{
cvtColor(arg->in[i], arg->out[i], CV_BGR2GRAY);
}
When you launch your thread, this code:
Args dati;
*dati.in = *inn;
*dati.out = *ou;
is only initialising one of the six elements. If inn and ou are actually 6 element arrays, you will need a loop to initialise all 6.
Args dati;
for (int i = 0; i < 6; i++) {
dati.in[i] = inn[i];
dati.out[i] = ou[i];
}
Similarly, in your thread, you're only processing the first element in the array. So this code:
Args* arg = (Args*)param;
cvtColor(*arg->in,*arg->out,CV_BGR2GRAY);
would need to become something like this:
Args* arg = (Args*)param;
for (int i = 0; i < 6; i++) {
cvtColor(arg->in[i],arg->out[0],CV_BGR2GRAY);
}