ComputeLibrary CLTensor data transfer - c++

I am working with integrating ARM ComputeLibrary into a project.
It's not an API whose semantics I am familiar with, but I'm working my way through the docs and examples.
At the moment, I am trying to copy the contents of an std::vector to a CLTensor. Then use the ARMCL GEMM operation.
I've been building an MWE, shown below, with the aim of getting matrix multiplication working.
To get the input data from a standard C++ std::vector, or std::ifstream, I am trying an iterator based approach, based on this example shown in the docs.
However, I keep getting a segfault.
There is an example of sgemm using CLTensor in the source, which is also where I'm drawing inspiration from. However it gets its input data from Numpy arrays, so isn't relevant up to this point.
I'm not sure in ARMCL if CLTensor and Tensor have disjoint methods. But I feel like they are of a common interface ITensor. Still, I haven't been able to find an equivalent example that uses CLTensor instead of Tensor for this iterator based method.
You can see my code I'm working with below, which fails on line 64 (*reinterpret_cast..). I'm not entirely sure what the operations are that it performs, but my guess is that we have our ARMCL iterator input_it which is incremented n * m times, each iteration setting the value of the CLTensor at that address to the corresponding input value. reinterpret_cast is just to make the types play nicely together?
I reckon my Iterator and Window objects are okay, but can't be sure.
#include "arm_compute/core/Types.h"
#include "arm_compute/runtime/CL/CLFunctions.h"
#include "arm_compute/runtime/CL/CLScheduler.h"
#include "arm_compute/runtime/CL/CLTuner.h"
#include "utils/Utils.h"
namespace armcl = arm_compute;
namespace armcl_utils = arm_compute::utils;
int main(int argc, char *argv[])
{
int n = 3;
int m = 2;
int p = 4;
std::vector<float> src_a = {2, 1,
6, 4,
2, 3};
std::vector<float> src_b = {5, 2, 1, 6,
3, 7, 4, 1};
std::vector<float> c_targets = {13, 11, 6, 13,
42, 40, 22, 40,
19, 25, 14, 15};
// Provides global access to a CL context and command queue.
armcl::CLTuner tuner{};
armcl::CLScheduler::get().default_init(&tuner);
armcl::CLTensor a{}, b{}, c{};
float alpha = 1;
float beta = 0;
// Initialize the tensors dimensions and type:
const armcl::TensorShape shape_a(m, n);
const armcl::TensorShape shape_b(p, m);
const armcl::TensorShape shape_c(p, n);
a.allocator()->init(armcl::TensorInfo(shape_a, 1, armcl::DataType::F32));
b.allocator()->init(armcl::TensorInfo(shape_b, 1, armcl::DataType::F32));
c.allocator()->init(armcl::TensorInfo(shape_c, 1, armcl::DataType::F32));
// configure sgemm
armcl::CLGEMM sgemm{};
sgemm.configure(&a, &b, nullptr, &c, alpha, beta);
// // Allocate the input / output tensors:
a.allocator()->allocate();
b.allocator()->allocate();
c.allocator()->allocate();
// // Fill the input tensor:
// // Simplest way: create an iterator to iterate through each element of the input tensor:
armcl::Window input_window;
armcl::Iterator input_it(&a, input_window);
input_window.use_tensor_dimensions(shape_a);
std::cout << " Dimensions of the input's iterator:\n";
std::cout << " X = [start=" << input_window.x().start() << ", end=" << input_window.x().end() << ", step=" << input_window.x().step() << "]\n";
std::cout << " Y = [start=" << input_window.y().start() << ", end=" << input_window.y().end() << ", step=" << input_window.y().step() << "]\n";
// // Iterate through the elements of src_data and copy them one by one to the input tensor:
execute_window_loop(input_window, [&](const armcl::Coordinates & id)
{
std::cout << "Setting item [" << id.x() << "," << id.y() << "]\n";
*reinterpret_cast<float *>(input_it.ptr()) = src_a[id.y() * m + id.x()]; //
},
input_it);
// armcl_utils::init_sgemm_output(dst, src0, src1, armcl::DataType::F32);
// Configure function
// Allocate all the images
// src0.allocator()->import_memory(armcl::Memory(&a));
//src0.allocator()->allocate();
//src1.allocator()->allocate();
// dst.allocator()->allocate();
// armcl_utils::fill_random_tensor(src0, -1.f, 1.f);
// armcl_utils::fill_random_tensor(src1, -1.f, 1.f);
// Dummy run for CLTuner
//sgemm.run();
std::vector<float> lin_c(n * p);
return 0;
}

The part you've missed (Which admittedly could be better explained in the documentation!) is that you need to map / unmap OpenCL buffers in order to make them accessible to the CPU.
If you look inside the fill_random_tensor (which is what's used in the cl_sgemm example you've got a call to tensor.map();
So if you map() your buffer before creating your iterator then I believe it should work:
a.map();
input_it(&a, input_window);
execute_window_loop(...)
{
}
a.unmap(); //Don't forget to unmap the buffer before using it on the GPU
Hope this helps

Related

Can we create an iterator for a `std::vector` which increments over three values simultaneously?

Here is a simple example of what I'm trying to achieve: Assume we are given a std::vector<std::byte> rgb_data containing RGB color values and a struct rgb{ double r, g, b; };. I would to create a std::vector<rgb> transformed_rgb_data containing these RGB color values using std::transform. Without std::transform, this could be done as follows:
std::size_t j = 0;
for (std::size_t i = 0; i < rgb_data.size(); i += 3)
{
transformed_rgb_data = {
static_cast<double>(rgb_data[i],
static_cast<double>(rgb_data[i + 1],
static_cast<double>(rgb_data[i + 2]
};
}
Is there a mechanism in the standard library which allows me to construct an iterator for rgb_data which increments by 3 and references a std::tuple (I think that would be the best idea) which then is passed to the unary function passed to std::transform?
The c++23 way would be by using std::views::adjacent_view (and its friend std::views::adjacent_transform_view), which is doing exactly this but not many compilers are currently supporting it.
Create your own custom "RGB view" iterator that has an internal uint8_t* p pointer.
When dereferenced, use the 3 adjacent bytes to construct the RGB struct and return it (something like RGB{p[0], p[1], p[2]}).
When incremented, increment the pointer by 3.
Make sure to specialize std::iterator_traits for your iterator
Copy that into your RGB vector
std::vector<uint8_t> byteArray = ...;
std::vector<RGB> rgbArray(rgbaIter{byteArray.data()},
rgbaIter{byteArray.data() + byteArray.size()});
// Can also use vector::insert, std::copy, std::copy_n, etc.
With range-v3, you might use chunk and transform view (equivalent should be in std in C+23):
std::vector<int> v{0, 1, 2, 3, 4, 5};
for (auto c : v | ranges::views::chunk(3) | ranges::views::transform([](auto r){
return color{r[0] / 255., r[1] / 255., r[2] / 255.};
}))
{
std::cout << c.r << " " << c.g << " " << c.b << std::endl;
}
Demo

About the using vertices as index in graphs c++ why we wasting space

I have a question about the vertices of graphs in c++. Like, let's suppose I want to have a graph with vertices as 100,200,300,400 and they are connected in some manner not important but if we are creating an adjacency list graph what we do is.
adj[u].push_back(v);
adj[v].push_back(u);
and let 400 is connected with 200 we are doing adj[400] and creating a large matrix of vectors when all we need was a matrix of size 4 as there are four vertices and here we going till 400 can someone explain this. Is it like in graphs we have all vertices consecutive and must start from some small number? The code works fine when you have vertices like 1,2,3,4,5. We are using vertices as an index and depending on our vertices they can vary by a lot than what we needed.
An adjacency list stores a list of the connected vertices for each vertex in the graph. For example, given this graph:
1---2
|\ |
| \ |
| \|
3---4
You would store:
1: 2, 3, 4
2: 1, 4
3: 1, 4
4: 1, 2, 3
This can be done with a std::vector<std::vector<int>>. Note that you do not need to use the values of the graph as the indexes into these vectors. If the values of the graph were instead 100, 200, 300, 400 you could use a separate map container to convert from vertex value to an index into the adjacency list (std::unordered_map<ValueType, IndexType>). You could also store a Vertex structure such as this:
struct Vertex {
int index; // 0, 1, 2, 3, 4, 5, etc.
int value; // 100, 200, or whatever value you want
};
Not sure what the problem is exactly but i guess si about the speed, the most simple and easy fix is to have a "memory layout" like in a pixel buffer, a index is a implicit value defied by de position since each segment is.
-------------------------------------------------------------------...
| float, float, float, float | float, float, float, float | float,
-------------------------------------------------------------------...
| index 0 | index 1 | index 2
-------------------------------------------------------------------...
As you didn't give a sample code to give a better idea the example asumes a lot if things but basically implements the layout idea; using arrays is not needed, is my preference, vector would give almost no performance penalty the bigges one being the resizement; some of the lines are not intuitive, like why is a operation + a array faster than having an array inside an array, it just is, the memory is slower than te cpu.
Small note, bacause all the "small arrays" are just a big array you need to worrie of overflows and underflow or add a check; if some vertex groups are smaller that the chunk size just waste the space, the time to compact and un compact the data is worst in most cases than having the padding.
#include <iostream>
#include <chrono>
template <typename VAL>
struct Ver_Map {
VAL * base_ptr;
uint32_t map_size;
uint32_t vertex_len;
void alloc_map(uint32_t elem, uint32_t ver_len, VAL in){
base_ptr = new VAL[elem * ver_len] { in };
vertex_len = ver_len;
map_size = elem;
}
void free_map(){
delete base_ptr;
}
VAL * operator()(uint32_t object){
return &base_ptr[(object * vertex_len)];
}
VAL & operator()(uint32_t object, uint32_t vertex){
return base_ptr[(object * vertex_len) + vertex];
}
};
int main (void) {
const uint32_t map_len = 10000;
Ver_Map<float> ver_map;
ver_map.alloc_map(map_len, 4, 0.0f);
// Use case
ver_map(0, 2) = 0.5f;
std::cout << ver_map(0)[1] << std::endl;
std::cout << ver_map(0)[2] << std::endl;
std::cout << ver_map(0, 2) << std::endl;
// Size in memory
std::cout << "Size of struct -> "
<< (map_len * sizeof(float)) + sizeof(Ver_Map<float>)
<< " bytes" << std::endl;
// Time to fully clear
auto start = std::chrono::steady_clock::now();
for(int x=0; x < map_len; x++){
for(int y=0; y < ver_map.vertex_len; y++){
ver_map(x, y) = 1.0f;
}
}
std::cout << "Full write time -> "
<< (uint32_t)std::chrono::duration_cast<std::chrono::microseconds>
(std::chrono::steady_clock::now() - start).count()
<< " microseconds" << std::endl;
ver_map.free_map();
return 0;
}

Using unordered_map to store key-value pairs in STL

I have to store some data in my program as described below.
The data is high dimensional coordinates and the number of points in those coordinates. Following would be a simple example (with coordinate dimension 5):
coordinate # of points
(3, 5, 3, 5, 7) 6
(6, 8, 5, 8, 9) 4
(4, 8, 6, 7, 9) 3
Please note that even if I use 5 dimensions as an example, the actual problem is of 20 dimensions. The coordinates are always integers.
I want to store this information in some kind of data structure. The first thing that comes to my mind is a hash table. I tried unordered_map in STL. But cannot figure out how to use the coordinates as the key in unordered_map. Defining it as:
unordered_map<int[5], int> umap;
or,
unordered_map<int[], int> umap;
gives me a compilation error. What am I doing wrong?
unordered_map needs to know how to hash your coordinates. In addition, it needs a way to compare coordinates for equality.
You can wrap your coordinates in a class or struct and provide a custom operator == to compare coordinate points. Then you need to specialise std::hash to be able to use your Point struct as a key in unordered_map. While comparing coordinates for equality is fairly straightforward, it is up to you to decide how coordinates are hashed. The following is an overview of what you need to implement:
#include <vector>
#include <unordered_map>
#include <cmath>
class Point
{
std::vector<int> coordinates;
public:
inline bool operator == (const std::vector<int>& _other)
{
if (coordinates.size() != _other.size())
{
return false;
}
for (uint c = 0; c < coordinates.size(); ++c)
{
if (coordinates[c] != _other[c])
{
return false;
}
}
return true;
}
};
namespace std
{
template<>
struct hash<Point>
{
std::size_t operator() (const Point& _point) const noexcept
{
std::size_t hash;
// See https://www.boost.org/doc/libs/1_67_0/doc/html/hash/reference.html#boost.hash_combine
// for an example of hash implementation for std::vector.
// Using Boost just for this might be an overkill - you could use just the hash_combine code here.
return hash;
}
};
}
int main()
{
std::unordered_map<Point, int> points;
// Use points...
return 0;
}
In case you know how many coordinates you are going to have and you can name them like this
struct Point
{
int x1;
int x2;
int x3;
// ...
}
you could use a header-only hashing library I wrote exactly for this purpose. Your mileage may vary.
Hacky way
I've seen this being used in programming competitions for ease of use. You can convert the set of points to a string(concatenate each coordinate and separate them with a space or any other special character) and then use unordered_map<string, int>
unordered_map<string, int> map; int p[5] = {3, 5, 3, 5, 7};
string point = to_string(p[0]) + " " + to_string(p[1]) + " " to_string(p[2]) + " " to_string(p[3]) + " " to_string(p[4]);
map[point] = 6;

Store pointer to Eigen Vector 'segment' without copy?

I have an Eigen Vector that I would like to refer to a segment at a later time (e.g. pass between functions) instead of modifying immediately.
Eigen::Matrix<float, Eigen::Dynamic, 1> vec(10);
// initialize
vec << 1, 2, 3, 4, 5, 6, 7, 8, 9, 10;
I would like to create a pointer to a segment that I can refer to later. The following works but it creates a copy so any changes made to the segment are not reflected in the original vector.
const int start = 2;
const int end = 8
Eigen::Matrix<float, Eigen::Dynamic, 1> *block = new Eigen::Matrix<float, Eigen::Dynamic, 1>(end - start + 1, 1);
*block = vec.segment(start-1,end-1);
How can I keep a reference to the segment without copying?
You can use an Eigen::Map to wrap an existing segment of memory without copying. I'm not sure why you're allocating the *block object and not just using block. Using a Map it would look like
Eigen::Map<Eigen::VectorXf> block(&vec(start - 1), end - start + 1);
You then use the Map as you would a normal VectorXd, sans resizing and stuff. Simpler yet (at least according to #ggael), you can use an Eigen:Ref to refer to part of an Eigen object without inducing a copy. For example:
void times2(Eigen::Ref< Eigen::VectorXf> rf)
{
rf *= 2.f;
}
int main()
{
Eigen::Matrix<float, Eigen::Dynamic, 1> vec(10);
// initialize
vec << 1, 2, 3, 4, 5, 6, 7, 8, 9, 10;
const int start = 2;
const int end = 8;
// This would work as well
//Eigen::Map<Eigen::VectorXf> block(&vec(start - 1), end - start + 1);
Eigen::Ref<Eigen::VectorXf> block = vec.segment(start, end - start + 1);
std::cout << block << "\n\n";
times2(block);
std::cout << vec << "\n";
return 0;
}
P.S. I think you're misusing the segment function. It takes a beginning position an the number of elements, i.e. (start, end-start+1).

C++ Eigen: How to concatenate matrices dynamically (pointer issue?)

I have the following problem:
I have several partial (eigen) MatrixXds I want to concatenate to another, larger, MatrixXd variable I only have as a pointer. However, both the size of the smaller matrices and their number are dynamic, so I cannot use the << operator easily.
So I'm trying the following (the smaller matrices are stored in list_subdiagrams, obviously, and basis->cols() defines the number of matrices), using Eigen's MatrixXd block funtionality:
// sd[] contains the smaller matrices to be concatenated; all are of the same size
// col defines the total number of smaller matrices
MatrixXd* ret = new MatrixXd(sd[0]->rows(), col*sd[0]->cols());
for (int i=0; i<col; ++i){
ret->block(0, i*sd[0]->cols(), sd[0]->rows(), sd[0]->cols()) = *(sd[i]);
}
This, unfortunately, appears to somehow overwrite some part of the *ret variable - for before the assignment via the block, the size is (in my test-case) correctly shown as being 2x1. After the assignment it becomes 140736006011136x140736006011376 ...
Thank you for your help!
What do you mean you don't know the size? You can use the member functions cols()/rows() to get the size. Also, I assume by concatenation you mean direct sum? In that case, you can do something like
#include <iostream>
#include <Eigen/Dense>
int main()
{
Eigen::MatrixXd *A = new Eigen::MatrixXd(2, 2);
Eigen::MatrixXd *B = new Eigen::MatrixXd(3, 3);
*A << 1, 2, 3, 4;
*B << 5, 6, 7, 8, 9, 10, 11, 12, 13;
Eigen::MatrixXd *result = new Eigen::MatrixXd(A->rows() + B->rows(), A->cols() + B->cols());
result->Zero(A->rows() + B->rows(), A->cols() + B->cols());
result->block(0, 0, A->rows(), A->cols()) = *A;
result->block(A->rows(), A->cols(), B->rows(), B->cols()) = *B;
std::cout << *result << std::endl;
delete A;
delete B;
delete result;
}
So first make sure it works for 2 matrices, test it, then extend it to N.