How to use collate_fn in LibTorch - c++

I'm trying to implement an image based regression using a CNN in libtorch. The problem is, that my images got different sizes, which will cause an Exception batching the images.
First things first, I create my dataset:
auto set = MyDataSet(pathToData).map(torch::data::transforms::Stack<>());
Then I create the dataLoader:
auto dataLoader = torch::data::make_data_loader(
std::move(set),
torch::data::DataLoaderOptions().batch_size(batchSize).workers(numWorkersDataLoader)
);
The exception will be thrown batching data in the train loop:
for (torch::data::Example<> &batch: *dataLoader) {
processBatch(model, optimizer, counter, batch);
}
with a batch size greater than 1 (with a batch size of 1 everything works well because there isn't any stacking involved). For example I'll get the following error using a batch size of 2:
...
what(): stack expects each tensor to be equal size, but got [3, 1264, 532] at entry 0 and [3, 299, 294] at entry 1
I read that one could for example use collate_fn in order to implement some padding (for example here), I just do not get where to implement it. For example torch::data::DataLoaderOptions does not offer such a thing.
Does anyone know how to do this?

I've got a solution now. In summary, I'm split my CNN in Conv- and Denselayers and use the output of a torch::nn::AdaptiveMaxPool2d in the batch construction.
In order to do so, I have to modify my Dataset, Net and train/val/test-methods. In my Net I added two additional forward-functions. The first one passes data through all Conv-Layers and returns the output of an AdaptiveMaxPool2d-Layer. The second one passes the data through all Dense-Layers. In practice this looks like:
torch::Tensor forwardConLayer(torch::Tensor x) {
x = torch::relu(conv1(x));
x = torch::relu(conv2(x));
x = torch::relu(conv3(x));
x = torch::relu(ada1(x));
x = torch::flatten(x);
return x;
}
torch::Tensor forwardDenseLayer(torch::Tensor x) {
x = torch::relu(lin1(x));
x = lin2(x);
return x;
}
Then I override the get_batch method and use forwardConLayer to compute every batch entry. In order to train (correctly), I call zero_grad() before I construct a batch. All in all this looks like:
std::vector<ExampleType> get_batch(at::ArrayRef<size_t> indices) override {
// impl from bash.h
this->net.zero_grad();
std::vector<ExampleType> batch;
batch.reserve(indices.size());
for (const auto i : indices) {
ExampleType batchEntry = get(i);
auto batchEntryData = (batchEntry.data).unsqueeze(0);
auto newBatchEntryData = this->net.forwardConLayer(batchEntryData);
batchEntry.data = newBatchEntryData;
batch.push_back(batchEntry);
}
return batch;
}
Lastly I call forwardDenseLayer at all places where I normally would call forward, e.g.:
for (torch::data::Example<> &batch: *dataLoader) {
auto data = batch.data;
auto target = batch.target.squeeze();
auto output = model.forwardDenseLayer(data);
auto loss = torch::mse_loss(output, target);
LOG(INFO) << "Batch loss: " << loss.item<double>();
loss.backward();
optimizer.step();
}
Update
This solution seems to cause an error if the number of the dataloader's workers isn't 0. The error is:
terminate called after thro9wing an instance of 'std::runtime_error'
what(): one of the variables needed for gradient computation has been modified by an inplace operation: [CPUFloatType [3, 12, 3, 3]] is at version 2; expected version 1 instead. ...
This error does make sense because the data is passing the CNN's head during the batching process. The solution to this "problem" is to set the number of workers to 0.

Related

TFLite c++ determine the classification on output

I'm trying to get an output from a trained model which has a classification, the input node count is 1, and the output node count is 2. However, I'm not quite sure where the classification lands and how exactly do I handle it.
for(size_t idx = 0; idx < input_node_count; idx++)
{
float* data_ptr = interpreter->typed_input_tensor<float>(idx);
memcpy(data_ptr, my_input.data(), input_elem_size[idx]);
}
if (kTfLiteOk != interpreter->Invoke())
{
return false;
}
for(size_t idx = 0; idx < output_node_count; idx++)
{
float* output = interpreter->typed_output_tensor<float>(idx);
output_buffer[idx] = std::vector<float> (output,
output + output_elem_size[idx]);
}
result = output_buffer[1];
classification_result = output_buffer[0]; // Best way to approach this
As of now, I can just print out the sizes and see that result is 196.608 elements and classification_result is 2, as it should. My problem is I hard-coded this to be index 1 and 0 but this might not always be the case in my program which runs all sorts of models. So sometimes classification might be index 1, which causes the above code to fall apart.
I've tried to check the sizes of the buffers however that is also not guaranteed since the classification size and the result size is different for each input. Is there a way for me to know for certain which index is which? Am I approaching this the right way?
Use Tensorflow lite signatures for this. Signature defs can help you with that by accessing inputs/outputs using names defined in the original model.
See conversion and inference example here
Python example
# Load the TFLite model in TFLite Interpreter
interpreter = tf.lite.Interpreter(TFLITE_FILE_PATH)
# There is only 1 signature defined in the model,
# so it will return it by default.
# If there are multiple signatures then we can pass the name.
my_signature = interpreter.get_signature_runner()
# my_signature is callable with input as arguments.
output = my_signature(x=tf.constant([1.0], shape=(1,10), dtype=tf.float32))
# 'output' is dictionary with all outputs from the inference.
# In this case we have single output 'result'.
print(output['result'])
For C++
# To run
auto my_signature = interpreter_->GetSignatureRunner("my_signature");
# Set your inputs and allocate tensors
auto* input_tensor_a = my_signature->input_tensor("input_a");
...
# Execute
my_signature->Invoke();
# output
auto* output_tensor_x = my_signature->output_tensor("output_x");

Creating BoolTensor Mask in torch C++

I am trying to create a mask for torch in C++ of type BoolTensor. The first n elements in dimension one need to be False and the rest need to be True.
This is my attempt but I do not know if this is correct (size is the number of elements):
src_mask = torch::BoolTensor({6, 1});
src_mask[:size,:] = 0;
src_mask[size:,:] = 1;
I'm not sure to understand exactly your goal here, so here is my best attempt to convert into C++ you pseudo-code .
First, with libtorch you declare the type of your tensor through the torch::TensorOptions struct (types names are prefixed with a lowercase k)
Second, your python-like slicing is possible thanks to the torch::Tensor::slicefunction (see here and there).
Finally, that gives you something like :
// Creates a tensor of boolean, initially all ones
auto options = torch::TensorOptions().dtype(torch::kBool));
torch::Tensor bool_tensor = torch::ones({6,1}, options);
// Set the slice to 0
int size = 3;
bool_tensor.slice(/*dim=*/0, /*start=*/0, /*end=*/size) = 0;
std::cout << bool_tensor << std::endl;
Please not that this will set the first size rows to 0. I assumed that's what you meant by "first elements in dimension x".
Another way to do it:
using namespace torch::indexing; //for using Slice(...) function
at::Tensor src_mask = at::empty({ 6, 1 }, at::kBool); //empty bool tensor
src_mask.index_put_({ Slice(None, size), Slice() }, 0); //src_mask[:size,:] = 0
src_mask.index_put_({ Slice(size, None), Slice() }, 1); //src_mask[size:,:] = 0

When creating threads using lambda expressions, how to give each thread its own copy of the lambda expression?

I have been working on a program that basically used brute force to work backward to find a method using a given set of operations to reach the given number. So, for example, if I gave in a set of operations +5,-7,*10,/3, and a given number say 100(*this example probably won't come up with a solution), and also a given max amount of moves to solve (let's say 8), it will attempt to come up with a use of these operations to get to 100. This part works using a single thread which I have tested in an application.
However, I wanted it to be faster and I came to multithreading. I have worked a long time to even get the lambda function to work, and after some serious debugging have realized that the solution "combo" is technically found. However, before it is tested, it is changed. I wasn't sure how this was possible considering the fact that I had thought that each thread was given its own copy of the lambda function and its variables to use.
In summary, the program starts off by parsing the information, then passes the information which is divided by the parser as paramaters into the array of an operation object(somewhat of a functor). It then uses an algorithm which generated combinations which are then executed by the operation objects. The algorithm, in simplicity, takes in the amount of operations, assigns it to a char value(each char value corresponds to an operation), then outputs a char value. It generates all possible combinations.
That is a summary of how my program works. Everything seems to be working fine and in order other than two things. There is another error which I have not added to the title because there is a way to fix it, but I am curious about alternatives. This way is also probably not good for my computer.
So, going back to the problem with the lambda expression inputted with the thread as seen is with what I saw using breakpoints in the debugger. It appeared that both threads were not generating individual combos, but more rather properly switching between the first number, but alternating combos. So, it would go 1111, 2211, rather than generating 1111, 2111.(these are generated as the previous paragraph showed, but they are done a char at a time, combined using a stringstream), but once they got out of the loop that filled the combo up, combos would get lost. It would randomly switch between the two and never test the correct combo because combinations seemed to get scrambled randomly. This I realized must have something to do with race conditions and mutual exclusion. I had thought I had avoided it all by not changing any variables changed from outside the lambda expression, but it appears like both threads are using the same lambda expression.
I want to know why this occurs, and how to make it so that I can say create an array of these expressions and assign each thread its own, or something similar to that which avoids having to deal with mutual exclusion as a whole.
Now, the other problem happens when I at the end delete my array of operation objects. The code which assigns them and the deleting code is shown below.
operation *operations[get<0>(functions)];
for (int i = 0; i < get<0>(functions); i++)
{
//creates a new object for each operation in the array and sets it to the corresponding parameter
operations[i] = new operation(parameterStrings[i]);
}
delete[] operations;
The get<0>(functions) is where the amount of functions is stored in a tuple and is the number of objects to be stored in an array. The paramterStrings is a vector in which the strings used as parameters for the constructor of the class are stored. This code results in an "Exception trace/breakpoint trap." If I use "*operations" instead I get a segmentation fault in the file where the class is defined, the first line where it says "class operation." The alternative is just to comment out the delete part, but I am pretty sure that it would be a bad idea to do so, considering the fact that it is created using the "new" operator and might cause memory leaks.
Below is the code for the lambda expression and where the corresponding code for the creation of threads. I readded code inside the lambda expression so it could be looked into to find possible causes for race conditions.
auto threadLambda = [&](int thread, char *letters, operation **operations, int beginNumber) {
int i, entry[len];
bool successfulComboFound = false;
stringstream output;
int outputNum;
for (i = 0; i < len; i++)
{
entry[i] = 0;
}
do
{
for (i = 0; i < len; i++)
{
if (i == 0)
{
output << beginNumber;
}
char numSelect = *letters + (entry[i]);
output << numSelect;
}
outputNum = stoll(output.str());
if (outputNum == 23513511)
{
cout << "strange";
}
if (outputNum != 0)
{
tuple<int, bool> outputTuple;
int previousValue = initValue;
for (int g = 0; g <= (output.str()).length(); g++)
{
operation *copyOfOperation = (operations[((int)(output.str()[g])) - 49]);
//cout << copyOfOperation->inputtedValue;
outputTuple = (*operations)->doOperation(previousValue);
previousValue = get<0>(outputTuple);
if (get<1>(outputTuple) == false)
{
break;
}
debugCheck[thread - 1] = debugCheck[thread - 1] + 1;
if (previousValue == goalValue)
{
movesToSolve = g + 1;
winCombo = outputNum;
successfulComboFound = true;
break;
}
}
//cout << output.str() << ' ';
}
if (successfulComboFound == true)
{
break;
}
output.str("0");
for (i = 0; i < len && ++entry[i] == nbletters; i++)
entry[i] = 0;
} while (i < len);
if (successfulComboFound == true)
{
comboFoundGlobal = true;
finishedThreads.push_back(true);
}
else
{
finishedThreads.push_back(true);
}
};
Threads created here :
thread *threadArray[numberOfThreads];
for (int f = 0; f < numberOfThreads; f++)
{
threadArray[f] = new thread(threadLambda, f + 1, lettersPointer, operationsPointer, ((int)(workingBeginOperations[f])) - 48);
}
If any more of the code is needed to help solve the problem, please let me know and I will edit the post to add the code. Thanks in advance for all of your help.
Your lambda object captures its arguments by reference [&], so each copy of the lambda used by a thread references the same shared objects, and so various threads race and clobber each other.
This is assuming things like movesToSolve and winCombo come from captures (it is not clear from the code, but it seems like it). winCombo is updated when a successful result is found, but another thread might immediately overwrite it right after.
So every thread is using the same data, data races abound.
You want to ensure that your lambda works only on two three types of data:
Private data
Shared, constant data
Properly synchronized mutable shared data
Generally you want to have almost everything in category 1 and 2, with as little as possible in category 3.
Category 1 is the easiest, since you can use e.g., local variables within the lambda function, or captured-by-value variables if you ensure a different lambda instance is passed to each thread.
For category 2, you can use const to ensure the relevant data isn't modified.
Finally you may need some shared global state, e.g., to indicate that a value is found. One option would be something like a single std::atomic<Result *> where when any thread finds a result, they create a new Result object and atomically compare-and-swap it into the globally visible result pointer. Other threads check this pointer for null in their run loop to see if they should bail out early (I assume that's what you want: for all threads to finish if any thread finds a result).
A more idiomatic way would be to use std::promise.

How to create an array of cl::sycl::buffers?

I am using the Xilinx's triSYCL github implementation,https://github.com/triSYCL/triSYCL.
I am trying to create a design with 100 producer/consumer to read/write from 100 pipes.
What I am not sure of is, How to create an array of cl::sycl::buffer and initialize it using std::iota.
Here is my code:
constexpr size_t T=6;
constexpr size_t n_threads=100;
cl::sycl::buffer<float, n_threads> a { T };
for (int i=0; i<n_threads; i++)
{
auto ba = a[i].get_access<cl::sycl::access::mode::write>();
// Initialize buffer a with increasing integer numbers starting at 0
std::iota(ba.begin(), ba.end(), i*T);
}
And I am getting the following error:
error: no matching function for call to ‘cl::sycl::buffer<float, 2>::buffer(<brace-enclosed initializer list>)’
cl::sycl::buffer<float, n_threads> a { T };
I am new to C++ programming. So I am not able to figure out the exact way to do this.
There are 2 points I think cause the issue you are currently having:
The 2nd template argument in the buffer object definition should be the dimensionality of the buffer (count of dimensions, should be 1, 2 or 3), not the dimensions themselves.
The constructor for the buffer should contain either the actual dimensions of the buffer, or the data that you want the buffer to have and the dimensions. To pass the dimensions, you need to pass a cl::sycl::range object to the constructor
As I understand you are trying to initialize a buffer of dimensionality 1 and with dimensions { 100, 1, 1 }. To do this, the definition of a should change to:
cl::sycl::buffer < float, 1 > a(cl::sycl::range< 1 >(n_threads));
Also, as the dimensionality can be deduced from the range template parameter, thus you can achieve the same effect with:
cl::sycl::buffer< float > a (cl::sycl::range< 1 >(n_threads));
As for initializing the buffer with std::iota, you have 3 options:
Use an array to initialize the data with the iota usage and pass them to the sycl buffer (case A),
Use the accessor to write to the buffer directly for host - CPU only (case B), or
Use an accessor with a parallel_for for execution on either host or an OpenCL device (case C).
Accessors should not be used as iterators (with .begin(), .end())
Case A:
std::vector<float> data(n_threads); // or std::array<float, n_threads> data;
std::iota(data.begin(), data.end(), 0); // this will create the data { 0, 1, 2, 3, ... }
cl::sycl::buffer<float> a(data.data(), cl::sycl::range<1>(n_threads));
// The data in a are already initialized, you can create an accessor to use them directly
Case B:
cl::sycl::buffer<float> a(cl::sycl::range<1>(n_threads));
{
auto ba = a.get_access<cl::sycl::access::mode::write>();
for(size_t i=0; i< n_threads; i++) {
ba[i] = i;
}
}
Case C:
cl::sycl::buffer<float> a(cl::sycl::range<1>(n_threads));
cl::sycl::queue q{cl::sycl::default_selector()}; // create a command queue for host or device execution
q.Submit([&](cl::sycl::handler& cgh) {
auto ba = a.get_access<cl::sycl::access::mode::write>();
cgh.parallel_for<class kernel_name>([=](cl::sycl::id<1> i){
ba[i] = i.get(0);
});
});
q.wait_and_throw(); // wait until kernel execution completes
Also check chapter 4.8 of the SYCL 1.2.1 spec https://www.khronos.org/registry/SYCL/specs/sycl-1.2.1.pdf as it has an example for iota
Disclaimer: triSYCL is a research project for now. Please use ComputeCpp for anything serious. :-)
If you really need arrays of buffer, I guess you can use something similar to Is there a way I can create an array of cl::sycl::pipe?
As a variant, you can use a std::vector<cl::sycl::buffer<float>> or std::array<cl::sycl::buffer<float>, n_threads> and initialize with a loop from a cl::sycl::buffer<float> { T }.

openMP slows down when passing from 2 to 4 threads doing binary searches in a custom container

I'm currently having a problem parallelizing a program in c++ using openMP. I am implementing a recommendation system with a user-based collaborative filtering method. To do that, I implemented a sparse_matrix class as a dictionary of dictionaries (where I mean a sort of python dictionary). In my case, since insertion is only done at the beginning of the algorithm when data is read from file, I implemented a dictionary as a std library vector of pair objects (key, value) with a flag that indicates if the vector is sorted. if the vector is sorted, a key is searched using binary searches. otherwise the vector is first sorted and then searched. Alternatively, it is possible to scan the dictionary's entries linearly for example in loops on all the keys of the dictionary. The relevant portion of the code that is causing problems is the following
void compute_predicted_ratings_omp (sparse_matrix &targets,
sparse_matrix &user_item_rating_matrix,
sparse_matrix &similarity_matrix,
int k_neighbors)
{
// Auxiliary private variables
int user, item;
double predicted_rating;
dictionary<int,double> target_vector, item_rating_vector, item_similarity_vector;
#pragma omp parallel shared(targets, user_item_rating_matrix, similarity_matrix)\
private(user, item, predicted_rating, target_vector, item_rating_vector, item_similarity_vector)
{
if (omp_get_thread_num() == 0)
std::cout << " - parallelized on " << omp_get_num_threads() << " threads: " << std::endl;
#pragma omp for schedule(dynamic, 1)
for (size_t iter_row = 0; iter_row < targets.nb_of_rows(); ++iter_row)
{
// Retrieve target user
user = targets.row(iter_row).get_key();
// Retrieve the user rating vector.
item_rating_vector = user_item_rating_matrix[user];
for (size_t iter_col = 0; iter_col < targets.row(iter_row).value().size(); ++iter_col)
{
// Retrieve target item
item = targets.row(iter_row).value().entry(iter_col).get_key();
// retrieve similarity vector associated to the target item
item_similarity_vector = similarity_matrix[item];
// Compute predicted rating
predicted_rating = predict_rating(item_rating_vector,
item_similarity_vector,
k_neighbors);
// Set result in targets
targets.row(iter_row).value().entry(iter_col).set_value(predicted_rating);
}
}
}
}
In this function I compute the predicted rating for a series of target pairs (user, item) (this is simply a weighted average). To do that, I do an outer loop on the target users (which are on the rows of the targets sparse matrix) and I retrieve the rating vector for the current user performing a binary search on the rows of the user_item_rating_matrix. Then, for each column in the current row (i.e. for each item) I retrieve another vector associated to the current item from the sparse matrix similarity_matrix. With these two vectors, I compute the prediction as a weighted average of their elements (on a subset of the items in common between the two vectors).
My problem is the following: I want to parallelize the outer loop using openMP. In the serial version, this functions takes around 3 secs. With openMP on 2 threads, it takes around 2 secs (which it is not bad since I still have some work imbalances in the outerloop). When using 4 threads, it takes 7 secs. I cannot understand what is the cause of this slowdown. Do you have any idea?
I have already thought about the problem and I share my considerations with you:
I access the sparse_matrices only in read mode. Since the matrices
are pre-sorted, all the binary searches should not modify the
matrices and no race-conditions should derive.
Various threads could access to the same vector of the sparse matrix at the same time. I read something about false sharing, but since I do not write in these vectors I think this should not be the reason of the slowdown.
The parallel version seems to work fine with two threads (even if the speedup is lower than expected).
No problem is observed with 4 threads for other choices of the parameters. In particular (cf. "Further details on predict_rating function" below), when I consider all the similar items for the weighted average and I scan the rating vector and search in the similarity vector (the opposite of what I normally do), the execution time scales well on 4 threads.
Further details on predict_rating function: This function works in the following way. The smallest between item_rating_vector and item_similarity_vector is scanned linearly and I do a binary search on the longest of the two. If the rating/similarity is positive, it is considered in the weighted average.
double predict_rating (dictionary<int, double> &item_rating_vector,
dictionary<int, double> &item_similarity_vector)
{
size_t size_item_rating_vector = item_rating_vector.size();
size_t size_item_similarity_vector = item_similarity_vector.size();
if (size_item_rating_vector == 0 || size_item_similarity_vector == 0)
return 0.0;
else
{
double s, r, sum_s = 0.0, sum_sr = 0.0;
int temp_item = 0;
if (size_item_rating_vector < size_item_similarity_vector)
{
// Scan item_rating_vector and search in item_similarity_vector
for (dictionary<int,double>::const_iterator iter = item_rating_vector.begin();
iter != item_rating_vector.end();
++iter)
{
// scan the rating vector forwards: iterate until the whole vector has
// been scanned.
temp_item = (*iter).get_key();
// Retrieve rating that user gave to temp_item (0.0 if not given)
try { s = item_similarity_vector[temp_item]; }
catch (const std::out_of_range &e) { s = 0.0; }
if (s > 0.0)
{
// temp_item is positively similar to the target item. consider it in the average
// Retrieve rating that the user gave to temp_item
r = (*iter).get_value();
// increment the sums
sum_s += s;
sum_sr += s * r;
}
}
}
else
{
// Scan item_similarity_vector and search in item_rating_vector
for (dictionary<int,double>::const_iterator iter = item_similarity_vector.begin();
iter != item_similarity_vector.end();
++iter)
{
// scan the rating vector forwards: iterate until the whole vector has
// been scanned.
temp_item = (*iter).get_key();
s = (*iter).get_value();
if (!(s > 0.0))
continue;
// Retrieve rating that user gave to temp_item (0.0 if not given)
try { r = item_rating_vector[temp_item]; }
catch (const std::out_of_range &e) { r = 0.0; }
if (r > 0.0)
{
// temp_item is positively similar to the target item: increment the sums
sum_s += s;
sum_sr += s * r;
}
}
}
if (sum_s > 0.0)
return sum_sr / sum_s;
else
return 0.0;
}
}
Further details on the hardware: I am running this program on a dell XPS15 with a quad-core i7 processor and 16Gb RAM. I execute the code on a linux virtualbox (I set the VM to use 4 processors and 4Gb RAM).
Thank in advance,
Pierpaolo
It appears you might have a false sharing problem with your targets variable. False sharing is when different threads frequently write to locations near each other (same cache line). By explicitly setting the schedule to dynamic with a chunk size of 1, you are telling OpenMP to only have each thread take tasks one element at a time, thus allowing different threads to work on data that may be near each other in targets.
I would recommend removing the schedule directive just to see how the default scheduler and chunk size do. Then I would try both static and dynamic schedules while varying the chunk size substantially. If your workload or hardware platform is unbalanced, dynamic will probably win, but I would still try static.
Well I found the solution to the problem myself: I post the explanation for the community. In the predict_rating function I used try/catch for handling out_of_range errors thrown by my dictionary structure when a key that is not contained in the dictionary is searched. I read on Are exceptions in C++ really slow that exception handling is computationally heavy in the case an exception is thrown. In my case, for each call of predict_rating I had multiple out_of_range error thrown and handled. I simply removed the try/catch block and wrote a function that searches in the dictionary and return a default value if that key does not exist. This modification produced a speedup of around 2000x and now the program scales well with respect to the number of threads even on the VM.
Thanks to all of you and if you have other suggestions don't hesitate!
Pierpaolo