Problem about value assignment in Arrayfire - c++

I'm using Arrayfire and Flashlight evaluating a network.
auto tmp = output(af::seq(2, 10), af::span, af::span, af::span);
auto softmax_tmp = fl::softmax(tmp, 0);
output(af::seq(2,10),af::span,af::span,af::span)=softmax_tmp;
output is a tensor with the shape of (12,100,1,1). Now I want to pull out the (2,10) dims of the tensor and for the extracted 100 9-dim vectors, apply softmax operation to them. Then put them back. Codes above.
Problem is that the 3rd line doesn't work. softmax_tmp is the right value, but the assignment operator in the 3rd line just failed. Exactly it can pass the compilation successfully, but output remains the old value as in 1st line.
Who could help me? A lot thanks really.

Related

How to correctly format input and resize output data whille using TensorRT engine?

I'm trying implementing deep learning model into TensorRT runtime. The model conversion step is done quite OK and i'm pretty sure about it.
Now there's 2 parts i'm currently struggle with is memCpy data from host To Device (like openCV to Trt) and get the right output shape in order to get the right data. So my questions is:
How actually a shape of input dims relate with memory buffer. What is the difference when the model input dims is NCHW and NHWC, so when i read a openCV image, it's NHWC and also the model input is NHWC, do i have to re-arange the buffer data, if Yes then what's the actual consecutive memory format i have to do ?. Or simply what does the format or sequence of data that the engine are expecting ?
About the output (assume the input are correctly buffered), how do i get the right result shape for each task (Detection, Classification, etc..)..
Eg. an array or something look similar like when working with python .
I read Nvidia docs and it's not beginner-friendly at all.
//Let's say i have a model thats have a dynamic shape input dim in the NHWC format.
auto input_dims = nvinfer1::Dims4{1, 386, 342, 3}; //Using fixed H, W for testing
context->setBindingDimensions(input_idx, input_dims);
auto input_size = getMemorySize(input_dims, sizeof(float));
// How do i format openCV Mat to this kind of dims and if i encounter new input dim format, how do i adapt to that ???
And the expected output dims is something like (1,32,53,8) for example, the output buffer result in a pointer and i don't know what's the sequence of the data to reconstruct to expected array shape.
// Run TensorRT inference
void* bindings[] = {input_mem, output_mem};
bool status = context->enqueueV2(bindings, stream, nullptr);
if (!status)
{
std::cout << "[ERROR] TensorRT inference failed" << std::endl;
return false;
}
auto output_buffer = std::unique_ptr<int>{new int[output_size]};
if (cudaMemcpyAsync(output_buffer.get(), output_mem, output_size, cudaMemcpyDeviceToHost, stream) != cudaSuccess)
{
std::cout << "ERROR: CUDA memory copy of output failed, size = " << output_size << " bytes" << std::endl;
return false;
}
cudaStreamSynchronize(stream);
//How do i use this output_buffer to form right shape of output, (1,32,53,8) in this case ?
Could you please edit your question and tell us which model you're using if it's a commonly known NN, prehaps one we can download to test locally?
Then, the answer since it doesn't depend on the model (even though it would help to answer)
How actually a shape of input dims relate with memory buffer
If the input is NxCxHxW, you need to allocate N*C*H*W*sizeof(float) memory for that on your CPU and GPU. To be more precise, you need to allocate space on GPU for all the bindings and on CPU for only input and output bindings.
when i read a openCV image, it's NHWC and also the model input is NHWC, do i have to re-arange the buffer data
No, you do not have to re-arrange the buffer data. If you would have to change between NHWC and NCHW you can check this or google 'opencv NHWC to NHCW'.
Full working code example here, especially this function.
Or simply what does the format or sequence of data that the engine are expecting ?
This depends on how the neural network was trained. You should in general know exactly which kind of preprocessing and image data formats have been used to train the NN. You should even use the same libraries to load images and process them if possible. It's an open problem in ML: if you try to replicate results of some papers and use their models but they haven't open sourced the preprocessing you might get worse results. In the "worst" case you can implement both NHCW and NCHW and test which of them works.
About the output (assume the input are correctly buffered), how do i get the right result shape for each task (Detection, Classification, etc..).. Eg. an array or something look similar like when working with python .
This question clearly requires me to understand which NNs you are referring to. But I myself do the following:
Load the TensorRT .engine file in my code like this and deserialize like this
Print the bindings like this
Then I know the size of the input binding or bindings if there are many inputs, and the size of the output binding or bindings if there are many outputs.
This way you know the right result shape for each task. I hope this answered your question. If not, please add detailed comments and edit your post to be more precise. Thank you.
I read Nvidia docs and it's not beginner-friendly at all.
Yes I agree. You're better of searching TensorRT c++ (or Python) repositories from Github and studying their code. Have you seen TensorRT samples? It doesn't really take many lines of code to implement TensorRT inference.

tensorflow: transpose expects a vector of size 1. But input(1) is a vector of size 2

I want to use a trained RNN language model to do inference.So:
I loaded the trained model graph in c++ using
tensorflow::MetaGraphDef graph_def;
TF_CHECK_OK(ReadBinaryProto(Env::Default(), path_to_graph, &graph_def));
TF_CHECK_OK(session->Create(graph_def.graph_def()));
load the model parameters by:
Tensor checkpointPathTensor(tensorflow::DT_STRING, tensorflow::TensorShape());
checkpointPathTensor.scalar<std::string>()() = path_to_ckpt;
TF_CHECK_OK(session_->Run({{graph_def.saver_def().filename_tensor_name(), checkpointPathTensor} },{},{graph_def.saver_def().restore_op_name()},nullptr));
up till now, everything goes fine. Then I want to compute the value of the node "output/output_batch_major":
TF_CHECK_OK(session->Run(inputs,{"output/output_batch_major"},{"post_control_dependencies"}, &outputs));
I got the error:
2018-07-13 14:13:36.793495: F tf_lm_model_loader.cc:190] Non-OK-status: session->Run(inputs,{"output/output_batch_major"},{"post_control_dependencies"}, &outputs) status: Invalid argument: transpose expects a vector of size 1. But input(1) is a vector of size 2
[[Node: extern_data/placeholders/delayed/sequence_mask_time_major/transpose = Transpose[T=DT_BOOL, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](extern_data/placeholders/delayed/SequenceMask/Less, extern_data/placeholders/delayed/sequence_mask_time_major/transpose/perm)]]
Aborted (core dumped)
I checked the graph using tensorboard, extern_data/placeholders/delayed/sequence_mask_time_major/transpose/perm is a Tensor with size 2, is this Tensor the input(1) in the error? How can I fix the problem?Any idea? Thanks in advance!
I had a similar issue on the input tensor to my predictor. I expanded the dimension by one and the issue was resolved. I suggest running the predictor in python, first. This helps to identify the size of the input tensor that you are passing to the predictor. Then, replicate the exact same size in C++. Also, based on your code snippet, I am not sure how you define the inputs to the Run method. I defined is as follows in my code:
std::vector<std::pair<std::string, tensorflow::Tensor>> input = {
{"input_1", input_tensor }
};
where "input_1" is the name of my input layer.
I hope this helps.
I get this error when pass wrong input type into tensorflow model. The model require 3d dimension array, I pass 1d dimension instead of so check your input data first.

Declaring variables in Python 2.7x to avoid issues later

I am new to Python, coming from MATLAB, and long ago from C. I have written a script in MATLAB which simulates sediment transport in rivers as a Markov Process. The code randomly places circles of a random diameter within a rectangular area of a specified dimension. The circles are non-uniform is size, drawn randomly from a specified range of sizes. I do not know how many times I will step through the circle placement operation so I use a while loop to complete the process. In an attempt to be more community oriented, I am translating the MATLAB script to Python. I used the online tool OMPC to get started, and have been working through it manually from the auto-translated version (was not that helpful, which is not surprising). To debug the code as I go, I use the
MATLAB generated results to generally compare and contrast against results in Python. It seems clear to me that I have declared variables in a way that introduces problems as calculations proceed in the script. Here are two examples of consistent problems between different instances of code execution. First, the code generated what I think are arrays within arrays because the script is returning results which look like:
array([[ True]
[False]], dtype=bool)
This result was generated for the following code snippet at the overlap_logix operation:
CenterCoord_Array = np.asarray(CenterCoordinates)
Diameter_Array = np.asarray(Diameter)
dist_check = ((CenterCoord_Array[:,0] - x_Center) ** 2 + (CenterCoord_Array[:,1] - y_Center) ** 2) ** 0.5
radius_check = (Diameter_Array / 2) + radius
radius_check_update = np.reshape(radius_check,(len(radius_check),1))
radius_overlap = (radius_check_update >= dist_check)
# Now actually check the overalp condition.
if np.sum([radius_overlap]) == 0:
# The new circle does not overlap so proceed.
newCircle_Found = 1
debug_value = 2
elif np.sum([radius_overlap]) == 1:
# The new circle overlaps with one other circle
overlap = np.arange(0,len(radius_overlap), dtype=int)
overlap_update = np.reshape(overlap,(len(overlap),1))
overlap_logix = (radius_overlap == 1)
idx_true = overlap_update[overlap_logix]
radius = dist_check(idx_true,1) - (Diameter(idx_true,1) / 2)
A similar result for the same run was produced for variables:
radius_check_update
radius_overlap
overlap_update
Here is the same code snippet for the working MATLAB version (as requested):
distcheck = ((Circles.CenterCoordinates(1,:)-x_Center).^2 + (Circles.CenterCoordinates(2,:)-y_Center).^2).^0.5;
radius_check = (Circles.Diameter ./ 2) + radius;
radius_overlap = (radius_check >= distcheck);
% Now actually check the overalp condition.
if sum(radius_overlap) == 0
% The new circle does not overlap so proceed.
newCircle_Found = 1;
debug_value = 2;
elseif sum(radius_overlap) == 1
% The new circle overlaps with one other circle
temp = 1:size(radius_overlap,2);
idx_true = temp(radius_overlap == 1);
radius = distcheck(1,idx_true) - (Circles.Diameter(1,idx_true)/2);
In the Python version I have created arrays from lists to more easily operate on the contents (the first two lines of the code snippet). The array within array result and creating arrays to access data suggests to me that I have incorrectly declared variable types, but I am not sure. Furthermore, some variables have a size, for example, (2L,) (the numerical dimension will change as circles are placed) where there is no second dimension. This produces obvious problems when I try to use the array in an operation with another array with a size (2L,1L). Because of these problems I started reshaping arrays, and then I stopped because I decided these were hacks because I had declared one, or more than one variable incorrectly. Second, for the same run I encountered the following error:
TypeError: 'numpy.ndarray' object is not callable
for the operation:
radius = dist_check(idx_true,1) - (Diameter(idx_true,1) / 2)
which occurs at the bottom of the above code snippet. I have posted the entire script at the following link because it is probably more useful to execute the script for oneself:
https://github.com/smchartrand/MarkovProcess_Bedload
I have set-up the code to run with some initial parameter values so decisions do not need to be made; these parameter values produce the expected results in the MATLAB-based script, which look something like this when plotted:
So, I seem to specifically be having issues with operations on lines 151-165, depending on the test value np.sum([radius_overlap]) and I think it is because I incorrectly declared variable types, but I am really not sure. I can say with confidence that the Python version and the MATLAB version are consistent in output through the first step of the while loop, and code line 127 which is entering the second step of the while loop. Below this point in the code the above documented issues eventually cause the script to crash. Sometimes the script executes to 15% complete, and sometimes it does not make it to 5% - this is due to the random nature of circle placement. I am preparing the code in the Spyder (Python 2.7) IDE and will share the working code publicly as a part of my research. I would greatly appreciate any help that can be offered to identify my mistakes and misapplications of python coding practice.
I believe I have answered my own question, and maybe it will be of use for someone down the road. The main sources of instruction for me can be found at the following three web pages:
Stackoverflow Question 176011
SciPy FAQ
SciPy NumPy for Matlab users
The third web page was very helpful for me coming from MATLAB. Here is the modified and working python code snippet which relates to the original snippet provided above:
dist_check = ((CenterCoordinates[0,:] - x_Center) ** 2 + (CenterCoordinates[1,:] - y_Center) ** 2) ** 0.5
radius_check = (Diameter / 2) + radius
radius_overlap = (radius_check >= dist_check)
# Now actually check the overalp condition.
if np.sum([radius_overlap]) == 0:
# The new circle does not overlap so proceed.
newCircle_Found = 1
debug_value = 2
elif np.sum([radius_overlap]) == 1:
# The new circle overlaps with one other circle
overlap = np.arange(0,len(radius_overlap[0]), dtype=int).reshape(1, len(radius_overlap[0]))
overlap_logix = (radius_overlap == 1)
idx_true = overlap[overlap_logix]
radius = dist_check[idx_true] - (Diameter[0,idx_true] / 2)
In the end it was clear to me that it was more straightforward for this example to use numpy arrays vs. lists to store results for each iteration of filling the rectangular area. For the corrected code snippet this means I initialized the variables:
CenterCoordinates, and
Diameter
as numpy arrays whereas I initialized them as lists in the posted question. This made a few mathematical operations more straightforward. I was also incorrectly indexing into variables with parentheses () as opposed to the correct method using brackets []. Here is an example of a correction I made which helped the code execute as envisioned:
Incorrect: radius = dist_check(idx_true,1) - (Diameter(idx_true,1) / 2)
Correct: radius = dist_check[idx_true] - (Diameter[0,idx_true] / 2)
This example also shows that I had issues with array dimensions which I corrected variable by variable. I am still not sure if my working code is the most pythonic or most efficient way to fill a rectangular area in a random fashion, but I have tested it about 100 times with success. The revised and working code can be downloaded here:
Working Python Script to Randomly Fill Rectangular Area with Circles
Here is an image of a final results for a successful run of the working code:
The main lessons for me were (1) numpy arrays are more efficient for repetitive numerical calculations, and (2) dimensionality of arrays which I created were not always what I expected them to be and care must be practiced when establishing arrays. Thanks to those who looked at my question and asked for clarification.

K-Means Algorithm not working properly

I was trying to write my own K-Means clustering algorithm however it is not working.Can someone take a look and help me finding what mistake I am committing.I am fairly new.
I expect the data to be clustered in 2 groups since K=2.However I am not getting the expected result.I think mean assignment is not working properly.Can someone give a look?
https://github.com/DivJ/Robo_Lab/blob/master/K_Means.py
dist=[]
lab=[]
x_sum,y_sum=0,0
x_sum1,y_sum1=0,0
k=2
mean=pt[:k]
def assignment():
global dist
global lab
for i in range(0,100):
for j in range(0,k):
dist.append(math.hypot(pt[i,0]-mean[j,0],pt[i,1]-mean[j,1]))
lab.append(dist.index(min(dist)))
dist=[]
def mean_shift():
global x_sum,x_sum1,y_sum,y_sum1,lab
for i in range(0,100):
if(lab[i]==0):
plt.scatter(pt[i,0],pt[i,1],c='r')
x_sum=pt[i,0]+x_sum
y_sum=pt[i,1]+y_sum
elif(lab[i]==1):
plt.scatter(pt[i,0],pt[i,1],c='b')
x_sum1=pt[i,0]+x_sum1
y_sum1=pt[i,1]+y_sum1
mean[0,0]=x_sum/lab.count(0)
mean[0,1]=y_sum/lab.count(0)
mean[1,0]=x_sum1/lab.count(1)
mean[1,1]=y_sum1/lab.count(1)
lab=[]
def k_means(itr):
for z in range(0,itr):
assignment()
mean_shift()
k_means(100)
Here's what's wrong with your code:
1) You initialize means as pt[:k], however later you reassign means which leads to the first two points being reassigned unintentionally since means merely is a pointer to these points. You need to create a copy of the first to points to avoid changing them:
import copy
means=copy.copy(pt[:k])
2) You initialize x_sum, y_sum, x_sum1 and y_sum1 outside of mean_shift() which causes the sums to grow bigger and bigger with each iteration. Set them to 0 every time you call mean_shift().

Cuda: least square solving , poor in speed

Recently ,I use Cuda to write an algorithm called 'orthogonal matching pursuit' . In my ugly Cuda code the entire iteration takes 60 sec , and Eigen lib takes just 3 sec...
In my code Matrix A is [640,1024] and y is [640,1] , in each step I select some vectors from A to compose a new Matrix called A_temp [640,itera], iter=1:500 . I new a array MaxDex_Host[] in cpu to tell which column to select .
I want to get x_temp[itera,1] from A_temp*x_temp=y using least-square , I use a cula API 'culaDeviceSgels' and cublas matrix-vector multiplication API.
So the culaDeviceSgels would call 500 times , and I think this would be faster than Eigen lib's QR.Sovler .
I check the Nisight performence anlysis , I found the custreamdestory takes a long time . I initial cublas before iteration and destory it after I get the result . So I want to know the what is the custreamdestory , different with cublasdestory?
The main problem is memcpy and function 'gemm_kernel1x1val' . I think this function is from 'culaDeviceSgels'
while(itera<500): I use cublasSgemv and cublasIsamax to get MaxDex_Host[itera] , then
MaxDex_Host[itera]=pos;
itera++;
float* A_temp_cpu=new float[M*itera]; // matrix all in col-major
for (int j=0;j<itera;j++) // to get A_temp [M,itera] , the MaxDex_Host[] shows the positon of which column of A to chose ,
{
for (int i=0;i<M;i++) //M=640 , and A is 640*1024 ,itera is add 1 each step
{
A_temp_cpu[j*M+i]=A[MaxDex_Host[j]*M+i];
}
}
// I must allocate one more array because culaDeviceSgels will decompose the one input Array , and I want to use A_temp after least-square solving.
float* A_temp_gpu;
float* A_temp2_gpu;
cudaMalloc((void**)&A_temp_gpu,Size_float*M*itera);
cudaMalloc((void**)&A_temp2_gpu,Size_float*M*itera);
cudaMemcpy(A_temp_gpu,A_temp_cpu,Size_float*M*itera,cudaMemcpyHostToDevice);
cudaMemcpy(A_temp2_gpu,A_temp_gpu,Size_float*M*itera,cudaMemcpyDeviceToDevice);
culaDeviceSgels('N',M,itera,1,A_temp_gpu,M,y_Gpu_temp,M);// the x_temp I want is in y_Gpu_temp's return value , stored in the y_Gpu_temp[0]——y_Gpu_temp[itera-1]
float* x_temp;
cudaMalloc((void**)&x_temp,Size_float*itera);
cudaMemcpy(x_temp,y_Gpu_temp,Size_float*itera,cudaMemcpyDeviceToDevice);
Cuda's memory manage seems too complex , is there any other convenience method to solve least-square?
I think that custreamdestory and gemm_kernel1x1val are internally called by the APIs you are using, so there is not much to do with them.
To improve your code, I would suggest to do the following.
You can get rid of A_temp_cpu by keeping a device copy of the matrix A. Then you can copy the rows of A into the rows of A_temp_gpu and A_temp2_gpu by a kernel assignment. This would avoid performing the first two cudaMemcpys.
You can preallocate A_temp_gpu and A_temp2_gpu outside the while loop by using the maximum possible value of itera instead of itera. This will avoid the first two cudaMallocs inside the loop. The same applies to x_temp.
As long as I know, culaDeviceSgels solves a linear system of equations. I think you can do the same also by using cuBLAS APIs only. For example, you can perform an LU factorization first by cublasDgetrfBatched() and then use cublasStrsv() two times to solve the two arising linear systems. You may wish to see if this solution leads to a faster algorithm.