Reduce vector by mask in CUDA C++ [duplicate] - c++

This question already has answers here:
CUDA stream compaction algorithm
(3 answers)
CUDA populate small array with contents of larger array
(1 answer)
Closed 16 days ago.
I am tiring to achieve a simple operation using the CUDA. I need to reduce 1D vector by 1D mask. So for example for a data array [8,9,1,9,6] and a mask [0,1,1,0,1] I should get a result [9,1,6]. I tried to create a parallel CUDA function to handle this task. Nevertheless, I was not successful.
In order to make the parallel process for the CUDA, I decided to split the input vector to sectors with similar sizes inside a function ReduceVectorByMask_D_G. Each sector is processed independently by the separated thread. Each thread has one counter (coutersInput_G) to index input array and one counter (countersOutput_G) to index output array. As the thread browse the mask, values where mask is non-zero are copied to the output array (which also leads to the increment of the output counter). In order to increment the counters safely, I use a function atomicAdd. Nevertheless, the counters are not incremented correctly and the values remain the same in each run of the thread. It looks like the original values of the counters are somehow cached and the CUDA does not assume the counters could be changed during individual runs of the function ReduceVectorByMask_D_G_G. Once the execution of the function ReduceVectorByMask_D_G_G ends, individual counters end up with correct values.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <math.h>
#define THREADS 5//Use only 5 threads for simplicity.
#define SETVECTORVALUE_G_G int i = blockDim.x * blockIdx.x + threadIdx.x;if (i < length){vector_G[i] = value;}
#define CUDAVALLOC1(typ) typ* output_G=NULL;\
cudaMalloc((void**)&output_G, length * sizeof(typ));\
if (output_G == NULL)\
{\
return NULL;\
}\
int gridSize = (int)ceil(((double)length / (double)THREADS))
/// <summary>
/// Set all values in vector to specific int value.
/// </summary>
/// <param name="vector_G">Vector to be edited.</param>
/// <param name="length">Number of elements.</param>
/// <param name="value">The value to be set.</param>
__global__ void SetVectorValue_I32_G_G(int* vector_G, size_t length, int value)
{
SETVECTORVALUE_G_G;//Set values for individual elements.
}
/// <summary>
/// Allocate array in GPU memory and fill with given value.
/// </summary>
/// <param name="length">Number of elements.</param>
/// <param name="value">Value to be used for filling.</param>
/// <returns>Returns address of the new array.</returns>
int* CudaValloc_I32_G(size_t length, int value)
{
CUDAVALLOC1(int);//Initialize arrays using type int.
SetVectorValue_I32_G_G << <gridSize, THREADS >> > (output_G, length, value);//Set individual values of the output array to specific value in GPU.
cudaDeviceSynchronize();//Wait for the process to finish.
return output_G;
}
/// <summary>
/// Set all values in vector to specific double value.
/// </summary>
/// <param name="vector_G">Vector to be edited.</param>
/// <param name="length">Number of elements.</param>
/// <param name="value">The value to be set.</param>
__global__ void SetVectorValue_D_G_G(double* vector_G, size_t length, double value)
{
SETVECTORVALUE_G_G;//Set values for individual elements.
}
/// <summary>
/// Allocate array in GPU memory and fill with given value.
/// </summary>
/// <param name="length">Number of elements.</param>
/// <param name="value">Value to be used for filling.</param>
/// <returns>Returns address of the new array.</returns>
double* CudaValloc_D_G(size_t length, double value)
{
CUDAVALLOC1(double);//Initialize arrays using type double.
SetVectorValue_D_G_G << <gridSize, THREADS >> > (output_G, length, value);//Set individual values of the output array to specific value in GPU.
cudaDeviceSynchronize();//Wait for the process to finish.
return output_G;
}
/// <summary>
/// Sum vector elements in individual threads. Each thread has its own accumulator.
/// </summary>
/// <param name="vector_G">Vector to be summed to individual counters.</param>
/// <param name="length">Length of the vector_G.</param>
/// <param name="counters_G">Individual accumulators for individual threads should be zeroed before function call. The length is the number of threads.</param>
__global__ void SumVectorValues_D_I32_G(int* vector_G, size_t length, int* counters_G)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;//OVerall index in the input.
if (i < length)
{
atomicAdd((int*)(counters_G + threadIdx.x), vector_G[i]);//Safe add.
}
}
/// <summary>
/// Summation of vector values using cuda.
/// </summary>
/// <param name="vector_G">Vector of integers to be summed.</param>
/// <param name="length">Length of the vector.</param>
/// <returns>Sum of the input vector.</returns>
int SumVectorValues_I32_G(int* vector_G, size_t length)
{
int output = 0;//The sum to be returned.
int gridSize = (int)ceil(((double)length / (double)THREADS));//Determine how many times the threads need to be run to cover whole array.
int* counters_G = CudaValloc_I32_G(THREADS, 0);//Create counter for each thread.
int* counters = (int*)malloc(THREADS * sizeof(int));//Prepare space for counters in the RAM.
SumVectorValues_D_I32_G << <gridSize, THREADS >> > (vector_G, length, counters_G);//Run individual threads.
cudaDeviceSynchronize();//Wait for the process to finish.
cudaMemcpy(counters, counters_G, THREADS * sizeof(int), cudaMemcpyDeviceToHost);//Move counters to RAM.
for (size_t i = 0; i < THREADS; i++)//Sum individual counters to one value.
{
output += counters[i];
}
cudaFree(counters_G);
free(counters);
return output;
}
/// <summary>
/// Process independently individual sectors of input to output.
/// </summary>
/// <param name="vector_G">The whole input data vector to be processed.</param>
/// <param name="mask_G">The whole input data vector to be used.</param>
/// <param name="inputSectorStarts_G">Relative addresses of sector starts in input.</param>
/// <param name="inputSectorSizes_G">Lengths of individual input sectors.</param>
/// <param name="outputSectorStarts_G">Relative addresses of sector starts in output.</param>
/// <param name="coutersInput_G">Zeroed counters to address input.</param>
/// <param name="countersOutput_G">Zeroed counters to address output.</param>
/// <param name="output_G">Array to strore output values.</param>
__global__ void ReduceVectorByMask_D_G_G(double* vector_G, int* mask_G, size_t* inputSectorStarts_G, size_t* inputSectorSizes_G, size_t* outputSectorStarts_G, int* coutersInput_G, int* countersOutput_G, double* output_G)
{
if (coutersInput_G[threadIdx.x] < inputSectorSizes_G[threadIdx.x])//Check counter to be in the current sector borders.
{//Current thread is still in its sector borders.
if (mask_G[inputSectorStarts_G[threadIdx.x] + coutersInput_G[threadIdx.x]])//Check mask value.
{//Mask is true the value copy will be done.
output_G[outputSectorStarts_G[threadIdx.x] + countersOutput_G[threadIdx.x]] = vector_G[inputSectorStarts_G[threadIdx.x] + coutersInput_G[threadIdx.x]];//Copy value from input array to the output array.
atomicAdd(countersOutput_G + ((int)threadIdx.x), 1);//Increment output counter to point to the next empty address.
}
atomicAdd(coutersInput_G + ((int)threadIdx.x), 1);//Increment input counter to point to the next value to be checked.
}
}
/// <summary>
/// Removes data points where mask is zero.
/// </summary>
/// <param name="vector_G">Input data to be reduced.</param>
/// <param name="mask_G">Mask to used for reduction. Contains values 1 or 0.</param>
/// <param name="vectorCount">Number of elements in the mask and in the input vector.</param>
/// <param name="countOfValidElements">Count of the non-zero elements in the mask.</param>
/// <returns>Reduced data vector in GPU memory.</returns>
double* ReduceVectorByMask_D_G(double* vector_G, int* mask_G, size_t vectorCount, size_t* countOfValidElements)
{
int gridSize = (int)ceil(((double)vectorCount / (double)THREADS));//How many times the threads need to be run to browse whole input array.
size_t* inputSizes = (size_t*)calloc(THREADS, sizeof(size_t));//Sizes of individual input blocks for individual threads.
size_t* inputSectorSizes_G = NULL;//Sector sizes of the input vector for individual threads.
size_t* outputSectorSizes = (size_t*)calloc(THREADS, sizeof(size_t));//Sector sizes of the output vector for individual threads.
size_t* outputSectorStarts = (size_t*)calloc(THREADS, sizeof(size_t));//Where in the output array individual sectors start.
size_t* inputSectorStarts = (size_t*)calloc(THREADS, sizeof(size_t));//Where in the input array individual sectors start.
size_t* inputSectorStarts_G = NULL;//Clone of the inputSectorStarts stored in the GPU.
size_t* outputSectorStarts_G = NULL;//Clone of the outputSectorStarts stored in the GPU.
int* countersInput_G = CudaValloc_I32_G(THREADS, 0);//Initialize and zero counters for incremental addressing of the input vector.
int* countersOutput_G = CudaValloc_I32_G(THREADS, 0);//Initialize and zero counters for incremental addressing of the output vector.
double* output_G = NULL;//Output vector.
cudaMalloc((void**)&inputSectorSizes_G, THREADS * sizeof(size_t));//Allocate GPU working arrays.
cudaMalloc((void**)&inputSectorStarts_G, THREADS * sizeof(size_t));//Allocate GPU working arrays.
cudaMalloc((void**)&outputSectorStarts_G, THREADS * sizeof(size_t));//Allocate GPU working arrays.
for (size_t i = 0; i < THREADS - 1; i++)//Browse threads and split input vector equidistantly.
{
inputSizes[i] = vectorCount / THREADS;//Division and flooring.
inputSectorStarts[i + 1] = inputSectorStarts[i] + inputSizes[i];//Address of the start of the new sector is calculated as the address of the start of the previous sector plus size of previous sector.
}
inputSizes[THREADS - 1] = vectorCount - vectorCount / THREADS * (THREADS - 1);//Last sector contains remaining elements as the number of input elements in not dividable by threadNumber.
cudaMemcpy(inputSectorStarts_G, inputSectorStarts, THREADS * sizeof(size_t), cudaMemcpyHostToDevice);//Move sector starts addresses to the GPU.
int* mask_G_temp = mask_G;//Prepare incremental pointer to the input mask.
countOfValidElements[0] = 0;//Zero overall counter.
for (size_t i = 0; i < THREADS; i++)//Browse individual sectors in mask to find how large sectors need to be in the output.
{
outputSectorSizes[i] = (size_t)SumVectorValues_I32_G(mask_G_temp, inputSizes[i]);//Get number of non-zero mask elements.
countOfValidElements[0] += outputSectorSizes[i];//Add to the overal output size.
mask_G_temp += inputSizes[i];//Move to next sector.
}
output_G = CudaValloc_D_G(countOfValidElements[0], 0);//Allocate output vector with zeros using precalculated number of non-zero elements.
for (size_t i = 0; i < THREADS - 1; i++)//Calculate sectors starts in the output.
{
outputSectorStarts[i + 1] = outputSectorStarts[i] + outputSectorSizes[i];//Current sector start is the previous sector start address plus previous sector size.
}
cudaMemcpy(outputSectorStarts_G, outputSectorStarts, THREADS * sizeof(size_t), cudaMemcpyHostToDevice);//Move data to GPU to be accepted by __global__ functions.
cudaMemcpy(inputSectorSizes_G, inputSizes, THREADS * sizeof(size_t), cudaMemcpyHostToDevice);//Move data to GPU to be accepted by __global__ functions.
ReduceVectorByMask_D_G_G << <gridSize, THREADS >> > (vector_G, mask_G, inputSectorStarts_G, inputSectorSizes_G, outputSectorStarts_G, countersInput_G, countersOutput_G, output_G);//Run kernel function
cudaDeviceSynchronize();//Wait for result
cudaFree(countersInput_G);//Clear data from the GPU.
cudaFree(countersOutput_G);//Clear data from the GPU.
cudaFree(inputSectorSizes_G); //Clear data from the GPU.
cudaFree(inputSectorStarts_G);//Clear data from the GPU.
cudaFree(outputSectorStarts_G);//Clear data from the GPU.
free(inputSizes);//Clear data from the RAM.
free(outputSectorSizes);//Clear data from the RAM.
free(outputSectorStarts);//Clear data from the RAM.
free(inputSectorStarts);//Clear data from the RAM.
return output_G;//Return reduced vector in GPU.
}
/// <summary>
/// Generates testing data and test the ReduceVectorByMask_D_G function
/// </summary>
/// <returns></returns>
int main()
{
double vector[100] = {8,9,1,9,6,0,2,5,9,9,1,9,9,4,8,1,4,9,7,9,6,0,8,9,6,7,7,3,6,1,7,0,2,0,0,8,6,3,9,0,4,3,7,7,1,4,4,6,7,7,2,6,6,1,1,4,9,3,5,2,7,2,5,6,8,9,5,1,1,2,8,2,8,2,9,3,1,2,6,4,3,8,5,5,9,2,7,7,3,5,0,0,5,7,9,1,5,4,0,3 };//Create random array to re reduced.
int mask[100] = {0,1,0,0,1,0,1,1,1,0,1,1,1,1,1,1,1,1,0,0,1,1,1,1,0,1,1,1,1,1,1,1,0,1,1,0,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1 };//Create mask.
double* vector_G = NULL;//Data in GPU.
int* mask_G = NULL;//Mask in GPU.
double* reducedVector_G = NULL;//Variable to store reduced data vector.
double* reducedVector = (double*)malloc(100 * sizeof(double));//Variable to store reduced data vector in RAM.
size_t countOfValidElements = 0;//Number of non-zero elements in the mask counted by the function as by-product.
cudaMalloc((void**)&vector_G, 100 * sizeof(double));//Allocate memory for the data.
cudaMalloc((void**)&mask_G, 100 * sizeof(int));//Allocate memory for the mask.
cudaMemcpy(vector_G, vector, 100 * sizeof(double), cudaMemcpyHostToDevice);//Copy data to the GPU.
cudaMemcpy(mask_G, mask, 100 * sizeof(int), cudaMemcpyHostToDevice);//Copy mask to the GPU.
reducedVector_G = ReduceVectorByMask_D_G(vector_G, mask_G, 100, &countOfValidElements);//Reduce vector by mask in GPU and get overall number of elements in output.
cudaMemcpy(reducedVector, reducedVector_G, countOfValidElements * sizeof(double), cudaMemcpyDeviceToHost);//Copy result to the RAM.
for (size_t i = 0; i < countOfValidElements; i++)//Write results to the console.
{
printf("%d,", (int)reducedVector[i]);//Print element separated.
}
cudaFree(vector_G);//Clear data from memory.
cudaFree(mask_G);//Clear data from memory.
cudaFree(reducedVector_G);//Clear data from memory.
free(reducedVector);//Clear data from memory.
}
Is there any simple way to fix my code in the function ReduceVectorByMask_D_G_G, or is there any other method to get the array reduced by mask in CUDA?

Related

Why we need to add the "ranges" when calculate ha_innobase::read_time?

We see that MySQL needs to add the ranges when calculate the ha_innobase::read_time (/storage/innobase/handler/ha_innobase.cc), my question is why it need it?
double ha_innobase::read_time(
uint index, /*!< in: key number */
uint ranges, /*!< in: how many ranges */
ha_rows rows) /*!< in: estimated number of rows in the ranges */
{
ha_rows total_rows;
if (index != table->s->primary_key) {
/* Not clustered */
return(handler::read_time(index, ranges, rows));
}
if (rows <= 2) {
return((double) rows);
}
/* Assume that the read time is proportional to the scan time for all
rows + at most one seek per range. */
double time_for_scan = scan_time();
if ((total_rows = estimate_rows_upper_bound()) < rows) {
return(time_for_scan);
}
return(ranges + (double) rows / (double) total_rows * time_for_scan);
}
This comment tells you the answer:
/* Assume that the read time is proportional to the scan time for all
rows + at most one seek per range. */
There is probably a seek for each range, and each seek increases the read time. Seeks were more costly on spinning storage devices, and most sites now use solid-state storage, but seeks are still a little bit costly. So it's important to count them when calculating the total read time.

OpenCL: Values of __local array are lost after a barrier call

I have a kernel storing some partial results in a local array before reducing
them into a single value (see the example below). Before the reduction process
starts, a barrier is placed to ensure all threads have successfully written their
partial data. However, the barrier resets the values of the temporary array to
default values (i.e. 0.0f for floats).
Minimal example:
__kernel void simulate_plate(__local float *partial)
{
__private int lpos;
lpos = get_local_id(0) + get_local_id(1) * get_local_size(1);
partial[lpos] = 1;
barrier(CLK_LOCAL_MEM_FENCE);
// At this point partial[i] == 0 for all i
// reduce data...
}
The argument partial has the following initializer:
clSetKernelArg(kernel, 0, local_group_size * sizeof(float), NULL);
The clSetKernelArg() call returns a status code CL_SUCCESS and the kernel
terminates without any errors.
Another observation is that swapping lines partial[lpos] = 1 and
barrier(CLK_LOCAL_MEM_FENCE) achieves the wanted result --- all components of
the array partial now equal to 1.
Any input why this behaviour occurs would be much appreciated.
I think the index should be like this
lpos = get_local_id(0) + get_local_id(1) * get_local_size(0);

Why does this noexcept matters for performance so much while a similar other noexcept does not?

I have a piece of generic code whose performance is important to me, because I was challenged to match its running time of a well-known hand-crafted code written in C. Before I began playing with noexcept, my code ran in 4.8 seconds. By putting noexcept every place I could think of (I know that this is not a good idea, but I did it for the learning sake), the code sped up to 3.3 seconds. Then I began to revert the changes until I got to even better performance (3.1 seconds) and remained with a single noexcept!
The question is: why does this particular noexcept help so much? Here it is:
static const AllActions& allActions_() noexcept {
static const AllActions instance = computeAllActions();
return instance;
}
It is interesting that I have another similar function, which is just fine without noexcept (i.e. putting noexcept there does not improve performance):
static const AllMDDeltas& mdDeltas_() {
static const AllMDDeltas instance = computeAllMDDeltas();
return instance;
}
Both functions are called by my code (which performs recursive depth-first search) a lot of times, so that the second function is as important for overall performance as the first one.
P.S. Here is the surrounding code for more context (the quoted functions and the functions that they call are at the end of the listing):
/// The sliding-tile puzzle domain.
/// \tparam nRows Number of rows on the board.
/// \tparam nRows Number of columns on the board.
template <int nRows, int nColumns>
struct SlidingTile : core::sb::DomainBase {
/// The type representing the cost of actions in the domain. Every
/// domain must provide this name.
using CostType = int;
using SNeighbor =
core::sb::StateNeighbor<SlidingTile>; ///< State neighbor type.
using ANeighbor =
core::sb::ActionNeighbor<SlidingTile>; ///< Action neighbor type.
/// The type for representing an action. The position of the tile being moved.
using Action = int;
/// Number of positions.
static constexpr int size_ = nRows * nColumns;
/// The type for the vector of actions for a given position of the blank.
using BlankActions = std::vector<ANeighbor>;
/// The type for all the actions in the domain.
using AllActions = std::array<BlankActions, size_>;
/// The type for two-dimension array of Manhattan distance heuristic deltas
/// for a given tile. The indexes are from and to of an action.
using TileMDDeltas = std::array<std::array<int, size_>, size_>;
/// The type for all Manhattan distance heuristic deltas.
using AllMDDeltas = std::array<TileMDDeltas, size_>;
/// The type for raw state representation.
using Board = std::array<int, size_>;
/// Initializes the ordered state.
SlidingTile() {
int i = -1;
for (auto &el : tiles_) el = ++i;
}
/// Initializes the state from a string, e.g. "[1, 4, 2, 3, 0, 5]" or "1 4 2
/// 4 0 5" for 3x2 board.
/// \param s The string.
SlidingTile(const std::string &s) {
int i = -1;
for (auto el : core::util::split(s, {' ', ',', '[', ']'})) {
tiles_[++i] = std::stoi(el);
if (tiles_[i] == 0) blank_ = i;
}
}
/// The default copy constructor.
SlidingTile(const SlidingTile &) = default;
/// The default assignment operator.
/// \return Reference to the assigned state.
SlidingTile &operator=(const SlidingTile &) = default;
/// Returns the array of tiles at each position.
/// \return The raw representation of the state, which is the array of tiles
/// at each position..
const Board &getTiles() const { return tiles_; }
/// Applies an action to the state.
/// \param a The action to be applied, i.e. the next position of the blank
/// on the board.
/// \return The state after the action.
SlidingTile &apply(Action a) {
tiles_[blank_] = tiles_[a];
blank_ = a;
return *this;
}
/// Returns the reverse of the given action in this state.
/// \param a The action whose reverse is to be returned.
/// \return The reverse of the given action.
Action reverseAction(Action a) const {
(void)a;
return blank_;
}
/// Computes the state neighbors of the state.
/// \return Vector of state neighbors of the state.
std::vector<SNeighbor> stateSuccessors() const {
std::vector<SNeighbor> res;
for (auto a : actionSuccessors()) {
auto n = SlidingTile{*this}.apply(a.action());
res.push_back(std::move(n));
}
return res;
}
/// Computes the action neighbors of the state.
/// \return Vector of action neighbors of the state.
const std::vector<ANeighbor> &actionSuccessors() const {
return allActions_()[blank_];
}
/// The change in the Manhattan distance heuristic to the goal state with
/// ordered tiles and the blank at position 0 due to applying the given action.
/// \param a The given action.
/// \return The change in the Manhattan distance heuristic to the goal state
/// with ordered pancake due to applying the given action.
int mdDelta(Action a) const {
return mdDeltas_()[tiles_[a]][a][blank_];
}
/// Computes the Manhattan distance heuristic to the goal state with
/// ordered tiles and the blank at position 0.
/// \return The Manhattan distance heuristic to the goal state with
/// ordered tiles and the blank at position 0.
int mdHeuristic() const {
int res = 0;
for (int pos = 0; pos < size_; ++pos)
if (pos != blank_)
res += rowDist(pos, tiles_[pos]) + colDist(pos, tiles_[pos]);
return res;
}
/// Computes the hash-code of the state.
/// \return The hash-code of the state.
std::size_t hash() const {
boost::hash<Board> v_hash;
return v_hash(tiles_);
}
/// Dumps the state to the given stream.
/// \tparam The stream type.
/// \param o The stream.
/// \return The modified stream.
template <class Stream> Stream &dump(Stream &o) const {
return o << tiles_;
}
/// Randomly shuffles the tiles.
void shuffle() {
auto old = tiles_;
while (old == tiles_)
std::random_shuffle(tiles_.begin(), tiles_.end());
}
/// The equality operator.
/// \param rhs The right-hand side of the operator.
/// \return \c true if the two states compare equal and \c false
/// otherwise.
bool operator==(const SlidingTile &rhs) const {
if (blank_ != rhs.blank_) return false;
for (int i = 0; i < size_; ++i)
if (i != blank_ && tiles_[i] != rhs.tiles_[i]) return false;
return true;
}
/// Returns a random state.
/// \return A random state.
static SlidingTile random() {
SlidingTile res{};
res.shuffle();
return res;
}
private:
/// Tile at each position. This does not include the position of the blank,
/// which is stored separately.
std::array<int, size_> tiles_;
/// Blank position.
int blank_{};
/// Computes the row number corresponding to the given position.
/// \return The row number corresponding to the given position.
static int row(int pos) { return pos / nColumns; }
/// The difference between the row numbers corresponding to the two given
/// positions.
/// \return The difference between the row numbers corresponding to the two
/// given positions.
static int rowDiff(int pos1, int pos2) { return row(pos1) - row(pos2); }
/// The distance between the row numbers corresponding to the two given
/// positions.
/// \return The distance between the row numbers corresponding to the two
/// given positions.
static int rowDist(int pos1, int pos2) {
return std::abs(rowDiff(pos1, pos2));
}
/// Computes the column number corresponding to the given position.
/// \return The column number corresponding to the given position.
static int col(int pos) { return pos % nColumns; }
/// The difference between the column numbers corresponding to the two given
/// positions.
/// \return The difference between the column numbers corresponding to the
/// two given positions.
static int colDiff(int pos1, int pos2) { return col(pos1) - col(pos2); }
/// The distance between the column numbers corresponding to the two given
/// positions.
/// \return The distance between the column numbers corresponding to the
/// two given positions.
static int colDist(int pos1, int pos2) {
return std::abs(colDiff(pos1, pos2));
}
/// Computes the actions available for each position of the blank.
static AllActions computeAllActions() {
AllActions res;
for (int blank = 0; blank < size_; ++blank) {
// the order is compatible with the code of Richard Korf.
if (blank > nColumns - 1)
res[blank].push_back(Action{blank - nColumns});
if (blank % nColumns > 0)
res[blank].push_back(Action{blank - 1});
if (blank % nColumns < nColumns - 1)
res[blank].push_back(Action{blank + 1});
if (blank < size_ - nColumns)
res[blank].push_back(Action{blank + nColumns});
}
return res;
}
/// Computes the heuristic updates for all the possible moves.
/// \return The heuristic updates for all the possible moves.
static AllMDDeltas computeAllMDDeltas() {
AllMDDeltas res;
for (int tile = 1; tile < size_; ++tile) {
for (int blank = 0; blank < size_; ++blank) {
for (const ANeighbor &a: allActions_()[blank]) {
int from = a.action(), to = blank;
res[tile][from][to] =
(rowDist(tile, to) - rowDist(tile, from)) +
(colDist(tile, to) - colDist(tile, from));
}
}
}
return res;
}
/// Returns all the actions.
/// \note See http://stackoverflow.com/a/42208278/2725810
static const AllActions& allActions_() noexcept {
static const AllActions instance = computeAllActions();
return instance;
}
/// Returns all the updates of the MD heuristic.
static const AllMDDeltas& mdDeltas_() {
static const AllMDDeltas instance = computeAllMDDeltas();
return instance;
}
};
Unlike computeAllMDDeltas, the function computeAllActions contains some calls to push_back, which may perform some memory allocations. This may throw if there is an exception from the underlying allocator, for example if you run out of memory. That is something which the compiler cannot optimize away, since it depends on runtime parameters.
Adding noexcept tells the compiler that these errors can not occur, which allows him to omit the code for exception handling.
Looking at this discussion on exception overhead : Are Exceptions in C++ really slow
You can get an easy understanding why your code is faster with noexcept, since the compiler does not need to create the list of handlers for each call to push_back that you make. Your function computeAllActions contains the majority of your calls to a throwable function which is why it gets the most out of the optimization.

VlFeat kdtree setup and query

I've managed to get VlFeat's SIFT implmentation working and I'd like to try matching two sets of image descriptors.
SIFT's feature vectors are 128 element float arrays, I've stored the descriptor lists in std::vectors as shown in the snippet below:
std::vector<std::vector<float> > ldescriptors = leftImage->descriptors;
std::vector<std::vector<float> > rdescriptors = rightImage->descriptors;
/* KDTree, L1 comparison metric, dimension 128, 1 tree, L1 metric */
VlKDForest* forest = vl_kdforest_new(VL_TYPE_FLOAT, 128, 1, VlDistanceL1);
/* Build the tree from the left descriptors */
vl_kdforest_build(forest, ldescriptors.size(), ldescriptors.data());
/* Searcher object */
VlKDForestSearcher* searcher = vl_kdforest_new_searcher(forest);
VlKDForestNeighbor neighbours[2];
/* Query the first ten points for now */
for(int i=0; i < 10; i++){
int nvisited = vl_kdforestsearcher_query(searcher, &neighbours, 2, rdescriptors[i].data());
cout << nvisited << neighbours[0].distance << neighbours[1].distance;
}
As far as I can tell that should work, but all I get out, for the distances, are nan's. The length of the descriptor arrays checkout so there does seem to be data going into the tree. I've plotted the keypoints and they also look reasonable, so the data is fairly sane.
What am I missing?
Rather sparse documentation here (links to the API): http://www.vlfeat.org/api/kdtree.html
What am I missing?
The 2nd argument of vl_kdforestsearcher_query takes a pointer to VlKDForestNeighbor:
vl_size
vl_kdforestsearcher_query(
VlKDForestSearcher *self,
VlKDForestNeighbor *neighbors,
vl_size numNeighbors,
void const *query
);
But here you declared VlKDForestNeighbor neighbours[2]; and then passed &neighbours as 2nd parameter which is not correct - your compiler probably issued a incompatible pointer types warning.
Since you declared an array, what you must do instead is either pass explicitly a pointer to the 1st neighbor:
int nvisited = vl_kdforestsearcher_query(searcher, &neighbours[0], 2, qrys[i]);
Or alternatively let the compiler do it for you:
int nvisited = vl_kdforestsearcher_query(searcher, neighbours, 2, qrys[i]);
EDIT
There is indeed a second (major) problem related to the way you build the kd-tree with ldescriptors.data().
Here you pass a std::vector<float>* pointer when VLFeat expects a float * contiguous array containing all your data points in row major order. So what you can do is copying your data in this format:
float *data = new float[128*ldescriptors.size()];
for (unsigned int i = 0; i < ldescriptors.size(); i++)
std::copy(ldescriptors[i].begin(), ldescriptors[i].end(), data + 128*i);
vl_kdforest_build(forest, ldescriptors.size(), data);
// ...
// then, right after `vl_kdforest_delete(forest);`
// do a `delete[] data;`

OpenCV Error: insufficient memory, in function call

I have a function looks like this:
void foo(){
Mat mat(50000, 200, CV_32FC1);
/* some manipulation using mat */
}
Then after several loops (in each loop, I call foo() once), it gives an error:
OpenCV Error: insufficient memory when allocating (about 1G) memory.
In my understanding, the Mat is local and once foo() returns, it is automatically de-allocated, so I am wondering why it leaks.
And it leaks on some data, but not all of them.
Here is my actual code:
bool VidBOW::readFeatPoints(int sidx, int eidx, cv::Mat &keys, cv::Mat &descs, cv::Mat &codes, int &barrier) {
// initialize buffers for keys and descriptors
int num = 50000; /// a large number
int nDims = 0; /// feature dimensions
if (featName == "STIP")
nDims = 162;
Mat descsBuff(num, nDims, CV_32FC1);
Mat keysBuff(num, 3, CV_32FC1);
Mat codesBuff(num, 3000, CV_64FC1);
// move overlapping codes from a previous window to buffer
int idxPre = -1;
int numPre = keys.rows;
int numMov = 0; /// number of overlapping points to move
for (int i = 0; i < numPre; ++i) {
if (keys.at<float>(i, 0) >= sidx) {
idxPre = i;
break;
}
}
if (idxPre > 0) {
numMov = numPre - idxPre;
keys.rowRange(idxPre, numPre).copyTo(keysBuff.rowRange(0, numMov));
codes.rowRange(idxPre, numPre).copyTo(codesBuff.rowRange(0, numMov));
}
// the starting row in code matrix where new codes from the updated features to add in
barrier = numMov;
// read keys and descriptors from feature file
int count = 0; /// number of new points that are read in buffers
if (featName == "STIP")
count = readSTIPFeatPoints(numMov, eidx, keysBuff, descsBuff);
// update keys, descriptors and codes matrix
descsBuff.rowRange(0, count).copyTo(descs);
keysBuff.rowRange(0, numMov+count).copyTo(keys);
codesBuff.rowRange(0, numMov+count).copyTo(codes);
// see if reaching the end of a feature file
bool flag = false;
if (feof(fpfeat))
flag = true;
return flag;
}
You don't post the code that calls your function, so I can't tell whether this is a true memory leak. The Mat objects that you allocate inside readFeatPoints() will be deallocated correctly, so there are no memory leaks that I can see.
You declare Mat codesBuff(num, 3000, CV_64FC1);. With num = 5000, this means you're trying to allocate 1.2 gigabytes of memory in one big block. You also copy some of this data to codes with the line:
codesBuff.rowRange(0, numMov+count).copyTo(codes);
If the value of numMove + count changes between iterations, this will cause reallocation of the data buffer in codes. If the value is large enough, you may also be eating up a significant amount of memory that persists across iterations of your loop. Both of these things may be leading to heap fragmentation. If at any point there doesn't exist a 1.2 GB chunk of memory waiting around, an insufficient memory error occurs, which is what you have experienced.