Matrix multiplication in SYCL using nd_range - c++

I am doing matrix multiplication in sycl, but having some problems. I am using 2 (4x4) Matrices for multiplication and on first iteration of for loop it works but on second iteration when i = 1 it works fine until C[11] = A[11]*B[15] but then it skips 1 multiplication and move forward. I know the problem why it skips but unfortunately i have been unable to change matrix B index properly. Kindly if someone can help i will greatly appreciate it. Thanks
Here is the code
Matsize= 4,
Blocksize = 4 also i know for loop will be equal to matsize it is 2 just to get clear idea of execution flow
{
range<1> dimensions(matSize * matSize);
const property_list props = { property::buffer::use_host_ptr() };
buffer<T> A_buf(MA, dimensions, props);
buffer<T> B_buf(MB, dimensions, props);
buffer<T> C_buf(MC, dimensions, props);
myQueue.submit([&](handler& cgh) {
auto A_ptr = A_buf.template get_access<access::mode::read>(cgh);
auto B_ptr = B_buf.template get_access<access::mode::read_write>(cgh);
auto C_ptr = C_buf.template get_access<access::mode::write>(cgh);
auto localRange = range<1>(blockSize* blockSize);
accessor<T, 1, access::mode::read_write, access::target::local>
C(matSize * matSize, cgh);
cgh.parallel_for<mxm_kernel>(
nd_range<2>(range<2>{matSize, matSize},
range<2>{blockSize, blockSize}),
[=](nd_item<2> item) {
const auto id_x = item.get_global_id(0);
const auto id_y = item.get_global_id(1);
const auto width = item.get_group_range(0) * item.get_local_range(0);
const auto index = id_x * width + id_y;
const auto index2 = id_y * width + id_x;
for (int i = 0; i < 2 ; i++) {
C[index] += A_ptr[index] * B_ptr[index2 + i ];
}
out << "C is!" << C[index] << sycl::endl;
item.barrier(cl::sycl::access::fence_space::local_space);
C_ptr[index] = C[index];
});
});
}

Related

What should you store or append a batch of tensors to in C++ when using LibTorch?

In C++, when using LibTorch (The C++ version of PyTorch), what should you store a batch of tensors in? I'm running into the problem of not being able to reset the batch on the next step because C++ doesn't allow storing a new variable over an existing variable.
In my attempt my batch of tensors is one single 385x385 tensor. The batch size is 385. In a for loop I use torch::cat to concatenate 385 smaller 1D tensors, which are 385 numbers long. (Maybe 'stack' or 'append' are better terms for what I'm doing since the are stacked together picket fence style more than 'concatenated', but that's what I'm using.) Anyways, there is not problem with this shape. It seems to work fine for one forward and backward pass but then the tensor becomes 770x385 on the next pass instead of a 385x385 tensor of the next 385, 385 long arrays. I hope I am painting a picture and not being too verbose.
The code.
Near the bottom I have the line all_step_obs = torch::tensor({}); to try to wipe out the contents of the tensor, AKA, the batch, but this gives me a Segmentation fault (core dumped). I guess for trying to access the tensor outside of the loop(?)
If I don't have this line I get a 770x385 tensor after the next step.
The model
#include "mujoco/mujoco.h"
struct Net : torch::nn::Module {
torch::Tensor action_high, action_low;
public:
Net(torch::Tensor action_high, torch::Tensor action_low) : action_high(action_high), action_low(action_low){
// Construct and register two Linear submodules.
fc1 = torch::nn::Linear(385, 385);
fc2 = torch::nn::Linear(385, 385);
fc3 = torch::nn::Linear(385, 42);
// cholesky_layer = torch::nn::Linear(385, (42 * (42 + 1)) / 2);
cholesky_layer = torch::nn::Linear(385, 385);
}
// Implement the Net's algorithm.
torch::Tensor forward(torch::Tensor x) {
// Use one of many tensor manipulation functions.
x = torch::relu(fc1->forward(x));
x = torch::dropout(x, /*p=*/0.2, /*train=*/is_training());
x = torch::relu(fc2->forward(x));
auto mean_layer = fc3->forward(x);
auto mean = action_low + (action_high - action_low) * mean_layer;
auto chol_l = cholesky_layer->forward(x);
// auto chol = torch::rand({385, 385});
auto chol = torch::matmul(chol_l, chol_l.transpose(0, 1));
chol = torch::nan_to_num(chol, 0, 2.0);
chol = chol.add(torch::eye(385));
auto cholesky = torch::linalg::cholesky(chol);
// return torch::cat({mean, cholesky}, 0);
return mean_layer;
}
// Use one of many "standard library" modules.
torch::nn::Linear fc1{nullptr}, fc2{nullptr}, fc3{nullptr}, cholesky_layer{nullptr};
};
The training
auto high = torch::ones({385, 42}) * 0.4;
auto low = torch::ones({385, 42}) * -0.4;
auto actor = Net(low, high);
int max_steps = 385;
int steps = 2000;
auto l1_loss = torch::smooth_l1_loss;
auto optimizer = torch::optim::Adam(actor.parameters(), 3e-4);
torch::Tensor train() {
torch::Tensor all_step_obs;
for (int i = 0; i<steps; ++i)
{
for (int i = 0; i<max_steps; ++i)
{
all_step_obs = torch::cat({torch::rand({385}).unsqueeze(0), all_step_obs});
}
auto mean = actor.forward(all_step_obs);
auto loss = l1_loss(mean, torch::rand({385, 42}), 1, 0);
optimizer.zero_grad();
loss.backward();
optimizer.step();
all_step_obs = torch::tensor({});
if (steps == 1999) {
return loss;
}
}
};
int main (int argc, const char** argv) {
std::cout << train();
}

Replace pointer arithmetic with std::span

I have the following code that uses pointer arithmetic and would like to replace it using std::span (or, I suppose, gsl::span). The code iterates over a number of pixels, each represented by 4 contiguous bytes, and updates their blue and green colours.
auto* row = (uint8_t*)buffer->data;
for (auto y = 0; y < buffer->height; ++y) {
auto* pixel = (uint32_t*)row;
for (auto x = 0; x < buffer->width; ++x) {
auto blue = x + blueOffset;
auto green = y + greenOffset;
*pixel++ = ((green << 8) | blue);
}
row += buffer->pitch;
}
buffer->data is a void* returned from a call to Windows VirtualAlloc(...) function.
How can this code be written to use safe, modern C++, such as std::span, as suggested by the C++ Core Guidelines and Moderns C++?
This compiles with each of C++20, gsl and gsl-lite. As you didn't provide a reproducible example I didn't test or benchmark this solution.
auto const my_span = span{
reinterpret_cast<uint8_t *>(buffer->data),
buffer->height * buffer->pitch};
constexpr auto size_ratio = sizeof(uint32_t) / sizeof(uint8_t); // 4
auto const uint8_width = size_ratio * buffer->width;
auto row_offset = 0UL;
auto row = my_span.subspan(row_offset, buffer->width);
for (auto y = 0; y < buffer->height; ++y) {
auto const pixels = span{
reinterpret_cast<uint32_t *>(row.data()),
buffer->width};
auto pixel = pixels.begin();
for (auto x = 0; x < buffer->width; ++x) {
auto const blue = x + blueOffset;
auto const green = y + greenOffset;
*pixel++ = ((green << 8) | blue);
}
row_offset += buffer->pitch;
row = my_span.subspan(row_offset, uint8_width);
}
Obviously using spans here doesn't necessarily make the code easier to read due to interpreting the same memory both as uint8_t and uint32_t. The spans should still give more safety.
The code could be made easier to read by providing more members or member functions in the struct which is pointed at by buffer. E.g. member functions could provide you with the needed subspans or at least with uint8_width. As this wasn't asked, I didn't touch that struct here. It might be given by a library, but one could still write a wrapper in that case.

Keras custom layer with CNTK backend (CRF as RNN)

I am attempting to duplicate the CRF as RNN which has been implemented in Keras but uses TensorFlow as a backend (https://github.com/sadeepj/crfasrnn_keras). The Keras front-end is fine, but some of the backend code is written as a TensorFlow custom op. I am trying to duplicate this in CNTK, but I have a few questions.
import cntk as C
from cntk import ops
import copy
import numpy as np
C.device.try_set_default_device(C.device.cpu())
ops.register_native_user_function('HighDimFilterOp', 'Cntk.HighDimFilter-' + C.__version__.rstrip('+'), 'CreateHighDimFilter')
def high_dim_filter(image=None, rgb=None, **kwargs):
inputs = [list(image), list(rgb)];
layer_config = copy.deepcopy(kwargs)
ops.native_user_function('HighDimFilterOp', inputs, layer_config, 'high_dim_filter')
This code is the Python call to my user C++ function. The C++ interface is as follows:
#include "HighDimFilter.h"
using namespace CNTK;
extern "C"
#ifdef _WIN32
__declspec(dllexport)
#endif
Function* CreateHighDimFilter(const Variable* operands, size_t /*numOperands*/, const Dictionary* attributes, const wchar_t* name)
{
printf("Creating HighDimFilter\n");
return new HighDimFilter({operands[0], operands[1]}, *attributes, name);
}
and the custom function itself is defined as:
#pragma once
#include "CNTKLibrary.h"
#include "modified_permutohedral.h"
using namespace CNTK;
class HighDimFilter final : public Function
{
bool _bilateral;
float _theta_alpha;
float _theta_beta;
float _theta_gamma;
enum Input : uint32_t
{
SCORES,
IM_INFO
};
public:
HighDimFilter(const std::vector<Variable>& inputs, const Dictionary& attributes, const std::wstring& name = L"HighDimFilter")
: Function(inputs, attributes, name)
{
if (attributes.Contains(L"bilateral"))
_bilateral = attributes[L"bilateral"].Value<bool>();
if (_bilateral == false)
{
if (attributes.Contains(L"theta_gamma"))
_theta_gamma = static_cast<float>(attributes[L"theta_gamma"].Value<double>());
}
else
{
if (attributes.Contains(L"theta_alpha"))
_theta_alpha = static_cast<float>(attributes[L"theta_alpha"].Value<double>());
if (attributes.Contains(L"theta_beta"))
_theta_beta = static_cast<float>(attributes[L"theta_beta"].Value<double>());
}
}
private:
void _compute_spatial_kernel(NDArrayViewPtr& Tensor, const float theta_gamma)
{
auto output_kernel = Tensor->WritableDataBuffer<float>();
auto outputShape = Tensor->Shape();
//auto channels = outputShape[0];
auto height = outputShape[1];
auto width = outputShape[2];
const auto num_pixels = width * height;
for (int p = 0; p < num_pixels; ++p)
{
output_kernel[2 * p] = static_cast<float>(p % width) / theta_gamma;
output_kernel[2 * p + 1] = static_cast<float>(p / width) / theta_gamma;
}
}
void _compute_bilateral_kernel(NDArrayViewPtr& Tensor, const NDArrayViewPtr& Image,
const float theta_alpha, const float theta_beta)
{
auto output_kernel = Tensor->WritableDataBuffer<float>();
auto rgb = Image->DataBuffer<float>();
auto outputShape = Tensor->Shape();
//auto channels = outputShape[0];
auto height = outputShape[1];
auto width = outputShape[2];
const auto num_pixels = height * width;
for (int p = 0; p < num_pixels; ++p)
{
// Spatial terms
output_kernel[5 * p] = static_cast<float>(p % width) / theta_alpha;
output_kernel[5 * p + 1] = static_cast<float>(p / width) / theta_alpha;
// Color terms
output_kernel[5 * p + 2] = static_cast<float>(rgb[p] / theta_beta);
output_kernel[5 * p + 3] = static_cast<float>(rgb[num_pixels + p] / theta_beta);
output_kernel[5 * p + 4] = static_cast<float>(rgb[2 * num_pixels + p] / theta_beta);
}
}
BackPropStatePtr Forward(const std::vector<ValuePtr>& inputValues,
std::unordered_map<Variable, ValuePtr>& outputs,
const DeviceDescriptor& computeDevice,
const std::unordered_set<Variable>& /*outputsToRetainBackwardStateFor */) override
{
#if 0
auto scoresShape = inputValues[Input::SCORES]->Shape();
auto channels = scoresShape[0];
auto height = scoresShape[1];
auto width = scoresShape[2];
const auto num_pixels = width * height;
auto &outputValue = outputs[this->Output()];
if (outputValue == nullptr)
{
outputValue = MakeSharedObject<Value>(MakeSharedObject<NDArrayView>(DataType::Float, scoresShape, computeDevice));
}
if (computeDevice.Type() != DeviceKind::CPU)
throw std::runtime_error("HighDimFilter: only CPU evaluation is supported at the moment.");
ModifiedPermutohedral mp;
if (_bilateral)
{
auto &kernel_vals = MakeSharedObject<Value>(MakeSharedObject<NDArrayView>(DataType::Float, NDShape({5, height, width}), computeDevice));
//float* kernel_vals = new float[5 * num_pixels];
_compute_bilateral_kernel(kernel_vals->Data(), inputValues[Input::IM_INFO]->Data(),
_theta_alpha, _theta_beta);
mp.init(kernel_vals->Data(), 5, num_pixels);
mp.compute(outputValue->Data(), inputValues[Input::SCORES]->Data(), false);
}
else
{
auto &kernel_vals = MakeSharedObject<Value>(MakeSharedObject<NDArrayView>(DataType::Float, NDShape({2, height, width}), computeDevice));
_compute_spatial_kernel(kernel_vals->Data(), _theta_gamma);
mp.init(kernel_vals->Data(), 2, num_pixels);
mp.compute(outputValue->Data(), inputValues[Input::SCORES]->Data(), channels, false);
}
return MakeSharedObject<BackPropState>(this->shared_from_this(), computeDevice, std::unordered_map<Variable, ValuePtr>({ {Inputs()[Input::IM_INFO], inputValues[Input::IM_INFO]} }));
#else
return nullptr;
#endif
}
void Backward(const BackPropStatePtr& state,
const std::unordered_map<Variable, ValuePtr>& rootGradientValues,
std::unordered_map<Variable, ValuePtr>& backPropagatedGradientValuesForInputs) override
{
#if 0
auto gradOutputVariable = Inputs()[Input::SCORES];
auto inputVariable = Inputs()[Input::IM_INFO];
auto &gradValue = backPropagatedGradientValuesForInputs[gradOutputVariable];
if (gradValue == nullptr)
gradValue = MakeSharedObject<Value>(MakeSharedObject<NDArrayView>(DataType::Float, gradOutputVariable.Shape(), state->Device()));
auto imageData = state->SavedForwardPropValues().at(inputVariable)->Data();
auto imageShape = imageData->Shape();
auto channels = imageShape[0];
auto height = imageShape[1];
auto width = imageShape[2];
const auto num_pixels = width * height;
if (state->Device().Type() != DeviceKind::CPU)
throw std::runtime_error("HighDimFilter: only CPU evaluation is supported at the moment.");
auto rootGradientData = rootGradientValues.at(this->Output())->Data();
ModifiedPermutohedral mp;
if (_bilateral)
{
auto &kernel_vals = MakeSharedObject<Value>(MakeSharedObject<NDArrayView>(DataType::Float, NDShape({5, height, width}), state->Device()));
//float* kernel_vals = new float[5 * num_pixels];
_compute_bilateral_kernel(kernel_vals->Data(), imageData,
_theta_alpha, _theta_beta);
mp.init(kernel_vals->Data(), 5, num_pixels);
mp.compute(gradValue->Data(), rootGradientData, true);
}
else
{
auto &kernel_vals = MakeSharedObject<Value>(MakeSharedObject<NDArrayView>(DataType::Float, NDShape({2, height, width}), state->Device()));
_compute_spatial_kernel(kernel_vals->Data(), _theta_gamma);
mp.init(kernel_vals->Data(), 2, num_pixels);
mp.compute(gradValue->Data(), rootGradientData, channels, true);
}
#endif
return;
}
const std::wstring& OpName() const override
{
static const std::wstring opName = L"HighDimFilterOp";
return opName;
}
size_t CurrentVersion() const override
{
NOT_IMPLEMENTED;
}
void InferOutputs(std::vector<Variable>& /*outputs */) override
{
NOT_IMPLEMENTED;
}
FunctionPtr Clone(const std::vector<Variable>& /*clonedInputs */) override
{
return nullptr;
}
};
My python call looks like:
bilateral_high_dim_filter = custom_module.high_dim_filter(image=all_ones_flat,
rgb=rgb_flat,
bilateral=True,
theta_alpha=self.theta_alpha,
theta_beta=self.theta_beta)
high_dim_filter = custom_module.high_dim_filter(image=all_ones_flat,
rgb=rgb_flat,
bilateral=False,
theta_gamma=self.theta_gamma)
The questions are as follows: 1) What are the "operands" passed in to the native_user_function on initialization? Are these only passed on initialization (are they intended to be weight and bias initialization)? How are the input operands used in the "Function" construction initializer? If I set these to "None" in Python, the code crashes.
2) How do you forward propagate the filter? Just call "forward()"? What about the required arguments to forward propagate?
3) Is there a numerical gradient calculation in CNTK similar to TensorFlow to check the gradient?

Several arithmetic operations parallelized in C++Amp

I am trying to parallelize a convolution filter using C++Amp. I would like the following function to start working (I don't know how to do it properly):
float* pixel_color[] = new float [16];
concurrency::array_view<float, 2> pixels(4, 4, pixel_array), taps(4, 4, myTap4Kernel_array);
concurrency::array_view<float, 1> pixel(16, pixel_color); // I don't know which data structure to use here
parallel_for_each(
pixels.extent, [=](concurrency::index<2> idx) restrict(amp)
{
int row=idx[0];
int col=idx[1];
pixels(row, col) = taps(row, col) * pixels(row, col);
pixel[0] += pixels(row, col);
});
pixel_color.synchronize();
pixels_.at<Pixel>(j, i) = pixel_color
}
The main problem is that I don't know how to use the pixel structure properly (which concurrent data structure to use here as I don't need all 16 elements). And I don't know if I can safely add the values this way.
The following code doesn't work, it does not add appropriate values to pixel[0].
I also would like to define
concurrency::array_view<float, 2> pixels(4, 4, pixel_array), taps(4, 4, myTap4Kernel_array);
outside the method (for example in the header file) and initialize it in the costructor or other function (as this is a bottle-neck and takes a lot of time copying the data between CPU and GPU). Does anybody know how to do this?
You're no the right track but doing in place manipulations of arrays on a GPU is tricky as you cannot guarantee the order in which different elements are updated.
Here's an example of something very similar. The ApplyColorSimplifierTiledHelper method contains an AMP restricted parallel_for_each that calls SimplifyIndexTiled for each index in the 2D array. SimplifyIndexTiled calculates a new value for each pixel in destFrame based on the value of the pixels surrounding the corresponding pixel in srcFrame. This solves the race condition issue present in your code.
This code comes from the Codeplex site for the C++ AMP book. The Cartoonizer case study includes several examples of these sorts of image processing problems implemented in C++ AMP using; arrays, textures, tiled/untiled and multi-GPU. The C++ AMP book discusses the implementation in some detail.
void ApplyColorSimplifierTiledHelper(const array<ArgbPackedPixel, 2>& srcFrame,
array<ArgbPackedPixel, 2>& destFrame, UINT neighborWindow)
{
const float_3 W(ImageUtils::W);
assert(neighborWindow <= FrameProcessorAmp::MaxNeighborWindow);
tiled_extent<FrameProcessorAmp::TileSize, FrameProcessorAmp::TileSize>
computeDomain = GetTiledExtent(srcFrame.extent);
parallel_for_each(computeDomain, [=, &srcFrame, &destFrame]
(tiled_index<FrameProcessorAmp::TileSize, FrameProcessorAmp::TileSize> idx)
restrict(amp)
{
SimplifyIndexTiled(srcFrame, destFrame, idx, neighborWindow, W);
});
}
void SimplifyIndex(const array<ArgbPackedPixel, 2>& srcFrame, array<ArgbPackedPixel,
2>& destFrame, index<2> idx,
UINT neighborWindow, const float_3& W) restrict(amp)
{
const int shift = neighborWindow / 2;
float sum = 0;
float_3 partialSum;
const float standardDeviation = 0.025f;
const float k = -0.5f / (standardDeviation * standardDeviation);
const int idxY = idx[0] + shift; // Corrected index for border offset.
const int idxX = idx[1] + shift;
const int y_start = idxY - shift;
const int y_end = idxY + shift;
const int x_start = idxX - shift;
const int x_end = idxX + shift;
RgbPixel orgClr = UnpackPixel(srcFrame(idxY, idxX));
for (int y = y_start; y <= y_end; ++y)
for (int x = x_start; x <= x_end; ++x)
{
if (x != idxX || y != idxY) // don't apply filter to the requested index, only to the neighbors
{
RgbPixel clr = UnpackPixel(srcFrame(y, x));
float distance = ImageUtils::GetDistance(orgClr, clr, W);
float value = concurrency::fast_math::pow(float(M_E), k * distance * distance);
sum += value;
partialSum.r += clr.r * value;
partialSum.g += clr.g * value;
partialSum.b += clr.b * value;
}
}
RgbPixel newClr;
newClr.r = static_cast<UINT>(clamp(partialSum.r / sum, 0.0f, 255.0f));
newClr.g = static_cast<UINT>(clamp(partialSum.g / sum, 0.0f, 255.0f));
newClr.b = static_cast<UINT>(clamp(partialSum.b / sum, 0.0f, 255.0f));
destFrame(idxY, idxX) = PackPixel(newClr);
}
The code uses ArgbPackedPixel, which is simply a mechanism for packing 8-bit RGB values into an unsigned long as C++ AMP does not support char. If your problem is small enough to fit into a texture then you may want to look at using this instead of an array as the pack/unpack is implemented in hardware on the GPU so is effectively "free", here you have to pay for it with additional compute. There is also an example of this implementation on CodePlex.
typedef unsigned long ArgbPackedPixel;
struct RgbPixel
{
unsigned int r;
unsigned int g;
unsigned int b;
};
const int fixedAlpha = 0xFF;
inline ArgbPackedPixel PackPixel(const RgbPixel& rgb) restrict(amp)
{
return (rgb.b | (rgb.g << 8) | (rgb.r << 16) | (fixedAlpha << 24));
}
inline RgbPixel UnpackPixel(const ArgbPackedPixel& packedArgb) restrict(amp)
{
RgbPixel rgb;
rgb.b = packedArgb & 0xFF;
rgb.g = (packedArgb & 0xFF00) >> 8;
rgb.r = (packedArgb & 0xFF0000) >> 16;
return rgb;
}

Skin Detection with Gaussian Mixture Models

I'm doing skin detection algorithm according to this article. There are two models at page 21: Mixture of Gaussian Skin and Non-skin Color Model.
The first model for skin detection works exellent.
There are examples:
1)Orginal image:
2) Skin mask
But the non-skin model gives wrong results:
Here is my code:
ipl_image_wrapper NudityDetector::filterPixelsWithGMM(const float covarinceMatrix[][3], const float meanMatrix[][3], const float weightVector[], const float probValue) const
{
ipl_image_wrapper mask = cvCreateImage(cvGetSize(m_image.get()), IPL_DEPTH_8U, 1);
double probability = 0.0;
float x[3] = { 0, 0, 0};
for(int i = 0; i < m_image.get()->height; ++i)
{
for(int j = 0; j < m_image.get()->width; ++j)
{
if (m_image.get()->nChannels == 3)
{
x[0] = (reinterpret_cast<uchar*>(m_image.get()->imageData + i * m_image.get()->widthStep))[j * 3 + 2];
x[1] = (reinterpret_cast<uchar*>(m_image.get()->imageData + i * m_image.get()->widthStep))[j * 3 + 1];
x[2] = (reinterpret_cast<uchar*>(m_image.get()->imageData + i * m_image.get()->widthStep))[j * 3];
double cov_det = 0.0;
double power = 0.0;
double A1 = 0.0;
double A2 = 0.0;
double A3 = 0.0;
probability = 0;
for (int k = 0; k < 16; ++k)
{
cov_det = covarinceMatrix[k][0] * covarinceMatrix[k][1] * covarinceMatrix[k][2];
A1 = covarinceMatrix[k][1] * covarinceMatrix[k][2];
A2 = covarinceMatrix[k][0] * covarinceMatrix[k][2];
A3 = covarinceMatrix[k][0] * covarinceMatrix[k][1];
power =(std::pow((x[0] - meanMatrix[k][0]), 2) * A1 +
std::pow((x[1] - meanMatrix[k][1]), 2) * A2 +
std::pow((x[2] - meanMatrix[k][2]), 2) * A3 ) / (2 * cov_det);
probability += 100 * weightVector[k] *std::exp(-power) / (std::pow(2 * M_PI, 3/2) * std::pow(cov_det, 1/2));
}
if ( probability < probValue)
{
(reinterpret_cast<uchar*>(mask.get()->imageData + i * mask.get()->widthStep))[j] = 0;
}
else
{
(reinterpret_cast<uchar*>(mask.get()->imageData + i * mask.get()->widthStep))[j] = 255;
}
}
}
}
cvDilate(mask.get(), mask.get(), NULL, 2);
cvErode(mask.get(), mask.get(), NULL, 1);
return mask;
}
ipl_image_wrapper NudityDetector::detectSkinWithGMM(const float probValue) const
{
//matrices are from article
ipl_image_wrapper mask = filterPixelsWithGMM(COVARIANCE_SKIN_MATRIX, MEAN_SKIN_MATRIX, SKIN_WEIGHT_VECTOR, probValue);
return mask;
}
ipl_image_wrapper NudityDetector::detectNonSkinWithGMM(const float probValue) const
{
//matrices are from article
ipl_image_wrapper mask = filterPixelsWithGMM(COVARIANCE_NON_SKIN_MATRIX, MEAN_NON_SKIN_MATRIX, NON_SKIN_WEIGHT_VECTOR, probValue);
return mask;
}
What I'm doing wrong? Maybe I misunderstand the meaning of tre article? Or I translated formula wrong in the code?
Thank you in advance!
In fact, there seems to be nothing wrong with the results, non-skin model correctly identifies non-skin regions as 255 and skin regions as 0. You may just need to tune parameter probValue to a lower value to get rid of some false negatives (small non-skin regions)
GMM may not be an effective approach for skin detection and you may employ some edge intensity information as a regularization parameter so that detected regions will not be fragmented.