Several arithmetic operations parallelized in C++Amp

Several arithmetic operations parallelized in C++Amp - c++

I am trying to parallelize a convolution filter using C++Amp. I would like the following function to start working (I don't know how to do it properly):
float* pixel_color[] = new float [16];
concurrency::array_view<float, 2> pixels(4, 4, pixel_array), taps(4, 4, myTap4Kernel_array);
concurrency::array_view<float, 1> pixel(16, pixel_color); // I don't know which data structure to use here
parallel_for_each(
pixels.extent, [=](concurrency::index<2> idx) restrict(amp)
{
int row=idx[0];
int col=idx[1];
pixels(row, col) = taps(row, col) * pixels(row, col);
pixel[0] += pixels(row, col);
});
pixel_color.synchronize();
pixels_.at<Pixel>(j, i) = pixel_color
}
The main problem is that I don't know how to use the pixel structure properly (which concurrent data structure to use here as I don't need all 16 elements). And I don't know if I can safely add the values this way.
The following code doesn't work, it does not add appropriate values to pixel[0].
I also would like to define
concurrency::array_view<float, 2> pixels(4, 4, pixel_array), taps(4, 4, myTap4Kernel_array);
outside the method (for example in the header file) and initialize it in the costructor or other function (as this is a bottle-neck and takes a lot of time copying the data between CPU and GPU). Does anybody know how to do this?

You're no the right track but doing in place manipulations of arrays on a GPU is tricky as you cannot guarantee the order in which different elements are updated.
Here's an example of something very similar. The ApplyColorSimplifierTiledHelper method contains an AMP restricted parallel_for_each that calls SimplifyIndexTiled for each index in the 2D array. SimplifyIndexTiled calculates a new value for each pixel in destFrame based on the value of the pixels surrounding the corresponding pixel in srcFrame. This solves the race condition issue present in your code.
This code comes from the Codeplex site for the C++ AMP book. The Cartoonizer case study includes several examples of these sorts of image processing problems implemented in C++ AMP using; arrays, textures, tiled/untiled and multi-GPU. The C++ AMP book discusses the implementation in some detail.
void ApplyColorSimplifierTiledHelper(const array<ArgbPackedPixel, 2>& srcFrame,
array<ArgbPackedPixel, 2>& destFrame, UINT neighborWindow)
{
const float_3 W(ImageUtils::W);
assert(neighborWindow <= FrameProcessorAmp::MaxNeighborWindow);
tiled_extent<FrameProcessorAmp::TileSize, FrameProcessorAmp::TileSize>
computeDomain = GetTiledExtent(srcFrame.extent);
parallel_for_each(computeDomain, [=, &srcFrame, &destFrame]
(tiled_index<FrameProcessorAmp::TileSize, FrameProcessorAmp::TileSize> idx)
restrict(amp)
{
SimplifyIndexTiled(srcFrame, destFrame, idx, neighborWindow, W);
});
}
void SimplifyIndex(const array<ArgbPackedPixel, 2>& srcFrame, array<ArgbPackedPixel,
2>& destFrame, index<2> idx,
UINT neighborWindow, const float_3& W) restrict(amp)
{
const int shift = neighborWindow / 2;
float sum = 0;
float_3 partialSum;
const float standardDeviation = 0.025f;
const float k = -0.5f / (standardDeviation * standardDeviation);
const int idxY = idx[0] + shift; // Corrected index for border offset.
const int idxX = idx[1] + shift;
const int y_start = idxY - shift;
const int y_end = idxY + shift;
const int x_start = idxX - shift;
const int x_end = idxX + shift;
RgbPixel orgClr = UnpackPixel(srcFrame(idxY, idxX));
for (int y = y_start; y <= y_end; ++y)
for (int x = x_start; x <= x_end; ++x)
{
if (x != idxX || y != idxY) // don't apply filter to the requested index, only to the neighbors
{
RgbPixel clr = UnpackPixel(srcFrame(y, x));
float distance = ImageUtils::GetDistance(orgClr, clr, W);
float value = concurrency::fast_math::pow(float(M_E), k * distance * distance);
sum += value;
partialSum.r += clr.r * value;
partialSum.g += clr.g * value;
partialSum.b += clr.b * value;
}
}
RgbPixel newClr;
newClr.r = static_cast<UINT>(clamp(partialSum.r / sum, 0.0f, 255.0f));
newClr.g = static_cast<UINT>(clamp(partialSum.g / sum, 0.0f, 255.0f));
newClr.b = static_cast<UINT>(clamp(partialSum.b / sum, 0.0f, 255.0f));
destFrame(idxY, idxX) = PackPixel(newClr);
}
The code uses ArgbPackedPixel, which is simply a mechanism for packing 8-bit RGB values into an unsigned long as C++ AMP does not support char. If your problem is small enough to fit into a texture then you may want to look at using this instead of an array as the pack/unpack is implemented in hardware on the GPU so is effectively "free", here you have to pay for it with additional compute. There is also an example of this implementation on CodePlex.
typedef unsigned long ArgbPackedPixel;
struct RgbPixel
{
unsigned int r;
unsigned int g;
unsigned int b;
};
const int fixedAlpha = 0xFF;
inline ArgbPackedPixel PackPixel(const RgbPixel& rgb) restrict(amp)
{
return (rgb.b | (rgb.g << 8) | (rgb.r << 16) | (fixedAlpha << 24));
}
inline RgbPixel UnpackPixel(const ArgbPackedPixel& packedArgb) restrict(amp)
{
RgbPixel rgb;
rgb.b = packedArgb & 0xFF;
rgb.g = (packedArgb & 0xFF00) >> 8;
rgb.r = (packedArgb & 0xFF0000) >> 16;
return rgb;
}

Related

VP8 C/C++ source, how to encode frames in ARGB format to frame instead of from file

I'm trying to get started with the VP8 library, I'm not building it in the standard way they tell you to, I just loaded all of the main files and the "encoder" folder into a new Visual Studio C++ DLL project, and just included the C files in an extern "C" dll export function, which so far builds fine etc., I just have no idea where to start with the C++ API to encode, say, 3 frames of ARGB data into a very basic video, just to get started
The only example I could find is in the examples folder called simple_encoder.c, although their premise is that they are loading in another file already and parsing its frames then converting it, so it seems a bit complicated, I just want to be able to pass in a byte array of a few ARGB frames and have it output a very simple VP8 video
I've seen How to encode series of images into VP8 using WebM VP8 Encoder API? (C/C++) but the accepted answer just links to the build instructions and references the general specification of the vp8 format, the closest I could find there is the example encoding parameters but I just want to do everything from C++ and I can't seem to find any other examples, besides for the default one simple_encoder.c?
Just to cite some of the relevant parts I think I understand, but still need more help on
//in int main...
...
vpx_image_t raw;
if (!vpx_img_alloc(&raw, VPX_IMG_FMT_I420, info.frame_width,
info.frame_height, 1)) {
//"Failed to allocate image." error
}
So that part I think I understand for the most part, VPX_IMG_FMT_I420 is the only part that's not made in this file itself, but its in vpx_image.h, first as
#define VPX_IMG_FMT_PLANAR
//then after...
typedef enum vpx_img_fmt {
VPX_IMG_FMT_NONE,
VPX_IMG_FMT_RGB24, /**< 24 bit per pixel packed RGB */
///some other formats....
VPX_IMG_FMT_ARGB, /**< 32 bit packed ARGB, alpha=255 */
VPX_IMG_FMT_YV12 = VPX_IMG_FMT_PLANAR | VPX_IMG_FMT_UV_FLIP | 1, /**< planar YVU */
VPX_IMG_FMT_I420 = VPX_IMG_FMT_PLANAR | 2,
} vpx_img_fmt_t; /**< alias for enum vpx_img_fmt */
So I guess part of my question is answered already just from writing this, that one of the formats is VPX_IMG_FMT_ARGB, although I don't where where it's defined, but I'm guessing in the above code I would replace it with
const VpxInterface *encoder = get_vpx_encoder_by_name("v8");
vpx_image_t raw;
VpxVideoInfo info = { 0, 0, 0, { 0, 0 } };
info.frame_width = 1920;
info.frame_height = 1080;
info.codec_fourcc = encoder->fourcc;
info.time_base.numerator = 1;
info.time_base.denominator = 24;
bool didIt = vpx_img_alloc(&raw, VPX_IMG_FMT_ARGB,
info.frame_width, info.frame_height/*example width and height*/, 1)
//check didIt..
vpx_codec_enc_cfg_t cfg;
vpx_codec_ctx_t codec;
vpx_codec_err_t res;
res = vpx_codec_enc_config_default(encoder->codec_interface(), &cfg, 0);
//check if !res for error
cfg.g_w = info.frame_width;
cfg.g_h = info.frame_height;
cfg.g_timebase.num = info.time_base.numerator;
cfg.g_timebase.den = info.time_base.denominator;
cfg.rc_target_bitrate = 200;
VpxVideoWriter *writer = NULL;
writer = vpx_video_writer_open(outfile_arg, kContainerIVF, &info);
//check if !writer for error
bool startIt = vpx_codec_enc_init(&codec, encoder->codec_interface(), &cfg, 0);
//not even sure where codec was set actually..
//check !startIt for error starting
//now the next part in the original is where it reads from the input file, but instead
//I need to pass in an array of some ARGB byte arrays..
//thing is, in the next step they use a while loop for
//vpx_img_read(&raw, fopen("path/to/YV12formatVideo", "rb"))
//to set the contents of the raw vpx image allocated earlier, then
//they call another program that writes it to the writer object,
//but I don't know how to read the actual ARGB data directly into the raw image
//without using fopen, so that's one question (review at end)
//so I'll just put a placeholder here for the **question**
//assuming I have an array of byte arrays stored individually
//for simplicity sake
int size = 1920 * 1080 * 4;
uint8_t imgOne[size] = {/*some big byte array*/};
uint8_t imgTwo[size] = {/*some big byte array*/};
uint8_t imgThree[size] = {/*some big byte array*/};
uint8_t *images[] = {imgOne, imgTwo, imgThree};
int framesDone = 0;
int maxFrames = 3;
//so now I can replace the while loop with a filler function
//until I find out how to set the raw image with ARGB data
while(framesDone < maxFrames) {
magicalFunctionToSetARGBOfRawImage(&raw, images[framesDone]);
encode_frame(&codec, &raw, framesDone, 0, writer);
framesDone++;
}
//now apparently it needs to be flushed after
while(encode_frame(&codec, 0, -1, 0, writer)){}
vpx_img_free(&raw);
bool isDestroyed = vpx_codec_destroy(&codec);
//check if !isDestroyed for error
//now we gotta define the encode_Frames function, but simpler
//(and make it above other function for reference purposes
//or in header
static int encode_frame(
vpx_codex_ctx_t *coydek,
vpx_image_t pic,
int currentFrame,
int flags,
VpxVideoWriter *koysayv/*writer*/
) {
//now to substitute their encodeFrame function for
//the actual raw calls to simplify things
const DidIt = vpx_codec_encode(
coydek,
pic,
currentFrame,
1,//duration I think
flags,//whatever that is
VPX_DL_REALTIME//different than simlpe_encoder
);
if(!DidIt) return;//error here
vpx_codec_iter_t iter = 0;
const vpx_codec_cx_pkt_t *pkt = 0;
int gotThings = 0;
while(
(pkt = vpx_codec_get_cx_data(
coydek,
&iter
)) != 0
) {
gotThings = 1;
if(
pkt->kind
== VPX_CODEC_CX_FRAME_PKT //don't exactly
//understand this part
) {
const
int
keyframe = (
pkt
->
data
.frame
.flags
&
VPX_FRAME_IS_KEY
) != 0; //don'texactly understand the
//& operator here or how it gets the keyframe
bool wroteFrame = vpx_video_writer_write_frame(
koysayv,
pkt->data.frame.buf
//I'm guessing this is the encoded
//frame data
,
pkt->data.frame.sz,
pkt->data.frame.pts
);
if(!wroteFrame) return; //error
}
}
return gotThings;
}
Thing is though, I don't know how to actually read the
ARGB data into the RAW image buffer itself, as mentioned
above, in the original example, they use
vpx_img_read(&raw, fopen("path/to/file", "rb"))
but if I'm starting off with the byte arrays themselves
then what function do I use for that instead of the file?
I have a feeling it can be solved by the source code for the vpx_img_read found in tools_common.c function:
int vpx_img_read(vpx_image_t *img, FILE *file) {
int plane;
for (plane = 0; plane < 3; ++plane) {
unsigned char *buf = img->planes[plane];
const int stride = img->stride[plane];
const int w = vpx_img_plane_width(img, plane) *
((img->fmt & VPX_IMG_FMT_HIGHBITDEPTH) ? 2 : 1);
const int h = vpx_img_plane_height(img, plane);
int y;
for (y = 0; y < h; ++y) {
if (fread(buf, 1, w, file) != (size_t)w) return 0;
buf += stride;
}
}
return 1;
}
although I personally am not experienced enough to necessarily know how to get a single frames ARGB data in, I think the key part is fread(buf, 1, w, file) which seems to read parts of file into buf which represents img->planes[plane];, which I think then by reading into buf that automatically reads into img->planes[plane];, but I'm not sure if that is the case, and also not sure how to replace the fread from file to just take in a bye array that is alreasy loaded into memory...

VPX_IMG_FMT_ARGB is not defined because not supported by libvpx (as far as I have seen). To compress an image using this library, you must first convert it to one of the supported format, like I420 (VPX_IMG_FMT_I420). The code here (not mine) : https://gist.github.com/racerxdl/8164330 do it well for the RGB format. If you don't want to use libswscale to make the conversion from RGB to I420, you can do things like this (this code convert a RGBA array of bytes to a I420 vpx_image that can be use by libvpx):
unsigned int tx = <width of your image>
unsigned int ty = <height of your image>
unsigned char *image = <array of bytes : RGBARGBA... of size ty*tx*4>
vpx_image_t *imageVpx = <result that must have been properly initialized by libvpx>
imageVpx->stride[VPX_PLANE_U ] = tx/2;
imageVpx->stride[VPX_PLANE_V ] = tx/2;
imageVpx->stride[VPX_PLANE_Y ] = tx;
imageVpx->stride[VPX_PLANE_ALPHA] = tx;
imageVpx->planes[VPX_PLANE_U ] = new unsigned char[ty*tx/4];
imageVpx->planes[VPX_PLANE_V ] = new unsigned char[ty*tx/4];
imageVpx->planes[VPX_PLANE_Y ] = new unsigned char[ty*tx ];
imageVpx->planes[VPX_PLANE_ALPHA] = new unsigned char[ty*tx ];
unsigned char *planeY = imageVpx->planes[VPX_PLANE_Y ];
unsigned char *planeU = imageVpx->planes[VPX_PLANE_U ];
unsigned char *planeV = imageVpx->planes[VPX_PLANE_V ];
unsigned char *planeA = imageVpx->planes[VPX_PLANE_ALPHA];
for (unsigned int y=0; y<ty; y++)
{
if (!(y % 2))
{
for (unsigned int x=0; x<tx; x+=2)
{
int r = *image++;
int g = *image++;
int b = *image++;
int a = *image++;
*planeY++ = max(0, min(255, (( 66*r + 129*g + 25*b) >> 8) + 16));
*planeU++ = max(0, min(255, ((-38*r + -74*g + 112*b) >> 8) + 128));
*planeV++ = max(0, min(255, ((112*r + -94*g + -18*b) >> 8) + 128));
*planeA++ = a;
r = *image++;
g = *image++;
b = *image++;
a = *image++;
*planeA++ = a;
*planeY++ = max(0, min(255, ((66*r + 129*g + 25*b) >> 8) + 16));
}
}
else
{
for (unsigned int x=0; x<tx; x++)
{
int const r = *image++;
int const g = *image++;
int const b = *image++;
int const a = *image++;
*planeA++ = a;
*planeY++ = max(0, min(255, ((66*r + 129*g + 25*b) >> 8) + 16));
}
}
}

openCl path tracer creates strange noise patterns

I've made a path tracer using openCl and c++, following the basic structure in this tutorial: http://raytracey.blogspot.com/2016/11/opencl-path-tracing-tutorial-2-path.html. As far as I can tell, nothing is wrong with the path tracing algorithm itself, but I get strange stripe patterns in the image that don't match the regular noise of path tracing. striped image
There are distinct vertical stripes and more narrow horizontal ones that make the image look granular regardless of how many samples I take per pixel. Again, pixel by pixel, the path tracer seems to be working (the outlines of objects are correct even where they appear mid-stripe) as seen here: close-up.
The only difference between my code and the one in the tutorial I link is that Sam Lapere appears to be using the c++ wrapper for openCl, and I've added a couple of features like movement. There also are a few differences in how I'm handling light bounces.
I'm new to openCl. What could be causing this? It seems like it doesn't have to do with my ray tracer itself, but somehow in the way I'm implementing openCl. I'm also using an SDL texture and renderer to show the image to the screen
here is the tracer code if it helps:
kernel:
__kernel void render_kernel
(__constant struct Sphere* spheres, const int width, const int height,
const int sphere_count, __global int * output, __global float3*
pixel_buckets, __global int* counter, __constant struct Ray* camera,
__global bool* reset){
int gid = get_global_id(0);
//for movement
if (*reset){
pixel_buckets[gid] = (float3)(0,0,0);
counter[gid] = 0;
}
int xcoord = gid % width;
int ycoord = gid / width;
struct Ray camray = createCamRay(xcoord, ycoord, width, height, counter[gid], camera);
float3 final_color = trace(spheres, &camray, sphere_count, xcoord, ycoord);
counter[gid] ++;
//average colors
pixel_buckets[gid] += final_color;
output[gid] = colorInt(clampColor(pixel_buckets[gid] / counter[gid]));
}
trace:
float3 trace(__constant struct Sphere* spheres, struct Ray* camray, const int sphere_count,
unsigned int seed0, unsigned int seed1){
struct Ray ray = *camray;
struct Sphere sphere1;
sphere1.center = (float3)(0, 0, 3);
sphere1.radius = 0.7;
sphere1.color = (float3)(1,1,0);
const int bounce_count = 8;
float3 colors[20];
float3 emiss[20];
for (int bounce = 0; bounce < bounce_count; bounce ++){
int sphere_id = 0;
float hit_distance = intersectScene(spheres, &ray, &sphere_id, sphere_count);
struct Sphere hit_sphere = spheres[sphere_id];
float3 hit_point = ray.origin + (ray.direction * hit_distance);
float3 normal = normalize(hit_point - hit_sphere.center);
if (dot(normal, -ray.direction) < 0){
normal = -normal;
}
//random bounce angles
float rand_theta = get_random(seed0, seed1);
float theta = acos(sqrt(rand_theta));
float rand_phi = get_random(seed0, seed1);
float phi = 2 * PI * rand_phi;
//scales the tnb vectors
float x = sin(theta) * sin(phi);
float y = sin(theta) * cos(phi);
float n = cos(theta);
float3 hemx = normalize(cross(ray.direction, normal)) * x;
float3 hemy = normalize(cross(hemx, normal)) * y;
normal = normal * n;
float3 new_ray = normalize(hemx + hemy + normal);
ray.origin = hit_point + (normal * EPSILON);
ray.direction = new_ray;
colors[bounce] = hit_sphere.color;
emiss[bounce] = hit_sphere.emmissive;
}
colors[bounce_count] = (float3)(0,0,0);
emiss[bounce_count] = (float3)(0,0,0);
for (int i = bounce_count - 1; i >= 0; i--){
colors[i] = (colors[i] * emiss[i]) + (colors[i] * colors[i + 1]);
}
return colors[0];
}
random number generator:
float get_random(unsigned int *seed0, unsigned int *seed1) {
/* hash the seeds using bitwise AND operations and bitshifts */
*seed0 = 36969 * ((*seed0) & 65535) + ((*seed0) >> 16);
*seed1 = 18000 * ((*seed1) & 65535) + ((*seed1) >> 16);
unsigned int ires = ((*seed0) << 16) + (*seed1);
/* use union struct to convert int to float */
union {
float f;
unsigned int ui;
} res;
res.ui = (ires & 0x007fffff) | 0x40000000; /* bitwise AND, bitwise OR */
return (res.f - 2.0f) / 2.0f;
}
thanks

Extracting raw data from template for use in CUDA

The following code is a snippet from the PCL (point cloud) library. It calculates the integral sum of an image.
template <class DataType, unsigned Dimension> class IntegralImage2D
{
public:
static const unsigned dim_fst = Dimension;
typedef cv::Vec<typename TypeTraits<DataType>::IntegralType, dim_fst> FirstType;
std::vector<FirstType> img_fst;
//.... lots of methods missing here that actually calculate the integral sum
/** \brief Compute the first order sum within a given rectangle
* \param[in] start_x x position of rectangle
* \param[in] start_y y position of rectangle
* \param[in] width width of rectangle
* \param[in] height height of rectangle
*/
inline FirstType getFirstOrderSum(unsigned start_x, unsigned start_y, unsigned width, unsigned height) const
{
const unsigned upper_left_idx = start_y * (wdt + 1) + start_x;
const unsigned upper_right_idx = upper_left_idx + width;
const unsigned lower_left_idx =(start_y + height) * (wdt + 1) + start_x;
const unsigned lower_right_idx = lower_left_idx + width;
return(img_fst[lower_right_idx] + img_fst[upper_left_idx] - img_fst[upper_right_idx] - img_fst[lower_left_idx]);
}
Currently the results are obtained using the following code:
IntegralImage2D<float,3> iim_xyz;
IntegralImage2D<float, 3>::FirstType fo_elements;
IntegralImage2D<float, 3>::SecondType so_elements;
fo_elements = iim_xyz.getFirstOrderSum(pos_x - rec_wdt_2, pos_y - rec_hgt_2, rec_wdt, rec_hgt);
so_elements = iim_xyz.getSecondOrderSum(pos_x - rec_wdt_2, pos_y - rec_hgt_2, rec_wdt, rec_hgt);
However I'm trying to parallelise the code (write getFirstOrderSum as a CUDA device function). Since CUDA doesn't recognise these FirstType and SecondType objects (or any opencv objects for that matter) I'm struggling (I'm new to C++) to extract the raw data from the template.
If possible I would like to cast the img_fst object to some kind of vector or array that I can allocate on the cuda kernel.
it seems img_fst is of type std::vector<cv::Matx<double,3,1>

As it turns out you can pass the raw data as you would using a normal vector.
void computation(ps::IntegralImage2D<float, 3> iim_xyz){
cv::Vec<double, 3>* d_img_fst = 0;
cudaErrorCheck(cudaMalloc((void**)&d_img_fst, sizeof(cv::Vec<double, 3>)*(iim_xyz.img_fst.size())));
cudaErrorCheck(cudaMemcpy(d_img_fst, &iim_xyz.img_fst[0], sizeof(cv::Vec<double, 3>)*(iim_xyz.img_fst.size()), cudaMemcpyHostToDevice));
//..
}
__device__ double* getFirstOrderSum(unsigned start_x, unsigned start_y, unsigned width, unsigned height, int wdt, cv::Vec<double, 3>* img_fst)
{
const unsigned upper_left_idx = start_y * (wdt + 1) + start_x;
const unsigned upper_right_idx = upper_left_idx + width;
const unsigned lower_left_idx = (start_y + height) * (wdt + 1) + start_x;
const unsigned lower_right_idx = lower_left_idx + width;
double* result = new double[3];
result[0] = img_fst[lower_right_idx].val[0] + img_fst[upper_left_idx].val[0] - img_fst[upper_right_idx].val[0] - img_fst[lower_left_idx].val[0];
result[1] = img_fst[lower_right_idx].val[1] + img_fst[upper_left_idx].val[1] - img_fst[upper_right_idx].val[1] - img_fst[lower_left_idx].val[1];
result[2] = img_fst[lower_right_idx].val[2] + img_fst[upper_left_idx].val[2] - img_fst[upper_right_idx].val[2] - img_fst[lower_left_idx].val[2];
return result; //i have to delete this pointer otherwise I will create memory leak
}

Implement a near real-time CPU capability like glAlphaFunc(GL_GREATER) with RGB source and RGBA overlay

Latency is the biggest concern here. I have found that trying to render 3 1920x1080 video feeds with RGBA overlays to individual windows via OpenGL has limits. I am able to render two windows with overlays or 3 windows without overlays just fine, but when the third window is introduced, rendering stalls are obvious. I believe that the issue is due to the overuse of glAlphaFunc() to overlay and RGBA based texture on an RGB video texture. In order to reduce the overuse, my thought is to move some of the overlay function into CPU (as I have lots of CPU - dual hexcore Xeon). The ideal place to do this would be when copying the source RGB image to the mapped PBO and replacing the RGB values with the ones from the RGBA overlay where A > 0.
I have tried using Intel IPP methods, but there is no method available that doesn't involve multiple calls and results in too much latency. I've tried straight C code, but this takes longer than the 33 ms that I am allowed. I need help with creating an optimized assembly or SSE based routine that will provide minimal latency.
Compile the below code with > g++ -fopenmp -O2 -mtune=native
Basic C function for clarity:
void copyAndOverlay(const uint8_t* aSourceRGB, const uint8_t* aOverlayRGBA, uint8_t* aDestinationRGB, int aWidth, int aHeight) {
int i;
#pragma omp parallel for
for (i=0; i<aWidth*aHeight; ++i) {
if (0 == aOverlayRGBA[i*4+3]) {
aDestinationRGB[i*3] = aSourceRGB[i*3]; // R
aDestinationRGB[i*3+1] = aSourceRGB[i*3+1]; // G
aDestinationRGB[i*3+2] = aSourceRGB[i*3+2]; // B
} else {
aDestinationRGB[i*3] = aOverlayRGBA[i*4]; // R
aDestinationRGB[i*3+1] = aOverlayRGBA[i*4+1]; // G
aDestinationRGB[i*3+2] = aOverlayRGBA[i*4+2]; // B
}
}
}
uint64_t getTime() {
struct timeval tNow;
gettimeofday(&tNow, NULL);
return (uint64_t)tNow.tv_sec * 1000000 + (uint64_t)tNow.tv_usec;
}
int main(int argc, char **argv) {
int pixels = _WIDTH_ * _HEIGHT_ * 3;
uint8_t *rgba = new uint8_t[_WIDTH_ * _HEIGHT_ * 4];
uint8_t *src = new uint8_t[pixels];
uint8_t *dst = new uint8_t[pixels];
uint64_t tStart = getTime();
for (int t=0; t<1000; ++t) {
copyAndOverlay(src, rgba, dst, _WIDTH_, _HEIGHT_);
}
printf("delta: %lu\n", (getTime() - tStart) / 1000);
delete [] rgba;
delete [] src;
delete [] dst;
return 0;
}

Here is an SSE4 implementation that is a little more than 5 times faster than the code you posted with the question (without parallelization of the loop). As written it only works on RGBA buffers that are 16-byte aligned and sized in multiples of 64, and on RGB buffers that are 16-byte aligned and sized in multiples of 48. The size will requirments will jive perfectly with your 1920x1080 resolution, and you may need to add code to ensure your buffers are 16-byte aligned.
void copyAndOverlay(const uint8_t* aSourceRGB, const uint8_t* aOverlayRGBA, uint8_t* aDestinationRGB, int aWidth, int aHeight) {
__m128i const ocmp = _mm_setzero_si128();
__m128i const omskshf1 = _mm_set_epi32(0x00000000, 0x0F0F0F0B, 0x0B0B0707, 0x07030303);
__m128i const omskshf2 = _mm_set_epi32(0x07030303, 0x00000000, 0x0F0F0F0B, 0x0B0B0707);
__m128i const omskshf3 = _mm_set_epi32(0x0B0B0707, 0x07030303, 0x00000000, 0x0F0F0F0B);
__m128i const omskshf4 = _mm_set_epi32(0x0F0F0F0B, 0x0B0B0707, 0x07030303, 0x00000000);
__m128i const ovalshf1 = _mm_set_epi32(0x00000000, 0x0E0D0C0A, 0x09080605, 0x04020100);
__m128i const ovalshf2 = _mm_set_epi32(0x04020100, 0x00000000, 0x0E0D0C0A, 0x09080605);
__m128i const ovalshf3 = _mm_set_epi32(0x09080605, 0x04020100, 0x00000000, 0x0E0D0C0A);
__m128i const ovalshf4 = _mm_set_epi32(0x0E0D0C0A, 0x09080605, 0x04020100, 0x00000000);
__m128i const blndmsk1 = _mm_set_epi32(0xFFFFFFFF, 0x00000000, 0x00000000, 0x00000000);
__m128i const blndmsk2 = _mm_set_epi32(0xFFFFFFFF, 0xFFFFFFFF, 0x00000000, 0x00000000);
__m128i const blndmsk3 = _mm_set_epi32(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0x00000000);
__m128i a, b, c, x, y, z, w, p, q, r, s;
uint8_t const *const aSourceRGBPast = aSourceRGB + 3 * aWidth * aHeight;
while (aSourceRGB != aSourceRGBPast) {
// source:
// aaabbbcccdddeeef
// ffggghhhiiijjjkk
// klllmmmnnnoooppp
//
// overlay:
// aaaabbbbccccdddd
// eeeeffffgggghhhh
// iiiijjjjkkkkllll
// mmmmnnnnoooopppp
// load source
a = _mm_load_si128((__m128i const*)(aSourceRGB ));
b = _mm_load_si128((__m128i const*)(aSourceRGB + 16));
c = _mm_load_si128((__m128i const*)(aSourceRGB + 32));
// load overlay
x = _mm_load_si128((__m128i const*)(aOverlayRGBA ));
y = _mm_load_si128((__m128i const*)(aOverlayRGBA + 16));
z = _mm_load_si128((__m128i const*)(aOverlayRGBA + 32));
w = _mm_load_si128((__m128i const*)(aOverlayRGBA + 48));
// compute blend mask, put 0xFF in bytes equal to zero
p = _mm_cmpeq_epi8(x, ocmp);
q = _mm_cmpeq_epi8(y, ocmp);
r = _mm_cmpeq_epi8(z, ocmp);
s = _mm_cmpeq_epi8(w, ocmp);
// align overlay to be condensed to 3-byte color
x = _mm_shuffle_epi8(x, ovalshf1);
y = _mm_shuffle_epi8(y, ovalshf2);
z = _mm_shuffle_epi8(z, ovalshf3);
w = _mm_shuffle_epi8(w, ovalshf4);
// condense overlay to 3-btye color
x = _mm_blendv_epi8(x, y, blndmsk1);
y = _mm_blendv_epi8(y, z, blndmsk2);
z = _mm_blendv_epi8(z, w, blndmsk3);
// align blend mask to be condensed to 3-byte color
p = _mm_shuffle_epi8(p, omskshf1);
q = _mm_shuffle_epi8(q, omskshf2);
r = _mm_shuffle_epi8(r, omskshf3);
s = _mm_shuffle_epi8(s, omskshf4);
// condense blend mask to 3-btye color
p = _mm_blendv_epi8(p, q, blndmsk1);
q = _mm_blendv_epi8(q, r, blndmsk2);
r = _mm_blendv_epi8(r, s, blndmsk3);
// select from overlay and source based on blend mask
x = _mm_blendv_epi8(x, a, p);
y = _mm_blendv_epi8(y, b, q);
z = _mm_blendv_epi8(z, c, r);
// write colors to destination
_mm_store_si128((__m128i*)(aDestinationRGB ), x);
_mm_store_si128((__m128i*)(aDestinationRGB + 16), y);
_mm_store_si128((__m128i*)(aDestinationRGB + 32), z);
// update poniters
aSourceRGB += 48;
aOverlayRGBA += 64;
aDestinationRGB += 48;
}
}

Flipping the 2D texture on a sphere with Ray-Tracing

I am working on my ray-tracer and I think I've made some significant achievements. I am currently trying to place texture images onto objects. However they don't place quite well. They appear flipped on the sphere. Here is the final image of my current code:
Here are the relevant code:
-Image Class for opening image
class Image
{
public:
Image() {}
void read_bmp_file(char* filename)
{
int i;
FILE* f = fopen(filename, "rb");
unsigned char info[54];
fread(info, sizeof(unsigned char), 54, f); // read the 54-byte header
// extract image height and width from header
width = *(int*)&info[18];
height = *(int*)&info[22];
int size = 3 * width * height;
data = new unsigned char[size]; // allocate 3 bytes per pixel
fread(data, sizeof(unsigned char), size, f); // read the rest of the data at once
fclose(f);
for(i = 0; i < size; i += 3)
{
unsigned char tmp = data[i];
data[i] = data[i+2];
data[i+2] = tmp;
}
/*Now data should contain the (R, G, B) values of the pixels. The color of pixel (i, j) is stored at
data[j * 3* width + 3 * i], data[j * 3 * width + 3 * i + 1] and data[j * 3 * width + 3*i + 2].
In the last part, the swap between every first and third pixel is done because windows stores the
color values as (B, G, R) triples, not (R, G, B).*/
}
public:
int width;
int height;
unsigned char* data;
};
-Texture class
class Texture: public Material
{
public:
Texture(char* filename): Material() {
image_ptr = new Image;
image_ptr->read_bmp_file(filename);
}
virtual ~Texture() {}
virtual void set_mapping(Mapping* mapping)
{ mapping_ptr = mapping;}
virtual Vec get_color(const ShadeRec& sr) {
int row, col;
if(mapping_ptr)
mapping_ptr->get_texel_coordinates(sr.local_hit_point, image_ptr->width, image_ptr->height, row, col);
return Vec (image_ptr->data[row * 3 * image_ptr->width + 3*col ]/255.0,
image_ptr->data[row * 3 * image_ptr->width + 3*col+1]/255.0,
image_ptr->data[row * 3 * image_ptr->width + 3*col+2]/255.0);
}
public:
Image* image_ptr;
Mapping* mapping_ptr;
};
-Mapping class
class SphericalMap: public Mapping
{
public:
SphericalMap(): Mapping() {}
virtual ~SphericalMap() {}
virtual void get_texel_coordinates (const Vec& local_hit_point,
const int hres,
const int vres,
int& row,
int& column) const
{
float theta = acos(local_hit_point.y);
float phi = atan2(local_hit_point.z, local_hit_point.x);
if(phi < 0.0)
phi += 2*PI;
float u = phi/(2*PI);
float v = (PI - theta)/PI;
column = (int)((hres - 1) * u);
row = (int)((vres - 1) * v);
}
};
-Local hit points:
virtual void Sphere::set_local_hit_point(ShadeRec& sr)
{
sr.local_hit_point.x = sr.hit_point.x - c.x;
sr.local_hit_point.y = (sr.hit_point.y - c.y)/R;
sr.local_hit_point.z = sr.hit_point.z -c.z;
}
-This is how I constructed the sphere in main:
Texture* t1 = new Texture("Texture\\earthmap2.bmp");
SphericalMap* sm = new SphericalMap();
t1->set_mapping(sm);
t1->set_ka(0.55);
t1->set_ks(0.0);
Sphere *s1 = new Sphere(Vec(-60,0,50), 149);
s1->set_material(t1);
w.add_object(s1);
Sorry for long codes but if I had any idea where that problem might occur, I'd have posted that part. Finally this is how I call get_color() function from the main:
xShaded += sr.material_ptr->get_color(sr).x * in.x * max(0.0, sr.normal.dot(l)) +
sr.material_ptr->ks * in.x * pow((max(0.0,sr.normal.dot(h))),1);
yShaded += sr.material_ptr->get_color(sr).y * in.y * max(0.0, sr.normal.dot(l)) +
sr.material_ptr->ks * in.y * pow((max(0.0,sr.normal.dot(h))),1);
zShaded += sr.material_ptr->get_color(sr).z * in.z * max(0.0, sr.normal.dot(l)) +
sr.material_ptr->ks * in.z * pow((max(0.0,sr.normal.dot(h))),1);

Shot in the dark: if memory serves, BMPs are stored from the bottom up, while many other image formats are top-down. Could that possibly be the problem? Perhaps your file reader just needs to reverse the rows?

Changing float phi = atan2(local_hit_point.z, local_hit_point.x); to float phi = atan2(local_hit_point.x, local_hit_point.z); solved the problem.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Several arithmetic operations parallelized in C++Amp - c++

Related

VP8 C/C++ source, how to encode frames in ARGB format to frame instead of from file

openCl path tracer creates strange noise patterns

Extracting raw data from template for use in CUDA

Implement a near real-time CPU capability like glAlphaFunc(GL_GREATER) with RGB source and RGBA overlay

Flipping the 2D texture on a sphere with Ray-Tracing

Categories

Resources