C++AMP exception in simple image processing example - c++

I'm trying to teach myself C++AMP, and would like to start with a very simple task from my field, that is image processing. I'd like to convert a 24 Bit-per-pixel RGB image (a Bitmap) to a 8 Bit-per-Pixel grayscale one. The image data is available in unsigned char arrays (obtained from Bitmap::LockBits(...) etc.)
I know that C++AMP for some reason cannot deal with char or unsigned char data via array or array_view, so I tried to use textures according to that blog. Here it is explained how 8bpp textures are written to, although VisualStudio 2013 tells me writeonly_texture_view was deprecated.
My code throws a runtime exception, saying "Failed to dispatch kernel." The complete text of the exception is lenghty:
ID3D11DeviceContext::Dispatch: The Unordered Access View (UAV) in slot 0 of the Compute Shader unit has the Format (R8_UINT). This format does not support being read from a shader as as UAV. This mismatch is invalid if the shader actually uses the view (e.g. it is not skipped due to shader code branching). It was unfortunately not possible to have all hardware implementations support reading this format as a UAV, despite that the format can written to as a UAV. If the shader only needs to perform reads but not writes to this resource, consider using a Shader Resource View instead of a UAV.
The code I use so far is this:
namespace gpu = concurrency;
gpu::extent<3> inputExtent(height, width, 3);
gpu::graphics::texture<unsigned int, 3> inputTexture(inputExtent, eight);
gpu::graphics::copy((void*)inputData24bpp, dataLength, inputTexture);
gpu::graphics::texture_view<unsigned int, 3> inputTexView(inputTexture);
gpu::graphics::texture<unsigned int, 2> outputTexture(width, height, eight);
gpu::graphics::writeonly_texture_view<unsigned int, 2> outputTexView(outputTexture);
gpu::parallel_for_each(outputTexture.extent,
[inputTexView, outputTexView](gpu::index<2> pix) restrict(amp) {
gpu::index<3> indR(pix[0], pix[1], 0);
gpu::index<3> indG(pix[0], pix[1], 1);
gpu::index<3> indB(pix[0], pix[1], 2);
unsigned int sum = inputTexView[indR] + inputTexView[indG] + inputTexView[indB];
outputTexView.set(pix, sum / 3);
});
gpu::graphics::copy(outputTexture, outputData8bpp);
What's the reason for this exception, and what can I do for a workaround?

I've also been learning C++Amp on my own and faced a very similar problem than yours, but in my case, I needed to deal with a 16 bit image.
Likely, the issue can be solved using textures although I can't help you on that due to a lack of experience.
So, what I did is basically based on bit masking.
First off, trick the compiler in order to let you compile:
unsigned int* sourceData = reinterpret_cast<unsigned int*>(source);
unsigned int* destData = reinterpret_cast<unsigned int*>(dest);
Next, your array viewer has to see all your data. Be aware that viwer really thing your data is 32 bit sized. So, you have to make the conversion ( divided to 2 because 16 bits, use 4 for 8 bits).
concurrency::array_view<const unsigned int> source( (size+ 7)/2, sourceData) );
concurrency::array_view<unsigned int> dest( (size+ 7)/2, sourceData) );
Now, you are able to write a typical for_each block.
typedef concurrency::array_view<const unsigned int> OriginalImage;
typedef concurrency::array_view<unsigned int> ResultImage;
bool Filters::Filter_Invert()
{
const int size = k_width*k_height;
const int maxVal = GetMaxSize();
OriginalImage& im_original = GetOriginal();
ResultImage& im_result = GetResult();
im_result.discard_data();
parallel_for_each(
concurrency::extent<2>(k_width, k_height),
[=](concurrency::index<2> idx) restrict(amp)
{
const int pos = GetPos(idx);
const int val = read_int16(im_original, pos);
write_int16(im_result, pos, maxVal - val);
});
return true;
}
int Filters::GetPos( const concurrency::index<2>& idx ) restrict(amp, cpu)
{
return idx[0] * Filters::k_height + idx[1];
}
And here it comes the magic:
template <typename T>
unsigned int read_int16(T& arr, int idx) restrict(amp, cpu)
{
return (arr[idx >> 1] & (0xFFFF << ((idx & 0x7) << 4))) >> ((idx & 0x7) << 4);
}
template<typename T>
void write_int16(T& arr, int idx, unsigned int val) restrict(amp, cpu)
{
atomic_fetch_xor(&arr[idx >> 1], arr[idx >> 1] & (0xFFFF << ((idx & 0x7) << 4)));
atomic_fetch_xor(&arr[idx >> 1], (val & 0xFFFF) << ((idx & 0x7) << 4));
}
Notice that this methods are for 16 bits for 8 bits won't work but it shouldn't be too difficult to adapt it to 8 bits. In fact, this was based on a 8 bit version, unfortunately, I couldn't find the reference.
Hope it helps.
David

Related

convert (n first bytes of) unsigned char pointer to float and double c++

Consider the following c++ code:
unsigned char* data = readData(..); //Let say data consist of 12 characters
unsigned int dataSize = getDataSize(...); //the size in byte of the data is also known (let say 12 bytes)
struct Position
{
float pos_x; //remember that float is 4 bytes
double pos_y; //remember that double is 8 bytes
}
Now I want to fill a Position variable/instance with data.
Position pos;
pos.pos_x = ? //data[0:4[ The first 4 bytes of data should be set to pos_x, since pos_x is of type float which is 4 bytes
pos.pos_x = ? //data[4:12[ The remaining 8 bytes of data should be set to pos_y which is of type double (8 bytes)
I know that in data, the first bytes correspond to pos_x and the rest to pos_y. That means the 4 first byte/character of data should be used to fill pos_x and the 8 remaining byte fill pos_y but I don't know how to do that.
Any idea? Thanks. Ps: I'm limited to c++11
You can use plain memcpy as another answer advises. I suggest packing memcpy into a function that also does error checking for you for most convenient and type-safe usage.
Example:
#include <cstring>
#include <stdexcept>
#include <type_traits>
struct ByteStreamReader {
unsigned char const* begin;
unsigned char const* const end;
template<class T>
operator T() {
static_assert(std::is_trivially_copyable<T>::value,
"The type you are using cannot be safely copied from bytes.");
if(end - begin < static_cast<decltype(end - begin)>(sizeof(T)))
throw std::runtime_error("ByteStreamReader");
T t;
std::memcpy(&t, begin, sizeof t);
begin += sizeof t;
return t;
}
};
struct Position {
float pos_x;
double pos_y;
};
int main() {
unsigned char data[12] = {};
unsigned dataSize = sizeof data;
ByteStreamReader reader{data, data + dataSize};
Position p;
p.pos_x = reader;
p.pos_y = reader;
}
One thing that you can do is to copy the data byte-by byte. There is a standard function to do that: std::memcpy. Example usage:
assert(sizeof pos.pos_x == 4);
std::memcpy(&pos.pos_x, data, 4);
assert(sizeof pos.pos_y == 8);
std::memcpy(&pos.pos_y, data + 4, 8);
Note that simply copying the data only works if the data is in the same representation as the CPU uses. Understand that different processors use different representations. Therefore, if your readData receives the data over the network for example, a simple copy is not a good idea. The least that you would have to do in such case is to possibly convert the endianness of the data to the native endianness (probably from big endian, which is conventionally used as the network endianness). Converting from one floating point representation to another is much trickier, but luckily IEE-754 is fairly ubiquitous.

Fastest way to produce a mask with n ones starting at position i

What is the fastest way (in terms of cpu cycles on common modern architecture), to produce a mask with len bits set to 1 starting at position pos:
template <class UIntType>
constexpr T make_mask(std::size_t pos, std::size_t len)
{
// Body of the function
}
// Call of the function
auto mask = make_mask<uint32_t>(4, 10);
// mask = 00000000 00000000 00111111 11110000
// (in binary with MSB on the left and LSB on the right)
Plus, is there any compiler intrinsics or BMI function that can help?
Fastest way? I'd use something like this:
template <class T>
constexpr T make_mask(std::size_t pos, std::size_t len)
{
return ((static_cast<T>(1) << len)-1) << pos;
}
If by "starting at pos", you mean that the lowest-order bit of the mask is at the position corresponding with 2pos (as in your example):
((UIntType(1) << len) - UIntType(1)) << pos
If it is possible that len is ≥ the number of bits in UIntType, avoid Undefined Behaviour with a test:
(((len < std::numeric_limits<UIntType>::digits)
? UIntType(1)<<len
: 0) - UIntType(1)) << pos
(If it is also possible that pos is ≥ std::numeric_limits<UIntType>::digits, you'll need another ternary op test.)
You could also use:
(UIntType(1)<<(len>>1)<<((len+1)>>1) - UIntType(1)) << pos
which avoids the ternary op at the cost of three extra shift operators; I doubt whether it would be faster but careful benchmarking would be necessary to know for sure.
Maybe using a table? For type uint32_t you can write:
static uint32_t masks[] = { 0x0, 0x1, 0x3, 0x7, 0xf, 0x1f, 0x3f...}; // only 32 such masks
return masks[len] << pos;
Whatever is the int type the number of masks is not so huge and the table can be easily generated by templates.
For BMI, maybe using BZHI? Starting from all bits set, BZHI with value 32-len and then shift by pos.
Speed is irrelevant here as the expression is constant, hence precomputed by the optimizer and in all likelyhood used as an immediate operand. Whatever you use, it will cost you 0 cycle.
The biggest issue here is the range of possible inputs. In C, shifts with a count larger than the type width are Undefined Behaviour. However, it looks like len can meaningfully range from 0 to the type width. e.g. 33 different lengths for uint32_t. With pos=0, we get masks from 0 to 0xFFFFFFFF. (I'm just going to assume 32-bit in English and asm for clarity, but use generic C++).
If we can exclude either end of that range as possible inputs, then there are only 32 possible lengths, and we can use a left or right shift as a building block. (Use an assert() to verify the input range in debug builds.)
I put several versions (from other answers) of the function
on the Godbolt compiler explorer with some macros to compile them with constant len, constant pos, or both inputs variable. Some do better than others. KIIV's looks good for the range it's valid for (len=0..31, pos=0..31).
This version works for len=1..32, and pos=0..31. It generates slightly worse x86-64 asm than KIIV's, so use KIIV's if it works without extra checks.
// right-shift a register of all-ones, then shift it into position.
// works for len=1..32 and pos=0..31
template <class T>
constexpr T make_mask_PJC(std::size_t pos, std::size_t len)
{
// T all_ones = -1LL;
// unsigned typebits = sizeof(T)*CHAR_BIT; // std::numeric_limits<T>::digits
// T len_ones = all_ones >> (typebits - len);
// return len_ones << pos
static_assert(std::numeric_limits<T>::radix == 2, "T isn't an integer type");
return static_cast<T>(-1LL) >> (std::numeric_limits<T>::digits - len) << pos; // pre-C++14 constexpr needs it all in one statement
}
// Same idea, but mask the shift count the same way x86 shift instructions do, so the compiler can do it for free.
// Doesn't always compile to ideal code with SHRX (BMI2), maybe gcc only knows about letting the shift instruction do the masking for the older SHR / SHL instructions
uint32_t make_mask_PJC_noUB(std::size_t pos, std::size_t len)
{
using T=uint32_t;
static_assert(std::numeric_limits<T>::radix == 2, "T isn't an integer type");
T all_ones = -1LL;
unsigned typebits = std::numeric_limits<T>::digits;
T len_ones = all_ones >> ( (typebits - len) & (typebits-1)); // the AND optimizes away
return len_ones << (pos & (typebits-1));
// return static_cast<T>(-1LL) >> (std::numeric_limits<T>::digits - len) << pos; // pre-C++14 constexpr needs it all in one statement
}
If len can be anything in [0..32], I don't have any great ideas for efficient branchless code. Perhaps branching is the way to go.
uint32_t make_mask_fullrange(std::size_t pos, std::size_t len)
{
using T=uint32_t;
static_assert(std::numeric_limits<T>::radix == 2, "T isn't an integer type");
T all_ones = -1LL;
unsigned typebits = std::numeric_limits<T>::digits;
//T len_ones = all_ones >> ( (typebits - len) & (typebits-1));
T len_ones = len==0 ? 0 : all_ones >> ( (typebits - len) & (typebits-1));
return len_ones << (pos & (typebits-1));
// return static_cast<T>(-1LL) >> (std::numeric_limits<T>::digits - len) << pos; // pre-C++14 constexpr needs it all in one statement
}

Getting wrong data back when reading from binary file

I'm having an issue reading in some bytes from a yuv file (it's 1280x720 if that matters) and was hoping someone could point out what I'm doing wrong. I'm getting different results using the read command and using an istream iterator . Here's some example code of what I'm trying to do:
void readBlock(std::ifstream& yuvFile, YUVBlock& destBlock, YUVConfig& config, const unsigned int x, const unsigned int y, const bool useAligned = false)
{
//Calculate luma offset
unsigned int YOffset = (useAligned ? config.m_alignedYFileOffset : config.m_YFileOffset) +
(destBlock.yY * (useAligned ? config.m_alignedYUVWidth : config.m_YUVWidth) + destBlock.yX);// *config.m_bitDepth;
//Copy Luma data
//yuvFile.seekg(YOffset, std::istream::beg);
for (unsigned int lumaY = 0; lumaY < destBlock.m_YHeight && ((lumaY + destBlock.yY) < config.m_YUVHeight); ++lumaY)
{
yuvFile.seekg(YOffset + ((useAligned ? config.m_alignedYUVWidth : config.m_YUVWidth)/* * config.m_bitDepth*/) * (lumaY), std::istream::beg);
int copySize = destBlock.m_YWidth;
if (destBlock.yX + copySize > config.m_YUVWidth)
{
copySize = config.m_YUVWidth - destBlock.yX;
}
if (destBlock.yX >= 1088 && destBlock.yY >= 704)
{
char* test = new char[9];
yuvFile.read(test, 9);
delete[] test;
yuvFile.seekg(YOffset + ((useAligned ? config.m_alignedYUVWidth : config.m_YUVWidth)/* * config.m_bitDepth*/) * (lumaY));
}
std::istream_iterator<uint8_t> start = std::istream_iterator<uint8_t>(yuvFile);
std::copy_n(start, copySize, std::back_inserter(destBlock.m_yData));
}
}
struct YUVBlock
{
std::vector<uint8_t> m_yData;
std::vector<uint8_t> m_uData;
std::vector<uint8_t> m_vData;
unsigned int m_YWidth;
unsigned int m_YHeight;
unsigned int m_UWidth;
unsigned int m_UHeight;
unsigned int m_VWidth;
unsigned int m_VHeight;
unsigned int yX;
unsigned int yY;
unsigned int uX;
unsigned int uY;
unsigned int vX;
unsigned int vY;
};
This error only seems to be happening at X =1088 and Y = 704 in the image. I'm expecting to see a byte value of 10 as the first byte I read back. When I use
yuvFile.read(test, 9);
I get 10 as my first byte. When I use the istream iterator:
std::istream_iterator<uint8_t> start = std::istream_iterator<uint8_t>(yuvFile);
std::copy_n(start, copySize, std::back_inserter(destBlock.m_yData));
The first byte I read is 17. 17 is the byte after 10 so it seems the istream iterator skips the first byte.
Any help would be appreciated
There is a major difference between istream::read and std::istream_iterator.
std::istream::read performs unformatted read.
std::istream_iterator performs formatted read.
From http://en.cppreference.com/w/cpp/iterator/istream_iterator
std::istream_iterator is a single-pass input iterator that reads successive objects of type T from the std::basic_istream object for which it was constructed, by calling the appropriate operator>>.
If your file was created using std::ostream::write or fwrite, you must use std::istream::read or fread to read the data.
If your file was created using any of the methods that create formatted output, such as std::ostream::operato<<(), fprintf, you have a chance to read the data using std::istream_iterator.

How to get values from unaligned memory in a standard way?

I know C++11 has some standard facilities which would allow to get integral values from unaligned memory. How could something like this be written in a more standard way?
template <class R>
inline R get_unaligned_le(const unsigned char p[], const std::size_t s) {
R r = 0;
for (std::size_t i = 0; i < s; i++)
r |= (*p++ & 0xff) << (i * 8); // take the first 8-bits of the char
return r;
}
To take the values stored in litte-endian order, you can then write:
uint_least16_t value1 = get_unaligned_le<uint_least16_t > (&buffer[0], 2);
uint_least32_t value2 = get_unaligned_le<uint_least32_t > (&buffer[2], 4);
How did the integral values get into the unaligned memory to begin with?
If they were memcpyed in, then you can use memcpy to get them out.
If they were read from a file or the network, you have to know their
format: how they were written to begin with. If they are four byte
big-endian 2s complement (the usual network format), then something
like:
// Supposes native int is at least 32 bytes...
unsigned
getNetworkInt( unsigned char const* buffer )
{
return buffer[0] << 24
| buffer[1] << 16
| buffer[2] << 8
| buffer[3];
}
This will work for any unsigned type, provided the type you're aiming
for is at least as large as the type you input. For signed, it depends
on just how portable you want to be. If all of your potential target
machines are 2's complement, and will have an integral type with the
same size as your input type, then you can use exactly the same code as
above. If your native machine is a 1's complement 36 bit machine (e.g.
a Unisys mainframe), and you're reading signed network format integers
(32 bit 2's complement), you'll need some additional logic.
As always, create the desired variable and populate it byte-wise:
#include <algorithm>
#include <type_traits>
template <typename R>
R get(unsigned char * p, std::size_t len = sizeof(R))
{
assert(len >= sizeof(R) && std::is_trivially_copyable<R>::value);
R result;
std::copy(p, p + sizeof(R), static_cast<unsigned char *>(&result));
return result;
}
This only works universally for trivially copyable types, though you can probably use it for on-trivial types if you have additional guarantees from elsewhere.

Reading bits in a memory

If I have a pointer to the start of a memory region, and I need to read the value packed in bits 30, 31, and 32 of that region, how can I read that value?
Use bit masks.
It depends on how big a byte is in your machine. The answer will vary depending on if you're zero- or one-indexing for those numbers. The following function returns 0 if the bit is 0 and non-zero if it is 1.
int getBit(char *buffer, int which)
{
int byte = which / CHAR_BIT;
int bit = which % CHAR_BIT;
return buffer[byte] & (1 << bit);
}
If your compiler can't optimize well enough to turn the division and mod operations into bit operations, you could do it explicitly, but I prefer this code for clarity.
(Edited to fix a bug and change to CHAR_BIT, which is a great idea.)
I'd probably generalize this answer to something like this:
template <typename T>
bool get_bit(const T& pX, size_t pBit)
{
if (pBit > sizeof(pX) * CHAR_BIT)
throw std::invalid_argument("bit does not exist");
size_t byteOffset = pBit / CHAR_BIT;
size_t bitOffset = pBit % CHAR_BIT;
char byte = (&reinterpret_cast<const char&>(pX))[byteOffset];
unsigned mask = 1U << bitOffset;
return (byte & mask) == 1;
}
Bit easier to use:
int i = 12345;
bool abit = get_bit(i, 4);
On a 32-bit system, you can simply shift pointer right 29. If you need the bit values in place, and by 0xE0000000.