Optimization of image resizing (method Nearest) with using SIMD

Optimization of image resizing (method Nearest) with using SIMD - c++

I know that 'Nearest' method of image resizing is the fastest method.
Nevertheless I search way to speed up it.
Evident step is a precalculate indices:
void CalcIndex(int sizeS, int sizeD, int colors, int* idx)
{
float scale = (float)sizeS / sizeD;
for (size_t i = 0; i < sizeD; ++i)
{
int index = (int)::floor((i + 0.5f) * scale)
idx[i] = Min(Max(index, 0), sizeS - 1) * colors;
}
}
template<int colors> inline void CopyPixel(const uint8_t* src, uint8_t* dst)
{
for (int i = 0; i < colors; ++i)
dst[i] = src[i];
}
template<int colors> void Resize(const uint8_t* src, int srcW, int srcH,
uint8_t* dst, int dstW, int dstH)
{
int idxY[dstH], idxX[dstW];//pre-calculated indices (see CalcIndex).
for (int dy = 0; dy < dstH; dy++)
{
const uint8_t * srcY = src + idxY[dy] * srcW * colors;
for (int dx = 0, offset = 0; dx < dstW; dx++, offset += colors)
CopyPixel<N>(srcY + idxX[dx], dst + offset);
dst += dstW * colors;
}
}
Are the next optimization steps exist? For example with using SIMD or some other optimization technic.
P.S. Especially I am interesting in optimization of RGB (Colors = 3).
And if I use current code I see that ARGB image (Colors = 4) is processing faster for 50% then RGB despite that it bigger for 30%.

The speed problem in (SIMD-based) resize algorithms comes from the mismatch of indexing input and output elements. When e.g. the resize factor is 6/5, one needs to consume 6 pixels and write 5. OTOH SIMD register width of 16 bytes maps to either 16 grayscale elements, 4 RGBA-elements or 5.33 RGB-elements.
My experience is that a sufficiently good performance (maybe not optimal, but often beating opencv and other freely available implementations) comes when trying to write 2-4 SIMD registers worth of data at a time, reading the required number of linear bytes from the input + some, and using pshufb in x86 SSSE3 or vtbl in Neon to gather load from the registers -- never from memory. One needs of course a fast mechanism to either calculate the LUT indices inline, or to precalculate the indices, which are shared between different output rows.
One should prepare to have several inner kernels depending on the input/output ratio of the (horizontal) resolution.
RGBrgbRGBrgbRGBr|gbRGBrgb .... <- input
^ where to load next 32 bytes of input
RGBRGBrgbRGBrgbr|gbRGBrgbRGBRGBrg| <- 32 output bytes, from
0000000000000000|0000001111111111| <- high bit of index
0120123456789ab9|abcdef0123423456| <- low 4 bits of index
Notice, that one can handle with the LUT method all channel counts
// inner kernel for downsampling between 1x and almost 2x*
// - we need to read max 32 elements and write 16
void process_row_ds(uint8_t const *input, uint8_t const *indices,
int const *advances, uint8_t *output, int out_width) {
do {
auto a = load16_bytes(input);
auto b = load16_bytes(input + 16);
auto c = load16_bytes(indices);
a = lut32(a,b,c); // get 16 bytes out of 32
store16_bytes(output, a);
output += 16;
input += *advances++;
} while (out_width--); // multiples of 16...
}
// inner kernel for upsampling between 1x and inf
void process_row_us(uint8_t const *input, uint8_t const *indices,
int const *advances, uint8_t *output, int out_width) {
do {
auto a = load16_bytes(input);
auto c = load16_bytes(indices);
a = lut16(a, c); // get 16 bytes out of 16
store16_bytes(output, a);
output += 16;
input += *advances++;
} while (out_width--);
}
I would also encourage to use some elementary filtering for downsampling, such as gaussian binomial kernels (1 1, 1 2 1, 1 3 3 1, 1 4 6 4 1, ...) along with hierarchical downsampling in addition to (at least) bilinear interpolation. It's of course possible that the application will tolerate aliasing artifacts -- the cost AFAIK is often not that large, especially given that otherwise the algorithm will be memory bound.

I think that using of _mm256_i32gather_epi32 (AVX2) can give some performance gain for resizing in case of 32 bit pixels:
inline void Gather32bit(const uint8_t * src, const int* idx, uint8_t* dst)
{
__m256i _idx = _mm256_loadu_si256((__m256i*)idx);
__m256i val = _mm256_i32gather_epi32((int*)src, _idx, 1);
_mm256_storeu_si256((__m256i*)dst, val);
}
template<> void Resize<4>(const uint8_t* src, int srcW, int srcH,
uint8_t* dst, int dstW, int dstH)
{
int idxY[dstH], idxX[dstW];//pre-calculated indices.
size_t dstW8 = dstW & (8 - 1);
for (int dy = 0; dy < dstH; dy++)
{
const uint8_t * srcY = src + idxY[dy] * srcW * 4;
int dx = 0, offset = 0;
for (; dx < dstW8; dx += 8, offset += 8*4)
Gather32bit(srcY, idxX + dx,dst + offset);
for (; dx < dstW; dx++, offset += 4)
CopyPixel<N>(srcY + idxX[dx], dst + offset);
dst += dstW * 4;
}
}
P.S. After some modification this method can be applied to RGB24:
const __m256i K8_SHUFFLE = _mm256_setr_epi8(
0x0, 0x1, 0x2, 0x4, 0x5, 0x6, 0x8, 0x9, 0xA, 0xC, 0xD, 0xE, -1, -1, -1, -1,
0x0, 0x1, 0x2, 0x4, 0x5, 0x6, 0x8, 0x9, 0xA, 0xC, 0xD, 0xE, -1, -1, -1, -1);
const __m256i K32_PERMUTE = _mm256_setr_epi32(0x0, 0x1, 0x2, 0x4, 0x5, 0x6, -1, -1);
inline void Gather24bit(const uint8_t * src, const int* idx, uint8_t* dst)
{
__m256i _idx = _mm256_loadu_si256((__m256i*)idx);
__m256i bgrx = _mm256_i32gather_epi32((int*)src, _idx, 1);
__m256i bgr = _mm256_permutevar8x32_epi32(
_mm256_shuffle_epi8(bgrx, K8_SHUFFLE), K32_PERMUTE);
_mm256_storeu_si256((__m256i*)dst, bgr);
}
template<> void Resize<3>(const uint8_t* src, int srcW, int srcH,
uint8_t* dst, int dstW, int dstH)
{
int idxY[dstH], idxX[dstW];//pre-calculated indices.
size_t dstW8 = dstW & (8 - 1);
for (int dy = 0; dy < dstH; dy++)
{
const uint8_t * srcY = src + idxY[dy] * srcW * 3;
int dx = 0, offset = 0;
for (; dx < dstW8; dx += 8, offset += 8*3)
Gather24bit(srcY, idxX + dx,dst + offset);
for (; dx < dstW; dx++, offset += 3)
CopyPixel<3>(srcY + idxX[dx], dst + offset);
dst += dstW * 3;
}
}
Note that if srcW < dstW then method of #Aki-Suihkonen is faster.

It’s possible to use SIMD, and I’m pretty sure it will help, unfortunately it’s relatively hard. Below is a simplified example which only supports image enlargements but not shrinking.
Still, I hope it might be useful as a starting point.
Both MSVC and GCC compile the hot loop in LineResize::apply method into 11 instructions. I think 11 instructions for 16 bytes should be faster than your version.
#include <stdint.h>
#include <emmintrin.h>
#include <tmmintrin.h>
#include <vector>
#include <array>
#include <assert.h>
#include <stdio.h>
// Implements nearest neighbor resize method for RGB24 or BGR24 bitmaps
class LineResize
{
// Each mask produces up to 16 output bytes.
// For enlargement exactly 16, for shrinking up to 16, possibly even 0.
std::vector<__m128i> masks;
// Length is the same as masks.
// For enlargement, the values contain source pointer offsets in bytes.
// For shrinking, the values contain destination pointer offsets in bytes.
std::vector<uint8_t> offsets;
// True if this class will enlarge images, false if it will shrink the width of the images.
bool enlargement;
void resizeFields( size_t vectors )
{
masks.resize( vectors, _mm_set1_epi32( -1 ) );
offsets.resize( vectors, 0 );
}
public:
// Compile the shuffle table. The arguments are line widths in pixels.
LineResize( size_t source, size_t dest );
// Apply the algorithm to a single line of the image.
void apply( uint8_t* rdi, const uint8_t* rsi ) const;
};
LineResize::LineResize( size_t source, size_t dest )
{
const size_t sourceBytes = source * 3;
const size_t destBytes = dest * 3;
assert( sourceBytes >= 16 );
assert( destBytes >= 16 );
// Possible to do much faster without any integer divides.
// Optimizing this sample for simplicity.
if( sourceBytes < destBytes )
{
// Enlarging the image, each SIMD vector consumes <16 input bytes, produces exactly 16 output bytes
enlargement = true;
resizeFields( ( destBytes + 15 ) / 16 );
int8_t* pMasks = (int8_t*)masks.data();
uint8_t* const pOffsets = offsets.data();
int sourceOffset = 0;
const size_t countVectors = masks.size();
for( size_t i = 0; i < countVectors; i++ )
{
const int destSlice = (int)i * 16;
std::array<int, 16> lanes;
int lane;
for( lane = 0; lane < 16; lane++ )
{
const int destByte = destSlice + lane; // output byte index
const int destPixel = destByte / 3; // output pixel index
const int channel = destByte % 3; // output byte within pixel
const int sourcePixel = destPixel * (int)source / (int)dest; // input pixel
const int sourceByte = sourcePixel * 3 + channel; // input byte
if( destByte < (int)destBytes )
lanes[ lane ] = sourceByte;
else
{
// Destination offset out of range, i.e. the last SIMD vector
break;
}
}
// Produce the offset
if( i == 0 )
assert( lanes[ 0 ] == 0 );
else
{
const int off = lanes[ 0 ] - sourceOffset;
assert( off >= 0 && off <= 16 );
pOffsets[ i - 1 ] = (uint8_t)off;
sourceOffset = lanes[ 0 ];
}
// Produce the masks
for( int j = 0; j < lane; j++ )
pMasks[ j ] = (int8_t)( lanes[ j ] - sourceOffset );
// The masks are initialized with _mm_set1_epi32( -1 ) = all bits set,
// no need to handle remainder for the last vector.
pMasks += 16;
}
}
else
{
// Shrinking the image, each SIMD vector consumes 16 input bytes, produces <16 output bytes
enlargement = false;
resizeFields( ( sourceBytes + 15 ) / 16 );
// Not implemented, but the same idea works fine for this too.
// The only difference, instead of using offsets bytes for source offsets, use it for destination offsets.
assert( false );
}
}
void LineResize::apply( uint8_t * rdi, const uint8_t * rsi ) const
{
const __m128i* pm = masks.data();
const __m128i* const pmEnd = pm + masks.size();
const uint8_t* po = offsets.data();
__m128i mask, source;
if( enlargement )
{
// One iteration of the loop produces 16 output bytes
// In MSVC results in 11 instructions for 16 output bytes.
while( pm < pmEnd )
{
mask = _mm_load_si128( pm );
pm++;
source = _mm_loadu_si128( ( const __m128i * )( rsi ) );
rsi += *po;
po++;
_mm_storeu_si128( ( __m128i * )rdi, _mm_shuffle_epi8( source, mask ) );
rdi += 16;
}
}
else
{
// One iteration of the loop consumes 16 input bytes
while( pm < pmEnd )
{
mask = _mm_load_si128( pm );
pm++;
source = _mm_loadu_si128( ( const __m128i * )( rsi ) );
rsi += 16;
_mm_storeu_si128( ( __m128i * )rdi, _mm_shuffle_epi8( source, mask ) );
rdi += *po;
po++;
}
}
}
// Utility method to print RGB pixel values from the vector
static void printPixels( const std::vector<uint8_t>&vec )
{
assert( !vec.empty() );
assert( 0 == ( vec.size() % 3 ) );
const uint8_t* rsi = vec.data();
const uint8_t* const rsiEnd = rsi + vec.size();
while( rsi < rsiEnd )
{
const uint32_t r = rsi[ 0 ];
const uint32_t g = rsi[ 1 ];
const uint32_t b = rsi[ 2 ];
rsi += 3;
const uint32_t res = ( r << 16 ) | ( g << 8 ) | b;
printf( "%06X ", res );
}
printf( "\n" );
}
// A triviual test to resize 24 pixels -> 32 pixels
int main()
{
constexpr int sourceLength = 24;
constexpr int destLength = 32;
// Initialize sample input with 24 RGB pixels
std::vector<uint8_t> input( sourceLength * 3 );
for( size_t i = 0; i < input.size(); i++ )
input[ i ] = (uint8_t)i;
printf( "Input: " );
printPixels( input );
// That special handling of the last pixels of last line is missing from this example.
static_assert( 0 == destLength % 16 );
LineResize resizer( sourceLength, destLength );
std::vector<uint8_t> result( destLength * 3 );
resizer.apply( result.data(), input.data() );
printf( "Output: " );
printPixels( result );
return 0;
}
The code ignores alignment issues. For production, you’d want another method for the last line of the image which doesn’t run to the end, instead handles the last few pixels with scalar code.
The code contains more memory references in the hot loop. However, the two vectors in that class are not too long, for 4k images the size is about 12kb, should fit in L1D cache and stay there.
If you have AVX2, will probably improve things further. For enlarging images, use _mm256_inserti128_si256, the vinserti128 instruction can load 16 bytes from memory into high half of the vector. Similarly, for downsizing images, use _mm256_extracti128_si256, the instruction has an option to use memory destination.

Related

What is the best way to loop AVX for un-even non-aligned array?

If array cannot be divided by 8 (for integer), what is the best way to write cycle for it? Possible way I figured out so far is to divide it into 2 separate cycles: 1 main cycle for almost all elements; and 1 tail cycle with maskload/maskstore for remaining 1-7 elements. But it's not looking like the best way.
for (auto i = 0; i < vec.size() - 8; i += 8) {
__m256i va = _mm256_loadu_si256((__m256i*) & vec[i]);
//do some work
_mm256_storeu_si256((__m256i*) & vec[i], va);
}
for (auto i = vec.size() - vec.size() % 8; i < vec.size(); i += 8) {
auto tmp = (vec.size() % 8) + 1;
char chArr[8] = {};
for (auto j = 0; j < 8; ++j) {
chArr[j] -= --tmp;
}
__m256i mask = _mm256_setr_epi32(chArr[0],
chArr[1], chArr[2], chArr[3], chArr[4], chArr[5], chArr[6], chArr[7]);
__m256i va = _mm256_maskload_epi32(&vec[i], mask);
//do some work
_mm256_maskstore_epi32(&vec[i], mask, va);
}
Could it be made looking better without hitting the performance? Just removing second for-loop for a single load doesn’t help much because it’s only 1 line saved out of dozen.
If I put maskload/maskstore in the main cycle it will slower down it significantly. There is also no maskloadu/maskstoreu, so I can't use this for unaligned array.

To expand on Yves' idea of prebuilding masks, here is one way to structure it:
#include <vector>
#include <immintrin.h>
void foo(std::vector<int>& vec)
{
std::size_t size = vec.size();
int* data = vec.data();
std::size_t i;
for(i = 0; i + 8 <= size; i += 8) {
__m256i va = _mm256_loadu_si256((__m256i*) (data + i));
asm volatile ("" : : : "memory"); // more work here
_mm256_storeu_si256((__m256i*) (data + i), va);
}
static const int maskarr[] = {
-1, -1, -1, -1, -1, -1, -1, -1,
0, 0, 0, 0, 0, 0, 0, 0
};
if(i < size) {
__m256i mask = _mm256_loadu_si256((const __m256i*)(
maskarr + (i + 8 - size)));
__m256i va = _mm256_maskload_epi32(data + i, mask);
asm volatile ("" : : : "memory"); // more work here
_mm256_maskstore_epi32(data + i, mask, va);
}
}
A few notes:
As mentioned in my comment, i + 8 <= vec.size() is safer as it avoids a possible wrap-around if vec.size() is 7 or lower
Use size_t or ptrdiff_t instead of int for such loop counters
The if to skip over the last part is important. Masked memory operations with an all-zero mask are very slow
The static mask array can be slimmed by two elements since we know we never access an all-filled or all-zero mask array

How do you handle indivisible vector lengths with SIMD intrinsics, array not a multiple of vector width?

I am currently learning how to work with SIMD intrinsics. I know that an AVX 256-bit vector can contain four doubles, eight floats, or eight 32-bit integers. How do we use AVX to process arrays that aren't a multiple of these numbers.
For example, how would you add two std::vectors of 53 integers each? Would we slice as many of the vector that would fit in the SIMD vector and just manually process the remainder? Is there a better way to do this?

Would we slice as many of the vector that would fit in the SIMD vector and just manually process the remainder? Is there a better way to do this?
Pretty much this. A basic example that processes all number in batches of 8, and uses mask load/maskstore to handle the remainder.
void add(int* const r, const int* const a, const int* const b, const unsigned count) {
// how many blocks of 8, and how many left over
const unsigned c8 = count & ~0x7U;
const unsigned cr = count & 0x7U;
// process blocks of 8
for(unsigned i = 0; i < c8; i += 8) {
__m256i _a = _mm256_loadu_si256((__m256i*)(a + i));
__m256i _b = _mm256_loadu_si256((__m256i*)(b + i));
__m256i _c = _mm256_add_epi32(_a, _b);
_mm256_storeu_si256((__m256i*)(c + i), _c);
}
const __m128i temp[5] = {
_mm_setr_epi32(0, 0, 0, 0),
_mm_setr_epi32(-1, 0, 0, 0),
_mm_setr_epi32(-1, -1, 0, 0),
_mm_setr_epi32(-1, -1, -1, 0),
_mm_setr_epi32(-1, -1, -1, -1)
};
// I'm using mask load / mask store for the remainder here.
// (this is not the only approach)
__m256i mask;
if(cr >= 4) {
mask = _mm256_set_m128i(temp[cr&3], temp[4]);
} else {
mask = _mm256_set_m128i(temp[0], temp[cr]);
}
__m256i _a = _mm256_maskload_epi32((a + c8), mask);
__m256i _b = _mm256_maskload_epi32((b + c8), mask);
__m256i _c = _mm256_add_epi32(_a, _b);
_mm256_maskstore_epi32((c + c8), mask, _c);
}
Of course, if you happen to use your own containers (or provide your own allocators), then you can avoid most of this mess by simply ensuring all container allocations occur in multiples of 256bits.
// yes, this class is missing a lot...
class MyIntArray {
public:
MyIntArray(unsigned count, const int* data) {
// bump capacity to next multiple of 8
unsigned cap = count & 7;
if(cap) cap = 8 - cap;
capacity = cap + count;
// allocation is aligned to 256bit
alloc = new int[capacity];
size = count;
memcpy(alloc, data, sizeof(int) * size);
}
MyIntArray(unsigned count) {
// bump capacity to next multiple of 8
unsigned cap = count & 7;
if(cap) cap = 8 - cap;
capacity = cap + count;
// allocation is aligned to 256bit
alloc = new int[capacity];
size = count;
}
unsigned capacity;
unsigned size;
int* alloc;
int* begin() { return alloc; }
int* end() { return alloc + size; }
const int* begin() const { return alloc; }
const int* end() const { return alloc + size; }
};
void add(MyIntArray r, const MyIntArray a, const MyIntArray b) {
// process blocks of 8.
// we may be stamping beyond the end of the array, but not over the
// the end of the capacity allocation....
// (probably also want to check to see if the sizes match!).
for(unsigned i = 0; i < r.size; i += 8) {
__m256i _a = _mm256_loadu_si256((__m256i*)(a.alloc + i));
__m256i _b = _mm256_loadu_si256((__m256i*)(b.alloc + i));
__m256i _c = _mm256_add_epi32(_a, _b);
_mm256_storeu_si256((__m256i*)(c.alloc + i), _c);
}
}

Loop vectorization - counting matches of 7-byte records with masking

I have a fairly simple loop:
auto indexRecord = getRowPointer(0);
bool equals;
// recordCount is about 6 000 000
for (int i = 0; i < recordCount; ++i) {
equals = BitString::equals(SelectMask, indexRecord, maxBytesValue);
rowsFound += equals;
indexRecord += byteSize; // byteSize is 7
}
Where BitString::equals is:
static inline bool equals(const char * mask, const char * record, uint64_t maxVal) {
return !(((*( uint64_t * ) mask) & (maxVal & *( uint64_t * ) record)) ^ (maxVal & *( uint64_t * ) record));
}
This code is used to simulate a Bitmap Index querying in databases.
My question is, if there's a way to vectorize the loop, going through all the records.
When trying to compile with GCC and -fopt-info-vec-missed -O3 I am getting: missed: couldn't vectorize loop.
I am new to this kind of optimizations and would like to learn more, it just feels like I am missing something.
EDIT
First of all, thank you all for answers. I should've included a Reprex.
Here it is now, with all functionality needed, as close as possible I could've done. All of this is done on x86-64 platform and I have both GCC and Clang available.
#include <iostream>
#include <cstdio>
#include <cstring>
#include <cstdint>
#include <bitset>
#include <ctime>
#include <cstdlib>
constexpr short BYTE_SIZE = 8;
class BitString {
public:
static int getByteSizeFromBits(int bitSize) {
return (bitSize + BYTE_SIZE - 1) / BYTE_SIZE;
}
static void setBitString(char *rec, int bitOffset) {
rec[bitOffset / 8] |= (1 << (bitOffset % BYTE_SIZE));
}
static inline bool equals(const char *mask, const char *record, uint64_t maxVal) {
return !(((*(uint64_t *) mask) & (maxVal & *(uint64_t *) record)) ^ (maxVal & *(uint64_t *) record));
}
};
// Class representing a table schema
class TableSchema {
public:
// number of attributes of a table
unsigned int attrs_count = -1;
// the attribute size in bytes, eg. 3 equals to something like CHAR(3) in SQL
unsigned int *attr_sizes = nullptr;
// max value (domain) of an attribute, -1 for unlimited, ()
int *attr_max_values = nullptr;
// the offset of each attribute, to simplify some pointer arithmetic for further use
unsigned int *attribute_offsets = nullptr;
// sum of attr_sizes if the record size;
unsigned int record_size = -1;
void calculate_offsets() {
if (attrs_count <= 0 || attribute_offsets != nullptr) {
return;
}
attribute_offsets = new unsigned int[attrs_count];
int offset = 0;
for (int i = 0; i < attrs_count; ++i) {
attribute_offsets[i] = offset;
offset += attr_sizes[i];
}
record_size = offset;
}
TableSchema() = default;
~TableSchema() {
if (attribute_offsets != nullptr) {
delete[] attribute_offsets;
attribute_offsets = nullptr;
}
attrs_count = -1;
}
};
class BitmapIndex {
private:
char *mData = nullptr;
short bitSize = 0;
int byteSize = 0;
int attrsCount = 0;
int *attrsMaxValue = nullptr;
int *bitIndexAttributeOffset = nullptr;
unsigned int recordCount = 0;
char *SelectMask;
unsigned int capacity = 0;
inline char *getRowPointer(unsigned int rowId) const {
return mData + rowId * byteSize;
}
inline bool shouldColBeIndexed(int max_col_value) const {
return max_col_value > 0;
}
public:
BitmapIndex(const int *attrs_max_value, int attrs_count, unsigned int capacity) {
auto maxValuesSum = 0;
attrsMaxValue = new int[attrs_count];
attrsCount = attrs_count;
bitIndexAttributeOffset = new int[attrs_count];
auto bitOffset = 0;
// attribute's max value is the same as number of bits used to encode the current value
// e.g., if attribute's max value is 3, we use 001 to represent value 1, 010 for 2, 100 for 3 and so on
for (int i = 0; i < attrs_count; ++i) {
attrsMaxValue[i] = attrs_max_value[i];
bitIndexAttributeOffset[i] = bitOffset;
// col is indexed only if it's max value is > 0, -1 means
if (!shouldColBeIndexed(attrs_max_value[i]))
continue;
maxValuesSum += attrs_max_value[i];
bitOffset += attrs_max_value[i];
}
bitSize = (short) maxValuesSum;
byteSize = BitString::getByteSizeFromBits(bitSize);
mData = new char[byteSize * capacity];
memset(mData, 0, byteSize * capacity);
SelectMask = new char[byteSize];
this->capacity = capacity;
}
~BitmapIndex() {
if (mData != nullptr) {
delete[] mData;
mData = nullptr;
delete[] attrsMaxValue;
attrsMaxValue = nullptr;
delete[] SelectMask;
SelectMask = nullptr;
}
}
unsigned long getTotalByteSize() const {
return byteSize * capacity;
}
// add record to index
void addRecord(const char * record, const unsigned int * attribute_sizes) {
auto indexRecord = getRowPointer(recordCount);
unsigned int offset = 0;
for (int j = 0; j < attrsCount; ++j) {
if (attrsMaxValue[j] != -1) {
// byte col value
char colValue = *(record + offset);
if (colValue > attrsMaxValue[j]) {
throw std::runtime_error("Col value is bigger than max allowed value!");
}
// printf("%d ", colValue);
BitString::setBitString(indexRecord, bitIndexAttributeOffset[j] + colValue);
}
offset += attribute_sizes[j];
}
recordCount += 1;
}
// SELECT COUNT(*)
int Select(const char *query) const {
uint64_t rowsFound = 0;
memset(SelectMask, 0, byteSize);
for (int col = 0; col < attrsCount; ++col) {
if (!shouldColBeIndexed(attrsMaxValue[col])) {
continue;
}
auto col_value = query[col];
if (col_value < 0) {
for (int i = 0; i < attrsMaxValue[col]; ++i) {
BitString::setBitString(SelectMask, bitIndexAttributeOffset[col] + i);
}
} else {
BitString::setBitString(SelectMask, bitIndexAttributeOffset[col] + col_value);
}
}
uint64_t maxBytesValue = 0;
uint64_t byteVals = 0xff;
for (int i = 0; i < byteSize; ++i) {
maxBytesValue |= byteVals << (i * 8);
}
auto indexRecord = getRowPointer(0);
for (int i = 0; i < recordCount; ++i) {
rowsFound += BitString::equals(SelectMask, indexRecord, maxBytesValue);
indexRecord += byteSize;
}
return rowsFound;
}
};
void generateRecord(
char *record,
const unsigned int attr_sizes[],
const int attr_max_value[],
int attr_count
) {
auto offset = 0;
for (int c = 0; c < attr_count; ++c) {
if (attr_max_value[c] == -1) {
for (int j = 0; j < attr_sizes[c]; ++j) {
record[offset + j] = rand() % 256;
}
} else {
for (int j = 0; j < attr_sizes[c]; ++j) {
record[offset + j] = rand() % attr_max_value[c];
}
}
offset += attr_sizes[c];
}
}
int main() {
TableSchema schema;
const int attribute_count = 13;
const int record_count = 1000000;
// for simplicity sake, attr_max_value > 0 is set only for attributes, which size is 1.
unsigned int attr_sizes[attribute_count] = {1, 5, 1, 5, 1, 1, 1, 6, 1, 1, 1, 11, 1};
int attr_max_values[attribute_count] = {3, -1, 4, -1, 6, 5, 7, -1, 7, 6, 5, -1, 8};
schema.attrs_count = attribute_count;
schema.attr_sizes = attr_sizes;
schema.attr_max_values = attr_max_values;
schema.calculate_offsets();
srand((unsigned ) time(nullptr));
BitmapIndex bitmapIndex(attr_max_values, attribute_count, record_count);
char *record = new char[schema.record_size];
for (int i = 0; i < record_count; ++i) {
// generate some random records and add them to the index
generateRecord(record, attr_sizes, attr_max_values, attribute_count);
bitmapIndex.addRecord(record, attr_sizes);
}
char query[attribute_count] = {-1, -1, 0, -1, -1, 3, 2, -1, 3, 3, 4, -1, 6};
// simulate Select COUNT(*) WHERE a1 = -1, a2 = -1, a3 = 0, ...
auto found = bitmapIndex.Select(query);
printf("Query found: %d records\n", found);
delete[] record;
return 0;
}

If the record size was 8, both GCC and Clang would autovectorize, for example: (hopefully a sufficiently representative stand-in for your actual context in which the code occurs)
int count(char * indexRecord, const char * SelectMask, uint64_t maxVal)
{
bool equals;
uint64_t rowsFound = 0;
// some arbitrary number of records
for (int i = 0; i < 1000000; ++i) {
equals = tequals(SelectMask, indexRecord, maxVal);
rowsFound += equals;
indexRecord += 8; // record size padded out to 8
}
return rowsFound;
}
The important part of it, as compiled by GCC, looks like this:
.L4:
vpand ymm0, ymm2, YMMWORD PTR [rdi]
add rdi, 32
vpcmpeqq ymm0, ymm0, ymm3
vpsubq ymm1, ymm1, ymm0
cmp rax, rdi
jne .L4
Not bad. It uses the same ideas that I would used manually: vpand the data with a mask (simplification of your bitwise logic), compare it to zero, subtract the results of the comparisons (subtract because a True result is indicated with -1) from 4 counters packed in a vector. The four separate counts are added after the loop.
By the way, note that I made rowsFound an uint64_t. That's important. If rowsFound is not 64-bit, then both Clang and GCC will try very hard to narrow the count ASAP, which is exactly the opposite of a good approach: that costs many more instructions in the loop, and has no benefit. If the count is intended to be a 32-bit int in the end, it can simply be narrowed after the loop, where it is probably not merely cheap but actually free to do that.
Something equivalent to that code would not be difficult to write manually with SIMD intrinsics, that could make the code less brittle (it wouldn't be based on hoping that compilers will do the right thing), but it wouldn't work for non-x86 platforms anymore.
If the records are supposed to be 7-byte, that's a more annoying problem to deal with. GCC gives up, Clang actually goes ahead with its auto-vectorization, but it's not good: the 8-byte loads are all done individually, the results then put together in a vector, which is all a big waste of time.
When doing it manually with SIMD intrinsics, the main problems would be unpacking the 7-byte records into qword lanes. An SSE4.1 version could use pshufb (pshufb is from SSSE3, but pcmpeqq is from SSE4.1 so it makes sense to target SSE4.1) to do this, easy. An AVX2 version could do a load that starts 2 bytes before the first record that it's trying to load, such that the "split" between the two 128-bit halves of the 256-bit registers falls between two records. Then vpshufb, which cannot move bytes from one 128-bit half to the other, can still move the bytes into place because none of them need to cross into the other half.
For example, an AVX2 version with manual vectorization and 7-byte records could look something like this. This requires either some padding at both the end and the start, or just skip the first record and end before hitting the last record and handle those separately. Not tested, but it would at least give you some idea of how code with manual vectorization would work.
int count(char * indexRecord, uint64_t SelectMask, uint64_t maxVal)
{
__m256i mask = _mm256_set1_epi64x(~SelectMask & maxVal);
__m256i count = _mm256_setzero_si256();
__m256i zero = _mm256_setzero_si256();
__m256i shufmask = _mm256_setr_epi8(2, 3, 4, 5, 6, 7, 8, -1, 9, 10, 11, 12, 13, 14, 15, -1, 0, 1, 2, 3, 4, 5, 6, -1, 7, 8, 9, 10, 11, 12, 13, -1);
for (int i = 0; i < 1000000; ++i) {
__m256i records = _mm256_loadu_si256((__m256i*)(indexRecord - 2));
indexRecord += 7 * 4;
records = _mm256_shuffle_epi8(records, shufmask);
__m256i isZero = _mm256_cmpeq_epi64(_mm256_and_si256(records, mask), zero);
count = _mm256_sub_epi64(count, isZero);
}
__m128i countA = _mm256_castsi256_si128(count);
__m128i countB = _mm256_extracti128_si256(count, 1);
countA = _mm_add_epi64(countA, countB);
return _mm_cvtsi128_si64(countA) + _mm_extract_epi64(countA, 1);
}

Here’s another approach. This code doesn’t use unaligned load tricks (especially valuable if you align your input data by 16 bytes), but uses more instructions overall because more shuffles, and only operates on 16-byte SSE vectors.
I have no idea how it compares to the other answers, may be either faster or slower. The code requires SSSE3 and SSE 4.1 instructions sets.
// Load 7 bytes from memory into the vector
inline __m128i load7( const uint8_t* rsi )
{
__m128i v = _mm_loadu_si32( rsi );
v = _mm_insert_epi16( v, *(const uint16_t*)( rsi + 4 ), 2 );
v = _mm_insert_epi8( v, rsi[ 6 ], 6 );
return v;
}
// Prepare mask vector: broadcast the mask, and duplicate the high byte
inline __m128i loadMask( uint64_t mask )
{
__m128i vec = _mm_cvtsi64_si128( (int64_t)mask );
const __m128i perm = _mm_setr_epi8( 0, 1, 2, 3, 4, 5, 6, 6, 0, 1, 2, 3, 4, 5, 6, 6 );
return _mm_shuffle_epi8( vec, perm );
}
// Prepare needle vector: load 7 bytes, duplicate 7-th byte into 8-th, duplicate 8-byte lanes
inline __m128i loadNeedle( const uint8_t* needlePointer, __m128i mask )
{
__m128i vec = load7( needlePointer );
const __m128i perm = _mm_setr_epi8( 0, 1, 2, 3, 4, 5, 6, 6, 0, 1, 2, 3, 4, 5, 6, 6 );
vec = _mm_shuffle_epi8( vec, perm );
return _mm_and_si128( vec, mask );
}
// Compare first 14 bytes with the needle, update the accumulator
inline void compare14( __m128i& acc, __m128i vec, __m128i needle, __m128i mask )
{
// Shuffle the vector matching the needle and mask; this duplicates two last bytes of each 7-byte record
const __m128i perm = _mm_setr_epi8( 0, 1, 2, 3, 4, 5, 6, 6, 7, 8, 9, 10, 11, 12, 13, 13 );
vec = _mm_shuffle_epi8( vec, perm );
// bitwise AND with the mask
vec = _mm_and_si128( vec, mask );
// Compare 8-byte lanes for equality with the needle
vec = _mm_cmpeq_epi64( vec, needle );
// Increment the accumulator if comparison was true
acc = _mm_sub_epi64( acc, vec );
}
size_t countRecords( const uint8_t* rsi, size_t count, const uint8_t* needlePointer, uint64_t maskValue )
{
const __m128i mask = loadMask( maskValue );
const __m128i needle = loadNeedle( needlePointer, mask );
__m128i acc = _mm_setzero_si128();
// An iteration of this loop consumes 16 records = 112 bytes = 7 SSE vectors
const size_t countBlocks = count / 16;
for( size_t i = 0; i < countBlocks; i++ )
{
const __m128i* p = ( const __m128i* )rsi;
rsi += 7 * 16;
__m128i a = _mm_loadu_si128( p );
compare14( acc, a, needle, mask );
__m128i b = _mm_loadu_si128( p + 1 );
compare14( acc, _mm_alignr_epi8( b, a, 14 ), needle, mask );
a = _mm_loadu_si128( p + 2 );
compare14( acc, _mm_alignr_epi8( a, b, 12 ), needle, mask );
b = _mm_loadu_si128( p + 3 );
compare14( acc, _mm_alignr_epi8( b, a, 10 ), needle, mask );
a = _mm_loadu_si128( p + 4 );
compare14( acc, _mm_alignr_epi8( a, b, 8 ), needle, mask );
b = _mm_loadu_si128( p + 5 );
compare14( acc, _mm_alignr_epi8( b, a, 6 ), needle, mask );
a = _mm_loadu_si128( p + 6 );
compare14( acc, _mm_alignr_epi8( a, b, 4 ), needle, mask );
compare14( acc, _mm_srli_si128( a, 2 ), needle, mask );
}
// Sum high / low lanes of the accumulator
acc = _mm_add_epi64( acc, _mm_srli_si128( acc, 8 ) );
// Handle the remainder, 7 bytes per iteration
// Compared to your 6M records, the remainder is small, the performance doesn't matter much.
for( size_t i = 0; i < count % 16; i++ )
{
__m128i a = load7( rsi );
rsi += 7;
compare14( acc, a, needle, mask );
}
return (size_t)_mm_cvtsi128_si64( acc );
}
P.S. Also, I would expect 8-byte indices to be faster despite the 15% RAM bandwidth overhead. Especially when vectorizing into AVX2.

First, your code is not a complete example. You're missing definitions and types of many variables, which makes it difficult to answer. You also did not indicate which platform you're compiling on/for.
Here are reasons why vectorization might fail:
Your reads are overlapping! you're reading 8 bytes at 7-byte intervals. That alone might confuse the vectorization logic.
Your pointers may not be __restrict'ed, meaning that the compiler must assume they might alias, meaning that it might need to reread from the address on every access.
Your equals() function pointer parameters are definitely not __restrict'ed (although the compiler could be seeing through that with inlining).
Alignment. x86_64 processors do not require aligned accesses, but on some platforms, some larger instructions need to know they work on properly aligned places in memory. Moreover, as #PeterCordes points out in a comment, compilers and libraries may be more picky than the hardware regarding alignment.
Why don't you put *SelectMask in a local variable?

Extracting data from a file that was read into memory

I have a binary data file that contains 2d and 3d coordinates in such order:
uint32 numberOfUVvectors;
2Dvec uv[numberOfUVvectors];
uint32 numberOfPositionVectors;
3Dvec position[numberOfPositionVectors];
uint32 numberOfNormalVectors;
3Dvec normal[numberOfNormalVectors];
2Dvec and 3Dvec are structs composed from 2 and 3 floats respectively.
At first, I read all these values using the "usual" way:
in.read(reinterpret_cast<char *>(&num2d), sizeof(uint32));
2Dvectors.reserve(num2d); // It's for an std::vector<2DVec> 2Dvectors();
for (int i = 0; i < num2d; i++){
2Dvec 2Dvector;
in.read(reinterpret_cast<char *>(&2Dvector), sizeof(2DVec));
2Dvectors.push_back(2Dvector);
}
It worked fine, but it was painfully slow (there can be more than 200k entries in a file and with so many read calls, the hdd access became a bottleneck). I decided to read the entire file into a buffer at once:
in.seekg (0, in.end);
int length = in.tellg();
in.seekg (0, in.beg);
char * buffer = new char [length];
is.read (buffer,length);
The reading is way faster now, but here's the question: how to parse that char buffer back into integers and structs?

To answer your specific question:
unsigned char * pbuffer = (unsigned char *)buffer;
uint32 num2d = *((uint32 *)pbuffer);
pbuffer += sizeof(uint32);
if(num2d)
{
2Dvec * p2Dvec = (2Dvec *)pbuffer;
2Dvectors.assign(p2Dvec, p2Dvec + num2d);
pbuffer += (num2d * sizeof(2Dvec));
}
uint32 numpos = *((uint32 *)pbuffer);
pbuffer += sizeof(uint32);
if(numpos)
{
3Dvec * p3Dvec = (3Dvec *)pbuffer;
Pos3Dvectors.assign(p3Dvec, p3Dvec + numpos);
pbuffer += (numpos * sizeof(3Dvec));
}
uint32 numnorm = *((uint32 *)pbuffer);
pbuffer += sizeof(uint32);
if(numnorm)
{
3Dvec * p3Dvec = (3Dvec *)pbuffer;
Normal3Dvectors.assign(p3Dvec, p3Dvec + numnorm);
pbuffer += (numnorm * sizeof(3Dvec));
}
// do not forget to release the allocated buffer
A an even faster way would be:
in.read(reinterpret_cast<char *>(&num2d), sizeof(uint32));
if(num2d)
{
2Dvectors.resize(num2d);
2Dvec * p2Dvec = &2Dvectors[0];
in.read(reinterpret_cast<char *>(&p2Dvec), num2d * sizeof(2Dvec));
}
//repeat for position & normal vectors

Use memcpy with the appropriate sizes and start values
or cast the values (example):
#include <iostream>
void copy_array(void *a, void const *b, std::size_t size, int amount)
{
std::size_t bytes = size * amount;
for (int i = 0; i < bytes; ++i)
reinterpret_cast<char *>(a)[i] = static_cast<char const *>(b)[i];
}
int main()
{
int a[10], b[10] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
copy_array(a, b, sizeof(b[0]), 10);
for (int i = 0; i < 10; ++i)
std::cout << a[i] << ' ';
}

C++: Convert text file of integers into a bitmap image file in BMP format

I have a text file being saved by a matrix library containing a 2D matrix as such:
1 0 0
6 0 4
0 1 1
Where each number is represented with a colored pixel. I am looking for some insight as to how I'd go about solving this problem. If any more information is required, do not hesitate to ask.
EDIT: Another approach I've tried is: fwrite(&intmatrix, size,1, bmp_ptr); where I pass in the matrix pointer, which does not seem to output a readable BMP file. The value of size is the rows*cols of course, and the type of matrix is arma::Mat<int> which is a matrix from the Armadillo Linear Algebra Library.
EDIT II: Reading this indicated that my size should probably be rows*cols*4 given the size of the rows if I am not mistaken, any guidance on this point as well would be great.

Here's an app which generates a text file of random integers, reads them back, and writes them to disk as a (roughly square) 32-bit-per-pixel .BMP image.
Note, I made a number of assumptions on things like the format of the original text file, the range of numbers, etc., but they are documented in the code. With this working example you should be able to tweak them easily, if necessary.
// IntToBMP.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <cstdint>
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <random>
#include <ctime>
#include <memory>
#pragma pack( push, 1 )
struct BMP
{
BMP();
struct
{
uint16_t ID;
uint32_t fileSizeInBytes;
uint16_t reserved1;
uint16_t reserved2;
uint32_t pixelArrayOffsetInBytes;
} FileHeader;
enum class CompressionMethod : uint32_t { BI_RGB = 0x00,
BI_RLE8 = 0x01,
BI_RLE4 = 0x02,
BI_BITFIELDS = 0x03,
BI_JPEG = 0x04,
BI_PNG = 0x05,
BI_ALPHABITFIELDS = 0x06 };
struct
{
uint32_t headerSizeInBytes;
uint32_t bitmapWidthInPixels;
uint32_t bitmapHeightInPixels;
uint16_t colorPlaneCount;
uint16_t bitsPerPixel;
CompressionMethod compressionMethod;
uint32_t bitmapSizeInBytes;
int32_t horizontalResolutionInPixelsPerMeter;
int32_t verticalResolutionInPixelsPerMeter;
uint32_t paletteColorCount;
uint32_t importantColorCount;
} DIBHeader;
};
#pragma pack( pop )
BMP::BMP()
{
//Initialized fields
FileHeader.ID = 0x4d42; // == 'BM' (little-endian)
FileHeader.reserved1 = 0;
FileHeader.reserved2 = 0;
FileHeader.pixelArrayOffsetInBytes = sizeof( FileHeader ) + sizeof( DIBHeader );
DIBHeader.headerSizeInBytes = 40;
DIBHeader.colorPlaneCount = 1;
DIBHeader.bitsPerPixel = 32;
DIBHeader.compressionMethod = CompressionMethod::BI_RGB;
DIBHeader.horizontalResolutionInPixelsPerMeter = 2835; // == 72 ppi
DIBHeader.verticalResolutionInPixelsPerMeter = 2835; // == 72 ppi
DIBHeader.paletteColorCount = 0;
DIBHeader.importantColorCount = 0;
}
void Exit( void )
{
std::cout << "Press a key to exit...";
std::getchar();
exit( 0 );
}
void MakeIntegerFile( const std::string& integerFilename )
{
const uint32_t intCount = 1 << 20; //Generate 1M (2^20) integers
std::unique_ptr< int32_t[] > buffer( new int32_t[ intCount ] );
std::mt19937 rng;
uint32_t rngSeed = static_cast< uint32_t >( time( NULL ) );
rng.seed( rngSeed );
std::uniform_int_distribution< int32_t > dist( INT32_MIN, INT32_MAX );
for( size_t i = 0; i < intCount; ++i )
{
buffer[ i ] = dist( rng );
}
std::ofstream writeFile( integerFilename, std::ofstream::binary );
if( !writeFile )
{
std::cout << "Error writing " << integerFilename << ".\n";
Exit();
}
writeFile << buffer[ 0 ];
for( size_t i = 1; i < intCount; ++i )
{
writeFile << " " << buffer[ i ];
}
}
int _tmain(int argc, _TCHAR* argv[]) //Replace with int main( int argc, char* argv[] ) if you're not under Visual Studio
{
//Assumption: 32-bit signed integers
//Assumption: Distribution of values range from INT32_MIN through INT32_MAX, inclusive
//Assumption: number of integers contained in file are unknown
//Assumption: source file of integers is a series of space-delimitied strings representing integers
//Assumption: source file's contents are valid
//Assumption: non-rectangular numbers of integers yield non-rectangular bitmaps (final scanline may be short)
// This may cause some .bmp parsers to fail; others may pad with 0's. For simplicity, this implementation
// attempts to render square bitmaps.
const std::string integerFilename = "integers.txt";
const std::string bitmapFilename = "bitmap.bmp";
std::cout << "Creating file of random integers...\n";
MakeIntegerFile( integerFilename );
std::vector< int32_t >integers; //If quantity of integers being read is known, reserve or resize vector or use array
//Read integers from file
std::cout << "Reading integers from file...\n";
{ //Nested scope will release ifstream resource when no longer needed
std::ifstream readFile( integerFilename );
if( !readFile )
{
std::cout << "Error reading " << integerFilename << ".\n";
Exit();
}
std::string number;
while( readFile.good() )
{
std::getline( readFile, number, ' ' );
integers.push_back( std::stoi( number ) );
}
if( integers.size() == 0 )
{
std::cout << "No integers read from " << integerFilename << ".\n";
Exit();
}
}
//Construct .bmp
std::cout << "Constructing .BMP...\n";
BMP bmp;
size_t intCount = integers.size();
bmp.DIBHeader.bitmapSizeInBytes = intCount * sizeof( integers[ 0 ] );
bmp.FileHeader.fileSizeInBytes = bmp.FileHeader.pixelArrayOffsetInBytes + bmp.DIBHeader.bitmapSizeInBytes;
bmp.DIBHeader.bitmapWidthInPixels = static_cast< uint32_t >( ceil( sqrt( intCount ) ) );
bmp.DIBHeader.bitmapHeightInPixels = static_cast< uint32_t >( ceil( intCount / static_cast< float >( bmp.DIBHeader.bitmapWidthInPixels ) ) );
//Write integers to .bmp file
std::cout << "Writing .BMP...\n";
{
std::ofstream writeFile( bitmapFilename, std::ofstream::binary );
if( !writeFile )
{
std::cout << "Error writing " << bitmapFilename << ".\n";
Exit();
}
writeFile.write( reinterpret_cast< char * >( &bmp ), sizeof( bmp ) );
writeFile.write( reinterpret_cast< char * >( &integers[ 0 ] ), bmp.DIBHeader.bitmapSizeInBytes );
}
//Exit
Exit();
}
Hope this helps.

If you choose the right image format this is very easy. PGM has an ASCII variant that looks almost exactly like your matrix, but with a header.
P2
3 3
6
1 0 0
6 0 4
0 1 1
Where P2 is the magic for ASCII PGM, the size is 3x3 and 6 is the maxval. I chose 6 because that was the maximum value you presented, which makes 6 white (while 0 is black). In a typical PGM that's 255, which is consistent with an 8-bit grayscale image.
PPM is almost as simple, it just has 3 color components per pixel instead of 1.
You can operate on these images with anything that takes PPM (netpbm, ImageMagick, GIMP, etc). You can resave them as binary PPMs which are basically the same size as an equivalent BMP.

To output a readable BMP file, you need to put a header first:
#include <WinGDI.h>
DWORD dwSizeInBytes = rows*cols*4; // when your matrix contains RGBX data)
// fill in the headers
BITMAPFILEHEADER bmfh;
bmfh.bfType = 0x4D42; // 'BM'
bmfh.bfSize = sizeof(BITMAPFILEHEADER) + sizeof(BITMAPINFOHEADER) + dwSizeInBytes;
bmfh.bfReserved1 = 0;
bmfh.bfReserved2 = 0;
bmfh.bfOffBits = sizeof(BITMAPFILEHEADER) + sizeof(BITMAPINFOHEADER);
BITMAPINFOHEADER bmih;
bmih.biSize = sizeof(BITMAPINFOHEADER);
bmih.biWidth = cols;
bmih.biHeight = rows;
bmih.biPlanes = 1;
bmih.biBitCount = 32;
bmih.biCompression = BI_RGB;
bmih.biSizeImage = 0;
bmih.biXPelsPerMeter = 0;
bmih.biYPelsPerMeter = 0;
bmih.biClrUsed = 0;
bmih.biClrImportant = 0;
Now before you write your color information, just write the bitmap header
fwrite(&bmfh, sizeof(bmfh),1, bmp_ptr);
fwrite(&bmih, sizeof(bmih),1, bmp_ptr);
And finally the color information:
fwrite(&intmatrix, size, sizeof(int), bmp_ptr);
Note, that the block size is sizeof(int), as your matrix doesn't contain single characters, but integers for each value. Depending on the content of your matrix, it might be a good idea to convert the values to COLORREF values (Check the RGB macro, which can be found in WinGDI.h, too)

I've rewritten and commented the answer from https://stackoverflow.com/a/2654860/586784. I hope you find it clear enough.
#include <cstddef>
#include <armadillo>
#include <map>
#include <cstdio>
#include <cassert>
///Just a tiny struct to bundle three values in range [0-255].
struct Color{
Color(unsigned char red, unsigned char green, unsigned char blue)
: red(red),green(green),blue(blue)
{}
///Defualt constructed Color() is black.
Color()
: red(0),green(0),blue(0)
{}
///Each color is represented by a combination of red, green, and blue.
unsigned char red,green,blue;
};
int main(int argc,char**argv)
{
///The width of the image. Replace with your own.
std::size_t w = 7;
///The height of the image. Replace with your own
std::size_t h = 8;
///http://arma.sourceforge.net/docs.html#Mat
///The Armadillo Linear Algebra Library Mat constructor is of the following
/// signature: mat(n_rows, n_cols).
arma::Mat<int> intmatrix(h,w);
///Fill out matrix, replace this with your own.
{
///Zero fill matrix
for(std::size_t i=0; i<h; ++i)
for(std::size_t j=0;j<w; ++j)
intmatrix(i,j) = 0;
intmatrix(0,3) = 1;
intmatrix(1,3) = 1;
intmatrix(2,2) = 6;
intmatrix(2,4) = 6;
intmatrix(3,2) = 4;
intmatrix(3,4) = 4;
intmatrix(4,1) = 6;
intmatrix(4,2) = 6;
intmatrix(4,3) = 6;
intmatrix(4,4) = 6;
intmatrix(4,5) = 6;
intmatrix(5,1) = 1;
intmatrix(5,2) = 1;
intmatrix(5,3) = 1;
intmatrix(5,4) = 1;
intmatrix(5,5) = 1;
intmatrix(6,0) = 4;
intmatrix(6,6) = 4;
intmatrix(7,0) = 6;
intmatrix(7,6) = 6;
}
///Integer to color associations. This is a map
///that records the meanings of the integers in the matrix.
///It associates a color with each integer.
std::map<int,Color> int2color;
///Fill out the color associations. Replace this with your own associations.
{
///When we see 0 in the matrix, we will use this color (red-ish).
int2color[0] = Color(255,0,0);
///When we see 0 in the matrix, we will use this color (green-ish).
int2color[1] = Color(0,255,0);
///When we see 0 in the matrix, we will use this color (blue-ish).
int2color[4] = Color(0,0,255);
///When we see 0 in the matrix, we will use this color (grey-ish).
int2color[6] = Color(60,60,60);
}
///The file size will consist of w*h pixels, each pixel will have an RGB,
/// where each color R,G,B is 1 byte, making the data part of the file to
/// be of size 3*w*h. In addition there is a header to the file which will
/// take of 54 bytes as we will see.
std::size_t filesize = 54 + 3*w*h;
///We make an array of 14 bytes to represent one part of the header.
///It is filled out with some default values, and we will fill in the
///rest momentarily.
unsigned char bmpfileheader[14] = {'B','M', 0,0,0,0, 0,0, 0,0, 54,0,0,0};
///The second part of the header is 40 bytes; again we fill it with some
///default values, and will fill in the rest soon.
unsigned char bmpinfoheader[40] = {40,0,0,0, 0,0,0,0, 0,0,0,0, 1,0, 24,0};
///We will now store the filesize,w,h into the header.
///We can't just write them to the file directly, because different platforms
///encode their integers in different ways. This is called "endianness"
///or "byte order". So we chop our integers up into bytes, and put them into
///the header byte-by-byte in the way we need to.
///Encode the least significant 8 bits of filesize into this byte.
///Because sizeof(unsigned char) is one byte, and one byte is eight bits,
///when filesize is casted to (unsigned char) only the least significant
///8 bits are kept and stored into the byte.
bmpfileheader[ 2] = (unsigned char)(filesize );
///... Now we shift filesize to the right 1 byte, meaning and trunctate
///that to its least significant 8 bits. This gets stored in the next
///byte.
bmpfileheader[ 3] = (unsigned char)(filesize>> 8);
///...
bmpfileheader[ 4] = (unsigned char)(filesize>>16);
///Encodes the most significant 8 bits of filesize into this byte.
bmpfileheader[ 5] = (unsigned char)(filesize>>24);
///Now we will store w (the width of the image) in the same way,
/// but into the byte [5-8] in bmpinfoheader.
bmpinfoheader[ 4] = (unsigned char)( w );
bmpinfoheader[ 5] = (unsigned char)( w>> 8);
bmpinfoheader[ 6] = (unsigned char)( w>>16);
bmpinfoheader[ 7] = (unsigned char)( w>>24);
///Now we will store h (the width of the image) in the same way,
/// but into the byte [9-12] in bmpinfoheader.
bmpinfoheader[ 8] = (unsigned char)( h );
bmpinfoheader[ 9] = (unsigned char)( h>> 8);
bmpinfoheader[10] = (unsigned char)( h>>16);
bmpinfoheader[11] = (unsigned char)( h>>24);
///Now we open the output file
FILE* f = fopen("img.bmp","wb");
///First write the bmpfileheader to the file. It is 14 bytes.
///The 1 means we are writing 14 elements of size 1.
///Remember, bmpfileheader is an array which is basically
///the same thing as saying it is a pointer to the first element
///in an array of contiguous elements. We can thus say:
///write 14 bytes, starting from the spot where bmpfileheader points
///to.
fwrite(bmpfileheader,1,14,f);
///Then write the bmpinfoheader, which is 40 bytes, in the same way.
fwrite(bmpinfoheader,1,40,f);
///Now we write the data.
///For each row (there are h rows), starting from the last, going
///up to the first.
///We iterate through the rows in reverse order here,
///apparently in the BMP format, the image
///is stored upside down.
for(std::size_t i=h-1; i != std::size_t(-1); --i)
{
///For each column in the row,
for(std::size_t j=0; j<w; ++j)
{
///We retreive the integer of the matrix at (i,j),
///and assert that there is a color defined for it.
assert (int2color.count(intmatrix(i,j)) != 0
&& "Integer in matrix not defined in int2color map");
///We somehow get the color for pixel (i,j).
///In our case, we get it from the intmatrix, and looking
///up the integer's color.
Color color = int2color[intmatrix(i,j)];
///Now the colors are written in reverse order: BGR
///We write the color using fwrite, by taking a pointer
///of the (unsigned char), which is the same thing as
///an array of length 1. Then we write the byte.
///First for blue,
fwrite(&color.blue,1,1,f);
///Same for green,
fwrite(&color.green,1,1,f);
///Finally red.
fwrite(&color.red,1,1,f);
}
///Now we do some padding, from 0-3 bytes, depending in the width.
unsigned char bmppad[3] = {0,0,0};
fwrite(bmppad,1,(4-(w*3)%4)%4,f);
}
///Free the file.
fclose(f);
return 0;
}

Is you problem seeing the matrix as an image or writing an image from your code ?
In the former case, just do as Ben Jackson said
In the later case, you want to pass the address of the data pointer of the Arm::Matrix, and using fwrite assumes that Arm::Matrix holds it's data as a contiguous memory array
[edit]
brief look at armadillo doc also tells that data is stored in column-major mode, but BMP assumes row-major mode, so your image will look flipped
[edit2]
Using Armadillo Matrix functions, it's even simpler
// assume A is a matrix
// and maxVal is the maximum int value in you matrix (you might scale it to maxVal = 255)
std::ofstream outfile("name.pgm");
oufile << "P2 " << sd::endl << a.n_rows << " " << a.n_cols << std::endl << maxVal << std::endl;
outfile << a << std::endl;
outfile.close();

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Optimization of image resizing (method Nearest) with using SIMD - c++

Related

What is the best way to loop AVX for un-even non-aligned array?

How do you handle indivisible vector lengths with SIMD intrinsics, array not a multiple of vector width?

Loop vectorization - counting matches of 7-byte records with masking

Extracting data from a file that was read into memory

C++: Convert text file of integers into a bitmap image file in BMP format

Categories

Resources