How do I align a pointer to a 16 byte boundary?
I found this code, not sure if its correct
char* p= malloc(1024);
if ((((unsigned long) p) % 16) != 0)
{
unsigned char *chpoint = (unsigned char *)p;
chpoint += 16 - (((unsigned long) p) % 16);
p = (char *)chpoint;
}
Would this work?
thanks
C++0x proposes std::align, which does just that.
// get some memory
T* const p = ...;
std::size_t const size = ...;
void* start = p;
std::size_t space = size;
void* aligned = std::align(16, 1024, p, space);
if(aligned == nullptr) {
// failed to align
} else {
// here, p is aligned to 16 and points to at least 1024 bytes of memory
// also p == aligned
// size - space is the amount of bytes used for alignment
}
which seems very low-level. I think
// also available in Boost flavour
using storage = std::aligned_storage_t<1024, 16>;
auto p = new storage;
also works. You can easily run afoul of aliasing rules though if you're not careful. If you had a precise scenario in mind (fit N objects of type T at a 16 byte boundary?) I think I could recommend something nicer.
Try this:
It returns aligned memory and frees the memory, with virtually no extra memory management overhead.
#include <malloc.h>
#include <assert.h>
size_t roundUp(size_t a, size_t b) { return (1 + (a - 1) / b) * b; }
// we assume here that size_t and void* can be converted to each other
void *malloc_aligned(size_t size, size_t align = sizeof(void*))
{
assert(align % sizeof(size_t) == 0);
assert(sizeof(void*) == sizeof(size_t)); // not sure if needed, but whatever
void *p = malloc(size + 2 * align); // allocate with enough room to store the size
if (p != NULL)
{
size_t base = (size_t)p;
p = (char*)roundUp(base, align) + align; // align & make room for storing the size
((size_t*)p)[-1] = (size_t)p - base; // store the size before the block
}
return p;
}
void free_aligned(void *p) { free(p != NULL ? (char*)p - ((size_t*)p)[-1] : p); }
Warning:
I'm pretty sure I'm stepping on parts of the C standard here, but who cares. :P
In glibc library malloc, realloc always returns 8 bytes aligned. If you want to allocate memory with some alignment which is a higher power 2 then you can use memalign and posix_memalign. Read http://www.gnu.org/s/hello/manual/libc/Aligned-Memory-Blocks.html
posix_memalign is one way: http://pubs.opengroup.org/onlinepubs/009695399/functions/posix_memalign.html as long as your size is a power of two.
The problem with the solution you provide is that you run the risk of writing off the end of your allocated memory. An alternative solution is to alloc the size you want + 16 and to use a similar trick to the one you're doing to get a pointer that is aligned, but still falls within your allocated region. That said, I'd use posix_memalign as a first solution.
Updated: New Faster Algorithm
Don't use modulo because it takes hundreds of clock cycles on x86 due to the nasty division and a lot more on other systems. I came up with a faster version of std::align than GCC and Visual-C++. Visual-C++ has the slowest implementation, which actually uses an amateurish conditional statement. GCC is very similar to my algorithm but I did the opposite of what they did but my algorithm is 13.3 % faster because it has 13 as opposed to 15 single-cycle instructions. See here is the research paper with dissassembly. The algorithm is actually one instruction faster if you use the mask instead of the pow_2.
/* Quickly aligns the given pointer to a power of two boundaries.
#return An aligned pointer of typename T.
#desc Algorithm is a 2's compliment trick that works by masking off
the desired number in 2's compliment and adding them to the
pointer. Please note how I took the horizontal comment whitespace back.
#param pointer The pointer to align.
#param mask Mask for the lower LSb, which is one less than the power of
2 you wish to align too. */
template <typename T = char>
inline T* AlignUp(void* pointer, uintptr_t mask) {
intptr_t value = reinterpret_cast<intptr_t>(pointer);
value += (-value) & mask;
return reinterpret_cast<T*>(value);
}
Here is how you call it:
enum { kSize = 256 };
char buffer[kSize + 16];
char* aligned_to_16_byte_boundary = AlignUp<> (buffer, 15); //< 16 - 1 = 15
char16_t* aligned_to_64_byte_boundary = AlignUp<char16_t> (buffer, 63);
Here is the quick bit-wise proof for 3 bits, it works the same for all bit counts:
~000 = 111 => 000 + 111 + 1 = 0x1000
~001 = 110 => 001 + 110 + 1 = 0x1000
~010 = 101 => 010 + 101 + 1 = 0x1000
~011 = 100 => 011 + 100 + 1 = 0x1000
~100 = 011 => 100 + 011 + 1 = 0x1000
~101 = 010 => 101 + 010 + 1 = 0x1000
~110 = 001 => 110 + 001 + 1 = 0x1000
~111 = 000 => 111 + 000 + 1 = 0x1000
Just in case you're here to learn how to align to a cache line an object in C++11, use the in-place constructor:
struct Foo { Foo () {} };
Foo* foo = new (AlignUp<Foo> (buffer, 63)) Foo ();
Here is the std::align implmentation, it uses 24 instructions where the GCC implementation uses 31 instructions, though it can be tweaked to eliminate a decrement instruction by turning (--align) to the mask for the Least Significant bits but that would not operate functionally identical to std::align.
inline void* align(size_t align, size_t size, void*& ptr,
size_t& space) noexcept {
intptr_t int_ptr = reinterpret_cast<intptr_t>(ptr),
offset = (-int_ptr) & (--align);
if ((space -= offset) < size) {
space += offset;
return nullptr;
}
return reinterpret_cast<void*>(int_ptr + offset);
}
Faster to Use mask rather than pow_2
Here is the code for aligning using a mask rather than the the pow_2 (which is the even power of 2). This is 20% fatert than the GCC algorithm but requires you to store the mask rather than the pow_2 so it's not interchangable.
inline void* AlignMask(size_t mask, size_t size, void*& ptr,
size_t& space) noexcept {
intptr_t int_ptr = reinterpret_cast<intptr_t>(ptr),
offset = (-int_ptr) & mask;
if ((space -= offset) < size) {
space += offset;
return nullptr;
}
return reinterpret_cast<void*>(int_ptr + offset);
}
few things:
don't change the pointer returned by the malloc/new: you'll need it later to free the memory;
make sure your buffer is big enough after adjusting the alignment
use size_t instead of unsigned long, since size_t guaranteed to have the same size as the pointer, as opposed to anything else:
here's the code:
size_t size = 1024; // this is how many bytes you need in the aligned buffer
size_t align = 16; // this is the alignment boundary
char *p = (char*)malloc(size + align); // see second point above
char *aligned_p = (char*)((size_t)p + (align - (size_t)p % align));
// use the aligned_p here
// ...
// when you're done, call:
free(p); // see first point above
Related
What is the fastest way to convert a vector of size 4 into a 32 bit float?
My failed attempt:
static bool vec2float32(std::vector<uint8_t> bytes, float &result)
{
if(bytes.size() != 4) return false;
uint8_t sign = (bytes.at(0) & 0x10000000); //will be 1 or 0
uint8_t exponent = (bytes.at(0) & 0x01111111);
uint16_t mantissa = (bytes.at(1) << (2*8)) + (bytes.at(2) << (1*8)) + (bytes.at(3) << (0*8));
result = (2^(exponent - 127)) * mantissa;
if(sign == 1) result = result * -1;
return true;
}
From the vector you can get a pointer to the first byte. If the bytes are already in the correct order, then you can copy the bytes directly into a float variable.
I believe this should be the fastest:
return *reinterpret_cast<float*>(&bytes[0]);
I think its technically undefined behavior. But most compilers should output what you expect here. The &bytes[0] is guaranteed to work because std::vector is guaranteed to be contiguous.
I'm trying to understand the SSE strstr implementation, and one particular function is doing something I don't quite understand wrt loading a const unsigned char* into an __m128i. The function is the __m128i_strloadu function (taken from here: http://strstrsse.googlecode.com/svn-history/r135/trunk/strmatch/lib/strstrsse42.c):
static inline __m128i __m128i_strloadu (const unsigned char * p) {
int offset = ((size_t) p & (16 - 1));
if (offset && (int) ((size_t) p & 0xfff) > 0xff0) {
__m128i a = _mm_load_si128 ((__m128i *) (p - offset));
__m128i zero = _mm_setzero_si128 ();
// I don't understand what this movemask, in concert
// with the shift right comparison below, are accomplishing
int bmsk = _mm_movemask_epi8 (_mm_cmpeq_epi8 (a, zero));
if ((bmsk >> offset) != 0) {
return __m128i_shift_right(a, offset);
}
}
return _mm_loadu_si128 ((__m128i *) p);
}
I feel like this is a simple align to 16 bits operation taking place, but I'm having trouble visualizing /how/ it's happening. What does the movemask comparison accomplish here / what is it checking for?
It's testing whether the end of the string is in this block, and if so, it shifts out the extra bytes and returns that. Otherwise it goes ahead and does the normal unaligned load, avoiding the shift and containing "more of this string" instead of spurious zeroes.
The mask is a mask of which bytes in the 16-byte block are zero. bmsk >> offset is the part of the mask that represents bytes that were asked for (starting from p), the extra bytes are due to alignment.
Is there a function (SSEx intrinsics is OK) which will fill the memory with a specified int32_t value? For instance, when this value is equal to 0xAABBCC00 the result memory should look like:
AABBCC00AABBCC00AABBCC00AABBCC00AABBCC00
AABBCC00AABBCC00AABBCC00AABBCC00AABBCC00
AABBCC00AABBCC00AABBCC00AABBCC00AABBCC00
AABBCC00AABBCC00AABBCC00AABBCC00AABBCC00
...
I could use std::fill or simple for-loop, but it is not fast enough.
Resizing of a vector performed only once in the beginning of program, this is not an issue. The bottleneck is filling the memory.
Simplified code:
struct X
{
typedef std::vector<int32_t> int_vec_t;
int_vec_t buffer;
X() : buffer( 5000000 ) { /* some more action */ }
~X() { /* some code here */ }
// the following function is called 25 times per second
const int_vec_t& process( int32_t background, const SOME_DATA& data );
};
const X::int_vec_t& X::process( int32_t background, const SOME_DATA& data )
{
// the following one string takes 30% of total time of #process function
std::fill( buffer.begin(), buffer.end(), background );
// some processing
// ...
return buffer;
}
This is how I would do it (please excuse the Microsoft-ness of it):
VOID FillInt32(__out PLONG M, __in LONG Fill, __in ULONG Count)
{
__m128i f;
// Fix mis-alignment.
if ((ULONG_PTR)M & 0xf)
{
switch ((ULONG_PTR)M & 0xf)
{
case 0x4: if (Count >= 1) { *M++ = Fill; Count--; }
case 0x8: if (Count >= 1) { *M++ = Fill; Count--; }
case 0xc: if (Count >= 1) { *M++ = Fill; Count--; }
}
}
f.m128i_i32[0] = Fill;
f.m128i_i32[1] = Fill;
f.m128i_i32[2] = Fill;
f.m128i_i32[3] = Fill;
while (Count >= 4)
{
_mm_store_si128((__m128i *)M, f);
M += 4;
Count -= 4;
}
// Fill remaining LONGs.
switch (Count & 0x3)
{
case 0x3: *M++ = Fill;
case 0x2: *M++ = Fill;
case 0x1: *M++ = Fill;
}
}
I have to ask: Have you definitely profiled std::fill and shown it to be the performance bottleneck? I would guess it to be implemented in a pretty efficient manner, such that the compiler can automatically generate the appropriate instructions (for example -march on gcc).
If it is the bottleneck, it may still be possible to get better benefit from an algorithmic redesign (if possible) to avoid setting so much memory (apparently over and over) such that it doesn't matter anymore which fill mechanism you use.
Thanks to everyone for your answers. I've checked wj32's solution , but it shows very similar time as std::fill do. My current solution works 4 times faster (in Visual Studio 2008) than std::fill with help of the function memcpy:
// fill the first quarter by the usual way
std::fill(buffer.begin(), buffer.begin() + buffer.size()/4, background);
// copy the first quarter to the second (very fast)
memcpy(&buffer[buffer.size()/4], &buffer[0], buffer.size()/4*sizeof(background));
// copy the first half to the second (very fast)
memcpy(&buffer[buffer.size()/2], &buffer[0], buffer.size()/2*sizeof(background));
In the production code one needs to add check if buffer.size() is divisible by 4 and add appropriate handling for that.
Have you considered using
vector<int32_t> myVector;
myVector.reserve( sizeIWant );
and then use std::fill? Or perhaps the constructor of a std::vector which takes as an argument the number of items held and the value to initialize them at?
Assuming you have a limited amount of values in your background parameter (or even better, only on), maybe you should try to allocate a static vector, and simply use memcpy.
const int32_t sBackground = 1234;
static vector <int32_t> sInitalizedBuffer(n, sBackground);
const X::int_vec_t& X::process( const SOME_DATA& data )
{
// the following one string takes 30% of total time of #process function
std::memcpy( (void*) data[0], (void*) sInitalizedBuffer[0], n * sizeof(sBackground));
// some processing
// ...
return buffer;
}
I just tested std::fill with g++ with full optimizations (SSE etc.. enabled):
#include <algorithm>
#include <inttypes.h>
int32_t a[5000000];
int main(int argc,char *argv[])
{
std::fill(a,a+5000000,0xAABBCC00);
return a[3];
}
and the inner loop looked like:
L2:
movdqa %xmm0, -16(%eax)
addl $16, %eax
cmpl %edx, %eax
jne L2
Looks like 0xAABBCC00 x 4 was loaded into xmm0 and is being moved 16-bytes at a time.
Not totally sure how you set 4 bytes in a row, but if you want to fill memory with just one byte over an over again, you can use memset.
void * memset ( void * ptr, int value, size_t num );
Fill block of memory
Sets the first num bytes of the block of memory pointed by ptr to the specified value (interpreted as an unsigned char).
the vs2013 and vs2015 can optimize a plain for-loop to a rep stos instruction. It's the fastest way to fill a buffer. You can specify the std::fill for your type like this:
namespace std {
inline void fill(vector<int>::iterator first, vector<int>::iterator last, int value){
for (size_t i = 0; i < last - first; i++)
first[i] = value;
}
}
BTW. To have the compiler do the optimization, the buffer must be accessed by the subscript operator.
It will not work on the gcc and clang. They both will compile the code to a conditional jump loop. It runs as slow as the original std::fill. And though the wchar_t is 32-bit, the wmemset does not have an assemble implement likes the memset. So you have to write assemble code to do the optimization.
It might be a bit non portable but you could use an overlapping memory copy.
Fill the first four bytes with the pattern you want and use memcpy().
int32* p = (int32*) malloc( size );
*p = 1234;
memcpy( p + 4, p, size - 4 );
don't think you can get much faster
If I have a pointer to the start of a memory region, and I need to read the value packed in bits 30, 31, and 32 of that region, how can I read that value?
Use bit masks.
It depends on how big a byte is in your machine. The answer will vary depending on if you're zero- or one-indexing for those numbers. The following function returns 0 if the bit is 0 and non-zero if it is 1.
int getBit(char *buffer, int which)
{
int byte = which / CHAR_BIT;
int bit = which % CHAR_BIT;
return buffer[byte] & (1 << bit);
}
If your compiler can't optimize well enough to turn the division and mod operations into bit operations, you could do it explicitly, but I prefer this code for clarity.
(Edited to fix a bug and change to CHAR_BIT, which is a great idea.)
I'd probably generalize this answer to something like this:
template <typename T>
bool get_bit(const T& pX, size_t pBit)
{
if (pBit > sizeof(pX) * CHAR_BIT)
throw std::invalid_argument("bit does not exist");
size_t byteOffset = pBit / CHAR_BIT;
size_t bitOffset = pBit % CHAR_BIT;
char byte = (&reinterpret_cast<const char&>(pX))[byteOffset];
unsigned mask = 1U << bitOffset;
return (byte & mask) == 1;
}
Bit easier to use:
int i = 12345;
bool abit = get_bit(i, 4);
On a 32-bit system, you can simply shift pointer right 29. If you need the bit values in place, and by 0xE0000000.
In C/C++, is there an easy way to apply bitwise operators (specifically left/right shifts) to dynamically allocated memory?
For example, let's say I did this:
unsigned char * bytes=new unsigned char[3];
bytes[0]=1;
bytes[1]=1;
bytes[2]=1;
I would like a way to do this:
bytes>>=2;
(then the 'bytes' would have the following values):
bytes[0]==0
bytes[1]==64
bytes[2]==64
Why the values should be that way:
After allocation, the bytes look like this:
[00000001][00000001][00000001]
But I'm looking to treat the bytes as one long string of bits, like this:
[000000010000000100000001]
A right shift by two would cause the bits to look like this:
[000000000100000001000000]
Which finally looks like this when separated back into the 3 bytes (thus the 0, 64, 64):
[00000000][01000000][01000000]
Any ideas? Should I maybe make a struct/class and overload the appropriate operators? Edit: If so, any tips on how to proceed? Note: I'm looking for a way to implement this myself (with some guidance) as a learning experience.
I'm going to assume you want bits carried from one byte to the next, as John Knoeller suggests.
The requirements here are insufficient. You need to specify the order of the bits relative to the order of the bytes - when the least significant bit falls out of one byte, does to go to the next higher or next lower byte.
What you are describing, though, used to be very common for graphics programming. You have basically described a monochrome bitmap horizontal scrolling algorithm.
Assuming that "right" means higher addresses but less significant bits (ie matching the normal writing conventions for both) a single-bit shift will be something like...
void scroll_right (unsigned char* p_Array, int p_Size)
{
unsigned char orig_l = 0;
unsigned char orig_r;
unsigned char* dest = p_Array;
while (p_Size > 0)
{
p_Size--;
orig_r = *p_Array++;
*dest++ = (orig_l << 7) + (orig_r >> 1);
orig_l = orig_r;
}
}
Adapting the code for variable shift sizes shouldn't be a big problem. There's obvious opportunities for optimisation (e.g. doing 2, 4 or 8 bytes at a time) but I'll leave that to you.
To shift left, though, you should use a separate loop which should start at the highest address and work downwards.
If you want to expand "on demand", note that the orig_l variable contains the last byte above. To check for an overflow, check if (orig_l << 7) is non-zero. If your bytes are in an std::vector, inserting at either end should be no problem.
EDIT I should have said - optimising to handle 2, 4 or 8 bytes at a time will create alignment issues. When reading 2-byte words from an unaligned char array, for instance, it's best to do the odd byte read first so that later word reads are all at even addresses up until the end of the loop.
On x86 this isn't necessary, but it is a lot faster. On some processors it's necessary. Just do a switch based on the base (address & 1), (address & 3) or (address & 7) to handle the first few bytes at the start, before the loop. You also need to special case the trailing bytes after the main loop.
Decouple the allocation from the accessor/mutators
Next, see if a standard container like bitset can do the job for you
Otherwise check out boost::dynamic_bitset
If all fails, roll your own class
Rough example:
typedef unsigned char byte;
byte extract(byte value, int startbit, int bitcount)
{
byte result;
result = (byte)(value << (startbit - 1));
result = (byte)(result >> (CHAR_BITS - bitcount));
return result;
}
byte *right_shift(byte *bytes, size_t nbytes, size_t n) {
byte rollover = 0;
for (int i = 0; i < nbytes; ++i) {
bytes[ i ] = (bytes[ i ] >> n) | (rollover < n);
byte rollover = extract(bytes[ i ], 0, n);
}
return &bytes[ 0 ];
}
Here's how I would do it for two bytes:
unsigned int rollover = byte[0] & 0x3;
byte[0] >>= 2;
byte[1] = byte[1] >> 2 | (rollover << 6);
From there, you can generalize this into a loop for n bytes. For flexibility, you will want to generate the magic numbers (0x3 and 6) rather then hardcode them.
I'd look into something similar to this:
#define number_of_bytes 3
template<size_t num_bytes>
union MyUnion
{
char bytes[num_bytes];
__int64 ints[num_bytes / sizeof(__int64) + 1];
};
void main()
{
MyUnion<number_of_bytes> mu;
mu.bytes[0] = 1;
mu.bytes[1] = 1;
mu.bytes[2] = 1;
mu.ints[0] >>= 2;
}
Just play with it. You'll get the idea I believe.
Operator overloading is syntactic sugar. It's really just a way of calling a function and passing your byte array without having it look like you are calling a function.
So I would start by writing this function
unsigned char * ShiftBytes(unsigned char * bytes, size_t count_of_bytes, int shift);
Then if you want to wrap this up in an operator overload in order to make it easier to use or because you just prefer that syntax, you can do that as well. Or you can just call the function.