Pointer arithmetic in memcpy has weird result [duplicate] - c++

This question already has answers here:
Incrementing pointers in C arrays
(1 answer)
Adding an offset to a pointer
(3 answers)
Closed 3 years ago.
I am coming back to C programming after some years so I guess I am a bit rusty, but I am seeing some weird behavior in my code.
I have the following:
memcpy(dest + (start_position * sizeof(MyEnum)), source, source_size * sizeof(MyEnum));
Where:
dest and source are arrays of MyEnum with different sizes,
dest is 64 bytes long.
source is 16 bytes long.
sizeof(MyEnum) is 4 bytes
source_size is 4, as there are 4 enums inside the array.
I am looping this code 4 times, advancing start_position at each time, so at each of the 4 loop iterations I get memcpy being called with the following values (I already checked this with the debugger):
memcpy(dest + (0), source, 16); (start_position = 0 * 4, since source size is 4)
memcpy(dest + (16), source, 16); (start_position = 1 * 4, since source size is 4)
memcpy(dest + (32), source, 16); (start_position = 2 * 4, since source size is 4)
memcpy(dest + (48), source, 16); (start_position = 3 * 4, since source size is 4)
memcpy works fine on the first loop, but on the second it copies data to another array instead, clearly going outside the memory region of dest array, violating another array's memory area.
So I checked the pointer arithmetic happening inside my function and this is what I got:
dest address is 0xbeffffa74
dest + (start_position * sizeof(MyEnum)) is 0xbefffab4 for (start_position * sizeof(MyEnum) = 16
The array being violated is at 0xbefffab4.
Although this explains why the array's memory is being violated, I don't get how 0xbeffffa74 + 16 is going to be 0xbefffab4, but I can confirm that's the address that memcpy is being called at.
I am running this on a Raspberry Pi, but AFAIK this shouldn't matter.

Pointer arithmetic works on the size of the pointed datatype. If you have a char* then ever time you increment the pointer it’ll move by one. If it’s an int* then every increment adds more than one, usually 4 to the pointer (due to int usually, but not always, being 32bit).
If you have a pointer to a struct then incrementing the pointer moves it by the size of the struct. Therefore the sizeof shouldn’t be there or you’ll move way too much.
memcpy(dest + (start_position * sizeof(MyEnum)), source, source_size * sizeof(MyEnum));
This moves the pointer 4*4 bytes every position since the MyEnum is four bytes.
memcpy(dest + start_position, source, source_size * sizeof(MyEnum));
This moves it only 4 bytes at a time.
This is logical because pointer[2] is the same as *(pointer + 2) so if pointer arithmetic didn’t implicitly take the pointed type size into account all indexing would also need the sizeof and you’d end up writing a lot of pointer[2 * sizeof(*pointer)].

Related

aligned malloc c++ implementation

I found this piece of code:
void* aligned_malloc(size_t required_bytes, size_t alignment) {
int offset = alignment - 1;
void* P = (void * ) malloc(required_bytes + offset);
void* q = (void * ) (((size_t)(p) + offset) & ~(alignment - 1));
return q;
}
that is the implementation of aligned malloc in C++. Aligned malloc is a function that supports allocating memory such that the
memory address returned is divisible by a specific power of two.
Example:
align_malloc (1000, 128) will return a memory address that is a multiple of 128 and that points to memory of size 1000 bytes.
But I don't understand line 4. Why sum twice the offset?
Thanks
Why sum twice the offset?
offset isn't exactly being summed twice. First use of offset is for the size to allocate:
void* p = (void * ) malloc(required_bytes + offset);
Second time is for the alignment:
void* q = (void * ) (((size_t)(p) + offset) & ~(alignment - 1));
Explanation:
~(alignment - 1) is a negation of offset (remember, int offset = alignment - 1;) which gives you the mask you need to satisfy the alignment requested. Arithmetic-wise, adding the offset and doing bitwise and (&) with its negation gives you the address of the aligned pointer.
How does this arithmetic work? First, remember that the internal call to malloc() is for required_bytes + offset bytes. As in, not the alignment you asked for. For example, you wanted to allocate 10 bytes with alignment of 16 (so the desired behavior is to allocate the 10 bytes starting in an address that is divisible with 16). So this malloc() from above will give you 10+16-1=25 bytes. Not necessarily starting at the right address in terms of being divisible with 16). But then this 16-1 is 0x000F and its negation (~) is 0xFFF0. And now we apply the bitwise and like this: p + 15 & 0xFFF0 which will cause every pointer p to be a multiple of 16.
But wait, why add this offset of alignment - 1 in the first place? You do it because once you get the pointer p returned by malloc(), the one thing you cannot do -- do in order to find the nearest address which is a multiple of the alignment requested -- is look for it before p, as this could cross into an address space of something allocated before p. For this, you begin by adding alignment - 1, which, think about it, is exactly the maximum by which you'd have to advance to get your alignment.
* Thanks to user DevSolar for some additional phrasing.
Note 1: For this way to work the alignment must be a power of 2. This snippet does not enforce such a thing and so could cause unexpected behavior.
Note 2: An interesting question is how could you implement a free() version for such an allocation, with the return value from this function.

How to quickly replicate a 6-byte unsigned integer into a memory region?

I need to replicate a 6-byte integer value into a memory region, starting with its beginning and as quickly as possible. If such an operation is supported in hardware, I'd like to use it (I'm on x64 processors now, compiler is GCC 4.6.3).
The memset doesn't suit the job, because it can replicate bytes only. The std::fill isn't good either, because I even can't define an iterator, jumping between 6 byte-width positions in the memory region.
So, I'd like to have a function:
void myMemset(void* ptr, uint64_t value, uint8_t width, size_t num)
This looks like memset, but there is an additional argument width to define how much bytes from the value to replicate. If something like that could be expressed in C++, that would be even better.
I already know about obvious myMemset implementation, which would call the memcpy in loop with last argument (bytes to copy) equal to the width. Also I know, that I can define a temporary memory region with size 6 * 8 = 48 bytes, fill it up with 6-byte integers and then memcpy it to the destination area.
Can we do better?
Something along #Mark Ransom comment:
Copy 6 bytes, then 6, 12, 24, 48, 96, etc.
void memcpy6(void *dest, const void *src, size_t n /* number of 6 byte blocks */) {
if (n-- == 0) {
return;
}
memcpy(dest, src, 6);
size_t width = 1;
while (n >= width) {
memcpy(&((char *) dest)[width * 6], dest, width * 6);
n -= width;
width <<= 1; // double w
}
if (n > 0) {
memcpy(&((char *) dest)[width * 6], dest, n * 6);
}
}
Optimization: scale n and width by 6.
[Edit]
Corrected destination #SchighSchagh
Added cast (char *)
Determine the most efficient write size that the CPU supports; then find the smallest number that can be evenly divided by both 6 and that write size and call that "block size".
Now split the memory region up into blocks of that size. Each block will be identical and all writes will be correctly aligned (assuming the memory region itself is correctly aligned).
For example, if the most efficient write size that the CPU supports is 4 bytes (e.g. ancient 80486) then the "size of block" would be 12 bytes. You'd set 3 general purpose registers and do 3 stores per block.
For another example, if the most efficient write size that the CPU supports is 16 bytes (e.g. SSE) then the "size of block" would be 48 bytes. You'd set 3 SSE registers and do 3 stores per block.
Also, I'd recommend rounding the size of the memory region up to ensure it is a multiple of the block size (with some "not strictly necessary" padding). A few unnecessary writes are less expensive than code to fill a "partial block".
The second most efficient method might be to use a memory copy (but not memcpy() or memmove()). In this case you'd write the initial 6 bytes (or 12 bytes or 48 bytes or whatever), then copy from (e.g.) &area[0] to &area[6] (working from lowest to highest) until you reach the end. For this memmove() will not work because it will notice the area is overlapping and work from highest to lowest instead; and memcpy() will not work because it assumes the source and destination do not overlap; so you'd have to create your own memory copy to suit. The main problem with this is that you double the number of memory accesses - "reading and writing" is slower than "writing alone".
If your Num is large enough, you can try using the AVX vector instructions that will handle 32 bytes at a time (_mm256_load_si256/_mm256_store_si256 or their unaligned variants).
As 32 is not a multiple of 6, you will have to first replicate the 6 bytes pattern 16 times using short memcpy's or 32/64 bits moves.
ABCDEF
ABCDEF|ABCDEF
ABCD EFAB CDEF|ABCD EFAB CDEF
ABCDEFAB CDEFABCD EFABCDEF|ABCDEFAB CDEFABCD EFABCDE
ABCDEFABCDEFABCD EFABCDEFABCDEFAB CDEFABCDEFABCDEF|ABCDEFABCDEFABCD EFABCDEFABCDEFAB CDEFABCDEFABCDEF
You will also finish with a short memcpy.
Try the __movsq intrinsic (x64 only; in assembly, rep movsq) that will move 8 bytes at a time, with a suitable repetition factor, and setting the destination address 6 bytes after the source. Check that overlapping addresses are handled smartly.
Write 8 bytes at a time.
Being on a 64-bit machine, certainly the generated code can operate well with 8-byte writes. After dealing with some set-up issues, in a tight loop, write 8-bytes per write about num times. Assumptions apply - see code.
// assume little endian
void myMemset(void* ptr, uint64_t value, uint8_t width, size_t num) {
assert(width > 0 && width <= 8);
uint64_t *ptr64 = (uint64_t *) ptr;
// # to stop early to prevent writing past array end
static const unsigned stop_early[8 + 1] = { 0, 8, 3, 2, 1, 1, 1, 1, 0 };
size_t se = stop_early[width];
if (num > se) {
num -= se;
// assume no bus-fault with 64-bit write # `ptr64, ptr64+1, ... ptr64+7`
while (num > 0) { // tight loop
num--;
*ptr64 = value;
ptr64 = (uint64_t *) ((char *) ptr64 + width);
}
ptr = ptr64;
num = se;
}
// Cope with last few writes
while (num-- > 0) {
memcpy(ptr, &value, width);
ptr = (char *) ptr + width;
}
}
Further optimization includes writing 2 blocks at a time width == 3 or 4, 4 blocks at a time when width == 2 and 8 blocks at a time width == 1.

Get the network 5bytes warning left shift count >= width of type

My machine is 64 bit. My code as below:
unsigned long long periodpackcount=*(mBuffer+offset)<<32|*(mBuffer+offset+1)<<24|* (mBuffer+offset+2)<<16|*(mBuffer+offset+3)<<8|*(mBuffer+offset+4);
mBuffer is unsigned char*. I want to get 5 bytes data and transform the data to host byte-order.
How can I avoid this warning ?
Sometimes it's best to break apart into a few lines in order to avoid issues. You have a 5 byte integer you want to read.
// Create the number to read into.
uint64_t number = 0; // uint64_t is in <stdint>
char *ptr = (char *)&number;
// Copy from the buffer. Plus 3 for leading 0 bits.
memcpy(ptr + 3, mBuffer + offset, 5);
// Reverse the byte order.
std::reverse(ptr, ptr + 8); // Can bit shift here instead
Probably not the best byte swap ever (bit shifting is faster). And my logic might be off for the offsetting, but something along those lines should work.
The other thing you may want to do is cast each byte before shifting since you're leaving it up to the compiler to determine the data type *(mBuffer + offset) is a character (I believe), so you may want to cast it to a larger type static_cast<uint64_t>(*(mBuffer + offset)) << 32 or something.

Aligned malloc in C++

I have a question on problem 13.9 in the book, "cracking the coding interview".
The question is to write an aligned alloc and free function that supports allocating memory, and in the answer the code is given below:
void *aligned_malloc(size_t required_bytes, size_t alignment) {
void *p1;
void **p2;
int offset=alignment-1+sizeof(void*);
if((p1=(void*)malloc(required_bytes+offset))==NULL)
return NULL;
p2=(void**)(((size_t)(p1)+offset)&~(alignment-1)); //line 5
p2[-1]=p1; //line 6
return p2;
}
I am so confused with the line 5 and line 6. Why do you have to do an "and" since you have already add offset to p1? and what does [-1] mean? Thanks for the help in advance.
Your sample code is not complete. It allocates nothing. It is pretty obvious you are missing a malloc statement, which sets the p1 pointer. I don't have the book, but I think the complete code should goes along these lines:
void *aligned_malloc(size_t required_bytes, size_t alignment) {
void *p1;
void **p2;
int offset=alignment-1+sizeof(void*);
p1 = malloc(required_bytes + offset); // the line you are missing
p2=(void**)(((size_t)(p1)+offset)&~(alignment-1)); //line 5
p2[-1]=p1; //line 6
return p2;
}
So ... what does the code do?
The strategy is to malloc more space than what we need (into p1), and return a p2 pointer somewhere after the beginning of the buffer.
Since alignment is a power of two, in binary it has the form of 1 followed by zeros. e.g. if alignment is 32, it will be 00100000 in binary
(alignment-1) in binary format will turn the 1 into 0, and all the 0's after the 1 into 1. For example: (32-1) is 00011111
the ~ will reverse all the bits. That is: 11100000
now, p1 is a pointer to the buffer (remember, the buffer is larger by offset than what we need). we add offset to p1: p1+offset.
So now, (p1+offset) is greater than what we want to return. We'll nil all the insignificant bits by doing a bitwise and: (p1+offset) & ~(offset-1)
This is p2, the pointer that we want to return. Note that because its last 5 digits are zero it is 32 aligned, as requested.
p2 is what we'll return. However, we must be able to reach p1 when the user calls aligned_free. For this, note that we reserved location for one extra pointer when we calculated the offset (that's the sizeof(void*) in line 4.
so, we want to put p1 immediately before p2. This is p2[-1]. This is a little bit tricky calculation. Remember that p2 is defined as void**. One way to look at it is as array of void*. C array calculation say that p2[0] is exactly p2. p2[1] is p2 + sizeof(void*). In general p2[n] = p2 + nsizeof(void). The compiler also supports negative numbers, so p2[-1] is one void* (typically 4 bytes) before p2.
I'm going to guess that aligned_free is something like:
void aligned_free( void* p ) {
void* p1 = ((void**)p)[-1]; // get the pointer to the buffer we allocated
free( p1 );
}
p1 is the actual allocation. p2 is the pointer being returned, which references memory past the point of allocation and leaves enough space for both allignment AND storing the actual allocated pointer in the first place. when aligned_free() is called, p1 will be retrieved to do the "real" free().
Regarding the bit math, that gets a little more cumbersome, but it works.
p2=(void**)(((size_t)(p1)+offset)&~(alignment-1)); //line 5
Remember, p1 is the actual allocation reference. For kicks, lets assume the following, with 32bit pointers:
alignment = 64 bytes, 0x40
offset = 0x40-1+4 = 0x43
p1 = 0x20000110, a value returned from the stock malloc()
What is important is the original malloc() that allocates an additional 0x43 bytes of space above and beyond the original request. This is to ensure both the alignment math and the space for the 32bit pointer can be accounted for:
p2=(void**)(((size_t)(p1)+offset)&~(alignment-1)); //line 5
p2 = (0x20000110 + 0x43) &~ (0x0000003F)
p2 = 0x20000153 &~ 0x0000003F
p2 = 0x20000153 & 0xFFFFFFC0
p2 = 0x20000140
p2 is aligned on a 0x40 boundary (i.e. all bits in 0x3F are 0) and enough space is left behind to store the 4-byte pointer for the original allocation, referenced by p1.
It has been forever since i did alignment math, so if i f'ed up the bits, please someone correct this.
And by the way, this is not the only way to do this.
void* align_malloc(size_t size, size_t alignment)
{
// sanity check for size/alignment.
// Make sure alignment is power of 2 (alignment&(alignment-1) ==0)
// allocate enough buffer to accommodate alignment and metadata info
// We want to store an offset to address where CRT reserved memory to avoid leak
size_t total = size+alignment+sizeof(size_t);
void* crtAlloc = malloc(total);
crtAlloc += sizeof(size_t); // make sure we have enough buffer ahead to store metadata info
size_t crtArithmetic = reinterprete_cast<int>(crtAlloc); // treat as int for pointer arithmetic
void* pRet = crtAlloc + (alignment - (crtArithmetic%alignment));
size_t *pMetadata = reinterprete_cast<size_t*>(pRet);
pMetadata[-1] = pRet - crtArithmetic- sizeof(size_t);
return pRet;
}
As an example size = 20, alignement = 16 and crt malloc returned address 10. and assuming sizeof(size_t) as 4 byte
- total bytes to allocate = 20+16+4 = 40
- memory committed by crt = address 10 to 50
- first make space in front by adding sizeof(size_t) 4 bytes so you point at 14
- add offset to align which is 14 + (16-14%16) = 16
- move back sizeof(size_t) 4 bytes (i.e. 12) and treat that as size_t pointer and store offset 2 (=12-10) to point to crt malloc
start
Same way, align_free will cast void pointer to size_t pointer, move back one location, read value stored there and adjust offset to move to crt alloc beginning
void* align_free(void* ptr)
{
size_t* pMetadata = reinterprete_cast<size_t*> (ptr);
free(ptr-pMetadata[-1]);
}
Optimization: You don't need sizeof(size_t) extra allocation if your alignment was more than sizeof(size_t)

C++: &a[2] - &a[1] ==?

a is array of integers, if I try to subtract the address value of &a[2] - &a[1] == ?
what should the result be 4 or 1 ?
EDIT: see 4th comment on top answer here why he says 1 ?? this is why I'm confused I thought it will be 4
EDIT: here is a test
&a[2] is same as &(*(a + 2)) (i.e (a + 2)) and &a[1] is same as &(*(a + 1)) (i.e. (a + 1)). So answer will be 1.
Pointer subtraction gives you the difference in elements, not bytes. It does not matter what the element type of the array is, the result of &a[2] - &a[1] will always be 1, because they are 1 element apart.
It is always 1. The pointer arithmetics are not concerned with the number of bytes that each element has, and this is very useful. Compare these:
ptr++; // go to the next element (correct)
ptr += sizeof *ptr; // go to the next element (wrong)
When you work with arrays you are usually interested in the elements, not in the bytes comprising them, and that is why pointer arithmetics in C has been defined this way.
The difference must be 1. When you compare pointers you will always get the difference of elements.
Since this is C++, I'm going to assume that you have not overridden the & or * operators on whatever type a is. Minding that, the following is true:
&a[2] - &a[1]
(a + 2) - (a + 1)
a + 2 - a - 1
2 - 1
1
A couple of the answers here (deleted since this answer was posted) clearly had byte* in mind:
int a[10];
byte * pA2 = (byte*)&a[2];
byte * pA1 = (byte*)&a[1];
int sz1 = &a[2] - &a[1];
int sz2 = pA2 - pA1;
CString msg;
msg.Format("int * %d, byte * %d\n", sz1, sz2);
OutputDebugString(msg);
output is:
int * 1, byte * 4
Two addresses but depending on the declaration of the variable the addresses are stored in, the difference between the two can be 1 or 4.