Most efficient way to read UInt32 from any memory address?

Most efficient way to read UInt32 from any memory address? - c++

What would be the most efficient way to read a UInt32 value from an arbitrary memory address in C++? (Assuming Windows x86 or Windows x64 architecture.)
For example, consider having a byte pointer that points somewhere in memory to block that contains a combination of ints, string data, etc., all mixed together. The following sample shows reading the various fields from this block in a loop.
typedef unsigned char* BytePtr;
typedef unsigned int UInt32;
...
BytePtr pCurrent = ...;
while ( *pCurrent != 0 )
{
...
if ( *pCurrent == ... )
{
UInt32 nValue = *( (UInt32*) ( pCurrent + 1 ) ); // line A
...
}
pCurrent += ...;
}
If at line A, pPtr happens to contain a 4-byte-aligned address, reading the UInt32 should be a single memory read. If pPtr contains a non-aligned address, more than one memory cycles my be needed which slows the code down. Is there a faster way to read the value from non-aligned addresses?

I'd recommend memcpy into a temporary of type UInt32 within your loop.
This takes advantage of the fact that a four byte memcpy will be inlined by the compiler when building with optimization enabled, and has a few other benefits:
If you are on a platform where alignment matters (hpux, solaris sparc, ...) your code isn't going to trap.
On a platform where alignment matters there it may be worthwhile to do an address check for alignment then one of a regular aligned load or a set of 4 byte loads and bit ors. Your compiler's memcpy very likely will do this the optimal way.
If you are on a platform where an unaligned access is allowed and doesn't hurt performance (x86, x64, powerpc, ...), you are pretty much guarenteed that such a memcpy is then going to be the cheapest way to do this access.
If your memory was initially a pointer to some other data structure, your code may be undefined because of aliasing problems, because you are casting to another type and dereferencing that cast. Run time problems due to aliasing related optimization issues are very hard to track down! Presuming that you can figure them out, fixing can also be very hard in established code and you may have to use obscure compilation options like -fno-strict-aliasing or -qansialias, which can limit the compiler's optimization ability significantly.

Your code is undefined behaviour.
Pretty much the only "correct" solution is to only read something as a type T if it is a type T, as follows:
uint32_t n;
char * p = point_me_to_random_memory();
std::copy(p, p + 4, reinterpret_cast<char*>(&n));
std::cout << "The value is: " << n << std::endl;
In this example, you want to read an integer, and the only way to do that is to have an integer. If you want it to contain a certain binary representation, you need to copy that data to the address starting at the beginning of the variable.

Let the compiler do the optimizing!
UInt32 ReadU32(unsigned char *ptr)
{
return static_cast<UInt32>(ptr[0]) |
(static_cast<UInt32>(ptr[1])<<8) |
(static_cast<UInt32>(ptr[2])<<16) |
(static_cast<UInt32>(ptr[3])<<24);
}

Related

Are these integers misaligned, and should I even care?

I have some code that interprets multibyte width integers from an array of bytes at an arbitrary address.
std::vector<uint8> m_data; // filled with data
uint32 pos = 13; // position that might not be aligned well for 8 byte integers
...
uint64 * ptr = reinterpret_cast<uint64*>(m_data.data() + pos);
*ptr = swap64(*ptr); // (swaps endianness)
Would alignment be an issue for this code? And if it is, is it a severe issue, or one that can safely be ignored because the penalty is trivial?

Use memcpy instead:
uint64_t x;
memcpy(&x, m_data.data()+pos, sizeof(uint64_t));
x = swap(x);
memcpy(m_data.data()+pos, &x, sizeof(uint64_t));
It has two benefits:
you avoid strict aliasing violation (caused by reading uint8_t buffer as uint64_t)
you don't have to worry about misalignment at all (you do need to care about misalignment, because even on x86, it can crash if the compiler autovectorizes your code)
Current compilers are good enough to do the right thing (i.e., your code will not be slow, memcpy is recognized, and will be handled well).

Some architectures require the read to be aligned to work. They throw a processor signal if the alignment is incorrect.
Depending on the platform it can
Crash the program
Cause a re-run with an unaligned read. (Performance hit)
Just work correctly
Performing a performance measure is a good start, and checking the OS specifications for your target platform would be prudent.

C++ struct aligment to 1 byte causes crash on WinCE

I'm working on some application that requires big chunk of memory. To decrease memory usage I've switched alignment for huge structure to 1 byte (#pragma pack(1)).
After this my struct size was around 10-15% smaller but some problem appeared.
When I try to use one of fields of my structure through pointer or reference application just crashes. If I change field directly it work ok.
In test application I found out that problem start to appear after using smaller then 4 bytes field in struct.
Test code:
#pragma pack(1)
struct TestStruct
{
struct
{
long long lLongLong;
long lLong;
//bool lBool; // << if uncommented than crash
//short lShort; // << if uncommented than crash
//char lChar; // << if uncommented than crash
//unsigned char lUChar; // << if uncommented than crash
//byte lByte; // << if uncommented than crash
__int64 lInt64;
unsigned int Int;
unsigned int Int2;
} General;
};
struct TestStruct1
{
TestStruct lT[5];
};
#pragma pack()
void TestFunct(unsigned int &pNewLength)
{
std::cout << pNewLength << std::endl;
std::cout << "pNL pointer: " << &pNewLength << std::endl;
pNewLength = 7; // << crash
char *lPointer = (char *)&pNewLength;
*lPointer = 0x32; // << or crash here
}
int _tmain(int argc, _TCHAR* argv[])
{
std::cout << sizeof(TestStruct1) << std::endl;
TestStruct1 *lTest = new TestStruct1();
TestFunct(lTest->lT[4].General.Int);
std::cout << lTest->lT[4].General.Int << std::endl;
char lChar;
std::cin >> lChar;
return 0;
}
Compiling this code on ARM (WinCE 6.0) result in crash. Same code on Windows x86 work ok. Changing pack(1) to pack(4) or just pack() resolve this problem but structure is larger.
Why this alignment causes problem ?

You can fix it (to run on WCE with ARM) by using __unaligned keyword, I was able to compile this code with VS2005 and successfully run on WM5 device, by changing:
void TestFunct(unsigned int &pNewLength)
to
void TestFunct(unsigned int __unaligned &pNewLength)
using this keyword will more than double instructions count but will allow to use any legacy structures.
more on this here:
http://msdn.microsoft.com/en-us/library/aa448596.aspx

ARM architectures only support aligned memory accesses. This means four-Byte types can only be read and written at addresses that are a multiple of 4. For two-Byte types, the address must be a multiple of two. Any attempt at unaligned memory access will normally reward you with a DATATYPE_MISALIGNMENT exception and a subsequent crash.
Now you might wonder why you only started seeing crashes when passing your unaligned structure members around as pointers and references; this has to do with the compiler. As long as you directly access the fields in the structure, it knows that you are accessing unaligned data and deals with it by transparently reading and writing the data in several aligned chunks that are split and reassembled. I have seen eVC++ do this to write a four-Byte structure member that was two-Byte-aligned: the generated assembly instructions split the integer into separate, two-Bytes pieces and writes them separately.
The compiler does not know whether a pointer or reference is aligned or not, so as soon as you pass unaligned structure fields around as pointers or references, there is no way for it to know these should be treated in a special manner. It will treat them as aligned data and will access them accordingly, which leads to crashes when the address is unaligned.
As marcin_j mentioned, it is possible to work around this by telling the compiler a particular pointer/reference is unaligned with the __unaligned keyword, or rather the UNALIGNED macro which does nothing on platforms that do not need it. It basically tells the compiler to be careful with the pointer in a way that I assume it similar to the way unaligned structure members are accessed.
A naïve approach would be to plaster UNALIGNED all over your code, but is not recommended because it can incur a performance penalty: any data access with __unaligned will need several memory read/writes whereas the aligned version needs only one. UNALIGNED is rather typically only used at places in the code where it is known that unaligned data will be passed around, and left out elsewhere.

On x86, unaligned access is slow. ARM flat out can't do it. Your small types break the alignment of the next element.
Not that it matters. The overhead is unlikely to be more than 3 bytes, if you sort your members by size.

How portable is using the low bit of a pointer as a flag?

If there is for example a class that requires a pointer and a bool. For simplicity an int pointer will be used in examples, but the pointer type is irrelevant as long as it points to something whose size() is more than 1 .
Defining the class with { bool , int *} data members will result in the class having a size that is double the size of the pointer and a lot of wasted space
If the pointer does not point to a char (or other data of size(1)), then presumably the low bit will always be zero. The class could defined with {int *} or for convenience: union { int *, uintptr_t }
The bool is implemented by setting/clearing the low bit of the pointer as per the logical bool value and clearing the bit when you need to use the pointer.
The defined way:
struct myData
{
int * ptr;
bool flag;
};
myData x;
// initialize
x.ptr = new int;
x.flag = false;
// set flag true
x.flag = true;
// set flag false
x.flag = false;
// use ptr
*(x.ptr)=7;
// change ptr
x = y; // y is another int *
And the proposed way:
union tiny
{
int * ptr;
uintptr_t flag;
};
tiny x;
// initialize
x.ptr = new int;
// set flag true
x.flag |= 1;
// set flag false
x.flag &= ~1;
// use ptr
tiny clean=x; // note that clean will likely be optimized out
clean.flag &= ~1; // back to original value as assigned to ptr
*(clean.ptr)=7;
// change ptr
bool flag=x.flag;
x.ptr = y; // y is another int *
x.flag |= flag;
This seems to be undefined behavior, but how portable is this?

As long as you restore the pointer's low-order bit before trying to use it as a pointer, it's likely to be "reasonably" portable, as long as your system, your C++ implementation, and your code meet certain assumptions.
I can't necessarily give you a complete list of assumptions, but off the top of my head:
It assumes you're not pointing to anything whose size is 1 byte. This excludes char, unsigned char, signed char, int8_t, and uint8_t. (And that assumes CHAR_BIT == 8; on exotic systems with, say, 16-bit or 32-bit bytes, other types might be excluded.)
It assumes objects whose size is at least 2 bytes are always aligned at an even address. Note that x86 doesn't require this; you can access a 4-byte int at an odd address, but it will be slightly slower. But compilers typically arrange for objects to be stored at even addresses. Other architectures may have different requirements.
It assumes a pointer to an even address has its low-order bit set to 0.
For that last assumption, I actually have a concrete counterexample. On Cray vector systems (J90, T90, and SV1 are the ones I've used myself) a machine address points to a 64-bit word, but the C compiler under Unicos sets CHAR_BIT == 8. Byte pointers are implemented in software, with the 3-bit byte offset within a word stored in the otherwise unused high-order 3 bits of the 64-bit pointer. So a pointer to an 8-byte aligned object could have easily its low-order bit set to 1.
There have been Lisp implementations (example) that use the low-order 2 bits of pointers to store a type tag. I vaguely recall this causing serious problems during porting.
Bottom line: You can probably get away with it for most systems. Future architectures are largely unpredictable, and I can easily imagine your scheme breaking on the next Big New Thing.
Some things to consider:
Can you store the boolean values in a bit vector outside your class? (Maintaining the association between your pointer and the corresponding bit in the bit vector is left as an exercise).
Consider adding code to all pointer operations that fails with an error message if it ever sees a pointer with its low-order bit set to 1. Use #ifdef to remove the checking code in your production version. If you start running into problems on some platform, build a version of your code with the checks enabled and see what happens.
I suspect that, as your application grows (they seldom shrink), you'll want to store more than just a bool along with your pointer. If that happens, the space issue goes away, because you're already using that extra space anyway.

In "theory": it's undefined behavior as far as I know.
In "reality": it'll work on everyday x86/x64 machines, and probably ARM too?
I can't really make a statement beyond that.

It's very portable, and furthermore, you can assert when you accept the raw pointer to make sure it meets the alignment requirement. This will insure against the unfathomable future compiler that somehow messes you up.
Only reasons not to do it are the readability cost and general maintenance associated with "hacky" stuff like that. I'd shy away from it unless there's a clear gain to be made. But it is sometimes totally worth it.

Conform to those rules and it should be very portable.

Byte array to int in C++

Would the following be the most efficient way to get an int16 (short) value from a byte array?
inline __int16* ReadINT16(unsigned char* ByteArray,__int32 Offset){
return (__int16*)&ByteArray[Offset];
};
If the byte array contains a dump of the bytes in the same endian format as the machine, this code is being called on. Alternatives are welcome.

It depends on what you mean by "efficient", but note that in some architectures this method will fail if Offset is odd, since the resulting 16 bit int will be misaligned and you will get an exception when you subsequently try to access it. You should only use this method if you can guarantee that Offset is even, e.g.
inline int16_t ReadINT16(uint8_t *ByteArray, int32_t Offset){
assert((Offset & 1) == 0); // Offset must be multiple of 2
return *(int16_t*)&ByteArray[Offset];
};
Note also that I've changed this slightly so that it returns a 16 bit value directly, since returning a pointer and then subsequently de-referencing it will most likely less "efficient" than just returning a 16 bit value directly. I've also switched to standard Posix types for integers - I recommend you do the same.

I'm surprised no one has suggested this yet for a solution that is both alignment safe and correct across all architectures. (well, any architecture where there are 8 bits to a byte).
inline int16_t ReadINT16(uint8_t *ByteArray, int32_t Offset)
{
int16_t result;
memcpy(&result, ByteArray+Offset, sizeof(int16_t));
return result;
};
And I suppose the overhead of memcpy could be avoided:
inline int16_t ReadINT16(uint8_t *ByteArray, int32_t Offset)
{
int16_t result;
uint8_t* ptr1=(uint8_t*)&result;
uint8_t* ptr2 = ptr1+1;
*ptr1 = *ByteArray;
*ptr2 = *(ByteArray+1);
return result;
};
I believe alignment issues don't generate exceptions on x86. And if I recall, Windows (when it ran on Dec Alpha and others) would trap the alignment exception and fix it up (at a modest perf hit). And I do remember learning the hard way that Sparc on SunOS just flat out crashes when you have an alignment issue.

inline __int16* ReadINT16(unsigned char* ByteArray,__int32 Offset)
{
return (__int16*)&ByteArray[Offset];
};
Unfortunately this has undefined behavour in C++, because you are accessing storage using two different types which is not allowed under the strict aliasing rules. You can access the storage of a type using a char*, but not the other way around.
From previous questions I asked, the only safe way really is to use memcpy to copy the bytes into an int and then use that. (Which will likely be optimised to the same code you'd hope anyway, so just looks horribly inefficient).
Your code will probably work, and most people seem to do this... But the point is that you can't go crying to your compiler vendor when one day it generates code that doesn't do what you'd hope.

I see no problem with this, that's exactly what I'd do. As long as the byte array is safe to access and you make sure that the offset is correct (shorts are 2 bytes so you may want to make sure that they can't do odd offsets or something like that)

How to reconstruct a data-structure from injected process' memory space?

I've got this DLL I made. It's injected to another process. Inside the other process,
I do a search from it's memory space with the following function:
void MyDump(const void *m, unsigned int n)
{
const char *p = reinterpret_cast(m);
for (unsigned int i = 0; i < n; ++i) {
// Do something with p[i]...
}
}
Now my question. If the target process uses a data structure, let's say
struct S
{
unsigned char a;
unsigned char b;
unsigned char c;
};
Is it always presented the same way in the process' memory? I mean, if S.a = 2 (which always follows b = 3, c = 4), is the structure presented in a continuous row in the process' memory space, like
Offset
---------------------
0x0000 | 0x02 0x03 0x04
Or can those variables be in a different places there, like
Offset
---------------------
0x0000 | 0x00 0x02 0x00
0x03fc | 0x00 0x03 0x04
If the latter one, how to reconstruct the data-structure from various points from the memory?
Many thanks in advance,
nhaa123

If your victim is written in C or C++, and the datatypes used are truly that simple, then you'll always find them as a single block of bytes in memory.
But as soon as you have C++ types like std::string that observation no longer holds. For starters, the exact layout will differ between C++ compilers, and even different versions of the same compiler. The bytes of a std::string will likely not be in a contiguous array, but sometimes they are. If they're split in two, finding the second half probably will not help you in finding the first half.
Not throw in more complicated environments like a JIT'ting JVM running a Java app. The types you encounter in memory are very very complex; one could write a book about decoding them.

The order of member will always be the same and the structure will occupy a contiguous memory block.
Depending on a compiler padding might be added between members but it still will be the same if the program is recompiled with the same compiler and the same settings. If padding is added and you are unaware of it you can't detect it reliably at runtime - all the information the compiler had is lost to that moment and you are left to just analyze the patterns and guess.

It depends on the alignment of the structure.
If you have something like this:
struct A
{
int16_t a;
char b;
int32_t c;
char d;
}
then by default on 32bit platform( I dont know if that is true for 64bit ), the offset of c is 4 as there is one byte padded after b, and after d there are 3 more bytess padded at the end (if I remember correctly).
It will be different if the structure has a specified alignment.

Now my question. If the target process uses a data structure [...] is it always presented the same way in the process' memory? I mean, if S.a = 2 (which always follows b = 3, c = 4), is the structure presented in a continuous row in the process' memory space?
Yes, however it will often be padded to align members in ways you may not expect. Thus, simply recreating the data structure in order to interface with it via code injection.
I would highly recommend using ReClassEx or ReClass.NET, two open-source programs created specifically for reconstructing data structures from memory and generating useable C++ code! Check out a screenshot:

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js