Visual C++ appends 0xCC (int3) bytes at the end of functions - c++

This is my first time around, and I really hope you guys can help me, as I have ran out of ideas by now.
I have searched for an answer for a couple of hours now, and could not find an answer that would actually work.
I would like to directly inject code into a running process. Yes, you have read it right. I am trying to inject code into another application, and - believe it or not - this is only to extend the functionality of it.
I am using Visual Studio 2012 Express Edition on Windows.
I have the following code:
__declspec(naked) void Foo()
{
__asm
{
// Inline assembly code here
}
}
__declspec(naked) void FooEnd() {}
int main()
{
cout << HEX(Foo) << endl;
cout << HEX(FooEnd) << endl;
cout << (int)FooEnd - (int)Foo << endl;
// Inject code here using WriteProcessMemory
return 0;
}
Most of the code has been removed in order to maintain readability, though I can post other portions of it on request.
Output is the following:
0x010B1000
0x010B1010
16
The resulting size is actually incorrect. The functions are compiled in the right order (made sure using /ORDER), but the compiler adds a bunch of 0xCC (int 3) bytes after each method which extends it's size, and so I can't get the real (useful) number of bytes that contains actual executable code.
In another stackoverflow question, it has been said that disabling "Edit and Continue" would make these extra bytes go away, but no matter what, that didn't work for me.
I also tried using Release setup instead of Debug, changed a bunch of optimization settings, but none of these had any effect. What do you think could be the solution? I may be missing something obvious.
Anyway, is this (in your opinion) the best way to acquire a function's length (readability, reliability, ease of use)?
I hope I explained everything I had to in order for you to be able to help. If you have further questions, please feel free to leave a comment.
Thanks for your time and efforts.

As Devolus points out, the compiler is inserting these extra bytes after your code in order to align the next function on a reasonable (usually divisible by 16) starting address.
The compiler is actually trying to help you since 0xCC is the breakpoint instruction, the code will break into the debugger (if attached) should the instruction pointer accidentally point outside a function at any point during execution.
None of this should worry you for your purposes. You can consider the 0xCC padding as part of the function.

You don't need the extra padding when you're injecting the code, so it's fine to discard them. It should also be fine to copy them over, it will just result in a few extra bytes of copying. Chances are the memory you're injecting to will by a page-aligned block anyway, so you're not really gaining anything by stripping it out.
But if you really want to strip it out, a simple solution to your problem would be to just iterate backwards from the last byte before the next function, until there are no more 0xcc bytes.
i.e.:
__declspec(naked) void Foo()
{
__asm
{
_emit 0x4A
_emit 0x4B
}
}
__declspec(naked) void FooEnd() {}
int main(int argc, char** argv)
{
//start at the last byte of the memory-aligned code instead of the first byte of FooEnd
unsigned char* fooLast = (unsigned char*)FooEnd-1;
//keep going backwards until we don't have a 0xcc
while(*fooLast == 0xCC)
fooLast--;
//fooLast will now point at the last byte of your function, so you need to add 1
int length = ((int)fooLast - (int)Foo) + 1;
//should output 2 for the length of Foo
std::cout << length;
}

The extra bytes are inserted by the compiler to create a memory alignment, so you can't discard it, since you are using the next function as a marker.
On the other hand, since you are writing the injected code in assembly anyway, you can just as well write the code, compile it, and then put the binary form in a byte array. That's how I would do this, because then you have the exact length.

Related

Why is this variable returning 32766?

I wrote a very basic evolution algorithm. The way it's supposed to work is that the user types in the desired value, and the amount of generations to try to reach it. Then, the program will run through, taking the nearest value in an array to the goal and mutating it four times (while also leaving the original, in case it's right) to try and get closer to the goal. In theory, it should take roughly |n|/2 generations to reach the value, as mutations happen in either one or two points.
Here's the code to demonstrate what I mean:
#include <iostream>
using namespace std;
int gen [5] = {0, 0, 0, 0, 0}; int goal; int gens; int best; int i = 0; int fit;
int dif(int in) {
return abs(gen[in] - goal);
}
void nextgen() {
int fit [5] = {dif(1), dif(2), dif(3), dif(4), dif(5)};
best = *max_element(fit, fit + 6);
int gen [5] = {best - 2, best - 1, best, best + 1, best + 2};
}
int main() {
cout << "Goal: "; cin >> goal; cout << "Gens: "; cin >> gens;
while(i < gens) {
nextgen(); cout << "Generation " << i + 1 << ": " << best << "\n";
i = i + 1;
}
}
It's pretty simple code. However, it seems that the int best bit of the output is returning 32766 every time, no matter what I do. Do you know what I've done wrong?
I've tried outputting the entire generation (which is even worse––a jumbled mess of non user friendly data that appears meaningless), I've reworked the code, I've added varibles and functions to try and pin down exactly where the error is, and I watched the entire code aesthetic youtube channel to make sure this looked good for you guys.
Looks like you're driving C++ without a license or safety belt. Joke aside, please keep trying and learning. But with C/C++ you should always enable compiler warnings. The godbolt link in the comment from #user4581301 is really good, the compiler flags -Wall -Wextra -pedantic -O2 -fsanitize=address,undefined are all best practice. (I would add -Werror.)
Why you got 32766 is possible to analyze with a debugger, but it's not meaningful. A number close to 32768 (=2^15) should trigger all the warning bells (could be an integer overflow). Your code is accessing uninitialized memory (among other issues), leading to what is called undefined behaviour. This means it may produce different output depending on your machine, compiler, optimization flags, OS, standard libraries, etc. - even adding a debug-print could change what it does.
For optimization algorithms (like GAs) it's also super easy to fool yourself into thinking that your implementation is correct, because the optimization will find a way to avoid (or exploit) any bugs. I've had one in my NN implementation that was accessing some data from the previous example by accident, and it took several days until I even noticed there was a problem.
If you want to focus on the algorithms, I suggest to start with a different language (anything except C/C++/Assembly). My advice would be either Python (though it can be 50x slower, it's much easier to learn and write) or Rust (just as fast as C++ and just as complicated, but no undefined behaviour). With Rust, every mistake in your code above would have given you either a warning by default, a compiler error, or a runtime error instead of wrong output. Though C++ with the flags mentioned above does the same for your specific code.

C++ struct aligment to 1 byte causes crash on WinCE

I'm working on some application that requires big chunk of memory. To decrease memory usage I've switched alignment for huge structure to 1 byte (#pragma pack(1)).
After this my struct size was around 10-15% smaller but some problem appeared.
When I try to use one of fields of my structure through pointer or reference application just crashes. If I change field directly it work ok.
In test application I found out that problem start to appear after using smaller then 4 bytes field in struct.
Test code:
#pragma pack(1)
struct TestStruct
{
struct
{
long long lLongLong;
long lLong;
//bool lBool; // << if uncommented than crash
//short lShort; // << if uncommented than crash
//char lChar; // << if uncommented than crash
//unsigned char lUChar; // << if uncommented than crash
//byte lByte; // << if uncommented than crash
__int64 lInt64;
unsigned int Int;
unsigned int Int2;
} General;
};
struct TestStruct1
{
TestStruct lT[5];
};
#pragma pack()
void TestFunct(unsigned int &pNewLength)
{
std::cout << pNewLength << std::endl;
std::cout << "pNL pointer: " << &pNewLength << std::endl;
pNewLength = 7; // << crash
char *lPointer = (char *)&pNewLength;
*lPointer = 0x32; // << or crash here
}
int _tmain(int argc, _TCHAR* argv[])
{
std::cout << sizeof(TestStruct1) << std::endl;
TestStruct1 *lTest = new TestStruct1();
TestFunct(lTest->lT[4].General.Int);
std::cout << lTest->lT[4].General.Int << std::endl;
char lChar;
std::cin >> lChar;
return 0;
}
Compiling this code on ARM (WinCE 6.0) result in crash. Same code on Windows x86 work ok. Changing pack(1) to pack(4) or just pack() resolve this problem but structure is larger.
Why this alignment causes problem ?
You can fix it (to run on WCE with ARM) by using __unaligned keyword, I was able to compile this code with VS2005 and successfully run on WM5 device, by changing:
void TestFunct(unsigned int &pNewLength)
to
void TestFunct(unsigned int __unaligned &pNewLength)
using this keyword will more than double instructions count but will allow to use any legacy structures.
more on this here:
http://msdn.microsoft.com/en-us/library/aa448596.aspx
ARM architectures only support aligned memory accesses. This means four-Byte types can only be read and written at addresses that are a multiple of 4. For two-Byte types, the address must be a multiple of two. Any attempt at unaligned memory access will normally reward you with a DATATYPE_MISALIGNMENT exception and a subsequent crash.
Now you might wonder why you only started seeing crashes when passing your unaligned structure members around as pointers and references; this has to do with the compiler. As long as you directly access the fields in the structure, it knows that you are accessing unaligned data and deals with it by transparently reading and writing the data in several aligned chunks that are split and reassembled. I have seen eVC++ do this to write a four-Byte structure member that was two-Byte-aligned: the generated assembly instructions split the integer into separate, two-Bytes pieces and writes them separately.
The compiler does not know whether a pointer or reference is aligned or not, so as soon as you pass unaligned structure fields around as pointers or references, there is no way for it to know these should be treated in a special manner. It will treat them as aligned data and will access them accordingly, which leads to crashes when the address is unaligned.
As marcin_j mentioned, it is possible to work around this by telling the compiler a particular pointer/reference is unaligned with the __unaligned keyword, or rather the UNALIGNED macro which does nothing on platforms that do not need it. It basically tells the compiler to be careful with the pointer in a way that I assume it similar to the way unaligned structure members are accessed.
A naïve approach would be to plaster UNALIGNED all over your code, but is not recommended because it can incur a performance penalty: any data access with __unaligned will need several memory read/writes whereas the aligned version needs only one. UNALIGNED is rather typically only used at places in the code where it is known that unaligned data will be passed around, and left out elsewhere.
On x86, unaligned access is slow. ARM flat out can't do it. Your small types break the alignment of the next element.
Not that it matters. The overhead is unlikely to be more than 3 bytes, if you sort your members by size.

Most efficient way to read UInt32 from any memory address?

What would be the most efficient way to read a UInt32 value from an arbitrary memory address in C++? (Assuming Windows x86 or Windows x64 architecture.)
For example, consider having a byte pointer that points somewhere in memory to block that contains a combination of ints, string data, etc., all mixed together. The following sample shows reading the various fields from this block in a loop.
typedef unsigned char* BytePtr;
typedef unsigned int UInt32;
...
BytePtr pCurrent = ...;
while ( *pCurrent != 0 )
{
...
if ( *pCurrent == ... )
{
UInt32 nValue = *( (UInt32*) ( pCurrent + 1 ) ); // line A
...
}
pCurrent += ...;
}
If at line A, pPtr happens to contain a 4-byte-aligned address, reading the UInt32 should be a single memory read. If pPtr contains a non-aligned address, more than one memory cycles my be needed which slows the code down. Is there a faster way to read the value from non-aligned addresses?
I'd recommend memcpy into a temporary of type UInt32 within your loop.
This takes advantage of the fact that a four byte memcpy will be inlined by the compiler when building with optimization enabled, and has a few other benefits:
If you are on a platform where alignment matters (hpux, solaris sparc, ...) your code isn't going to trap.
On a platform where alignment matters there it may be worthwhile to do an address check for alignment then one of a regular aligned load or a set of 4 byte loads and bit ors. Your compiler's memcpy very likely will do this the optimal way.
If you are on a platform where an unaligned access is allowed and doesn't hurt performance (x86, x64, powerpc, ...), you are pretty much guarenteed that such a memcpy is then going to be the cheapest way to do this access.
If your memory was initially a pointer to some other data structure, your code may be undefined because of aliasing problems, because you are casting to another type and dereferencing that cast. Run time problems due to aliasing related optimization issues are very hard to track down! Presuming that you can figure them out, fixing can also be very hard in established code and you may have to use obscure compilation options like -fno-strict-aliasing or -qansialias, which can limit the compiler's optimization ability significantly.
Your code is undefined behaviour.
Pretty much the only "correct" solution is to only read something as a type T if it is a type T, as follows:
uint32_t n;
char * p = point_me_to_random_memory();
std::copy(p, p + 4, reinterpret_cast<char*>(&n));
std::cout << "The value is: " << n << std::endl;
In this example, you want to read an integer, and the only way to do that is to have an integer. If you want it to contain a certain binary representation, you need to copy that data to the address starting at the beginning of the variable.
Let the compiler do the optimizing!
UInt32 ReadU32(unsigned char *ptr)
{
return static_cast<UInt32>(ptr[0]) |
(static_cast<UInt32>(ptr[1])<<8) |
(static_cast<UInt32>(ptr[2])<<16) |
(static_cast<UInt32>(ptr[3])<<24);
}

Why would a C++ program allocate more memory for local variables than it would need in the worst case?

Inspired by this question.
Apparently in the following code:
#include <Windows.h>
int _tmain(int argc, _TCHAR* argv[])
{
if( GetTickCount() > 1 ) {
char buffer[500 * 1024];
SecureZeroMemory( buffer, sizeof( buffer ) );
} else {
char buffer[700 * 1024];
SecureZeroMemory( buffer, sizeof( buffer ) );
}
return 0;
}
compiled with default stack size (1 megabyte) with Visual C++ 10 with optimizations on (/O2) a stack overflow occurs because the program tries to allocate 1200 kilobytes on stack.
The code above is of course slightly exaggerated to show the problem - uses lots of stack in a rather dumb way. Yet in real scenarios stack size can be smaller (like 256 kilobytes) and there could be more branches with smaller objects that would induce a total allocation size enough to overflow the stack.
That makes no sense. The worst case would be 700 kilobytes - it would be the codepath that constructs the set of local variables with the largest total size along the way. Detecting that path during compilation should not be a problem.
So the compiler produces a program that tries to allocate even more memory than the worst case. According to this answer LLVM does the same.
That could be a deficiency in the compiler or there could be some real reason for doing it this way. I mean maybe I just don't understand something in compilers design that would explain why doing allocation this way is necessary.
Why would the compiler want a program allocate more memory than the code needs in the worst case?
I can only speculate that this optimization was deemed too unimportant by the compiler designers. Or perhaps, there is some subtle security reason.
BTW, on Windows, stack is reserved in its entirety when the thread starts execution, but is committed on as-needed basis, so you are not really spending much "real" memory even if you reserved a large stack.
Reserving a large stack can be a problem on 32-bit system, where having large number of threads can eat the available address space without really committing much memory. On 64-bit, you are golden.
It could be down to your use of SecureZeroMemory. Try replacing it with regular ZeroMemory and see what happens- the MSDN page essentially indicates that SZM has some additional semantics beyond what it's signature implies, and they could be the cause of the bug.
The following code when compiled using GCC 4.5.1 on ideone places the two arrays at the same address:
#include <iostream>
int main()
{
int x;
std::cin >> x;
if (x % 2 == 0)
{
char buffer[500 * 1024];
std::cout << static_cast<void*>(buffer) << std::endl;
}
if (x % 3 == 0)
{
char buffer[700 * 1024];
std::cout << static_cast<void*>(buffer) << std::endl;
}
}
input: 6
output:
0xbf8e9b1c
0xbf8e9b1c
The answer is probably "use another compiler" if you want this optimization.
OS Pageing and byte alignment could be a factor. Also housekeeping may use extra stack along with space required for calling other functions within that function.

Which is more readable (C++ = )

int valueToWrite = 0xFFFFFFFF;
static char buffer2[256];
int* writePosition = (int* ) &buffer2[5];
*writePosition = valueToWrite;
//OR
* ((int*) &buffer2[10] ) = valueToWrite;
Now, I ask you guys which one do you find more readable. The 2 step technique involving a temporary variable or the one step technique?
Do not worry about optimization, they both optimize to the same thing, as you can see here.
Just tell me which one is more readable for you.
or DWORD PTR ?buffer2#?1??main##9#4PADA+5, -1
or DWORD PTR ?buffer2#?1??main##9#4PADA+10, -1
int* writePosition = (int* ) &buffer2[5]
Or
*((int*) &buffer2[10] ) = valueToWrite;
Are both incorrect because on some platforms access to unaligned values (+5 +10) may cost hundreds of CPU cycles and on some (like older ARM) it would cause an illegal operation.
The correct way is:
memcpy( buffer+5, &valueToWrite, sizeof(valueToWrite));
And it is more readable.
Once you encapsulate it inside a class, it does not really matter which technique you use. The method name will provide the description as to what the code is doing. Thus, in most cases you will not have to delve into the actual impl. to see what is going on.
class Buffer
{
char buffer2[256];
public:
void write(int pos, int value) {
int* writePosition = (int*) &buffer2[pos];
*writePosition = value;
}
}
If I was forced to choose, I'd say 1. However, I'll note the code as presented is very C like either way; I'd shy away from either and re-examine the the problem. Here's a simple one that is more C++-y
const char * begin = static_cast<char*>(static_cast<void*>(&valueToWrite));
std::copy(begin, begin+sizeof(int), &buffer2[5]);
The first example is more readable purely on the basis that your brain doesn't have to decipher the pointer operations globed together.
This will reduce the time a developer looking at the code for the first time needs to understand what's actually going. In my experience this loosely correlates to reducing the probability of introducing new bugs.
I find the second, shorter one easier to read.
I suspect, however, that this rather depends on whether you are the type of person that can easily 'get' pointers.
The type casting from char* to int* is a little awkward, though. I presume there is a good reason this needs to be done.
Watch out -- this code probably won't work due to alignment issues! Why not just use memset?
#include <string.h>
memset(buffer2+10, 0xFF, 4);
If you can afford to tie yourself to a single compiler (or do preprocessor hacks around compatiblity issues), you can use a packed-structs option to get symbolic names for the values you're writing. For example, on GCC:
struct __attribute__ ((__packed__)) packed_struct
{
char stuff_before[5]
int some_value;
}
/* .... */
static char buffer2[256];
struct packed_struct *ps = buffer2;
ps->some_value = valueToWrite;
This has a number of advantages:
Your code more clearly reflects what you're doing, if you name your fields well.
Since the compiler knows if the platform you're on supports efficient unaligned access, it can automatically choose between native unaligned access, or appropriate workarounds on platforms that don't support unaligned access.
But again, has the major disadvantage of not having any standardized syntax.
Most readable would be either variant with a comment added on what you're doing there.
That being said, I despise variables introduced simply for the purpose of a one-time use a couple of lines later. Doing most of my work in the maintenance area, getting dozens of variable names pushed in my face that are poor efforts not having to write an explanatory comment sets me on edge.
Definitely:
* ((int*) &buffer2[10] ) = valueToWrite;
I parse it not in one but few steps, and that is why it is more readable: I have all steps in one line.
From the readability perspective, the behaviour of your code should be clear, but "clear" is not how I would describe either of these alternatives. In fact, they are the opposite of "clear", as they are non-portable.
On top of alignment issues, there's integer representation (the size varies from system to system, as does sign representation, endianness and padding to throw into the soup). Thus, the behaviour of your code from system to system is erratic.
If you want to be clear about what your algorithm is supposed to do, you should explicitly put each byte into its correct place. For example:
void serialise_uint_lsb(unsigned char *destination, unsigned source) {
destination[0] = source & 0xff; source >>= 8;
destination[1] = source & 0xff; source >>= 8;
assert(source == 0);
}
void deserialise_uint_lsb(unsigned *destination, unsigned char *source) {
*destination = 0;
*destination <<= 8; *destination += source[1];
*destination <<= 8; *destination += source[0];
}
Serialisation and deserialisation are idiomatic concepts for programmers... *printf and *scanf are forms of serialisation/deserialisation, for example, except it's idiomatically instilled into your head that the most significant (decimal) digit goes first... which is the problem with your code; your code doesn't tell your system the direction of the integer, how many bytes there are, etc... bad news.
Use a serialisation/deserialisation function. Programmers will understand that best.