How to alloc a executable memory buffer?

How to alloc a executable memory buffer? - c++

I would like to alloc a buffer that I can execute on Win32 but I have an exception in visual studio cuz the malloc function returns a non executable memory zone. I read that there a NX flag to disable... My goal is convert a bytecode to asm x86 on fly with keep in mind performance.
Does somemone can help me?

You don't use malloc for that. Why would you anyway, in a C++ program? You also don't use new for executable memory, however. There's the Windows-specific VirtualAlloc function to reserve memory which you then mark as executable with the VirtualProtect function applying, for instance, the PAGE_EXECUTE_READ flag.
When you have done that, you can cast the pointer to the allocated memory to an appropriate function pointer type and just call the function. Don't forget to call VirtualFree when you are done.
Here is some very basic example code with no error handling or other sanity checks, just to show you how this can be accomplished in modern C++ (the program prints 5):
#include <windows.h>
#include <vector>
#include <iostream>
#include <cstring>
int main()
{
std::vector<unsigned char> const code =
{
0xb8, // move the following value to EAX:
0x05, 0x00, 0x00, 0x00, // 5
0xc3 // return what's currently in EAX
};
SYSTEM_INFO system_info;
GetSystemInfo(&system_info);
auto const page_size = system_info.dwPageSize;
// prepare the memory in which the machine code will be put (it's not executable yet):
auto const buffer = VirtualAlloc(nullptr, page_size, MEM_COMMIT, PAGE_READWRITE);
// copy the machine code into that memory:
std::memcpy(buffer, code.data(), code.size());
// mark the memory as executable:
DWORD dummy;
VirtualProtect(buffer, code.size(), PAGE_EXECUTE_READ, &dummy);
// interpret the beginning of the (now) executable memory as the entry
// point of a function taking no arguments and returning a 4-byte int:
auto const function_ptr = reinterpret_cast<std::int32_t(*)()>(buffer);
// call the function and store the result in a local std::int32_t object:
auto const result = function_ptr();
// free the executable memory:
VirtualFree(buffer, 0, MEM_RELEASE);
// use your std::int32_t:
std::cout << result << "\n";
}
It's very unusual compared to normal C++ memory management, but not really rocket science. The hard part is to get the actual machine code right. Note that my example here is just very basic x64 code.

Extending the above answer, a good practice is:
Allocate memory with VirtualAlloc and read-write-access.
Fill that region with your code
Change that region's protection with VirtualProtectto execute-read-access
jump to/call the entry point in this region
So it could look like this:
adr = VirtualAlloc(NULL, size, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
// write code to the region
ok = VirtualProtect(adr, size, PAGE_EXECUTE_READ, &oldProtection);
// execute the code in the region

As stated in documentation for VirtualAlloc
flProtect [in]
The memory protection for the region of pages to be allocated. If the pages are being committed, you can specify any one of the memory protection constants.
one of them is:
PAGE_EXECUTE
0x10
Enables execute access to the committed region of pages. An attempt to write to the committed region results in an access violation.
This flag is not supported by the CreateFileMapping function.
PAGE_EXECUTE_READ
0x20
Enables execute or read-only access to the committed region of pages. An attempt to write to the committed region results in an access violation.
Windows Server 2003 and Windows XP: This attribute is not supported by the CreateFileMapping function until Windows XP with SP2 and Windows Server 2003 with SP1.
PAGE_EXECUTE_READWRITE
0x40
Enables execute, read-only, or read/write access to the committed region of pages.
Windows Server 2003 and Windows XP: This attribute is not supported by the CreateFileMapping function until Windows XP with SP2 and Windows Server 2003 with SP1.
and so on from here

C version based off of Christian Hackl's answer
I think SIZE_T dwSize of VirtualAlloc should be the size of the code in bytes, not system_info.dwPageSize (what if sizeof code is bigger than system_info.dwPageSize?).
I don't know C enough to know if sizeof(code) is the "correct" way of getting the size of the machine code
this compiles under c++ so I guess it's not off topic lol
#include <Windows.h>
#include <stdio.h>
int main()
{
// double add(double a, double b) {
// return a + b;
// }
unsigned char code[] = { //Antonio Cuni - How to write a JIT compiler in 30 minutes: https://www.youtube.com/watch?v=DKns_rH8rrg&t=118s
0xf2,0x0f,0x58,0xc1, //addsd %xmm1,%xmm0
0xc3, //ret
};
LPVOID buffer = VirtualAlloc(NULL, sizeof(code), MEM_COMMIT, PAGE_READWRITE);
memcpy(buffer, code, sizeof(code));
//protect after write, because protect will prevent writing.
DWORD oldProtection;
VirtualProtect(buffer, sizeof(code), PAGE_EXECUTE_READ, &oldProtection);
double (*function_ptr)(double, double) = (double (*)(double, double))buffer; //is there a cleaner way to write this ?
// double result = (*function_ptr)(2, 234); //NOT SURE WHY THIS ALSO WORKS
double result = function_ptr(2, 234);
VirtualFree(buffer, 0, MEM_RELEASE);
printf("%f\n", result);
}

At compile time, the linker will organize your program's memory footprint by allocating memory into data sections and code sections. The CPU will make sure that the program counter (the hard CPU register) value remains within a code section or the CPU will throw a hardware exception for violating the memory bounds. This provides some security by making sure your program only executes valid code. Malloc is intended for allocating data memory. Your application has a heap and the heap's size is established by the linker and is marked as data memory. So at runtime malloc is just grabbing some of the virtual memory from your heap which will always be data.
I hope this helps you have a better understanding what's going on, though it might not be enough to get you where you need to be. Perhaps you can pre-allocate a "code heap" or memory pool for your runtime-generated code. You will probably need to fuss with the linker to accomplish this but I don't know any of the details.

Related

Stack overflow when passing a vector to memcpy

I am trying to pass a vector of unsigned char (shellcode) to virtualalloc but keep getting stack overflow when trying to execute
std::vector<unsigned char> decrypted = {0x00, 0x15, ...} ; // shell code decrypted to unsigned char
cout << &decrypted.front() << endl;
size_t shellcodesize = decrypted.size();
void *exec = VirtualAlloc(0, shellcodesize, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
memcpy(exec, &decrypted.front(), shellcodesize);
((void(*)())exec)();
The shellcode is not the problem because I used different shellcodes all cause the same problem
The encryption/decryption works as intended because it is tested in other projects before and works flawlessly which leaves me with the last 4-5 lines shown above
when compiling no errors are shown but when running in windbg preview I get this
(3cd4.3664): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
WARNING: Stack overflow detected. The unwound frames are extracted from outside normal stack bounds.
00000000`00170003 385c7838 cmp byte ptr [rax+rdi*2+38h],bl ds:00000000`01f13078=??
WARNING: Stack overflow detected. The unwound frames are extracted from outside normal stack bounds.
I think that when using unsigned char buf[] = "\x00\x15"; it is automatically null-terminated but as far as I know a vector is not which I think causes the stack overflow issue (please correct me if wrong )

There is one slight issue with your code:
void *exec = VirtualAlloc(0, shellcodesize, MEM_RESERVE | MEM_COMMIT, PAGE_EXECUTE_READWRITE);
is the correct call to VirtualAlloc here. Most Windows versions actually return allocated memory, if the first parameter is 0, but there is no guarantee for this. If the first parameter is non-0 (and MEM_COMMIT is requested w/o MEM_RESERVE), the call will fail.
Other than that, your code is correct (you might check the return value of VirtualAlloc to be non-zero) and I would suspect, that the loaded code contains the stack overflow. Especially, as 0x00 0x15 does not make any sense at all when disassembled and will certainly crash your application.
EDIT: If your architecture is X64, you can test the following shellcode: 0xb8, 0x01, 0x00, 0x00, 0x00, 0xc3 (it simply returns 1 from the subroutine)

How to know commited memory (shared and private) of process in C++

I need to know total committed memory size (shared and private)
The private I extract from PROCESS_MEMORY_COUNTERS_EX using PagefileUsage
How can I know the (total) shared memory of the process ?

Generally, the process of shared memory: CreateFileMapping creates a shared memory space; OpenFileMapping opens shared memory and returns a HANDLE handle; MapViewOfFile obtains the memory mapped to the program, which can be read and written.
In the MSDN of MapViewOfFile: To obtain the size of a view, use the VirtualQuery function. The return value of the VirtualQuery function represents the size actually filled to the second parameter, not the size of the memory. The real memory information is viewed in the second parameter after filling
SIZE_T VirtualQuery(
LPCVOID lpAddress,
PMEMORY_BASIC_INFORMATION lpBuffer,
SIZE_T dwLength
);
Here is the code:
HANDLE hMap = CreateFileMapping(INVALID_HANDLE_VALUE, NULL, PAGE_READWRITE, 0, 1000, NULL);
PVOID pData = MapViewOfFile(hMap, FILE_MAP_READ | FILE_MAP_WRITE, 0, 0, 0);
MEMORY_BASIC_INFORMATION mem_info;
int nBufferSize = VirtualQuery(pData, &mem_info, sizeof(mem_info));
UnmapViewOfFile(pData); //you could place a breakpoint here
CloseHandle(hMap);
When the breakpoint is triggered, check the memory data of mem_info, pay special attention to the RegionSize member in the structure.
The system memory is allocated according to the granularity as the smallest unit. The normal granularity of a 32-bit system is 4kb, so the memory after a successful application is generally an integer multiple of 4kb. So, the code above applies for 1000 bytes, but the system actually allocates 4kb.

ARM GCC heap not fully used

I am setting up my Cortex-M4 platform to use heap memory and encountering some issues.
I set heap region size to be 512 bytes, and it only allocates 9 bytes. Then I set heap to be 10kB and it can only allocate 362 bytes.
Here is my gcc stub:
int _sbrk(int a)
{
//align a to 4 bytes
if (a & 3)
{
a += (4 - (a & 0x3));
}
extern long __heap_start__;
extern long __heap_end__;
static char* heap_ptr = (char*)&__heap_start__;
if (heap_ptr + a < (char*)&__heap_end__)
{
int res = (int)heap_ptr;
heap_ptr += a;
return res;
}
else
{
return -1;
}
}
__heap_start__ and __heap_end__ are correct and their difference show correct region size.
I added debug in _sbrk function to see what a argument is passed when this function is called and the values of that argument are like these in each call respectively:
2552
1708
4096
What can I do to make it use full heap memory? And how _sbrk argument is calculated? Basically, what's wrong here?
Building C++ code, using new (std::nothrow).
EDIT
If I am using malloc (C style) it allocates 524 bytes and no _sbrk call before main, unlike when using operator new.
arm-none-eabi-g++.exe (GNU Tools for ARM Embedded Processors 6-2017-q2-update) 6.3.1 20170620 (release) [ARM/embedded-6-branch revision 249437]
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
EDIT2 Minimal Complete Verifiable Example
Here is my application code and _sbrk with info printing:
void foo()
{
while (true)
{
uint8_t * byte = new (std::nothrow) uint8_t;
if (byte)
{
DBB("Byte allocated");
cnt++;
}
else
{
DBB_ERROR("Allocated %d bytes", cnt);
}
}
}
int _sbrk(int a)
{
//align a to 4 bytes
if (a & 3)
{
a += (4 - (a & 0x3));
}
extern long __heap_start__;
extern long __heap_end__;
static char* heap_ptr = (char*)&__heap_start__;
DBB("%d 0x%08X", a, a);
DBB("0x%08X", heap_ptr);
DBB("0x%08X", &__heap_start__);
DBB("0x%08X", &__heap_end__);
if (heap_ptr + a < (char*)&__heap_end__)
{
int res = (int)heap_ptr;
heap_ptr += a;
DBB("OK 0x%08X 0x%08X", res, heap_ptr);
return res;
}
else
{
DBB("ERROR");
return -1;
}
}
And produced output is:

Your output reveals the C++ memory allocation system first asks for 32 bytes and then 132 bytes. It is then able to satisfy nine requests for new uint8_t with that space. Presumably it uses some of the 164 bytes for its internal record-keeping. This may involve keeping link lists or maps of which blocks are allocated, or some other data structure. Also, for efficiency, it likely does not track single-byte allocations but rather provides some minimum block size for each allocation, perhaps 8 or 16 bytes. When it runs out of space it needs, it asks for another 4096 bytes. Your sbrk then fails since this is not available.
The C++ memory allocation system is working as designed. In order to operate, it requires more space than is doled out for individual requests. In order to supply more memory for requests, you must provide more memory in the heap. You cannot expect a one-to-one correspondence, or any simple correspondence, between memory supplied from sbrk to the memory allocation system and memory supplied from the memory allocation system to its clients.
There is no way to tell the C++ memory allocation system to use “full heap memory” to satisfy requests to it. It is required to track dynamic allocations and releases of memory. Since its clients may make diverse size requests and may release them in any order, it needs to be able to track which blocks are currently allocated and which are not—a simple stack will not suffice. Therefore, it must use additional data structures to keep track of memory, and those data structures will consume space. So not all of the heap space can be given to clients; some of it must be used for overhead.
If the memory use of the memory allocation system in your C++ implementation is too inefficient for your purposes, you might replace it with one you write yourself or with third-party software. Any implementation of the memory allocation system makes various trade-offs about speed and block size, and those can be tailored to particular situations and goals.

Invalid Handle with NtWow64AllocateVirtualMemory64

Test code:
typedef NTSTATUS(NTAPI *ntalloc64t)(HANDLE, PULONG64, ULONG64, PULONG64, ULONG, ULONG);
#define NtCurrentProcess() ( (HANDLE)(PULONG64) -1 ) ;
int _tmain(int argc, _TCHAR* argv[])
{
ULONG64 dwSize = 0x1000;
ntalloc64t ntalloc64f = (ntalloc64t)(GetProcAddress(GetModuleHandleA("ntdll"), "NtWow64AllocateVirtualMemory64"));
PVOID pvBaseAddress;
pvBaseAddress = (PVOID)NULL;
long kk = ntalloc64f((HANDLE)GetCurrentProcess(), (PULONG64)&pvBaseAddress, 0, (PULONG64)&dwSize, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
}
I am running under WOW64. This returns 0xc0000008 which means the handle is invalid. Also does not work when passing -1 as the handle, which should indicate to WinAPI to use the current process.

NtWow64AllocateVirtualMemory64 is undocumented but you can assume that its parameters are almost the same as NtAllocateVirtualMemory and MSDN says this about the base address parameter:
A pointer to a variable that will receive the base address of the allocated region of pages. If the initial value of this parameter is non-NULL, the region is allocated starting at the specified virtual address rounded down to the next host page size address boundary. If the initial value of this parameter is NULL, the operating system will determine where to allocate the region.
You are hiding a bug with your casts; (PULONG64)&pvBaseAddress points to 32 zero bits from pvBaseAddress = (PVOID)NULL and 32 undefined bits from somewhere on your stack and if these bits are not all zero then you are asking for a specific base address that is probably not available!
Remove as many casts as possible and it should start working:
typedef NTSTATUS(NTAPI *ntalloc64t)(HANDLE, PULONG64, ULONG64, PULONG64, ULONG, ULONG);
ntalloc64t ntalloc64f = (ntalloc64t) GetProcAddress(GetModuleHandleA("ntdll"), "NtWow64AllocateVirtualMemory64");
// TODO: if (!ntalloc64f) not wow64, handle error...
HANDLE hTargetProcess = OpenProcess(...);
ULONG64 base = 0, size = 0x1000;
long nts = ntalloc64f(hTargetProcess, &base, 0, &size, MEM_COMMIT|MEM_RESERVE, PAGE_READWRITE);
printf("status=%d base=%I64x size=%I64x\n", nts, base, size);

when we call NtWow64AllocateVirtualMemory64 from 32-bit ntdll.dll (it exist only in wow64 ntdll.dll) the whNtWow64AllocateVirtualMemory64 (64bit function) called inside wow64.dll. my reconstruction from win10 assembler code:
struct Wow64AllocateVirtualMemory64_Stack {
ULONG ProcessHandle;// !!! unsigned !!
ULONG BaseAddress;
ULONG64 ZeroBits;
ULONG RegionSize;
ULONG AllocationType;
ULONG Protection;
};
NTSTATUS
NTAPI
whNtWow64AllocateVirtualMemory64(Wow64AllocateVirtualMemory64_Stack* p)
{
return NtAllocateVirtualMemory(
(HANDLE)(ULONG_PTR)p->ProcessHandle,
(void**)(ULONG_PTR)p->BaseAddress,
p->ZeroBits,
(PSIZE_T)(ULONG_PTR)p->RegionSize,
p->AllocationType,
p->Protection);
}
key point here that HANDLE is 32-bit size in 32 bit code and 64-bit size in 64-bit code. as result 32-bit handle value must be extended to 64-bit handle in 64bit code. but it can be zero or sign extended. of course when we extend positive 32bit value (real process handle) - no different, result will be the same. but when we extend negative value -1 - result of zero extend will be 0xFFFFFFFF (this is invalid handle). result of sign extend - will be 0xFFFFFFFFFFFFFFFF - correct pseudo handle to current process. windows 10 use zero extend handle:
as result we can not use -1 (GetCurrentProcess()) here
win8 use sign-extend handle:
however no any sense use this api for allocate memory in wow64 process. really - if we accept any memory base address, or < 4GB - we can use NtAllocateVirtualMemory or VirtualAlloc[Ex]. so this function only have sense use in case we want allocate memory at base address >= 4Gb. but this is impossible in wow64 process. - system reserve all memory space higher than >= 4G. typical memory map for wow64bit process (with /LARGEADDRESSAWARE option)
so visible only 64bit ntdll.dll here, and all other memory is reserved.
without /LARGEADDRESSAWARE option reserved range begin from 7FFF0000. also this reserved memory can not be released - on call NtFreeVirtualMemory (from 64bit process) i got STATUS_INVALID_PAGE_PROTECTION error.
so no sense use this api for allocate inside self (and any another wow64 process). only if we want allocate memory in 64bit process and not simply allocate, but at range higher than 4GB. i even dont know for which target this can be need - why <4GB memory base, which can be allocated with usual NtAllocateVirtualMemory or VirtualAlloc[Ex] not ok. and funny that no related NtWow64FreeVirtualMemory64 api - so impossible free alocated memory. of course possible write base-independed (and as result no import) 64bit code, embedded in 32bit process, call it via 64 call gate, this code can call functions from 64bit ntdll (and only from it) and return. this is possible, but already another story

Higher than expected memory usage with VirtualAlloc; what's going on?

Important: Scroll down to the "final update" before you invest too much time here. Turns out the main lesson is to beware of the side effects of other tests in your unittest suite, and to always reproduce things in isolation before jumping to conclusions!
On the face of it, the following 64-bit code allocates (and accesses) one-mega 4k pages using VirtualAlloc (a total of 4GByte):
const size_t N=4; // Tests with this many Gigabytes
const size_t pagesize4k=4096;
const size_t npages=(N<<30)/pagesize4k;
BOOST_AUTO_TEST_CASE(test_VirtualAlloc) {
std::vector<void*> pages(npages,0);
for (size_t i=0;i<pages.size();++i) {
pages[i]=VirtualAlloc(0,pagesize4k,MEM_RESERVE|MEM_COMMIT,PAGE_READWRITE);
*reinterpret_cast<char*>(pages[i])=1;
}
// Check all allocs succeeded
BOOST_CHECK(std::find(pages.begin(),pages.end(),nullptr)==pages.end());
// Free what we allocated
bool trouble=false;
for (size_t i=0;i<pages.size();++i) {
const BOOL err=VirtualFree(pages[i],0,MEM_RELEASE);
if (err==0) trouble=true;
}
BOOST_CHECK(!trouble);
}
However, while executing it grows the "Working Set" reported in Windows Task Manager (and confirmed by the value "sticking" in the "Peak Working Set" column) from a baseline ~200,000K (~200MByte) to over 6,000,000 or 7,000,000K (tested on 64bit Windows7, and also on ESX-virtualized 64bit Server 2003 and Server 2008; unfortunately I didn't take note of which systems the various numbers observed occurred on).
Another very similar test case in the same unittest executable tests one-mega 4k mallocs (followed by frees) and that only expands by around the expected 4GByte when running.
I don't get it: does VirtualAlloc have some quite high per-alloc overhead? It's clearly a significant fraction of the page size if so; why is so much extra needed and what's it for? Or am I misunderstanding what the "Working Set" reported actually means? What's going on here?
Update: With reference to Hans' answer, I note this fails with an access violation in the second page access, so whatever is going on isn't as simple as the allocation being rounded up to the 64K "granularity".
char*const ptr = reinterpret_cast<char*>(
VirtualAlloc(0, 4096, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE)
);
ptr[0] = 1;
ptr[4096] = 1;
Update: Now on an AWS/EC2 Windows2008 R2 instance, with VisualStudioExpress2013 installed, I can't reproduce the problem with this minimal code (compiled 64bit), which tops out with an apparently overhead-free peak working set of 4,335,816K, which is the sort of number I'd expected to see originally. So either there is something different about the other machines I'm running on, or the boost-test based exe used in the previous testing. Bizzaro, to be continued...
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
#include <vector>
int main(int, char**) {
const size_t N = 4;
const size_t pagesize4k = 4096;
const size_t npages = (N << 30) / pagesize4k;
std::vector<void*> pages(npages, 0);
for (size_t i = 0; i < pages.size(); ++i) {
pages[i] = VirtualAlloc(0, pagesize4k, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
*reinterpret_cast<char*>(pages[i]) = 1;
}
Sleep(5000);
for (size_t i = 0; i < pages.size(); ++i) {
VirtualFree(pages[i], 0, MEM_RELEASE);
}
return 0;
}
Final update: Apologies! I'd delete this question if I could because it turns out the observed problems were entirely due to an immediately preceeding unittest in the test suite which used TBB's "scalable allocator" to allocate/deallocate a couple of GByte of stuff. It seems scalable allocator actually retains such allocations in it's own pool rather than returning them to the system (see e.g here or here). Became obvious once I ran tests individually with enough of a Sleep after them to observe their on-completion working set in task manager (whether anything can be done about the TBB behaviour might be an interesting question, but as-is the question here is a red-herring).

pages[i]=VirtualAlloc(0,pagesize4k,MEM_RESERVE|MEM_COMMIT,PAGE_READWRITE);
You won't get 4096 bytes, it will be rounded up to the smallest permitted allocation. Which is SYSTEM_INFO.dwAllocationGranularity, it has been 64KB for a long time. It is a very basic address space fragmentation counter-measure.
So you are allocating way more than you think.

It turns out the observed problems were entirely due to an immediately preceding unittest in the test suite which used TBB's "scalable allocator" to allocate/deallocate a couple of GByte of stuff. It seems scalable allocator actually retains such allocations in it's own pool rather than returning them to the system (see e.g here or here). Became obvious once I ran tests individually with enough of a Sleep after them to observe their on-completion working set in task manager.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js