Dynamic Memory Allocation in fast RAM - c++

On a Windows 32-bit and 64-bit machine, I have to allocate memory to store large amounts of data that are streaming live, a total of around 1GB. If I use malloc(), I am going to obtain a virtual memory address, and this address could be actually causing some paging to the hard drive depending on the amount of memory I have. Unfortunately I'm afraid that the HD will impact performance and cause data to be missing.
Is there a way to force memory to allocate only in RAM, even if it means that I get an error when not enough memory is available (so the user needs to close other things or use another machine)? I want to guarantee that all operations will be done in memory. If this fails, forcing the application to exit is acceptable.
I know that another process may come in and itself take some memory, but I am not worried because in this machine that is not happening (it'll be the only application on the machine to be doing this large allocation).
[Edit:]
My attempt so far has been to try use VirtualLock as follows:
if(!SetProcessWorkingSetSize(this, 300000, 300008))
printf("Error Changing Working Set Size\n");
// Allocate 1GB space
unsigned long sz = sizeof(unsigned char)*1000000000;
unsigned char * m_buffer = (unsigned char *) malloc(sz);
if(m_buffer == NULL)
{
printf("Memory Allocation failed\n");
}
else
{
// Protect memory from being swapped
if(!VirtualLock(m_buffer , sz))
{
printf("Memory swap protection failed\n");
}
}
But the change in Working set fails, and so does the VirtualLock. Malloc does return non-null.
[Edit2]
I have tried also:
unsigned long sz = sizeof(unsigned char)*1000000000;
LPVOID lpvResult;
lpvResult = VirtualAlloc(NULL,sz, MEM_PHYSICAL|MEM_RESERVE, PAGE_NOCACHE);
But lpvResult is 0, so no luck there either.

You can use mlock, mlockall, munlock, munlockall functions in order to prevent pages from being swapped (part of POSIX, also available in MinGW). Unfortunately, I have no experience with Windows but it looks like VirtualLock does the same thing.
Hope it helps. Good Luck!

I think VirtualAlloc might get you some of what you want.
This problem really boils down to just writing your own memory manager instead of using CRT function.

You need to use the undocumented NtLockVirtualMemory function with lock option 2 (LOCK_VM_IN_RAM); make sure you request and obtain SE_LOCK_MEMORY_NAME privilege first, and be aware that it might not be granted (I'm sure what the group policy defaults the privilege to, but it might very well be granted to nobody).
I suggest using VirtualLock as a fallback, and if that fails too, to use SetProcessWorkingSetSize. If that fails then just let it fail I guess...
See this link for some nice discussion about this. One person says:
When you specify LOCK_VM_IN_WSL flag, you just tell the Balance Set Manager that you don't want some particular page to get swapped to the disk, and ask it to leave this page alone when trimming the working set of the target process. This is just an indication, so that the target page may still get swapped if the system is low on RAM. However, when you specify LOCK_VM_IN_RAM flag, you issue a directive to the Memory Manager to treat this page as non-pageable (i.e. do something the driver does when it calls MmProbeAndLockPages() in order to lock pages, described by MDL) , so that the page is question is guaranteed to be loaded in RAM all the time.
Edit:
Read this.

One option would be to create a RAM Disk out of your host's memory. While there is no longer native support for this in the distributed Windows code, you can still find the drivers necessary for free or made available through commercial products. For instance, DRDataRam provides a free driver for personal use and a commercially licensed product for business use at: http://memory.dataram.com/products-and-services/software/ramdisk
There is also ImDisk Virtual Driver available at: http://www.ltr-data.se/opencode.html/#ImDisk It is open sourced and free for commercial use. It is digitally signed with a trusted certificate from Microsoft.
For more information concerning RAM Drives on Windows, check out ServerFault.com.

You should take a look at Address Windowing Extensions (AWE). It sounds like it matches the memory constraints you have (emphasis mine):
AWE uses physical nonpaged memory and window views of various portions of this physical memory within a 32-bit virtual address space.

Related

Dynamic memory allocation seems instant in debug but gradual in release mode

I have a large dynamically allocated array (C++, MSVC110), and I am initializing it like this:
try {
size_t arrayLength = 1 << 28;
data = new int[arrayLength];
for (size_t i = 0; i < arrayLength; ++i) {
data[i] = rand();
}
}
catch (std::bad_alloc&) { /* Report error. */ }
Everything was fine before I tried to allocate more than system's actual RAM, like 10 GB. I was expecting to catch a bad_alloc exception, but the system (Win7) started to swap as crazy, etc.. you know what I am talking about.
Then I examined the situation in my task manager and noticed interesting thing, in debug mode the allocation was instant, but in release, it was gradual.
Debug mode:
Release mode:
What is causing it? Could this have any negative impact on performance? Have I done something wrong? Is OS causing this? Or C++ allocator?
I would actually prefer to get an exception if there is not enough memory rather than go to endless swapping loop. Is there any way how to achieve it in C++?
I know that one solution might be turn off swapping in Windows, but that would solve the problem only for me.
I think the memory allocator is doing some chaining in debug mode to allow better detection of memory handling errors. It will access every allocated block to write a few bytes in each, thus forcing the system to commit all pages allocated quickly.
In release mode, it is your code that does the linear filling of the block, thus commiting one page at a time.
As for limiting the amount of memory, well you have system calls to let you know about available resources. These, for instance, in Windows environment.
Having a system call fail if memory sawp is required would make no sense, since the amount of available memory changes constantly due to circumstances a given program cannot control (like other applications being started).
There are possibilities to make some memory blocks non-swappable (i.e. locked in RAM), but that kind of usage is usually limited to system layers like drivers.
It is up to you to detect available memory and enforce an allocation limit.
Note that it is a dangerous game; since you are usually not running alone on the computer, and there is no telling if another application will be launched later and consume more memory.
If swap is a killer for your application, you should consider taking safety margins (i.e. try to leave something like 500Mb or 1 Gb of RAM available to the system)

Allocating more memory than there exists using malloc

This code snippet will allocate 2Gb every time it reads the letter 'u' from stdin, and will initialize all the allocated chars once it reads 'a'.
#include <iostream>
#include <stdlib.h>
#include <stdio.h>
#include <vector>
#define bytes 2147483648
using namespace std;
int main()
{
char input [1];
vector<char *> activate;
while(input[0] != 'q')
{
gets (input);
if(input[0] == 'u')
{
char *m = (char*)malloc(bytes);
if(m == NULL) cout << "cant allocate mem" << endl;
else cout << "ok" << endl;
activate.push_back(m);
}
else if(input[0] == 'a')
{
for(int x = 0; x < activate.size(); x++)
{
char *m;
m = activate[x];
for(unsigned x = 0; x < bytes; x++)
{
m[x] = 'a';
}
}
}
}
return 0;
}
I am running this code on a linux virtual machine that has 3Gb of ram. While monitoring the system resource usage using the htop tool, I have realized that the malloc operation is not reflected on the resources.
For example when I input 'u' only once(i.e. allocate 2GB of heap memory), I don't see the memory usage increasing by 2GB in htop. It is only when I input 'a'(i.e. initialize), I see the memory usage increasing.
As a consequence, I am able to "malloc" more heap memory than there exists. For example, I can malloc 6GB(which is more than my ram and swap memory) and malloc would allow it(i.e. NULL is not returned by malloc). But when I try to initialize the allocated memory, I can see the memory and swap memory filling up till the process is killed.
-My questions:
1.Is this a kernel bug?
2.Can someone explain to me why this behavior is allowed?
It is called memory overcommit. You can disable it by running as root:
echo 2 > /proc/sys/vm/overcommit_memory
and it is not a kernel feature that I like (so I always disable it). See malloc(3) and mmap(2) and proc(5)
NB: echo 0 instead of echo 2 often -but not always- works also. Read the docs (in particular proc man page that I just linked to).
from man malloc (online here):
By default, Linux follows an optimistic memory allocation strategy.
This means that when malloc() returns non-NULL there is no guarantee
that the memory really is available.
So when you just want to allocate too much, it "lies" to you, when you want to use the allocated memory, it will try to find enough memory for you and it might crash if it can't find enough memory.
No, this is not a kernel bug. You have discovered something known as late paging (or overcommit).
Until you write a byte to the address allocated with malloc (...) the kernel does little more than "reserve" the address range. This really depends on the implementation of your memory allocator and operating system of course, but most good ones do not incur the majority of kernel overhead until the memory is first used.
The hoard allocator is one big offender that comes to mind immediately, through extensive testing I have found it almost never takes advantage of a kernel that supports late paging. You can always mitigate the effects of late paging in any allocator if you zero-fill the entire memory range immediately after allocation.
Real-time operating systems like VxWorks will never allow this behavior because late paging introduces serious latency. Technically, all it does is put the latency off until a later indeterminate time.
For a more detailed discussion, you may be interested to see how IBM's AIX operating system handles page allocation and overcommitment.
This is a result of what Basile mentioned, over commit memory. However, the explanation kind of interesting.
Basically when you attempt to map additional memory in Linux (POSIX?), the kernel will just reserve it, and will only actually end up using it if your application accesses one of the reserved pages. This allows multiple applications to reserve more than the actual total amount of ram / swap.
This is desirable behavior on most Linux environments unless you've got a real-time OS or something where you know exactly who will need what resources, when and why.
Otherwise somebody could come along, malloc up all the ram (without actually doing anything with it) and OOM your apps.
Another example of this lazy allocation is mmap(), where you have a virtual map that the file you're mapping can fit inside - but you only have a small amount of real memory dedicated to the effort. This allows you to mmap() huge files (larger than your available RAM), and use them like normal file handles which is nifty)
-n
Initializing / working with the memory should work:
memset(m, 0, bytes);
Also you could use calloc that not only allocates memory but also fills it with zeros for you:
char* m = (char*) calloc(1, bytes);
1.Is this a kernel bug?
No.
2.Can someone explain to me why this behavior is allowed?
There are a few reasons:
Mitigate need to know eventual memory requirement - it's often convenient to have an application be able to an amount of memory that it considers an upper limit on the need it might actually have. For example, if it's preparing some kind of report either of an initial pass just to calculate the eventual size of the report or a realloc() of successively larger areas (with the risk of having to copy) may significantly complicate the code and hurt performance, where-as multiplying some maximum length of each entry by the number of entries could be very quick and easy. If you know virtual memory is relatively plentiful as far as your application's needs are concerned, then making a larger allocation of virtual address space is very cheap.
Sparse data - if you have the virtual address space spare, being able to have a sparse array and use direct indexing, or allocate a hash table with generous capacity() to size() ratio, can lead to a very high performance system. Both work best (in the sense of having low overheads/waste and efficient use of memory caches) when the data element size is a multiple of the memory paging size, or failing that much larger or a small integral fraction thereof.
Resource sharing - consider an ISP offering a "1 giga-bit per second" connection to 1000 consumers in a building - they know that if all the consumers use it simultaneously they'll get about 1 mega-bit, but rely on their real-world experience that, though people ask for 1 giga-bit and want a good fraction of it at specific times, there's inevitably some lower maximum and much lower average for concurrent usage. The same insight applied to memory allows operating systems to support more applications than they otherwise would, with reasonable average success at satisfying expectations. Much as the shared Internet connection degrades in speed as more users make simultaneous demands, paging from swap memory on disk may kick in and reduce performance. But unlike an internet connection, there's a limit to the swap memory, and if all the apps really do try to use the memory concurrently such that that limit's exceeded, some will start getting signals/interrupts/traps reporting memory exhaustion. Summarily, with this memory overcommit behaviour enabled, simply checking malloc()/new returned a non-NULL pointer is not sufficient to guarantee the physical memory is actually available, and the program may still receive a signal later as it attempts to use the memory.

how to manage large arrays

I have a c++ program that uses several very large arrays of doubles, and I want to reduce the memory footprint of this particular part of the program. Currently, I'm allocating 100 of them and they can be 100 Mb each.
Now, I do have the advantage, that eventually parts of these arrays become obsolete during later parts of the program's execution, and there is little need to ever have the whole of any one of then in memory at any one time.
My question is this:
Is there any way of telling the OS after I have created the array with new or malloc that a part of it is unnecessary any more ?
I'm coming to the conclusion that the only way to achieve this is going to be to declare an array of pointers, each of which may point to a chunk say 1Mb of the desired array, so that old chunks that are not needed any more can be reused for new bits of the array. This seems to me like writing a custom memory manager which does seem like a bit of a sledgehammer, that's going to create a bit of a performance hit as well
I can't move the data in the array because it is going to cause too many thread contention issues. the arrays may be accessed by any one of a large number of threads at any time, though only one thread ever writes to any given array.
It depends on the operating system. POSIX - including Linux - has the system call madvise to do improve memory performance. From the man page:
The madvise() system call advises the kernel about how to handle paging input/output in the address range beginning at address addr and with size length bytes. It allows an application to tell the kernel how it expects to use some mapped or shared memory areas, so that the kernel can choose appropriate read-ahead and caching techniques. This call does not influence the semantics of the application (except in the case of MADV_DONTNEED), but may influence its performance. The kernel is free to ignore the advice.
See the man page of madvise for more information.
Edit: Apparently, the above description was not clear enough. So, here are some more details, and some of them are specific to Linux.
You can use mmap to allocate a block of memory (directly from the OS instead of the libc), that is not backed by any file. For large chunks of memory, malloc is doing exactly the same thing. You have to use munmap to release the memory - regardless of the usage of madvise:
void* data = ::mmap(nullptr, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
// ...
::munmap(data, size);
If you want to get rid of some parts of this chunk, you can use madvise to tell the kernel to do so:
madvise(static_cast<unsigned char*>(data) + 7 * page_size,
3 * page_size, MADV_DONTNEED);
The address range is still valid, but it is no longer backed - neither by physical RAM nor by storage. If you access the pages later, the kernel will allocate some new pages on the fly and re-initialize them to zero. Be aware, that the dontneed pages are also part of the virtual memory size of the process. It might be necessary to make some configuration changes to the virtual memory management, e.g. activating over-commit.
It would be easier to answer if we had more details.
1°) The answer to the question "Is there any way of telling the OS after I have created the array with new or malloc that a part of it is unnecessary any more ?" is "not really". That's the point of C and C++, and any language that let you handle memory manually.
2°) If you're using C++ and not C, you should not be using malloc.
3°) Nor arrays, unless for a very specific reason. Use a std::vector.
4°) Preferably, if you need to change often the content of the array and reduce the memory footprint, use a linked list (std::list), though it'll be more expensive to "access" individually the content of the list (but will be almost as fast if you only iterate through it).
A std::deque with pointers to std::array<double,LARGE_NUMBER> may do the job, but you better make a dedicated container with the deque, so you can remap the indexes and most importantly, define when entries are not used anymore.
The dedicated container can also contain a read/write lock, so it can be used in a thread-safe way.
You could try using lists instead of arrays. Of course list is 'heavyer' than array but on the other hand it is easy to reconstruct a list so that you can throw away a part of it when it becomes obsolete. You could also use a wrapper which would only contain indexes saying which part of the list is up-to-date and which part may be reused.
This will help you improve performance, but will require a little bit more (reusable) memory.
Allocating by chunk and delete[]-ing and new[]-ing on the way seems like the good solution. It may be possible to do as little as memory management as possible. Do not reuse chunk yourself, simply deallocate old one and allocate new chunks when needed.

new[] doesn't decrease available memory until populated

This is in C++ on CentOS 64bit using G++ 4.1.2.
We're writing a test application to load up the memory usage on a system by n Gigabytes. The idea being that the overall system load gets monitored through SNMP etc. So this is just a way of exercising the monitoring.
What we've seen however is that simply doing:
char* p = new char[1000000000];
doesn't affect the memory used as shown in either top or free -m
The memory allocation only seems to become "real" once the memory is written to:
memcpy(p, 'a', 1000000000); //shows an increase in mem usage of 1GB
But we have to write to all of the memory, simply writing to the first element does not show an increase in the used memory:
p[0] = 'a'; //does not show an increase of 1GB.
Is this normal, has the memory actually been allocated fully? I'm not sure if it's the tools we are using (top and free -m) that are displaying incorrect values or whether there is something clever going on in the compiler or in the runtime and/or kernel.
This behavior is seen even in a debug build with optimizations turned off.
It was my understanding that a new[] allocated the memory immediately. Does the C++ runtime delay this actual allocation until later on when it is accessed. In that case can an out of memory exception be deferred until well after the actual allocation of the memory until the memory is accessed?
As it is it is not a problem for us, but it would be nice to know why this is occurring the way it is!
Cheers!
Edit:
I don't want to know about how we should be using Vectors, this isn't OO / C++ / the current way of doing things etc etc. I just want to know why this is happening the way it is, rather than have suggestions for alternative ways of trying it.
When your library allocates memory from the OS, the OS will just reserve an address range in the process's virtual address space. There's no reason for the OS to actually provide this memory until you use it - as you demonstrated.
If you look at e.g. /proc/self/maps you'll see the address range. If you look at top's memory use you won't see it - you're not using it yet.
Please look up for overcommit. Linux by default doesn't reserve memory until it is accessed. And if you end up by needing more memory than available, you don't get an error but a random process is killed. You can control this behavior with /proc/sys/vm/*.
IMO, overcommit should be a per process setting, not a global one. And the default should be no overcommit.
About the second half of your question:
The language standard doesn't allow any delays in throwing a bad_alloc. That must happen as an alternative to new[] returning a pointer. It cannot happen later!
Some OSs might try to overcommit memory allocations, and fail later. That is not conforming to the C++ language standard.

Reading value at an address

I'm trying to make a program that reads the value at a certain address.
I have this:
int _tmain(int argc, _TCHAR* argv[])
{
int *address;
address = (int*)0x00000021;
cout << *address;
return 0;
}
But this gives a read violation error. What am I doing wrong?
Thanks
That reads the value at that address within the process's own space. You'll need to use other methods if you want to read another process's space, or physical memory.
It's open to some question exactly what OlyDbg is showing you. 32-bit (and 64-bit) Windows uses virtual memory, which means the address you use in your program is not the same as the address actually sent over the bus to the memory chips. Instead, Windows (and I should add that other OSes such as Linux, MacOS, *bsd, etc., do roughly the same) sets up some tables that say (in essence) when the program uses an address in this range, use that range of physical addresses.
This mapping is done on a page-by-page basis (where each page is normally 4K bytes, though other sizes are possible). In that table, it can also mark a page as "not present" -- this is what supports paging memory to disk. When you try to read a page that's marked as not present, the CPU generates an exception. The OS then handles that exception by reading the data from the disk into a block of memory, and updating the table to say the data is present at physical address X. Along with not-present, the tables support a few other values, such as read-only, so you can read by not write some addresses.
Windows (again, like the other OSes) sets up the tables for the first part of the address space, but does NOT associate any memory with them. From the viewpoint of a user program, those addresses simply should never be used.
That gets us back to my uncertainty about what OlyDbg is giving you when you ask it to read from address 0x21. That address simply doesn't refer to any real data -- never has and never will.
What others have said is true as well: a debugger will usually use some OS functions (E.g. ReadProcessMemory and WriteProcessMemory, among others under Windows) to get access to things that you can't read or write directly. These will let you read and write memory in another process, which isn't directly accessible by a normal pointer. Neither of those would help in trying to read from address 0x21 though -- that address doesn't refer to any real memory in any process.
You can only use a pointer that points to an actual object.
If you don't have an object at address 0x00000021, this won't work.
If you want to create an object on the free store (the heap), you need to do so using new:
int* address = new int;
*address = 42;
cout << *address;
delete address;
When your program is running on an operating system that provides virtual memory (Windows, *nix, OS X) Not all addresses are backed by memory. CPU's that support virtual memory use something called Page Tables to control which address refer to memory. The size of an individual page is usually 4096 bytes, but that does vary and is likely to be larger in the future.
The API's that you use to query the page tables isn't part of the standard C/C++ runtime, so you will need to use operating system specific functions to know which adresses are OK to read from and which will cause you to fault. On Windows you would use VirtualQuery to find out if a given address can be read, written, executed, or any/none of the above.
You can't just read data from an arbitrary address in memory.