I have a need to allocate all memory available to a process, in order to implement a test of a system service. The test (among others) requires exhausting all available resources, attempting the call, and checking for a specific outcome.
In order to do this, I wrote a loop that reallocates a block of memory until, realloc returns null, then using the last good allocation, then cutting the difference between the last successful quantity and the last unsuccessful quantity until the unsuccessful quantity is 1 byte larger than the last successful quantity, guaranteeing that all available memory is consumed.
The code I wrote is as follows (debug prints also included)
#include <stdio.h>
#include <malloc.h>
int main(void)
{
char* X;
char* lastgood = NULL;
char* toalloc = NULL;
unsigned int top = 1;
unsigned int bottom = 1;
unsigned int middle;
do
{
bottom = top;
lastgood = toalloc;
top = bottom*2;
printf("lastgood = %p\ntoalloc = %p\n", lastgood, toalloc);
if (lastgood != NULL)
printf("*lastgood = %i\n", *lastgood);
toalloc = realloc(toalloc, top);
printf("lastgood = %p\ntoalloc = %p\n", lastgood, toalloc);
if (toalloc == NULL && lastgood != NULL)
printf("*lastgood = %i\n", *lastgood); //segfault happens here
}while(toalloc != NULL);
do
{
if (toalloc != NULL) lastgood = toalloc;
else toalloc = lastgood;
middle = bottom+(top - bottom)/2;
toalloc = realloc(toalloc, middle);
if (toalloc == NULL) top = middle;
else bottom = middle;
}while(top - bottom > 1);
if (toalloc != NULL) lastgood = toalloc;
X = lastgood;
//make a call that attempts to get more memory
free(X);
}
According to realloc's manpage, realloc does not destroy the previous address if it returns null. Even so, this code results in a segfault when it tries to print lastgood after toalloc receives NULL from realloc. Why is this happening, and is there a better way to just grab the exact quantity of unallocated memory?
I am running it on glibc, on ubuntu with kernel 3.11.x
You are not checking the value of top for overflow. This is what happens with its value:
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072
262144
524288
1048576
2097152
4194304
8388608
16777216
33554432
67108864
134217728
268435456
536870912
1073741824
2147483648
0
Just before the last realloc(), the new value of top is 0 again (actually 2^32 but that doesn't fit in 32 bits), which seems to cause the memory block to actually deallocate.
Trying to allocate the maximum contiguous block is not a good idea. The memory map seen by the user process has some block already allocated for shared libraries, and the actual code and data of the current process. Unless you want to know the maximum contiguous memory block you can allocate, the way to go is to allocate as much as you can in a single block. When you have reached that, do the same with a different pointer, and keep doing that until you really run out of memory. Take into account that in 64-bit systems, you don't get all available memory in just one malloc()/realloc(). As I've just seen, malloc() in 64-bit systems allocate up to 4GB of memory in one call, even when you can issue several mallocs() and still succeed on each of them.
A visual of a user process memory map as seen in a 32-bit Linux system is described in an answer I gave a few days ago:
Is kernel space mapped into user space on Linux x86?
I've come up with this program, to "eat" all the memory it can:
#include <stdio.h>
#include <malloc.h>
typedef struct slist
{
char *p;
struct slist *next;
} TList;
int main(void)
{
size_t nbytes;
size_t totalbytes = 0;
int i = 0;
TList *list = NULL, *node;
node = malloc (sizeof *node);
while (node)
{
node->next = list;
list = node;
nbytes = -1; /* can I actually do this? */
node->p = malloc(nbytes);
while (nbytes && !node->p)
{
nbytes/=2;
node->p = malloc(nbytes);
}
totalbytes += nbytes + sizeof *node;
if (nbytes==0)
break;
i++;
printf ("%8d", i);
}
printf ("\nBlocks allocated: %d. Memory used: %f GB\n",
i, totalbytes/(1024*1048576.0));
return 0;
}
The execution yields these values in a 32-bit Linux system:
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50
51 52 53
Blocks allocated: 53. Memory used: 2.998220 GB
Very close to the 3GB limit on 32-bit Linux systems. On a 64-bit Linux system, I've reached 30000 blocks of 4GB each, and still counting. I don't really know if Linux can allocate that much memory or it's my mistake. According to this, the maximum virtual address space is 128TB (that would be 32768 4GB blocks)
UPDATE: in fact, so it is. I've left running this program on a 64-bit box, and after 110074 succesuflly allocated blocks, the grand total of memory allocated has been 131071.578884 GB . Each malloc() has been able to allocate as much as 4 GB per operation, but when reached 115256 GB, it has begun to allocate up to 2 GB, then when it reached 123164 GB allocated, it begun to allocate up to 1GB per malloc(). The progression asyntotically tends to 131072 GB, but it actually stops a little earlier, 131071.578884 GB because the very process, its data and shared libraries, uses a few KB of memory.
I think getrlimit() is the answer to your new question. Here is the man page http://linux.die.net/man/2/getrlimit
You are probably interested in RLIMIT_STACK. Please check the man page.
first part is replaced with the following.
char *p;
size_t size = SIZE_MAX;//SIZE_MAX : #include <stdint.h>
do{
p = malloc(size);
size >>= 1;
}while(p==NULL);
Related
Why does creating a 1D array larger than my memory fail, but I can create a 2D array larger than my memory? I thought the OS gives you virtual memory and you can request as much as you want. It's not until you start reading from and write to memory and it becomes part of the resident set that the hardware constraints become an issue.
On the small VM with 512MB of memory I tried:
1, 512 MB array: no issue
1, 768 MB array: no issue
1, 879 MB array: no issue
1, 880 MB array: fails
1, 1024 MB array: fails
1000, 512MB arrays no issue (at this point
I've allocated 256GB of virtual memory,
well exceeding the physical limits)
On a large VM with 8GB of memory, all the above worked.
For this experiment, I used this code:
#include <stdio.h> /* printf */
#include <stdlib.h> /* atoi */
#include <iostream>
#include <unistd.h>
int main(int argc, char *argv[],char **envp) {
if(argc < 3) {
printf("main <mb> <times>\n");
return -1;
}
int megabytes = atoi(argv[1]);
int times = atoi(argv[1]);
// megabytes 1024 kilobytes 1024 bytes 1 integer
// -------- * --------- * ---------- * --------
// megabyte kilobyte 4 bytes
int sizeOfArray = megabytes*1024*1024/sizeof(int);
long long bytes = megabytes*1024*1024;
printf("grabbing memory :%dmb, arrayEntrySize:%d, times:%d bytes:%lld\n",
megabytes, sizeOfArray, times, bytes);
int ** array = new int*[times];
for( int i = 0; i < times; i++) {
array[i] = new int[sizeOfArray];
}
while(true) {
// 1 second to microseconds
usleep(1*1000000);
}
for( int i = 0; i < times; i++) {
delete [] array[i];
}
delete [] array;
}
Commands and outputs of experiments on small 512MB VM:
free -h
total used free shared buff/cache available
Mem: 488M 66M 17M 5.6M 404M 381M
Swap: 511M 72K 511M
./a.out 512 1
grabbing memory :512mb, arrayEntrySize:134217728, times:512 bytes:536870912
./a.out 768 1
grabbing memory :768mb, arrayEntrySize:201326592, times:768 bytes:805306368
./a.out 1024 1
grabbing memory :1024mb, arrayEntrySize:268435456, times:1024 bytes:1073741824
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)
./a.out 512 1000
grabbing memory :512mb, arrayEntrySize:134217728, times:512 bytes:536870912
#htop
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
2768 root 20 0 256G 4912 2764 S 0.0 1.0 0:00.00 ./a.out 512 1000
Commands and outputs of experiments on the large 8GB VM:
free -h
total used free shared buff/cache available
Mem: 7.8G 78M 7.6G 8.8M 159M 7.5G
Swap: 511M 0B 511M
./a.out 512 1
grabbing memory :512mb, arrayEntrySize:134217728, times:512 bytes:536870912
./a.out 768 1
grabbing memory :768mb, arrayEntrySize:201326592, times:768 bytes:805306368
./a.out 1024 1
grabbing memory :1024mb, arrayEntrySize:268435456, times:1024 bytes:1073741824
./a.out 512 1000
grabbing memory :512mb, arrayEntrySize:134217728, times:512 bytes:536870912
# htop
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
1292 root 20 0 256G 6920 2720 S 0.0 0.1 0:00.00 ./a.out 512 1000
This is due to the fact thay memory is allocated in a chunk based on how much you ask for.
You are asking for a sereis of relatively small blocks in the 2D array, each block isnt necessarily next to each other.
However the 1D array is massive and requires a full sized and contigiois memory block, and you may not have a block of that size available even if you have thay much memory available.
long long size = (long long)1*1024*1024*1024;
char *p =NULL;
while (p==NULL)
{
/* mmap 100MiB of shared anonymous memory */
p = malloc(size);
if(!p)
{
printf("\nallocation failed for size %ld",size);
size = size/2;
}
else
{
printf("\nallocation passed for size %ld",size);
break;
}
}
if(p)
{
printf("\n making persinstant these allocated memory,%p",p);
/* Touch every single page to make them resident */
for (i = 0; i < size / 4096; i++)
{
printf ("\n%d",i);
p[i * 4096] = 1;
}
sleep(600) // sleep 10 min and analyze the memory used data
}
free -m [before running after code]
Mon Jul 17 07:57:13 CDT 2017
total used free shared buffers cached
Mem: 15950 3225 12725 1 263 1076
-/+ buffers/cache: 1885 14065
Swap: 2047 0 2047
free -m [after before running these code]
total used free shared buffers cached
Mem: 15950 2199 13751 1 263 1076
-/+ buffers/cache: 860 15090
Swap: 2047 0 2047
So if we see i write 1GB data ( very first time allocate successfully) and free -m shows only 1000KB being used.
Can someone help me to fill this gap. how can i verify the exact RAM uses of my code.
Thanks
Vijayky88
This difference was there because my system gives same output for free -m and free -k and actually it is in mb.
I am using cudaMemGetInfo in order to get the vram currently used by the system.
extern __host__ cudaError_t CUDARTAPI cudaMemGetInfo(size_t *free, size_t *total);
And I am having two problems :
the main is that the free value returned is only right when the graphic device has almost no memory free for allocation. Otherwise it remains at about 20% memory used even when GPU-Z clearly states that about 80 % is used. And when I reach 95% memory used cudaMemGetInfo suddenly returns a good value. Note that the total memory is always correct.
the second problem is that as soon as I use the function, video memory is allocated. At least 40mbytes but it can reach 400 on some graphic devices.
My code :
#include <cuda_runtime.h>
size_t Profiler::GetGraphicDeviceVRamUsage(int _NumGPU)
{
cudaSetDevice(_NumGPU);
size_t l_free = 0;
size_t l_Total = 0;
cudaError_t error_id = cudaMemGetInfo(&l_free, &l_Total);
return (l_Total - l_free);
}
I tried with 5 different nvidia graphic devices. The problems are always the same.
Any idea ?
On your first point, I cannot reproduce this. If I expand your code into a complete example:
#include <iostream>
size_t GetGraphicDeviceVRamUsage(int _NumGPU)
{
cudaSetDevice(_NumGPU);
size_t l_free = 0;
size_t l_Total = 0;
cudaError_t error_id = cudaMemGetInfo(&l_free, &l_Total);
return (l_Total - l_free);
}
int main()
{
const size_t sz = 1 << 20;
for(int i=0; i<20; i++) {
size_t before = GetGraphicDeviceVRamUsage(0);
char *p;
cudaMalloc((void **)&p, sz);
size_t after = GetGraphicDeviceVRamUsage(0);
std::cout << i << " " << before << "->" << after << std::endl;
}
return cudaDeviceReset();
}
I get this on a linux machine:
$ ./meminfo
0 82055168->83103744
1 83103744->84152320
2 84152320->85200896
3 85200896->86249472
4 86249472->87298048
5 87298048->88346624
6 88346624->89395200
7 89395200->90443776
8 90443776->91492352
9 91492352->92540928
10 92540928->93589504
11 93589504->94638080
12 94638080->95686656
13 95686656->96735232
14 96735232->97783808
15 97783808->98832384
16 98832384->99880960
17 99880960->100929536
18 100929536->101978112
19 101978112->103026688
and I get this on a Windows WDDM machine:
>meminfo
0 64126976->65175552
1 65175552->66224128
2 66224128->67272704
3 67272704->68321280
4 68321280->69369856
5 69369856->70418432
6 70418432->71467008
7 71467008->72515584
8 72515584->73564160
9 73564160->74612736
10 74612736->75661312
11 75661312->76709888
12 76709888->77758464
13 77758464->78807040
14 78807040->79855616
15 79855616->80904192
16 80904192->81952768
17 81952768->83001344
18 83001344->84049920
19 84049920->85098496
Both seem consistent to me.
On your second point: cudaSetDevice establishes a CUDA context on the device number which you pass to it, if no context already exists. Establishing a CUDA context will reserve memory for the runtime components required to run CUDA code. So it is completely normal that calling the function will consume memory if it is the first CUDA API containing function you call.
I'm writing a CUDA program. The code copies pinned memory to shared memory, by pinned memory I mean memory allocated with cudaHostAlloc(., ., cudaHostAllocMapped). It takes 600us to copy 16 bytes and 8ms to copy 256 bytes. Why so huge difference?
My code looks like:
__global__
void kernel_func(char* dict, int dict_len)
{
__shared__ char s_dict[256];
/* dict_len = 16; */
if(threadIdx.x == 0) {// copy once for each block
memcpy((unsigned char*)s_dict, (unsigned char*)dict, dict_len);
}
__syncthreads();
}
kernel_func<<<32, 128>>>("256 bytes pinned memory", 256);
The environment: GTX650 + CUDA 6.5 + Win7-32bit
Because 256 bytes is 16 times what you previously copied (16 bytes)
Now 16 bytes took 600us
And 16 times that is 9600us Which is close to the 8000us you observed (1ms = 1000us)
I wrote a simple application to test memory consumption. In this test application, I created four processes to continually consume memory, those processes won't release the memory unless the process exits.
I expected this test application to consume the most memory of RAM and cause the other application to slow down or crash. But the result is not the same as expected. Below is the code:
#include <stdio.h>
#include <unistd.h>
#include <list>
#include <vector>
using namespace std;
unsigned short calcrc(unsigned char *ptr, int count)
{
unsigned short crc;
unsigned char i;
//high cpu-consumption code
//implements the CRC algorithm
//CRC is Cyclic Redundancy Code
}
void* ForkChild(void* param){
vector<unsigned char*> MemoryVector;
pid_t PID = fork();
if (PID > 0){
const int TEN_MEGA = 10 * 10 * 1024 * 1024;
unsigned char* buffer = NULL;
while(1){
buffer = NULL;
buffer = new unsigned char [TEN_MEGA];
if (buffer){
try{
calcrc(buffer, TEN_MEGA);
MemoryVector.push_back(buffer);
} catch(...){
printf("An error was throwed, but caught by our app!\n");
delete [] buffer;
buffer = NULL;
}
}
else{
printf("no memory to allocate!\n");
try{
if (MemoryVector.size()){
buffer = MemoryVector[0];
calcrc(buffer, TEN_MEGA);
buffer = NULL;
} else {
printf("no memory ever allocated for this Process!\n");
continue;
}
} catch(...){
printf("An error was throwed -- branch 2,"
"but caught by our app!\n");
buffer = NULL;
}
}
} //while(1)
} else if (PID == 0){
} else {
perror("fork error");
}
return NULL;
}
int main(){
int children = 4;
while(--children >= 0){
ForkChild(NULL);
};
while(1) sleep(1);
printf("exiting main process\n");
return 0;
}
TOP command
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2775 steve 20 0 1503m 508 312 R 99.5 0.0 1:00.46 test
2777 steve 20 0 1503m 508 312 R 96.9 0.0 1:00.54 test
2774 steve 20 0 1503m 904 708 R 96.6 0.0 0:59.92 test
2776 steve 20 0 1503m 508 312 R 96.2 0.0 1:00.57 test
Though CPU is high, but memory percent remains 0.0. How can it be possible??
Free command
free shared buffers cached
Mem: 3083796 0 55996 428296
Free memory is more than 3G out of 4G RAM.
Does there anybody know why this test app just doesn't work as expected?
Linux uses optimistic memory allocation: it will not physically allocate a page of memory until that page is actually written to. For that reason, you can allocate much more memory than what is available, without increasing memory consumption by the system.
If you want to force the system to allocate (commit) a physical page , then you have to write to it.
The following line does not issue any write, as it is default-initialization of unsigned char, which is a no-op:
buffer = new unsigned char [TEN_MEGA];
If you want to force a commit, use zero-initialization:
buffer = new unsigned char [TEN_MEGA]();
To make the comments into an answer:
Linux will not allocate memory pages for a process until it writes to them (copy-on-write).
Additionally, you are not writing to your buffer anywhere, as the default constructor for unsigned char does not perform any initializations, and new[] default-initializes all items.
fork() returns the PID in the parent, and 0 in the child. Your ForkChild as written will execute all the work in the parent, not the child.
And the standard new operator will never return null; it will throw if it fails to allocate memory (but due to overcommit it won't actually do that either in Linux). This means your test of buffer after the allocation is meaningless: it will always either take the first branch or never reach the test. If you want a null return, you need to write new (std::nothrow) .... Include <new> for that to work.
But your program is infact doing what you expected it to do. As an answer has pointed out (# Michael Foukarakis's answer), memory not used is not allocated. In your output of the top program, I noticed that the column virt had a large amount of memory on it for each process running your program. A little googling later, I saw what this was:
VIRT -- Virtual Memory Size (KiB). The total amount of virtual memory used by the task. It includes all code, data and shared libraries plus pages that have been swapped out and pages that have been mapped but not used.
So as you can see, your program does in fact generate memory for itself, but in the form of pages and stored as virtual memory. And I think that is a smart thing to do
A snippet from this wiki page
A page, memory page, or virtual page -- a fixed-length contiguous block of virtual memory, and it is the smallest unit of data for the following:
memory allocation performed by the operating system for a program; and
transfer between main memory and any other auxiliary store, such as a hard disk drive.
...Thus a program can address more (virtual) RAM than physically exists in the computer. Virtual memory is a scheme that gives users the illusion of working with a large block of contiguous memory space (perhaps even larger than real memory), when in actuality most of their work is on auxiliary storage (disk). Fixed-size blocks (pages) or variable-size blocks of the job are read into main memory as needed.
Sources:
http://www.computerhope.com/unix/top.htm
https://stackoverflow.com/a/18917909/2089675
http://en.wikipedia.org/wiki/Page_(computer_memory)
If you want to gobble up a lot of memory:
int mb = 0;
char* buffer;
while (1) {
buffer = malloc(1024*1024);
memset(buffer, 0, 1024*1024);
mb++;
}
I used something like this to make sure the file buffer cache was empty when taking some file I/O timing measurements.
As other answers have already mentioned, your code doesn't ever write to the buffer after allocating it. Here memset is used to write to the buffer.