lazy allocation for c++ object arrays

lazy allocation for c++ object arrays - c++

If I do something like:
void f() {
const int n = 1<<14;
int *foo = new int [n];
}
or
void f() {
const int n = 1<<14;
int *foo = new int [n]();
}
Will the Linux kernel will use lazy memory allocation? For the second case, in the same way than when creating static arrays?
How far can I take this? For instance, having a struct that will be filled with 0s, will it always be allocated lazily, or will it actually allocate physical RAM when it is initialized?
struct X {
int a, b, c, d, f, g, ..., z;
}
void f() {
int *foo = new X();//lazy?
const int n = 1<<14;
int *foo = new X [n]();//lazy?
}

For a standard Ubuntu 20.04 machine running Linux 5.4.0-51-generic....
We can observe this directly. In the code below, I increased the n value to 1 << 24 (~16 million ints = 64MB for 32-bit int) so it's the dominant factor in overall memory usage. I compiled, ran, and observed memory usage in htop:
#include <unistd.h>
int main() {
int *foo = new int [1 << 24];
sleep(100);
}
htop values: VIRT 71416KB / RES 1468KB
The virtual address allocations include the memory allocated by new, but the resident memory size is much smaller - indicating that distinct physical backing memory pages weren't needed yet for all the 64MB allocated.
After changing to int *foo = new int[1<<24]();:
htop values: VIRT 71416KB / RES 57800KB
Requesting the memory be zeroed resulted in a resident memory value just under the 64MB that was initialised, and it won't have been due to memory pressure (I have 64GB RAM), but some algorithm in the kernel must have decided to page out some of the backing memory after it was zeroed (I suspect kswapd?). The large RES value suggests that each page zeroed was given a distinct page of physical backing memory (as distinct from e.g. being mapped to the OS's zero-page for COW-allocation of an actual backing page).
With structs:
#include <unistd.h>
struct X {
int a[1 << 24];
};
int main() {
auto foo = new X;
sleep(100);
}
htop values: VIRT 71416KB / RES 1460KB
This shows insufficient RES for the static arrays to have distinct backing pages. Either the virtual memory has been pre-mapped to the OS zero-page, or it's unmapped and will be mapped initially to the zero-page when accessed, then given its own physical backing page if written to - I'm not sure which, but in terms of actual physical RAM usage it doesn't make any difference.
After changing to auto foo = new X{};
htop values: VIRT 71416KB / RES 67844KB
You can clearly see that initialising the bytes to 0s resulted in use of backing memory for the arrays.
Addressing your questions:
Will the Linux kernel will use lazy memory allocation?
The virtual memory allocation is done when the new is done. Distinct physical backing memory is allocated lazily when an actual write is done to the memory by the user-space code.
For the second case, in the same way than when creating static arrays?
#include <unistd.h>
int g_a[1 << 24];
int f(int i) {
static int a[1 << 24];
return a[i];
}
int main(int argc, const char* argv[]) {
sleep(20);
int k = f(2930);
sleep(20);
return argc + k;
}
VIRT 133MB RES 1596KB
When this was run, the memory didn't jump after 20 seconds, indicating all the virtual address space was allocated during program loading. The low resident memory shows that the pages were not accessed and zeroed the way they were for new.
Just to address a potential point of confusion: while the Linux Kernel will zero out backing memory the first time it's provided to the process, any given call to new won't (in any implementation I've seen) know whether the memory allocated is being recycled from earlier dynamic allocations - which might have had non-zero values written into it - that have since been deleted/freed. Because of this, if you use memory-zeroing forms like new X{} or new int[n]() then the memory will be unconditionally cleared by the user-space code, causing the full amount of backing memory to be assigned and faulted in.

As many comments said, operator new usually uses malloc under the hood. malloc allocates space but does not by default allocate physical pages. However, malloc often writes internal data to the beginning of a block of memory, so only the first or first couple of pages allocated as virtual address space will fault and be allocated physically by the Linux kernel. The Linux kernel zeroes all allocated physical pages so whether you add () to the end of the allocation to zero-initialize the allocated memory probably has no effect, in terms of new physical pages being assigned. (Already allocated physical pages mapped to the allocated virtual address range are zeroed in that case.)

Related

C++ std::make_unique usage

This is the first time I am trying to use std::unique_ptr but I am getting an access violation
when using std::make_unique with large size .
what is the difference in this case and is it possible to catch this type of exceptions in c++ ?
void SmartPointerfunction(std::unique_ptr<int>&Mem, int Size)
{
try
{
/*declare smart pointer */
//Mem = std::unique_ptr<int>(new int[Size]); // using new (No crash)
Mem = std::make_unique<int>(Size); // using make_unique (crash when Size = 10000!!)
/*set values*/
for (int k = 0; k < Size; k++)
{
Mem.get()[k] = k;
}
}
catch(std::exception& e)
{
std::cout << "Exception :" << e.what() << std::endl;
}
}

When you invoke std::make_unique<int>(Size), what you actually did is allocate a memory of size sizeof(int) (commonly 4bytes), and initialize it as a int variable with the number of Size. So the size of the memory you allocated is only a single int, Mem.get()[k] will touch the address which out of boundary.
But out of bounds doesn't mean your program crash immediately. As you may know, the memory address we touch in our program is virtual memory. And let's see the layout of virtual memory addresses.
You can see the memory addresses are divided into several segments (stack, heap, bss, etc). When we request a dynamic memory, the returned address will usually located in heap segment (I use usually because sometimes allocator will use mmap thus the address will located at a memory shared area, which is located between stack and heap but not marked on the diagram).
The dynamic memory we obtained are not contiguous, but heap is a contiguous segment. from the OS's point of view, any access to the heap segment is legal. And this is what the allocator exactly doing. Allocator manages the heap, divides the heap into different blocks. These blocks, some of which are marked "used" and some of which are marked "free". When we request a dynamic memory, the allocator looks for a free block that can hold the size we need, (split it to a small new block if this free block is much larger than we need), marks it as used, and returns its address. If such a free block cannot be found, the allocator will call sbrk to increase the heap.
Even if we access address which out of range, as long as it is within the heap, the OS will regard it as a legal operation. Although it might overwrite data in some used blocks, or write data into a free block. But if the address we try to access is out of the heap, for example, an address greater than program break or an address located in the bss. The OS will regard it as a "segment fault" and crash immediately.
So your program crashing is nothing to do with the parameter of std::make_unique<int>. It just so happens that when you specify 1000, the addresses you access are out of the segment.

std::make_unique<int>(Size);
This doesn't do what you are expecting!
It creates single int and initializes it into value Size!
I'm pretty sure your plan was to do:
auto p = std::make_unique<int[]>(Size)
Note extra brackets. Also not that result type is different. It is not std::unique_ptr<int>, but std::unique_ptr<int[]> and for this type operator[] is provided!
Fixed version, but IMO you should use std::vector.

Memory limit in int main()

I need to make a big array in one task (more than 10^7).
And what I found that if i do it int main the code wouldnt work (the program will exit before doing cout "Process returned -1073741571 (0xC00000FD)").
If I do it outside everything will work.
(I am using Code::Blocks 17.12)
// dont work
#include <bits/stdc++.h>
using namespace std;
const int N = 1e7;
int main() {
int a[N];
cout << 1;
return 0;
}
// will work
#include <bits/stdc++.h>
using namespace std;
const int N = 1e7;
int a[N];
int main() {
cout << 1;
return 0;
}
So I have questions:
-Why it happens?
-What can I do to define array in int main()? (actually if i do vector same size in int main() everything will work and it is strange)

There are four main types of memory which are interesting for C++ programmers: stack, heap, static memory, and the memory of registers.
In
const int N = 1e7;
int main(){int a[N];}
stack memory is deployed.
This type of memory is usually more limited than the heap and the static memory in size. For that reason, the error code is returned.
Operator new (or other function which allocates memory in heap) is needed so as to use heap:
const int N = 1e7;
int main(){int* a = new int[N]; delete a;}
Usually, the operator new is not used explicitly.
std::vector uses heap (i.e. it uses new or something of the lower level underneath) (as opposed to the std::array or the C-style array, e.g. int[N]). Because of that, std::vector is usually capable of holding bigger chunks of data than the std::array or the C-style array.
If you do
const int N = 1e7;
int a[N];
int main(){}
static memory is utilized. It's usually less limited in size than the stack memory.
To wrap up, you used stack in int main(){int a[N];}, static memory in int a[N]; int main(){}, and heap in int main(){std::vector<int> v(N);}, and, because of that, received different results.
Use heap for big arrays (via the std::vector or the operator new, examples are given above).

The problem is that your array is actually very big. Assuming that int is 4 bytes, 10 000 000 integers will be 40 000 000bytes which is about 40 Mb. In windows maximum stack size is 1Mb and on modern Linux 8Mb. As local variables are located in stack so youre allocating your 40mb array in 1mb or 8mb stack (if youre in windows or linux respectively). So your program runs out of stack space. In case of global array its ok, because global variables are located in bss(data) segment of program which has static size and is not changing. And in case of std::vector your array is allocated in dynamic memory e.g. in heap, thats why your program is not crashing. If you don't want to use std::vector you can dynamically allocate an array on heap like following
int* arrayPtr = new int[N]
Then you need to free unused dynamically allocated memory with delete operator:
delete arrayPtr;
But in this case you need to know how to work with pointers. Or if you want it to not be dynamic and be only in main, you can make your array in main static (I think 99.9% this will work 😅 and I think you need to try) like this
int main() {static int arr[N];return 0;}
Which will be located in data segment (like global variable)

Dynamic array in an array of structures in OpenCL

I have a struct :
struct A
{
double a;
int c;
double *array;
}
main()
{
A *str = new A[50];
for(int i=0;i<50;i++)
{
str[i].array = new double[5];
str[i].array[0] = 50;
}
.....
Buffer BufA = Buffer(...,..., 50 * sizeof(A),str);
.....
}
In kernel
struct A
{
double a;
int c;
double *array;
}
__kernel void vector(__global A *str)
{
int id = get_global_id(0);
printf("Element - %f",str[id].array[0]);
}
But in the kernel does not see the value in the array. Probably, because in the buffer I allocated memory for an array of structures without the memory of a dynamic array. How can I implement this?

On modern system, a process doesn't see the actual addresses of objects, but rather the virtual addresses of such objects.
This means, two processes cannot pass each others pointers and expect them to mean the same thing. You need to rethink your application with that in mind.

On top of the address virtualization mentioned by YSC, you should also keep in mind that the memory that your graphics card (or other OCL device) is operating on may be distinct (as in, different pieces of hardware) from the memory your CPU is operating on.
The OpenCL buffers are responsible for transporting their contents between these memories. So for example an array of ints that you create and write to on the CPU would have to be copied to GPU memory (and have space allocated there, and possibly be copied back after the kernel is done), which these buffers do for you. But if you store pointers to other CPU memory in your buffer, then that other memory will not be transferred automatically. Further, the pointer relation would most likely break, as there is no guarantee that your other data is at the same location in GPU memory as in CPU memory.
The solution, naturally, is to put all the data you want transferred into buffers, including the sub-arrays. One way to do this without using excessive amounts of buffers would be to pack the sub-arrays together into one and storing indices into it instead of pointers to memory.

Only able to allocate limited memory using new operator in CUDA

I wrote a cuda kernel like this
__global__ void mykernel(int size; int * h){
double *x[size];
for(int i = 0; i < size; i++){
x[i] = new double[2];
}
h[0] = 20;
}
void main(){
int size = 2.5 * 100000 // or 10,000
int *h = new int[size];
int *u;
size_t sizee = size * sizeof(int);
cudaMalloc(&u, sizee);
mykernel<<<size, 1>>>(size, u);
cudaMemcpy(&h, &u, sizee, cudaMemcpyDeviceToHost);
cout << h[0];
}
I have some other code in the kernel too but I have commented it out. The code above it also allocates some more memory.
Now when I run this with size = 2.5*10^5 I get h[0] value to be 0;
When I run this with size = 100*100 I get h[0] value to be 20;
So I am guessing that my kernels are crashing cause I am running out of memory. I am using a Tesla card C2075 which has ram 2GB ! I even tried this by shutting down the xserver. What I am working on is not even 100mb of data.
How can I allocate more memory to each block?

Now when I run this with size = 2.5*10^5 I get h[0] value to be 0;
When I run this with size = 100*100 I get h[0] value to be 20;
In your kernel launch, you are using this size variable also:
mykernel<<<size, 1>>>(size, u);
^^^^
On a cc2.0 device (Tesla C2075), this particular parameter in the 1D case is limited to 65535. So 2.5*10^5 exceeds 65535, but 100*100 does not. Therefore, your kernel may be running if you specify size of 100*100, but is probably not running if you specify size of 2.5*10^5.
As already suggested to you, proper cuda error checking should point this error out to you, and in general will probably result in you needing to ask far fewer questions on SO, as well as posting higher-quality questions on SO. Take advantage of the CUDA runtime's ability to let you know when things have gone wrong and when you are making a mistake. Then you won't be in a quandary, thinking you have a memory allocation problem when in fact you probably have a kernel launch configuration problem.
How can I allocate more memory to each block?
Although it is probably not your main issue (as indicated above), in-kernel new and malloc are limited to the size of the device heap. Once this has been exhausted, further calls to new or malloc will return a null pointer. If you use this null pointer anyway, your kernel code will begin to perform unspecified behavior, and will likely crash.
When using new and malloc, especially when you're having trouble, it's good practice to check for a null return value. This applies to both host (at least for malloc) and device code.
The size of the device heap is pretty small to begin with (8MB), but it can be modified.
Referring to the documentation:
The device memory heap has a fixed size that must be specified before any program using malloc() or free() is loaded into the context. A default heap of eight megabytes is allocated if any program uses malloc() without explicitly specifying the heap size.
The following API functions get and set the heap size:
•cudaDeviceGetLimit(size_t* size, cudaLimitMallocHeapSize)
•cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size)
The heap size granted will be at least size bytes. cuCtxGetLimit()and cudaDeviceGetLimit() return the currently requested heap size.

Fail to malloc big block memory after many malloc/free small blocks memory

Here is the code.
First I try to malloc and free a big block memory, then I malloc many small blocks memory till it run out of memory, and I free ALL those small blocks.
After that, I try to malloc a big block memory.
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char **argv)
{
static const int K = 1024;
static const int M = 1024 * K;
static const int G = 1024 * M;
static const int BIG_MALLOC_SIZE = 1 * G;
static const int SMALL_MALLOC_SIZE = 3 * K;
static const int SMALL_MALLOC_TIMES = 1 * M;
void **small_malloc = (void **)malloc(SMALL_MALLOC_TIMES * sizeof(void *));
void *big_malloc = malloc(BIG_MALLOC_SIZE);
printf("big malloc first time %s\n", (big_malloc == NULL)? "failed" : "succeeded");
free(big_malloc);
for (int i = 0; i != SMALL_MALLOC_TIMES; ++i)
{
small_malloc[i] = malloc(SMALL_MALLOC_SIZE);
if (small_malloc[i] == NULL)
{
printf("small malloc failed at %d\n", i);
break;
}
}
for (int i = 0; i != SMALL_MALLOC_TIMES && small_malloc[i] != NULL; ++i)
{
free(small_malloc[i]);
}
big_malloc = malloc(BIG_MALLOC_SIZE);
printf("big malloc second time %s\n", (big_malloc == NULL)? "failed" : "succeeded");
free(big_malloc);
return 0;
}
Here is the result:
big malloc first time succeeded
small malloc failed at 684912
big malloc second time failed
It looks like there are memory fragments.
I know memory fragmentation happens when there are many small empty space in memory but there is no big enough empty space for big size malloc.
But I've already free EVERYTHING I malloc, the memory should be empty.
Why I can't malloc big block at the second time?
I use Visual Studio 2010 on Windows 7, I build 32-bits program.

The answer, sadly, is still fragmentation.
Your initial large allocation ends up tracked by one allocation block; however when you start allocating large numbers of 3k blocks of memory your heap gets sliced into chunks.
Even when you free the memory, small pieces of the block remain allocated within the process's address space. You can use a tool like Sysinternals VMMap to see these allocations visually.
It looks like 16M blocks are used by the allocator, and once these blocks are freed up they never get returned to the free pool (i.e. the blocks remain allocated).
As a result you don't have enough contiguous memory to allocate the 1GB block the second time.

Even I know just a little about this, I found the following thread Why does malloc not work sometimes? which covers the similar topic as yours.
It contains the following links:
http://www.eskimo.com/~scs/cclass/int/sx7.html (Pointer Allocation Strategies)
http://www.gidforums.com/t-9340.html (reasons why malloc fails? )

The issue is likely that even if you free every allocation, malloc does not return all the memory to the operating system.
When your program requested the numerous smaller allocations, malloc had to increase the size of the "arena" from which it allocates memory.
There is no guarantee that if you free all the memory, the arena will shrink to the original size. It's possible that the arena is still there, and all the blocks have been put into a free list (perhaps coalesced into larger blocks).
The presence of this lingering arena in your address space may be making it impossible to satisfy the large allocation request.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js