difference between copying 16-bytes and 256-bytes to shared memory

difference between copying 16-bytes and 256-bytes to shared memory - c++

I'm writing a CUDA program. The code copies pinned memory to shared memory, by pinned memory I mean memory allocated with cudaHostAlloc(., ., cudaHostAllocMapped). It takes 600us to copy 16 bytes and 8ms to copy 256 bytes. Why so huge difference?
My code looks like:
__global__
void kernel_func(char* dict, int dict_len)
{
__shared__ char s_dict[256];
/* dict_len = 16; */
if(threadIdx.x == 0) {// copy once for each block
memcpy((unsigned char*)s_dict, (unsigned char*)dict, dict_len);
}
__syncthreads();
}
kernel_func<<<32, 128>>>("256 bytes pinned memory", 256);
The environment: GTX650 + CUDA 6.5 + Win7-32bit

Because 256 bytes is 16 times what you previously copied (16 bytes)
Now 16 bytes took 600us
And 16 times that is 9600us Which is close to the 8000us you observed (1ms = 1000us)

Related

Memory allocation failed when required more than 16GB contiguous memory

My workstation has 128GB memory. I cannot allocate an array that takes (contiguous) memory more than ~16GB. But I can allocate multiple arrays with each takes around 15GB.
Sample code:
#include <stdlib.h>
#include <iostream>
using namespace std;
int main(int argc, char **argv)
{
int MM = 1000000;
int NN = 2200; // 2000 is okay, used ~16GB memory; 2200 produces Segmentation fault
double* testMem1d;
testMem1d = (double*) malloc(MM*NN*sizeof(double));
double* testMem1d1; // NN=2000, allocate another array (or two) at the same time is okay
testMem1d1 = (double*) malloc(MM*NN*sizeof(double));
cout << "testMem1d allocated" << endl;
cin.get(); // here is okay, only malloc but not accessing the array element
cout << "testMem1d[MM*NN-1]=" << testMem1d[MM*NN-1]<< endl;
cout << "testMem1d1[MM*NN-1]=" << testMem1d1[MM*NN-1]<< endl;
// keep running and check the physical memory footprint
for (int tt=0;tt<1000;tt++)
{
for (int ii=0; ii<MM*NN; ii++)
{
testMem1d[ii]=ii;
testMem1d1[ii]=ii;
}
cout << "MM=" << MM << ", NN=" << NN << ", testMem1d[MM*NN-1]=" << testMem1d[MM*NN-1]<< endl;
}
}
Please ignore I am using malloc() in c++ if it is not an essential problem. (Is it?) I need/want to use malloc() for other reasons.
Some observations:
(1) Allocating multiple arrays, with each one smaller than 15GB, is fine
(2) Only do malloc() is fine. "Segmentation fault" when accessing the array elements.
I thought there might be some system settings that limited memory allocation. From "ulimit -a" everything seems fine. As the program can access 64-bit virtual address space, I couldn't find any reason that only limits the contiguous memory allocation.
OS: Ubunt 16.04. I tried g++ and icc with mcmodel=large. It seems irrelevant.
uname -a
Linux 4.4.0-143-generic #169-Ubuntu SMP Thu Feb 7 07:56:38 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 515031
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 515031
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Edits:
(1) mallc() actually returns NULL [to mcleod_ideafix]
(2) [to zwol]
free -m
total used free shared buff/cache available
Mem: 128809 18950 107840 1129 2018 107910
Swap: 974 939 35

The multiplication MM*NN*sizeof(double) is left associative, so it happens as (MM * NN) * sizeof(double). On a platform with 32-bit int having the equation MM * NN is equal to 2200000000 which cannot be represented in an 32-bit int and overflows (and undefined behavior happens) and wraps around which results in -2094967296. Then this value is promoted to common type with sizeof(double), to size_t. This is a signed type to unsigned type conversion, where the signed value cannot be represented in an unsigned type, so the conversion is implementation-defined. In twos-complement with 64-bit size_t sign extension happens and it should result in 18446744071614584320. Then this value is multiplied by sizeof(double) which I assume is equal to 8, it overflows multiple times (which is safe, size_t unsigned) and results in 18446744056949813248 bytes. Your machine doesn't have that much memory, so malloc returns NULL.
That's why it's good to put sizeof as first operand in malloc call:
malloc(sizeof(double) * MM * NN);
in which case operands will be promoted to size_t before multiplication.
Still that wouldn't be enough, because in testMem1d[MM*NN-1] and in ii<MM*NN overflow still happens. So you should change the type of MM and NN to a type with enough bits to hold the result.
size_t MM = 1000000;
size_t NN = 2200;
Or cast the values to proper type before each multiplication that can overflow.
Note that statements cout << "testMem1d[MM*NN-1]=" << testMem1d[MM*NN-1]<< endl; cout << "testMem1d1[MM*NN-1]=" << testMem1d1[MM*NN-1]<< endl; are reading uninitialized memory.
Prefer using new in C++.

How does memory on the heap get exhausted?

I have been testing out some of my own code to see how much allocated memory it takes to exhaust the memory on the heap or free store. However, unless my code is wrong in the testing of it, I am getting completely different results in terms of how much memory can be put on the heap.
I am testing two different programs. The first program creates vector objects on the heap. The second program creates integer objects on the heap.
Here is my code:
#include <vector>
#include <stdio.h>
int main()
{
long long unsigned bytes = 0;
unsigned megabytes = 0;
for (long long unsigned i = 0; ; i++) {
std::vector<int>* pt1 = new std::vector<int>(100000,10);
bytes += sizeof(*pt1);
bytes += pt1->size() * sizeof(pt1->at(0));
megabytes = bytes / 1000000;
if (i >= 1000 && i % 1000 == 0) {
printf("There are %d megabytes on the heap\n", megabytes);
}
}
}
The final output of this code before getting a bad_alloc error is: "There are 2000 megabytes on the heap"
In the second program:
#include <stdio.h>
int main()
{
long long unsigned bytes = 0;
unsigned megabytes = 0;
for (long long unsigned i = 0; ; i++) {
int* pt1 = new int(10);
bytes += sizeof(*pt1);
megabytes = bytes / 1000000;
if (i >= 100000 && i % 100000 == 0) {
printf("There are %d megabytes on the heap\n", megabytes);
}
}
}
The final output of this code before getting a bad_alloc error is: "There are 511 megabytes on the heap"
The final output in both programs is vastly different. Am I misunderstanding something about the free store? I thought that both results would be about the same.

It is very likely that pointers returned by new on your platform are 16-byte aligned.
If int is 4 bytes, this means that for every new int(10) you're getting four bytes and making 12 bytes unusable.
This alone would explain the difference between getting 500MB of usable space from small allocations and 2000MB from large ones.
On top of that, there's overhead of keeping track of allocated blocks (at a minimum, of their size and whether they're free or in use). That is very much specific to your system's memory allocator but also incurs per-allocation overhead. See "What is a Chunk" in https://sourceware.org/glibc/wiki/MallocInternals for an explanation of glibc's allocator.

First of all you have to understand that operating system assign memory to process in quite large chunks of memory called pages (it is a hardware property). Page size is about 4 -16 kB.
Now standard library try use memory in efficient way. So it have to find a way to chop pages to smaller pieces and manage them. To do that some extra information about heap structure have to be maintained.
Here is cool Andrei Alexandrescu cppcon talk more or less how it works (it omits information about pages management).
So when you allocating lots of small objects information about heap structure is quite large. On other hand if you allocating smaller number of larger objects is more efficient - less memory is waisted on tracking memory structure.
Note also that depending on heap strategy sometimes (when small piece of memory is requested) it is more efficient to waste some memory and return larger size of memory then it was requested.

Why does creating a 1D array larger than my memory fail, but I have no issue creating a 2D array larger than my memory?

Why does creating a 1D array larger than my memory fail, but I can create a 2D array larger than my memory? I thought the OS gives you virtual memory and you can request as much as you want. It's not until you start reading from and write to memory and it becomes part of the resident set that the hardware constraints become an issue.
On the small VM with 512MB of memory I tried:
1, 512 MB array: no issue
1, 768 MB array: no issue
1, 879 MB array: no issue
1, 880 MB array: fails
1, 1024 MB array: fails
1000, 512MB arrays no issue (at this point
I've allocated 256GB of virtual memory,
well exceeding the physical limits)
On a large VM with 8GB of memory, all the above worked.
For this experiment, I used this code:
#include <stdio.h> /* printf */
#include <stdlib.h> /* atoi */
#include <iostream>
#include <unistd.h>
int main(int argc, char *argv[],char **envp) {
if(argc < 3) {
printf("main <mb> <times>\n");
return -1;
}
int megabytes = atoi(argv[1]);
int times = atoi(argv[1]);
// megabytes 1024 kilobytes 1024 bytes 1 integer
// -------- * --------- * ---------- * --------
// megabyte kilobyte 4 bytes
int sizeOfArray = megabytes*1024*1024/sizeof(int);
long long bytes = megabytes*1024*1024;
printf("grabbing memory :%dmb, arrayEntrySize:%d, times:%d bytes:%lld\n",
megabytes, sizeOfArray, times, bytes);
int ** array = new int*[times];
for( int i = 0; i < times; i++) {
array[i] = new int[sizeOfArray];
}
while(true) {
// 1 second to microseconds
usleep(1*1000000);
}
for( int i = 0; i < times; i++) {
delete [] array[i];
}
delete [] array;
}
Commands and outputs of experiments on small 512MB VM:
free -h
total used free shared buff/cache available
Mem: 488M 66M 17M 5.6M 404M 381M
Swap: 511M 72K 511M
./a.out 512 1
grabbing memory :512mb, arrayEntrySize:134217728, times:512 bytes:536870912
./a.out 768 1
grabbing memory :768mb, arrayEntrySize:201326592, times:768 bytes:805306368
./a.out 1024 1
grabbing memory :1024mb, arrayEntrySize:268435456, times:1024 bytes:1073741824
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)
./a.out 512 1000
grabbing memory :512mb, arrayEntrySize:134217728, times:512 bytes:536870912
#htop
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
2768 root 20 0 256G 4912 2764 S 0.0 1.0 0:00.00 ./a.out 512 1000
Commands and outputs of experiments on the large 8GB VM:
free -h
total used free shared buff/cache available
Mem: 7.8G 78M 7.6G 8.8M 159M 7.5G
Swap: 511M 0B 511M
./a.out 512 1
grabbing memory :512mb, arrayEntrySize:134217728, times:512 bytes:536870912
./a.out 768 1
grabbing memory :768mb, arrayEntrySize:201326592, times:768 bytes:805306368
./a.out 1024 1
grabbing memory :1024mb, arrayEntrySize:268435456, times:1024 bytes:1073741824
./a.out 512 1000
grabbing memory :512mb, arrayEntrySize:134217728, times:512 bytes:536870912
# htop
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
1292 root 20 0 256G 6920 2720 S 0.0 0.1 0:00.00 ./a.out 512 1000

This is due to the fact thay memory is allocated in a chunk based on how much you ask for.
You are asking for a sereis of relatively small blocks in the 2D array, each block isnt necessarily next to each other.
However the 1D array is massive and requires a full sized and contigiois memory block, and you may not have a block of that size available even if you have thay much memory available.

Why is previously allocated memory inaccessable after a failed realloc?

I have a need to allocate all memory available to a process, in order to implement a test of a system service. The test (among others) requires exhausting all available resources, attempting the call, and checking for a specific outcome.
In order to do this, I wrote a loop that reallocates a block of memory until, realloc returns null, then using the last good allocation, then cutting the difference between the last successful quantity and the last unsuccessful quantity until the unsuccessful quantity is 1 byte larger than the last successful quantity, guaranteeing that all available memory is consumed.
The code I wrote is as follows (debug prints also included)
#include <stdio.h>
#include <malloc.h>
int main(void)
{
char* X;
char* lastgood = NULL;
char* toalloc = NULL;
unsigned int top = 1;
unsigned int bottom = 1;
unsigned int middle;
do
{
bottom = top;
lastgood = toalloc;
top = bottom*2;
printf("lastgood = %p\ntoalloc = %p\n", lastgood, toalloc);
if (lastgood != NULL)
printf("*lastgood = %i\n", *lastgood);
toalloc = realloc(toalloc, top);
printf("lastgood = %p\ntoalloc = %p\n", lastgood, toalloc);
if (toalloc == NULL && lastgood != NULL)
printf("*lastgood = %i\n", *lastgood); //segfault happens here
}while(toalloc != NULL);
do
{
if (toalloc != NULL) lastgood = toalloc;
else toalloc = lastgood;
middle = bottom+(top - bottom)/2;
toalloc = realloc(toalloc, middle);
if (toalloc == NULL) top = middle;
else bottom = middle;
}while(top - bottom > 1);
if (toalloc != NULL) lastgood = toalloc;
X = lastgood;
//make a call that attempts to get more memory
free(X);
}
According to realloc's manpage, realloc does not destroy the previous address if it returns null. Even so, this code results in a segfault when it tries to print lastgood after toalloc receives NULL from realloc. Why is this happening, and is there a better way to just grab the exact quantity of unallocated memory?
I am running it on glibc, on ubuntu with kernel 3.11.x

You are not checking the value of top for overflow. This is what happens with its value:
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072
262144
524288
1048576
2097152
4194304
8388608
16777216
33554432
67108864
134217728
268435456
536870912
1073741824
2147483648
0
Just before the last realloc(), the new value of top is 0 again (actually 2^32 but that doesn't fit in 32 bits), which seems to cause the memory block to actually deallocate.
Trying to allocate the maximum contiguous block is not a good idea. The memory map seen by the user process has some block already allocated for shared libraries, and the actual code and data of the current process. Unless you want to know the maximum contiguous memory block you can allocate, the way to go is to allocate as much as you can in a single block. When you have reached that, do the same with a different pointer, and keep doing that until you really run out of memory. Take into account that in 64-bit systems, you don't get all available memory in just one malloc()/realloc(). As I've just seen, malloc() in 64-bit systems allocate up to 4GB of memory in one call, even when you can issue several mallocs() and still succeed on each of them.
A visual of a user process memory map as seen in a 32-bit Linux system is described in an answer I gave a few days ago:
Is kernel space mapped into user space on Linux x86?
I've come up with this program, to "eat" all the memory it can:
#include <stdio.h>
#include <malloc.h>
typedef struct slist
{
char *p;
struct slist *next;
} TList;
int main(void)
{
size_t nbytes;
size_t totalbytes = 0;
int i = 0;
TList *list = NULL, *node;
node = malloc (sizeof *node);
while (node)
{
node->next = list;
list = node;
nbytes = -1; /* can I actually do this? */
node->p = malloc(nbytes);
while (nbytes && !node->p)
{
nbytes/=2;
node->p = malloc(nbytes);
}
totalbytes += nbytes + sizeof *node;
if (nbytes==0)
break;
i++;
printf ("%8d", i);
}
printf ("\nBlocks allocated: %d. Memory used: %f GB\n",
i, totalbytes/(1024*1048576.0));
return 0;
}
The execution yields these values in a 32-bit Linux system:
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50
51 52 53
Blocks allocated: 53. Memory used: 2.998220 GB
Very close to the 3GB limit on 32-bit Linux systems. On a 64-bit Linux system, I've reached 30000 blocks of 4GB each, and still counting. I don't really know if Linux can allocate that much memory or it's my mistake. According to this, the maximum virtual address space is 128TB (that would be 32768 4GB blocks)
UPDATE: in fact, so it is. I've left running this program on a 64-bit box, and after 110074 succesuflly allocated blocks, the grand total of memory allocated has been 131071.578884 GB . Each malloc() has been able to allocate as much as 4 GB per operation, but when reached 115256 GB, it has begun to allocate up to 2 GB, then when it reached 123164 GB allocated, it begun to allocate up to 1GB per malloc(). The progression asyntotically tends to 131072 GB, but it actually stops a little earlier, 131071.578884 GB because the very process, its data and shared libraries, uses a few KB of memory.

I think getrlimit() is the answer to your new question. Here is the man page http://linux.die.net/man/2/getrlimit
You are probably interested in RLIMIT_STACK. Please check the man page.

first part is replaced with the following.
char *p;
size_t size = SIZE_MAX;//SIZE_MAX : #include <stdint.h>
do{
p = malloc(size);
size >>= 1;
}while(p==NULL);

Why this app doesn't consume as much memory as expected

I wrote a simple application to test memory consumption. In this test application, I created four processes to continually consume memory, those processes won't release the memory unless the process exits.
I expected this test application to consume the most memory of RAM and cause the other application to slow down or crash. But the result is not the same as expected. Below is the code:
#include <stdio.h>
#include <unistd.h>
#include <list>
#include <vector>
using namespace std;
unsigned short calcrc(unsigned char *ptr, int count)
{
unsigned short crc;
unsigned char i;
//high cpu-consumption code
//implements the CRC algorithm
//CRC is Cyclic Redundancy Code
}
void* ForkChild(void* param){
vector<unsigned char*> MemoryVector;
pid_t PID = fork();
if (PID > 0){
const int TEN_MEGA = 10 * 10 * 1024 * 1024;
unsigned char* buffer = NULL;
while(1){
buffer = NULL;
buffer = new unsigned char [TEN_MEGA];
if (buffer){
try{
calcrc(buffer, TEN_MEGA);
MemoryVector.push_back(buffer);
} catch(...){
printf("An error was throwed, but caught by our app!\n");
delete [] buffer;
buffer = NULL;
}
}
else{
printf("no memory to allocate!\n");
try{
if (MemoryVector.size()){
buffer = MemoryVector[0];
calcrc(buffer, TEN_MEGA);
buffer = NULL;
} else {
printf("no memory ever allocated for this Process!\n");
continue;
}
} catch(...){
printf("An error was throwed -- branch 2,"
"but caught by our app!\n");
buffer = NULL;
}
}
} //while(1)
} else if (PID == 0){
} else {
perror("fork error");
}
return NULL;
}
int main(){
int children = 4;
while(--children >= 0){
ForkChild(NULL);
};
while(1) sleep(1);
printf("exiting main process\n");
return 0;
}
TOP command
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2775 steve 20 0 1503m 508 312 R 99.5 0.0 1:00.46 test
2777 steve 20 0 1503m 508 312 R 96.9 0.0 1:00.54 test
2774 steve 20 0 1503m 904 708 R 96.6 0.0 0:59.92 test
2776 steve 20 0 1503m 508 312 R 96.2 0.0 1:00.57 test
Though CPU is high, but memory percent remains 0.0. How can it be possible??
Free command
free shared buffers cached
Mem: 3083796 0 55996 428296
Free memory is more than 3G out of 4G RAM.
Does there anybody know why this test app just doesn't work as expected?

Linux uses optimistic memory allocation: it will not physically allocate a page of memory until that page is actually written to. For that reason, you can allocate much more memory than what is available, without increasing memory consumption by the system.
If you want to force the system to allocate (commit) a physical page , then you have to write to it.
The following line does not issue any write, as it is default-initialization of unsigned char, which is a no-op:
buffer = new unsigned char [TEN_MEGA];
If you want to force a commit, use zero-initialization:
buffer = new unsigned char [TEN_MEGA]();

To make the comments into an answer:
Linux will not allocate memory pages for a process until it writes to them (copy-on-write).
Additionally, you are not writing to your buffer anywhere, as the default constructor for unsigned char does not perform any initializations, and new[] default-initializes all items.

fork() returns the PID in the parent, and 0 in the child. Your ForkChild as written will execute all the work in the parent, not the child.
And the standard new operator will never return null; it will throw if it fails to allocate memory (but due to overcommit it won't actually do that either in Linux). This means your test of buffer after the allocation is meaningless: it will always either take the first branch or never reach the test. If you want a null return, you need to write new (std::nothrow) .... Include <new> for that to work.

But your program is infact doing what you expected it to do. As an answer has pointed out (# Michael Foukarakis's answer), memory not used is not allocated. In your output of the top program, I noticed that the column virt had a large amount of memory on it for each process running your program. A little googling later, I saw what this was:
VIRT -- Virtual Memory Size (KiB). The total amount of virtual memory used by the task. It includes all code, data and shared libraries plus pages that have been swapped out and pages that have been mapped but not used.
So as you can see, your program does in fact generate memory for itself, but in the form of pages and stored as virtual memory. And I think that is a smart thing to do
A snippet from this wiki page
A page, memory page, or virtual page -- a fixed-length contiguous block of virtual memory, and it is the smallest unit of data for the following:
memory allocation performed by the operating system for a program; and
transfer between main memory and any other auxiliary store, such as a hard disk drive.
...Thus a program can address more (virtual) RAM than physically exists in the computer. Virtual memory is a scheme that gives users the illusion of working with a large block of contiguous memory space (perhaps even larger than real memory), when in actuality most of their work is on auxiliary storage (disk). Fixed-size blocks (pages) or variable-size blocks of the job are read into main memory as needed.
Sources:
http://www.computerhope.com/unix/top.htm
https://stackoverflow.com/a/18917909/2089675
http://en.wikipedia.org/wiki/Page_(computer_memory)

If you want to gobble up a lot of memory:
int mb = 0;
char* buffer;
while (1) {
buffer = malloc(1024*1024);
memset(buffer, 0, 1024*1024);
mb++;
}
I used something like this to make sure the file buffer cache was empty when taking some file I/O timing measurements.
As other answers have already mentioned, your code doesn't ever write to the buffer after allocating it. Here memset is used to write to the buffer.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js