Unable to measure static arrays memory usage with GetProcessMemoryInfo - c++

I am trying to learn both details on memory usage works, as well as how to measure it using C++. I know that under Windows, a quick way to retrieve the amount of RAM being used by the current application process, when including <Windows.h>, is:
PROCESS_MEMORY_COUNTERS info;
GetProcessMemoryInfo( GetCurrentProcess( ), &info, sizeof(info) );
(uint64_t)info.WorkingSetSize;
Then, I used that to run a very simple test:
#include <iostream>
#include <Windows.h>"
int main(void)
{
uint64_t currentUsedRAM(0);
PROCESS_MEMORY_COUNTERS info;
GetProcessMemoryInfo(GetCurrentProcess(), &info, sizeof(info));
currentUsedRAM = info.WorkingSetSize;
const int N(1000000);
int x[N]; //in the second run, comment this line out
int y[N]; //in the second run, comment this line out
//int *x = new int[N]; //in the second run UNcomment this line out
//int *y = new int[N]; //in the second run UNcomment this line out
for (int i = 0; i < N; i++)
{
x[i] = 1;
y[i] = 2;
}
GetProcessMemoryInfo(GetCurrentProcess(), &info, sizeof(info));
currentUsedRAM = info.WorkingSetSize - currentUsedRAM;
std::cout << "Current RAM used: " << currentUsedRAM << "\n";
return 0;
}
What I don't understand at all when I run the code above, the output is: Current RAM used: 0, while I was expecting something around 8mb since I filled two 1D int arrays of 1 million entries each. Now, if I re-run the code but making x and y become dinamically allocated arrays, now the output is, as expected: Current RAM used: 8007680.
Why is that? How to make it detect memory-usage in both cases?

The compiler have optimised your code. If fact, for your first run, neither x or y is allocated. Considering that there is visible side effect : the return value of GetProcessMemoryInfo, this optimiszation seems kind of weird.
Anyway, you can prevent this by adding some other side effect, such as outputing the sum of each element of those two array, which will guarateen the crashing.
The memory allocating for local objects with automatic storage duration happens at the beginning of the enclosing code block and deallocated at the end. So your code can't measure the memory usage for any automatic sotrage duration variable in main(nor my deleted code snippet, Which I wasn't awared of). But things are different for those objects with dynamic storage duration, they are allocated per request.
I designed a test which involves recusion for the discussion in comment area. You can see that the memory usage increased if the program goes deeper. This is a proof to that it counts the memroy usage on stack. BTW, it isn't counting how many memory your objects need, but how many your program needs.
void foo(int depth, int *a, int *b, uint64_t usage) {
if (depth >= 100)
return ;
int x[100], y[100];
for (int i = 0; i < 100; i++)
{
x[i] = 1 + (a==nullptr?0:a[i]);
y[i] = 2 + (b==nullptr?0:b[i]);
}
PROCESS_MEMORY_COUNTERS info;
GetProcessMemoryInfo(GetCurrentProcess(), &info, sizeof(info));
std::cout << "Current RAM used: " << info.WorkingSetSize - usage << "\n";
foo(depth+1,x,y,usage);
int sum = 0;
for (int i=0; i<100; i++)
sum += x[i] + y[i];
std::cout << sum << std::endl;
}
int main(void)
{
uint64_t currentUsedRAM(0);
PROCESS_MEMORY_COUNTERS info;
GetProcessMemoryInfo(GetCurrentProcess(), &info, sizeof(info));
currentUsedRAM = info.WorkingSetSize;
foo(0, nullptr, nullptr, currentUsedRAM);
return 0;
}
/*
Current RAM used: 0
Current RAM used: 61440
Current RAM used: 65536
Current RAM used: 65536
Current RAM used: 65536
Current RAM used: 65536
Current RAM used: 69632
Current RAM used: 69632
Current RAM used: 69632
Current RAM used: 69632
Current RAM used: 69632
Current RAM used: 73728
*/
The system allocate 4k each time, which is the size of a page. I don't know why it comes 0, and then suddenly 61440. Explaining how windows manages the memory is very hard and is far beyond my ability, though I have confident in the 4k thing... and that it do count the memory usage for variables with automatic storage duration.

Related

C++: Checking if dynamic de-allocation has worked correctly

I am currently following a book from Springer called "Guide to scientific computing in C++", and one of its exercises regarding pointers says as follows:
"Write code that allocates memory dynamically to two vectors of doubles of length 3, assigns values to each of the entries, and then de-allocates the memory. Extend this code so that it calculates the scalar product of these vectors and prints it to screen before the memory is de-allocated. Put the allocation of memory, calculation and de-allocation of memory inside a for loop that runs 1,000,000,000 times: if the memory is not de-allocated properly your code will use all available resources and your computer may struggle."
My attempt at this is:
for (long int j = 0; j < 1000000000; j++) {
// Allocate memory for the variables
int length = 3;
double *pVector1;
double *pVector2;
double *scalarProduct;
pVector1 = new double[length];
pVector2 = new double[length];
scalarProduct = new double[length];
for (i = 0; i < length; i++) { // loop to give values to the variables
pVector1[i] = (double) i + 1;
pVector2[i] = pVector1[i] - 1;
scalarProduct[i] = pVector1[i] * pVector2[i];
std::cout << scalarProduct[i] << " " << std::flush; // print scalar product
}
std::cout << std::endl;
// deallocate memory
delete[] pVector1;
delete[] pVector2;
delete[] scalarProduct;
}
My problem is that this code runs, but is inefficient. It seems that the de-allocation of the memory should be much faster since it runs for over a minute before terminating it. I am assuming that I am misusing the de-allocation, but haven't found a proper way to fix it.
Your code does exactly what it is supposed to, run a long time without crashing your computer due to out_of_memory. The book might be a bit dated as it assumes you can not allocate more than 72.000.000.000 bytes before crashing. You can test it be removing the deletes hence leaking the memory.

Strange behavior: memcpy faster 1x when src is not set value

GCC version: gcc 4.8.5
copt: -std=c++11 -O3
SIZE = 50 * 1024 * 1024
The first piece of code:
int main() {
char* src = new char[SIZE];
char* dst = new char[SIZE];
memset(dst, 'a', SIZE);
for (size_t i = 0; i < 5; ++i) {
size_t start = now();
memcpy(dst, src, SIZE);
cout << "timer:" << now() - start << "ms" << endl;
}
return 0;
}
Output:
timer:5ms
timer:4ms
timer:5ms
timer:5ms
timer:4ms
The second piece of code:
int main() {
char* src = new char[SIZE];
char* dst = new char[SIZE];
memset(src, 'a', SIZE);
memset(dst, 'a', SIZE);
for (size_t i = 0; i < 5; ++i) {
size_t start = now();
memcpy(dst, src, SIZE);
cout << "timer:" << now() - start << "ms" << endl;
}
return 0;
}
Output:
timer:9ms
timer:8ms
timer:8ms
timer:8ms
timer:8ms
The third piece of code:
int main() {
char* src = new char[SIZE];
char* dst = new char[SIZE];
for (size_t i = 0; i < 5; ++i) {
size_t start = now();
memcpy(dst, src, SIZE);
cout << "timer:" << now() - start << "ms" << endl;
}
return 0;
}
Output:
timer:22ms
timer:4ms
timer:5ms
timer:5ms
timer:5ms
Summary:
Compare first and third case: first round of 3rd case slow is because of minor page fault.
Questions:
Why in the 1st case, memcpy src wouldn't trigger any minor page fault?
Why in the 2nd case, 1x slower than 1st case. Any optimization in OS?
Memcpy is bounded by external memory throughput; it looks like the OS is able to allocate memory virtually into the page tables and performing Copy-on-write. This would explain both phenomena: there would be only one reserved block of physical memory for unmodified src, which would be located in the fastest cache in cases 2 and 3. In case one all memory access would go up and down to external memory. The 5x speed penalty in run 1 of case 2 is due to the virtually allocated src being copied on write to unique physical pages.
Timing the initial memsets N times in a row should confirm the hypothesis.
The copy-on-write technique can be extended to support efficient memory allocation by having a page of physical memory filled with zeros. When the memory is allocated, all the pages returned refer to the page of zeros and are all marked copy-on-write. This way, physical memory is not allocated for the process until data is written, allowing processes to reserve more virtual memory than physical memory and use memory sparsely, at the risk of running out of virtual address space.

Why deallocating heap memory is much slower than allocating it?

This is an empirical assumption (that allocating is faster then de-allocating).
This is also one of the reason, i guess, why heap based storages (like STL containers or else) choose to not return currently unused memory to the system (that is why shrink-to-fit idiom was born).
And we shouldn't confuse, of course, 'heap' memory with the 'heap'-like data structures.
So why de-allocation is slower?
Is it Windows-specific (i see it on Win 8.1) or OS independent?
Is there some C++ specific memory manager automatically involved on using 'new' / 'delete' or the whole mem. management is completely relies on the OS? (i know C++11 introduced some garbage-collection support, which i never used really, better relying on the old stack and static duration or self managed containers and RAII).
Also, in the code of the FOLLY string i saw using old C heap allocation / deallocation, is it faster then C++ 'new' / 'delete'?
P. S. please note that the question is not about virtual memory mechanics, i understand that user-space programs didn't use real mem. addresation.
The assertion that allocating memory is faster than deallocating it seemed a bit odd to me, so I tested it. I ran a test where I allocated 64MB of memory in 32-byte chunks (so 2M calls to new), and I tried deleting that memory in the same order it was allocated, and in a random order. I found that linear-order deallocation was about 3% faster than allocation, and that random deallocation was about 10% slower than linear allocation.
I then ran a test where I started with 64MB of allocated memory, and then 2M times either allocated new memory or deleted existing memory (at random). Here, I found that deallocation was about 4.3% slower than allocation.
So, it turns out you were correct - deallocation is slower than allocation (though I wouldn't call it "much" slower). I suspect this has simply to do with more random accesses, but I have no evidence for this other than that the linear deallocation was faster.
To answer some of your questions:
Is there some C++ specific memory manager automatically involved on using 'new' / 'delete'?
Yes. The OS has system calls which allocate pages of memory (typically 4KB chunks) to processes. It's the process' job to divide up those pages into objects. Try looking up the "GNU Memory Allocator."
I saw using old C heap allocation / deallocation, is it faster then C++ 'new' / 'delete'?
Most C++ new/delete implementations just call malloc and free under the hood. This is not required by the standard, however, so it's a good idea to always use the same allocation and deallocation function on any particular object.
I ran my tests with the native testing framework provided in Visual Studio 2015, on a Windows 10 64-bit machine (The tests were also 64-bit). Here's the code:
#include "stdafx.h"
#include "CppUnitTest.h"
using namespace Microsoft::VisualStudio::CppUnitTestFramework;
namespace AllocationSpeedTest
{
class Obj32 {
uint64_t a;
uint64_t b;
uint64_t c;
uint64_t d;
};
constexpr int len = 1024 * 1024 * 2;
Obj32* ptrs[len];
TEST_CLASS(UnitTest1)
{
public:
TEST_METHOD(Linear32Alloc)
{
for (int i = 0; i < len; ++i) {
ptrs[i] = new Obj32();
}
}
TEST_METHOD(Linear32AllocDealloc)
{
for (int i = 0; i < len; ++i) {
ptrs[i] = new Obj32();
}
for (int i = 0; i < len; ++i) {
delete ptrs[i];
}
}
TEST_METHOD(Random32AllocShuffle)
{
for (int i = 0; i < len; ++i) {
ptrs[i] = new Obj32();
}
srand(0);
for (int i = 0; i < len; ++i) {
int pos = (rand() % (len - i)) + i;
Obj32* temp = ptrs[i];
ptrs[i] = ptrs[pos];
ptrs[pos] = temp;
}
}
TEST_METHOD(Random32AllocShuffleDealloc)
{
for (int i = 0; i < len; ++i) {
ptrs[i] = new Obj32();
}
srand(0);
for (int i = 0; i < len; ++i) {
int pos = (rand() % (len - i)) + i;
Obj32* temp = ptrs[i];
ptrs[i] = ptrs[pos];
ptrs[pos] = temp;
}
for (int i = 0; i < len; ++i) {
delete ptrs[i];
}
}
TEST_METHOD(Mixed32Both)
{
for (int i = 0; i < len; ++i) {
ptrs[i] = new Obj32();
}
srand(0);
for (int i = 0; i < len; ++i) {
if (rand() % 2) {
ptrs[i] = new Obj32();
}
else {
delete ptrs[i];
}
}
}
TEST_METHOD(Mixed32Alloc)
{
for (int i = 0; i < len; ++i) {
ptrs[i] = new Obj32();
}
srand(0);
for (int i = 0; i < len; ++i) {
if (rand() % 2) {
ptrs[i] = new Obj32();
}
else {
//delete ptrs[i];
}
}
}
TEST_METHOD(Mixed32Dealloc)
{
for (int i = 0; i < len; ++i) {
ptrs[i] = new Obj32();
}
srand(0);
for (int i = 0; i < len; ++i) {
if (rand() % 2) {
//ptrs[i] = new Obj32();
}
else {
delete ptrs[i];
}
}
}
TEST_METHOD(Mixed32Neither)
{
for (int i = 0; i < len; ++i) {
ptrs[i] = new Obj32();
}
srand(0);
for (int i = 0; i < len; ++i) {
if (rand() % 2) {
//ptrs[i] = new Obj32();
}
else {
//delete ptrs[i];
}
}
}
};
}
And here are the raw results over several runs. All numbers are in milliseconds.
I had much the same idea as #Basile: I wondered whether your base assumption was actually (even close to) correct. Since you tagged the question C++, I wrote a quick benchmark in C++ instead.
#include <vector>
#include <iostream>
#include <numeric>
#include <chrono>
#include <iomanip>
#include <locale>
int main() {
std::cout.imbue(std::locale(""));
using namespace std::chrono;
using factor = microseconds;
auto const size = 2000;
std::vector<int *> allocs(size);
auto start = high_resolution_clock::now();
for (int i = 0; i < size; i++)
allocs[i] = new int[size];
auto stop = high_resolution_clock::now();
auto alloc_time = duration_cast<factor>(stop - start).count();
start = high_resolution_clock::now();
for (int i = 0; i < size; i++)
delete[] allocs[i];
stop = high_resolution_clock::now();
auto del_time = duration_cast<factor>(stop - start).count();
std::cout << std::left << std::setw(20) << "alloc time: " << alloc_time << " uS\n";
std::cout << std::left << std::setw(20) << "del time: " << del_time << " uS\n";
}
I also used VC++ on Windows instead of gcc on Linux. The result wasn't much different though: freeing the memory took substantially less time than allocating it did. Here are the results from three successive runs.
alloc time: 2,381 uS
del time: 1,429 uS
alloc time: 2,764 uS
del time: 1,592 uS
alloc time: 2,492 uS
del time: 1,442 uS
I'd warn, however, allocation and freeing is handled (primarily) by the standard library, so this could be different between one standard library and another (even when using the same compiler). I'd also note that it wouldn't surprise me if this were to change somewhat in multi-threaded code. Although it's not actually correct, there appear to be a few authors who are under the mis-apprehension that freeing in a multithreaded environment requires locking a heap for exclusive access. This can be avoided, but the means to do so isn't necessarily immediately obvious.
I am not sure of your observation. I wrote the following program (on Linux, hopefully you could port it to your system).
// public domain code
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <errno.h>
#include <string.h>
#include <assert.h>
const unsigned possible_word_sizes[] = {
1, 2, 3, 4, 5,
8, 12, 16, 24,
32, 48, 64, 128,
256, 384, 2048
};
long long totalsize;
// return a calloc-ed array of nbchunks malloced zones of
// somehow random size
void **
malloc_chunks (int nbchunks)
{
const int nbsizes =
(int) (sizeof (possible_word_sizes)
/ sizeof (possible_word_sizes[0]));
void **ad = calloc (nbchunks, sizeof (void *));
if (!ad)
{
perror ("calloc chunks");
exit (EXIT_FAILURE);
};
for (int ix = 0; ix < nbchunks; ix++)
{
unsigned sizindex = random () % nbsizes;
unsigned size = possible_word_sizes[sizindex];
void *zon = malloc (size * sizeof (void *));
if (!zon)
{
fprintf (stderr,
"malloc#%d (%d words) failed (total %lld) %s\n",
ix, size, totalsize, strerror (errno));
exit (EXIT_FAILURE);
}
((int *) zon)[0] = ix;
totalsize += size;
ad[ix] = zon;
}
return ad;
}
void
free_chunks (void **chks, int nbchunks)
{
// first, free the two thirds of chunks in random order
for (int i = 0; 3 * i < 2 * nbchunks; i++)
{
int pix = random () % nbchunks;
if (chks[pix])
{
free (chks[pix]);
chks[pix] = NULL;
}
}
// then, free the rest in reverse order
for (int i = nbchunks - 1; i >= 0; i--)
if (chks[i])
{
free (chks[i]);
chks[i] = NULL;
}
}
int
main (int argc, char **argv)
{
assert (sizeof (int) <= sizeof (void *));
int nbchunks = (argc > 1) ? atoi (argv[1]) : 32768;
if (nbchunks < 128)
nbchunks = 128;
srandom (time (NULL));
printf ("nbchunks=%d\n", nbchunks);
void **chks = malloc_chunks (nbchunks);
clock_t clomall = clock ();
printf ("clomall=%ld totalsize=%lld words\n",
(long) clomall, totalsize);
free_chunks (chks, nbchunks);
clock_t clofree = clock ();
printf ("clofree=%ld\n", (long) clofree);
return 0;
}
I compiled it with gcc -O2 -Wall mf.c -o mf on my Debian/Sid/x86-64 (i3770k, 16Gb). I run time ./mf 100000 and got:
nbchunks=100000
clomall=54162 totalsize=19115681 words
clofree=83895
./mf 100000 0.02s user 0.06s system 95% cpu 0.089 total
on my system clock gives CPU microseconds. If the call to random is negligible (and I don't know if it is) w.r.t. malloc & free time, I tend to disagree with your observations. free seems to be twice as fast as malloc. My gcc is 6.1, my libc is Glibc 2.22.
Please take time to compile the above benchmark on your system and report the timings.
FWIW, I took Jerry's code and
g++ -O3 -march=native jerry.cc -o jerry
time ./jerry; time ./jerry; time ./jerry
gives
alloc time: 1940516
del time: 602203
./jerry 0.00s user 0.01s system 68% cpu 0.016 total
alloc time: 1893057
del time: 558399
./jerry 0.00s user 0.01s system 68% cpu 0.014 total
alloc time: 1818884
del time: 527618
./jerry 0.00s user 0.01s system 70% cpu 0.014 total
When you allocate small memory blocks, the block size you specify maps directly to a suballocator for that size, which is commonly represented as a "slab" of memory containing same size records, to avoid memory fragmentation. This can be very fast, similar to an array access. But freeing such blocks is not so straight forward, because you are passing a pointer to memory of unknown size, requiring additional work to determine what slab it belongs to, before the block can be returned to its proper place.
When you allocate large blocks of virtual memory, a memory page range is set up in your process space without actually mapping any physical memory to it, and that requires very little work to accomplish. But freeing such large blocks can require much more work, because the pointer freed must first be matched to the page tables for that range, followed by walking through all of the page entries for the memory range that it spans, and releasing all of the physical memory pages assigned to that range by the intervening page faults.
Of course, the details of this will vary depending on the implementation being used, but the principles remain much the same: memory allocation of a known block size requires less effort than releasing a pointer to a memory block of unknown size. My knowledge of this comes directly from my experience developing high-performance commercial grade RAII memory allocators.
I should also point out that since every heap allocation has a matching and corresponding release, this pair of operations represents a single allocation cycle, i.e. as the two sides of one coin. Together, their execution time can be accurately measured, but separately such measurement is difficult to pin down, as it varies widely depending on block size, previous activity across similar sizes, caching and other operational considerations. But in the end, allocate/free differences may not much matter, since you don't do one without the other.
The problem here is heap fragmentation. Programs written in languages with explicit pointer arithmetic have no realistic ways of defragmenting heap.
If your heap is fragmented, you can't return memory to OS. OS, barring virtual memory, depends on brk(2)-like mechanism - i.e. you set an upper bound for all memory addresses you'll refer to. But when you have even one buffer allocated and still in use near existing boundary, you can't return memory to OS explicitly. Doesn't matter if 99% of all the memory in your program is freed.
Dealocation doesn't have to be slower than allocation. But the fact that you have manual deallocation with heap fragmenting makes allocation slower and more complex.
GCs fight this by compactifying heap. This way, allocation is just incrementing pointer for them, and deallocation is not needed for bulk of objects.

Linux really allocating memory it shoudn't in C++ code

In Linux, the kernel doesn't allocate any physical memory pages until we actually using that memory, but I am having a hard time here trying to find why it does in fact allocate this memory:
for(int t = 0; t < T; t++){
for(int b = 0; b < B; b++){
Matrix[t][b].length = 0;
Matrix[t][b].size = 60;
Matrix[t][b].pointers = (Node**)malloc(60*sizeof(Node*));
}
}
I then access this data structure to add one element to it like this:
Node* elem = NULL;
Matrix[a][b].length++;
Matrix[a][b]->pointers[ Matrix[a][b].length ] = elem;
Essentially, I run my program with htop on the side and Linux does allocate more memory if I increase the no. "60" I have in the code above. Why? Shouldn't it only allocate one page when the first element is added to the array?
It depends on how your Linux system is configured.
Here's a simple C program that tries to allocate 1TB of memory and touches some of it.
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
int main()
{
char *array[1000];
int i;
for (i = 0; i < 1000; ++i)
{
if (NULL == (array[i] = malloc((int) 1e9)))
{
perror("malloc failed!");
return -1;
}
array[i][0] = 'H';
}
for (i = 0; i < 1000; ++i)
printf("%c", array[i][0]);
printf("\n");
sleep(10);
return 0;
}
When I run top by its side, it says the VIRT memory usage goes to 931g (where g means GiB), while RES only goes to 4380 KiB.
Now, when I change my system to use a different overcommit strategy by /sbin/sysctl -w vm.overcommit_memory=2 and re-run it, I get:
malloc failed!: Cannot allocate memory
So your system may be using a different overcommit strategy than you expected. For more information read this.
Your assumption that malloc / new doesn't cause any memory to be written, and therefore assigned physical memory by the OS, is incorrect (for the memory allocator implementation you have).
I've reproduced the behavior you are describing in the following simple program:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
int main(int argc, char **argv)
{
char **array[128][128];
int size;
int i, j;
if (1 == argc || 0 >= (size = atoi(argv[1])))
fprintf(stderr, "usage: %s <num>; where num > 0\n", argv[0]), exit(-1);
for (i = 0; i < 128; ++i)
for (j = 0; j < 128; ++j)
if (NULL == (array[i][j] = malloc(size * sizeof(char*))))
{
fprintf(stderr, "malloc failed when i = %d, j = %d\n", i, j);
perror(NULL);
return -1;
}
sleep(10);
return 0;
}
When I run this with various small size parameters as input, the VIRT and RES memory footprints (as reported by top) grow together in-step, even though I'm not explicitly touching the inner arrays that I'm allocating.
This basically holds true until size exceeds ~512. Thereafter, RES stays constant at 64 MiB while VIRT can be extremely large (e.g. - 1220 GiB when size is 10M). That is because 512 * 8 = 4096, which is a common virtual page size on Linux systems, and 128 * 128 * 4096 B = 64 MiB.
Therefore, it looks like the first page of every allocation is being mapped to physical memory, probably because malloc / new itself is writing to part of the allocation for its own internal book keeping. Of course, lots of small allocations may fit in and be placed on the same page, so only one page gets mapped to physical memory for many such allocations.
In your code example, changing the size of the array matters because it means less of those arrays can be fit on one page, therefore requiring more memory pages to be touched by malloc / new itself (and therefore mapped to physical memory by the OS) over the run of the program.
When you use 60, that takes about 480 bytes, so ~8 of those allocations can be put on one page. When you use 100, that takes about 800 bytes, so only ~5 of those allocations can be put on one page. So, I'd expect the "100 program" to use about 8/5ths as much memory as the "60 program", which seems to be a big enough difference to make your machine start swapping to stable storage.
If each of your smaller "60" allocations were already over 1 page in size, then changing it to be bigger "100" wouldn't affect your program's initial physical memory usage, just like you originally expected.
PS - I think whether you explicitly touch the initial page of your allocations or not will be irrelevant as malloc / new will have already done so (for the memory allocator implementation you have).
Here's a sketch of what you could do if you typically expect that your b arrays will usually be small, usually be less than 2^X pointers (X = 5 in the code below), but also handles exceptional cases where they get even bigger.
You can adjust X down if your expected usage doesn't match. You could also adjust the minimum size arrays up from 0 (and not allocate the smaller 2^i levels), if you expect most of your arrays will usually use at least 2^Y pointers (e.g. - Y = 3).
If you think that actually X == Y (e.g. - 4) for your usage pattern, then you can just do one allocation of B * (0x1 << X) * sizeof(Node*) and divvy up that T array to your b's. Then if a b array needs to exceed 2^X pointers, then resort to malloc for it followed by realloc's if it needs to grow even further.
The main point here is that the initial allocation will map to very little physical memory, addressing the problem that initially spurred your original question.
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#define T 1278
#define B 131072
#define CAP_MAX_LG2 5
#define CAP_MAX (0x1 << CAP_MAX_LG2) // pre-alloc T's to handle all B arrays of length up to 2^CAP_MAX_LG2
typedef struct Node Node;
typedef struct
{
int t; // so a matrix element can know to which T_Allocation it belongs
int length;
int cap_lg2; // log base 2 of capacity; -1 if capacity is zero
Node **pointers;
} MatrixElem;
typedef struct
{
Node **base; // pre-allocs B * 2^(CAP_MAX_LG2 + 1) Node pointers; every b array can be any of { 0, 1, 2, 4, 8, ..., CAP_MAX } capacity
Node **frees_pow2[CAP_MAX_LG2 + 1]; // frees_pow2[i] will point at the next free array of 2^i pointers to Node to allocate to a growing b array
} T_Allocation;
MatrixElem Matrix[T][B];
T_Allocation T_Allocs[T];
int Node_init(Node *n) { return 0; } // just a dummy
void Node_fini(Node *n) { } // just a dummy
int Node_eq(const Node *n1, const Node *n2) { return 0; } // just a dummy
void Init(void)
{
for(int t = 0; t < T; t++)
{
T_Allocs[t].base = malloc(B * (0x1 << (CAP_MAX_LG2 + 1)) * sizeof(Node*));
if (NULL == T_Allocs[t].base)
abort();
T_Allocs[t].free_pows2[0] = T_Allocs[t].base;
for (int x = 1; x <= CAP_MAX_LG2; ++x)
T_Allocs[t].frees_pow2[x] = &T_Allocs[t].base[B * (0x1 << (x - 1))];
for(int b = 0; b < B; b++)
{
Matrix[t][b].t = t;
Matrix[t][b].length = 0;
Matrix[t][b].cap_lg2 = -1;
Matrix[t][b].pointers = NULL;
}
}
}
Node *addElement(MatrixElem *elem)
{
if (-1 == elem->cap_lg2 || elem->length == (0x1 << elem->cap_lg2)) // elem needs a bigger pointers array to add an element
{
int new_cap_lg2 = elem->cap_lg2 + 1;
int new_cap = (0x1 << new_cap_lg2);
if (new_cap_lg2 <= CAP_MAX_LG2) // new b array can still fit in pre-allocated space in T
{
Node **new_pointers = T_Allocs[elem->t].frees_pow2[new_cap_lg2];
memcpy(new_pointers, elem->pointers, elem->length * sizeof(Node*));
elem->pointers = new_pointers;
T_Allocs[elem->t].frees_pow2[new_cap_lg2] += new_cap;
}
else if (elem->cap_lg2 == CAP_MAX_LG2) // exceeding pre-alloc'ed arrays in T; use malloc
{
Node **new_pointers = malloc(new_cap * sizeof(Node*));
if (NULL == new_pointers)
return NULL;
memcpy(new_pointers, elem->pointers, elem->length * sizeof(Node*));
elem->pointers = new_pointers;
}
else // already exceeded pre-alloc'ed arrays in T; use realloc
{
Node **new_pointers = realloc(elem->pointers, new_cap * sizeof(Node*));
if (NULL == new_pointers)
return NULL;
elem->pointers = new_pointers;
}
++elem->cap_lg2;
}
Node *ret = malloc(sizeof(Node);
if (ret)
{
Node_init(ret);
elem->pointers[elem->length] = ret;
++elem->length;
}
return ret;
}
int removeElement(const Node *a, MatrixElem *elem)
{
int i;
for (i = 0; i < elem->length && !Node_eq(a, elem->pointers[i]); ++i);
if (i == elem->length)
return -1;
Node_fini(elem->pointers[i]);
free(elem->pointers[i]);
--elem->length;
memmove(&elem->pointers[i], &elem->pointers[i+1], sizeof(Node*) * (elem->length - i));
return 0;
}
int main()
{
return 0;
}

CUDA kernel automatically recall kernel to finish vector addition. Why?

I am just beginning to play with CUDA so I tried out a textbook vector addition code. However, when I specify kernel calls to only add the first half of vector, the second half also gets added! This behavior stops when I include some thrust library header.
I am totally confused. Please see the code below:
#include <iostream>
using namespace std;
__global__ void VecAdd(float *d_dataA, float *d_dataB, float *d_resultC)
{
//printf("gridDim.x is %d \n",gridDim.x);
int tid = blockIdx.x * blockDim.x + threadIdx.x;
// printf("tid is %d \n",tid);
d_resultC[tid] = d_dataA[tid] + d_dataB[tid];
}
int main()
{
const int ARRAY_SIZE = 8*1024;
const int ARRAY_BYTES = ARRAY_SIZE * sizeof(float);
float *h_dataA, *h_dataB, *h_resultC;
float *d_dataA, *d_dataB, *d_resultC;
h_dataA = (float *)malloc(ARRAY_BYTES);
h_dataB = (float *)malloc(ARRAY_BYTES);
h_resultC = (float *)malloc(ARRAY_BYTES);
for(int i=0; i<ARRAY_SIZE;i++){
h_dataA[i]=i+1;
h_dataB[i]=2*(i+1);
};
cudaMalloc((void **)&d_dataA,ARRAY_BYTES);
cudaMalloc((void **)&d_dataB,ARRAY_BYTES);
cudaMalloc((void **)&d_resultC,ARRAY_BYTES);
cudaMemcpy(d_dataA, h_dataA,ARRAY_BYTES, cudaMemcpyHostToDevice);
cudaMemcpy(d_dataB, h_dataB,ARRAY_BYTES, cudaMemcpyHostToDevice);
cout << h_resultC[0] << endl;
cout << h_resultC[ARRAY_SIZE-1] << endl;
dim3 dimBlock(ARRAY_SIZE/8,1,1);
dim3 dimGrid(1,1,1);
VecAdd<<<dimGrid,dimBlock>>>(d_dataA, d_dataB, d_resultC);
cout << h_resultC[0] << endl;
cout << h_resultC[ARRAY_SIZE-1] << endl;
cudaMemcpy(h_resultC,d_resultC ,ARRAY_BYTES,cudaMemcpyDeviceToHost);
cout << h_resultC[0] << endl;
cout << h_resultC[ARRAY_SIZE-1] << endl;
return 0;
}
Have you launched it first with ARRAY_SIZE threads and then with the half of them? (or 1/8)
You are not initializing d_resultC, so it's probably that d_resultC has the result of the previous executions. That would explain that behavior, but maybe it doesn't.
Add a cudaMemset over d_result_C and tell us what happens.
I can't answer for sure why your kernel is processing more elements than expected. It's processing one elements per thread, so the number of elements processed definitely should be blockDim.x*gridDim.x.
I want to point out though, that it's good practice to write kernels that use "grid stride loops" so they aren't so dependent on the block and thread count. The performance cost is negligible and if you are performance-sensitive, the blocking parameters are different for different GPUs.
http://cudahandbook.to/15QbFWx
So you should add a count parameter (the number of elements to process), then write something like:
__global__ void VecAdd(float *d_dataA, float *d_dataB, float *d_resultC, int N)
{
for ( int i = blockIdx.x*blockDim.x + threadIdx.x;
i < N;
i += blockDim.x*gridDim.x ) {
d_resultC[i] = d_dataA[i] + d_dataB[i];
}
}
As some guys mentioned above. This may be caused by the remain data from your previous run. You didn't free the memory you allocated may be the reason of this odd situation.
I think you should free the allocated arrays on the host using free and also free the memory on the GPU using CudaFree
Also I strongly recommend you to allocate the host memory using CudaMallocHost instead of malloc and free them at the end of the program by CudaFreeHost. This will give you fast copy. See here: CudaMallocHost
Anyway, don't forget to free heap memory on C/C++ program, whether with CUDA or not.