Memory page alignment doesn't seem to affect performance...? - c++

I am developing an application for which performance is a fundamental issue. In particular, I was willing to organize a tree-like structure that needs to be traversed really quickly in blocks of the same size as my memory page size so that it would reduce the number of cache misses needed to reach a leaf.
I am quite a novice in the art of memory optimization. As far as I understand, the process of accessing the main memory goes more or less as follows:
CPUs have several layer of caches of increasing size and decreasing speed.
Every time some data that I need is already in the cache, it is fetched from the cache (cache hit).
If it is not in the cache, it will be fetched from the main memory.
Anytime something is loaded from the main memory, the whole page (or pages) containing the data are loaded and stored in the cache. In this way, if I try to access locations in memory that are close to the ones I already fetched from the main memory, they will already be in my CPU cache.
However, if I organize my data in blocks of the same size as my memory page size, I thought that it would also be needed to align that data properly, so that whenever a new block of my data needs to be loaded only one page of memory will need to be fetched from the main memory rather than the two pages containing the first half and the second half of my data block). In principle, shouldn't a correctly aligned data block mean only one access to the memory rather than two? Shouldn't that more or less double memory performance?
I tried the following:
#include <iostream>
#include <unistd.h>
#include <stdio.h>
#include <time.h>
#include <sys/time.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
using namespace std;
#define BLOCKS 262144
#define TESTS 131072
unsigned long int utime()
{
struct timeval tp;
gettimeofday(&tp, NULL);
return tp.tv_sec * 1000000 + tp.tv_usec;
}
unsigned long int pagesize = sysconf(_SC_PAGE_SIZE);
unsigned long int block_slots = pagesize / sizeof(unsigned int);
unsigned int t = 0;
unsigned int p = 0;
unsigned long int test(unsigned int * data)
{
unsigned long int start = utime();
for(unsigned int n=0; n<TESTS; n++)
{
for(unsigned int i=0; i<block_slots; i++)
t += data[p * block_slots + i];
p = t % BLOCKS;
}
unsigned long int end = utime();
return end - start;
}
int main()
{
srand((unsigned int) time(NULL));
char * buffer = new char[(BLOCKS + 3) * pagesize];
for(unsigned int i=0; i<(BLOCKS + 3) * pagesize; i++)
buffer[i] = rand();
for(unsigned int i=0; i<pagesize; i++)
cout<<test((unsigned int *) (buffer + i))<<endl;
cout<<"("<<t<<")"<<endl;
delete [] buffer;
}
This code instantiates more or less 1 GB of empty bytes, fills them with random numbers. Then the function test is called with all the possible shifts in a memory page (from a 0 shift to a 4096 shift). The test function interprets the pointer provided as a group of blocks of data and carries out some simple operation (sum) over those blocks. The order of access to the blocks is more or less random (it's determined by the partial sums) so that every time a new block is accessed it is nearly certain not to already be in the cache.
The function test is then timed. In all the shift configurations but one I should observe some timing, while in one particular shift configuration (the null shift, maybe?) I should observe some big improvement in terms of efficiency. This, however, does not happen at all: all the shift timings are perfectly compatible with each other.
Why does this happen and what does this mean? Can I just forget about memory alignment? Can I also forget about making my data blocks exactly as big as a memory page? (I was planning to use some padding in case they were smaller). Or maybe something in the cache management process is just unclear to me?

Related

How does memory on the heap get exhausted?

I have been testing out some of my own code to see how much allocated memory it takes to exhaust the memory on the heap or free store. However, unless my code is wrong in the testing of it, I am getting completely different results in terms of how much memory can be put on the heap.
I am testing two different programs. The first program creates vector objects on the heap. The second program creates integer objects on the heap.
Here is my code:
#include <vector>
#include <stdio.h>
int main()
{
long long unsigned bytes = 0;
unsigned megabytes = 0;
for (long long unsigned i = 0; ; i++) {
std::vector<int>* pt1 = new std::vector<int>(100000,10);
bytes += sizeof(*pt1);
bytes += pt1->size() * sizeof(pt1->at(0));
megabytes = bytes / 1000000;
if (i >= 1000 && i % 1000 == 0) {
printf("There are %d megabytes on the heap\n", megabytes);
}
}
}
The final output of this code before getting a bad_alloc error is: "There are 2000 megabytes on the heap"
In the second program:
#include <stdio.h>
int main()
{
long long unsigned bytes = 0;
unsigned megabytes = 0;
for (long long unsigned i = 0; ; i++) {
int* pt1 = new int(10);
bytes += sizeof(*pt1);
megabytes = bytes / 1000000;
if (i >= 100000 && i % 100000 == 0) {
printf("There are %d megabytes on the heap\n", megabytes);
}
}
}
The final output of this code before getting a bad_alloc error is: "There are 511 megabytes on the heap"
The final output in both programs is vastly different. Am I misunderstanding something about the free store? I thought that both results would be about the same.
It is very likely that pointers returned by new on your platform are 16-byte aligned.
If int is 4 bytes, this means that for every new int(10) you're getting four bytes and making 12 bytes unusable.
This alone would explain the difference between getting 500MB of usable space from small allocations and 2000MB from large ones.
On top of that, there's overhead of keeping track of allocated blocks (at a minimum, of their size and whether they're free or in use). That is very much specific to your system's memory allocator but also incurs per-allocation overhead. See "What is a Chunk" in https://sourceware.org/glibc/wiki/MallocInternals for an explanation of glibc's allocator.
First of all you have to understand that operating system assign memory to process in quite large chunks of memory called pages (it is a hardware property). Page size is about 4 -16 kB.
Now standard library try use memory in efficient way. So it have to find a way to chop pages to smaller pieces and manage them. To do that some extra information about heap structure have to be maintained.
Here is cool Andrei Alexandrescu cppcon talk more or less how it works (it omits information about pages management).
So when you allocating lots of small objects information about heap structure is quite large. On other hand if you allocating smaller number of larger objects is more efficient - less memory is waisted on tracking memory structure.
Note also that depending on heap strategy sometimes (when small piece of memory is requested) it is more efficient to waste some memory and return larger size of memory then it was requested.

CPU caching understanding

I tested the next code with Google Benchmark framework to measure memory access latency with different array sizes:
int64_t MemoryAccessAllElements(const int64_t *data, size_t length) {
for (size_t id = 0; id < length; id++) {
volatile int64_t ignored = data[id];
}
return 0;
}
int64_t MemoryAccessEvery4th(const int64_t *data, size_t length) {
for (size_t id = 0; id < length; id += 4) {
volatile int64_t ignored = data[id];
}
return 0;
}
And I get next results (the results are averaged by google benchmark, for big arrays, there is about ~10 iterations and much more work performed for smaller one):
There is a lot of different stuff happened on this picture and unfortunately, I can't explain all changes in the graph.
I tested this code on single core CPU with next caches configuration:
CPU Caches:
L1 Data 32K (x1), 8 way associative
L1 Instruction 32K (x1), 8 way associative
L2 Unified 256K (x1), 8 way associative
L3 Unified 30720K (x1), 20 way associative
At this pictures we can see many changes in graph behavior:
There is a spike after 64 bytes array size that can be explained by the fact, that cache line size is 64 bytes long and with an array of size more than 64 bytes we experience one more L1 cache miss (which can be categorized as a compulsory cache miss)
Also, the latency increasing near the cache size bounds that is also easy to explain - at this moment we experience capacity cache misses
But there is a lot of questions about results that I can't explain:
Why latency for the MemoryAccessEvery4th decreasing after array exceeded ~1024 bytes?
Why we can see another peak for the MemoryAccessAllElements around 512 bytes? It is an interesting point because at this moment we started to access more than one set of cache lines (8 * 64 bytes in one set). But is it really caused by this event and if it is than how it can be explained?
Why we can see latency increasing after passing the L2 cache size while benchmarking MemoryAccessEvery4th but there is no such difference with the MemoryAccessAllElements?
I've tried to compare my results with the results from gallery of processor cache effects and what every programmer should know about memory, but I can't fully describe my results with reasoning from this articles.
Can someone help me to understand the internal processes of CPU caching?
UPD:
I use the following code to measure performance of memory access:
#include <benchmark/benchmark.h>
using namespace benchmark;
void InitializeWithRandomNumbers(long long *array, size_t length) {
auto random = Random(0);
for (size_t id = 0; id < length; id++) {
array[id] = static_cast<long long>(random.NextLong(0, 1LL << 60));
}
}
static void MemoryAccessAllElements_Benchmark(State &state) {
size_t size = static_cast<size_t>(state.range(0));
auto array = new long long[size];
InitializeWithRandomNumbers(array, size);
for (auto _ : state) {
DoNotOptimize(MemoryAccessAllElements(array, size));
}
delete[] array;
}
static void CustomizeBenchmark(benchmark::internal::Benchmark *benchmark) {
for (int size = 2; size <= (1 << 24); size *= 2) {
benchmark->Arg(size);
}
}
BENCHMARK(MemoryAccessAllElements_Benchmark)->Apply(CustomizeBenchmark);
BENCHMARK_MAIN();
You can find slightly different examples in the repository, but actually the basic approach for the benchmark in the question is same.

Why does allocating large chunks of memory fail when reallocing small chunks doesn't

This code results in x pointing to a chunk of memory 100GB in size.
#include <stdlib.h>
#include <stdio.h>
int main() {
auto x = malloc(1);
for (int i = 1; i< 1024; ++i) x = realloc(x, i*1024ULL*1024*100);
while (true); // Give us time to check top
}
While this code fails allocation.
#include <stdlib.h>
#include <stdio.h>
int main() {
auto x = malloc(1024ULL*1024*100*1024);
printf("%llu\n", x);
while (true); // Give us time to check top
}
My guess is, that the memory size of your system is less than the 100 GiB that you are trying to allocate. While Linux does overcommit memory, it still bails out of requests that are way beyond what it can fulfill. That is why the second example fails.
The many small increments of the first example, on the other hand, are way below that threshold. So each one of them succeeds as the kernel knows that you didn't require any of the prior memory yet, so it has no indication that it won't be able to back those 100 additional MiB.
I believe that the threshold for when a memory request from a process fails is relative to the available RAM, and that it can be adjusted (though I don't remember how exactly).
Well you're allocating less memory in the one that succeeds:
for (int i = 1; i< 1024; ++i) x = realloc(x, i*1024ULL*1024*100);
The last realloc is:
x = realloc(x, 1023 * (1024ULL*1024*100));
As compared to:
auto x = malloc(1024 * (1024ULL*100*1024));
Maybe that's right where your memory boundary is - the last 100M that broke the camel's back?

Read from 5x10^8 different array elements, 4 bytes each time

So I'm taking an assembly course and have been tasked with making a benchmark program for my computer - needless to say, I'm a bit stuck on this particular piece.
As the title says, we're supposed to create a function to read from 5x108 different array elements, 4 bytes each time. My only problem is, I don't even think it's possible for me to create an array of 500 million elements? So what exactly should I be doing? (For the record, I'm trying to code this in C++)
//Benchmark Program in C++
#include <iostream>
#include <time.h>
using namespace std;
int main() {
clock_t t1,t2;
int readTemp;
int* arr = new int[5*100000000];
t1=clock();
cout << "Memory Test"
<< endl;
for(long long int j=0; j <= 500000000; j+=1)
{
readTemp = arr[j];
}
t2=clock();
float diff ((float)t2-(float)t1);
float seconds = diff / CLOCKS_PER_SEC;
cout << "Time Taken: " << seconds << " seconds" <<endl;
}
Your system tries to allocate 2 billion bytes (1907 MiB), while the maximum available memory for Windows is 2 gigabytes (2048 MiB). These numbers are very close. It's likely your system has allocated the remaining 141 MiB for other stuff. Even though your code is very small, OS is pretty liberal in allocation of the 2048 MiB address space, wasting large chunks for e.g. the following:
C++ runtime (standard library and other libraries)
Stack: OS allocates a lot of memory to support recursive functions; it doesn't matter that you don't have any
Paddings between virtual memory pages
Paddings used just to make specific sections of data appear at specific addresses (e.g. 0x00400000 for lowest code address, or something like that, is used in Windows)
Padding used to randomize the values of pointers
There's a Windows application that shows a memory map of a running process. You can use it by adding a delay (e.g. getchar()) before the allocation and looking at the largest contiguous free block of memory at that point, and which allocations prevent it from being large enough.
The size is possible :
5 * 10^8 * 4 = ~1.9 GB.
First you will need to allocate your array (dynamically only ! There's no such stack memory).
For your task the 4 bytes is the size of an interger, so you can do it
int* arr = new int[5*100000000];
Alternatively, if you want to be more precise, you can allocate it as bytes
int* arr = new char[5*4*100000000];
Next, you need to make the memory dirty (meaning write something into it) :
memset(arr,0,5*100000000*sizeof(int));
Now, you can benchmark cache misses (I'm guessing that's what it's intended in such a huge array) :
int randomIndex= GetRandomNumberBetween(0,5*100000000-1); // make your own random implementation
int bytes = arr[randomIndex]; // access 4 bytes through integer
If you want 5* 10 ^8 accesses randomly you can make a knuth shuffle inside your getRandomNumber instead of using pure random.

Why is my program slow ? How can I improve its efficiency?

I've a program that does Block Nested loop join (link text). Basically what it does is, it reads contents from a file (say 10GB file) into buffer1 (say 400MB), puts it into a hash table. Now read contents of the second file (say 10GB file) into buffer 2 (say 100MB) and see if the elements in buffer2 are present in the hash. Outputting the result doesn't matter. I'm just concerned with efficiency of the program for now. In this program, I need to read 8 bytes at a time from both files so I use long long int.
The problem is my program is very inefficient. How can I make it efficient ?
// I compile using g++ -o hash hash.c -std=c++0x
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/time.h>
#include <stdint.h>
#include <math.h>
#include <limits.h>
#include <iostream>
#include <algorithm>
#include <vector>
#include <unordered_map>
using namespace std;
typedef std::unordered_map<unsigned long long int, unsigned long long int> Mymap;
int main()
{
uint64_t block_size1 = (400*1024*1024)/sizeof(long long int); //block size of Table A - division operator used to make the block size 1 mb - refer line 26,27 malloc statements.
uint64_t block_size2 = (100*1024*1024)/sizeof(long long int); //block size of table B
int i=0,j=0, k=0;
uint64_t x,z,l=0;
unsigned long long int *buffer1 = (unsigned long long int *)malloc(block_size1 * sizeof(long long int));
unsigned long long int *buffer2 = (unsigned long long int *)malloc(block_size2 * sizeof(long long int));
Mymap c1 ; // Hash table
//Mymap::iterator it;
FILE *file1 = fopen64("10G1.bin","rb"); // Input is a binary file of 10 GB
FILE *file2 = fopen64("10G2.bin","rb");
printf("size of buffer1 : %llu \n", block_size1 * sizeof(long long int));
printf("size of buffer2 : %llu \n", block_size2 * sizeof(long long int));
while(!feof(file1))
{
k++;
printf("Iterations completed : %d \n",k);
fread(buffer1, sizeof(long long int), block_size1, file1); // Reading the contents into the memory block from first file
for ( x=0;x< block_size1;x++)
c1.insert(Mymap::value_type(buffer1[x], x)); // inserting values into the hash table
// std::cout << "The size of the hash table is" << c1.size() * sizeof(Mymap::value_type) << "\n" << endl;
/* // display contents of the hash table
for (Mymap::const_iterator it = c1.begin();it != c1.end(); ++it)
std::cout << " [" << it->first << ", " << it->second << "]";
std::cout << std::endl;
*/
while(!feof(file2))
{
i++; // Counting the number of iterations
// printf("%d\n",i);
fread(buffer2, sizeof(long long int), block_size2, file2); // Reading the contents into the memory block from second file
for ( z=0;z< block_size2;z++)
c1.find(buffer2[z]); // finding the element in hash table
// if((c1.find(buffer2[z]) != c1.end()) == true) //To check the correctness of the code
// l++;
// printf("The number of elements equal are : %llu\n",l); // If input files have exactly same contents "l" should print out the block_size2
// l=0;
}
rewind(file2);
c1.clear(); //clear the contents of the hash table
}
free(buffer1);
free(buffer2);
fclose(file1);
fclose(file2);
}
Update :
Is it possible to directly read a chunk (say 400 MB) from a file and directly put it into hash table using C++ stream readers? I think that can further reduce the overhead.
If you're using fread, then try using setvbuf(). The default buffers used by the standard lib file I/O calls are tiny (often of the order of 4kB). When processing large amounts of data quickly, you will be I/O bound and the overhead of fetching many small buffer-fuls of data can become a significant bottleneck. Set this to a larger size (e.g. 64kB or 256kB) and you can reduce that overhead and may see significant improvements - try out a few values to see where you get the best gains as you will get diminishing returns.
The running time for your program is (l1 x bs1 x l2 x bs2) (where l1 is the number of lines in the first file, and bs1 is the block size for the first buffer, and l2 is the number of lines in the second file, and bs2 is the block size for the second buffer) since you have four nested loops. Since your block sizes are constant, you can say that your order is O(n x 400 x m x 400) or O(1600mn), or in the worst case O(1600n2) which essentially ends up being O(n2).
You can have an O(n) algorithm if you do something like this (pseudocode follows):
map = new Map();
duplicate = new List();
unique = new List();
for each line in file1
map.put(line, true)
end for
for each line in file2
if(map.get(line))
duplicate.add(line)
else
unique.add(line)
fi
end for
Now duplicate will contain a list of duplicate items and unique will contain a list of unique items.
In your original algorithm, you needlessly traverse the second file for every line in the first file. So you actually end up losing the benefit of the hash (which gives you O(1) lookup time). The trade-off in this case, of course, is that you have to store the entire 10GB in memory which probably is not that helpful. Usually in cases like these the trade-off is between run-time and memory.
There is probably a better way to do this. I need to think about it some more. If not, I'm pretty sure someone will come up with a better idea :).
UPDATE
You can probably reduce memory-usage if you can find a good way to hash the line (that you read in from the first file) so that you get a unique value (i.e., a 1-to-1 mapping between the line and the hash value). Essentially you would do something like this:
for each line in file1
map.put(hash(line), true)
end for
for each line in file2
if(map.get(hash(line)))
duplicate.add(line)
else
unique.add(line)
fi
end for
Here the hash function is the one that performs the hashing. This way you don't have to store all the lines in memory. You only have to store their hashed values. This might help you a little bit. Even still, in the worse case (where you are either comparing two files that are identical, or entirely different) you can still end up with 10Gb in memory for either duplicate or unique list. You can get around with it with the loss of some information if you simply store a count of unique or duplicate items instead of the items themselves.
long long int *ptr = mmap() your files, then compare them with memcmp() in chunks.
Once a discrepancy is found, step back one chunk and compare them in more detail. (More detail means long long int in this case.)
If you expect to find discrepancies often, do not bother with memcmp(), just write your own loop comparing the long long ints to each other.
The only way to know is to profile it, eg with gprof. Create a benchmark of your current implementation and then experiment with other modifications methodically and re-run the benchmark.
I'd bet if you read in larger chunks you'd get better performance. fread() and Process multiple blocks per pass.
The problem I see is that you are reading the second file n-times. Really slow.
The best way to make this faster is to pre-sort the files then do a Sort-merge join. The sort is almost always worth it, in my experience.