Read from 5x10^8 different array elements, 4 bytes each time

Read from 5x10^8 different array elements, 4 bytes each time - c++

So I'm taking an assembly course and have been tasked with making a benchmark program for my computer - needless to say, I'm a bit stuck on this particular piece.
As the title says, we're supposed to create a function to read from 5x108 different array elements, 4 bytes each time. My only problem is, I don't even think it's possible for me to create an array of 500 million elements? So what exactly should I be doing? (For the record, I'm trying to code this in C++)
//Benchmark Program in C++
#include <iostream>
#include <time.h>
using namespace std;
int main() {
clock_t t1,t2;
int readTemp;
int* arr = new int[5*100000000];
t1=clock();
cout << "Memory Test"
<< endl;
for(long long int j=0; j <= 500000000; j+=1)
{
readTemp = arr[j];
}
t2=clock();
float diff ((float)t2-(float)t1);
float seconds = diff / CLOCKS_PER_SEC;
cout << "Time Taken: " << seconds << " seconds" <<endl;
}

Your system tries to allocate 2 billion bytes (1907 MiB), while the maximum available memory for Windows is 2 gigabytes (2048 MiB). These numbers are very close. It's likely your system has allocated the remaining 141 MiB for other stuff. Even though your code is very small, OS is pretty liberal in allocation of the 2048 MiB address space, wasting large chunks for e.g. the following:
C++ runtime (standard library and other libraries)
Stack: OS allocates a lot of memory to support recursive functions; it doesn't matter that you don't have any
Paddings between virtual memory pages
Paddings used just to make specific sections of data appear at specific addresses (e.g. 0x00400000 for lowest code address, or something like that, is used in Windows)
Padding used to randomize the values of pointers
There's a Windows application that shows a memory map of a running process. You can use it by adding a delay (e.g. getchar()) before the allocation and looking at the largest contiguous free block of memory at that point, and which allocations prevent it from being large enough.

The size is possible :
5 * 10^8 * 4 = ~1.9 GB.
First you will need to allocate your array (dynamically only ! There's no such stack memory).
For your task the 4 bytes is the size of an interger, so you can do it
int* arr = new int[5*100000000];
Alternatively, if you want to be more precise, you can allocate it as bytes
int* arr = new char[5*4*100000000];
Next, you need to make the memory dirty (meaning write something into it) :
memset(arr,0,5*100000000*sizeof(int));
Now, you can benchmark cache misses (I'm guessing that's what it's intended in such a huge array) :
int randomIndex= GetRandomNumberBetween(0,5*100000000-1); // make your own random implementation
int bytes = arr[randomIndex]; // access 4 bytes through integer
If you want 5* 10 ^8 accesses randomly you can make a knuth shuffle inside your getRandomNumber instead of using pure random.

Related

How does memory on the heap get exhausted?

I have been testing out some of my own code to see how much allocated memory it takes to exhaust the memory on the heap or free store. However, unless my code is wrong in the testing of it, I am getting completely different results in terms of how much memory can be put on the heap.
I am testing two different programs. The first program creates vector objects on the heap. The second program creates integer objects on the heap.
Here is my code:
#include <vector>
#include <stdio.h>
int main()
{
long long unsigned bytes = 0;
unsigned megabytes = 0;
for (long long unsigned i = 0; ; i++) {
std::vector<int>* pt1 = new std::vector<int>(100000,10);
bytes += sizeof(*pt1);
bytes += pt1->size() * sizeof(pt1->at(0));
megabytes = bytes / 1000000;
if (i >= 1000 && i % 1000 == 0) {
printf("There are %d megabytes on the heap\n", megabytes);
}
}
}
The final output of this code before getting a bad_alloc error is: "There are 2000 megabytes on the heap"
In the second program:
#include <stdio.h>
int main()
{
long long unsigned bytes = 0;
unsigned megabytes = 0;
for (long long unsigned i = 0; ; i++) {
int* pt1 = new int(10);
bytes += sizeof(*pt1);
megabytes = bytes / 1000000;
if (i >= 100000 && i % 100000 == 0) {
printf("There are %d megabytes on the heap\n", megabytes);
}
}
}
The final output of this code before getting a bad_alloc error is: "There are 511 megabytes on the heap"
The final output in both programs is vastly different. Am I misunderstanding something about the free store? I thought that both results would be about the same.

It is very likely that pointers returned by new on your platform are 16-byte aligned.
If int is 4 bytes, this means that for every new int(10) you're getting four bytes and making 12 bytes unusable.
This alone would explain the difference between getting 500MB of usable space from small allocations and 2000MB from large ones.
On top of that, there's overhead of keeping track of allocated blocks (at a minimum, of their size and whether they're free or in use). That is very much specific to your system's memory allocator but also incurs per-allocation overhead. See "What is a Chunk" in https://sourceware.org/glibc/wiki/MallocInternals for an explanation of glibc's allocator.

First of all you have to understand that operating system assign memory to process in quite large chunks of memory called pages (it is a hardware property). Page size is about 4 -16 kB.
Now standard library try use memory in efficient way. So it have to find a way to chop pages to smaller pieces and manage them. To do that some extra information about heap structure have to be maintained.
Here is cool Andrei Alexandrescu cppcon talk more or less how it works (it omits information about pages management).
So when you allocating lots of small objects information about heap structure is quite large. On other hand if you allocating smaller number of larger objects is more efficient - less memory is waisted on tracking memory structure.
Note also that depending on heap strategy sometimes (when small piece of memory is requested) it is more efficient to waste some memory and return larger size of memory then it was requested.

Memory usage of large 2D static array and vector of vectors

I need to use large matrix 20000 * 20000 for a machine learning project. When it is implemented as static array, it uses approximately 1.57 GB of memory. If it is implemented with vector of vectors it uses much more memory then the static array (approximately 3.06 GB). I could not figure out the reason behind it.
Array version:
static double distanceMatrix[20000][20000] = {0};
Vector of vectors:
vector<vector<double>> distanceMatrix(20000, vector<double> (20000));
I used them to store the distances between points.
for (int i = 0; i < 20000; i++){
for (int j = i+1; j < 20000; j++)
distanceMatrix[i][j] = euclid_dist(pointVec[i], pointVec[j]);
}
I also observed that when I used array version memory usage increases gradually during the nested loop. However, while using vector of vectors, memory usage reaches 3.06 GB then nested loop starts.
I checked the memory usage with Xcode debug navigator and Activity Monitor. Thanks in advance!

That's because of vector's memory allocation strategy, which is probably newsize=oldsize*const when reaching its limit (implementation-dependant), see also vector memory allocation strategy

First of all the array doesn't take 1,57 GB of memory. So there is an issue with the measurement.
Experiment with the static array
When running the following code in Xcode, you'll find out that the array is exactly 3,2 Gb in size:
const size_t matsize=20000;
static double mat2D[matsize][matsize] = {0};
cout<<"Double: " << sizeof (double) <<endl;
cout<<"Array: " << sizeof mat2D <<endl;
cout<<" " << sizeof(double)*matsize*matsize<<endl;
// ... then your loop
When the programme starts, its reported memory consumption is only 5,3MB before entering into the loop, although the static array is already there. Once the loop finished, the reported memory consumption is 1,57 Gb as you explained. But still not the 3,2Gb that we could expect.
The memory consumption figure that you read is the physical memory used by your process. The remaining memory of the process is in the virtual memory, which is much larger (7 Gb during my experiment).
Experiment of the vector
First, let's look at the approximate memory size of the vector, knowing that each vector has a fixed size plus a dynamically allocated variable size (based on the capacity which can be equal or greater than the number of elements actually stored in the vector). The following code can give you some estimates:
vector<vector<double>> mat2D(matsize, vector<double> (matsize));
cout<<"Vector (fixed part):" << sizeof (vector<double>)<<endl;
cout<<"Vector capacity: " << mat2D[0].capacity()<<endl;
cout<<" (variable): " << sizeof(double)*mat2D[0].capacity()<<endl;
cout<<" (total): " << sizeof(double)*mat2D[0].capacity() + sizeof(mat2D[0])<<endl;
cout<<"Vector of vector: " << (sizeof(double)*mat2D[0].capacity() + sizeof(mat2D[0]))*matsize+sizeof(mat2D)<<endl;
// then your loop
Running this code will show that the memory required to store your vector of vector is about 3,2 Gb + 480 Kb overhead (24 bytes per vector * 2001 vectors).
Before entering the loop, we will notice that already 3 Gb of physical memory is used. So MacOS uses twice the physical memory for this dynamic version compared to the array version. This is certainly because the allocation process requires to really access more memory upfront: each of the 2000 lines requires a separate allocation and initialization.
Conclusion
The overall virtual memory used in the two approaches is not that different. I could measure around 7Gb running it in debug mode with Xcode. The vector variant uses a little bit more than previously due to the extra 480Kb overhead for vectors.
The strategy used by macOS for using more or less physical memory is complex and may depend on many factors (including the access patterns), the goal being to find the best tradeoff between physical memory available and cost of swapping.
But the physical memory usage is not representative of the memory consumption by the process.

How do you calculate memory access time?

I create a large boolean 2d array (5000X5000 for a total of 25 billion elements at 23MB). Then I loop through and instantiate every element with a random true or false. Then I loop through and read every single element. All 25 million elements are read in ~100ms.
23MB is too big to fit in the CPU's cache and I think my program is too simple to benefit from any type of compiler optimization so am I right to conclude that the program is reading 25 million elements from RAM in ~100ms?
#include "stdafx.h"
#include <iostream>
#include <chrono>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
bool **locs;
locs = new bool*[5000];
for(int i = 0; i < 5000; i++)
locs[i] = new bool[5000];
for(int i = 0; i < 5000; i++)
for(int i2 = 0; i2 < 5000; i2++)
locs[i][i2] = rand() % 2 == 0 ? true : false;
int *idx = new int [5000*5000];
for(int i = 0; i < 5000*5000; i++)
*(idx + i) = rand() % 4999;
bool val;
int memAccesses = 0;
auto start = std::chrono::high_resolution_clock::now();
for(int i = 0; i < 5000*5000; i++) {
val = locs[*(idx + i)][*(idx + ++i)];
memAccesses += 2;
}
auto finish = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(finish-start).count() << " ns\n";
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(finish-start).count() << " ms\n";
cout << "TOTAL MEMORY ACCESSES: " << memAccesses << endl;
cout << "The size of the array in memory is " << ((sizeof(bool)*5000*5000)/1048576) << "MB";
int exit; cin >> exit;
return 0;
}
/*
OUTPUT IS:
137013700 ns
137 ms
TOTAL MEMORY ACCESSES: 25000000
The size of the array in memory is 23MB
*/

As other answers have mentioned, the "speed" you are seeing (even if the CPU is executing your code and it is not stripped by the compiler) is about 250 MBps, which is very very low number for modern systems.
However, your methodology seems flawed to me (admittedly, I'm not an expert in benchmarking.) And here are the problems I see:
For any benchmark such as this, even in the simplest form, you need to distinguish random-access from sequential-access. Memory is not a random-access device (despite its name) and performs very poorly here. Your code seems to be accessing memory randomly, so you add that to your conclusion as a qualifier: that you are "reading 25 million elements from random locations from RAM in ~100ms."
Another aspect of this sort of benchmarks is the concept of latency vs. throughput. Again, if you want to conclude anything from your numbers and timings, you need to be aware what are you measuring exactly.
You are counting memory accesses incorrectly. Depending of the exact code your compiler is generating, this line:
val = locs[*(idx + i)][*(idx + ++i)];
might realistically access the memory system anywhere between 4 to 9 times.
At best, if i, idx, locs and val are all either in registers or access to them is eliminated, then you need to read *(idx + i), read locs[*(idx + i)] (remember that locs is an array of pointers to arrays, not a 2D array,) read *(idx + ++i), and finally read locs[*(idx + i)][*(idx + ++i)]. A few of these might be cached, but it's unlikely, with the cache-thrashing that's going on.
At worst, in addition to the above, you need two accesses for ++i (read, then write back,) one for idx, one for locs and one for val. I don't know, you might even need another read for the single i and/or two reads for the two idx occurrences (due to pointer aliasing and whatnot.)
You need to be aware that memory is never accessed in single bytes or even words. Memory is always read and written in units of cache-line. And cache line size can be different from system to system, although the most common size these days is 64 bytes. So, each time you read a memory location that is not in the cache, you are loading 64-bytes (or more) from RAM. If the memory locations you are reading are at the cache line boundary (some of the bytes in one cache line and some in the next) then you are loading two cache lines from RAM. Given a sane compiler and properly aligned variables in memory, this doesn't happen very often, but it might. So you have to at least multiply your calculated bandwidth used by the size of your cache line.
However, if you are accessing a memory location that is already in cache, then you don't load anything from RAM. You need to consider this in your conclusions too.
You also need to consider cache line eviction, your cache's associativity, number of levels, the fact that some cache levels are shared between instructions and data and some aren't, some are shared between cores and some aren't, and a lot of other things when evaluating the performance of caches and memory.
The DRAM chips also have a lot of weird and complex behaviors and characteristics. Some memory locations are faster to read after some others (due to the arrangements of rows and columns,) some accesses might get delayed a long time (at CPU speeds) because of the refresh cycle, other devices might be using the RAM or the bus that RAM is on, etc., etc. I'm far from familiar with the operations of modern memory chips, and even I know that it's a complete mess.
You have to consider the effects of compiler optimization on your code. This means that you have to look t your code after the compiler is done with it, in assembly form. You need to look at the generated assembly to be able to know what your code is actually doing: whether and which of your memory accesses are optimized out.
All in all, I don't think that you can conclude much useful information from your program. Sorry about that, but memory is very complex!

Portions (blocks) of memory will be stored in the processor cache at a time, which allows the processor to quickly access those items. However, that speed is perfectly reasonable for modern memory. Even the slowest DDR3 ram can transfer data at about 6 GB/s.

Cache usage is independent from program's complexity. Whenever data is read from RAM it goes into cache. Since cache has a certain size, there's always that amount of data available. If you access a memory location next to the previous, there is a good chance it will be cached already. In such case RAM is not accessed.
I would suggest reading CPU cache wikipedia entry to broaden your knowledge.
BTW: val = locs[*(idx + i)][*(idx + ++i)]; are you certain that this is evaluated from left to right? I am not. This is an undefined behavior. I'd suggest putting the ++i below the accessor line.
//EDIT:
There is nothing done with the value read from memory. It is quite possible that these instructions are not executed at all! Check the bytecode or add a (void) val; instruction which should force it to be generated.

No. The reads won't always go all the way down to the RAM. Blocks of memory get pulled into the cache when a read (or write) is performed. As long as the block from which you are reading is already in the cache, the cache is used. If you request data from a block that is not in the cache, then the RAM is accessed to fetch the block of memory and place it in the cache. Reading from the cache is significantly cheaper than reading from RAM.
EDIT
Again, write oprerations cause blocks from memory to get pulled into the cache. Because you are storing the values in your program before reading them, the data you are reading is most likely already in the cache from when you stored it. Therefore, it is likely that your loop that reads the values never needs to access RAM.

Wrangling memory for a highly iterative c++ program

tl:dr I am needing a way to better manage memory in C++ while retaining large datasets.
I am currently creating a program that outputs a database that I need for a later project, and I am struggling with memory control. I have the program written to a functional level that outputs the dataset that I am needing on a small scale, but to ramp up the size to where I need it and keep it realistic, I need to increase the number of iterations. Problem is when I do that I end up running out of memory on my computer (4gb) and it has to start pagefiling, which slows the processing considerably.
The basic outline is that I am creating stores, then creating a year's worth of transactional data for said store. When the store is created, a list of numbers is generated that represents the daily sales goals for the transactions, then transactions are randomly generated until that number is reached. This method gives some nicely organic results that I am quite happy with. Unfortunately all of those transactions have to be stored in memory until they are output to my file.
When the transactions are created they are temporarily stored in a vector, which I execute .clear() on after I store a copy of the vector in my permanent storage location.
I have started to try to move to unique_ptr's for my temporary storage, but I am unsure if they are even being deleted properly upon returning from the functions that are generating my data.
the code is something like this (I cut some superfluous code that wasn't pertinent to the question at hand)
void store::populateTransactions() {
vector<transaction> tempVec;
int iterate=0, month=0;
double dayTotal=0;
double dayCost=0;
int day=0;
for(int i=0; i<365; i++) {
if(i==dsf[month]) {
month++;
day=0;
}
while(dayTotal<dailySalesTargets[i]) {
tempVec.push_back(transaction(2013, month+1, day+1, 1.25, 1.1));
dayTotal+=tempVec[iterate].returnTotal();
dayCost+=tempVec[iterate].returnCost();
iterate++;
}
day++;
dailyTransactions.push_back(tempVec);
dailyCost.push_back(dayCost);
dailySales.push_back(dayTotal);
tempVec.clear();
dayTotal = 0;
dayCost = 0;
iterate = 0;
}
}
transaction::transaction(int year, int month, int day, double avg, double dev) {
rng random;
transTime = &testing;
testing = random.newTime(year, month, day);
itemCount = round(random.newNum('l', avg, dev,0));
if(itemCount <= 0) {
itemCount = 1;
}
for(int i=0; i<itemCount; i++) {
int select = random.newNum(0,libs::products.products.size());
items.push_back(libs::products.products[select]);
transTotal += items[i].returnPrice();
transCost += items[i].returnCost();
}
}

The reason you are running into memory issues is because as you add elements to the vector it eventually has to resize it's internal buffer. This entails allocating a new block of memory, copying the existing data to the new member and then deleting the old buffer.
Since you know the number of elements the vector will hold before hand you can call the vectors reserve() member function to allocate the memory ahead of time. This will eliminate the constant resizing that you are no doubt encountering now.
For instance in the constructor for transaction you would do the following before the loop that adds data to the vector.
items.reserve(itemCount);
In store::populateTransactions() you should calculate the total number of elements the vector will hold and call tempVec.reserve() in the same was described above. Also keep in mind that if you are using a local variable to populate the vector you will eventually need to copy it. This will cause the same issues as the destination vector will need to allocate memory before the contents can be copied (unless you use move semantics available in C++11). If the data needs to be returned to the calling function (as opposed to being a member variable of store) you should take it by reference as a parameter.
void store::populateTransactions(vector<transaction>& tempVec)
{
//....
}
If it is not practical to determine the number of elements ahead of time you should consider using std::deque instead. From cppreference.com
As opposed to std::vector, the elements of a deque are not stored contiguously: typical implementations use a sequence of individually allocated fixed-size arrays.
The storage of a deque is automatically expanded and contracted as needed. Expansion of a deque is cheaper than the expansion of a std::vector because it does not involve copying of the existing elements to a new memory location.
In regard to the comment by Rafael Baptista about how the resize operation allocates memory the following example should give you a better idea of what it going on. The amount of memory listed is the amount required during the resize
#include <iostream>
#include <vector>
int main ()
{
std::vector<int> data;
for(int i = 0; i < 10000001; i++)
{
size_t oldCap = data.capacity();
data.push_back(1);
size_t newCap = data.capacity();
if(oldCap != newCap)
{
std::cout
<< "resized capacity from "
<< oldCap
<< " to "
<< newCap
<< " requiring " << (oldCap + newCap) * sizeof(int)
<< " total bytes of memory"
<< std::endl;
}
}
return 0;
}
When compiled with VC++10 the following results are generated when adding 1,000,001 elements to a vector. These results are specific to VC++10 and can vary between implementations of std::vector.
resized capacity from 0 to 1 requiring 4 total bytes of memory
resized capacity from 1 to 2 requiring 12 total bytes of memory
resized capacity from 2 to 3 requiring 20 total bytes of memory
resized capacity from 3 to 4 requiring 28 total bytes of memory
resized capacity from 4 to 6 requiring 40 total bytes of memory
resized capacity from 6 to 9 requiring 60 total bytes of memory
resized capacity from 9 to 13 requiring 88 total bytes of memory
resized capacity from 13 to 19 requiring 128 total bytes of memory
...snip...
resized capacity from 2362204 to 3543306 requiring 23622040 total bytes of memory
resized capacity from 3543306 to 5314959 requiring 35433060 total bytes of memory
resized capacity from 5314959 to 7972438 requiring 53149588 total bytes of memory
resized capacity from 7972438 to 11958657 requiring 79724380 total bytes of memory

This is fun! Some quick comments I can think of.
a. STL clear() does not always free the memory instantaneously. Instead you can use std::vector<transaction>().swap(tmpVec);.
b. If you are using a compiler which has C++11 vector::emplace_back then you should remove the push_back and use it. It should be a big boost both in memory and speed. With push_back you basically have two copies of the same data floating around and you are at the mercy of allocator to return it back to the OS.
c. Any reason you cannot flush dailyTransactions to disk every once in a while? You can always serialize the vector and write it out to disk, clear the memory and you should be good again.
d. As pointed by others, reserve should also help a lot.

maximum vector size reached prematurely

I am testing how large I can make a 1D vector on my computer. For this I am using the following MWE:
#include <iostream>
#include <vector>
using namespace std;
int main()
{
vector<double> vec;
const unsigned long long lim = 1E8;
for(unsigned long long i=0; i<lim; i++)
{
vec.push_back(i);
}
cout << vec.max_size() << endl; //outputs 536.870.911 on my 32-bit system
return 0;
}
As shown, max_size() gives me that a std::vector can contain 536.870.911 elements on my system. However, when I run the above MWE, I get the error
terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc
My computer has 2GB RAM, but 1E8 integers will only take up 381MB, so I don't see why I get a bad_alloc error?

1E8 = 100000000 and sizeof(double) = 8 [in nearly all systems], so 762MB. Now, if we start with a vector of, say, 16 elements, and it doubles each time it "outgrows" the current size, to get space for 1E8 elements, we get the following sequence:
16, 32, 64, 128, 256, ... 67108864 (64M entries), the next one is 134217728, taking up 8 * 128M = 1GB, and you ALSO have to have space for a 64M * 8 = 512MB chunk at the same time, to copy the old data from. Given that there isn't a full 2GB of space available in a 32-bit process, because some memory is used up for stack, program code, DLL's, and other such things, finding a 1GB contiguous region of space may be hard when there is (more than) 512MB already occupied.
The problem of "I can't fit as much as I thought in my memory" is not an unusual problem.
One solution would be to use std::vector::reserve() to pre-allocate enough space. That is much more likely to work, since you only need a single large allocation, not two - and it won't require much more than the 762MB either, since it's allocated to the right size, not some arbitrary "double what it currently is".

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js