c++ stack efficient for multicore application - c++

I am trying to code a multicode Markov Chain in C++ and while I am trying to take advantage of the many CPUs (up to 24) to run a different chain in each one, I have a problem in picking a right container to gather the result the numerical evaluations on each CPU. What I am trying to measure is basically the average value of an array of boolean variables. I have tried coding a wrapper around a `std::vector`` object looking like that:
struct densityStack {
vector<int> density; //will store the sum of boolean varaibles
int card; //will store the amount of elements we summed over for normalizing at the end
densityStack(int size){ //constructor taking as only parameter the size of the array, usually size = 30
density = vector<int> (size, 0);
card = 0;
}
void push_back(vector<int> & toBeAdded){ //method summing a new array (of measurements) to our stack
for(auto valStack = density.begin(), newVal = toBeAdded.begin(); valStack != density.end(); ++valStack, ++ newVal)
*valStack += *newVal;
card++;
}
void savef(const char * fname){ //method outputting into a file
ofstream out(fname);
out.precision(10);
out << card << "\n"; //saving the cardinal in first line
for(auto val = density.begin(); val != density.end(); ++val)
out << << (double) *val/card << "\n";
out.close();
}
};
Then, in my code I use a single densityStack object and every time a CPU core has data (can be 100 times per second) it will call push_back to send the data back to densityStack.
My issue is that this seems to be slower that the first raw approach where each core stored each array of measurement in file and then I was using some Python script to average and clean (I was unhappy with it because storing too much information and inducing too much useless stress on the hard drives).
Do you see where I can be losing a lot of performance? I mean is there a source of obvious overheading? Because for me, copying back the vector even at frequencies of 1000Hz should not be too much.

How are you synchronizing your shared densityStack instance?
From the limited info here my guess is that the CPUs are blocked waiting to write data every time they have a tiny chunk of data. If that is the issue, a simple technique to improve performance would be to reduce the number of writes. Keep a buffer of data for each CPU and write to the densityStack less frequently.

Related

Why is my Threaded Thrift calls slow?

My thrift definition is something like this:
list<i32> getValues()
Implemented it in C++.
Server.cpp has the following piece of code:
.....
std::vector<int32_t> store;
TransferServiceHandler() {
for(int i=0;i<100000000;i++)
store.push_back(i);
}
void getValues(std::vector<int32_t> & _return) {
// Your implementation goes here
_return = store;
}
.....
Client.cpp has a simple loop in which it calls the getValues():
for(int k=0;k<10;k++){
clock_gettime(CLOCK_REALTIME, &ds_spec);
int64_t dstarted = ds_spec.tv_sec * 1000 + (ds_spec.tv_nsec / 1.0e6);
std::vector<int32_t> values;
client.getValues(values);
clock_gettime(CLOCK_REALTIME, &de_spec);
int64_t dended = de_spec.tv_sec * 1000 + (de_spec.tv_nsec / 1.0e6);
std::cout << "Values size :" << values.size() << " in " << (dended - dstarted) << " ms\n";
}
Connections are initialized and closed outside the loop.
Usually few hundred thousand entries are returned by this call.
When there is no data (when the lists are empty) i can see the call happening in 1ms-2ms, when i vary data there's a unpredictable delay in the transfer. Both the client and server are running on the same machine (equipped with 10Gb/s Ethernet, 8 cores and 30 GB of memory).
How do you normally debug a situation like this? I don't think the issue is with the network since its a 10 Gigs machine and size of the data is hardly few MBs.
I ran a benchmark with various data size and you can see the delay isn't stable for each call.
I am not sure I fully understand the interaction between client and server, however your getValue method could be improved by using move semantics (C++11), therefore you could move the store vector rather then making a copy (memory operations are quite expensive) as follows:
void getValues(std::vector<int32_t> & _return) {
// Your implementation goes here
_return = std::move(store);
}
Note that this works fine as long as content of store (which now has been moved into _return) does not need to persist beyond the call to getValue.
It looks to me like you are loosing resolution here:
clock_gettime(CLOCK_REALTIME, &ds_spec);
int64_t dstarted = ds_spec.tv_sec * 1000 + (ds_spec.tv_nsec / 1.0e6);
Which goes against the reason for using clock_gettime() to begin with;
Here is a link on how to profile code using clock_gettime(); hopefully it will solve your problem.
I'm pointing to resolution because this can be a good cause for unexpected profiling results.
I made significant improvement in the performance after transferring the data as binary rather than vector.
On the thrift definition file, changed the list to binary.
Here's the new benchmark on the same amount of data:

Multithread performance drops down after a few operations

I encountered this weird bug in a c++ multithread program on linux. The multithreaded part basically executes a loop. One single iteration first loads a sift file containing some features. And then it queries these features against a tree. Since I have a lot of images, I used multiple threads to do this querying. Here is the code snippets.
struct MultiMatchParam
{
int thread_id;
float *scores;
double *scores_d;
int *perm;
size_t db_image_num;
std::vector<std::string> *query_filenames;
int start_id;
int num_query;
int dim;
VocabTree *tree;
FILE *file;
};
// multi-thread will do normalization anyway
void MultiMatch(MultiMatchParam &param)
{
// Clear scores
for(size_t t = param.start_id; t < param.start_id + param.num_query; t++)
{
for (size_t i = 0; i < param.db_image_num; i++)
param.scores[i] = 0.0;
DTYPE *keys;
int num_keys;
keys = ReadKeys_sfm((*param.query_filenames)[t].c_str(), param.dim, num_keys);
int normalize = true;
double mag = param.tree->MultiScoreQueryKeys(num_keys, normalize, keys, param.scores);
delete [] keys;
}
}
I run this on a 8-core cpu. At first it runs perfectly and the cpu usage is nearly 100% on all 8 cores. After each thread has queried several images (about 20 images), all of a sudden the performance (cpu usage) drops drastically, down to about 30% across all eight cores.
I doubt the key to this bug is concerned with this line of code.
double mag = param.tree->MultiScoreQueryKeys(num_keys, normalize, keys, param.scores);
Since if I replace it with another costly operations (e.g., a large for-loop containing sqrt). The cpu usage is always nearly 100%. This MultiScoreQueryKeys function does a complex operation on a tree. Since all eight cores may read the same tree (no write operation to this tree), I wonder whether the read operation has some kind of blocking effect. But it shouldn't have this effect because I don't have write operations in this function. Also the operations in the loop are basically the same. If it were to block the cpu usage, it would happen in the first few iterations. If you need to see the details of this function or other part of this project, please let me know.
Use std::async() instead of zeta::SimpleLock lock

How do you calculate memory access time?

I create a large boolean 2d array (5000X5000 for a total of 25 billion elements at 23MB). Then I loop through and instantiate every element with a random true or false. Then I loop through and read every single element. All 25 million elements are read in ~100ms.
23MB is too big to fit in the CPU's cache and I think my program is too simple to benefit from any type of compiler optimization so am I right to conclude that the program is reading 25 million elements from RAM in ~100ms?
#include "stdafx.h"
#include <iostream>
#include <chrono>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
bool **locs;
locs = new bool*[5000];
for(int i = 0; i < 5000; i++)
locs[i] = new bool[5000];
for(int i = 0; i < 5000; i++)
for(int i2 = 0; i2 < 5000; i2++)
locs[i][i2] = rand() % 2 == 0 ? true : false;
int *idx = new int [5000*5000];
for(int i = 0; i < 5000*5000; i++)
*(idx + i) = rand() % 4999;
bool val;
int memAccesses = 0;
auto start = std::chrono::high_resolution_clock::now();
for(int i = 0; i < 5000*5000; i++) {
val = locs[*(idx + i)][*(idx + ++i)];
memAccesses += 2;
}
auto finish = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(finish-start).count() << " ns\n";
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(finish-start).count() << " ms\n";
cout << "TOTAL MEMORY ACCESSES: " << memAccesses << endl;
cout << "The size of the array in memory is " << ((sizeof(bool)*5000*5000)/1048576) << "MB";
int exit; cin >> exit;
return 0;
}
/*
OUTPUT IS:
137013700 ns
137 ms
TOTAL MEMORY ACCESSES: 25000000
The size of the array in memory is 23MB
*/
As other answers have mentioned, the "speed" you are seeing (even if the CPU is executing your code and it is not stripped by the compiler) is about 250 MBps, which is very very low number for modern systems.
However, your methodology seems flawed to me (admittedly, I'm not an expert in benchmarking.) And here are the problems I see:
For any benchmark such as this, even in the simplest form, you need to distinguish random-access from sequential-access. Memory is not a random-access device (despite its name) and performs very poorly here. Your code seems to be accessing memory randomly, so you add that to your conclusion as a qualifier: that you are "reading 25 million elements from random locations from RAM in ~100ms."
Another aspect of this sort of benchmarks is the concept of latency vs. throughput. Again, if you want to conclude anything from your numbers and timings, you need to be aware what are you measuring exactly.
You are counting memory accesses incorrectly. Depending of the exact code your compiler is generating, this line:
val = locs[*(idx + i)][*(idx + ++i)];
might realistically access the memory system anywhere between 4 to 9 times.
At best, if i, idx, locs and val are all either in registers or access to them is eliminated, then you need to read *(idx + i), read locs[*(idx + i)] (remember that locs is an array of pointers to arrays, not a 2D array,) read *(idx + ++i), and finally read locs[*(idx + i)][*(idx + ++i)]. A few of these might be cached, but it's unlikely, with the cache-thrashing that's going on.
At worst, in addition to the above, you need two accesses for ++i (read, then write back,) one for idx, one for locs and one for val. I don't know, you might even need another read for the single i and/or two reads for the two idx occurrences (due to pointer aliasing and whatnot.)
You need to be aware that memory is never accessed in single bytes or even words. Memory is always read and written in units of cache-line. And cache line size can be different from system to system, although the most common size these days is 64 bytes. So, each time you read a memory location that is not in the cache, you are loading 64-bytes (or more) from RAM. If the memory locations you are reading are at the cache line boundary (some of the bytes in one cache line and some in the next) then you are loading two cache lines from RAM. Given a sane compiler and properly aligned variables in memory, this doesn't happen very often, but it might. So you have to at least multiply your calculated bandwidth used by the size of your cache line.
However, if you are accessing a memory location that is already in cache, then you don't load anything from RAM. You need to consider this in your conclusions too.
You also need to consider cache line eviction, your cache's associativity, number of levels, the fact that some cache levels are shared between instructions and data and some aren't, some are shared between cores and some aren't, and a lot of other things when evaluating the performance of caches and memory.
The DRAM chips also have a lot of weird and complex behaviors and characteristics. Some memory locations are faster to read after some others (due to the arrangements of rows and columns,) some accesses might get delayed a long time (at CPU speeds) because of the refresh cycle, other devices might be using the RAM or the bus that RAM is on, etc., etc. I'm far from familiar with the operations of modern memory chips, and even I know that it's a complete mess.
You have to consider the effects of compiler optimization on your code. This means that you have to look t your code after the compiler is done with it, in assembly form. You need to look at the generated assembly to be able to know what your code is actually doing: whether and which of your memory accesses are optimized out.
All in all, I don't think that you can conclude much useful information from your program. Sorry about that, but memory is very complex!
Portions (blocks) of memory will be stored in the processor cache at a time, which allows the processor to quickly access those items. However, that speed is perfectly reasonable for modern memory. Even the slowest DDR3 ram can transfer data at about 6 GB/s.
Cache usage is independent from program's complexity. Whenever data is read from RAM it goes into cache. Since cache has a certain size, there's always that amount of data available. If you access a memory location next to the previous, there is a good chance it will be cached already. In such case RAM is not accessed.
I would suggest reading CPU cache wikipedia entry to broaden your knowledge.
BTW: val = locs[*(idx + i)][*(idx + ++i)]; are you certain that this is evaluated from left to right? I am not. This is an undefined behavior. I'd suggest putting the ++i below the accessor line.
//EDIT:
There is nothing done with the value read from memory. It is quite possible that these instructions are not executed at all! Check the bytecode or add a (void) val; instruction which should force it to be generated.
No. The reads won't always go all the way down to the RAM. Blocks of memory get pulled into the cache when a read (or write) is performed. As long as the block from which you are reading is already in the cache, the cache is used. If you request data from a block that is not in the cache, then the RAM is accessed to fetch the block of memory and place it in the cache. Reading from the cache is significantly cheaper than reading from RAM.
EDIT
Again, write oprerations cause blocks from memory to get pulled into the cache. Because you are storing the values in your program before reading them, the data you are reading is most likely already in the cache from when you stored it. Therefore, it is likely that your loop that reads the values never needs to access RAM.

Wrangling memory for a highly iterative c++ program

tl:dr I am needing a way to better manage memory in C++ while retaining large datasets.
I am currently creating a program that outputs a database that I need for a later project, and I am struggling with memory control. I have the program written to a functional level that outputs the dataset that I am needing on a small scale, but to ramp up the size to where I need it and keep it realistic, I need to increase the number of iterations. Problem is when I do that I end up running out of memory on my computer (4gb) and it has to start pagefiling, which slows the processing considerably.
The basic outline is that I am creating stores, then creating a year's worth of transactional data for said store. When the store is created, a list of numbers is generated that represents the daily sales goals for the transactions, then transactions are randomly generated until that number is reached. This method gives some nicely organic results that I am quite happy with. Unfortunately all of those transactions have to be stored in memory until they are output to my file.
When the transactions are created they are temporarily stored in a vector, which I execute .clear() on after I store a copy of the vector in my permanent storage location.
I have started to try to move to unique_ptr's for my temporary storage, but I am unsure if they are even being deleted properly upon returning from the functions that are generating my data.
the code is something like this (I cut some superfluous code that wasn't pertinent to the question at hand)
void store::populateTransactions() {
vector<transaction> tempVec;
int iterate=0, month=0;
double dayTotal=0;
double dayCost=0;
int day=0;
for(int i=0; i<365; i++) {
if(i==dsf[month]) {
month++;
day=0;
}
while(dayTotal<dailySalesTargets[i]) {
tempVec.push_back(transaction(2013, month+1, day+1, 1.25, 1.1));
dayTotal+=tempVec[iterate].returnTotal();
dayCost+=tempVec[iterate].returnCost();
iterate++;
}
day++;
dailyTransactions.push_back(tempVec);
dailyCost.push_back(dayCost);
dailySales.push_back(dayTotal);
tempVec.clear();
dayTotal = 0;
dayCost = 0;
iterate = 0;
}
}
transaction::transaction(int year, int month, int day, double avg, double dev) {
rng random;
transTime = &testing;
testing = random.newTime(year, month, day);
itemCount = round(random.newNum('l', avg, dev,0));
if(itemCount <= 0) {
itemCount = 1;
}
for(int i=0; i<itemCount; i++) {
int select = random.newNum(0,libs::products.products.size());
items.push_back(libs::products.products[select]);
transTotal += items[i].returnPrice();
transCost += items[i].returnCost();
}
}
The reason you are running into memory issues is because as you add elements to the vector it eventually has to resize it's internal buffer. This entails allocating a new block of memory, copying the existing data to the new member and then deleting the old buffer.
Since you know the number of elements the vector will hold before hand you can call the vectors reserve() member function to allocate the memory ahead of time. This will eliminate the constant resizing that you are no doubt encountering now.
For instance in the constructor for transaction you would do the following before the loop that adds data to the vector.
items.reserve(itemCount);
In store::populateTransactions() you should calculate the total number of elements the vector will hold and call tempVec.reserve() in the same was described above. Also keep in mind that if you are using a local variable to populate the vector you will eventually need to copy it. This will cause the same issues as the destination vector will need to allocate memory before the contents can be copied (unless you use move semantics available in C++11). If the data needs to be returned to the calling function (as opposed to being a member variable of store) you should take it by reference as a parameter.
void store::populateTransactions(vector<transaction>& tempVec)
{
//....
}
If it is not practical to determine the number of elements ahead of time you should consider using std::deque instead. From cppreference.com
As opposed to std::vector, the elements of a deque are not stored contiguously: typical implementations use a sequence of individually allocated fixed-size arrays.
The storage of a deque is automatically expanded and contracted as needed. Expansion of a deque is cheaper than the expansion of a std::vector because it does not involve copying of the existing elements to a new memory location.
In regard to the comment by Rafael Baptista about how the resize operation allocates memory the following example should give you a better idea of what it going on. The amount of memory listed is the amount required during the resize
#include <iostream>
#include <vector>
int main ()
{
std::vector<int> data;
for(int i = 0; i < 10000001; i++)
{
size_t oldCap = data.capacity();
data.push_back(1);
size_t newCap = data.capacity();
if(oldCap != newCap)
{
std::cout
<< "resized capacity from "
<< oldCap
<< " to "
<< newCap
<< " requiring " << (oldCap + newCap) * sizeof(int)
<< " total bytes of memory"
<< std::endl;
}
}
return 0;
}
When compiled with VC++10 the following results are generated when adding 1,000,001 elements to a vector. These results are specific to VC++10 and can vary between implementations of std::vector.
resized capacity from 0 to 1 requiring 4 total bytes of memory
resized capacity from 1 to 2 requiring 12 total bytes of memory
resized capacity from 2 to 3 requiring 20 total bytes of memory
resized capacity from 3 to 4 requiring 28 total bytes of memory
resized capacity from 4 to 6 requiring 40 total bytes of memory
resized capacity from 6 to 9 requiring 60 total bytes of memory
resized capacity from 9 to 13 requiring 88 total bytes of memory
resized capacity from 13 to 19 requiring 128 total bytes of memory
...snip...
resized capacity from 2362204 to 3543306 requiring 23622040 total bytes of memory
resized capacity from 3543306 to 5314959 requiring 35433060 total bytes of memory
resized capacity from 5314959 to 7972438 requiring 53149588 total bytes of memory
resized capacity from 7972438 to 11958657 requiring 79724380 total bytes of memory
This is fun! Some quick comments I can think of.
a. STL clear() does not always free the memory instantaneously. Instead you can use std::vector<transaction>().swap(tmpVec);.
b. If you are using a compiler which has C++11 vector::emplace_back then you should remove the push_back and use it. It should be a big boost both in memory and speed. With push_back you basically have two copies of the same data floating around and you are at the mercy of allocator to return it back to the OS.
c. Any reason you cannot flush dailyTransactions to disk every once in a while? You can always serialize the vector and write it out to disk, clear the memory and you should be good again.
d. As pointed by others, reserve should also help a lot.

Random memory accesses are expensive?

During optimizing my connect four game engine I reached a point where further improvements only can be minimal because much of the CPU-time is used by the instruction TableEntry te = mTable[idx + i] in the following code sample.
TableEntry getTableEntry(unsigned __int64 lock)
{
int idx = (lock & 0xFFFFF) * BUCKETSIZE;
for (int i = 0; i < BUCKETSIZE; i++)
{
TableEntry te = mTable[idx + i]; // bottleneck, about 35% of CPU usage
if (te.height == NOTSET || lock == te.lock)
return te;
}
return TableEntry();
}
The hash table mTable is defined as std::vector<TableEntry> and has about 4.2 mil. entrys (about 64 MB). I have tried to replace the vectorby allocating the table with new without speed improvement.
I suspect that accessing the memory randomly (because of the Zobrist Hashing function) could be expensive, but really that much? Do you have suggestions to improve the function?
Thank you!
Edit: BUCKETSIZE has a value of 4. It's used as collision strategy. The size of one TableEntry is 16 Bytes, the struct looks like following:
struct TableEntry
{ // Old New
unsigned __int64 lock; // 8 8
enum { VALID, UBOUND, LBOUND }flag; // 4 4
short score; // 4 2
char move; // 4 1
char height; // 4 1
// -------
// 24 16 Bytes
TableEntry() : lock(0LL), flag(VALID), score(0), move(0), height(-127) {}
};
Summary: The function originally needed 39 seconds. After making the changes jdehaan suggested, the function now needs 33 seconds (the program stops after 100 seconds). It's better but I think Konrad Rudolph is right and the main reason why it's that slow are the cache misses.
You are making copies of your table entry, what about using TableEntry& as a type. For the default value at the bottom a static default TableEntry() will also do. I suppose that is where you lose much time.
const TableEntry& getTableEntry(unsigned __int64 lock)
{
int idx = (lock & 0xFFFFF) * BUCKETSIZE;
for (int i = 0; i < BUCKETSIZE; i++)
{
// hopefuly now less than 35% of CPU usage :-)
const TableEntry& te = mTable[idx + i];
if (te.height == NOTSET || lock == te.lock)
return te;
}
return DEFAULT_TABLE_ENTRY;
}
How big is a table entry? I suspect it's the copy that is expensive not the memory lookup.
Memory accesses are quicker if they are contiguous because of cache hits, but it seem you are doing this.
The point about copying the TableEntry is valid. But let’s look at this question:
I suspect that accessing the memory randomly (…) could be expensive, but really that much?
In a word, yes.
Random memory access with an array of your size is a cache killer. It will generate lots of cache misses which can be up to three orders of magnitude slower than access to memory in cache. Three orders of magnitude – that’s a factor 1000.
On the other hand, it actually looks as though you are using lots of array elements in order, even though you generated your starting point using a hash. This speaks against the cache miss theory, unless your BUCKETSIZE is tiny and the code gets called very often with different lock values from the outside.
I have seen this exact problem with hash tables before. The problem is that continuous random access to the hashtable touch all of the memory used by the table (both the main array and all of the elements). If this is large relative to your cache size you will thrash. This manifests as the exact problem you are encountering: That instruction which first references new memory appears to have a very high cost due to the memory stall.
In the case I worked on, a further issue was that the hash table represented a rather small part of the key space. The "default" value (similar to what you call DEFAULT_TABLE_ENTRY) applied to the vast majority of keys so it seemed like the hash table was not heavily used. The problem was that although default entries avoided many inserts, the continuous action of searching touched every element of the cache over and over (and in random order). In that case I was able to move the values from the hashed data to live with the associated structure. It took more overall space because even keys with the default value had to explicitly store the default value, but the locality of reference was vastly improved and the performance gain was huge.
Use pointers
TableEntry* getTableEntry(unsigned __int64 lock) {
int idx = (lock & 0xFFFFF) * BUCKETSIZE;
TableEntry* max = &mTable[idx + BUCKETSIZE];
for (TableEntry* te = &mTable[idx]; te < max; te++)
{
if (te->height == NOTSET || lock == te->lock)
return te;
}
return DEFAULT_TABLE_ENTRY; }