modifying values in pointers is very slow? - c++

I'm working with a huge amount of data stored in an array, and am trying to optimize the amount of time it takes to access and modify it. I'm using Window, c++ and VS2015 (Release mode).
I ran some tests and don't really understand the results I'm getting, so I would love some help optimizing my code.
First, let's say I have the following class:
class foo
{
public:
int x;
foo()
{
x = 0;
}
void inc()
{
x++;
}
int X()
{
return x;
}
void addX(int &_x)
{
_x++;
}
};
I start by initializing 10 million pointers to instances of that class into a std::vector of the same size.
#include <vector>
int count = 10000000;
std::vector<foo*> fooArr;
fooArr.resize(count);
for (int i = 0; i < count; i++)
{
fooArr[i] = new foo();
}
When I run the following code, and profile the amount of time it takes to complete, it takes approximately 350ms (which, for my purposes, is far too slow):
for (int i = 0; i < count; i++)
{
fooArr[i]->inc(); //increment all elements
}
To test how long it takes to increment an integer that many times, I tried:
int x = 0;
for (int i = 0; i < count; i++)
{
x++;
}
Which returns in <1ms.
I thought maybe the number of integers being changed was the problem, but the following code still takes 250ms, so I don't think it's that:
for (int i = 0; i < count; i++)
{
fooArr[0]->inc(); //only increment first element
}
I thought maybe the array index access itself was the bottleneck, but the following code takes <1ms to complete:
int x;
for (int i = 0; i < count; i++)
{
x = fooArr[i]->X(); //set x
}
I thought maybe the compiler was doing some hidden optimizations on the loop itself for the last example (since the value of x will be the same during each iteration of the loop, so maybe the compiler skips unnecessary iterations?). So I tried the following, and it takes 350ms to complete:
int x;
for (int i = 0; i < count; i++)
{
fooArr[i]->addX(x); //increment x inside foo function
}
So that one was slow again, but maybe only because I'm incrementing an integer with a pointer again.
I tried the following too, and it returns in 350ms as well:
for (int i = 0; i < count; i++)
{
fooArr[i]->x++;
}
So am I stuck here? Is ~350ms the absolute fastest that I can increment an integer, inside of 10million pointers in a vector? Or am I missing some obvious thing? I experimented with multithreading (giving each thread a different chunk of the array to increment) and that actually took longer once I started using enough threads. Maybe that was due to some other obvious thing I'm missing, so for now I'd like to stay away from multithreading to keep things simple.
I'm open to trying containers other than a vector too, if it speeds things up, but whatever container I end up using, I need to be able to easily resize it, remove elements, etc.
I'm fairly new to c++ so any help would be appreciated!

Let's look from the CPU point of view.
Incrementing an integer means I have it in a CPU register and just increments it. This is the fastest option.
I'm given an address (vector->member) and I must copy it to a register, increment, and copy the result back to the address. Worst: My CPU cache is filled with vector pointers, not with vector-member pointers. Too few hits, too much cache "refueling".
If I could manage to have all those members just in a vector, CPU cache hits would be much more frequent.

Try the following:
int count = 10000000;
std::vector<foo> fooArr;
fooArr.resize(count, foo());
for (auto it= fooArr.begin(); it != fooArr.end(); ++it) {
it->inc();
}
The new is killing you and actually you don't need it because resize inserts elements at the end if the size it's greater (check the docs: std::vector::resize)
And the other thing it's about using pointers which IMHO should be avoided until the last moment and it's uneccesary in this case. The performance should be a little bit faster in this case since you get better locality of your references (see cache locality). If they were polymorphic or something more complicated it might be different.

Related

Writing two versions of a function, one for "clarity" and one for "speed"

My professor assigned homework to write a function that takes in an array of integers and sorts all zeros to the end of the array while maintaining the current order of non-zero ints. The constraints are:
Cannot use the STL or other templated containers.
Must have two solutions: one that emphasizes speed and another that emphasizes clarity.
I wrote up this function attempting for speed:
#include <iostream>
#include <cstdio>
#include <cstdlib>
using namespace std;
void sortArray(int array[], int size)
{
int i = 0;
int j = 1;
int n = 0;
for (i = j; i < size;)
{
if (array[i] == 0)
{
n++;
i++;
}
else if (array[i] != 0 && j != i)
{
array[j++] = array[i++];
}
else
{
i++;
n++;
}
}
while (j < size)
{
array[j++] = 0;
}
}
int main()
{
//Example 1
int array[]{20, 0, 0, 3, 14, 0, 5, 11, 0, 0};
int size = sizeof(array) / sizeof(array[0]);
sortArray(array, size);
cout << "Result :\n";
for (int i = 0; i < size; i++)
{
cout << array[i] << " ";
}
cout << endl << "Press any key to exit...";
cin.get();
return 0;
}
It outputs correctly, but;
I don't know what the speed of it actually is, can anyone help me figure out how to calculate that?
I have no idea how to go about writing a function for "clarity"; any ideas?
I my experience, unless you have very complicated algorithm, speed and clarity come together:
void sortArray(int array[], int size)
{
int item;
int dst = 0;
int src = 0;
// collect all non-zero elements
while (src < size) {
if (item = array[src++]) {
array[dst++] = item;
}
}
// fill the rest with zeroes
while (dst < size) {
array[dst++] = 0;
}
}
Speed comes from a good algorithm. Clarity comes from formatting, naming variables and commenting.
Speed as in complexity?
Since you are, and need, to look at all the elements in the array — and as such have a single loop going through the indexes in the range [0, N)—where N denotes the size of the input—your solution is O(N).
Further reading:
Plain English explanation of big O
Determining big O Notation
Regarding clearity
In my honest opinion there shouldn't need to be two alternatives when implementing such functionality as you are presenting. If you rename your variables to more suitable (descriptive) names your current solution should be clear enough to count as both performant and clear.
Your current approach can be written in plain english in a very clear fashion:
pseudo-explanation
set write_index to 0
set number_of_zeroes to 0
For each element in array
If element is 0
increase number_of_zeros by one
otherwise
write element value to position denoted by write_index
increase write_index by one
write number_of_zeroes 0s at the end of array
Having stated the explanation above we can quickly see that sortArray is not a descriptive name for your function, a more suitable name would probably be partition_zeroes or similar.
Adding comments could improve readability, but you current focus should lie in renaming your variables to better express the intent of the code.
(I feel your question is almost off-topic; I am answering it from a Linux perspective; I recommend using Linux to learn C++ programming; you'll adapt my advices to your operating system if you are using something else....)
speed
Regarding speed, you should have two complementary approaches.
The first (somehow "theoretical") is to analyze (i.e. think on) your algorithm and give (with some proof) its asymptotic time complexity.
The second approach (only "practical", and often pragmatical) is to benchmark and profile your program. Don't forget to compile with optimizations enabled (e.g. using g++ -Wall -O2 with GCC). Have a benchmark which runs for more than half of a second (so processes a large amount of data, e.g. several million numbers) and repeat it several times (e.g. using time(1) command on Linux). You could also measure some time inside your program using e.g. <chrono> in C++11, or just clock(3) (if you read a large array from some file, or build a large array of pseudo-random numbers with <random> or with random(3) you certainly want to measure separately the time to read or fill the array with the time to move zeros out of it). See also time(7).
(You need to process a large amount of data - more than a million items, perhaps many millions of them - because computer are very fast; a typical "elementary" operation -a machine instruction- takes less than a nanosecond, and you have lot of uncertainty on a single run, see this)
clarity
Regarding clarity, it is a bit subjective, but you might try to make your code readable and concise. Adding a few good comments could also help.
Be careful about naming: sorting is not exactly what your program is doing (it is more moving zeros than sorting the array)...
I think this is the best - Of course you may wish to use doxygen or some other
// Shift the non-zeros to the front and put zero in the rest of the array
void moveNonZerosTofront(int *list, unsigned int length)
{
unsigned int from = 0, to = 0;
// This will move the non-zeros
for (; from < length; ++from) {
if (list[from] != 0) {
list[to] = list[from];
to++;
}
}
// So the rest of the array needs to be assigned zero (as we found those on the way)
for (; to < length; +=to) {
list[to] = 0;
}
}

Vector performance suffering

I've been working on state space exploration and was originally using a map to store the assignment of the world states like map<Variable *, int>, where variables are objects in the world with a domain from 0 to n where n is finite. The implementation was extremely quick for performance, but I noticed that it does not scale well with the size of the state space. I changed the states to use vector<int> instead, where I use the id of a variable to find its index in the vector. Memory usage improved greatly, but the efficiency of the solver has tanked (gone from <30 seconds to 400+). The only code that I modified was generating the states and validating if the state is the goal. I can't figure out why using a vector has degraded performance, especially since the vector operations should only take linear time at worst.
Originally this is was how I generated nodes:
State * SuccessorGen::generate_successor(const Operator &op, map<Variable *, int> &var_assignment){
map<Variable *, int> values;
values.insert(var_assignment.begin(), var_assignment.end());
vector<Operator::Effect> effect = op.get_effect();
vector<Operator::Effect>::const_iterator eff_it = effect.begin();
for (; eff_it != effect.end(); eff_it++){
values[eff_it->var] = eff_it->after;
}
return new State(values);
}
And in my new implementation:
State* SuccessorGen::generate_successor(const Operator &op, const vector<int> &assignment){
vector<int> child;
child = assignment;
vector<Operator::Effect> effect = op.get_effect();
vector<Operator::Effect>::const_iterator eff_it = effect.begin();
for (; eff_it != effect.end(); eff_it++){
Variable *v = eff_it->var;
int id = v->get_id();
child[id] = eff_it->after;
}
return new State(child);
}
(The goal checking is similar, just looping over the goal assignment instead of operator effects.)
Are these vector operations really that much slower than using a map? Is there an equally efficient STL container I can use that has a lower overhead? The number of variables is relatively small (<50) and the vector never needs to be resized or modified after the for loop.
Edit:
I tried timing one loop through all the operators to see timing comparisons, with the effect list and assignment the vector version runs one loop in 0.3 seconds, while the map version is a little over 0.4 seconds. When I comment that section out the map was about the same, yet the vector jumped up to closer to 0.5 seconds. I added child.reserve(assignment.size()) but that did not make any change.
Edit 2:
From user63710's answer, I've also been digging through the rest of the code and noticed something really strange going on in the heuristic calculation. The vector version works fine, but for the map I use this line Node *n = new Node(i, transition.value, label_cost); open_list.push(n);, but once the loop finishes filling the queue the node gets totally screwed up. Nodes are a simple struct as:
struct Node{
// Source Value, Destination Value
int from;
int to;
int distance;
Node(int &f, int &t, int &d) : from(f), to(t), distance(d){}
};
Instead of having from, to, distance, it replaces from and to with id with some random number, and that search does not do what it should and is returning much faster then it should. When I tweak the map version to convert the map to a vector and run this:
Node n(i, transition.value, label_cost); open_list.push(n);
the performance is about equal to that of the vector. So that fixes my main issue, but this leaves me wondering why using Node *n gets this behaviour opposed to Node n()?
If as you say, the sizes of these structures are fairly small (~50 elements), I have to think that the issue is somewhere else. At least, I don't think it involves the memory accesses or allocation of the vector/map.
Some example code I made to test: Map version:
unique_ptr<map<int, int>> make_successor_map(const vector<int> &ids,
const map<int, int> &input)
{
auto new_map = make_unique<map<int, int>>(input.begin(), input.end());
for (size_t i = 0; i < ids.size(); ++i)
swap((*new_map)[ids[i]], (*new_map)[i]);
return new_map;
}
int main()
{
auto a_map = make_unique<map<int, int>>();
// ids to access
vector<int> ids;
const int n = 100;
for (int i = 0; i < n; ++i)
{
a_map->insert({i, rand()});
ids.push_back(i);
}
random_shuffle(ids.begin(), ids.end());
for (int i = 0; i < 1e6; ++i)
{
auto temp_map = make_successor_map(ids, *a_map);
swap(temp_map, a_map);
}
cout << a_map->begin()->second << endl;
}
Vector version:
unique_ptr<vector<int>> make_successor_vec(const vector<int> &ids,
const vector<int> &input)
{
auto new_vec = make_unique<vector<int>>(input);
for (size_t i = 0; i < ids.size(); ++i)
swap((*new_vec)[ids[i]], (*new_vec)[i]);
return new_vec;
}
int main()
{
auto a_vec = make_unique<vector<int>>();
// ids to access
vector<int> ids;
const int n = 100;
for (int i = 0; i < n; ++i)
{
a_vec->push_back(rand());
ids.push_back(i);
}
random_shuffle(ids.begin(), ids.end());
for (int i = 0; i < 1e6; ++i)
{
auto temp_vec = make_successor_vec(ids, *a_vec);
swap(temp_vec, a_vec);
}
cout << *a_vec->begin() << endl;
}
The map version takes around 15 seconds to run on my old Core 2 Duo T9600, and the vector version takes 0.406 seconds. Both we're compiled on G++ 4.9.2 with g++ -O3 --std=c++1y. So if your code takes 0.4s per iteration (note that it took my example code 0.4s for 1 million calls), then I'm really thinking your problem is somewhere else.
That's not to say you aren't having performance decreases due to switching from map->vector, but that the code you posted doesn't show much reason for that to happen.
The problem is that you create vectors without reserving space. Vectors store elements contiguously. That ensures constant access to elements.
So everytime you add an item to the vector (for example via your inserter), the vector has to reallocate more space and eventuelly move all the existing elements to a reallocated memory location. This causes slowdown and considerable heap fragmentation.
The solution to this is to reserve() elements if you know in advance how many elements you'll have. Or if you don't reserve() larger chunks and compare size() and capacity() to check if it's time to reserve more.

How to speed up a function that returns a pointer to object in c++?

I am a mechanical engineer so please understand I am not trained in proper coding. I have a finite element code that uses grids to make elements which make a model. The element is not important to this question so I have left it out. The elements and grids are read in from a file and that part works.
class Grid
{
private:
int id;
double x;
double y;
double z;
public:
Grid();
Grid(int, double, double, double);
int get_id() { return id;};
};
Grid::Grid() {};
Grid::Grid(int t_id, double t_x, double t_y double t_z)
{
id = t_id; x = t_x; y = t_y; z = t_z;
}
class SurfaceModel
{
private:
Grid** grids;
Element** elements;
int grid_count;
int elem_count;
public:
SurfaceModel();
SurfaceModel(int, int);
~SurfaceModel();
void read_grid(std::string);
int get_grid_count() { return grid_count; };
Grid* get_grid(int);
};
SurfaceModel::SurfaceModel()
{
grids = NULL;
elements = NULL;
}
SurfaceModel::SurfaceModel(int g, int e)
{
grids = new Grid*[g];
for (int i = 0; i < g; i++)
grids[i] = NULL;
elements = new Element*[e];
for (int i = 0; i < e; i++)
elements[i] = NULL;
}
void SurfaceModel::read_grid(std::string line)
{
... blah blah ...
grids[index] = new Grid(n_id, n_x, n_y, n_z);
... blah blah ....
}
Grid* SurfaceModel::get_grid(int i)
{
if (i < grid_count)
return grids[i];
else
return NULL;
}
When I need to actually use the grid I use the get_grid maybe something like this:
SurfaceModel model(...);
.... blah blah .....
for (int i = 0; i < model.get_grid_count(); i++)
{
Grid *cur_grid = model.get_grid(i);
int cur_id = cur_grid->get_id();
}
My problem is that the call to get_grid seems to be taking more time than I think it should to simply return my object. I have run the gprof on the code and found that get_grid gets called about 4 billion times when going through a very large simulation and another operation using the x, y, z occurs about the same. The operation does some multiplication. What I found is that the get_grid and math take about the same amount of time (~40 seconds). This seems like I have done something wrong. Is there a faster way to get that object out of there?
I think you're forgetting to set grid_count and elem_count.
This means, they will have uninitialized (indeterminate) values. If you loop for those values, you can easily end up looping a lot of iterations.
SurfaceModel::SurfaceModel()
: grid_count(0),
grids(NULL),
elem_count(0),
elements(NULL)
{
}
SurfaceModel::SurfaceModel(int g, int e)
: grid_count(g),
elem_count(e)
{
grids = new Grid*[g];
for (int i = 0; i < g; i++)
grids[i] = NULL;
elements = new Element*[e];
for (int i = 0; i < e; i++)
elements[i] = NULL;
}
Howeverm, I suggest you would want to get rid of each instance of new in this program (and use a vector for the grid)
On a modern CPU accessing memory often takes longer than doing multiplication. Getting good performance on modern systems can often mean focusing more on optimizing memory accesses than optimizing computation. Because you are storing your grid objects as an array of dynamically allocated pointers the grid objects themselves will be stored non-contiguously in memory and you will likely get many cache misses when trying to access them. In this example you would probably see a significant speedup by storing your grid objects directly in an array or vector since you will be accessing contiguous memory in your loop and so get good cache utilization and effective hardware prefetching.
4 billion times a microsecond (which is a pretty acceptable time in many cases) gives 4 000 seconds. And since you only get about 40 s (if I get it right), I doubt there's something seriously wrong here. If it's still slow for the task, I'd consider the use of parallel computing.

Stack versus Integer

I've created a program to solve Cryptarithmetics for a class on Data Structures. The professor recommended that we utilize a stack consisting of linked nodes to keep track of which letters we replaced with which numbers, but I realized an integer could do the same trick. Instead of a stack {A, 1, B, 2, C, 3, D, 4} I could hold the same info in 1234.
My program, though, seems to run much more slowly than the estimation he gave us. Could someone explain why a stack would behave much more efficiently? I had assumed that, since I wouldn't be calling methods over and over again (push, pop, top, etc) and instead just add one to the 'solution' that mine would be faster.
This is not an open ended question, so do not close it. Although you can implement things different ways, I want to know why, at the heart of C++, accessing data via a Stack has performance benefits over storing in ints and extracting by moding.
Although this is homework, I don't actually need help, just very intrigued and curious.
Thanks and can't wait to learn something new!
EDIT (Adding some code)
letterAssignments is an int array of size 26. for a problem like SEND + MORE = MONEY, A isn't used so letterAssignments[0] is set to 11. All chars that are used are initialized to 10.
answerNum is a number with as many digits as there are unique characters (in this case, 8 digits).
int Cryptarithmetic::solve(){
while(!solved()){
for(size_t z = 0; z < 26; z++){
if(letterAssignments[z] != 11) letterAssignments[z] = 10;
}
if(answerNum < 1) return NULL;
size_t curAns = answerNum;
for(int i = 0; i < numDigits; i++){
if(nextUnassigned() != '$') {
size_t nextAssign = curAns % 10;
if(isAssigned(nextAssign)){
answerNum--;
continue;
}
assign(nextUnassigned(), nextAssign);
curAns /= 10;
}
}
answerNum--;
}
return answerNum;
}
Two helper methods in case you'd like to see them:
char Cryptarithmetic::nextUnassigned(){
char nextUnassigned = '$';
for(int i = 0; i < 26; i++) {
if(letterAssignments[i] == 10) return ('A' + i);
}
}
void Cryptarithmetic::assign(char letter, size_t val){
assert('A' <= letter && letter <= 'Z'); // valid letter
assert(letterAssignments[letter-'A'] != 11); // has this letter
assert(!isAssigned(val)); // not already assigned.
letterAssignments[letter-'A'] = val;
}
From the looks of things the way you are doing things here is quite inefficiant.
As a general rule try to have the least amount of for loops possible since each one will slow down your implementation greatly.
for instance if we strip all other code away, your program looks like
while(thing) {
for(z < 26) {
}
for(i < numDigits) {
for(i < 26) {
}
for(i < 26) {
}
}
}
this means that for each while loop you are doing ((26+26)*numDigits)+26 loop operations. Thats assuming isAssigned() does not use a loop.
Idealy you want:
while(thing) {
for(i < numDigits) {
}
}
which i'm sure is possible with changes to your code.
This is why your implementation with the integer array is much slower than an implementation using the stack which does not use the for(i < 26) loops (I assume).
In Answer to your original question however, storing an array of integers will always be faster than any struct you can come up with simply because there are more overheads involved in assigning the memory, calling functions, etc.
But as with everything, implementation is the key difference between a slow program and a fast program.
The problem is that by counting you are considering also repetitions, when may be the problem asks to assign a different number to each different letter so that the numeric equation holds.
For example for four letters you are testing 10*10*10*10=10000 letter->number mappings instead of 10*9*8*7=5040 of them (the bigger is the number of letters and bigger becomes the ratio between the two numbers...).
The div instruction used by the mod function is quite expensive. Using it for your purpose can easily be less efficient than a good stack implementation. Here is an instruction timings table: http://gmplib.org/~tege/x86-timing.pdf
You should also write unit tests for your int-based stack to make sure that it works as intended.
Programming is actually trading memory for time and vice versa.
Here you are packing data into integer. You spare memory but loose time.
Speed of course depends on the implementation of stack. C++ is C with classes. If you are not using classes it's basically C(as fast as C).
const int stack_size = 26;
struct Stack
{
int _data[stack_size];
int _stack_p;
Stack()
:_stack_size(0)
{}
inline void push(int val)
{
assert(_stack_p < stack_size); // this won't be overhead
// unless you compile debug version(-DNDEBUG)
_data[_stack_p] = val;
}
inline int pop()
{
assert(_stack_p > 0); // same thing. assert is very useful for tracing bugs
return _data[--_stack_p]; // good hint for RVO
}
inline int size()
{
return _stack_p;
}
inline int val(int i)
{
assert(i > 0 && i < _stack_p);
return _data[i];
}
}
There is no overhead like vtbp. Also pop() and push() are very simple so they will be inlined, so no overhead of function call. Using int as stack element also good for speed because int is guaranteed to be of best suitable size for processor(no need for alignment etc).

Vector push_back in while and for loops returns SIGABRT signal (signal 6) (C++)

I'm making a C++ game which requires me to initialize 36 numbers into a vector. You can't initialize a vector with an initializer list, so I've created a while loop to initialize it faster. I want to make it push back 4 of each number from 2 to 10, so I'm using an int named fourth to check if the number of the loop is a multiple of 4. If it is, it changes the number pushed back to the next number up. When I run it, though, I get SIGABRT. It must be a problem with fourth, though, because when I took it out, it didn't give the signal.
Here's the program:
for (int i; i < 36;) {
int fourth = 0;
fourth++;
fourth%=4;
vec.push_back(i);
if (fourth == 0) {
i++;
}
}
Please help!
You do not initialize i. Use for (int i = 0; i<36;). Also, a new variable forth is allocated on each iteration of the loop body. Thus the test fourth==0 will always yield false.
I want to make it push back 4 of each number from 2 to 10
I would use the most straight forward approach:
for (int value = 2; value <= 10; ++value)
{
for (int count = 0; count < 4; ++count)
{
vec.push_back(value);
}
}
The only optimization I would do is making sure that the capacity of the vector is sufficient before entering the loop. I would leave other optimizations to the compiler. My guess is, what you gain by omitting the inner loop, you lose by frequent modulo division.
You did not initialize i, and you are resetting fourth in every iteration. Also, with your for loop condition, I do not think it will do what you want.
I think this should work:
int fourth = 0;
for (int i = 2; i<=10;) {
fourth++;
fourth%=4;
vec.push_back(i);
if (fourth==0) {
i++;
}
}
I've been able to create a static array declaration and pass that array into the vector at initialization without issue. Pretty clean too:
const int initialValues[36] = {0,1,2...,35};
std::vector foo(initialValues);
Works with constants, but haven't tried it with non const arrays.