I am currently going through some code and I currently have a road class, with a vector of pointers to lanes (a private member), and this road class includes a lane class. This lane class contains a vector of pointers to vehicles, which is another class that contains simple get and set functions to update and obtain a vehicle's position, velocity etc. Now, I have vehicles moving in separate lanes and I allow them to switch lanes, as it is so in traffic flow. However, I would like my vehicles to continuously find a distance from it and the vehicle in front, i.e., look in the vehicles vector and find the closest vehicle. Then I intend to use that to instruct whether a car should decelerate or not. I would also like to make sure that cars which are leading the rest, since once a vehicle leaves the displaywindow height, they should be deleted.
My attempt at this is as follows:
void Lane::Simulate(double time)
{ // This simulate allows check between other vehicles.
double forwardDistance = 0;
for (unsigned int iV = 0; iV < fVehicles.size(); iV++)
{
for(unsigned int jV = 0; jV < fVehicles.size(); jV++)
{
forwardDistance = fVehicles[iV]->getPosition() - fVehicles[jV]->getPosition();
}
}
if(fVehicles.size() < 15)
{
addRanVehicle(); // Adds a vehicle, with position zero but random velocities, to each lane.
}
for (unsigned int iVehicle = 0; iVehicle < fVehicles.size(); iVehicle++)
{
fVehicles[iVehicle]->Simulate(time); // Updates position based on time, velocity and acceleration.
}
}
There may be a much better method than using this forwardDistance parameter. The idea is to loop over each pair of vehicles, avoid the point iV == jV, and find the vehicle which is in front of the iVth vehicle, and record the distance between the two vehicles into a setDistance() function (which is a function of my Vehicle class). I should then be able to use this to check whether a car is too close, check whether it can overtake, or whether it just has to brake.
Currently, I am not sure how to make an efficient looping mechanism for this.
Investigate the cost of performing an ordered insert of Vehicles into the lane. If the Vehicles are ordered according to position on the road, detecting the distance of two Vehicles is child's play:
Eg
for (size_t n = 0; n < fVehicles.size() - 1; n++)
{
distance = fVehicles[n].getPosition() - fVehicles[n+1].getPosition();
}
This is O(N) vs O(N^2) (using ^ as exponent, not XOR). The price of this simplification is the requiring ordered insert into fVehicles, and that should be O(N): One std::binary_search to detect the insertion point and whatever shuffling is required by fVehicles to free up space to place the Vehicle.
Maintaining ordering of fVehicles may be beneficial in other places as well. Visualizing the list (graphically or by print statements) will be much easier, debugging is generally easier on the human brain when everything is in a nice predictable order, and CPUs... They LOVE going in a nice, predictable straight line. Sometimes you get a performance boost that you didn't see coming. Great write-up on that here: Why is it faster to process a sorted array than an unsorted array?
Only way to be sure if this is better is to try it and measure it.
Other Suggestions:
Don't use pointers to the vehicles.
Not only are they harder to manage, they can slow you down quite a bit. As mentioned above, modern CPUs are really good at going in straight lines, and pointers can throw a kink in that straight line.
You never really know where in dynamic memory a pointer is going to be relative to the last pointer you looked at. But with a contiguous block of Vehicles , when the CPU loads Vehicle N it can possibly also grab Vehicles N+1 and N+2. If it can't because they are too big, it doesn't matter much because it already knows where they are, and while the CPU is processing, and idle memory channel could be reading ahead and grabbing the data you're going to need soon.
With the pointer you save a bit every time you move a Vehicle from lane to lane (pointers are usually much cheaper than objects to copy), but may suffer on each and every loop iteration in each and every simulation tick and the volume really adds up. Bjarne Stroustrup, God-Emperor of C++, has an excellent write up on this problem using linked lists as an example (Note linked list is often worse than vector of pointer, but the idea is the same).
Take advantage of std::deque.
std::vector Is really good at stack-like behaviour. You can add to and remove from the end lightning fast, but if you add to or remove from the beginning, everything in the vector is moved.
Most of the lane insertions are likely to be at one end and the removals at the other simply because older Vehicles will gravitate toward the end as Vehicles are added to the beginning or vise versa. This is a certainty if suggestion 1 is taken and fVehicles is ordered. New vehicles will be added to the lane at the beginning, a few will change lanes into or out of the middle, and old vehicles will be removed from the end. deque is optimized for inserting and removing at both ends so adding new cars is cheap, removing old cars is cheap and you only pay full price for cars that change lanes.
Documentation on std::deque
Addendum
Take advantage of range-based for where possible. Range-based for takes most of the iteration logic away and hides it from you.
Eg this
for (unsigned int iV = 0; iV < fVehicles.size(); iV++)
{
for(unsigned int jV = 0; jV < fVehicles.size(); jV++)
{
forwardDistance = fVehicles[iV]->getPosition() - fVehicles[jV]->getPosition();
}
}
becomes
for (auto v_outer: fVehicles)
{
for (auto v_inner: fVehicles)
{
forwardDistance = v_outer->getPosition() - v_inner->getPosition();
}
}
It doesn't look much better if you are counting lines, but you can't accidentally
iV <= fVehicles.size()
or
fVehicles[iV]->getPosition() - fVehicles[iV]->getPosition()
It removes the possibility for you to make silly, fatal, and hard-to-spot errors.
Let's break one down:
for (auto v_outer: fVehicles)
^ ^ ^
type | |
variable name |
Container to iterate
Documentation on Range-based for
In this case I'm also taking advantage of auto. auto allows the compiler to select the type of the data. The compiler knows that fVehicles contains pointers to Vehicles, so it replaces auto with Vehicle * for you. This takes away some of the headaches if you find yourself refactoring the code later.
Documentation on auto
Unfortunately in this cans it can also trap you. If you follow the suggestions above, fVehicles becomes
std::dequeue<Vehicle> fVehicles;
which means auto is now Vehicle. Which makes v_outer a copy, costing you copying time and meaning if you change v_outer, you change a copy and the original goes unchanged. to avoid that, tend toward
for (auto &v_outer: fVehicles)
The compiler is good at deciding how best to handle that reference or if it even needs it.
Related
I was set a homework challenge as part of an application process (I was rejected, by the way; I wouldn't be writing this otherwise) in which I was to implement the following functions:
// Store a collection of integers
class IntegerCollection {
public:
// Insert one entry with value x
void Insert(int x);
// Erase one entry with value x, if one exists
void Erase(int x);
// Erase all entries, x, from <= x < to
void Erase(int from, int to);
// Return the count of all entries, x, from <= x < to
size_t Count(int from, int to) const;
The functions were then put through a bunch of tests, most of which were trivial. The final test was the real challenge as it performed 500,000 single insertions, 500,000 calls to count and 500,000 single deletions.
The member variables of IntegerCollection were not specified and so I had to choose how to store the integers. Naturally, an STL container seemed like a good idea and keeping it sorted seemed an easy way to keep things efficient.
Here is my code for the four functions using a vector:
// Previous bit of code shown goes here
private:
std::vector<int> integerCollection;
};
void IntegerCollection::Insert(int x) {
/* using lower_bound to find the right place for x to be inserted
keeps the vector sorted and makes life much easier */
auto it = std::lower_bound(integerCollection.begin(), integerCollection.end(), x);
integerCollection.insert(it, x);
}
void IntegerCollection::Erase(int x) {
// find the location of the first element containing x and delete if it exists
auto it = std::find(integerCollection.begin(), integerCollection.end(), x);
if (it != integerCollection.end()) {
integerCollection.erase(it);
}
}
void IntegerCollection::Erase(int from, int to) {
if (integerCollection.empty()) return;
// lower_bound points to the first element of integerCollection >= from/to
auto fromBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), from);
auto toBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), to);
/* std::vector::erase deletes entries between the two pointers
fromBound (included) and toBound (not indcluded) */
integerCollection.erase(fromBound, toBound);
}
size_t IntegerCollection::Count(int from, int to) const {
if (integerCollection.empty()) return 0;
int count = 0;
// lower_bound points to the first element of integerCollection >= from/to
auto fromBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), from);
auto toBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), to);
// increment pointer until fromBound == toBound (we don't count elements of value = to)
while (fromBound != toBound) {
++count; ++fromBound;
}
return count;
}
The company got back to me saying that they wouldn't be moving forward because my choice of container meant the runtime complexity was too high. I also tried using list and deque and compared the runtime. As I expected, I found that list was dreadful and that vector took the edge over deque. So as far as I was concerned I had made the best of a bad situation, but apparently not!
I would like to know what the correct container to use in this situation is? deque only makes sense if I can guarantee insertion or deletion to the ends of the container and list hogs memory. Is there something else that I'm completely overlooking?
We cannot know what would make the company happy. If they reject std::vector without concise reasoning I wouldn't want to work for them anyway. Moreover, we dont really know the precise requirements. Were you asked to provide one reasonably well performing implementation? Did they expect you to squeeze out the last percent of the provided benchmark by profiling a bunch of different implementations?
The latter is probably too much for a homework challenge as part of an application process. If it is the first you can either
roll your own. It is unlikely that the interface you were given can be implemented more efficiently than one of the std containers does... unless your requirements are so specific that you can write something that performs well under that specific benchmark.
std::vector for data locality. See eg here for Bjarne himself advocating std::vector rather than linked lists.
std::set for ease of implementation. It seems like you want the container sorted and the interface you have to implement fits that of std::set quite well.
Let's compare only isertion and erasure assuming the container needs to stay sorted:
operation std::set std::vector
insert log(N) N
erase log(N) N
Note that the log(N) for the binary_search to find the position to insert/erase in the vector can be neglected compared to the N.
Now you have to consider that the asymptotic complexity listed above completely neglects the non-linearity of memory access. In reality data can be far away in memory (std::set) leading to many cache misses or it can be local as with std::vector. The log(N) only wins for huge N. To get an idea of the difference 500000/log(500000) is roughly 26410 while 1000/log(1000) is only ~100.
I would expect std::vector to outperform std::set for considerably small container sizes, but at some point the log(N) wins over cache. The exact location of this turning point depends on many factors and can only reliably determined by profiling and measuring.
Nobody knows which container is MOST efficient for multiple insertions / deletions. That is like asking what is the most fuel-efficient design for a car engine possible. People are always innovating on the car engines. They make more efficient ones all the time. However, I would recommend a splay tree. The time required for a insertion or deletion is a splay tree is not constant. Some insertions take a long time and some take only a very a short time. However, the average time per insertion/deletion is always guaranteed to be be O(log n), where n is the number of items being stored in the splay tree. logarithmic time is extremely efficient. It should be good enough for your purposes.
The first thing that comes to mind is to hash the integer value so single look ups can be done in constant time.
The integer value can be hashed to compute an index in to an array of bools or bits, used to tell if the integer value is in the container or not.
Counting and and deleting large ranges could be sped up from there, by using multiple hash tables for specific integer ranges.
If you had 0x10000 hash tables, that each stored ints from 0 to 0xFFFF and were using 32 bit integers you could then mask and shift the upper half of the int value and use that as an index to find the correct hash table to insert / delete values from.
IntHashTable containers[0x10000];
u_int32 hashIndex = (u_int32)value / 0x10000;
u_int32int valueInTable = (u_int32)value - (hashIndex * 0x10000);
containers[hashIndex].insert(valueInTable);
Count for example could be implemented as so, if each hash table kept count of the number of elements it contained:
indexStart = startRange / 0x10000;
indexEnd = endRange / 0x10000;
int countTotal = 0;
for (int i = indexStart; i<=indexEnd; ++i) {
countTotal += containers[i].count();
}
Not sure if using sorting really is a requirement for removing the range. It might be based on position. Anyway, here is a link with some hints which STL container to use.
In which scenario do I use a particular STL container?
Just FYI.
Vector maybe a good choice, but it does a lot of re allocation, as you know. I prefer deque instead, as it doesn't require big chunk of memory to allocate all items. For such requirement as you had, list probably fit better.
Basic solution for this problem might be std::map<int, int>
where key is the integer you are storing and value is the number of occurences.
Problem with this is that you can not quickly remove/count ranges. In other words complexity is linear.
For quick count you would need to implement your own complete binary tree where you can know the number of nodes between 2 nodes(upper and lower bound node) because you know the size of tree, and you know how many left and right turns you took to upper and lower bound nodes. Note that we are talking about complete binary tree, in general binary tree you can not make this calculation fast.
For quick range remove I do not know how to make it faster than linear.
I am a bit curiuous about vector optimization and have couple questions about it. (I am still a beginner in programing)
example:
struct GameInfo{
EnumType InfoType;
// Other info...
};
int _lastPosition;
// _gameInfoV is sorted beforehand
std::vector<GameInfo> _gameInfoV;
// The tick function is called every game frame (in "perfect" condition it's every 1.0/60 second)
void BaseClass::tick()
{
for (unsigned int i = _lastPosition; i < _gameInfoV.size(); i++{
auto & info = _gameInfoV[i];
if( !info.bhasbeenAdded ){
if( DoWeNeedNow() ){
_lastPosition++;
info.bhasbeenAdded = true;
_otherPointer->DoSomething(info.InfoType);
// Do something more with "info"....
}
else return; //Break the cycle since we don't need now other "info"
}
}
}
The _gameInfoV vector size can be between 2000 and 5000.
My main 2 questions are:
Is it better to leave the way how it is or it's better to make smaller chunks of it, which is checked for every different GameInfo.InfoType
Is it worth the hassle of storing the last start position index of the vector instead of iterating from the beginning.
Note that if using smaller vectors there will be like 3 to 6 of them
The third thing is probably that I am not using vector iterators, but is it safe to use then like this?
std::vector<GameInfo>::iterator it = _gameInfoV.begin() + _lastPosition;
for (it = _gameInfoV.begin(); it != _gameInfoV.end(); ++it){
//Do something
}
Note: It will be used in smartphones, so every optimization will be appreciated, when targeting weaker phones.
-Thank you
Don't; except if you frequently move memory around
It is no hassle if you do it correctly:
std::vector<GameInfo>::const_iterator _lastPosition(gameInfoV.begin());
// ...
for (std::vector<GameInfo>::iterator info=_lastPosition; it!=_gameInfoV.end(); ++info)
{
if (!info->bhasbeenAdded)
{
if (DoWeNeedNow())
{
++_lastPosition;
_otherPointer->DoSomething(info->InfoType);
// Do something more with "info"....
}
else return; //Break the cycle since we don't need now other "i
}
}
Breaking one vector up into several smaller vectors in general doesn't improve performance. It could even slightly degrade performance because the compiler has to manage more variables, which take up more CPU registers etc.
I don't know about gaming so I don't understand the implication of GameInfo.InfoType. Your processing time and CPU resource requirements are going to increase if you do more total iterations through loops (where each loop iteration performs the same type of operation). So if separating the vectors causes you to avoid some loop iterations because you can skip entire vectors, that's going to increase performance of your app.
iterators are the most secure way to iterate through containers. But for a vector I often just use the index operator [] and my own indexer (a plain old unsigned integer).
First, some background:
I'm working on a project which requires me to simulate interactions between objects that can be thought of as polygons (usually triangles or quadrilaterals, almost certainly fewer than seven sides), each side of which is composed of the radius of two circles with a variable (and possibly zero) number of 'rivers' of various constant widths passing between them, and out of the polygon through some other side. As these rivers and circles and their widths (and the positions of the circles) are specified at runtime, one of these polygons with N sides and M rivers running through it can be completely described by an array of N+2M pointers, each referring to the relevant rivers/circles, starting from an arbitrary corner of the polygon and passing around (in principal, since rivers can't overlap, they should be specifiable with less data, but in practice I'm not sure how to implement that).
I was originally programming this in Python, but quickly found that for more complex arrangements performance was unacceptably slow. In porting this over to C++ (chosen because of its portability and compatibility with SDL, which I'm using to render the result once optimization is complete) I am at somewhat of a loss as to how to deal with the polygon structure.
The obvious thing to do is to make a class for them, but as C++ lacks even runtime-sized arrays or multi-type arrays, the only way to do this would be with a ludicrously cumbersome set of vectors describing the list of circles, rivers, and their relative placement, or else an even more cumbersome 'edge' class of some kind. Rather than this, it seems like the better option is to use a much simpler, though still annoying, vector of void pointers, each pointing to the rivers/circles as described above.
Now, the question:
If I am correct, the proper way to handle the relevant memory allocations here with the minimum amount of confusion (not saying much...) is something like this:
int doStuffWithPolygons(){
std::vector<std::vector<void *>> polygons;
while(/*some circles aren't assigned a polygon*/){
std::vector<void *> polygon;
void *start = &/*next circle that has not yet been assigned a polygon*/;
void *lastcircle = start;
void *nextcircle;
nextcircle = &/*next circle to put into the polygon*/;
while(nextcircle != start){
polygon.push_back(lastcircle);
std::vector<River *> rivers = /*list of rivers between last circle and next circle*/;
for(unsigned i = 0; i < rivers.size(); i++){
polygon.push_back(rivers[i]);
}
lastcircle = nextcircle;
nextcircle = &/*next circle to put into the polygon*/;
}
polygons.push_back(polygon);
}
int score = 0;
//do whatever you're going to do to evaluate the polygons here
return score;
}
int main(){
int bestscore = 0;
std::vector<int> bestarrangement; //contains position of each circle
std::vector<int> currentarrangement = /*whatever arbitrary starting arrangement is appropriate*/;
while(/*not done evaluating polygon configurations*/){
//fiddle with current arrangement a bit
int currentscore = doStuffWithPolygons();
if(currentscore > bestscore){
bestscore = currentscore;
bestarrangement = currentarrangement;
}
}
//somehow report what the best arrangement is
return 0;
}
If I properly understand how this stuff is handled, I shouldn't need any delete or .clear() calls because everything goes out of scope after the function call. Am I correct about this? Also, is there any part of the above that is needlessly complex, or else is insufficiently complex? Am I right in thinking that this is as simple as C++ will let me make it, or is there some way to avoid some of the roundabout construction?
And if you're response is going to be something like 'don't use void pointers' or 'just make a polygon class', unless you can explain how it will make the problem simpler, save yourself the trouble. I am the only one who will ever see this code, so I don't care about adhering to best practices. If I forget how/why I did something and it causes me problems later, that's my own fault for insufficiently documenting it, not a reason to have written it differently.
edit
Since at least one person asked, here's my original python, handling the polygon creation/evaluation part of the process:
#lots of setup stuff, such as the Circle and River classes
def evaluateArrangement(circles, rivers, tree, arrangement): #circles, rivers contain all the circles, rivers to be placed. tree is a class describing which rivers go between which circles, unrelated to the problem at hand. arrangement contains (x,y) position of the circles in the current arrangement.
polygons = []
unassignedCircles = range(len(circles))
while unassignedCircles:
polygon = []
start = unassignedCircles[0]
lastcircle = start
lastlastcircle = start
nextcircle = getNearest(start,arrangement)
unassignedCircles.pop(start)
unassignedCircles.pop(nextcircle)
while(not nextcircle = start):
polygon += [lastcircle]
polygon += getRiversBetween(tree, lastcircle,nextcircle)
lastlastcircle = lastcircle
lastcircle = nextcircle;
nextcircle = getNearest(lastcircle,arrangement,lastlastcircle) #the last argument here guarantees that the new nextcircle is not the same as the last lastcircle, which it otherwise would have been guaranteed to be.
unassignedCircles.pop(nextcircle)
polygons += [polygon]
return EvaluatePolygons(polygons,circles,rivers) #defined outside.
Void as template argument must be lower case. Other than that it should work, but I also recommend using a base class for that. With a smart pointer you can let the system handle all the memory management.
With my current project, I did my best to adhere to the principle that premature optimization is the root of all evil. However, now the code is tested, and it is time for optimization. I did some profiling, and it turns out my code spends almost 20% of its time in a function where it finds all possible children, puts them in a vector, and returns them. As a note, I am optimizing for speed, memory limitations are not a factor.
Right now the function looks like this:
void Board::GetBoardChildren(std::vector<Board> &children)
{
children.reserve(open_columns_.size()); // only reserve max number of children
UpdateOpenColumns();
for (auto i : open_columns_)
{
short position_adding_to = ColumnToPosition(i);
MakeMove(position_adding_to); // make the possible move
children.push_back(*this); // add to vector of children
ReverseMove(); // undo move
}
}
According to the profiling, my code spends about 40% of the time just on the line children.push_back(*this); I am calling the function like this:
std::vector<Board> current_children;
current_state.GetBoardChildren(current_children);
I was thinking since the maximum number of possible children is small (7), would it be better to just use an array? Or is there not a ton I can do to optimize this function?
From your responses to my comments, it seems very likely that most of the time is spent copying the board in
children.push_back(*this);
You need to find a way to avoid making all those copies, or a way to make them cheaper.
Simply changing the vector into an array or a list will likely not make any difference to performance.
The most important question is: Do you really need all States at once inside current_state?
If you just iterate over them once or twice in the default order, then there is no need for a vector, just generate them on demand.
If you really need it, here is the next step. Since Board is expensive for copying, a DifferenceBoard that keeps only track of the difference may be better. Pseudocode:
struct DifferenceBoard { // or maybe inherit from Board that a DifferenceBoard
// can be built from another DifferenceBoard
Board *original;
int fromposition, toposition;
State state_at_position;
State get(int y, int x) const {
if ((x,y) == fromposition) return Empty;
if ((x,y) == toposition ) return state_at_position;
return original->get();
}
};
I have a class containing a number of double values. This is stored in a vector where the indices for the classes are important (they are referenced from elsewhere). The class looks something like this:
Vector of classes
class A
{
double count;
double val;
double sumA;
double sumB;
vector<double> sumVectorC;
vector<double> sumVectorD;
}
vector<A> classes(10000);
The code that needs to run as fast as possible is something like this:
vector<double> result(classes.size());
for(int i = 0; i < classes.size(); i++)
{
result[i] += classes[i].sumA;
vector<double>::iterator it = find(classes[i].sumVectorC.begin(), classes[i].sumVectorC.end(), testval);
if(it != classes[i].sumVectorC.end())
result[i] += *it;
}
The alternative is instead of one giant loop, split the computation into two separate loops such as:
for(int i = 0; i < classes.size(); i++)
{
result[i] += classes[i].sumA;
}
for(int i = 0; i < classes.size(); i++)
{
vector<double>::iterator it = find(classes[i].sumVectorC.begin(), classes[i].sumVectorC.end(), testval);
if(it != classes[i].sumVectorC.end())
result[i] += *it;
}
or to store each member of the class in a vector like so:
Class of vectors
vector<double> classCounts;
vector<double> classVal;
...
vector<vector<double> > classSumVectorC;
...
and then operate as:
for(int i = 0; i < classes.size(); i++)
{
result[i] += classCounts[i];
...
}
Which way would usually be faster (across x86/x64 platforms and compilers)? Are look-ahead and cache lines are the most important things to think about here?
Update
The reason I'm doing a linear search (i.e. find) here and not a hash map or binary search is because the sumVectors are very short, around 4 or 5 elements. Profiling showed a hash map was slower and a binary search was slightly slower.
As the implementation of both variants seems easy enough I would build both versions and profile them to find the fastest one.
Empirical data usually beats speculation.
As a side issue: Currently, the find() in your innermost loop does a linear scan through all elements of classes[i].sumVectorC until it finds a matching value. If that vector contains many values, and you have no reason to believe that testVal appears near the start of the vector, then this will be slow -- consider using a container type with faster lookup instead (e.g. std::map or one of the nonstandard but commonly implemented hash_map types).
As a general guideline: consider algorithmic improvements before low-level implementation optimisation.
As lothar says, you really should test it out. But to answer your last question, yes, cache misses will be a major concern here.
Also, it seems that your first implementation would run into load-hit-store stalls as coded, but I'm not sure how much of a problem that is on x86 (it's a big problem on XBox 360 and PS3).
It looks like optimizing the find() would be a big win (profile to know for sure). Depending on the various sizes, in addition to replacing the vector with another container, you could try sorting sumVectorC and using a binary search in the form of lower_bound. This will turn your linear search O(n) into O(log n).
if you can guarrantee that std::numeric_limits<double>::infinity is not a possible value, ensuring that the arrays are sorted with a dummy infinite entry at the end and then manually coding the find so that the loop condition is a single test:
array[i]<test_val
and then an equality test.
then you know that the average number of looked at values is (size()+1)/2 in the not found case. Of course if the search array changes very frequently then the issue of keeping it sorted is an issue.
of course you don't tell us much about sumVectorC or the rest of A for that matter, so it is hard to ascertain and give really good advice. For example if sumVectorC is never updates then it is probably possible to find an EXTREMELY cheap hash (eg cast ULL and bit extraction) that is perfect on the sumVectorC values that fits into double[8]. Then the overhead is bit extract and 1 comparison versus 3 or 6
Also if you have a bound on sumVectorC.size() that is reasonable(you mentioned 4 or 5 so this assumption seems not bad) you could consider using an aggregated array or even just a boost::array<double> and add your own dynamic size eg :
class AggregatedArray : public boost::array<double>{
size_t _size;
size_t size() const {
return size;
}
....
push_back(..){...
pop(){...
resize(...){...
};
this gets rid of the extra cache line access to the allocated array data for sumVectorC.
In the case of sumVectorC very infrequently updating if finding a perfect hash (out of your class of hash algoithhms)is relatively cheap then you can incur that with profit when sumVectorC changes. These small lookups can be problematic and algorithmic complexity is frequently irrelevant - it is the constants that dominate. It is an engineering problem and not a theoretical one.
Unless you can guarantee that the small maps are in cache you can be almost be guaranteed that using a std::map will yield approximately 130% worse performance as pretty much each node in the tree will be in a separate cache line
Thus instead of accessing (4 times 1+1 times 2)/5 = 1.2 cache lines per search (the first 4 are in first cacheline, the 5th in the second cacheline, you will access (1 + 2 times 2 + 2 times 3) = 9/5) + 1 for the tree itself = 2.8 cachelines per search (the 1 being 1 node at the root, 2 nodes being children of the root, and the last 2 being grandchildren of the root, plus the tree itself)
So I would predict using a std::map to take 2.8/1.2 = 233% as long for a sumVectorC having 5 entries
This what I meant when I said: "It is an engineering problem and not a theoretical one."