At what point does an std::map make more sense for grouping objects compared to two vectors and a linear search?

At what point does an std::map make more sense for grouping objects compared to two vectors and a linear search? - c++

I am trying to sort a large collection of objects into a series of groups, which represent some kind of commonality between them.
There seems to be two ways I can go about this:
1) I can manage everything by hand, sorting out all the objects into a vector of vectors. However, this means that I have to iterate over all the upper level vectors every time I want to try and find an existing group for an ungrouped object. I imagine this will become very computationally expensive very quickly as the number of disjoint groups increases.
2) I can use the identifiers of each object that I'm using to classify them as a key for an std::map, where the value is a vector. At that point, all I have to do is iterate over all the input objects once, calling myMap[object.identifier].push_back(object) each time. The map will sort everything out into the appropriate vector, and then I can just iterate over the resulting values afterwards.
My question is...
Which method would be best to use? It seems like a vector of vectors would be faster initially, but it's going to slow down as more and more groups are created. AFAIK, std::map uses RB trees internally, which means that finding the appropriate vector to add the object to should be faster, but you're going to pay for that when the tree inevitably needs to be rebalanced.
The additional memory consumption from an std::map doesn't matter. I'm dealing with anywhere from 12000 to 80000 individual objects that need to be grouped together, and I expect there to be anywhere from 12000 to 20000 groups once everything is said and done.

Instead of using either of your mentioned approaches directly, I suggest you evaluate the use of std::unordered_map (docs here) for your use case. It uses maps with buckets and hashed values internally and has average constant complexity for search, insertion and removal.

Related

Unordered map vs vector

I'm building a little 2d game engine. Now I need to store the prototypes of the game objects (all type of informations). A container that will have at most I guess few thousand elements all with unique key and no elements will be deleted or added after a first load. The key value is a string.
Various threads will run, and I need to send to everyone a key(or index) and with that access other information(like a texture for the render process or sound for the mixer process) available only to those threads.
Normally I use vectors because they are way faster to accessing a known element. But I see that unordered map also usually have a constant speed if I use the ::at element access. It would make the code much cleaner and also easier to maintain because I will deal with much more understandable man made strings.
So the question is, the difference in speed between a access to a vector[n] compared to a unorderedmap.at("string") is negligible compared to his benefits?
From what I understand accessing various maps in different part of the program, with different threads running just with a "name" for me is a big deal and the speed difference isn't that great. But I'm too inexperienced to be sure of this. Although I found informations about it seem I can't really understand if I'm right or wrong.
Thank you for your time.

As an alternative, you could consider using an ordered vector because the vector itself will not be modified. You can easily write an implementation yourself with STL lower_bound etc, or use an implementation from libraries ( boost::flat_map).
There is a blog post from Scott Meyers about container performance in this case. He did some benchmarks and the conclusion would be that an unordered_mapis probably a very good choice with high chances that it will be the fastest option. If you have a restricted set of keys, you can also compute a minimal optimal hash function, e.g. with gperf
However, for these kind of problems the first rule is to measure yourself.

My problem was to find a record on a container by a given std::string type as Key access. Considering Keys that only EXISTS(not finding them was not a option) and the elements of this container are generated only at the beginning of the program and never touched thereafter.
I had huge fears unordered map was not fast enough. So I tested it, and I want to share the results hoping I've not mistaken everything.
I just hope that can help others like me and to get some feedback because in the end I'm beginner.
So, given a struct of record filled randomly like this:
struct The_Mess
{
std::string A_string;
long double A_ldouble;
char C[10];
int* intPointer;
std::vector<unsigned int> A_vector;
std::string Another_String;
}
I made a undordered map, give that A_string contain the key of the record:
std::unordered_map<std::string, The_Mess> The_UnOrdMap;
and a vector I sort by the A_string value(which contain the key):
std::vector<The_Mess> The_Vector;
with also a index vector sorted, and used to access as 3thrd way:
std::vector<std::string> index;
The key will be a random string of 0-20 characters in lenght(I wanted the worst possible scenario) containing letter both capital and normal and numbers or spaces.
So, in short our contendents are:
Unordered map I measure the time the program get to execute:
record = The_UnOrdMap.at( key ); record is just a The_Mess struct.
Sorted Vector measured statements:
low = std::lower_bound (The_Vector.begin(), The_Vector.end(), key, compare);
record = *low;
Sorted Index vector:
low2 = std::lower_bound( index.begin(), index.end(), key);
indice = low2 - index.begin();
record = The_Vector[indice];
The time is in nanoseconds and is a arithmetic average of 200 iterations. I have a vector that I shuffle at every iteration containing all the keys, and at every iteration I cycle through it and look for the key I have here in the three ways.
So this are my results:
I think the initials spikes are a fault of my testing logic(the table I iterate contains only the keys generated so far, so it only has 1-n elements). So 200 iterations of 1 key search for the first time. 200 iterations of 2 keys search the second time etc...
Anyway, it seem that in the end the best option is the unordered map, considering that is a lot less code, it's easier to implement and will make the whole program way easier to read and probably maintain/modify.

You have to think about caching as well. In case of std::vector you'll have very good cache performance when accessing the elements - when accessing one element in RAM, CPU will cache nearby memory values and this will include nearby portions of your std::vector.
When you use std::map (or std::unordered_map) this is no longer true. Maps are usually implemented as self balancing binary-search trees, and in this case values can be scattered around the RAM. This imposes great hit on cache performance, especially as maps get bigger and bigger as CPU just cannot cache the memory that you're about to access.
You'll have to run some tests and measure performance, but cache misses can greatly hurt the performance of your program.

You are most likely to get the same performance (the difference will not be measurable).
Contrary to what some people seem to believe, unordered_map is not a binary tree. The underlying data structure is a vector. As a result, cache locality does not matter here - it is the same as for vector. Granted, you are going to suffer if you have collissions due to your hashing function being bad. But if your key is a simple integer, this is not going to happen. As a result, access to to element in hash map will be exactly the same as access to the element in the vector with time spent on getting hash value for integer, which is really non-measurable.

chained hash table keys with universal hasing,does it need a rehash?

I am implementing a chained hash table using a vector < lists >. I resized my vector to a prime number, let's say 5. To choose the key I am using the universal hasing.
My question is, do I need to rehash my vector? I mean this code will generate always a key in a range between 0 and 5 because it depends from the size of my hashtable, causing collisions of course but the new strings will be all added in the lists of every position in the vector...so it seems I don't need to resize/rehash the whole thing. What do you think? Is this a mistake?

Yes, you do. Otherwise objects will be in the wrong hash bucket and when you search for them, you won't find them. The whole point of hashing is to make locating an object faster -- that won't work if objects aren't where they're supposed to be.
By the way, you probably shouldn't be doing this. There are people who have spent years developing efficient hashing algorithms. Trying to roll your own will result in poor performance. Start with the article on linear hashing in Wikipedia.

do I need to rehash my vector?
Your container could continue to function without rehashing, but searching, insertions and erasing will perform more and more like a plain list instead of a hash table: for example, if you've inserted 10,000 elements you can expect each list in your vector to have roundly 2000 elements, and you may have to search all 2000 to see if a value you're considering inserting is a duplicate, or to find a value to erase, or simply return an iterator to. Sure, 2,000 is better than 10,000, but it's a long way from the O(1) performance expected of a quality hash table implementation. Your non-resizing implementation is still "O(N)".
Is this a mistake?
Yes, a fundamental one.

3D-Grid of bins: nested std::vector vs std::unordered_map

pros, I need some performance-opinions with the following:
1st Question:
I want to store objects in a 3D-Grid-Structure, overall it will be ~33% filled, i.e. 2 out of 3 gridpoints will be empty.
Short image to illustrate:
Maybe Option A)
vector<vector<vector<deque<Obj>> grid;// (SizeX, SizeY, SizeZ);
grid[x][y][z].push_back(someObj);
This way I'd have a lot of empty deques, but accessing one of them would be fast, wouldn't it?
The Other Option B) would be
std::unordered_map<Pos3D, deque<Obj>, Pos3DHash, Pos3DEqual> Pos3DMap;
where I add&delete deques when data is added/deleted. Probably less memory used, but maybe less fast? What do you think?
2nd Question (follow up)
What if I had multiple containers at each position? Say 3 buckets for 3 different entities, say object types ObjA, ObjB, ObjC per grid point, then my data essentially becomes 4D?
Another illustration:
Using Option 1B I could just extend Pos3D to include the bucket number to account for even more sparse data.
Possible queries I want to optimize for:
Give me all Objects out of ObjA-buckets from the entire structure
Give me all Objects out of ObjB-buckets for a set of
grid-positions
Which is the nearest non-empty ObjC-bucket to
position x,y,z?
PS:
I had also thought about a tree based data-structure before, reading about nearest neighbour approaches. Since my data is so regular I had thought I'd save all the tree-building dividing of the cells into smaller pieces and just make a static 3D-grid of the final leafs. Thats how I came to ask about the best way to store this grid here.
Question associated with this, if I have a map<int, Obj> is there a fast way to ask for "all objects with keys between 780 and 790"? Or is the fastest way the building of the above mentioned tree?
EDIT
I ended up going with a 3D boost::multi_array that has fortran-ordering. It's a little bit like the chunks games like minecraft use. Which is a little like using a kd-tree with fixed leaf-size and fixed amount of leaves? Works pretty fast now so I'm happy with this approach.

Answer to 1st question
As #Joachim pointed out, this depends on whether you prefer fast access or small data. Roughly, this corresponds to your options A and B.
A) If you want fast access, go with a multidimensional std::vector or an array if you will. std::vector brings easier maintenance at a minimal overhead, so I'd prefer that. In terms of space it consumes O(N^3) space, where N is the number of grid points along one dimension. In order to get the best performance when iterating over the data, remember to resolve the indices in the reverse order as you defined it: innermost first, outermost last.
B) If you instead wish to keep things as small as possible, use a hash map, and use one which is optimized for space. That would result in space O(N), with N being the number of elements. Here is a benchmark comparing several hash maps. I made good experiences with google::sparse_hash_map, which has the smallest constant overhead I have seen so far. Plus, it is easy to add it to your build system.
If you need a mixture of speed and small data or don't know the size of each dimension in advance, use a hash map as well.
Answer to 2nd question
I'd say you data is 4D if you have a variable number of elements a long the 4th dimension, or a fixed large number of elements. With option 1B) you'd indeed add the bucket index, for 1A) you'd add another vector.
Which is the nearest non-empty ObjC-bucket to position x,y,z?
This operation is commonly called nearest neighbor search. You want a KDTree for that. There is libkdtree++, if you prefer small libraries. Otherwise, FLANN might be an option. It is a part of the Point Cloud Library which accomplishes a lot of tasks on multidimensional data and could be worth a look as well.

How to make a fast search for an object with a particular value in a vector of structs or classes? c++

If I have thousands of struct or class objects in a vector, how to find those that are needed, in a fast way?
For example:
Making a game, and I need fastest way of collision detection. Each tile is a struct, there are many tiles in the vector map, with a values: x and y.
So basically I do:
For(i=0;i<end of vector list;i++)
{
//searching if x= 100 and y =200
}
So maybe there is a different way , like smart pointers or something to search for particular objects faster?

You should sort your vector and then use the standard library algorithms like binary_search, lower_bound, or upper_bound.
The above will give you a better compliexity than o(n) given by walk through of entire vector or by using standard library algorithm find.

i think you have to go more in depth that the simple research of a value inside a group of struct, even more if you are planning on searching among a elevated number.
How are the struct generated, how are they collected and how you keep track of them, there is a common key that you can you can use to order while you create them?
You should focus on sorting them while you add it to the whole structure, that way you avoid massive computation burst every time you have to perform a search. Choose a good algorithm (example AVL sorting), that way you can have a O(log(n))) adding/delete/searching.

A vector is just an unordered collection of objects. There is not really anyway to do what you are asking unless you start sorting your vector in specific ways (e.g. if it is sorted you can jump to the middle of the vector and potentially split your search time in half)
You may be better off picking a different data structure (either instead of the vector or in combination with it)

For example:
for_each(v.begin(),v.end(), [](int e)
{
if (e%2==1)//vector elements that are not divided by 2 without remainder
cout<<e<<endl;
});

C++ map really slow?

i've created a dll for gamemaker. dll's arrays where really slow so after asking around a bit i learnt i could use maps in c++ and make a dll.
anyway, ill represent what i need to store in a 3d array:
information[id][number][number]
the id corresponds to an objects id. the first number field ranges from 0 - 3 and each number represents a different setting. the 2nd number field represents the value for the setting in number field 1.
so..
information[101][1][4];
information[101][2][4];
information[101][3][4];
this would translate to "object with id 101 has a value of 4 for settings 1, 2 and 3".
i did this to try and copy it with maps:
//declared as a class member
map<double, map<int, double>*> objIdMap;
///// lower down the page, in some function
map<int, double> objSettingsMap;
objSettingsMap[1] = 4;
objSettingsMap[2] = 4;
objSettingsMap[3] = 4;
map<int, double>* temp = &objSettingsMap;
objIdMap[id] = temp;
so the first map, objIdMap stores the id as the key, and a pointer to another map which stores the number representing the setting as the key, and the value of the setting as the value.
however, this is for a game, so new objects with their own id's and settings might need to be stored (sometimes a hundred or so new ones every few seconds), and the existing ones constantly need to retrieve the values for every step of the game. are maps not able to handle this? i has a very similar thing going with game maker's array's and it worked fine.

Do not use double's as a the key of a map.
Try to use a floating point comparison function if you want to compare two doubles.

1) Your code is buggy: You store a pointer to a local object objSettingsMap which will be destroyed as soon as it goes out of scope. You must store a map obj, not a pointer to it, so the local map will be copied into this object.
2) Maps can become arbitrarily large (i have maps with millions of entrys). If you need speed try hash_maps (part of C++0x, but also available from other sources), which are considerably faster. But adding some hundred entries each second shouldn't be a problem. But befre worring about execution speed you should always use a profiler.
3) I am not really sure if your nested structures MUST be maps. Depending of what number of setting you have, and what values they may have, a structure or bitfield or a vector might be more accurate.

If you need really fast associative containers, try to learn about hashes. Maps are 'fast enough' but not brilliant for some cases.
Try to analyze what is the structure of objects you need to store. If the fields are fixed I'd recommend not to use nested maps. At all. Maps are usually intended for 'average' number of indexes. For low number simple lists are more effective because of insert / erase operations lower complexity. For great number of indexes you really need to think about hashing.
Don't forget about memory. std::map is highly dynamic template so on small objects stored you loose tons of memory because of dynamic allocation. Is it what you are really expecting? Once I was involved in std::map usage removal which lowered memory requirements in about 2 times.
If you only need to fill the map at startup and only search for elements (don't need to change structure) I'd recommend simple std::vector with sort applied after all the elems inserted. And then you can just use binary search (as you have sorted vector). Why? std::vector is much more predictable thing. The biggest advantage is continuous memory area.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js