chained hash table keys with universal hasing,does it need a rehash?

chained hash table keys with universal hasing,does it need a rehash? - c++

I am implementing a chained hash table using a vector < lists >. I resized my vector to a prime number, let's say 5. To choose the key I am using the universal hasing.
My question is, do I need to rehash my vector? I mean this code will generate always a key in a range between 0 and 5 because it depends from the size of my hashtable, causing collisions of course but the new strings will be all added in the lists of every position in the vector...so it seems I don't need to resize/rehash the whole thing. What do you think? Is this a mistake?

Yes, you do. Otherwise objects will be in the wrong hash bucket and when you search for them, you won't find them. The whole point of hashing is to make locating an object faster -- that won't work if objects aren't where they're supposed to be.
By the way, you probably shouldn't be doing this. There are people who have spent years developing efficient hashing algorithms. Trying to roll your own will result in poor performance. Start with the article on linear hashing in Wikipedia.

do I need to rehash my vector?
Your container could continue to function without rehashing, but searching, insertions and erasing will perform more and more like a plain list instead of a hash table: for example, if you've inserted 10,000 elements you can expect each list in your vector to have roundly 2000 elements, and you may have to search all 2000 to see if a value you're considering inserting is a duplicate, or to find a value to erase, or simply return an iterator to. Sure, 2,000 is better than 10,000, but it's a long way from the O(1) performance expected of a quality hash table implementation. Your non-resizing implementation is still "O(N)".
Is this a mistake?
Yes, a fundamental one.

Related

Returning zero when the key is not exist in unordered_map

I have the following container:
std::unordered_map<uint8_t,int> um;
um is assumed to have keys between 0 and 255 but not all of them. So, in certain point of time I want to ask it to give me the value of the key 13 for example. If it was there, I want its value (which is guaranteed to be not 0). If not, I want it to return 0.
What is the best way (performance point of view) to implement this?
What I tried till now: use find and return 0 if it was not found or the value if it was found.
P.S. Changing to std::vector<int> that contains 256 items is not an option. I can not afford the space to storing 256 values always.
EDIT:
My problem is histogram computing problem keys (colors 0-255) values(frequent, int is enough). I will not be satisfied if I just know that some key is exist or not. I also need the value (the frequent).
Additional information:
I will never erase any item.
I will add items sometimes (at most 256 items) and usually less than 10.
I will query on key so so many times.
Usually querying and inserting come with no specific order.

You have a trade-off between memory and speed.
Your unordered_map should have the less speed complexity.
Using std::vector<std::pair<uint8_t, int>> would be more compact (and more cache friendly).
std::pair<std::vector<uint8_t>, std::vector<int>> would be even more compact (no padding between uint8_t and int)
You can even do better by factorizing size/capacity, but it is no longer in std::.
With vector, you have then an other trade of: complexity for searching and add key:
unsorted vector: constant add, Linear search
sorted vector: linear add (due to insert value in middle of vector), logarithmic search.

I might use a vector for space compactness.
It is tempting to keep it sorted for logarithmic search performance. But since the expected number of elements is less than 10, I might just leave it unsorted and use linear search.
So
vector<pair<uint8_t, int>> data;
If the number of expected elements is large, then having a sorted vector might help.
Boost offers a map-like interface with vector-like layout. See boost flat_map at http://www.boost.org/doc/libs/1_48_0/doc/html/container/non_standard_containers.html#container.non_standard_containers.flat_xxx

At what point does an std::map make more sense for grouping objects compared to two vectors and a linear search?

I am trying to sort a large collection of objects into a series of groups, which represent some kind of commonality between them.
There seems to be two ways I can go about this:
1) I can manage everything by hand, sorting out all the objects into a vector of vectors. However, this means that I have to iterate over all the upper level vectors every time I want to try and find an existing group for an ungrouped object. I imagine this will become very computationally expensive very quickly as the number of disjoint groups increases.
2) I can use the identifiers of each object that I'm using to classify them as a key for an std::map, where the value is a vector. At that point, all I have to do is iterate over all the input objects once, calling myMap[object.identifier].push_back(object) each time. The map will sort everything out into the appropriate vector, and then I can just iterate over the resulting values afterwards.
My question is...
Which method would be best to use? It seems like a vector of vectors would be faster initially, but it's going to slow down as more and more groups are created. AFAIK, std::map uses RB trees internally, which means that finding the appropriate vector to add the object to should be faster, but you're going to pay for that when the tree inevitably needs to be rebalanced.
The additional memory consumption from an std::map doesn't matter. I'm dealing with anywhere from 12000 to 80000 individual objects that need to be grouped together, and I expect there to be anywhere from 12000 to 20000 groups once everything is said and done.

Instead of using either of your mentioned approaches directly, I suggest you evaluate the use of std::unordered_map (docs here) for your use case. It uses maps with buckets and hashed values internally and has average constant complexity for search, insertion and removal.

Unordered map vs vector

I'm building a little 2d game engine. Now I need to store the prototypes of the game objects (all type of informations). A container that will have at most I guess few thousand elements all with unique key and no elements will be deleted or added after a first load. The key value is a string.
Various threads will run, and I need to send to everyone a key(or index) and with that access other information(like a texture for the render process or sound for the mixer process) available only to those threads.
Normally I use vectors because they are way faster to accessing a known element. But I see that unordered map also usually have a constant speed if I use the ::at element access. It would make the code much cleaner and also easier to maintain because I will deal with much more understandable man made strings.
So the question is, the difference in speed between a access to a vector[n] compared to a unorderedmap.at("string") is negligible compared to his benefits?
From what I understand accessing various maps in different part of the program, with different threads running just with a "name" for me is a big deal and the speed difference isn't that great. But I'm too inexperienced to be sure of this. Although I found informations about it seem I can't really understand if I'm right or wrong.
Thank you for your time.

As an alternative, you could consider using an ordered vector because the vector itself will not be modified. You can easily write an implementation yourself with STL lower_bound etc, or use an implementation from libraries ( boost::flat_map).
There is a blog post from Scott Meyers about container performance in this case. He did some benchmarks and the conclusion would be that an unordered_mapis probably a very good choice with high chances that it will be the fastest option. If you have a restricted set of keys, you can also compute a minimal optimal hash function, e.g. with gperf
However, for these kind of problems the first rule is to measure yourself.

My problem was to find a record on a container by a given std::string type as Key access. Considering Keys that only EXISTS(not finding them was not a option) and the elements of this container are generated only at the beginning of the program and never touched thereafter.
I had huge fears unordered map was not fast enough. So I tested it, and I want to share the results hoping I've not mistaken everything.
I just hope that can help others like me and to get some feedback because in the end I'm beginner.
So, given a struct of record filled randomly like this:
struct The_Mess
{
std::string A_string;
long double A_ldouble;
char C[10];
int* intPointer;
std::vector<unsigned int> A_vector;
std::string Another_String;
}
I made a undordered map, give that A_string contain the key of the record:
std::unordered_map<std::string, The_Mess> The_UnOrdMap;
and a vector I sort by the A_string value(which contain the key):
std::vector<The_Mess> The_Vector;
with also a index vector sorted, and used to access as 3thrd way:
std::vector<std::string> index;
The key will be a random string of 0-20 characters in lenght(I wanted the worst possible scenario) containing letter both capital and normal and numbers or spaces.
So, in short our contendents are:
Unordered map I measure the time the program get to execute:
record = The_UnOrdMap.at( key ); record is just a The_Mess struct.
Sorted Vector measured statements:
low = std::lower_bound (The_Vector.begin(), The_Vector.end(), key, compare);
record = *low;
Sorted Index vector:
low2 = std::lower_bound( index.begin(), index.end(), key);
indice = low2 - index.begin();
record = The_Vector[indice];
The time is in nanoseconds and is a arithmetic average of 200 iterations. I have a vector that I shuffle at every iteration containing all the keys, and at every iteration I cycle through it and look for the key I have here in the three ways.
So this are my results:
I think the initials spikes are a fault of my testing logic(the table I iterate contains only the keys generated so far, so it only has 1-n elements). So 200 iterations of 1 key search for the first time. 200 iterations of 2 keys search the second time etc...
Anyway, it seem that in the end the best option is the unordered map, considering that is a lot less code, it's easier to implement and will make the whole program way easier to read and probably maintain/modify.

You have to think about caching as well. In case of std::vector you'll have very good cache performance when accessing the elements - when accessing one element in RAM, CPU will cache nearby memory values and this will include nearby portions of your std::vector.
When you use std::map (or std::unordered_map) this is no longer true. Maps are usually implemented as self balancing binary-search trees, and in this case values can be scattered around the RAM. This imposes great hit on cache performance, especially as maps get bigger and bigger as CPU just cannot cache the memory that you're about to access.
You'll have to run some tests and measure performance, but cache misses can greatly hurt the performance of your program.

You are most likely to get the same performance (the difference will not be measurable).
Contrary to what some people seem to believe, unordered_map is not a binary tree. The underlying data structure is a vector. As a result, cache locality does not matter here - it is the same as for vector. Granted, you are going to suffer if you have collissions due to your hashing function being bad. But if your key is a simple integer, this is not going to happen. As a result, access to to element in hash map will be exactly the same as access to the element in the vector with time spent on getting hash value for integer, which is really non-measurable.

std::map's behavior on referring to a key

I am writing a program for numerical simulation by using std::map to store some key-value pairs. The map is used as storing the states evoluted during the simulation. The type of the key is a integer and the value of corresponds to the key tells how many copies are there for the same keys, i.e. std::map. For each step of the simulation, I need to calculate how many values are there for the same key, so I will check that by the following code
if (map[key]>0) {do something here with the number of copies}
However, I soon find that this code doesn't work because even there is no such key in the map, whenever you call the map[key], it will generate a placeholder for that key and set the value as zero; therefore, I always overcount the total number of keys by std::map.size(). I later change the code as follow to search the key instead
if (map.find(key)!=map.end()) {...}
So is it the only and fastest way to check if a key exists or not for a map? I am going to run the simulation for hundreds millions times and it will call above code very often to check the key. Will it be too slow to use map.find() instead? Thanks.

The find member function is probably the fastest way to find whether a key is already in the map. That said, if you don't need to iterate over items in the map in order, you might get better performance with an std::unordered_map instead.

In a std::map or hashtable (std::unordered_map), the find function is very fast, as fast as using the [] subscripting operator. In fact, it's faster when the element is not found, because it doesn't have to insert one.

I don't think there is much difference in speed for various ways to check for existence of key. On the other hand: if your keys are integers and range is known, you might just use the array.
BTW:
I got interested about the speed of simple array, vector, map and unordered map. I have written simple program, that does 100000000 container[n]++, where n is a random number in range of 0 to 10000. The results:
array: 1.27s
vector: 1.36s
unordered map: 2.6s
map: 11.6s
The overhead of loop + index calculation in this simple case is ~0.8s.
So it all depends on how much time is spent elsewhere. If it's considerably more (per 100000000 iterations) then it does not matter much what you use. But if it's not, it can be quite a difference.

you can use hash_map, it is the fastest data structures for your key-value type;
also you can use map,but it is slower than hash_map

How to store a sequence of timestamped data?

I have an application that need to store a sequence of voltage data, each entry is something like a pair {time, voltage}
the time is not necessarily continuous, if the voltage doesn't move, I will not have any reading.
The problem is that i also need to have a function that lookup timestamp, like, getVoltageOfTimestamp(float2second(922.325))
My solution is to have a deque that stores the paires, then for every 30 seconds, I do a sampling and store the index into a map
std::map,
so inside getVoltageOfTimestamp(float2second(922.325)), I simply find the nearest interval_of_30_seconds to the desired time, and then move my pointer of deque to that corresponding_index_of_deque, iterate from there and find the correct voltage.
I am not sure whether there exist a more 'computer scientist' solution here, can anyone give me a clue?

You could use a binary search on your std::deque because the timestamps are in ascending order.
If you want to optimize for speed, you could also use a std::map<Timestamp, Voltage>. For finding an element, you can use upper_bound on the map and return the element before the one found by upper_bound. This approach uses more memory (because std::map<Timestamp, Voltage> has some overhead and it also allocates each entry separately).

Rather then use a separate map, you can do a binary search directly on the deque to find the closet timestamp. Given the complexity requirements of a std::map, doing a binary search will be just as efficient as a map lookup (both are O(log N)) and won't require the extra overhead.

Do you mind using c++ ox conepts ? If not deque<tuple<Time, Voltage>> will do the job.

One way you can improve over binary search is to identify the samples of your data. Assuming your samples are every 30 milliseconds, then in vector/list store the readings as you get them. In the other array, insert the index of the array every 30 seconds. Now given a timestamp, just go to the first array and find the index of the element in the list, now just go there and check few elements preceding/succeeding it.
Hope this helps.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js