I am creating a genetic algorithm that classifies set of data for which I need to generate random sequences of 1s,0s and 2s for defining a rule
2 represents it being in 2 states a 1 and 0. I am trying to use Map STL for mapping the random set of rules generated and the output for each rule. I need the Map key to be dynamic/ changing every iteration/generation to be filled by new population of rules.
I realise I have option of using pointers which complicates my code and will have readability issue.
The other option I know about is copying the key elements and value and deleting it so it can be replace by the new rule.
So, My question is:
1). Is it better to use vector and my own mapping algorithm? The only problem being is I want to be efficient and fast as I will be dealing with 2000 or more data.
2). Are there any other STL which I can use no libraries that I need to download pls.?
3). Shall I just use Map and keep reseting the elements in the map each time so I can initialise them again?
which method would be effective?
Open to any other suggestion or advice.
If you don't want data to be in sorted order you could consider std::unordered_map<>.
Have a look at this for some benchmarks.
Related
I am trying to sort a large collection of objects into a series of groups, which represent some kind of commonality between them.
There seems to be two ways I can go about this:
1) I can manage everything by hand, sorting out all the objects into a vector of vectors. However, this means that I have to iterate over all the upper level vectors every time I want to try and find an existing group for an ungrouped object. I imagine this will become very computationally expensive very quickly as the number of disjoint groups increases.
2) I can use the identifiers of each object that I'm using to classify them as a key for an std::map, where the value is a vector. At that point, all I have to do is iterate over all the input objects once, calling myMap[object.identifier].push_back(object) each time. The map will sort everything out into the appropriate vector, and then I can just iterate over the resulting values afterwards.
My question is...
Which method would be best to use? It seems like a vector of vectors would be faster initially, but it's going to slow down as more and more groups are created. AFAIK, std::map uses RB trees internally, which means that finding the appropriate vector to add the object to should be faster, but you're going to pay for that when the tree inevitably needs to be rebalanced.
The additional memory consumption from an std::map doesn't matter. I'm dealing with anywhere from 12000 to 80000 individual objects that need to be grouped together, and I expect there to be anywhere from 12000 to 20000 groups once everything is said and done.
Instead of using either of your mentioned approaches directly, I suggest you evaluate the use of std::unordered_map (docs here) for your use case. It uses maps with buckets and hashed values internally and has average constant complexity for search, insertion and removal.
I am implementing a chained hash table using a vector < lists >. I resized my vector to a prime number, let's say 5. To choose the key I am using the universal hasing.
My question is, do I need to rehash my vector? I mean this code will generate always a key in a range between 0 and 5 because it depends from the size of my hashtable, causing collisions of course but the new strings will be all added in the lists of every position in the vector...so it seems I don't need to resize/rehash the whole thing. What do you think? Is this a mistake?
Yes, you do. Otherwise objects will be in the wrong hash bucket and when you search for them, you won't find them. The whole point of hashing is to make locating an object faster -- that won't work if objects aren't where they're supposed to be.
By the way, you probably shouldn't be doing this. There are people who have spent years developing efficient hashing algorithms. Trying to roll your own will result in poor performance. Start with the article on linear hashing in Wikipedia.
do I need to rehash my vector?
Your container could continue to function without rehashing, but searching, insertions and erasing will perform more and more like a plain list instead of a hash table: for example, if you've inserted 10,000 elements you can expect each list in your vector to have roundly 2000 elements, and you may have to search all 2000 to see if a value you're considering inserting is a duplicate, or to find a value to erase, or simply return an iterator to. Sure, 2,000 is better than 10,000, but it's a long way from the O(1) performance expected of a quality hash table implementation. Your non-resizing implementation is still "O(N)".
Is this a mistake?
Yes, a fundamental one.
I have a very large array storing some numbers. My task is to find if a particular number exists in array or not efficiently. Which algorithm and data structure I should go with?
Few assumptions:
Each number in array would be unique.
I am not concerned about where the data is found in array I just want to return true if data is found else false.
I would be using C++ as programming language.
Please suggest.
Thanks
Constant time lookup with unordered_set.
There are also options of bitsets etc. Depends exactly how large is "very large" and the sparseness of the values stored compared to how many of them there actually are.
seems unordered_set is suitable for your requirement.
PS: Pls remember all elements in this set are immutable
The known best way to check if an element (number) is a member of a set (array) is to use bloom filters. It works well if set is changing over time or if there are set operations among sets. Bloom filters are easy to implement and good implementations are availble
If set is static (i.e. never change) the good way is to use perfect hash function. It will take time to build, but will outperform usual hash function provided by std::unordered_set
I am working on a piece of code that need to do something very similar to the combine function in ArcGIS in C/C++. See: http://webhelp.esri.com/arcgisdesktop/9.3/index.cfm?TopicName=Combining%20multiple%20rasters
The C++ code will read in multiple very large raster data files (2GB+) in chunks, find unique combinations and output to a single map. For example, if there were 3 maps and <1,3,5> existed, respectfully, in the first cell of the three maps then I want all subsequent instances of <1,3,5> to have the same key in the final output map.
What STL containers should I be using to store the maps? Reading in the files in chunks certainly adds more complexity of the project. The algorithm needs to be very fast so I cannot use vectors which have a O(n) complexity for searching. Currently, I'm thinking of using a unsorted_map of unsorted_multimaps but I am not sure if this is correct and if I am going to get the performance I need.
Any thoughts?
Yes, std::map or std::unordered_map is correct choice. unordered_map is faster and more compact if you don't need order. If you need something even faster, you can replace it with other map implementation.
Use something compact for the key, something like std::tuple or std::array.
Is there a way to write simple hashtable with the key as "strings" and value as the frequency, so that there are NO collisons? There will no be removal from the hashtable, and if the object already exists in the hashtable, then just update its frequency(add them together).
I was thinking there might be a algorithm that can compute a unique number from the string which will be used as the index.
Yes, i am avoiding the use of all STL construct including unordered_map.
You can use any perfect hash generator like gperf
See here for a list: http://en.wikipedia.org/wiki/Perfect_hash_function
PS. You'd still possibly want to use a map instead of flat array/vector in case the mapped domain gets too big/sparse
It really depends on what you mean by 'simple'.
The std::map is a fairly simple class. Still, it uses a red-black tree with all of the insertion, deletion, and balancing nicely hidden away, and it is templated to handle any orderable type as a key and any type as the value. Most map classes use a similar implementation, and avoid any sort of hashing functionality.
Hashing without collisions is not a trivial matter whatsoever. Perhaps the simplest method is Pearson Hashing.
It seems like you have 3 choices:
Implement your own perfect hashing class. This would be a pretty good sized class with a lot of functionality and some decently complex algorithms. I don't think this is simple.
Download and use a perfect hashing library that is already out there. Of course, you have to worry about deployability.
Use STL's map class. It's embedded, well-documented, easy to use, type-flexible, and completely cross-platform. This seems like the 'simplest' solution.
If I may ask, Why are you avoiding STL?
If the set of possible strings is known beforehand, you can use a perfect hash function generator to do this. But otherwise, what you ask is impossible.
Now, it IS possible to make the likelihood of collisions extremely low by using a good hash function and making sure your table is huge. You basically need a big enough table to make the likelihood of invoking the Birthday Paradox low enough to suit you. Then you just use n bits of output from SHA-1, and 2^n will be your table size.
I'm also wondering if maybe you could use a Bloom filter and have an actual counter instead of bits. Keep a list of all the words you've stuffed into the bloom filter and what entries they've incremented (which will be the same each time) and you have yourself a gigantic linear function that you might be able to solve to get all the individual counts back out again.