I want to store strings and issue each with a unique ID number (an index would be fine). I would only need one copy of each string and I require quick lookup. I check if the string exist in the table often enough that i notice a performance hit. Whats the best container to use for this and how do i lookup if the string exist?
I would suggest tr1::unordered_map. It is implemented as a hashmap so it has an expected complexity of O(1) for lookups and a worst case of O(n). There is also a boost implementation if your compiler doesn't support tr1.
#include <string>
#include <iostream>
#include <tr1/unordered_map>
using namespace std;
int main()
{
tr1::unordered_map<string, int> table;
table["One"] = 1;
table["Two"] = 2;
cout << "find(\"One\") == " << boolalpha << (table.find("One") != table.end()) << endl;
cout << "find(\"Three\") == " << boolalpha << (table.find("Three") != table.end()) << endl;
return 0;
}
try this:
(source: adrinael.net)
Try std::map.
First and foremost you must be able to quantify your options. You have also told us that the main usage pattern you're interested in is lookup, not insertion.
Let N be the number of strings that you expect to be having in the table, and let C be the average number of characters in any given string present in the said table (or in the strings that are checked against the table).
In the case of a hash-based approach, for each lookup you pay the following costs:
O(C) - calculating the hash for the string you are about to look up
between O(1 x C) and O(N x C), where 1..N is the cost you expect from traversing the bucket based on hash key, here multiplied by C to re-check the characters in each string against the lookup key
total time: between O(2 x C) and O((N + 1) x C)
In the case of a std::map-based approach (which uses red-black trees), for each lookup you pay the following costs:
total time: between O(1 x C) and O(log(N) x C) - where O(log(N)) is the maximal tree traversal cost, and O(C) is the time that std::map's generic less<> implementation takes to recheck your lookup key during tree traversal
In the case of large values for N and in the absence of a hash function that guarantees less than log(N) collisions, or if you just want to play it safe, you're better off using a tree-based (std::map) approach. If N is small, by all means, use a hash-based approach (while still making sure that hash collision is low.)
Before making any decision, though, you should also check:
http://meshula.net/wordpress/?p=183
http://wyw.dcweb.cn/mstring.htm
Are the Strings to be searched available statically? You might want to look at a perfect hashing function
sounds like an array would work just fine where the index is the index into the array. To check if it exists, just make sure the index is in bounds of the array and that its entry isn't NULL.
EDIT: if you sort the list, you could always use a binary search which should have fast lookup.
EDIT: Also, if you want to search for a string, you can always use a std::map<std::string, int> as well. This should have some decent lookup speeds.
Easiest is to use std::map.
It works like this:
#include <map>
using namespace std;
...
map<string, int> myContainer;
myContainer["foo"] = 5; // map string "foo" to id 5
// Now check if "foo" has been added to the container:
if (myContainer.find("foo") != myContainer.end())
{
// Yes!
cout << "The ID of foo is " << myContainer["foo"];
}
// Let's get "foo" out of it
myContainer.erase("foo")
Google sparse hash maybe
Related
I solved a problem to find duplicates in a list
I used the property of a set that it contains only unique members
set<int> s;
// insert the new item into the set
s.insert(nums[index]);
// if size does not increase there is a duplicate
if (s.size() == previousSize)
{
DuplicateFlag = true;
break;
}
Now I am trying to solve the same problem with hash functions in the Standard Library. I have sample code like this
#include <functional>
using namespace __gnu_cxx;
using namespace std;
hash<int> hash_fn2;
int x = 34567672;
size_t int_hash2 = hash_fn2(x);
cout << x << " " << int_hash2 << '\n';
x and int_hash2 are always the same
Am I missing something here ?
For std::hash<int>, it's ok to directly return the original int value. From the specification, it only needs to ensure that for two different parameters k1 and k2 that are not equal, the probability that std::hash<Key>()(k1) == std::hash<Key>()(k2) should be very small, approaching 1.0/std::numeric_limits<size_t>::max(). Clearly returning the original value satisfies the requirement for std::hash<int>.
x and int_hash2 are always the same Am I missing something here ?
Yes. You say "I am trying to solve the same problem with hash functions", but hash functions are not functional alternatives to std::set<>s, and can not - by themselves - be used to solve your poroblem. You probably want to use a std::unordered_set<>, which will internally use a hash table, using the std::hash<> function (by default) to help it map from elements to "buckets". For the purposes of a hash table, a hash function for integers that returns the input is usually good enough, and if it's not the programmer's expected to provide their preferred alternative as a template parameter.
Anyway, all you have to do to try a hash table approach is change std:set<int> s; to std::unordered_set<int> s; in your original code.
I'm looking for a data structure (and an C++ implementation) that allows to search (efficiently) for all elements having an integer value within a given interval. Example: say the set contains:
3,4,5,7,11,13,17,20,21
Now I want to now all elements from this set within [5,19]. So the answer should be 5,7,11,13,17
For my usage trivial search is not an option, as the number of elements is large (several million elements) and I have to do the search quite often. Any suggestions?
For this, you typically use std::set, that is an ordered set which has a search tree built on top (at least that's one possible implementation).
To get the elements in the queried interval, find the two iterators pointing at the first and last element you're looking for. That's a use case of the algorithm std::lower_bound and upper_bound to consider both interval limits as inclusive: [x,y]. (If you want to have the end exclusive, use lower_bound also for the end.)
These algorithms have logarithmic complexity on the size of the set: O(log n)
Note that you may also use a std::vector if you sort it before applying these operations. This might be advantageous in some situations, but if you always want to sort the elements, use std::set, as it does that automatically for you.
Live demo
#include <set>
#include <algorithm>
#include <iostream>
int main()
{
// Your set (Note that these numbers don't have to be given in order):
std::set<int> s = { 3,4,5,7,11,13,17,20,21 };
// Your query:
int x = 5;
int y = 19;
// The iterators:
auto lower = std::lower_bound(s.begin(), s.end(), x);
auto upper = std::upper_bound(s.begin(), s.end(), y);
// Iterating over them:
for (auto it = lower; it != upper; ++it) {
// Do something with *it, or just print *it:
std::cout << *it << '\n';
}
}
Output:
5
7
11
13
17
For searching within the intervals like you mentioned, Segment trees are the best. In competitive programming, several questions are based on this data structure.
One such implementation could be found here:
http://www.sanfoundry.com/cpp-program-implement-segement-tree/
You might need to modify the code to suit your question, but the basic implementation remains the same.
I have set of strings and I need to find if one specific string is in it. I need to do this only one time (next time strings are different).
I'm thinking to sort strings with bucket sort and then do binary search.
Time complexity: O(n+k)+O(log n)
Is there any faster/better solution?
With set I mean more strings not std::set.
To summarize the comments above in an answer. If you are loading strings to be compared on the fly and do not need them to be in a specific order, then std::unordered_set is by far the fastest.
unordered_set is a hash set and will punch your string through a hash function and find if it is already in the set in constant time O(1).
If you need to retain the order of the elements then it becomes a question what is faster of retaining a vector and doing a linear search though it, or whether it is still worth to build the hash set.
Code:
std::unordered_set<std::string> theSet;
// Insert a few elements.
theSet.insert("Mango");
theSet.insert("Grapes");
theSet.insert("Bananas");
if ( theSet.find("Hobgoblins") == theSet.end() ) {
cout << "Could not find any hobgoblins in the set." << endl;
}
if ( theSet.find("Bananas") != theSet.end() ) {
cout << "But we did find bananas!!! YAY!" << endl;
}
For comparison:
If you use std::vector you will need O(n) time building the vector and then O(n) time finding an element.
If you use std::unordered_set you will still need O(n) time to build the vector, but afterwards you can find an element in constant time O(1).
There are two ways in which I can easily make a key,value attribution in C++ STL: maps and sets of pairs. For instance, I might have
map<key_class,value_class>
or
set<pair<key_class,value_class> >
In terms of algorithm complexity and coding style, what are the differences between these usages?
They are semantically different. Consider:
#include <set>
#include <map>
#include <utility>
#include <iostream>
using namespace std;
int main() {
pair<int, int> p1(1, 1);
pair<int, int> p2(1, 2);
set< pair<int, int> > s;
s.insert(p1);
s.insert(p2);
map<int, int> m;
m.insert(p1);
m.insert(p2);
cout << "Set size = " << s.size() << endl;
cout << "Map size = " << m.size() << endl;
}
http://ideone.com/cZ8Vjr
Output:
Set size = 2
Map size = 1
Set elements cannot be modified while they are in the set. set's iterator and const_iterator are equivalent. Therefore, with set<pair<key_class,value_class> >, you cannot modify the value_class in-place. You must remove the old value from the set and add the new value. However, if value_class is a pointer, this doesn't prevent you from modifying the object it points to.
With map<key_class,value_class>, you can modify the value_class in-place, assuming you have a non-const reference to the map.
map<key_class,value_class> will sort on key_class and allow no duplicates of key_class.
set<pair<key_class,value_class> > will sort on key_class and then value_class if the key_class instances are equal, and will allow multiple values for key_class
The basic difference is that for the set the key is the pair, whereas for the map the key is key_class - this makes looking things up by key_class, which is what you want to do with maps, difficult for the set.
Both are typically implemented with the same data structure (normally a red-black balanced binary tree), so the complexity for the two should be the same.
std::map acts as an associative data structure. In other words, it allows you to query and modify values using its associated key.
A std::set<pair<K,V> > can be made to work like that, but you have to write extra code for the query using a key and more code to modify the value (i.e. remove the old pair and insert another with the same key and a different value). You also have to make sure there are no more than two values with the same key (you guessed it, more code).
In other words, you can try to shoe-horn a std::set to work like a std::map, but there is no reason to.
To understand algorithmic complexity, you first need to understand the implementation.
std::map is implemented using RB-tree where as hash_map are implemented using arrays of linked list. std::map provides O(log(n)) for insert/delete/search operation, hash_map is O(1) is best case and o(n) in worst case depending upon hash collisions.
Visualising that semantic difference Philipp mentioned after stepping through the code, note how map key is a const int and how p2 was not inserted into m:
This is similar to a recent question.
I will be maintaining sorted a list of values. I will be inserting items of arbitrary value into the list. Each time I insert a value, I would like to determine its ordinal position in the list (is it 1st, 2nd, 1000th). What is the most efficient data structure and algorithm for accomplishing this? There are obviously many algorithms which could allow you to do this but I don't see any way to easily do this using simple STL or QT template functionality. Ideally, I would like to know about existing open source C++ libraries or sample code that can do this.
I can imagine how to modify a B-tree or similar algorithm for this purpose but it seems like there should be an easier way.
Edit3:
Mike Seymour pretty well confirmed that, as I wrote in my original post, that there is indeed no way to accomplish this task using simple STL. So I'm looking for a good btree, balanced-tree or similar open source c++ template which can accomplish without modification or with the least modification possible - Pavel Shved showed this was possible but I'd prefer not to dive into implementing a balanced tree myself.
(the history should show my unsuccessful efforts to modify Mathieu's code to be O(log N) using make_heap)
Edit 4:
I still give credit to Pavel for pointing out that btree can give a solution to this, I have to mention that simplest way to achieve this kind of functionality without implementing a custom btree c++ template of your own is to use an in-memory database. This would give you log n and is fairly easy to implement.
Binary tree is fine with this. Its modification is easy as well: just keep in each node the number of nodes in its subtree.
After you inserted a node, perform a search for it again by walking from root to that node. And recursively update the index:
if (traverse to left subtree)
index = index_on_previous_stage;
if (traverse to right subtree)
index = index_on_previous_stage + left_subtree_size + 1;
if (found)
return index + left_subtree_size;
This will take O(log N) time, just like inserting.
I think you can std::set here. It provides you sorting behavior also returns the position of the iterator where the value is inserted. From that position you can get the index. For example:
std::set<int> s;
std::pair<std::set<int>::iterator, bool> aPair = s.insert(5);
size_t index = std::distance(s.begin(), aPair.first) ;
Note that the std::list insert(it, value) member function returns an iterator to the newly inserted element. Maybe it can help?
If, as you say in one of your comments, you only need an approximate ordinal position,
you could estimate this from the range of values you already have - you only need to read the first and last values in the collection in constant time, something like this:
multiset<int> values;
values.insert(value);
int ordinal = values.size() * (value - values.front()) /
(values.back()-values.front());
To improve the approximation, you could keep track of statistical properties (mean and variance, and possibly higher-order moments for better accuracy) of the values as you add them to a multiset. This will still be constant time. Here's a vague sketch of the sort of thing you might do:
class SortedValues : public multiset<int>
{
public:
SortedValues() : sum(0), sum2(0) {}
int insert(int value)
{
// Insert the value and update the running totals
multiset<int>::insert(value);
sum += value;
sum2 += value*value;
// Calculate the mean and deviation.
const float mean = float(sum) / size();
const float deviation = sqrt(mean*mean - float(sum2)/size());
// This function is left as an exercise for the reader.
return size() * EstimatePercentile(value, mean, deviation);
}
private:
int sum;
int sum2;
};
If you want ordinal position, you want a container which models the RandomAccessContainer Concept... basically, a std::vector.
Operations of sorts on a std::vector are relatively fast, and you can get to the position you wish using std::lower_bound or std::upper_bound, you can decide by yourself if you want multiple values at once, to retrieve all equal values, a good way is to use std::equal_range which basically gives you a the same result as applying both the lower and upper bounds but with a better complexity.
Now, for the ordinal position, the great news is that std::distance as a O(1) complexity on models of RandomAccessIterator.
typedef std::vector<int> ints_t;
typedef ints_t::iterator iterator;
ints_t myInts;
for (iterator it = another.begin(), end = another.end(); it != end; ++it)
{
int myValue = *it;
iterator search = std::lower_bound(myInts.begin(), myInts.end(), myValue);
myInts.insert(search, myValue);
std::cout << "Inserted " << myValue << " at "
<< std::distance(myInts.begin(), search) << "\n";
// Not necessary to flush there, that would slow things down
}
// Find all values equal to 50
std::pair<iterator,iterator> myPair =
std::equal_range(myInts.begin(), myInts.end(), 50);
std::cout << "There are " << std::distance(myPair.first,myPair.second)
<< " values '50' in the vector, starting at index "
<< std::distance(myInts.begin(), myPair.first) << std::endl;
Easy, isn't it ?
std::lower_bound, std::upper_bound and std::equal_range have a O(log(n)) complexity and std::distance has a O(1) complexity, so everything there is quite efficient...
EDIT: as underlined in the comments >> inserting is actually O(n) since you have to move the elements around.
Why do you need the ordinal position? As soon as you insert another item in the list the ordinal positions of other items later in the list will change, so there doesn't seem to be much point in finding the ordinal position when you do an insert.
It may be better to simply append elements to a vector, sort and then use a binary search to find the ordinal position, but it depends on what you are really trying to achieve
If you have the iterator to the item (as suggested by dtrosset), you can use std::distance (e.g. std::distance(my_list.begin(), item_it))
if you have an iterator that you want to find the index of then use std::distance,
which is either O(1) or O(n) depending on the container, however the O(1) containers are going to have O(n) inserts so overall you are looking at an O(n) algorithm with any stl container.
as others have said its not immediatly obvious why this is useful?