I use typeid(ClassName).name() to get the name for a wide range of class types. However, I need to make the length of it fixed (e.g. 8 char). In many cases the class is in a namespace which makes the string so long, and it does not work if I just get the first 10 characters.
Does anyone know a good way to code/decode a string into a fixed size string? I can't really keep a table to map the hash_code to a name since I'm going to send the string to another machine which does not have access to the map.
template <typename ClassType> char* get_name(){
return typeid(ClassType).name(); // ??
}
In general, it's not possible to build a function mapping arbitrary-length strings into a fixed domain. That violates the pigeonhole principle.
The following suggestion seems to me fairly convoluted, but given the lack of larger context to your problem, here goes...
Suppose you build a class through which to run all your names, as so
class compressor {
explicit compressor(std::size_t seed);
std::string operator()(const std::string &name) const;
}
It has two members: a ctor taking a seed, and an operator() taking a name string and returning an 8-char key string. In your code, initialize this object with some fixed, arbitrary seed.
Internally, the class object should hold an unordered_map mapping, for each distinct name on which it was applied, the key to which it was mapped. Initially, obviously, this internal unordered_map will be empty.
The class object should use a universal hash function, pseudo-randomly selected by the seed in the constructor. See the answer to this question on one way to create a universal hash function.
When the operator is called, it should check if the name is in the internal unordered_map. If so, return the key found for it. Otherwise, first use the hash function to calculate the key and place it in the internal unordered_map. When generating a new key, though, check if it collides with an existing key, and throw an exception if so.
Realistically speaking, since each distinct name corresponds to a place in your code where you call typeid, the number of distinct names, say n should be in the 1000s, at most. Set m to be the range possible with 8 characters (264).
The probability of a collision is ~n2 / (2 m), which should be tiny. Thus, most chances are that there will be no collisions, and no exception thrown. If one is thrown, though, change the seed, and build the program again. The expected number of times you'll have to do that (after the initial time) is close to 0.
Related
I have been pondering a data structure problem for a while, but can't seem to come up with a good solution. I can not shake off the feeling that the solution is simple and I'm just not seeing it, however, so hopefully you guys can help!
Here is the problem: I have a large collection of objects in memory. Each of them has a number of data fields. Some of the data fields, such as an ID, are unique for each objects, but others, such as a name, can appear in multiple objects.
class Object {
size_t id;
std::string name;
Histogram histogram;
Type type;
...
};
I need to organize these objects in a way that will allow me to quickly (even if the number of objects is relatively large, i.e. millions) filter the collection given a specification of an arbitrary number of object members while all members that are left unspecified count as wildcards. For example, if I specify a given name, I want to retrieve all the objects whose name member equals the given name. However, if I then add a histogram to the query, I would like the query to return only the objects that match in both the name and the histogram fields, and so on. So, for example, I'd like a function
std::set<Object*> retrieve(size_t, std::string, Histogram, Type)
that can both do
retrieve(42, WILDCARD, WILDCARD, WILDCARD)
as well as
retrieve(42, WILDCARD, WILDCARD, Type_foo)
where the second call would return fewer or equally as many objects as the first one. Which data structure allows queries like this and can both be constructed and queried in reasonable time for object counts in the millions?
Thanks for the help!
First you could use Boost Multi-index to implement efficent lookup over differnt members of your Object. This could help to limit the number of elements to consider. As a second step you can simply use a lambda expression to implement a predicate for std::find_if to get first element or use std::copy_if to copy all elements to an target sequence. If you decide to use boost you can use Boost Range with filtering.
I have two objects, Account and Transaction where Transaction is the unique pair of Account and an incrementing id number. I want to use boost::hash to get unique values for these and have overloaded the hash_value method per the instructions: http://www.boost.org/doc/libs/1_53_0/doc/html/hash/custom.html
class Account {
...
};
class Transaction
{
Account account;
unsigned int id;
};
Account's hash_value method works correctly, and the value returned is always unique for a given account, however to make the unique pair, Transaction's method needs to use hash
_combine (per boost's instructions ):
inline std::size_t hash_value( const Account& acct )
{
boost::hash<int> hasher;
size_t rval = hasher( acct.id() ); //just an int. guaranteed to be unique
return rval;
}
inline std::size_t hash_value( const Transaction& t )
{
std::size_t seed = 0;
boost::hash_combine( seed, t.account );
boost::hash_combine( seed, t.id );
return seed;
}
This returns the same values for different inputs sometimes. Why?? I only have a few thousand accounts, and the id number only goes up to a few hundred thousand. this doesn't seem like an upper bound issue.
Does anyone know if this is a bug, or if I need to seed boost hash?
Thanks
Look up perfect hashing, and the birthday paradox, and for completeness's sake the pigeonhole principle.
What it boils down to is hash functions generally do produce collisions,unless what you're hashing has very specific properties you've taken advantage of. Your chances of seeing a hash collision for any given set of keys is going to be counterintuitively high just because that's one of the mathematical realities we're not wired for: with a 1/365 chance of getting any particular hash, your odds of a collision are 50/50 given just 23 keys.
Boost provides good generic hash functions because it makes no/few assumptions about the input and tries to be fast. In most cases you can make specific assumptions about the input to create a far better hash function than what you get from boost. For example you can optimize a string hash function by assuming the string contains english text. By using assumptions you can make far better hash functions (as in: far less collisions). For example if you need to merge two hash values that are each integers between 1 and 1000 it's obvious that you will not get collisions is you multiply one of them by 1000 and then add the other.
Be very careful when writing custom hash functions because there is a clear disadvantage beyond getting it wrong: Code robustness always suffers.
Example 1: You optimize a UTF-8 string hash for english language strings. Suddenly the application gets chinese language strings.
Example 2: You assume an ID is always small because the IDs start at 1, increase by one each time one is assigned and there are never more than a few thousand assigned. Now someone changes the id to be a random GUID.
I have a settings which are stored in std::map. For example, there is WorldTime key with value which updates each main cycle iteration. I don't want to read it from map when I do need (it's also processed each frame), I think it's not fast at all. So, can I get pointer to the map's value and access it? The code is:
std::map<std::string, int> mSettings;
// Somewhere in cycle:
mSettings["WorldTime"] += 10; // ms
// Somewhere in another place, also called in cycle
DrawText(mSettings["WorldTime"]); // Is slow to call each frame
So the idea is something like:
int *time = &mSettings["WorldTime"];
// In cycle:
DrawText(&time);
How wrong is it? Should I do something like that?
Best use a reference:
int & time = mSettings["WorldTime"];
If the key doesn't already exist, the []-access will create the element (and value-initialize the mapped value, i.e. 0 for an int). Alternatively (if the key already exists):
int & time = *mSettings.find("WorldTime");
As an aside: if you have hundreds of thousands of string keys or use lookup by string key a lot, you might find that an std::unordered_map<std::string, int> gives better results (but always profile before deciding). The two maps have virtually identical interfaces for your purpose.
According to this answer on StackOverflow, it's perfectly OK to store a pointer to a map element as it will not be invalidated until you delete the element (see note 3).
If you're worried so much about performance then why are you using strings for keys? What if you had an enum? Like this:
enum Settings
{
WorldTime,
...
};
Then your map would be using ints for keys rather than strings. It has to do comparisons between the keys because I believe std::map is implemented as a balanced tree. Comparisons between ints are much faster than comparisons between strings.
Furthermore, if you're using an enum for keys, you can just use an array, because an enum IS essentially a map from some sort of symbol (ie. WorldTime) to an integer, starting at zero. So then do this:
enum Settings
{
WorldTime,
...
NumSettings
};
And then declare your mSettings as an array:
int mSettings[NumSettings];
Which has faster lookup time compared to a std::map. Reference like this then:
DrawText(mSettings[WorldTime]);
Since you're basically just accessing a value in an array rather than accessing a map this is going to be a lot faster and you don't have to worry about the pointer/reference hack you were trying to do in the first place.
I've only recently started dwelling into boost and it's containers, and I read a few articles on the web and on stackoverflow that a boost::unordered_map is the fastest performing container for big collections.
So, I have this class State, which must be unique in the container (no duplicates) and there will be millions if not billions of states in the container.
Therefore I have been trying to optimize it for small size and as few computations as possible. I was using a boost::ptr_vector before, but as I read on stackoverflow a vector is only good as long as there are not that many objects in it.
In my case, the State descibes sensorimotor information from a robot, so there can be an enormous amount of states, and therefore fast lookup is of topemost priority.
Following the boost documentation for unordered_map I realize that there are two things I could do to speed things up: use a hash_function, and use an equality operator to compare States based on their hash_function.
So, I implemented a private hash() function which takes in State information and using boost::hash_combine, creates an std::size_t hash value.
The operator== compares basically the state's hash values.
So:
is std::size_t enough to cover billions of possible hash_function
combinations ? In order to avoid duplicate states I intend to use
their hash_values.
When creating a state_map, should I use as key the State* or the hash
value ?
i.e: boost::unordered_map<State*,std::size_t> state_map;
Or
boost::unordered_map<std::size_t,State*> state_map;
Are the lookup times with a boost::unordered_map::iterator =
state_map.find() faster than going through a boost::ptr_vector and
comparing each iterator's key value ?
Finally, any tips or tricks on how to optimize such an unordered map
for speed and fast lookups would be greatly appreciated.
EDIT: I have seen quite a few answers, one being not to use boost but C++0X, another not to use an unordered_set, but to be honest, I still want to see how boost::unordered_set is used with a hash function.
I have followed boost's documentation and implemented, but I still cannot figure out how to use the hash function of boost with the ordered set.
This is a bit muddled.
What you say are not "things that you can do to speed things up"; rather, they are mandatory requirements of your type to be eligible as the element type of an unordered map, and also for an unordered set (which you might rather want).
You need to provide an equality operator that compares objects, not hash values. The whole point of the equality is to distinguish elements with the same hash.
size_t is an unsigned integral type, 32 bits on x86 and 64 bits on x64. Since you want "billions of elements", which means many gigabytes of data, I assume you have a solid x64 machine anyway.
What's crucial is that your hash function is good, i.e. has few collisions.
You want a set, not a map. Put the objects directly in the set: std::unordered_set<State>. Use a map if you are mapping to something, i.e. states to something else. Oh, use C++0x, not boost, if you can.
Using hash_combine is good.
Baby example:
struct State
{
inline bool operator==(const State &) const;
/* Stuff */
};
namespace std
{
template <> struct hash<State>
{
inline std::size_t operator()(const State & s) const
{
/* your hash algorithm here */
}
};
}
std::size_t Foo(const State & s) { /* some code */ }
int main()
{
std::unordered_set<State> states; // no extra data needed
std::unordered_set<State, Foo> states; // another hash function
}
An unordered_map is a hashtable. You don't store the hash; it is done internally as the storage and lookup method.
Given your requirements, an unordered_set might be more appropriate, since your object is the only item to store.
You are a little confused though -- the equality operator and hash function are not truly performance items, but required for nontrivial objects for the container to work correctly. A good hash function will distribute your nodes evenly across the buckets, and the equality operator will be used to remove any ambiguity about matches based on the hash function.
std::size_t is fine for the hash function. Remember that no hash is perfect; there will be collisions, and these collision items are stored in a linked list at that bucket position.
Thus, .find() will be O(1) in the optimal case and very close to O(1) in the average case (and O(N) in the worst case, but a decent hash function will avoid that.)
You don't mention your platform or architecture; at billions of entries you still might have to worry about out-of-memory situations depending on those and the size of your State object.
forget about hash; there is nothing (at least from your question) that suggests you have a meaningful key;
lets take a step back and rephrase your actual performance goals:
you want to quickly validate no duplicates ever exist for any of your State objects
comment if i need to add others.
From the aforementioned goal, and from your comment i would suggest you use actually a ordered_set rather than an unordered_map. Yes, the ordered search uses binary search O(log (n)) while unordered uses lookup O(1).
However, the difference is that with this approach you need the ordered_set ONLY to check that a similar state doesn't exist already when you are about to create a new one, that is, at State creation-time.
In all the other lookups, you actually don't need to look into the ordered_set! because you already have the key; State*, and the key can access the value by the magic dereference operator: *key
so with this approach, you only are using the ordered_set as an index to verify States on creation time only. In all the other cases, you access your State with the dereference operator of your pointer-value key.
if all the above wasn't enough to convince you, here is the final nail in the coffin of the idea of using a hash to quickly determine equality; hash function has a small probability of collision, but as the number of states will grow, that probability will become complete certainty. So depending on your fault-tolerance, you are going to deal with state collisions (and from your question and the number of States you are expecting to deal, it seems you will deal with a lot of them)
For this to work, you obviously need the compare predicate to test for all the internal properties of your state (giroscope, thrusters, accelerometers, proton rays, etc.)
i want to see the number of appearance of words from some phrases.
My problem is that i can't use map to do this:
map[word] = appearnce++;
Instead i have a class that uses binary tree and behaves like a map, but i only have the method:
void insert(string, int);
Is there a way to counts the words apperances using this function?(because i can't find a way to increment the int for every different word) Or do I have to overload operator [] for the class? What should i do ?
Presumably you also have a way to retrieve data from your map-like structure (storing data does little good unless you can also retrieve it). The obvious method would be to retrieve the current value, increment it, and store the result (or store 1 if retrieving showed the value wasn't present previously).
I guess this is homework and you're learning about binary trees. In that case I would implement operator[] to return a reference to the existing value (and if no value exists, default construct a value, insert it, and return that. Obviously operator[] will be implemented quite similarly to your insert method.
can you edit "insert" function?
if you can, you can add static variable that count the appearnces inside the function