C++ Complicated look-up table - c++

I have around 400.000 "items".
Each "item" consists of 16 double values.
At runtime I need to compare items with each other. Therefore I am muplicating their double values. This is quite time-consuming.
I have made some tests, and I found out that there are only 40.000 possible return values, no matter which items I compare with each other.
I would like to store these values in a look-up table so that I can easily retrieve them without doing any real calculation at runtime.
My question would be how to efficiently store the data in a look-up table.
The problem is that if I create a look-up table, it gets amazingly huge, for example like this:
item-id, item-id, compare return value
1 1 499483,49834
1 2 -0.0928
1 3 499483,49834
(...)
It would sum up to around 120 million combinations.
That just looks too big for a real-world application.
But I am not sure how to avoid that.
Can anybody please share some cool ideas?
Thank you very much!

Assuming I understand you correctly, You have two inputs with 400K possibilities, so 400K * 400K = 160B entries... assuming you have them indexed sequentially, and the you stored your 40K possibilities in a way that allowed 2-octets each, you're looking at a table size of roughly 300GB... pretty sure that's beyond current every-day computing. So, you might instead research if there is any correlation between the 400K "items", and if so, if you can assign some kind of function to that correlation that gives you a clue (read: hash function) as to which of the 40K results might/could/should result. Clearly your hash function and lookup needs to be shorter than just doing the multiplication in the first place. Or maybe you can reduce the comparison time with some kind of intelligent reduction, like knowing the result under certain scenarios. Or perhaps some of your math can be optimized using integer math or boolean comparisons. Just a few thoughts...

To speed things up, you should probably compute all of the possible answers, and store the inputs to each answer.
Then, I would recommend making some sort of look up table that uses the answer as the key(since the answers will all be unique), and then storing all of the possible inputs that get that result.
To help visualize:
Say you had the table 'Table'. Inside Table you have keys, and associated to those keys are values. What you do is you make the keys have the type of whatever format your answers are in(the keys will be all of your answers). Now, give your 400k inputs each a unique identifier. You then store the unique identifiers for a multiplication as one value associated to that particular key. When you compute that same answer again, you just add it as another set of inputs that can calculate that key.
Example:
Table<AnswerType, vector<Input>>
Define Input like:
struct Input {IDType one, IDType two}
Where one 'Input' might have ID's 12384, 128, meaning that the objects identified by 12384 and 128, when multiplied, will give the answer.
So, in your lookup, you'll have something that looks like:
AnswerType lookup(IDType first, IDType second)
{
foreach(AnswerType k in table)
{
if table[k].Contains(first, second)
return k;
}
}
// Defined elsewhere
bool Contains(IDType first, IDType second)
{
foreach(Input i in [the vector])
{
if( (i.one == first && i.two == second ) ||
(i.two == first && i.one == second )
return true;
}
}
I know this isn't real C++ code, its just meant as pseudo-code, and it's a rough cut as-is, but it might be a place to start.
While the foreach is probably going to be limited to a linear search, you can make the 'Contains' method run a binary search by sorting how the inputs are stored.
In all, you're looking at a run-once application that will run in O(n^2) time, and a lookup that will run in nlog(n). I'm not entirely sure how the memory will look after all of that, though. Of course, I don't know much about the math behind it, so you might be able to speed up the linear search if you can somehow sort the keys as well.

Related

Multiplying inventory with values

I was wondering whether there is an easier way of multiplying the inventory I have by their values assigned? I came out with the below code, which seems to work but it looks lengthy.
score = {
"handle" : 4 ,
"blade" : 3 ,
"bottle" : 10 ,
"full blade" : 10
}
inventory = ["handle", "blade", "bottle", "full blade"]
list = []
def ScoreCompute():
for x in inventory:
list.append(score.get(x))
ScoreCompute()
print sum(list)
x = sum(list)
If you want to sum all the values in your dictionary score, you can do this with
sum(score.values())
or
sum(score.itervalues())
An additional note: In your example, inventory is equal to score.keys() (though, the order is not necessarily the same).
Edit:
As of Python 3.X inventory would equal something like list(score.keys()).
Thanks to #dwanderson for the hint.
>>>> sum(score.values())
Python dictionaries have lots of convenience methods/accessors, specifically for these types of things. First, if you want every item (key, value, key-value pair) in a dictionary, you probably don't need to use the .get(...) syntax.
If you want just the keys, you could do:
for key in score:
(Note: for key in score.keys(): works as well, but in Python3, it's unnecessary, and in Python2, it can give a performance hit; for key in score.iterkeys() works without creating a temporary list, but there's probably no reason to call score.iterkeys() over just 'score`, at least not that I can think of off-hand.)
If you want just the values, Python provides a built-in, efficient method for though - no need to construct an entire new list of the values, just to sum them and then throw the list away. This gets into the topic of generators, but that seems a bit of a digression here. Suffice to say,
for value in score.itervalues():
would let you look at and examine each of the values. You don't know which key goes with which value, but in your particular case, you don't care about the key, only the value, so that's fine.
If you want to look at both the key and the value at the same time, then all you need is
for key, value in score.iteritems():
Now, in your particular case, you just need each value once, and Python's sum is smart enough to figure out how to extract that from the .values() generator, so there's no need for an explicit for-loop. You can just write sum(score.itervalues()) and get what you need.
It doesn't look like a particular concern here, but just to note: even though you don't have an explicit for-loop, it's still performing one "under the hood", so the longer the list, the longer it will take to sum. Again, though, this shouldn't matter for such a straight-forward and small example; it would be something to keep in mind if you were taking the sum of a list with millions and millions of values (or if you were calling sum(score.itervalues()) over and over and over again).
A little bit shorter if inventory has fewer entries than score:
>>> sum(score[key] for key in inventory)
27

What's the correct way to generate random strings without duplicates

I'm thinking about generating random strings, without making any duplication.
First thought was to use a binary tree create and locate for duplicate in tree, if any.
But this may not be very effective.
Second thought was using MD5 like hash method which create messages based only on time, but this may introduce another problem, different machines has different accuracy of time.
And in a modern processor, more than one string could be created in a single timestamp.
Is there any better way to do this?
Generate N sequential strings, then do a random shuffle to pull them out in random order. If they need to be unique across separate generators, mix a unique generator ID into the string.
Beware of MD5, there's no guarantee that two different Strings won't generate the same hash.
As for your problem, it depends on a number of constraints: are the strings short or long? Do they have to be meaningful? Etc... Two solutions from the top of my head:
1 Generate UUIDs then turn them into String with a binary representation or base 64 algorithm.
2 Simply generate random Strings and put them in a searchable structure (HashMap) so that you can find very quickly (O(1)-O(log n)) if a generated String already has a duplicate, in which case it is discarded.
A tree probably won't be the most efficient, especially for insertions - as it will have to constantly re-balance itself (somewhat of an "expensive" operation).
I'd recommend using a HashSet type data structure. The hashing algorithm should already be quite efficient (much more so than something like MD5), and all operations are constant-time. Insert all your Strings into the Set. If you create a new String, check to see if it already exists in the Set.
It sounds like you want to generate a uuid? See http://docs.python.org/library/uuid.html
>>> import uuid
>>> uuid.uuid4()
UUID('dafd3cb8-3163-4734-906b-a33671ce52fe')
You should specify in what programming language you're coding. For instance, in Java this will work nicely: UUID.randomUUID().toString() . UUID identifiers are unique in practice, as is stated in wikipedia:
The intent of UUIDs is to enable distributed systems to uniquely identify information without significant central coordination. In this context the word unique should be taken to mean "practically unique" rather than "guaranteed unique". Since the identifiers have a finite size it is possible for two differing items to share the same identifier. The identifier size and generation process need to be selected so as to make this sufficiently improbable in practice.
A binary tree is probably better than usual here - no rebalancing necessary, because your strings are random, and it's on random data that binary trees work their best. However, it's still O(log(n)) for lookup and addition.
But maybe more efficient, if you know in advance how many random strings you'll need and don't mind a little probability in the mix, is to use a bloom filter.
Bloom filters give an efficient, probabilistic set membership test with memory requirements as low as one bit per element saved in the set. Basically, a bloom filter can say with 100% certainty that a member does not belong to a set, but with a high but not quite 100% certainty that a member is in a set. In your case, throwing out an extra candidate or two shouldn't hurt at all, so the probabilistic nature shouldn't hurt a bit.
Bloom filters are also relatively unique in that they can test for set membership in constant time.
For a while, I listed treaps here, but that's silly - they do a lot of operations in O(log(n)) again, and would only be relevant if your data isn't truly random.
If you don't need your strings to be saved in order for some reason (and it sounds like you probably don't), a traditional hash table is a good way to go. They like to know how big your final dataset will be in advance (to avoid slow hash table resizes), but they too are constant time for insertion and lookup.
http://stromberg.dnsalias.org/svn/bloom-filter/trunk/

How to create a hash table

I would like to mention before continuing that I have looked at other questions asking the same thing on this site as well as on other sites. I hope that I can get a good answer, because my goal is twofold:
Foremost, I would like to learn how to create a hash table.
Secondly, I find that a lot of answers on Stack Overflow tend to assume a certain level of knowledge on a subject that is often not there, especially for the newer types. That being said, I hope to edit my main message to include an explanation of the process a bit more in depth once I figure it out myself.
Onto the main course:
As I understand them so far, a hash table is an array of lists (or a similar data structure) that hopes to, optimally, have as few collisions as possible in order to preserve it's lauded O(1) complexity. The following is my current process:
So my first step is to create an array of pointers:
Elem ** table;
table = new Elem*[size];//size is the desired size of the array
My second step is to create a hashing function( a very simple one ).
int hashed = 0;
hashed = ( atoi( name.c_str() ) + id ) % size;
//name is a std string, and id is a large integer. Size is the size of the array.
My third step would be to create something to detect collisions, which is the part I'm currently at.
Here's some pseudo-code:
while( table[hashedValue] != empty )
hashedValue++
else
put in the list at that index.
It's relatively inelegant, but I am still at the "what is this" stage. Bear with me.
Is there anything else? Did I miss something or do something incorrectly?
Thanks
Handle finding no empty slots and resizing the table.
You're missing a definition for Elem. That's not trivial, as it depends on whether you want a chaining or a probing hash table.
A hash function produces the same value for the same data. Your collision check, however, modifies that value, which means that the hash value not only depends on the input, but also on the presence of other elements in the hash map. This is bad, as you almost never will be able to actually access the element you put in before through its name, only through iterating over the map.
Second, your collision check is vulnerable to overflow / range errors, as you just increase the hash value without checking against the size of the map (though, as I said before, you shouldn't even be doing this).

How to map 13 bit value to 4 bit code?

I have a std::map for some packet processing program.
I didn't noticed before profiling but unfortunately this map lookup alone consume about 10% CPU time (called too many time).
Usually there only exist at most 10 keys in the input data. So I'm trying to implement a kind of key cache in front of the map.
Key value is 13 bit integer. I know there are only 8192 possible keys and array of 8192 items can give constant time lookup but I feel already ashamed and don't want use such a naive approach :(
Now, I'm just guessing some method of hashing that yield 4 bit code value for 13 bit integer very fast.
Any cool idea?
Thanks in advance.
UPDATE
Beside my shame, I don't have total control over source code and it's almost prohibited to make new array for this purpose.
Project manager said (who ran the profiler) linked list show small performance gain and recommended using std::list instead of std::map.
UPDATE
Value of keys are random (no relationship) and doesn't have good distribution.
Sample:
1) 0x100, 0x101, 0x10, 0x0, 0xffe
2) 0x400, 0x401, 0x402, 0x403, 0x404, 0x405, 0xff
Assuming your hash table either contains some basic type -- it's almost no memory at all. Even on 64-bit systems it's only 64kb of memory. There is no shame in using a lookup table like that, it has some of the best performance you can get.
You may want to go with middle solution and open addressing technique: one array of size 256. Index to an array is some simple hash function like XOR of two bytes. Element of the array is struct {key, value}. Collisions are handled by storing collided element at the next available index. If you need to delete element from array, and if deletion is rare then just recreate array (create a temporary list from remaining elements, and then create array from this list).
If you pick your hash function smartly there would not be any collisions almost ever. For instance, from your two examples one such hash would be to XOR low nibble of high byte with high nibble of low byte (and do what you like with remaining 13-th bit).
Unless you're writing for some sort of embedded system where 8K is really significant, just use the array and move on. If you really insist on doing something else, you might consider a perfect hash generator (e.g., gperf).
If there are really only going to be something like 10 active entries in your table, you might seriously consider using an unsorted vector to hold this mapping. Something like this:
typedef int key_type;
typedef int value_type;
std::vector<std::pair<key_type, value_type> > mapping;
inline void put(key_type key, value_type value) {
for (size_t i=0; i<mapping.size(); ++i) {
if (mapping[i].first==key) {
mapping[i].second=value;
return;
}
}
mapping.push_back(std::make_pair(key, value));
}
inline value_type get(key_type key) {
for (size_t i=0; i<mapping.size(); ++i) {
if (mapping[i].first==key) {
return mapping[i].second;
}
}
// do something reasonable if not found?
return value_type();
}
Now, the asymptotic speed of these algorithms (each O(n)) is much worse than you'd have with either a red-black tree (like std::map at O(log n)) or hash table (O(1)). But you're not talking about dealing with a large number of objects, so asymptotic estimates don't really buy you much.
Additionally, std::vector buys you both low overhead and locality of reference, which neither std::map nor std::list can offer. So it's more likely that a small std::vector will stay entirely within the L1 cache. As it's almost certainly the memory bottleneck that's causing your performance issues, using a std::vector with even my poor choice of algorithm will likely be faster than either a tree or linked list. Of course, only a few solid profiles will tell you for sure.
There are certainly algorithms that might be better choices: a sorted vector could potentially give even better performance; a well tuned small hash table might work as well. I suspect that you'll run into Amdahl's law pretty quickly trying to improve on a simple unsorted vector, however. Pretty soon you might find yourself running into function call overhead, or some other such concern, as a large contributor to your profile.
I agree with GWW, you don't use so much memory in the end...
But if you want, you could use an array of 11 or 13 linkedlists, and hash the keys with the % function. If the key number is less than the array size, complexity tents still to be O(1).
When you always just have about ten keys, use a list (or array). Do some benchmarking to find out whether or not using a sorted list (or array) and binary search will improve performance.
You might first want to see if there are any unnecessary calls to the key lookup. You only want to do this once per packet ideally -- each time you call a function there is going to be some overhead, so getting rid of extra calls is good.
Map is generally pretty fast, but if there is any exploitable pattern in the way keys are mapped to items you could use that and potentially do better. Could you provide a bit more information about the keys and the associated 4-bit values? E.g. are they sequential, is there some sort of pattern?
Finally, as others have mentioned, a lookup table is very fast, 8192 values * 4 bits is only 4kb, a tiny amount of memory indeed.
I would use a lookup table. It's tiny unless you are using a micrcontroller or something.
Otherwise I would do this -
Generate a table of say 30 elements.
For each lookup calculate a hash value of (key % 30) and compare it with the stored key in that location in the table. If the key is there then you found your value. if the slot is empty, then add it. If the key is wrong then skip to the next free cell and repeat.
With 30 cells and 10 keys collisions should be rare but if you get one it's fast to skip to the next cell, and normal lookups are simply a modulus and a compare operation so fairly fast

Perfect hash function for a set of integers with no updates

In one of the applications I work on, it is necessary to have a function like this:
bool IsInList(int iTest)
{
//Return if iTest appears in a set of numbers.
}
The number list is known at app load up (But are not always the same between two instances of the same application) and will not change (or added to) throughout the whole of the program. The integers themselves maybe large and have a large range so it is not efficient to have a vector<bool>. Performance is a issue as the function sits in a hot spot. I have heard about Perfect hashing but could not find out any good advice. Any pointers would be helpful. Thanks.
p.s. I'd ideally like if the solution isn't a third party library because I can't use them here. Something simple enough to be understood and manually implemented would be great if it were possible.
I would suggest using Bloom Filters in conjunction with a simple std::map.
Unfortunately the bloom filter is not part of the standard library, so you'll have to implement it yourself. However it turns out to be quite a simple structure!
A Bloom Filter is a data structure that is specialized in the question: Is this element part of the set, but does so with an incredibly tight memory requirement, and quite fast too.
The slight catch is that the answer is... special: Is this element part of the set ?
No
Maybe (with a given probability depending on the properties of the Bloom Filter)
This looks strange until you look at the implementation, and it may require some tuning (there are several properties) to lower the probability but...
What is really interesting for you, is that for all the cases it answers No, you have the guarantee that it isn't part of the set.
As such a Bloom Filter is ideal as a doorman for a Binary Tree or a Hash Map. Carefully tuned it will only let very few false positive pass. For example, gcc uses one.
What comes to my mind is gperf. However, it is based in strings and not in numbers. However, part of the calculation can be tweaked to use numbers as input for the hash generator.
integers, strings, doesn't matter
http://videolectures.net/mit6046jf05_leiserson_lec08/
After the intro, at 49:38, you'll learn how to do this. The Dot Product hash function is demonstrated since it has an elegant proof. Most hash functions are like voodoo black magic. Don't waste time here, find something that is FAST for your datatype and that offers some adjustable SEED for hashing. A good combo there is better than the alternative of growing the hash table.
#54:30 The Prof. draws picture of a standard way of doing perfect hash. The perfect minimal hash is beyond this lecture. (good luck!)
It really all depends on what you mod by.
Keep in mind, the analysis he shows can be further optimized by knowing the hardware you are running on.
The std::map you get very good performance in 99.9% scenarios. If your hot spot has the same iTest(s) multiple times, combine the map result with a temporary hash cache.
Int is one of the datatypes where it is possible to just do:
bool hash[UINT_MAX]; // stackoverflow ;)
And fill it up. If you don't care about negative numbers, then it's twice as easy.
A perfect hash function maps a set of inputs onto the integers with no collisions. Given that your input is a set of integers, the values themselves are a perfect hash function. That really has nothing to do with the problem at hand.
The most obvious and easy to implement solution for testing existence would be a sorted list or balanced binary tree. Then you could decide existence in log(N) time. I doubt it'll get much better than that.
For this problem I would use a binary search, assuming it's possible to keep the list of numbers sorted.
Wikipedia has example implementations that should be simple enough to translate to C++.
It's not necessary or practical to aim for mapping N distinct randomly dispersed integers to N contiguous buckets - i.e. a perfect minimal hash - the important thing is to identify an acceptable ratio. To do this at run-time, you can start by configuring a worst-acceptible ratio (say 1 to 20) and a no-point-being-better-than-this-ratio (say 1 to 4), then randomly vary (e.g. changing prime numbers used) a fast-to-calculate hash algorithm to see how easily you can meet increasingly difficult ratios. For worst-acceptible you don't time out, or you fall back on something slower but reliable (container or displacement lists to resolve collisions). Then, allow a second or ten (configurable) for each X% better until you can't succeed at that ratio or reach the no-pint-being-better ratio....
Just so everyone's clear, this works for inputs only known at run time with no useful patterns known beforehand, which is why different hash functions have to be trialed or actively derived at run time. It is not acceptible to simple say "integer inputs form a hash", because there are collisions when %-ed into any sane array size. But, you don't need to aim for a perfectly packed array either. Remember too that you can have a sparse array of pointers to a packed array, so there's little memory wasted for large objects.
Original Question
After working with it for a while, I came up with a number of hash functions that seemed to work reasonably well on strings, resulting in a unique - perfect hashing.
Let's say the values ranged from L to H in the array. This yields a Range R = H - L + 1.
Generally it was pretty big.
I then applied the modulus operator from H down to L + 1, looking for a mapping that keeps them unique, but has a smaller range.
In you case you are using integers. Technically, they are already hashed, but the range is large.
It may be that you can get what you want, simply by applying the modulus operator.
It may be that you need to put a hash function in front of it first.
It also may be that you can't find a perfect hash for it, in which case your container class should have a fall back position.... binary search, or map or something like that, so that
you can guarantee that the container will work in all cases.
A trie or perhaps a van Emde Boas tree might be a better bet for creating a space efficient set of integers with lookup time bring constant against the number of objects in the data structure, assuming that even std::bitset would be too large.