I apologize in advance if this has been asked before. If it has I have no idea what this data structure would be called.
I have a collection of N (approx ~300 or less) widgets. Each widget has M (around ~10) different names. This collection will be populated once and then looked up many times. I will be implementing this in C++.
An example of this might be a collection of 200 people and storing their names in 7 different languages.
The lookup function would basically look like this:
lookup("name", A, B), which will return the translation of the name "name" from language A to language B, (only if name is in the collection).
Is there any known data structure in the literature for doing this sort of thing efficiently? The most obvious solution is to create a bunch of hash tables for the lookups, but having MxM hash tables for all the possible pairs quickly gets unwieldy and memory inefficient. I'd also be willing to consider sorted arrays (binary search) or even trees. Since the collection is not huge, log(N) lookups are just fine.
Thank you everyone!
Based on your description of the desired lookup function, it sounds like you could use a single hash table where the key is tuple<string, Language, Language> and the value is the result of the translation. The two languages in the key identify the source language of the string and the language of the desired translation, respectively.
Create an N-by-M array D, such that D[u,v] is the word in language v for widget u.
Also create M hash tables, H₀...Hₘ (where m is M-1) such that Hᵥ(w).data is u if w is the word for widget u in language v.
To perform lookup(w, r, s),
(1) set u = Hᵣ(w).data
(2) if D[u,r] equals w, return D[u,s], else return not-found.
Related
I'm looking to have a reverse lookup of a 2D table i.e. Table: F(x,y), given F, find x and y
My current method uses a nested for loop to search the table for all x's and y's to find F within some error. The complication here is that the queried "F" may not be a perfect match for the "F" in my lookup table. I also have NaNs in my table. I'm hoping to have this program find the nearest "F" to the queried "F".
The table is currently a 2D array, but I'm thinking a map may be more appropriate here. I know how to create a multidimensional map from this: https://www.geeksforgeeks.org/implementing-multidimensional-map-in-c/
I also found some great answers (#Rob's specifically) on how to have a reverse map lookup for a 1D map using Boost here: Reverse map lookup
I'm having some trouble combining the two methods, as well as having a findNearest feature.
Sounds like you would like to use the indexing suite of Boost Geometry.
Not only does it have nearest-k queries, but it affords you all kinds of coordinate systems (including geodesic systems).
See Spatial Indexes
I have around 400.000 "items".
Each "item" consists of 16 double values.
At runtime I need to compare items with each other. Therefore I am muplicating their double values. This is quite time-consuming.
I have made some tests, and I found out that there are only 40.000 possible return values, no matter which items I compare with each other.
I would like to store these values in a look-up table so that I can easily retrieve them without doing any real calculation at runtime.
My question would be how to efficiently store the data in a look-up table.
The problem is that if I create a look-up table, it gets amazingly huge, for example like this:
item-id, item-id, compare return value
1 1 499483,49834
1 2 -0.0928
1 3 499483,49834
(...)
It would sum up to around 120 million combinations.
That just looks too big for a real-world application.
But I am not sure how to avoid that.
Can anybody please share some cool ideas?
Thank you very much!
Assuming I understand you correctly, You have two inputs with 400K possibilities, so 400K * 400K = 160B entries... assuming you have them indexed sequentially, and the you stored your 40K possibilities in a way that allowed 2-octets each, you're looking at a table size of roughly 300GB... pretty sure that's beyond current every-day computing. So, you might instead research if there is any correlation between the 400K "items", and if so, if you can assign some kind of function to that correlation that gives you a clue (read: hash function) as to which of the 40K results might/could/should result. Clearly your hash function and lookup needs to be shorter than just doing the multiplication in the first place. Or maybe you can reduce the comparison time with some kind of intelligent reduction, like knowing the result under certain scenarios. Or perhaps some of your math can be optimized using integer math or boolean comparisons. Just a few thoughts...
To speed things up, you should probably compute all of the possible answers, and store the inputs to each answer.
Then, I would recommend making some sort of look up table that uses the answer as the key(since the answers will all be unique), and then storing all of the possible inputs that get that result.
To help visualize:
Say you had the table 'Table'. Inside Table you have keys, and associated to those keys are values. What you do is you make the keys have the type of whatever format your answers are in(the keys will be all of your answers). Now, give your 400k inputs each a unique identifier. You then store the unique identifiers for a multiplication as one value associated to that particular key. When you compute that same answer again, you just add it as another set of inputs that can calculate that key.
Example:
Table<AnswerType, vector<Input>>
Define Input like:
struct Input {IDType one, IDType two}
Where one 'Input' might have ID's 12384, 128, meaning that the objects identified by 12384 and 128, when multiplied, will give the answer.
So, in your lookup, you'll have something that looks like:
AnswerType lookup(IDType first, IDType second)
{
foreach(AnswerType k in table)
{
if table[k].Contains(first, second)
return k;
}
}
// Defined elsewhere
bool Contains(IDType first, IDType second)
{
foreach(Input i in [the vector])
{
if( (i.one == first && i.two == second ) ||
(i.two == first && i.one == second )
return true;
}
}
I know this isn't real C++ code, its just meant as pseudo-code, and it's a rough cut as-is, but it might be a place to start.
While the foreach is probably going to be limited to a linear search, you can make the 'Contains' method run a binary search by sorting how the inputs are stored.
In all, you're looking at a run-once application that will run in O(n^2) time, and a lookup that will run in nlog(n). I'm not entirely sure how the memory will look after all of that, though. Of course, I don't know much about the math behind it, so you might be able to speed up the linear search if you can somehow sort the keys as well.
I have a table that is accessed by row and column, where those two are not integral. They will, however be unique and taken from the same set. The table will need to be expanded, but always from the end. Removal may be necessary from the middle, but is not a priority.
I am currently testing 2 approaches:
map<Key, int> headers;
vector<vector<Value> > table;
Or:
map<Key, map<Key, Value> > table;
Which is going to be more appropriate? I am also open to new suggestions.
Examples showing basic usage (though both very much oversimplified) are here and here.
It all depends on how this structure is going to be used: How densely populated the table will be, how efficient various operations have to be, how big the payload type (Value) will be, etc.
Your first approach (vector of vectors, with a map to translate indices) is a dense representation: Every value is stored explicitly in the table. If vectors grow by a factor of L, then the total allocation excess for the data proper can go up to L^2. For example, if L == 1.25, you may end up over 50% excess storage; if sizeof(Value) is large or if your table is large, that may be prohibitive. Also expanding the table may occasionally be quite expensive (when the vectors must be reallocated).
Your second approach (map of maps) is potentially sparse. However, if all table (row, column) pairs are accessed, it will get dense. Also, the book-keeping information for maps is a bit larger than for vectors. So for small Value sizes, the vector of vectors approach might be more space-efficient. If most of your table will be populated by "default" values, then you might improve things by distinguishing between read and write access to the table: A read for a value could perform a "find" and return a synthesized default value if notable entry was found.
I've got 200 strings. Each string has a relationship (measured by a float between 0 and 1) with every other string. This relationship is two-way; that is, relationship A/B == relationship B/A. This yields n(n-1)/2 relationships, or 19,800.
What I want to do is store these relationships in a lookup table so that given any two words I can quickly find the relationship value.
I'm using c++ so I'd probably use a std::map to store the LUT. The question is, what's the best key to use for this purpose.
The key needs to be unique and needs to be able to be calculated quickly from both words.
My approach is going to be to create a unique identifier for each word pair. For example given the words "apple" and "orange" then I combine them together as "appleorange" (alphabetical order, smallest first) and use that as the key value.
Is this a good solution or can someone suggest something more cleverer? :)
Basically you are describing a function of two parameters with the added property that order of parameters is not significant.
Your approach will work if you do not have ambiguity between words when changing order (I would suggest putting a coma or like between the two words to remove possible ambiguities). Any 2D array would also work.
I would probably convert each keyword to some unique identifier (using a simple map) before trying to find the relationship value, but it does not change much from what you are proposing.
If boost/tr1 is acceptable, I would go for an unordered_map with the pair of strings as key. The main question would then be: what with the order of the strings? This could be handled by the hash-function, which starts with the lexical first string.
Remark: this is just a suggestion after reading the design-issue, not a study.
How "quickly" is quickly? Given you don't care about the order of the two words, you could try a map like this:
std::map<std::set<std::string>, double> lut;
Here the key is a set of the two words, so if you insert "apple" and "orange", then the order is the same as "orange" "apple", and given set supports the less than operator, it can function as a key in a map. NOTE: I intentionally did not use a pair for a key, given the order matters there...
I'd start with something fairly basic like this, profile and see how fast/slow the lookups etc. are before seeing if you need to do anything smarter...
If you create a sorted array with the 200 strings, then you can binary search it to find the matching indices of the two strings, then use those two indices in a 2D array to find the relationship value.
If your 200 strings are in an array, your 20,100 similarity values can be in a one dimensional array too. It's all down to how you index into that array. Say x and y are the indexes of the strings you want the similarity for. Swap x and y if necessary so that y>=x, then look at entry i= x + y(y+1)/2 in the large array.
(x,y) of (0,0),(0,1),(1,1),(0,2),(1,2),(2,2),(0,3),(1,3)... will take you to entry 0,1,2,3,4,5,6,7...
So this uses space optimally and it gives faster look up than a map would. I'm assuming efficiency is at least mildly important to you since you are using C++!
[if you're not interested in self similarity values where y=x, then use i = x + y(y-1)/2 instead].
I am trying to represent a relation (table) in C++ code:
The columns of the relation are integers.
The number of columns in the relation is fixed at runtime.
No duplicates should be allowed (this is the major source of cost).
I want to have a map from names to relations.
Any ideas for an efficient implementation, the main issue here is detecting duplicates at insertion time, it can be very costly.
Make each row of the table a struct Row.
Use a std::set or std::unordered_set to store these structs. Collision (querying) can be detected in (for std::set) O(log n + d) time or (for std::unordered_set) amortized O(d) time where d is the number of columns.
To efficiently map from names to rows, create a boost::bimap<std::string, Row>.
KennyTM has a point. You could use SQLite. As described in the link, you can use it to create a temporary in-memory database.