I'm looking to have a reverse lookup of a 2D table i.e. Table: F(x,y), given F, find x and y
My current method uses a nested for loop to search the table for all x's and y's to find F within some error. The complication here is that the queried "F" may not be a perfect match for the "F" in my lookup table. I also have NaNs in my table. I'm hoping to have this program find the nearest "F" to the queried "F".
The table is currently a 2D array, but I'm thinking a map may be more appropriate here. I know how to create a multidimensional map from this: https://www.geeksforgeeks.org/implementing-multidimensional-map-in-c/
I also found some great answers (#Rob's specifically) on how to have a reverse map lookup for a 1D map using Boost here: Reverse map lookup
I'm having some trouble combining the two methods, as well as having a findNearest feature.
Sounds like you would like to use the indexing suite of Boost Geometry.
Not only does it have nearest-k queries, but it affords you all kinds of coordinate systems (including geodesic systems).
See Spatial Indexes
Related
I apologize in advance if this has been asked before. If it has I have no idea what this data structure would be called.
I have a collection of N (approx ~300 or less) widgets. Each widget has M (around ~10) different names. This collection will be populated once and then looked up many times. I will be implementing this in C++.
An example of this might be a collection of 200 people and storing their names in 7 different languages.
The lookup function would basically look like this:
lookup("name", A, B), which will return the translation of the name "name" from language A to language B, (only if name is in the collection).
Is there any known data structure in the literature for doing this sort of thing efficiently? The most obvious solution is to create a bunch of hash tables for the lookups, but having MxM hash tables for all the possible pairs quickly gets unwieldy and memory inefficient. I'd also be willing to consider sorted arrays (binary search) or even trees. Since the collection is not huge, log(N) lookups are just fine.
Thank you everyone!
Based on your description of the desired lookup function, it sounds like you could use a single hash table where the key is tuple<string, Language, Language> and the value is the result of the translation. The two languages in the key identify the source language of the string and the language of the desired translation, respectively.
Create an N-by-M array D, such that D[u,v] is the word in language v for widget u.
Also create M hash tables, H₀...Hₘ (where m is M-1) such that Hᵥ(w).data is u if w is the word for widget u in language v.
To perform lookup(w, r, s),
(1) set u = Hᵣ(w).data
(2) if D[u,r] equals w, return D[u,s], else return not-found.
So I have a boost::multi_index_container with multiple non-unique indexes. I would like to find an elegant way to do an relational-database style query to find all elements that match a set of criteria using multiple indexes.
For instance, given a list of connections between devices, I'd like to search for all elements whose source is 'server' and whose destination is 'pc2'. I've got an index on Source and an index on Dest.
Source Dest/Port
---- ---------
server pc1 23
server pc1 27
server pc1 80
server pc2 80 <- want to find these two
server pc2 90 <-
printer pc3 110
printer pc1 110
scanner mac 8080
Normally I might do lower_bound and upper_bound on the first index (to match 'server'), then do a linear search between those iterators to find those elements that match in the "Dest" column, but that's not very satisfying, since I've got a second index. Is there an elegant stl/boost-like way to take advantage of the fact that there are two indexes and avoid a linear search (or an equivalent amount of work, such as adding all intermediate results to another container, etc.)?
(Obviously in the example, a linear search would be fastest, but if there were 10000 items with 'server' as the source, having the second index would start to be nice.)
Any ideas are appreciated!
You might simply get some inspiration from relational databases...
... but first we need to demystify a thing about indexes.
Compound Indexes
In a relational database there are two types of indexes:
regular indexes: an index on one column
compound indexes: an index on multiple columns at once
The two give different performance results. When you need to use two indexes, there is a merge pass to combine the results given by them (also called join), therefore compound indexes can provide a speed boost.
Multi-Index
Boost multi-index can use compounds indexes, you are free to provide your own hashing or comparison function after all.
A key difference with a relational database is that you cannot have an efficient merge pass (merging two ROWID sets) because this require intrinsic knowledge to be efficient, therefore you are indeed stuck with a linear search among the results of the first search. It is up to you to find the most discriminant first search.
Note: the name multi-index refers to the idea that it automatically maintains multiple index when you insert, update and delete your elements. It also means that you can search using any of those indexes with a performance profile that you decided. But it is not a full-blown database engine with statistics, heuristics and a query engine.
The most elegant way to do a relational-database style query is to use a relational database. I'm not being flippant; you're using the wrong data structure. If "relational-database style query" operations are going to be something that you do frequently, I would strongly urge you to invest in SQLite.
The purpose of Boost.MultiIndex is not to be a quick-and-dirty database.
I have created a vector which contains several map<>.
vector<map<key,value>*> v;
v.push_back(&map1);
// ...
v.push_back(&map2);
// ...
v.push_back(&map3);
At any point of time, if a value has to be retrieved, I iterate through the vector and find the key in every map element (i.e. v[0], v[1] etc.) until it's found. Is this the best way ? I am open for any suggestion. This is just an idea I have given, I am yet to implement this way (please show if any mistake).
Edit: It's not important, in which map the element is found. In multiple modules different maps are prepared. And they are added one by one as the code progresses. Whenever any key is searched, the result should be searched in all maps combined till that time.
Without more information on the purpose and use, it might be a little difficult to answer. For example, is it necessary to have multiple map objects? If not, then you could store all of the items in a single map and eliminate the vector altogether. This would be more efficient to do the lookups. If there are duplicate entries in the maps, then the key for each value could include the differentiating information that currently defines into which map the values are put.
If you need to know which submap the key was found in, try:
unordered_set<key, pair<mapid, value>>
This has much better complexity for searching.
If the keys do not overlap, i.e., are unique througout all maps, then I'd advice a set or unordered_set with a custom comparision functor, as this will help with the lookup. Or even extend the first map with the new maps, if profiling shows that is fast enough / faster.
If the keys are not unique, go with a multiset or unordered_multiset, again with a custom comparision functor.
You could also sort your vector manually and search it with a binary_search. In any case, I advice using a tree to store all maps.
It depends on how your maps are "independently created", but if it's an option, I'd make just one global map (or multimap) object and pass that to all your creators. If you have lots of small maps all over the place, you can just call insert on the global one to merge your maps into it.
That way you have only a single object in which to perform lookup, which is reasonably efficient (O(log n) for multimap, expected O(1) for unordered_multimap).
This also saves you from having to pass raw pointers to containers around and having to clean up!
I've got 200 strings. Each string has a relationship (measured by a float between 0 and 1) with every other string. This relationship is two-way; that is, relationship A/B == relationship B/A. This yields n(n-1)/2 relationships, or 19,800.
What I want to do is store these relationships in a lookup table so that given any two words I can quickly find the relationship value.
I'm using c++ so I'd probably use a std::map to store the LUT. The question is, what's the best key to use for this purpose.
The key needs to be unique and needs to be able to be calculated quickly from both words.
My approach is going to be to create a unique identifier for each word pair. For example given the words "apple" and "orange" then I combine them together as "appleorange" (alphabetical order, smallest first) and use that as the key value.
Is this a good solution or can someone suggest something more cleverer? :)
Basically you are describing a function of two parameters with the added property that order of parameters is not significant.
Your approach will work if you do not have ambiguity between words when changing order (I would suggest putting a coma or like between the two words to remove possible ambiguities). Any 2D array would also work.
I would probably convert each keyword to some unique identifier (using a simple map) before trying to find the relationship value, but it does not change much from what you are proposing.
If boost/tr1 is acceptable, I would go for an unordered_map with the pair of strings as key. The main question would then be: what with the order of the strings? This could be handled by the hash-function, which starts with the lexical first string.
Remark: this is just a suggestion after reading the design-issue, not a study.
How "quickly" is quickly? Given you don't care about the order of the two words, you could try a map like this:
std::map<std::set<std::string>, double> lut;
Here the key is a set of the two words, so if you insert "apple" and "orange", then the order is the same as "orange" "apple", and given set supports the less than operator, it can function as a key in a map. NOTE: I intentionally did not use a pair for a key, given the order matters there...
I'd start with something fairly basic like this, profile and see how fast/slow the lookups etc. are before seeing if you need to do anything smarter...
If you create a sorted array with the 200 strings, then you can binary search it to find the matching indices of the two strings, then use those two indices in a 2D array to find the relationship value.
If your 200 strings are in an array, your 20,100 similarity values can be in a one dimensional array too. It's all down to how you index into that array. Say x and y are the indexes of the strings you want the similarity for. Swap x and y if necessary so that y>=x, then look at entry i= x + y(y+1)/2 in the large array.
(x,y) of (0,0),(0,1),(1,1),(0,2),(1,2),(2,2),(0,3),(1,3)... will take you to entry 0,1,2,3,4,5,6,7...
So this uses space optimally and it gives faster look up than a map would. I'm assuming efficiency is at least mildly important to you since you are using C++!
[if you're not interested in self similarity values where y=x, then use i = x + y(y-1)/2 instead].
I have a dense matrix where the indices correspond to genes. While gene identifiers are often integers, they are not contiguous integers. They could be strings instead, too.
I suppose I could use a boost sparse matrix of some sort with integer keys, and it wouldn't matter if they're contiguous. Or would this still occupy a great deal of space, particularly if some genes have identifiers that are nine digits?
Further, I am concerned that sparse storage is not appropriate, since this is an all-by-all matrix (there will be a distance in each and every cell, provided the gene exists).
I'm unlikely to need to perform any matrix operations (e.g., matrix multiplication). I will need to pull vectors out of the matrix (slices).
It seems like the best type of matrix would be keyed by a Boost unordered_map (a hash map), or perhaps even simply an STL map.
Am I looking at this the wrong way? Do I really need to roll my own? I thought I saw such a class somewhere before.
Thanks!
You could use a std::map to map the gene identifiers to unique, consecutively assigned integers (every time you add a new gene identifier to the map, you can give it the map's size as its identifier, assuming you never remove genes from the map).
If you want to be able to search for the identifier of a gene based on its unique integer, you can use a second map or you could use a boost::bimap, which provides a bidirectional mapping of elements.
As for which matrix container to use, you might consider boost::ublas::matrix; it provides vector-like access to rows and columns of the matrix.
If you don't need matrix operations, you don't need a matrix. A 2D map with string keys can be done with map<map<string> > in plain C++, or using a hash map accordingly from Boost.
There is Boost.MultiArray which will allow you to manage with non-continuous indexes.
If you want an efficient implementation working with matrices with static size, there is also Boost.LA, which in now on the review schedule.
And las there is also NT2 which should be submitted to Boost soon.