What is the best autocomplete/suggest algorithm,datastructure [C++/C] - c++

We see Google, Firefox some AJAX pages show up a list of probable items while user types characters.
Can someone give a good algorithm, data structure for implementing autocomplete?

A trie is a data structure that can be used to quickly find words that match a prefix.
Edit: Here's an example showing how to use one to implement autocomplete http://rmandvikar.blogspot.com/2008/10/trie-examples.html
Here's a comparison of 3 different auto-complete implementations (though it's in Java not C++).
* In-Memory Trie
* In-Memory Relational Database
* Java Set
When looking up keys, the trie is marginally faster than the Set implementation. Both the trie and the set are a good bit faster than the relational database solution.
The setup cost of the Set is lower than the Trie or DB solution. You'd have to decide whether you'd be constructing new "wordsets" frequently or whether lookup speed is the higher priority.
These results are in Java, your mileage may vary with a C++ solution.

For large datasets, a good candidate for the backend would be Ternary search trees. They combine the best of two worlds: the low space overhead of binary search trees and the character-based time efficiency of digital search tries.
See in Dr. Dobbs Journal: http://www.ddj.com/windows/184410528
The goal is the fast retrieval of a finite resultset as the user types in. Lets first consider that to search "computer science" you can start typing from "computer" or "science" but not "omputer". So, given a phrase, generate the sub-phrases starting with a word. Now for each of the phrases, feed them into the TST (ternary search tree). Each node in the TST will represent a prefix of a phrase that has been typed so far. We will store the best 10 (say) results for that prefix in that node. If there are many more candidates than the finite amount of results (10 here) for a node, there should be a ranking function to resolve competition between two results.
The tree can be built once every few hours, depending on the dynamism of the data. If the data is in real time, then I guess some other algorithm will give a better balance. In this case, the absolute requirement is the lightning-fast retrieval of results for every keystroke typed which it does very well.
More complications will arise if the suggestion of spelling corrections is involved. In that case, the edit distance algorithms will have to be considered as well.
For small datasets like a list of countries, a simple implementation of Trie will do. If you are going to implement such an autocomplete drop-down in a web application, the autocomplete widget of YUI3 will do everything for you after you provide the data in a list. If you use YUI3 as just the frontend for an autocomplete backed by large data, make the TST based web services in C++, and then use script node data source of the autocomplete widget to fetch data from the web service instead of a simple list.

Segment trees can be used for efficiently implementing auto complete

If you want to suggest the most popular completions, a "Suggest Tree" may be a good choice:
Suggest Tree

For a simple solution : you generate a 'candidate' with a minimum edit (Levenshtein) distance (1 or 2) then you test the existence of the candidate with a hash container (set will suffice for a simple soltion, then use unordered_set from the tr1 or boost).
Example:
You wrote carr and you want car.
arr is generated by 1 deletion. Is arr in your unordered_set ? No. crr is generated by 1 deletion. Is crr in your unordered_set ? No. car is generated by 1 deletion. Is car in your unordered_set ? Yes, you win.
Of course there's insertion, deletion, transposition etc...
You see that your algorithm for generating candidates is really where you’re wasting time, especially if you have a very little unordered_set.

Related

Managing large spatial data set with attributes in C++

I have a data set with about 700 000 entries, and each entry is a set of 3D coordinates with attributes such as name, timestamp, ID, and so on.
Right now I'm just reading the coordinates and render them as points in OpenGL. However I want to associate each point with its corresponding attributes and I want to be able to sort and pick them during runtime based on their attributes. How would I go about to achieve this in an efficient manner?
I know I can put I can put the data in a struct and use stl sort for sorting, but is that a good design choice or is there a more efficient/elegant way of handling the problem?
The way I tend to look at these design choices is to first use one of the standard library containers (btw, if you need to "just" do lookup you don't necessarily have to sort, but you need a container that allows lookup), then check if this an "efficient enough" solution for the problem.
You can usually come up with a custom solution that is more efficient and maybe more elegant but you tend to run into two issues with that:
1) You end up having to implement some type of a container, which will cost you time both in implementation and debugging compared to a well understood and tested container that is already out there. Most of the time you're better off trying to solve the problem at hand rather than make it bigger by adding more code.
2) If someone else will have to maintain your code at some point, chances are they are familiar with standard library components both from a design and implementation perspective, but they won't be familiar with your custom container, thus increasing the learning curve.
If you consider each attribute of your point class as a component of a vector, then your selection process is a region query. Your example of a string attribute being equal to something means that the region is actually a line in your data space. However, there won't be any sorting made on other attributes within that selection, you will have to implement it by yourself, but it should be relatively straightforward for octrees, which partition data in ordered regions.
As advocated in another answer, try existing standard solutions first. If you can find an of the shelf implementation of one of these data structures:
R-tree
KD tree
BSP
Octree, or more likely, a n dimensional version of the quadtree or octree principle (I will use the term octree herein to denote the general data structure)
then go for it. These are the data structures I recommend for spatial data management.
You could also use an embedded RDBMS capable of working with spatial data (they usually implement R-tree for spatial indexing), but it may not be interesting if your dataset isn't dynamic.
If your dataset falls within the 10000 entries range, then by today standards it isn't that large, so using simpler structures should suffice. In that perimeter, I would go first for a simple std::vector, and use std::sort and std::find to filter the data in smaller set and sort it afterward.
I would probably try an ordered set or map on the most queried attribute in a second attempt, then do some benchmarks to pick the more performing solution.
For a more efficient one dimensional indexing algorithm (in essence, that`s what sets and maps are), you might want to try B-trees: there's C++ implementation available from google.
My third attempt would go toward an OpenCL solution (although if you are doing heavy OpenGL rendering, you might prefer doing the work on the CPU instead, but that depends on your framerate needs).
If your dataset is much larger, as it seems to be, then consider one of the more complex solutions I listed initially.
At any rate, without more details about your dataset and how you plan to use it, it will be difficult to provide a good solution, so the only real advice we can give is: try everthing you can and benchmark.
If you're dealing with point clouds, take a look at PCL, it could save you a lot of time and effort without having to dig into the intricacies of spatial indexing yourself. It also includes visualisation.

Features combinations

I have a list of features set (40 features) and my idea firstly was to evaluate the classifier on all the combinations that I can get. However, after I did some calculations I found that the combinations will reach millions!! Thus, it will take forever!!!!
I read about the ability of using random search method to chose random features. However, each time I run the random search I got the same features sets. Do I need to change the seed number or any option??
Also, Is using random search effective and can substitute the approach of choosing all combinations???
I would appreciate your help experts.
Many thanks in advance,
Ahmad
When you want to perform an attribute selection in WEKA, yo should take into account 2 algorithms, the searcher and the attribute evaluator (I will talk about it later).
As you said, maybe you cannot try an Exhaustive search because it takes so long, there are greedy alternatives to get good results (depending on the problem) like Best first (based on hill climbing). The option that you comment (Random search) is another approach to make the selection subsets, it makes random iterations to select subsets that will be evaluated.
Why are you getting the same subset of selected attributes? Because the Random search is selecting always the same subsets and the evaluator determines the best one (final output). But if I change the seed parameter it should change. Maybe or... maybe not. Why? Because if the algorithm performs an enough number of iterations (although it starts with a different seed) it will get the same subsets than the previous one (convergence) and the evaluator will choose the same subset as the previous execution.
If you do not want to get convergence in the selector output, just change the seed, but choose a smaller search percent to limit the exploration and get different results.
But, in my opinion, if you are getting always the same results is because the evaluator (I do not know what algorithm are you using) has determined that this subset is "the best" given your dataset. I also recommend you to try another selector like Best first or a Genetic search as your search method.

Search structure with history (persistence)

I need a map-like data structure (in C++) for storing pairs (Key,T) with the following functionality:
You can insert new elements (Key,T) into the current structure
You can search for elements based on Key in the current structure
You can make a "snapshot" of the current version of the structure
You can switch to one of the versions of the structures which you took the snapshot of and continue all operations from there
Completely remove one of the versions
What I don't need
Element removal from the structure
Merging of different versions of the structure into one
Iteration over all (or some of) elements currently stored in the structure
In other words, you have some search structure that you can build up, but at any point you can jump in history, and expand the earlier/different version of the structure in a different way. Later on you may jump between those different versions.
In my project, Key and T are likely to be integers or pointer values, but not strings.
The primary objective is to reduce the time complexity; space consumption is secondary (but should be reasonable as well). To clarify, for me log(N)+log(S) (where N-number of elements, S-number of snapshots) would be enough, although faster is better :)
I have some rough idea how to implement it --- for example: being the structure a binary search tree, the insertion of a new element can clone the path from the root to the insertion location, while keeping the rest of the tree intact. Switching tree versions would be equivalent to picking a different version of the root node, for which some changes are simply not visible.
However, to make this custom tree efficient (e.g. self-balancing) it will require some additional effort and careful coding. Of course I can do it myself but perhaps there are already existing libraries to do exactly that?
Also, there is probably a proper name for this kind of data structure that I simply don't know, making my Google searches (or SO searches) total failures...
Thank you for your help!
I think what you are looking for is an immutable map. Functional (or functionally inspired) programming languages (such as Haskell or Scala) have immutable versions of most of the containers you'd find in the STL. Operations such as insertion/removal etc. then return a copy of the map (preserving the original) with the copy containing your requested modification. A lot of work has gone into designing the datastructures so that the copies are able to point to as much of the original datastructure as possible to reduce time and memory complexity of each operation.
You can find a lot more details in a book such as this one: http://www.amazon.co.uk/Purely-Functional-Structures-Chris-Okasaki/dp/0521663504.
While searching for some persistent search trees libraries I stumbled on this:
http://cg.scs.carleton.ca/~dana/pbst/
While it does not have the exact same functionality as needed, it seems pretty close to it. I will investigate.
(posting here, as someone may find it useful as well)

Hash Table Implementation Using An Array of Linked Lists

This question has been bugging me for quite a long time and today I've read a detailed article related to hash tables. Without checking any implementation examples I wanted to give a shot for writing a Hash Table from scratch.
The seperate chaining method gave me the idea of implementing the hash table. Anyone who has experience on data structures might regard this question as a joke but i'm a beginner and without diving straight at the code I wanted to discuss my implementations efficiency. Would it be efficient or any other fundamental ideas could be preferred than this?
I think for starters you can also peek into the source (or documentations) of some hash maps implemented in boost libraries. It is called unordered_map. (link is here)
As long as you don't know about these implementations and want to use a hash and you are annoyed because it is not in STL, you are intrigued to write your own fast datastore.
But now implementing hash-maps are so much out of the game that C++11 has unordered_map in its STL. You'll see there are plenty of more interesting stuff out there.
Note: separate chaining is called bucket hash. In fact, boost uses bucket hash, see this link. Maybe you could rather look up some performance comparisons. Chances are those who do perf's will write good enough implementations.
Using closed addressing, another alternative is to use a self balancing binary search tree, e.g. red-black tree/std::map or heap tree, for the inner data structure, or even another hash map with different hashing algorithm.
Using open addressing, another alternative to linear probing are quadratic probing and double hashing; there are also less commonly used strategies such as cuckoo hashing, hopscotch hashing, etc.
The key points of implementing hash table is choosing the right hashing algorithm, resizing strategy (load factor), and collision resolution strategy. The best strategy is highly dependant on the type of workload that you're expecting as there are tradeoffs for each approach.

Is a radix tree (Patricia Trie) an efficient data structre for a mobile-phone address book

I have been thinking on implementing an address book in C++. Since it's developed for mobile application the address book should use as less memory as possible and also the user should still be able to search or sort contacts by name fast ( paradox I know ).
After I've researched a bit I found that most of the people suggest that a Trie would be the best data structure tp fit my needs. More precisely a radix tree( Patricia Trie ). Using this data structure would also be great for implementing autocomplete too.
Are there other viable solutions or is it ok if I start coding using this idea ?
Beware of tries for small collections. Though they do offer good asymptotical behavior, their hidden constant in both time and space might be too big.
Especially, tries tend to have poor cache performace, which should be the main concern for small collections.
Assuming your data is relatively small [<10,000 entries], a std::vector can offer good cache performance, which will probably be much more influence then the size factor. So even the search time for it is asymptotically higher then trie or a std::set, practically - it might be better then both, thanks to good caching behavior.
If you can also maintain the vector sorted, using binary search - you can benefit from both logarithmic search time and good cache behavior.
(*)This answer assumes the hardware where the app will be deployed on has CPU-Cache.
tries are the best for such purpose as they offer quick search,insertiona and deletion.