what data structure to use for range searches? - c++

Trying to make a simple program to catalogue books. Something like this, for example:
struct book{
string author;
string title;
int catalogNumber;
}
Ultimately, I want to be able to do title searches based on a range. So the user could specify to display results of books where the title begins with "aa" though "be". Ideally, the search average case would be logarithmic.
Is there something in the STL that could help me? Otherwise, what is the best way to go about this?
Thanks!

You can store them in an std::set, and use std::lower_bound and std::upper_bound to find a range (and yes, that should be logarithmic). To do that, you'll need to define operator< to operate on just the field(s) you care about (title, in this case).
If you're (virtually) always treating the title as the key, you might prefer to use an std::map<std::string, info>, with info defined like:
struct info {
string author;
int catalogNumber;
info(string a, int c) : author(a), catalogNumber(c) {}
};
This makes a few operations a little easier, such as:
books["Moby Dick"] = info("Herman Melville", 1234);
If you want to support searching by title or author (for example) consider using something like a Boost bimap or multi_index.
For what it's worth, I'd also give serious thought to using a string instead of an int for the catalog number. Almost none of the standard numbering systems (e.g., Dewey decimal, library of congress, ISBN) will store very nicely in an int.

You can use a trie [expanding #smarinov suggestion here]:
Finding the set of relevant words with a common prefix is farily easy in a trie, just follow pointers in the trie until you reach the node representing the desired common prefix. This node is the trie containing the desired common prefix.
In your example, you will need:
range("aa","be") = prefix("a") + (prefix("b[a-e]")
The complexity expected for this OP is O(|S|), where |S| is the length of the common prefix. Note that any algorithm is expected to be not better then it [O(logn) algorithms are actually O(|S| * logn) because the compare op depends on the length of the string.

You can put your elements in a std::set. The problem with that is that you'd probably like your users to be able to search by title as well as by author. A solution is just to maintain two sets, but if your data changes this can be tricky to maintain and you need twice as much space.
You can always write something like Trie, but chances are your data will change and it becomes harder to maintain the logarithmic search time. You can implement any kind of Self-balancing binary search tree, but that's essentially what a set is - a Red-black tree. Writing one is not the easiest task, however...
Update: You can hash everything and implement some form of the Rabin-Karp string search algorithm, but you should be aware that there are collisions possible if you do it. You can reduce the probability of one by double-hashing and/or using good hashing functions.

Related

How to improve linked list searching. C++

I have simple method in C++ which searchs for string in linked list. That works well but I need to make it faster. Is it possible? Maybe I need to insert items into list in alphabetical order? But I dont think it could help in serching list anymore. In list there is about 300 000 items (words).
int GetItemPosition(const char* stringToFind)
{
int i = 0;
MyList* Tmp = FistListItem;
while (Tmp){
if (!strcmp(Tmp->Value, stringToFind))
{
return i;
}
Tmp = Tmp->NextItem;
i++;
}
return -1;
}
Method returns the position number if item found, otherwise returns -1.
Any sugesstion will be helpfull.
Thanks for answers, I can change structure. I have only one constraint. Code must implement the following interface:
int Count(void);
int AddItem(const char* StringValue, int WordOccurrence);
int GetItemPosition(const char* StringValue);
char* GetString(int Index);
int GetOccurrenceNum(int Index);
void SetInteger(int Index, int WordOccurrence);
So which structure will be the in your opinion the most suitable?
Searching a linked list is linear so you need to iterate from beginning one by one so it is O(n). Linked lists are not the best if you will use it for searching, you can utilize more suitable data structures such as binary trees.
Ordering elements does not help much because still you need to iterate each element anyway.
Wikipedia article says:
In an unordered list, one simple heuristic for decreasing average search time is the move-to-front heuristic, which simply moves an element to the beginning of the list once it is found. This scheme, handy for creating simple caches, ensures that the most recently used items are also the quickest to find again.
Another common approach is to "index" a linked list using a more
efficient external data structure. For example, one can build a
red-black tree or hash table whose elements are references to the
linked list nodes. Multiple such indexes can be built on a single
list. The disadvantage is that these indexes may need to be updated
each time a node is added or removed (or at least, before that index
is used again).
So in the first case you can slightly improve (by statistical assumptions) your search performance by moving items found previously closer to the beginning of the list. This assumes that previously found elements will be searched more frequently.
Second method requires to use other data structures.
If using linked lists is not a hard requirement, consider using hash tables, sorted arrays (random access) or balanced trees.
Consider using array or std::vector as a storage instead of linked list, and use binary search to find particular string, or even better, std::set, if you don't need a numerical index. If for some reasons it is not possible to use other containers, there is not much possible to do - you may want to speed up the process of comparison by storing hash of the string along with it in node.
I suggest hashing.
Since you've already got a linked list of your own), you can try chaining with linked lists for collision resolution.
Rather than using a linear linked list, you may want to use a binary search tree, or a red/black tree. These trees are designed on minimizing the traversals to find an item.
You could also store "short cut links". For example, if the list is of strings, you could have an array of links of where to start searching based on the first letter.
For example, shortcut['B'] would return a pointer to the first link to start searching for strings starting with 'B'.
The answer is no, you cannot improve the search without changing your data-structure.
As it stands, sorting the list will not give you a faster search for any random item.
It will only allow you to quickly decide if the given item is in the list by testing against the first item (which will be either the smallest or the largest entry) and this improvement is not likely to make a big difference.
So can you please edit your question and explain to us your constraints?
Can you use a completely different data structure, like an array or a tree? (as others have suggested)
If not, can you modify the way your linked list is linked?
If not, we will be unlikely to help you...
The best option is to use faster data structure for storing strings:
std::map - red-black tree behind the scenes. Has O(logn) for search/insert/delete operations. Suitable if you want to store additional values with strings (for example - positions).
std::set - basically the same tree but without values. Best for case when you need only contains operation.
std::unordered_map - hash table. O(1) access.
std::unordered_set - hash set. Also O(1) access.
Note. But in all of these cases there is a catch. Complexity is calculated only based on n (count of strings). In reality string comparison is not free. So, O(1) becomes O(m), O(logn) becomes O(mlogn) (where m is maximal length of string). This does not matter in case of relatively short strings. But if this is not true consider using Trie. In practice trie can be even faster than hash table - each character of query string is accessed only once instead of multiple times. For hash table/set it's at least once for hash calculation and at least once for actual string comparison (depending on collision resolution strategy - not sure how it is implemented in C++).

Fastest way to search for a string

I have 300 strings to be stored and searched for and that most of them are identical in terms of characters and lenght. For Example i have string "ABC1","ABC2","ABC3" and so on. and another set like sample1,sample2,sample3. So i am kinda confused as of how to store them like to use an array or a hash table. My main concern is the time i take to search for a string when i need to get one out from the storage. If i use an array i will have to do a string compare on all the index for me to arrive at one. Now if i go and impliment a hash table i will have to take care of collisions(obvious) and that i will have to impliment chaining for storing identical strings.
So i am kind of looking for some suggestions weighing the pros and cons of each and arrive at the best practice
Because the keys are short tend to have a common prefix you should consider radix data structures such as the Patricia trie and Ternary Search Tree (google these, you'll find lots of examples) Time for searching these structures tends to be O(1) with respect to # of entries and O(n) with respect to length of the keys. Beware, however that long strings can use lots of memory.
Search time is similar to hash maps if you don't consider collision resolution which is not a problem in a radix search. Note that I am considering the time to compute the hash as part of the cost of a hash map. People tend to forget it.
One downside is radix structures are not cache-friendly if your keys tend to show up in random order. As someone mentioned, if the search time is really important: measure the performance of some alternative approaches.
This depends on how much your data is changing. With that I mean, if you have 300 index strings which are referencing to another string, how often does those 300 index strings change?
You can use a std::map for quick lookups, but the map will require more resource when it is created the first time (compared to a array, vector or list).
I use maps mostly for some kind of dynamic lookup tables (for example: ip to socket).
So in your case it will look like this:
std::map<std::string, std::string> my_map;
my_map["ABC1"] = "sample1";
my_map["ABC2"] = "sample2";
std::string looked_up = my_map["ABC1"];

C++ Boggle Solver: Finding Prefixes in a Set

This is for a homework assignment, so I don't want the exact code, but would appreciate any ideas that can help point me in the right direction.
The assignment is to write a boggle solving program. I've got the recursive part down I feel, but I need some insight on how to compare the current sequence of characters to the dictionary.
I'm required to store the dictionary in either a set or sorted list. I've been trying a way to implement this using a set. In order to make the program run faster and not follow dead end paths, I need to check and see if the current sequence of characters exists as a prefix to anything in the set (dictionary).
I've found that set.find() operation only returns true if the string is an exact match. In the lab requirements, the professor mentioned that:
"If the dictionary is stored in a Set, many data structure libraries provide a way to find the string in the Set that is closest to the one you are searching for. Such an operation could be used to quickly find a word with a given prefix."
I've been searching today for a what the professor is describing. I've found a lot of information on tries, but since I'm required to use a list or set, I don't think that will work.
I've also tried looking up algorithms for autocomplete functions, but the ones that I've found seem extremely complicated for what I'm trying to accomplish here.
I also was thinking of using strncmp() to compare the current sequence to a word from the dictionary set, but again, I don't know how exactly that would function in this situation, if at all.
Is it worth it to continue investigating how this would work in a set or should I just try using a sorted list to store my dictionary?
Thanks
As #Raymond Hettinger mentions in his answer, a trie would be extremely useful here. However, if you either are uncomfortable writing a trie or would prefer to use off-the-shelf components, you can use a cute property of how words are ordered alphabetically to check in O(log n) time whether a given prefix exists. The idea is as follows - suppose for example that you are checking for the prefix "thr." If you'll note, every word that begins with the prefix "thr" must be sandwiched between the strings "thr" and "ths." For example, thr ≤ through < ths, and thr ≤ throat < ths. If you are storing your words in a giant sorted array, you can use a modified version of binary search to find the first word alphabetically at least the prefix of your choice and the first word alphabetically at least the next prefix (formed by taking the last letter of the prefix and incrementing it). If these are the same word, then nothing is between them and the prefix doesn't exist. If they're not, then something is between them and the prefix does it.
Since you're using C++, you can potentially do with a std::vector and the std::lower_bound algorithm. You could also throw all the words into a std::set and use the set's version of lower_bound. For example:
std::set<std::string> dictionary;
std::string prefix = /* ... */
/* Get the next prefix. */
std::string nextPrefix = prefix;
nextPrefix[nextPrefix.length() - 1]++;
/* Check whether there is something with the prefix. */
if (dictionary.lower_bound(prefix) != dictionary.lower_bound(nextPrefix)) {
/* ... something has that prefix ... */
} else {
/* ... no word has that prefix ... */
}
That said, the trie is probably a better structure here. If you're interested, there is another data structure called a DAWG (Directed Acyclic Word Graph) that is similar to the trie but uses substantially less memory; in the Stanford introductory CS courses (where Boggle is an assignment), students actually are provided a DAWG containing all the words in the language. There is also another data structure called a ternary search tree that is somewhere in-between a binary search tree and a trie that may be useful here, if you'd like to look into it.
Hope this helps!
The trie is the preferred data structure of choice for this problem.
If you're limited to sets and dictionaries, I would choose a dictionary that maps prefixes to an array of possible matches:
asp -> aspberger aspire
bal -> balloon balance bale baleen ...

Searching a C++ Vector<custom_class> for the first/last occurence of a value

I'm trying to work out the best method to search a vector of type "Tracklet" (a class I have built myself) to find the first and last occurrence of a given value for one of its variables. For example, I have the following classes (simplified for this example):
class Tracklet {
TimePoint *start;
TimePoint *end;
int angle;
public:
Tracklet(CvPoint*, CvPoint*, int, int);
}
class TimePoint {
int x, y, t;
public:
TimePoint(int, int, int);
TimePoint(CvPoint*, int);
// Relevant getters and setters exist here
};
I have a vector "vector<Tracklet> tracklets" and I need to search for any tracklets with a given value of "t" for the end timepoint. The vector is ordered in terms of end time (i.e. tracklet.end->t).
I'm happy to code up a search algorithm, but am unsure of which route to take with it. I'm not sure binary search would be suitable, as I seem to remember it won't necessarily find the first. I was thinking of a method where I use binary search to find an index of an element with the correct time, then iterate back to find the first and forward to find the last. I'm sure there's a better way than that, since it wastes binary searches O(log n) by iterating.
Hopefully that makes sense: I struggled to explain it a bit!
Cheers!
If the vector is sorted and contains the value, std::lower_bound will give you an iterator to the first element with a given value and std::upper_bound will give you an iterator to one element past the last one containing the value. Compare the value with the returned element to see if it existed in the vector. Both these functions use binary search, so time is O(logN).
To compare on tracklet.end->t, use:
bool compareTracklets(const Tracklet &tr1, const Tracklet &tr2) {
return (tr1.end->t < tr2.end->t);
}
and pass compareTracklets as the fourth argument to lower_bound or upper_bound
I'd just use find and find_end, and then do something more complicated only if testing showed it to be too slow.
If you're really concerned about lookup performance, you might consider a different data structure, like a map with timestamp as the key and a vector or list of elements as the value.
A binary search seems like your best option here, as long as your vector remains sorted. It's essentially identical, performance-wise, to performing a lookup in a binary tree-structure.
dirkgently referred to a sweet optimization comparative. But I would in fact not use a std::vector for this.
Usually, when deciding to use a STL container, I don't really consider the performance aspect, but I do consider its interface regarding the type of operation I wish to use.
std::set<T>::find
std::set<T>::lower_bound
std::set<T>::upper_bound
std::set<T>::equal_range
Really, if you want an ordered sequence, outside of a key/value setup, std::set is just easier to use than any other.
You don't have to worry about inserting at a 'bad' position
You don't have problems of iterators invalidation when adding / removing an element
You have built-in methods for searching
Of course, you also want your Comparison Predicate to really shine (hopes the compiler inlines the operator() implementation), in every case.
But really, if you are not convinced, try a build with a std::vector and manual insertion / searching (using the <algorithm> header) and try another build using std::set.
Compare the size of the implementations (number of lines of code), compare the number of bugs, compare the speed, and then decide.
Most often, the 'optimization' you aim for is actually a pessimization, and in those rares times it's not, it's just so complicated that it's not worth it.
Optimization:
Don't
Expert only: Don't, we mean it
The vector is ordered in terms of time
The start time or the end time?
What is wrong with a naive O(n) search? Remember you are only searching and not sorting. You could use a sorted container as well (if that doesn't go against the basic design).

How to implement an associative array/map/hash table data structure (in general and in C++)

Well I'm making a small phone book application and I've decided that using maps would be the best data structure to use but I don't know where to start. (Gotta implement the data structure from scratch - school work)
Tries are quite efficient for implementing maps where the keys are short strings. The wikipedia article explains it pretty well.
To deal with duplicates, just make each node of the tree store a linked list of duplicate matches
Here's a basic structure for a trie
struct Trie {
struct Trie* letter;
struct List *matches;
};
malloc(26*sizeof(struct Trie)) for letter and you have an array. if you want to support punctuations, add them at the end of the letter array.
matches can be a linked list of matches, implemented however you like, I won't define struct List for you.
Simplest solution: use a vector which contains your address entries and loop over the vector to search.
A map is usually implemented either as a binary tree (look for red/black trees for balancing) or as a hash map. Both of them are not trivial: Trees have some overhead for organisation, memory management and balancing, hash maps need good hash functions, which are also not trivial. But both structures are fun and you'll get a lot of insight understanding by implementing one of them (or better, both :-)).
Also consider to keep the data in the vector list and let the map contain indices to the vector (or pointers to the entries): then you can easily have multiple indices, say one for the name and one for the phone number, so you can look up entries by both.
That said I just want to strongly recommend using the data structures provided by the standard library for real-world-tasks :-)
A simple approach to get you started would be to create a map class that uses two vectors - one for the key and one for the value. To add an item, you insert a key in one and a value in another. To find a value, you just loop over all the keys. Once you have this working, you can think about using a more complex data structure.