I have vector of strings. It contains some names. I need to search whether particular string is present in the vector. Eg: vector of string contains "Name" and "Age". Search string is "NameXYZ". So I have to search whether "NameXYZ" contains any of the vector element. Since one of the vector element is "Name", it should return true. Is there any possibility to achieve this without iterating.
The answer is NO.
It's impossibile to search something in a vector without iterating.
The vector is unorder and unmapped container so you need to iterating to it to find something.
I attach you this link to the cppreference site:
std::vector - cppreference.com
The more complex is answer is SOMETIMES
What you are looking for is something like a hash set. Or for your situation hash map with Key=Name and Value=Age.
This works by defining a function that turns a string into a number, called a hash. When you want to test whether the string is in the list, you calculate what number it has. You then get a list of potential candidates and iterate through those.
If you are lucky, every string has a unique number and you need to test at most one string. You still have to search for the number, but if you use the standard container, you can be certain that it is optimized. Unless you want to spend a long time making your own, it's a simple easy win.
However, be aware that it is very hard to get any search faster than O(Log(N)) complexity. This method is still O(log(N)) complex under the hood, as it has to still search for the hash.
Related
I am implementing a chained hash table using a vector < lists >. I resized my vector to a prime number, let's say 5. To choose the key I am using the universal hasing.
My question is, do I need to rehash my vector? I mean this code will generate always a key in a range between 0 and 5 because it depends from the size of my hashtable, causing collisions of course but the new strings will be all added in the lists of every position in the vector...so it seems I don't need to resize/rehash the whole thing. What do you think? Is this a mistake?
Yes, you do. Otherwise objects will be in the wrong hash bucket and when you search for them, you won't find them. The whole point of hashing is to make locating an object faster -- that won't work if objects aren't where they're supposed to be.
By the way, you probably shouldn't be doing this. There are people who have spent years developing efficient hashing algorithms. Trying to roll your own will result in poor performance. Start with the article on linear hashing in Wikipedia.
do I need to rehash my vector?
Your container could continue to function without rehashing, but searching, insertions and erasing will perform more and more like a plain list instead of a hash table: for example, if you've inserted 10,000 elements you can expect each list in your vector to have roundly 2000 elements, and you may have to search all 2000 to see if a value you're considering inserting is a duplicate, or to find a value to erase, or simply return an iterator to. Sure, 2,000 is better than 10,000, but it's a long way from the O(1) performance expected of a quality hash table implementation. Your non-resizing implementation is still "O(N)".
Is this a mistake?
Yes, a fundamental one.
I have simple method in C++ which searchs for string in linked list. That works well but I need to make it faster. Is it possible? Maybe I need to insert items into list in alphabetical order? But I dont think it could help in serching list anymore. In list there is about 300 000 items (words).
int GetItemPosition(const char* stringToFind)
{
int i = 0;
MyList* Tmp = FistListItem;
while (Tmp){
if (!strcmp(Tmp->Value, stringToFind))
{
return i;
}
Tmp = Tmp->NextItem;
i++;
}
return -1;
}
Method returns the position number if item found, otherwise returns -1.
Any sugesstion will be helpfull.
Thanks for answers, I can change structure. I have only one constraint. Code must implement the following interface:
int Count(void);
int AddItem(const char* StringValue, int WordOccurrence);
int GetItemPosition(const char* StringValue);
char* GetString(int Index);
int GetOccurrenceNum(int Index);
void SetInteger(int Index, int WordOccurrence);
So which structure will be the in your opinion the most suitable?
Searching a linked list is linear so you need to iterate from beginning one by one so it is O(n). Linked lists are not the best if you will use it for searching, you can utilize more suitable data structures such as binary trees.
Ordering elements does not help much because still you need to iterate each element anyway.
Wikipedia article says:
In an unordered list, one simple heuristic for decreasing average search time is the move-to-front heuristic, which simply moves an element to the beginning of the list once it is found. This scheme, handy for creating simple caches, ensures that the most recently used items are also the quickest to find again.
Another common approach is to "index" a linked list using a more
efficient external data structure. For example, one can build a
red-black tree or hash table whose elements are references to the
linked list nodes. Multiple such indexes can be built on a single
list. The disadvantage is that these indexes may need to be updated
each time a node is added or removed (or at least, before that index
is used again).
So in the first case you can slightly improve (by statistical assumptions) your search performance by moving items found previously closer to the beginning of the list. This assumes that previously found elements will be searched more frequently.
Second method requires to use other data structures.
If using linked lists is not a hard requirement, consider using hash tables, sorted arrays (random access) or balanced trees.
Consider using array or std::vector as a storage instead of linked list, and use binary search to find particular string, or even better, std::set, if you don't need a numerical index. If for some reasons it is not possible to use other containers, there is not much possible to do - you may want to speed up the process of comparison by storing hash of the string along with it in node.
I suggest hashing.
Since you've already got a linked list of your own), you can try chaining with linked lists for collision resolution.
Rather than using a linear linked list, you may want to use a binary search tree, or a red/black tree. These trees are designed on minimizing the traversals to find an item.
You could also store "short cut links". For example, if the list is of strings, you could have an array of links of where to start searching based on the first letter.
For example, shortcut['B'] would return a pointer to the first link to start searching for strings starting with 'B'.
The answer is no, you cannot improve the search without changing your data-structure.
As it stands, sorting the list will not give you a faster search for any random item.
It will only allow you to quickly decide if the given item is in the list by testing against the first item (which will be either the smallest or the largest entry) and this improvement is not likely to make a big difference.
So can you please edit your question and explain to us your constraints?
Can you use a completely different data structure, like an array or a tree? (as others have suggested)
If not, can you modify the way your linked list is linked?
If not, we will be unlikely to help you...
The best option is to use faster data structure for storing strings:
std::map - red-black tree behind the scenes. Has O(logn) for search/insert/delete operations. Suitable if you want to store additional values with strings (for example - positions).
std::set - basically the same tree but without values. Best for case when you need only contains operation.
std::unordered_map - hash table. O(1) access.
std::unordered_set - hash set. Also O(1) access.
Note. But in all of these cases there is a catch. Complexity is calculated only based on n (count of strings). In reality string comparison is not free. So, O(1) becomes O(m), O(logn) becomes O(mlogn) (where m is maximal length of string). This does not matter in case of relatively short strings. But if this is not true consider using Trie. In practice trie can be even faster than hash table - each character of query string is accessed only once instead of multiple times. For hash table/set it's at least once for hash calculation and at least once for actual string comparison (depending on collision resolution strategy - not sure how it is implemented in C++).
For certain reasons, I have a linked list of objects, with the Object containing a string.
I might be required to search for a particular string, and in doing so, retrieve the object, based on that string.
The starting header of the list is the only input I have for the list.
Though the number of objects I have, is capped at 3000, and that's not so much, I still wondered if there was an efficient way to do this, instead of searching the objects one by one for a matching string.
The Objects in the list are not sorted in any way, and I cannot expect them to be sorted, and the entry point to the linked list is the only input I have.
So, could anyone tell me if there's an efficient way ( search algorithm perhaps ) to achieve this?
Separately for this kind of search, if required, what would be the sort of data structure recommended, assuming that this search be the most data intensive function of the object?
Thanks..
Use a std::map<std::string, YourObjectType>. You can still iterate all objects. But they are sorted by the string now.
If you might have multiple objects with the same string, use a multimap instead.
If you can't switch to any different structure/container, then there is no way to do this better than linear to size of list.
Having 3000 you would like to use a unordered map instead of a linked list, which will give you average O(1) lookup, insertion, and deletion time.
I am dealing with a collection of objects where the reasonable size of it could be anywhere between 1 and 50K (but there's no set upper limit). Each object contains a handful of strings.
I want to implement to a search function that can partially, exactly, or RegEx match any of one these strings and subsequently return a list of objects.
If each object only contained a single string then I could simply lexicographically sort them, and pull out ranges fairly easily - but I am reluctant to implement a map-like structure for each of the contained strings due to speed/memory concerns.
Is there a data structure well suited to this kind of operation for speed and memory efficiency? I'm sensing a database maybe on the horizon, but I know little about them, so I want to hold off researching until someone more knowledgeable can nudge me in the right direction!
a map-like collection is probably your best bet, the key will be the string, and the value is a reference to the containing object. If your strings are held inside the objects as a stl string, then you could store a reference to the data in the key part of the map instead (alternatively use a shared_ptr for the strings and reference them in both the object and the map)
Searching, sorting just becomes a matter of implementing a custom search functor that uses the dereferenced data. The size of the map will be 2 references plus the map overhead which isn't going to be that bad if you consider the alternatives will be as large, if not larger.
partially, exactly, or RegEx match any of one these strings and subsequently return a list of objects
Well, for exact matches, you could have a std::map<std::string, std::vector<object*> >. The key would be the exact string, and the vector holds pointers to matching objects, many of these pointers may point to a single object instance.
You could have a front-end map from partial strings to full strings: say the string is "dogged", you'd sadly have to put entries in for "dogged", "ogged", "gged", "ged", "ed" and "d" (stop wherever you like if you want a minimum match size)... then use lower_bound to search. That way, say you search on "dog" you could still see that there was a match for "dogged" (doesn't matter if it matches say "dogfood" instead. This would be a simple std::map<string, string>. While you increment forwards from the lower_bound position and the string still matches (i.e. from dogfood to dogged to ... until it doesn't start with dog), you can search for that in the "exact match" map and aggregate results.
For regular expressions, I have no good suggestion... I'd start with a brute force search through all the full strings. If it really isn't good enough, then you do some rough optimisations like checking for a constant substring to filter by before doing the brute force matching, but it's beyond me to imagine how to do this very thoroughly and fast.
(substitute your favourite smart pointers for object*s if useful)
Thanks for all the replies, but following on from techniques mentioned in this post, I've decided to use an enhanced suffix array from the header-only SeqAn project.
I'm trying to work out the best method to search a vector of type "Tracklet" (a class I have built myself) to find the first and last occurrence of a given value for one of its variables. For example, I have the following classes (simplified for this example):
class Tracklet {
TimePoint *start;
TimePoint *end;
int angle;
public:
Tracklet(CvPoint*, CvPoint*, int, int);
}
class TimePoint {
int x, y, t;
public:
TimePoint(int, int, int);
TimePoint(CvPoint*, int);
// Relevant getters and setters exist here
};
I have a vector "vector<Tracklet> tracklets" and I need to search for any tracklets with a given value of "t" for the end timepoint. The vector is ordered in terms of end time (i.e. tracklet.end->t).
I'm happy to code up a search algorithm, but am unsure of which route to take with it. I'm not sure binary search would be suitable, as I seem to remember it won't necessarily find the first. I was thinking of a method where I use binary search to find an index of an element with the correct time, then iterate back to find the first and forward to find the last. I'm sure there's a better way than that, since it wastes binary searches O(log n) by iterating.
Hopefully that makes sense: I struggled to explain it a bit!
Cheers!
If the vector is sorted and contains the value, std::lower_bound will give you an iterator to the first element with a given value and std::upper_bound will give you an iterator to one element past the last one containing the value. Compare the value with the returned element to see if it existed in the vector. Both these functions use binary search, so time is O(logN).
To compare on tracklet.end->t, use:
bool compareTracklets(const Tracklet &tr1, const Tracklet &tr2) {
return (tr1.end->t < tr2.end->t);
}
and pass compareTracklets as the fourth argument to lower_bound or upper_bound
I'd just use find and find_end, and then do something more complicated only if testing showed it to be too slow.
If you're really concerned about lookup performance, you might consider a different data structure, like a map with timestamp as the key and a vector or list of elements as the value.
A binary search seems like your best option here, as long as your vector remains sorted. It's essentially identical, performance-wise, to performing a lookup in a binary tree-structure.
dirkgently referred to a sweet optimization comparative. But I would in fact not use a std::vector for this.
Usually, when deciding to use a STL container, I don't really consider the performance aspect, but I do consider its interface regarding the type of operation I wish to use.
std::set<T>::find
std::set<T>::lower_bound
std::set<T>::upper_bound
std::set<T>::equal_range
Really, if you want an ordered sequence, outside of a key/value setup, std::set is just easier to use than any other.
You don't have to worry about inserting at a 'bad' position
You don't have problems of iterators invalidation when adding / removing an element
You have built-in methods for searching
Of course, you also want your Comparison Predicate to really shine (hopes the compiler inlines the operator() implementation), in every case.
But really, if you are not convinced, try a build with a std::vector and manual insertion / searching (using the <algorithm> header) and try another build using std::set.
Compare the size of the implementations (number of lines of code), compare the number of bugs, compare the speed, and then decide.
Most often, the 'optimization' you aim for is actually a pessimization, and in those rares times it's not, it's just so complicated that it's not worth it.
Optimization:
Don't
Expert only: Don't, we mean it
The vector is ordered in terms of time
The start time or the end time?
What is wrong with a naive O(n) search? Remember you are only searching and not sorting. You could use a sorted container as well (if that doesn't go against the basic design).