Fastest way to search for a string

Fastest way to search for a string - c++

I have 300 strings to be stored and searched for and that most of them are identical in terms of characters and lenght. For Example i have string "ABC1","ABC2","ABC3" and so on. and another set like sample1,sample2,sample3. So i am kinda confused as of how to store them like to use an array or a hash table. My main concern is the time i take to search for a string when i need to get one out from the storage. If i use an array i will have to do a string compare on all the index for me to arrive at one. Now if i go and impliment a hash table i will have to take care of collisions(obvious) and that i will have to impliment chaining for storing identical strings.
So i am kind of looking for some suggestions weighing the pros and cons of each and arrive at the best practice

Because the keys are short tend to have a common prefix you should consider radix data structures such as the Patricia trie and Ternary Search Tree (google these, you'll find lots of examples) Time for searching these structures tends to be O(1) with respect to # of entries and O(n) with respect to length of the keys. Beware, however that long strings can use lots of memory.
Search time is similar to hash maps if you don't consider collision resolution which is not a problem in a radix search. Note that I am considering the time to compute the hash as part of the cost of a hash map. People tend to forget it.
One downside is radix structures are not cache-friendly if your keys tend to show up in random order. As someone mentioned, if the search time is really important: measure the performance of some alternative approaches.

This depends on how much your data is changing. With that I mean, if you have 300 index strings which are referencing to another string, how often does those 300 index strings change?
You can use a std::map for quick lookups, but the map will require more resource when it is created the first time (compared to a array, vector or list).
I use maps mostly for some kind of dynamic lookup tables (for example: ip to socket).
So in your case it will look like this:
std::map<std::string, std::string> my_map;
my_map["ABC1"] = "sample1";
my_map["ABC2"] = "sample2";
std::string looked_up = my_map["ABC1"];

Related

C++ Find in a vector of <int, pair>

So previously I only had 1 key I needed to look up, so I was able to use a map:
std::map <int, double> freqMap;
But now I need to look up 2 different keys. I was thinking of using a vector with std::pair i.e.:
std::vector <int, std::pair<int, double>> freqMap;
Eventually I need to look up both keys to find the correct value. Is there a better way to do this, or will this be efficient enough (will have ~3k entries). Also, not sure how to search using the second key (first key in the std::pair). Is there a find for the pair based on the first key? Essentially I can access the first key by:
freqMap[key1]
But not sure how to iterate and find the second key in the pair.
Edit: Ok adding the use case for clarification:
I need to look up a val based on 2 keys, a mux selection and a frequency selection. The raw data looks something like this:
Mux, Freq, Val
0, 1000, 1.1
0, 2000, 2.7
0, 10e9, 1,7
1, 1000, 2.2
1, 2500, 0.8
6, 2000, 2.2

The blanket answer to "which is faster" is generally "you have to benchmark it".
But besides that, you have a number of options. A std::map is more efficient than other data structures on paper, but not necessarily in practice. If you truly are in a situation where this is performance critical (i.e. avoid premature optimisation) try different approaches, as sketched below, and measure the performance you get (memory-wise and cpu-wise).
Instead of using a std::map, consider throwing your data into a struct, give it proper names and store all values in a simple std::vector. If you modify the data only seldom, you can optimise retrieval cost at the expense of additional insertion cost by sorting the vector according to the key you are typically using to find an entry. This will allow you to do binary search, which can be much faster than linear search.
However, linear search can be surprisingly fast on a std::vector because of both cache locality and branch prediction. Both of which you are likely to lose when dealing with a map, unordered_map or (binary searched) sorted vector. So, although O(n) sounds much more scary than, say, O(log n) for map or even O(1) for unordered_map, it can still be faster under the right conditions.
Especially if you discover that you don't have a discernible index member you can use to sort your entries, you will have to either stick to linear search in contiguous memory (i.e. vector) or invest into a doubly indexed data structure (effectively something akin to two maps or two unordered_maps). Having two indexes usually prevents you from using a single map/unordered_map.
If you can pack your data more tightly (i.e. do you need an int or would a std::uint8_t do the job?, do you need a double? etc.) you will amplify cache locality and for only 3k entries you have good chances of an unsorted vector to perform best. Although operations on an std::size_t are typically faster themselves than on smaller types, iterating over contiguous memory usually offsets this effect.
Conclusion: Try an unsorted vector, a sorted vector (+binary search), a map and an unordered_map. Do proper benchmarking (with several repetitions) and pick the fastest one. If it doesn't make a difference pick the one that is the most straight-forward to understand.
Edit: Given your example data, it sounds like the first key has an extremely small domain. As far as I can tell "Mux" seems to be limited to a small number of different values which are near each other, in such a situation you may consider using an std::array as your primary indexing structure and have a suitable lookup structure as your second one. For example:
std::array<std::vector<std::pair<std::uint64_t,double>>,10>
std::array<std::unordered_map<std::uint64_t,double>,10>

Unordered map vs vector

I'm building a little 2d game engine. Now I need to store the prototypes of the game objects (all type of informations). A container that will have at most I guess few thousand elements all with unique key and no elements will be deleted or added after a first load. The key value is a string.
Various threads will run, and I need to send to everyone a key(or index) and with that access other information(like a texture for the render process or sound for the mixer process) available only to those threads.
Normally I use vectors because they are way faster to accessing a known element. But I see that unordered map also usually have a constant speed if I use the ::at element access. It would make the code much cleaner and also easier to maintain because I will deal with much more understandable man made strings.
So the question is, the difference in speed between a access to a vector[n] compared to a unorderedmap.at("string") is negligible compared to his benefits?
From what I understand accessing various maps in different part of the program, with different threads running just with a "name" for me is a big deal and the speed difference isn't that great. But I'm too inexperienced to be sure of this. Although I found informations about it seem I can't really understand if I'm right or wrong.
Thank you for your time.

As an alternative, you could consider using an ordered vector because the vector itself will not be modified. You can easily write an implementation yourself with STL lower_bound etc, or use an implementation from libraries ( boost::flat_map).
There is a blog post from Scott Meyers about container performance in this case. He did some benchmarks and the conclusion would be that an unordered_mapis probably a very good choice with high chances that it will be the fastest option. If you have a restricted set of keys, you can also compute a minimal optimal hash function, e.g. with gperf
However, for these kind of problems the first rule is to measure yourself.

My problem was to find a record on a container by a given std::string type as Key access. Considering Keys that only EXISTS(not finding them was not a option) and the elements of this container are generated only at the beginning of the program and never touched thereafter.
I had huge fears unordered map was not fast enough. So I tested it, and I want to share the results hoping I've not mistaken everything.
I just hope that can help others like me and to get some feedback because in the end I'm beginner.
So, given a struct of record filled randomly like this:
struct The_Mess
{
std::string A_string;
long double A_ldouble;
char C[10];
int* intPointer;
std::vector<unsigned int> A_vector;
std::string Another_String;
}
I made a undordered map, give that A_string contain the key of the record:
std::unordered_map<std::string, The_Mess> The_UnOrdMap;
and a vector I sort by the A_string value(which contain the key):
std::vector<The_Mess> The_Vector;
with also a index vector sorted, and used to access as 3thrd way:
std::vector<std::string> index;
The key will be a random string of 0-20 characters in lenght(I wanted the worst possible scenario) containing letter both capital and normal and numbers or spaces.
So, in short our contendents are:
Unordered map I measure the time the program get to execute:
record = The_UnOrdMap.at( key ); record is just a The_Mess struct.
Sorted Vector measured statements:
low = std::lower_bound (The_Vector.begin(), The_Vector.end(), key, compare);
record = *low;
Sorted Index vector:
low2 = std::lower_bound( index.begin(), index.end(), key);
indice = low2 - index.begin();
record = The_Vector[indice];
The time is in nanoseconds and is a arithmetic average of 200 iterations. I have a vector that I shuffle at every iteration containing all the keys, and at every iteration I cycle through it and look for the key I have here in the three ways.
So this are my results:
I think the initials spikes are a fault of my testing logic(the table I iterate contains only the keys generated so far, so it only has 1-n elements). So 200 iterations of 1 key search for the first time. 200 iterations of 2 keys search the second time etc...
Anyway, it seem that in the end the best option is the unordered map, considering that is a lot less code, it's easier to implement and will make the whole program way easier to read and probably maintain/modify.

You have to think about caching as well. In case of std::vector you'll have very good cache performance when accessing the elements - when accessing one element in RAM, CPU will cache nearby memory values and this will include nearby portions of your std::vector.
When you use std::map (or std::unordered_map) this is no longer true. Maps are usually implemented as self balancing binary-search trees, and in this case values can be scattered around the RAM. This imposes great hit on cache performance, especially as maps get bigger and bigger as CPU just cannot cache the memory that you're about to access.
You'll have to run some tests and measure performance, but cache misses can greatly hurt the performance of your program.

You are most likely to get the same performance (the difference will not be measurable).
Contrary to what some people seem to believe, unordered_map is not a binary tree. The underlying data structure is a vector. As a result, cache locality does not matter here - it is the same as for vector. Granted, you are going to suffer if you have collissions due to your hashing function being bad. But if your key is a simple integer, this is not going to happen. As a result, access to to element in hash map will be exactly the same as access to the element in the vector with time spent on getting hash value for integer, which is really non-measurable.

Map of vector of struct vs Vector of struct

I am making a small project program that involves inputting quotes that would be later saved into a database (in this case a .txt file). There are also commands that the user would input such as list (which shows the quote by author) and random (which displays a random quote).
Here's the structure if I would use a map (with the author string as the key):
struct Information{
string quoteContent;
vector<string> tags;
}
and here's the structure if I would use the vector instead:
struct Information{
string author;
string quoteContent;
vector<string> tags;
}
note: The largest largest number of quotes I've had in the database is 200. (imported from a file)
I was just wondering which data structure would yield better performance. I'm still pretty new to this c++ thing, so any help would be appreciated!

For your data volumes it obviously doesn't matter from a performance perspective, but multi_map will likely let you write shorter, more comprehensible and maintainable code. Regarding general performance of vector vs maps (which is good to know about but likely only becomes relevant with millions of data elements or low-latency requirements)...
vector doesn't do any automatic sorting for you, so you'd probably push_back quotes as you read them, then do one std::sort once the data's loaded, after which you can find elements very quickly by author with std::binary_search or std::lower_bound, or identify insertion positions for new quotes using e.g. std::lower_bound, but if you want to insert a new quote thereafter you have to move the existing vector elements from that position on out of the way to make room - that's relatively slow. As you're just doing a few ad-hoc insertions based on user input, the time to do that with only a few hundred quotes in the vector will be totally insignificant. For the purposes of learning programming though, it's good to understand that a multimap is arranged as a kind of branching binary tree, with pointers linking the data elements, which allows for relatively quick insertion (and deletion). For some applications following all those pointers around can be more expensive (i.e. slower) than vector's contiguous memory (which works better with CPU cache memory), but in your case the data elements are all strings and vectors of strings that will likely (unless Short String Optimisations kick in) require jumping all over memory anyway.
In general, if author is naturally a key for your data just use a multi_map... it'll do all your operations in reasonable time, maybe not the fastest but never particularly slow, unlike vector for post-data-population mid-container insertions (/deletions).

Depends on the purpose of usage. Both data-structures have their pros and cons.
Vectors
Position index at() or operator []
Find function not present You would have to use find algorithm func.
Maps:
Key can be searched
Position index is not applicable. Keys are stored
(use unordered map for better performance than map.)
Use datastructure on basis of what you want to achieve.

The golden rule is: "When in doubt, measure."
i.e. Write some tests, do some benchmarking.
Anyway, considering that you have circa 200 items, I don't think there should be an important difference from the two cases on modern PC hardware. Big-O notation matters when N is big (e.g. 10,000s, 100,000s, 1,000,000s, etc.)
vector tends to be simpler than map, and I'd use it as the default container of choice (unless your main goal is to access the items given the author's name as a key, in this case map seems more logically suited).
Another option might be to have a vector with items sorted using author's names, so you can use binary search (which is O(logN)) inside the vector.

hash table vs. linear list

Will there be a instance where a search for a keyword in a linear list will be quicker than a hash table?
I'd basically like to know if there is a fringe case where the search for a keyword in a linear list will be faster than a hash table search.
Thanks!

Searching in a hash table is not always constant-time in reality. If the hash function is a poor match for the data, you can have a lot of collisions, and in the extreme case that every data item has the same hash value, the result looks much like linear search. Depending on the details, this effective linear search might work slower than a linear search over the data in an array. (E.g. open addressing with a quadratic probing sequence, which makes poor use of the processor caches, might well be slower than a linear search over an array.)
Here's an example of a real-world case where all keys ended up in the same bucket: Java bug 4669519.

Yes, in the cases of a very small number of elements. Think about how a hash works. It has to compute the hash to find a bucket, then search through the list in that bucket. Plus it could be a complex multi level hash, etc. So you break even around the point where searching through a linear list is more work than the hash lookup algorithm.
Another instance would be if the element you are looking for is always at the beginning or near the beginning of a list. Depending on what you are doing it could happen.
There are others, but that should help you think about it.
Still, don't get confused. The hash is usually what you want.

data structure for storing array of strings in a memory

I'm considering of data structure for storing a large array of strings in a memory. Strings will be inserted at the beginning of the programm and will not be added or deleted while programm is running. The crucial point is that search procedure should be as fast as it can be. Saving of memory is not important. I incline to standard structure hash_set from standard library, that allows to search elements in the structure with about constant time. But it's not guaranteed that this time will be short. Will anyone suggest a better standard desicion?
Many thanks!

Try a Prefix Tree
A Trie is better than a Binary Search Tree for searching elements. Compared against a hash table, you could see this question

If lookup time really is the only important thing, then at startup time, once you have all the strings, you could compute a perfect hash over them, and use this as the hashing function for a hashtable.
The problem is how you'd execute the hash - any kind of byte-code-based computation is probably going to be slower than using a fixed hash and dealing with collisions. But if all you care about is lookup speed, then you can require that your process has the necessary privileges to load and execute code. Write the code for the perfect hash, run it through a compiler, load it. Test at runtime whether it's actually faster for these strings than your best known data-agnostic structure (which might be a Trie, a hashtable, a Judy array or a splay tree, depending on implementation details and your typical access patterns), and if not fall back to that. Slow setup, fast lookup.
It's almost never truly the case that speed is the only crucial point.

There is e.g. google-sparsehash.
It includes a dense hash set/map (re)implementation that may perform better than the standard library hash set/map.
See performance. Make sure that you are using a good hash function. (My subjective vote: murmur2.)
Strings will be inserted at the
beginning of the programm and will not
be added or deleted while programm is running.
If the strings are immutable - so insertion/deletion is "infrequent", so to speak -, another option is to build a Directed Acyclic Word Graph or a Compact Directed Acyclic Word Graph that might* be faster than a hash table and has a better worst case guarantee.
**Standard disclaimer applies: depending on the use case, implementations, data set, phase of the moon, etc. Theoretical expectations may differ from observed results because of factors not accounted for (e.g. cache and memory latency, time complexity of certain machine instructions, etc.).*

A hash_set with a suitable number of buckets would be ideal, alternatively a vector with the strings in dictionary order, searched used binary search, would be great too.

The two standard data structures for fast string lookup are hash tables and tries, particularly Patricia tries. A good hash implementation and a good trie implementation should give similar performance, as long as the hash implementation is good enough to limit the number of collisions. Since you never modify the set of strings, you could try to build a perfect hash. If performance is more important than development time, try all solutions and benchmark them.
A complementary technique that could save lookups in the string table is to use atoms: each time you read a string that you know you're going to look up in the table, look it up immediately, and store a pointer to it (or an index in the data structure) instead of storing the string. That way, testing the equality of two strings is a simple pointer or integer equality (and you also save memory by storing each string once).

Your best bet would be as follows:
Building your structure:
Insert all your strings (char*s) into an array.
Sort the array lexicographically.
Lookup
Use a binary search on your array.
This maintains cache locality, allows for efficient lookup (Will search in a space of ~4 billion strings with 32 comparisons), and is dead simple to implement. There's no need to get fancy with tries, because they are complicated, and slower than they appear (especially if you have long strings).
Random sidenote: Combined with http://blogs.msdn.com/b/oldnewthing/archive/2005/05/19/420038.aspx, you'll be unstoppable!

Well, assuming you truly want an array and not an associative contaner as you've mentioned, the allocation strategy mentioned in Raymond Chen's Blog would be efficient.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js