Efficient data structure for searching numbers and strings - c++

I have a scenario where in strings and numbers are combined into a single entity. I need to search based on the string or the number. How do I go about with the data structure for this?
I thought of coming up with a hashing for strings and search tree approach for numbers. Can you please comment on my choice and also suggest better structures if any?
Thanks!

Use two std::maps, one from std::string to a pointer and the other from number to a pointer. The pointers go to your "single entity". See how far you can scale this (millions of entries...) before trying to optimize further.

Related

Not sure which data structure to use

Assuming I have the following text:
today was a good day and today was a sunny day.
I break up this text into lines, seperated by white spaces, which is
Today
was
a
good
etc.
Now I use the vector data structure to simple count the number of words in a text via .size(). That's done.
However, I also want to check If a word comes up more than once, and if so, how many time. In my example "today" comes up 2 times.
I want to store that "today" and append a 2/x (depending how often it comes up in a large text). Now that's not just for "today" but for every word in the text. I want to look up how often a word appears, append an counter, and sort it (the word + counters) in descending order (that's another thing, but
not important right now).
I'm not sure which data structure to use here. Map perhaps? But I can't add counters to map.
Edit: This is what I've done so far: http://pastebin.com/JncR4kw9
You should use a map. Infact, you should use an unordered_map.
unordered_map<string,int> will give you a hash table which will use strings as keys, and you can augment the integer to keep count.
unordered_map has the advantage of O(1) lookup and insertion over the O(logn) lookup and insertion of a map. This is because the former uses an array as a container whereas the latter uses some implementation of trees (red black, I think).
The only disadvantage of an unordered_map is that as mentioned in its name, you can't iterate over all the elements in lexical order. This should be clear from the explanation of their structure above. However, you don't seem to need such a traversal, and hence it shouldn't be an issue.
unordered_map<string,int> mymap;
mymap[word]++; // will increment the counter associated with the count of a word.
Why not use two data structures? The vector you have now, and a map, using the string as the key, and an integer as data, which then will be the number of times the word was found in the text.
Sort the vector in alphabetical order.
Scan it and compare every word to those that follow, until you find a different one, and son on.
a, a, and, day, day, sunny, today, today, was, was
2 1 2 1 2 2
A better option to consider is Radix Tree, https://en.wikipedia.org/wiki/Radix_tree
Which is quite memory efficient, and in case of large text input, it will perform better than alternative data structures.
One can store the frequencies of a word in the nodes of tree. Also it will reap the benefits of "locality of reference[For any text document]" too.

Best way to Store and Search Numbers in C++

I have a very large array storing some numbers. My task is to find if a particular number exists in array or not efficiently. Which algorithm and data structure I should go with?
Few assumptions:
Each number in array would be unique.
I am not concerned about where the data is found in array I just want to return true if data is found else false.
I would be using C++ as programming language.
Please suggest.
Thanks
Constant time lookup with unordered_set.
There are also options of bitsets etc. Depends exactly how large is "very large" and the sparseness of the values stored compared to how many of them there actually are.
seems unordered_set is suitable for your requirement.
PS: Pls remember all elements in this set are immutable
The known best way to check if an element (number) is a member of a set (array) is to use bloom filters. It works well if set is changing over time or if there are set operations among sets. Bloom filters are easy to implement and good implementations are availble
If set is static (i.e. never change) the good way is to use perfect hash function. It will take time to build, but will outperform usual hash function provided by std::unordered_set

Fastest way to search for a string

I have 300 strings to be stored and searched for and that most of them are identical in terms of characters and lenght. For Example i have string "ABC1","ABC2","ABC3" and so on. and another set like sample1,sample2,sample3. So i am kinda confused as of how to store them like to use an array or a hash table. My main concern is the time i take to search for a string when i need to get one out from the storage. If i use an array i will have to do a string compare on all the index for me to arrive at one. Now if i go and impliment a hash table i will have to take care of collisions(obvious) and that i will have to impliment chaining for storing identical strings.
So i am kind of looking for some suggestions weighing the pros and cons of each and arrive at the best practice
Because the keys are short tend to have a common prefix you should consider radix data structures such as the Patricia trie and Ternary Search Tree (google these, you'll find lots of examples) Time for searching these structures tends to be O(1) with respect to # of entries and O(n) with respect to length of the keys. Beware, however that long strings can use lots of memory.
Search time is similar to hash maps if you don't consider collision resolution which is not a problem in a radix search. Note that I am considering the time to compute the hash as part of the cost of a hash map. People tend to forget it.
One downside is radix structures are not cache-friendly if your keys tend to show up in random order. As someone mentioned, if the search time is really important: measure the performance of some alternative approaches.
This depends on how much your data is changing. With that I mean, if you have 300 index strings which are referencing to another string, how often does those 300 index strings change?
You can use a std::map for quick lookups, but the map will require more resource when it is created the first time (compared to a array, vector or list).
I use maps mostly for some kind of dynamic lookup tables (for example: ip to socket).
So in your case it will look like this:
std::map<std::string, std::string> my_map;
my_map["ABC1"] = "sample1";
my_map["ABC2"] = "sample2";
std::string looked_up = my_map["ABC1"];

Find substring in many objects containing multiple strings

I am dealing with a collection of objects where the reasonable size of it could be anywhere between 1 and 50K (but there's no set upper limit). Each object contains a handful of strings.
I want to implement to a search function that can partially, exactly, or RegEx match any of one these strings and subsequently return a list of objects.
If each object only contained a single string then I could simply lexicographically sort them, and pull out ranges fairly easily - but I am reluctant to implement a map-like structure for each of the contained strings due to speed/memory concerns.
Is there a data structure well suited to this kind of operation for speed and memory efficiency? I'm sensing a database maybe on the horizon, but I know little about them, so I want to hold off researching until someone more knowledgeable can nudge me in the right direction!
a map-like collection is probably your best bet, the key will be the string, and the value is a reference to the containing object. If your strings are held inside the objects as a stl string, then you could store a reference to the data in the key part of the map instead (alternatively use a shared_ptr for the strings and reference them in both the object and the map)
Searching, sorting just becomes a matter of implementing a custom search functor that uses the dereferenced data. The size of the map will be 2 references plus the map overhead which isn't going to be that bad if you consider the alternatives will be as large, if not larger.
partially, exactly, or RegEx match any of one these strings and subsequently return a list of objects
Well, for exact matches, you could have a std::map<std::string, std::vector<object*> >. The key would be the exact string, and the vector holds pointers to matching objects, many of these pointers may point to a single object instance.
You could have a front-end map from partial strings to full strings: say the string is "dogged", you'd sadly have to put entries in for "dogged", "ogged", "gged", "ged", "ed" and "d" (stop wherever you like if you want a minimum match size)... then use lower_bound to search. That way, say you search on "dog" you could still see that there was a match for "dogged" (doesn't matter if it matches say "dogfood" instead. This would be a simple std::map<string, string>. While you increment forwards from the lower_bound position and the string still matches (i.e. from dogfood to dogged to ... until it doesn't start with dog), you can search for that in the "exact match" map and aggregate results.
For regular expressions, I have no good suggestion... I'd start with a brute force search through all the full strings. If it really isn't good enough, then you do some rough optimisations like checking for a constant substring to filter by before doing the brute force matching, but it's beyond me to imagine how to do this very thoroughly and fast.
(substitute your favourite smart pointers for object*s if useful)
Thanks for all the replies, but following on from techniques mentioned in this post, I've decided to use an enhanced suffix array from the header-only SeqAn project.

Selecting Appropriate Data Structure

We are reading a book and we have to store every character of this book with it's count.
like : "This is to test" out should be: T4 H1 I2 S3 O1 E1
What will be the most appropriate data structure here and why? and what would be the logic here.
An array of ints would do fine here. Create an array, each index is a letter of the alphabet (you'll probably want to store upper and lower case separately for perf reasons when scanning the book). As you scan, increment the int at the array location for that letter. When you're done, print them all out.
Based on your description, you only need a simple hash of characters to their count. This is because you only have a limited set of characters that can occur in a book (even counting punctuation, accents and special characters). So a hash with a few hundred entries will suffice.
The appropriate data structure would be a std::vector (or, if you want something built-in, an array). If you're using C++0x, std::array would work nicely as well.
The logic is pretty simple -- read a character, (apparently convert to upper case), and increment the count for that item in the array/vector.
The choice of selecting a data structure not only depends on what kind of data you want to store inside the data structure but more importantly on what kind of operations you want to perform on the data.
Have a look at this excellent chart which helps to decide when to use which STL container.
Ofcourse, In your case an std::array(C+0x) or std::vector, seems to be a good choice.