Find substring in many objects containing multiple strings

Find substring in many objects containing multiple strings - c++

I am dealing with a collection of objects where the reasonable size of it could be anywhere between 1 and 50K (but there's no set upper limit). Each object contains a handful of strings.
I want to implement to a search function that can partially, exactly, or RegEx match any of one these strings and subsequently return a list of objects.
If each object only contained a single string then I could simply lexicographically sort them, and pull out ranges fairly easily - but I am reluctant to implement a map-like structure for each of the contained strings due to speed/memory concerns.
Is there a data structure well suited to this kind of operation for speed and memory efficiency? I'm sensing a database maybe on the horizon, but I know little about them, so I want to hold off researching until someone more knowledgeable can nudge me in the right direction!

a map-like collection is probably your best bet, the key will be the string, and the value is a reference to the containing object. If your strings are held inside the objects as a stl string, then you could store a reference to the data in the key part of the map instead (alternatively use a shared_ptr for the strings and reference them in both the object and the map)
Searching, sorting just becomes a matter of implementing a custom search functor that uses the dereferenced data. The size of the map will be 2 references plus the map overhead which isn't going to be that bad if you consider the alternatives will be as large, if not larger.

partially, exactly, or RegEx match any of one these strings and subsequently return a list of objects
Well, for exact matches, you could have a std::map<std::string, std::vector<object*> >. The key would be the exact string, and the vector holds pointers to matching objects, many of these pointers may point to a single object instance.
You could have a front-end map from partial strings to full strings: say the string is "dogged", you'd sadly have to put entries in for "dogged", "ogged", "gged", "ged", "ed" and "d" (stop wherever you like if you want a minimum match size)... then use lower_bound to search. That way, say you search on "dog" you could still see that there was a match for "dogged" (doesn't matter if it matches say "dogfood" instead. This would be a simple std::map<string, string>. While you increment forwards from the lower_bound position and the string still matches (i.e. from dogfood to dogged to ... until it doesn't start with dog), you can search for that in the "exact match" map and aggregate results.
For regular expressions, I have no good suggestion... I'd start with a brute force search through all the full strings. If it really isn't good enough, then you do some rough optimisations like checking for a constant substring to filter by before doing the brute force matching, but it's beyond me to imagine how to do this very thoroughly and fast.
(substitute your favourite smart pointers for object*s if useful)

Thanks for all the replies, but following on from techniques mentioned in this post, I've decided to use an enhanced suffix array from the header-only SeqAn project.

Related

c++ find vector of element present in the string

I have vector of strings. It contains some names. I need to search whether particular string is present in the vector. Eg: vector of string contains "Name" and "Age". Search string is "NameXYZ". So I have to search whether "NameXYZ" contains any of the vector element. Since one of the vector element is "Name", it should return true. Is there any possibility to achieve this without iterating.

The answer is NO.
It's impossibile to search something in a vector without iterating.
The vector is unorder and unmapped container so you need to iterating to it to find something.
I attach you this link to the cppreference site:
std::vector - cppreference.com

The more complex is answer is SOMETIMES
What you are looking for is something like a hash set. Or for your situation hash map with Key=Name and Value=Age.
This works by defining a function that turns a string into a number, called a hash. When you want to test whether the string is in the list, you calculate what number it has. You then get a list of potential candidates and iterate through those.
If you are lucky, every string has a unique number and you need to test at most one string. You still have to search for the number, but if you use the standard container, you can be certain that it is optimized. Unless you want to spend a long time making your own, it's a simple easy win.
However, be aware that it is very hard to get any search faster than O(Log(N)) complexity. This method is still O(log(N)) complex under the hood, as it has to still search for the hash.

The fastest C++ algorithm for string testing against a list of predefined seeds (case insensitive)

I have list of seed strings, about 100 predefined strings. All strings contain only ASCII characters.
std::list<std::wstring> seeds{ L"google", L"yahoo", L"stackoverflow"};
My app constantly receives a lot of strings which can contain any characters. I need check each received line and decide whether it contain any of seeds or not. Comparison must be case insensitive.
I need the fastest possible algorithm to test received string.
Right now my app uses this algo:
std::wstring testedStr;
for (auto & seed : seeds)
{
if (boost::icontains(testedStr, seed))
{
return true;
}
}
return false;
It works well, but I'm not sure that this is the most efficient way.
How is it possible to implement the algorithm in order to achieve better performance?
This is a Windows app. App receives valid std::wstring strings.
Update
For this task I implemented Aho-Corasick algo. If someone could review my code it would be great - I do not have big experience with such algorithms. Link to implementation: gist.github.com

If there are a finite amount of matching strings, this means that you can construct a tree such that, read from root to leaves, similar strings will occupy similar branches.
This is also known as a trie, or Radix Tree.
For example, we might have the strings cat, coach, con, conch as well as dark, dad, dank, do. Their trie might look like this:
A search for one of the words in the tree will search the tree, starting from a root. Making it to a leaf would correspond to a match to a seed. Regardless, each character in the string should match to one of their children. If it does not, you can terminate the search (e.g. you would not consider any words starting with "g" or any words beginning with "cu").
There are various algorithms for constructing the tree as well as searching it as well as modifying it on the fly, but I thought I would give a conceptual overview of the solution instead of a specific one since I don't know of the best algorithm for it.
Conceptually, an algorithm you might use to search the tree would be related to the idea behind radix sort of a fixed amount of categories or values that a character in a string might take on at a given point in time.
This lets you check one word against your word-list. Since you're looking for this word-list as sub-strings of your input string, there's going to be more to it than this.
Edit: As other answers have mentioned, the Aho-Corasick algorithm for string matching is a sophisticated algorithm for performing string matching, consisting of a trie with additional links for taking "shortcuts" through the tree and having a different search pattern to accompany this. (As an interesting note, Alfred Aho is also a contributor to the the popular compiler textbook, Compilers: Principles, Techniques, and Tools as well as the algorithms textbook, The Design And Analysis Of Computer Algorithms. He is also a former member of Bell Labs. Margaret J. Corasick does not seem to have too much public information on herself.)

You can use Aho–Corasick algorithm
It builds trie/automaton where some vertices marked as terminal which would mean string has seeds.
It's built in O(sum of dictionary word lengths) and gives the answer in O(test string length)
Advantages:
It's specifically works with several dictionary words and check time doesn't depend on number of words (If we not consider cases where it doesn't fit to memory etc)
The algorithm is not hard to implement (comparing to suffix structures at least)
You may make it case insensitive by lowering each symbol if it's ASCII (non ASCII chars don't match anyway)

You should try a pre-existing regex utility, it may be slower than your hand-rolled algorithm but regex is about matching multiple possibilities, so it is likely it will be already several times faster than a hashmap or a simple comparison to all strings. I believe regex implementations may already use the Aho–Corasick algorithm mentioned by RiaD, so basically you will have at your disposal a well tested and fast implementation.
If you have C++11 you already have a standard regex library
#include <string>
#include <regex>
int main(){
std::regex self_regex("google|yahoo|stackoverflow");
regex_match(input_string ,self_regex);
}
I expect this to generate the best possible minimum match tree, so I expect it to be really fast (and reliable!)

One of the faster ways is to use suffix tree https://en.wikipedia.org/wiki/Suffix_tree, but this approach has huge disadvantage - it is difficult data structure with difficult constructing. This algorithm allows to build tree from string in linear complexity https://en.m.wikipedia.org/wiki/Ukkonen%27s_algorithm

Edit: As Matthieu M. pointed out, the OP asked if a string contains a keyword. My answer only works when the string equals the keyword or if you can split the string e.g. by the space character.
Especially with a high number of possible candidates and knowing them at compile time using a perfect hash function with a tool like gperf is worth a try. The main principle is, that you seed a generator with your seed and it generates a function that contains a hash function which has no collisions for all seed values. At runtime you give the function a string and it calculates the hash and then it checks if it is the only possible candidate corresponding to the hashvalue.
The runtime cost is hashing the string and then comparing against the only possible candidate (O(1) for seed size and O(1) for string length).
To make the comparison case insensitive you just use tolower on the seed and on your string.

Because number of string is not big (~100), you can use next algo:
Calculate max length of word you have. Let it be N.
Create int checks[N]; array of checksum.
Let's checksum will be sum of all characters in searching phrase. So, you can calculate such checksum for each word from your list (that is known at compile time), and create std::map<int, std::vector<std::wstring>>, where int is checksum of string, and vector should contain all your strings with that checksum.
Create array of such maps for each length (up to N), it can be done at compile time also.
Now move over big string by pointer. When pointer points to X character, you should add value of X char to all checks integers, and for each of them (numbers from 1 to N) remove value of (X-K) character, where K is number of integer in checks array. So, you will always have correct checksum for all length stored in checks array.
After that search on map does there exists strings with such pair (length & checksum), and if exists - compare it.
It should give false-positive result (when checksum & length is equal, but phrase is not) very rare.
So, let's say R is length of big string. Then looping over it will take O(R).
Each step you will perform N operations with "+" small number (char value), N operations with "-" small number (char value), that is very fast. Each step you will have to search for counter in checks array, and that is O(1), because it's one memory block.
Also each step you will have to find map in map's array, that will also be O(1), because it's also is one memory block.
And inside map you will have to search for string with correct checksum for log(F), where F is size of map, and it will usually contain no more then 2-3 strings, so we can in general pretend that it is also O(1).
Also you can check, and if there is no strings with same checksum (that should happens with high chance with just 100 words), you can discard map at all, storing pairs instead of map.
So, finally that should give O(R), with quite small O.
This way of calculating checksum can be changed, but it's quite simple and completely fast, with really rare false-positive reactions.

As a variant on DarioOO’s answer, you could get a possibly faster implementation of a regular expression match, by coding a lex parser for your strings. Though normally used together with yacc, this is a case where lex on its own does the job, and lex parsers are usually very efficient.
This approach might fall down if all your strings are long, as then an algorithm such as Aho-Corasick, Commentz-Walter or Rabin-Karp would probably offer significant improvements, and I doubt that lex implementations use any such algorithm.
This approach is harder if you have to be able to configure the strings without reconfiguration, but since flex is open source you could cannibalise its code.

string cursors/markers in C++ strings

I am working with some big (megabytes) strings, and I need to modify them by inserting and removing characters at different locations.
To make it more efficient, instead of searching insertion/deletion points every time, I would like to have "cursors" or "tags" which are still valid if text is inserted (i.e. they are moved accordingly), and are still valid if the removed text is not "enclosing" the cursor position (i.e. a cursor becomes invalid only if it was in the removed substring, other cursors are moved accordingly).
I do not need to operate on the string concurrently, but insertion/deletion operations happen always one at time.
Do you know how this can be done with standard C++, boost or a portable, lightweight library?

If the number of insertion points will be relatively small, why not just keep a list (or array) of your insertion points, ordered by and including their offset into the data string.
Then, any time you insert/remove some text, simply pass through that list and adjust any insertion points that are past the offset of the modification, either up or down by the size of the insertion/removal.
Of course, you'd have to decide what it meant to have a modification "hit" one of your insertion points (e.g. what to do if a deleted range includes one or more insertion points), but that'd depend on what you're trying to maintain those markers for.

Maybe you can use special keywords in your text, which you match and replace using regular expressions (regex): http://www.cplusplus.com/reference/regex/
Be careful which keywords you use, because they could potentially occur naturally in the string.

When do we actually use a Trie?

I am starting to read about Trie. I got also references from friends here in: Tutorials on Trie
I am not clear on the following:
It seems that to go on and use a Trie one assumes that all the input strings that will be the search space and used to build the Trie are separated in distinct word boundaries.
E.g. all the example tutorials I have seen use input such as:
S={ball, bid, byte, car, cat, mac, map etc...}
Then we build the trie from S and do our searches (really fast)
My question is: How did we end up with S to begin with?
I mean before starting to read about tries I imagined that S would be an arbitrarily long text e.g. A Shakespeare passage.
Then using a Trie we could find things really fast.
But it seems this is not the case.
Is the assumption here that the input passage (of Shakespeare for example) is pre-processed first extracting all the words to get S?
So if one wants to search for patterns (same way as you do when you Google and see all pages having also spaces in your search query) a Trie is not appropriate?
When can we know if a Trie is the data structure that we can actually use?

Tries are useful where you have a fixed dictionary you want to look up quickly. Compared to a hashtable it may require less storage for a large dictionary but may well take longer to look up. One example place I have used it is for mapping URLs to operations on a web server were there may be inheritance of functionality based on the prefix. Here recursing down a trie enables appropriate lookup of all of the methods that need to be called for a particular url. It would also be efficient for storing a dictionary.
For doing text searches you would typically represent documents using a token vector of leximes with weights (perhaps based on occurance frequency), and then search against that to get a ranking of documents against a particular search vector. There a number of standard libraries to do this which I would suggest using rather than writing your own - particularly for removing stopwords, dealing with synonyms and stemming.

We can use tries for sub string searching in linear time, without pre processing the string every time. You can get a best tutorial on suffix tree generation #
Ukkonen's suffix tree algorithm in plain English?

As the other examples have said, a trie is useful because it provides fast string look-ups (or, more generally, look-ups for any sequence). Some examples of where I've used tries:
My answer to this question uses a (slightly modified) trie for matching sentences: it is a trie based on a sequence of words, rather than a sequence of characters. (The other answers to that question probably demonstrate the trie in action more clearly.)
I've also used a trie in a game which had a large number of rooms with names (the total number and the names were defined at run time), each of these names has to be unique and one had to be able to search for a room with a given name quickly. A hash table could also have been used, but in some ways a trie is simpler to implement and faster when using strings. (My trie implementation ended up being ~50 lines of C.)
The trie tag probably has many more examples.

There are multiple ways to use tries. The typical example is a lookup such as the one you have presented. However Tries can also be used to fully index a complete text. Either you use the Ukkonen suffix tree algorithm, to produce a suffix trie, or you explicetly construct the suffix trie by storing suffixes (much slower than Ukkonens algorithm, but also much simpler). As this is preprocessing, which needs to be done only once speed is not that crucial.
For this you would just take your text, insert the full text, then chop of the first letter, insert the resulting text, chop of second letter, insert...
So if we have the text "The Text" we would insert the following set:
{"The Text", "he Text", "e Text", " Text", "Text", "ext", "xt", "t"}
In the resulting suffix trie we can easily search for any kind of prefix. Also this is space efficient, because we do not need to store the whole string, since common prefixes are stored only once.
If you need to store much longer strings space efficiently it is best not only to store prefixes together but also suffixes. In that case you could build up a directed acyclic word graph (DAWG), which is very similar to a trie in conception.
So a trie in that sense allows finding arbitrary substrings, including partial words. If you are only interested in storing words, a different data structure should be used, for example a inverted list (if order is important) or a vector space based retrieval algorithm (in case word order does not matter).

What are some good methods to replace string names with integer hashes

Usually, entities and components or other parts of the game code in data-driven design will have names that get checked if you want to find out which object you're dealing with exactly.
void Player::Interact(Entity *myEntity)
{
if(myEntity->isNearEnough(this) && myEntity->GetFamilyName() == "guard")
{
static_cast<Guard*>(myEntity)->Say("No mention of arrows and knees here");
}
}
If you ignore the possibility that this might be premature optimization, it's pretty clear that looking up entities would be a lot faster if their "name" was a simple 32 bit value instead of an actual string.
Computing hashes out of the string names is one possible option. I haven't actually tried it, but with a range of 32bit and a good hashing function the risk of collision should be minimal.
The question is this: Obviously we need some way to convert in-code (or in some kind of external file) string-names to those integers, since the person working on these named objects will still want to refer to the object as "guard" instead of "0x2315f21a".
Assuming we're using C++ and want to replace all strings that appear in the code, can this even be achieved with language-built in features or do we have to build an external tool that manually looks through all files and exchanges the values?

Jason Gregory wrote this on his book :
At Naughty Dog, we used a variant of the CRC-32 algorithm to hash our strings, and we didn't encounter a single collision in over two years of development on Uncharted: Drake's Fortune.
So you may want to look into that.
And about the build step you mentioned, he also talked about it. They basically encapsulate the strings that need to be hashed in something like:
_ID("string literal")
And use an external tool at build time to hash all the occurrences. This way you avoid any runtime costs.

This is what enums are for. I wouldn't dare to decide which resource is best for the topic, but there are plenty to choose from: https://www.google.com/search?q=c%2B%2B+enum

I'd say go with enums!
But if you already have a lot of code already using strings, well, either just keep it that way (simple and usually enough fast on a PC anyway) or hash it using some kind of CRC or MD5 into an integer.

This is basically solved by adding an indirection on top of a hash map.
Say you want to convert strings to integers:
Write a class wraps both an array and a hashmap. I call these classes dictionaries.
The array contains the strings.
The hash map's key is the string (shared pointers or stable arrays where raw pointers are safe work as well)
The hash map's value is the index into the array the string is located, which is also the opaque handle it returns to calling code.
When adding a new string to the system, it is searched for already existing in the hashmap, returns the handle if present.
If the handle is not present, add the string to the array, the index is the handle.
Set the string and the handle in the map, and return the handle.
Notes/Caveats:
This strategy makes getting the string back from the handle run in constant time (it is merely an array deference).
handle identifiers are first come first serve, but if you serialize the strings instead of the values it won't matter.
Operator[] overloads for both the key and the value are fairly simple (registering new strings, or getting the string back), but wrapping the handle with a user-defined class (wrapping an integer) adds a lot of much needed type safety, and also avoids ambiguity if you want the key and the values to be the same types (overloaded[]'s wont compile and etc)
You have to store the strings in RAM, which can be a problem.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js