Fast string matching algorithm with simple wildcards support - c++

I need to match input strings (URLs) against a large set (anywhere from 1k-250k) of string rules with simple wildcard support.
Requirements for wildcard support are as follows:
Wildcard (*) can only substitute a "part" of a URL. That is fragments of a domain, path, and parameters. For example, "*.part.part/*/part?part=part&part=*". The only exception to this rule is in the path area where "/*" should match anything after the slash.
Examples:
*.site.com/* -- should match sub.site.com/home.html, sub2.site.com/path/home.html
sub.site.*/path/* -- should match sub.site.com/path/home.html, sub.site.net/path/home.html, but not sub.site.com/home.html
Additional requirements:
Fast lookup (I realize "fast" is a relative term. Given the max 250k rules, still fall within < 1.5s if possible.)
Work within the scope of a modern desktop (e.g. not a server implementation)
Ability to return 0:n matches given a input string
Matches will have rule data attached to them
What is the best system/algorithm for such as task? I will be developing the solution in C++ with the rules themselves stored in a SQLite database.

First of all, one of the worst performing searches you can do is with a wildcard at both ends of the string ".domain.com/path" -- and I think you're going to hit this case a lot. So my first recommendation is to reverse the order of the domains as they're stored in your DB: com.domain.example/path1/path2/page.html. That will allow you to keep things much more tidy and only use wildcards in "one direction" on the string, which will provide MUCH faster lookups.
I think John mentions some good points about how to do this all within your DB. If that doesn't work I would use a regex library in C++ against the list. I bet you'll get the best performance and most general regex syntax that way.

If I'm not mistaken, you can take string rule and break it up into domain, path, and query pieces, just like it's a URL. Then you can apply a standard wildcard matching algorithm with each of those pieces against the corresponding pieces from the URLs you want to test against. If all of the pieces match, the rule is a match.
Example
Rule: *.site.com/*
domain => *.site.com
path => /*
query => [empty]
URL: sub.site.com/path/home.html
domain => sub.site.com
path => /path/home.html
query => [empty]
Matching process:
domain => *.site.com matches sub.site.com? YES
path => /* matches /path/home.html? YES
query => [empty] matches [empty] YES
Result: MATCH
As you are storing the rules in a database I would store them already broken into those three pieces. And if you want uber-speed you could convert the *'s to %'s and then use the database's native LIKE operation to do the matching for you. Then you'd just have a query like
SELECT *
FROM ruleTable
WHERE #urlDomain LIKE ruleDomain
AND #urlPath LIKE rulePath
AND #urlQuery LIKE ruleQuery
where #urlDomain, #urlPath, and #urlQuery are variables in a prepared statement. The query would return the rules that match a URL, or an empty result set if nothing matches.

Related

Why won't CloudSearch find substring matches in filename text field?

I have a CloudSearch domain with a filename text field. My issue is that a text query won't match (some) documents with filenames I think it (logically) should. If I have documents with these filenames:
'cars'
'Cars Movie.jpg'
'cars.pdf'
'cars#.jpg'
and I perform a simple text query of 'cars', I get back files #1, #2, and #4 but not #3. If I search 'cars*' (or do a structured query using prefix) I can match #3. This doesn't make sense to me, especially that #4 matches but #3 does not.
TL;DR It's because of the way the tokenization algorithm handles periods.
When you perform a text search, you're performing a search against processed data, not the literal field. (Maybe that should've been obvious, but it wasn't how I was thinking about it before.)
The documentation gives an overview of how text is processed:
During indexing, Amazon CloudSearch processes text and text-array fields according to the analysis scheme configured for the field to determine what terms to add to the index. Before the analysis options are applied, the text is tokenized and normalized.
The part of the process that's ultimately causing this behavior is the tokenization:
During tokenization, the stream of text in a field is split into separate tokens on detectable boundaries using the word break rules defined in the Unicode Text Segmentation algorithm.
According to the word break rules, strings separated by whitespace such as spaces and tabs are treated as separate tokens. In many cases, punctuation is dropped and treated as whitespace. For example, strings are split at hyphens (-) and the at symbol (#). However, periods that are not followed by whitespace are considered part of the token.
The reason I was seeing the matches described in the question is because the file extensions are being included with whatever precedes them as a single token. If we look back at the example, and build an index according to these rules, it makes sense why a search of 'cars' returns documents #1, #2, and #4 but not #3.
# Text Index
1 'cars' ['cars']
2 'Cars Movie.jpg' ['cars', 'movie.jpg']
3 'cars.pdf'. ['cars.pdf']
4 'cars#.jpg' ['cars', '.jpg']
Possible Solutions
It might seem like setting a custom analysis scheme could fix this, but none of the options there (stopwords, stemming, synonyms) help you overcome the tokenization problem. I think the only possible solution, to get the desired behavior, is to tokenize the filename (using a custom algorithm) before upload, and then store the tokens in a text array field. Although devising a custom tokenization algorithm that supports multiple languages is a large problem.

CloudSearch wildcard query not working with 2013 API after migration from 2011 API

I've recently upgraded a CloudSearch instance from the 2011 to the 2013 API. Both instances have a field called sid, which is a text field containing a two-letter code followed by some digits e.g. LC12345. With the 2011 API, if I run a search like this:
q=12345*&return-fields=sid,name,desc
...I get back 1 result, which is great. But the sid of the result is LC12345 and that's the way it was indexed. The number 12345 does not appear anywhere else in any of the resulting document fields. I don't understand why it works. I can only assume that this type of query is looking for any terms in any fields that even contain the number 12345.
The reason I'm asking is because this functionality is now broken when I query using the 2013 API. I need to use the structured query parser, but even a comparable wildcard query using the simple parser is not working e.g.
q.parser=simple&q=12345*&return=sid,name,desc
...returns nothing, although the document is definitely there i.e. if I query for LC12345* it finds the document.
If I could figure out how to get the simple query working like it was before, that would at least get me started on how to do the same with the structured syntax.
Why it's not working
CloudSearch v1 (2011) had a different way of tokenizing mixed alpha+numeric strings. Here's the logic as described in the archived docs (emphasis mine).
If a string contains both alphabetic and numeric characters and is at
least three and no more than nine characters long, the alphabetic and
numeric portions of the string are treated as separate tokens. For
example, the string DOC298 is tokenized into two terms: doc 298
CloudSearch v2 (2013) text processing follows Unicode Text Segmentation, which does not specify that behavior:
Do not break within sequences of digits, or digits adjacent to letters (“3a”, or “A3”).
Solution
You should just be able to search *12345 to get back results with any prefix. There may be some edge cases like getting back results you don't want (things with more preceding digits like AB99912345); I don't know enough about your data to say whether those are real concerns.
Another option would would be to index the numeric prefix separately from the alphabetical suffix but that's additional work that may be unnecessary.
I'm guessing you are using Cloudsearch in English, so maybe this isn't your specific problem, but also watch out for Stopwords in your search queries:
https://docs.aws.amazon.com/cloudsearch/latest/developerguide/configuring-analysis-schemes.html#stopwords
In your example, the word "jo" is a stop word in Danish and another languages, and of course, supported languages, have a dictionary of stop words that has very common ones. If you don't specify a language in your text field, it will be English. You can see them here: https://docs.aws.amazon.com/cloudsearch/latest/developerguide/text-processing.html#text-processing-settings

Jax-RS overloading methods/paths order of execution

I am writing an API for my app, and I am confused about how Jax-RS deals with certain scenarios
For example, I define two paths:
#Path("user/{name : [a-zA-Z]+}")
and
#Path("user/me")
The first path that I specified clearly encompasses the second path since the regular expression includes all letters a-z. However, the program doesn't seem to have an issue with this. Is it because it defaults to the most specific path (i.e. /me and then looks for the regular expression)?
Furthermore, what happens if I define two regular expressions as the path with some overlap. Is there a default method which will be called?
Say I want to create three paths for three different methods:
#Path{"user/{name : [a-zA-Z]+}")
#Path("user/{id : \\d+}")
#Path("user/me")
Is this best practice/appropriate? How will it know which method to call?
Thank you in advance for any clarification.
This is in the spec in "Matching Requests to Resource Methods"
Sort E using (1) the number of literal characters in each member as the primary key (descending order), (2) the number of capturing groups as a secondary key (descending order), (3) the number of capturing groups with non-default regular expressions (i.e. not ‘([^ /]+?)’) as the tertiary key (descending order), ...
What happens is the candidate methods are sorted by specified ordered "key". I highlight them in bold.
The first sort key is the number of literal characters. So for these three
#Path{"user/{name : [a-zA-Z]+}")
#Path("user/{id : \\d+}")
#Path("user/me")
if the requested URI is ../user/me, the last one will always be chosen, as it has the most literal characters (7, / counts). The others only have 5.
Aside from ../users/me anything else ../users/.. will depend on the regex. In your case one matches only numbers and one matches only letters. There is no way for these two regexes to overlap. So it will match accordingly.
Now just for fun, let's say we have
#Path{"user/{name : .*}")
#Path("user/{id : \\d+}")
#Path("user/me")
If you look at the top two, we now have overlapping regexes. The first will match all numbers, as will the second one. So which one will be used? We can't make any assumptions. This is a level of ambiguity not specified and I've seen different behavior from different implementations. AFAIK, there is no concept of a "best matching" regex. Either it matches or it doesn't.
But what if we wanted the {id : \\d+} to always be checked first. If it matches numbers then that should be selected. We can hack it based on the specification. The spec talks about "capturing groups" which are basically the {..}s. The second sorting key is the number of capturing groups. The way we could hack it is to add another "optional" group
#Path{"user/{name : .*}")
#Path("user/{id : \\d+}{dummy: (/)?}")
Now the latter has more capturing groups so it will always be ahead in the sort. All it does is allow an optional /, which doesn't really affect the API, but insures that if the request URI is all numbers, this path will always be chose.
You can see a discussion with some test cases in this answer

Search and Replace in Solr?

Im looking for something like a search and replace functionality in Solr.
I have dumped a document into solr, and doing some text analysis over it. At times i may need to group couple of words together and want solr to treat it as one single token.
For ex: "South Africa" will be treated as one single token for further processing. And also notice that these can be dynamic and im going to let the end user to decide which words he/she has to group. So NO Semantics required.
My current plan is to add a special character between these two words so Solr will treat it as one single token (StandardTokenizerFactory) for further processing.
So im looking for something like:
replace("South Africa",South_Africa")
Can anyone has any solution?
Use a Synonym filter and define these replacements in a synonyms.txt file. Once you have all of your definitions, rebuild the index.
You would probably have an entry like this to handle both the case where a field has a LowerCase filter before Synonym and where Synonym comes before LowerCase.
South Africa,south africa => southafrica
More info here http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
You could perhaps use a PatternReplaceFilter and a clever regexp.

Tokenize the text depending on some specific rules. Algorithm in C++

I am writing a program which will tokenize the input text depending upon some specific rules. I am using C++ for this.
Rules
Letter 'a' should be converted to token 'V-A'
Letter 'p' should be converted to token 'C-PA'
Letter 'pp' should be converted to token 'C-PPA'
Letter 'u' should be converted to token 'V-U'
This is just a sample and in real time I have around 500+ rules like this. If I am providing input as 'appu', it should tokenize like 'V-A + C-PPA + V-U'. I have implemented an algorithm for doing this and wanted to make sure that I am doing the right thing.
Algorithm
All rules will be kept in a XML file with the corresponding mapping to the token. Something like
<rules>
<rule pattern="a" token="V-A" />
<rule pattern="p" token="C-PA" />
<rule pattern="pp" token="C-PPA" />
<rule pattern="u" token="V-U" />
</rules>
1 - When the application starts, read this xml file and keep the values in a 'std::map'. This will be available until the end of the application(singleton pattern implementation).
2 - Iterate the input text characters. For each character, look for a match. If found, become more greedy and look for more matches by taking the next characters from the input text. Do this until we are getting a no match. So for the input text 'appu', first look for a match for 'a'. If found, try to get more match by taking the next character from the input text. So it will try to match 'ap' and found no matches. So it just returns.
3 - Replace the letter 'a' from input text as we got a token for it.
4 - Repeat step 2 and 3 with the remaining characters in the input text.
Here is a more simple explanation of the steps
input-text = 'appu'
tokens-generated=''
// First iteration
character-to-match = 'a'
pattern-found = true
// since pattern found, going recursive and check for more matches
character-to-match = 'ap'
pattern-found = false
tokens-generated = 'V-A'
// since no match found for 'ap', taking the first success and replacing it from input text
input-text = 'ppu'
// second iteration
character-to-match = 'p'
pattern-found = true
// since pattern found, going recursive and check for more matches
character-to-match = 'pp'
pattern-found = true
// since pattern found, going recursive and check for more matches
character-to-match = 'ppu'
pattern-found = false
tokens-generated = 'V-A + C-PPA'
// since no match found for 'ppu', taking the first success and replacing it from input text
input-text = 'u'
// third iteration
character-to-match = 'u'
pattern-found = true
tokens-generated = 'V-A + C-PPA + V-U' // we'r done!
Questions
1 - Is this algorithm looks fine for this problem or is there a better way to address this problem?
2 - If this is the right method, std::map is a good choice here? Or do I need to create my own key/value container?
3 - Is there a library available which can tokenize string like the above?
Any help would be appreciated
:)
So you're going through all of the tokens in your map looking for matches? You might as well use a list or array, there; it's going to be an inefficient search regardless.
A much more efficient way of finding just the tokens suitable for starting or continuing a match would be to store them as a trie. A lookup of a letter there would give you a sub-trie which contains only the tokens which have that letter as the first letter, and then you just continue searching downward as far as you can go.
Edit: let me explain this a little further.
First, I should explain that I'm not familiar with these the C++ std::map, beyond the name, which makes this a perfect example of why one learns the theory of this stuff as well as than details of particular libraries in particular programming languages: unless that library is badly misusing the name "map" (which is rather unlikely), the name itself tells me a lot about the characteristics of the data structure. I know, for example, that there's going to be a function that, given a single key and the map, will very efficiently search for and return the value associated with that key, and that there's also likely a function that will give you a list/array/whatever of all of the keys, which you could search yourself using your own code.
My interpretation of your data structure is that you have a map where the keys are what you call a pattern, those being a list (or array, or something of that nature) of characters, and the values are tokens. Thus, you can, given a full pattern, quickly find the token associated with it.
Unfortunately, while such a map is a good match to converting your XML input format to a internal data structure, it's not a good match to the searches you need to do. Note that you're not looking up entire patterns, but the first character of a pattern, producing a set of possible tokens, followed by a lookup of the second character of a pattern from within the set of patterns produced by that first lookup, and so on.
So what you really need is not a single map, but maps of maps of maps, each keyed by a single character. A lookup of "p" on the top level should give you a new map, with two keys: p, producing the C-PPA token, and "anything else", producing the C-PA token. This is effectively a trie data structure.
Does this make sense?
It may help if you start out by writing the parsing code first, in this manner: imagine someone else will write the functions to do the lookups you need, and he's a really good programmer and can do pretty much any magic that you want. Writing the parsing code, concentrate on making that as simple and clean as possible, creating whatever interface using these arbitrary functions you need (while not getting trivial and replacing the whole thing with one function!). Now you can look at the lookup functions you ended up with, and that tells you how you need to access your data structure, which will lead you to the type of data structure you need. Once you've figured that out, you can then work out how to load it up.
This method will work - I'm not sure that it is efficient, but it should work.
I would use the standard std::map rather than your own system.
There are tools like lex (or flex) that can be used for this. The issue would be whether you can regenerate the lexical analyzer that it would construct when the XML specification changes. If the XML specification does not change often, you may be able to use tools such as lex to do the scanning and mapping more easily. If the XML specification can change at the whim of those using the program, then lex is probably less appropriate.
There are some caveats - notably that both lex and flex generate C code, rather than C++.
I would also consider looking at pattern matching technology - the sort of stuff that egrep in particular uses. This has the merit of being something that can be handled at runtime (because egrep does it all the time). Or you could go for a scripting language - Perl, Python, ... Or you could consider something like PCRE (Perl Compatible Regular Expressions) library.
Better yet, if you're going to use the boost library, there's always the Boost tokenizer library -> http://www.boost.org/doc/libs/1_39_0/libs/tokenizer/index.html
You could use a regex (perhaps the boost::regex library). If all of the patterns are just strings of letters, a regex like "(a|p|pp|u)" would find a greedy match. So:
Run a regex_search using the above pattern to locate the next match
Plug the match-text into your std::map to get the replace-text.
Print the non-matched consumed input and replace-text to your output, then repeat 1 on the remaining input.
And done.
It may seem a bit complicated, but the most efficient way to do that is to use a graph to represent a state-chart. At first, i thought boost.statechart would help, but i figured it wasn't really appropriate. This method can be more efficient that using a simple std::map IF there are many rules, the number of possible characters is limited and the length of the text to read is quite high.
So anyway, using a simple graph :
0) create graph with "start" vertex
1) read xml configuration file and create vertices when needed (transition from one "set of characters" (eg "pp") to an additional one (eg "ppa")). Inside each vertex, store a transition table to the next vertices. If "key text" is complete, mark vertex as final and store the resulting text
2) now read text and interpret it using the graph. Start at the "start" vertex. ( * ) Use table to interpret one character and to jump to new vertex. If no new vertex has been selected, an error can be issued. Otherwise, if new vertex is final, print the resulting text and jump back to start vertex. Go back to (*) until there is no more text to interpret.
You could use boost.graph to represent the graph, but i think it is overly complex for what you need. Make your own custom representation.