Derive minimal regular expression from input - c++

I have a remote "agent" that returns "yes" or "no" when handed a string. Communicating with this agent is expensive, so I'm hoping to find a library that will allow me to iteratively build a regular expression given positive and negative feedback, while being intelligent about its construction. This would allow me to cache answers on the sending side.
For example, suppose we query the agent with "good" and receive a "yes". The initial derived regular expression ought to be "good".
Suppose I query then with "goop" and receive a "yes". I would expect the derived regular expression to be "goo[dp]", not "good|goop".
And so forth.
I do not need backtracking or any other fancy non-linear time operations in my derived regex. Presumably the generated regex would be a DFA under the hood. Is anyone aware of any c/c++ regular expression libraries capable of doing this? Alternatively, reasons why this is a dumb idea and better solutions to my real problem would also be useful.

Rather than a regular expression, you could use a Trie.
Then for each new string you walk the trie one node for each character. I suspect that you would also want a marker character for end of string - once you reach this character, if the node exists, it holds the yes/no answer.

Well, unless I'm missing something in your situation, I think that memory is cheap enough to just straight up implement a dumb cache - say, an unordered_map of <std::string, bool>. Not only will this be much easier to build, it will probably be faster too, since you're building a hash map. The only downside to this is if you were going to query the remote service with a bazillion different keys, then this might not be the best approach.

Related

Efficient algorithm for string matching with a very large pattern set

I'm looking for an efficient algorithm able to find all patterns that match a specific string. The pattern set can be very large (more than 100,000) and dynamic (patterns added or removed at anytime). Patterns are not necessarily standard regexp, they can be a subset of regexp or something similar to shell pattern (ie: file-*.txt). A solution for a subset of regex is preferred (as explained below).
FYI: I'm not interested by brute force approaches based on a list of RegExp.
By simple regexp, I mean a regular expression that supports ?, *, + , character classes [a-z] and possibly the logical operator |.
To clarify my need: I wish find all patterns that match the URL:
http://site1.com/12345/topic/news/index.html
The response should be these patterns based on the pattern set below.
http://*.site1.com/*/topic/*
http://*.site1.com/*
http://*
Pattern set:
http://*.site1.com/*/topic/*
http://*.site1.com/*/article/*
http://*.site1.com/*
http://*.site2.com/topic/*
http://*.site2.com/article/*
http://*.site2.com/*
http://*
Here is an approach we've used pretty successfully (implementation here):
Adding Patterns:
For any pattern there exists a set of sub-strings a string must contain in order to have a chance of matching against it. Call these meta words. For example:
dog*fish -> [dog, fish]
[lfd]og -> [og]
dog? -> [dog]
When you add a pattern to the data structure, break it up into meta words and store them in a Aho-Corasick string matching data structure. Maintain an internal data structure to map meta words back to pattern words.
Running Queries:
Given an input string, use the Aho-Corasick data structure you've built to get all the meta words contained in that string. Then, using the map you've created, test the patterns that correspond to those meta words.
This works well because while string matching is fairly slow, you can narrow down the number of patterns you actually have to match against very quickly. Our implementation can perform about 200,000 queries per second, on a regular laptop, against sets of 150,000+ patterns. See the bench-marking mode in the program to test that.
One approach that comes to mind is to create tree structures of patterns.
Example: http://* would contain all the patterns (listed above). http://*.site1.com/* would contain all the site1.com ones. This could significantly reduce the number of patterns that need to be checked.
Additionally you could determine which patters are mutually exclusive to further prune the list you search.
So first take all the patterns and create trees out of them. Search all roots to determine which branches and nodes need to be analyzed.
Improve the algorithm by determining which branches are mutually exclusive so once you find a hit on a given branch you would know which branches/nodes do not need to be visited.
To get started you could be lazy and your first pass could be to sort the patterns and do simple does next pattern contain this pattern type logic to determine if "this" is contained in next. EX: if( "http://*.site1.com/*".startsWith("http://*") == true )
You could get more sophisticated in your ability to determine if one pattern does actually contain another but this would get you started.
To get any better at determining the question:
"Does this pattern contain that pattern?"
I believe you would need to be able to parse the regex... This article looks like a good place to start with to understand how to accomplish that: Parsing regular expressions with recursive descent
If the set of URLs doesn't change very fast, you really should use a regex engine that compiles down its patterns. Java provides one of these, but it might not be satisfactory if you want to know which pattern matches.
A widely used mechanism for doing this and determining which match, are various lexer generators, e.g., FLEX and similar tools. They accept what amount to a regex for each "lexeme", and build an integrated FSA to recognize any of them which is extremely efficient to execute.
You could invoke Flex when your set changes. If that's too slow, get an open source version of Flex and integrate into your engine; it internally builds the FSA so you could use it directly. (Some engineering likely necessary). But if you really have a high performance matching problem, some work to do it well won't bother you.
If the set of URLs changes faster than FLEX can generate FSAs (odd), you have a real problem. In that case, you might build an online discrimination tree by scanning the "regex" from left to right and integrating the characters/predicates you see into your existing discrimination tree. Matching then consists of walking down the discrimination tree, performing the various tests; if you arrive at a leaf, you have a match, otherwise not. This might be just as fast as a FLEX-generated automation if done right, but likely a lot, lot bigger.

Can a regular expression be tested to see if it reduces to .*

I'm developing an application where users enter a regular expression as a filter criterion, however I do not want people to be (easily) able to enter .* (i.e. match anything). The problem is, if I just use if (expression == ".*"), then this could be easily sidestepped by entering something such as .*.*.
Does anyone know of a test that could take a piece of regex and see if is essentially .* but in a slightly more elaborate form?
My thoughts are:
I could see if the expression is one or more repetitions of .*, (i.e. if it matches (\.\*)+ (quotations/escapes may not be entirely accurate, but you get the idea). The problem with this is that there may be other forms of writing a global match (e.g. with $ and ^) that are too exhaustive to even think of upfront, let along test.
I could test a few randomly generated Strings with it and assume that if they all pass, the user has entered a globally matching pattern. The problem with this approach is that there could be situations where the expression is sufficiently tight and I just pick bad strings to match against.
Thoughts, anyone?
(FYI, the application is in Java but I guess this is more of an algorithmic question than one for a particular language.)
Yes, there is a way. It involves converting the regex to a canonical FSM representation. See http://en.wikipedia.org/wiki/Regular_expression#Deciding_equivalence_of_regular_expressions
You can likely find published code that does the work for you. If not, the detailed steps are described here: http://swtch.com/~rsc/regexp/regexp1.html
If that seems like too much work, then you can use a quick and dirty probabilistic test. Just Generated some random strings to see if they match the user's regex. If they are match, you have a pretty good indication that the regex is overly broad.
There are many, many possibilities to achieve something equivalent to .*. e.g. just put any class of characters and the counter part into a class or a alternation and it will match anything.
So, I think with a regular expression its not possible to test another regular expression for equivalence to .*.
These are some examples that would match the same than .* (they will additionally match the newline characters)
/[\s\S]*/
/(\w|\W)*/
/(a|[^a])*/
/(a|b|[^ab])*/
So I assume your idea 2 would be a lot easier to achieve.
Thanks everyone,
I did miss the testing for equivalence entry on the wikipedia, which was interesting.
My memories of DFAs (I seem to recall having to prove, or at least demonstrate, in an exam in 2nd year CompSci that a regex cannot test for palindromes) are probably best left rested at the moment!
I am going to go down the approach of generating a set of strings to test. If they all pass, then I am fairly confident that the filter is too broad and needs to be inspected manually. Meanwhile, at least one failure indicates that the expression is more likely to be fit for purpose.
Now to decide what type of strings to generate in order to run the tests....
Kind regards,
Russ.

TCL string match vs regexps

Is it right that we should avoid using regexp as it is slow. Instead we should use string operations. Are there cases that both can be used but regexp is better?
You should use the appropriate tool for the job. That means, you should not avoid regex, you should use it when it is necessary.
If you are just searching for a fixed sequence of characters, use string operations.
If you are searching for a pattern, then use regular expressions.
Example
Search for the word "Foo". use string operations it will also find
"Foobar", is this OK? NO, well then maybe search for "Foo ", but then
it will not find "Foo," and "Foo."
With regex no problem, you can match for a word boundary /\mFoo\M/ and
this regex will not be slow.
I think this negative image comes from special problems like catastrophic backtracking.
There has been a recent example (catastrophic-backtracking-shouldnt-be-happening-on-this-regex) where this behaviour was unexpected.
Conclusion
A regex has to be well designed, if it isn't then the performance can be catastrophic. But the same can also happen to your normal code if you use a bad algorithm.
For a small job it should nearly never be a problem to use a regex, if your task is bigger and has to be repeated often, do a benchmark.
From my own experience, I am analyzing really big text files (some hundred MB) and use regexes to find the rows I am interested in and I don't experience performance problems because of regex.
Here an interesting read about code optimization
Regular expressions (REs) are a marvelous hammer. They can solve some problems elegantly, and many more with brute force, but it won't be pretty. And some problems can be solved with REs if you hit them enough, but there are much better solutions available (for example, things that are a good fit for string map)
string match - or globbing - can be thought of as a simplified version of regular expressions. The glob pattern will usually be shorter than the equivalent regular expression (character classes are an exception - ERs support them, with globs you need to spell them out). I don't know offhand how the performance differs; I'd expect string match to be slightly faster on equivalent patterns because of the simpler logic, but time is much more reliable than expectations.
For a specific case where REs are easier to use, extracting a substring contextually vs. by simple character position is a good example. Or for matching one of several alternatives.
My rule of thumb is to use the simplest thing that works. If that's string match, then great. If it seems like the pattern is too complex for that, go to a regexp and be happy you have the choice.
The best advice I can give, and the advice I use myself is, use regular expressions only when a simpler solution won't work.
If you can use simple string matching, or use glob patterns, use them. It's only when those cannot work that you should be using regular expressions.
To address your specific question I would say that, no, there is no time when you can use either but that regular expressions are the better choice. Maybe there's an edge case I'm not thinking of, but generally speaking, simpler solutions are always better.
I don't know about Tcl in particular, but generally it can be said that if you're looking for exact text matches (e. g. find all lines that start with #define) then string operations are faster. But if you're looking for patterns (e. g. all lines that contain a word that starts with c and ends with t) then regular expressions are the right tool for this (\bc\w*t\b would be a good regex for this - compare this to the program logic you'd need if you had to write this yourself.
And even if regex is slower in a case like this, chances are high that it won't matter in terms of execution speed, but it'll matter a lot when changes to the matching logic are required (oh, now we need to look for a word that starts with c and ends with t but contains at least two as and no x --> \bc(?=\w*a\w*a)(?!\w*x)\w*t\b).
A place where most regex engines don't want to go is recursion (matching nested tags, nested parentheses and all that). That's where parsers enter the picture.
Regular expression matching is a kind of string operation. While it's not as fast as some of the more basic operations, it is enormously more capable too. It's also more difficult to use, especially if you don't already know the basic syntax of REs, but that's not a reason to avoid them. However, replacing a regular expression with a collection of basic string operations can just lead to the program getting enormously longer: sometimes, you simply need complex manipulations.
Tcl does a number of things to make RE operations more efficient. Notably, it detects particularly simple REs and converts them into glob-like matches (as in string match) which are faster but much less powerful, and it does a number of things to cache the compiled form of REs so that matching has less overhead. It also uses an automata-theoretic matching engine that has fewer surprises during match time (at a cost of more time to compile the RE in the first place).
In short, don't avoid them. Use them where appropriate. (And time if you're in doubt about speed.)
regexp aka regular expressions are used to match many different strings and can be very complex or even to validate a specific input.
string match only allows wildcards such as * and ? and basic character grouping with [] as in regexp.
You can read about it here: http://www.tcl.tk/man/tcl8.5/TclCmd/string.htm#M40
A basic guide what regexp can do also with some examples are explained here: http://www.regular-expressions.info/
So in short: If you don't need regexp or even don't know much about it, i recommand you to not use it. If you just want to compare two strings for their equality use string equal.

intersection regular expression with groups, how to derive the intersection of the grouped bit?

I'm currently trying to solve a problem that's similar to Testing intersection of two regular languages with the exception that I know how to do the intersection, but have an additional requirement.
The intersection logic I intend to use is the Dragon Book's algorithm for converting an NFA to a DFA, but executed on two NFA's at the same time. Since all DFA's are NFA's (but with very little non-determinism), you can repeat this as needed for more intersections.
My problem is that one of my regexes has groups that can be used further on as a part of a new regex. Concretely:
bin/x86/a.out: obj/x86/.*\.o
obj/{[a-zA-Z0-9]+}/{.*}.o: src/\2.c
In the end of the first line I have a regex that matches all objects for x86 targets. In the second line I have a regex that specifies a possible build line, that should match the first group with the fixed "x86" and the second with any given string after it. In the example the first match isn't used yet, but it should be retrievable. To make sure that the matching ends (and to allow recursive rules), I want to use the information gained from the first regex in matching the second. The rule is selected by taking the second regex from the first and the first from the second line and to determine if the intersection of the two (the DFA resulting from the intersection) has an accepting state. If it does, there are sentences that both can parse and therefore some values that the group can take.
Is it possible, in general, to extract information from the first regex for use in matching the group of the second regex?
If not in general, what kinds of restrictions do I need to add?
I believe back-ticks make the language non-regular, so you won't be able to convert it to a finite-automoton.
Why do your examples look like Makefile rules, even though Makefiles don't support regular expressions?
Because that's the thing I'm trying to make (no pun intended).
Which regex library do you use?
None, as of yet. I'm considering to write my own based on the output of this question. If this isn't possible, I may make do with an existing one that supports this. If this is theoretically possible, I'll develop my own to do exactly this & make the application as I intend it.
Some support lookahead expressions, which are another way of expression intersections
The idea behind the intersection is to define rules that are generic and can contain multiple varying left-side parts (the use of % in usual makefiles, but without the requirement to do some sort of recursive make if you do have more than one variation point - such as the platform, build type or file name). If I can't take the second regex into account for the group I can't use such a rule recursively because the recursion wouldn't have any change between each step/level. That would reduce the genericity but would still be acceptable. Still, it's an interesting question to know the answer to (IE, the can it be done generically) and it'll be deciding in the requirements I'll have for a regex library.
(Not posted as original author because I lost my cookie & am waiting for the accounts to be merged).

How to generate random strings that match a given regexp?

Duplicate:
Random string that matches a regexp
No, it isn't. I'm looking for an easy and universal method, one that I could actually implement. That's far more difficult than randomly generating passwords.
I want to create an application that takes a regular expression, and shows 10 randomly generated strings that match that expression. It's supposed to help people better understand their regexps, and to decide i.e. if they're secure enough for validation purposes. Does anyone know of an easy way to do that?
One obvious solution would be to write (or steal) a regexp parser, but that seems really over my head.
I repeat, I'm looking for an easy and universal way to do that.
Edit: Brute force approach is out of the question. Assuming the random strings would just be [a-z0-9]{10} and 1 million iterations per second, it would take 65 years to iterate trough the space of all 10-char strings.
Parse your regular expression into a DFA, then traverse your DFA randomly until you end up in an accepting state, outputting a character for each transition. Each walk will yield a new string that matches the expression.
This doesn't work for "regular" expressions that aren't really regular, though, such as expressions with backreferences. It depends on what kind of expression you're after.
Take a look at Perl's String::Random.
One rather ugly solution that may or may not be practical is to leverage an existing regex diagnostics option. Some regex libraries have the ability to figure out where the regex failed to match. In this case, you could use what is in effect a form of brute force, but using one character at a time and trying to get longer (and further-matching) strings until you got a full match. This is a very ugly solution. However, unlike a standard brute force solution, it failure on a string like ab will also tell you whether there exists a string ab.* which will match (if not, stop and try ac. If so, try a longer string). This is probably not feasible with all regex libraries.
On the bright side, this kind of solution is probably pretty cool from a teaching perspective. In practice it's probably similar in effect to a dfa solution, but without the requirement to think about dfas.
Note that you won't want to use random strings with this technique. However, you can use random characters to start with if you keep track of what you've tested in a tree, so the effect is the same.
if your only criteria are that your method is easy and universal, then there ain't nothing easier or more universal than brute force. :)
for (i = 0; i < 10; ++i) {
do {
var str = generateRandomString();
} while (!myRegex.match(str));
myListOfGoodStrings.push(str);
}
Of course, this is a very silly way to do things and mostly was meant as a joke.
I think your best bet would be to try writing your own very basic parser, teaching it just the things which you're expecting to encounter (eg: letter and number ranges, repeating/optional characters... don't worry about look-behinds etc)
The universality criterion is impossible. Given the regular expression "^To be, or not to be -- that is the question:$", there will not be ten unique random strings that match.
For non-degenerate cases:
moonshadow's link to Perl's String::Random is the answer. A Perl program that reads a RegEx from stdin and writes the output from ten invocations of String::Random to stdout is trivial. Compile it to either a Windows or Unix exe with Perl2exe and invoke it from PHP, Python, or whatever.
Also see Random Text generator based on regex