Haskell: should I bother compiling regular expressions? - regex

My impulse is to say yes, especially if I'm using the same regex in multiple code locations, but this article indicates that the library will cache the compilation for me (which I'm not even sure how it would do):
There’s normally no need to compile regular expression patterns. A pattern will be compiled the first time it’s used, and your Haskell runtime should memoise the compiled representation for you.

If you reuse a regular expression then it is worthwhile to use the RegexMaker type class to define the "compiled" regular expression. It has the ability to take additional options and the ability to report compilation failure in your choice of Monad.
To use the "compiled" form you can use 'match' or 'matchM' from RegexLike which gives you the equivalent to =~ or ==~ operators.

GHC (as of 7.8.4/regex-tdfa-1.2.0) does not memoize regular expressions matched with (=~) or (=~~). I see an order of magnitude performance improvement using compile and execute with a large number of potential matches.
compile and execute
(=~)

Related

Construct String from Regexp

I've trying to find out a way to construct string data from a compiled *regexp.Regexp with given interface{} array. For example:
re := regexp.MustCompile(`(?P<word>\w+)\s*(?P<num>\d+)`)
I want to construct a string from the structure found in re by a string and a int data which may be received as interface{}.
Can't figure out how I can do that in Go. Please help me out.
Thanks in advance.
Such a library, often called Xeger, exists for many languages, including go. However, this one is called regen: It's a tool to generate random strings from Go/RE2 regular expressions.
Here's an example:
$ regen -n 3 '[a-z]{6,12}(\+[a-z]{6,12})?#[a-z]{6,16}(\.[a-z]{2,3}){1,2}'
iprbph+gqastu#regegzqa.msp
abxfcomj#uyzxrgj.kld.pp
vzqdrmiz#ewdhsdzshvvxjk.pi
Essentially, all regen does is parse the regular expressions it's
given and iterate over the tree produced by regexp/syntax and attempt
to generate strings based on the ops described by its results. This
could probably be optimized further by compiling the resulting Regexp
into a Prog, but I didn't feel like this was worthwhile when it's a
very small tool.
Some additional information can be
found at https://godoc.org/go.spiff.io/regen.
While technically not impossible this is a huge amount of work.
Go regexp package compiles regular expressions into bytecode programs that are then executed to search strings. De-compiling this bytecode back into a regular expression is not impossible but quite difficult and moreover you cannot be sure that the regexp expression you get is the exact same that was used originally... for example
/(?:a+)+/
and
/a+/
will be compiled to the same code for matching (expressions are "simplified" before being passed to the code generator).
Most probably you want to find another solution, like saving the original strings before compiling them into regular expression objects.

Compiletime build up of std::regex

Since I know the regexes at compiletime, and building up a regex is in O(2^m) where m is the length of the regex, I would love to build up the regex at compiletime.
Is this possible with std::regex? (I don't think so, because I don't see any constexpr constructor for basic_regex)
And if not, is there a regex library which can buildup my regexes at compiletime?
A CppCon 2017 lightning talk by Hana Dusikova "Regular Expressions Redefined in C++” described an approach to compile-time regular expressions using a user-defined literal for regex strings and a compile-time approach to generating the matching function. The code is on GitHub, but is still experimental and highly fluid at this time. So it seems that compile-time regexes are probably going to appear sometime soon.
We need to distinguish between program compile and regex compile. The latter is really done at a program runtime and it means building a large but efficient structure (state machine) suitable for fast matching against various strings.
in c++11 regex, regex compilation is done when you construct a regex object of string:
std::regex e (your_re_string);
If you use such an object in regex_match, regex_search, regex_replace, you take the advantage of working with an already-compiled regular expression. So, if you know your string at program compile time, the best thing you can do for the sake of speed is to construct a corresponding regex object just once per program run, say, having it somewhere declared as a static variable with initializer:
static std::regex e (your_constant_re_string);
Probably it is what you want.
Some forms of regex_match, ... function may work immediately with regular expression strings instead. But please note that although it's usually more convenient for a programmer, if you use them, the performance will suffer of doing regex compiling every time such a function called.
P.S. If you really, really, really want to have you regexp compiled at a program compile time, you may
(1) Use an external regexp/lexer compiler software (like https://github.com/madelson/PrecompiledRegex.Fody, Flex https://en.wikipedia.org/wiki/Flex_(lexical_analyser_generator) or similar)
(2) compile an std::regex object, then serialize and convert to C++ input (which is actually a DIY version of (1))
But I'm quite sure that it doesn't worth if only wanted in order to save one regex compile per program run. Maybe unless you have really overwhelming expressions.

PCRE repetition based on captured number -- (\d)(.{\1})

1xxx captures x
2xxx captures xx
3xxx captures xxx
I thought maybe this simple pattern would work:
(\d)(.{\1})
But no.
I know this is easy in Perl, but I'm using PCRE in Julia which means it would be hard to embed code to change the expression on-the-fly.
Note that regular expressions are usually compiled to a state machine before being executed, and are not naively interpreted.
Technically, n Xn (where n is a number and X a rule containing all characters) isn't a regular language. It isn't a context-free language, and isn't even a context-sensitive language! (See the Chomsky Hierarchy). While PCRE regexes can match all all context-free languages (if expressed suitably), the engine can only match a very limited subset of context-sensitive languages. We have a big problem on our hand that can neither be solved by regular expressions nor regexes with all the PCRE extensions.
The solution here usually is to separate tokenization, parsing, and semantic validation when trying to parse some input. Here:
read the number (possibly using a regex)
read the following characters (possibly using a regex)
validate that the length of the character string is equal to the given number.
Obviously this isn't going to work in this specific case without implementing backtracking or similar strategies, so we will have to write a parser ourselves that can handle the input:
read the number (possibly using a regex)
then read that number of characters at that position (possibly using a substr-like function).
Regexes are awesome, but they are simply not the correct tool for every problem. Sometimes, writing the program yourself is easier.
It can't be done in general. For the particular example you gave, you can use the following:
1.{1}|2.{2}|3.{3}
If you have a long but fix list of numbers, you can generate the pattern programmatically.

Regular Expression Vs. String Parsing

At the risk of open a can of worms and getting negative votes I find myself needing to ask,
When should I use Regular Expressions and when is it more appropriate to use String Parsing?
And I'm going to need examples and reasoning as to your stance. I'd like you to address things like readability, maintainability, scaling, and probably most of all performance in your answer.
I found another question Here that only had 1 answer that even bothered giving an example. I need more to understand this.
I'm currently playing around in C++ but Regular Expressions are in almost every Higher Level language and I'd like to know how different languages use/ handle regular expressions also but that's more an after thought.
Thanks for the help in understanding it!
Edit: I'm still looking for more examples and talk on this but the response so far has been great. :)
It depends on how complex the language you're dealing with is.
Splitting
This is great when it works, but only works when there are no escaping conventions.
It does not work for CSV for example because commas inside quoted strings are not proper split points.
foo,bar,baz
can be split, but
foo,"bar,baz"
cannot.
Regular
Regular expressions are great for simple languages that have a "regular grammar". Perl 5 regular expressions are a little more powerful due to back-references but the general rule of thumb is this:
If you need to match brackets ((...), [...]) or other nesting like HTML tags, then regular expressions by themselves are not sufficient.
You can use regular expressions to break a string into a known number of chunks -- for example, pulling out the month/day/year from a date. They are the wrong job for parsing complicated arithmetic expressions though.
Obviously, if you write a regular expression, walk away for a cup of coffee, come back, and can't easily understand what you just wrote, then you should look for a clearer way to express what you're doing. Email addresses are probably at the limit of what one can correctly & readably handle using regular expressions.
Context free
Parser generators and hand-coded pushdown/PEG parsers are great for dealing with more complicated input where you need to handle nesting so you can build a tree or deal with operator precedence or associativity.
Context free parsers often use regular expressions to first break the input into chunks (spaces, identifiers, punctuation, quoted strings) and then use a grammar to turn that stream of chunks into a tree form.
The rule of thumb for CF grammars is
If regular expressions are insufficient but all words in the language have the same meaning regardless of prior declarations then CF works.
Non context free
If words in your language change meaning depending on context, then you need a more complicated solution. These are almost always hand-coded solutions.
For example, in C,
#ifdef X
typedef int foo
#endif
foo * bar
If foo is a type, then foo * bar is the declaration of a foo pointer named bar. Otherwise it is a multiplication of a variable named foo by a variable named bar.
It should be Regular Expression AND String Parsing..
You can use both of them to your advantage!Many a times programmers try to make a SINGLE regular expression for parsing a text and then find it very difficult to maintain..You should use both as and when required.
The REGEX engine is FAST.A simple match takes less than a microsecond.But its not recommended for parsing HTML.

Efficiently querying one string against multiple regexes

Lets say that I have 10,000 regexes and one string and I want to find out if the string matches any of them and get all the matches.
The trivial way to do it would be to just query the string one by one against all regexes. Is there a faster,more efficient way to do it?
EDIT:
I have tried substituting it with DFA's (lex)
The problem here is that it would only give you one single pattern. If I have a string "hello" and patterns "[H|h]ello" and ".{0,20}ello", DFA will only match one of them, but I want both of them to hit.
This is the way lexers work.
The regular expressions are converted into a single non deterministic automata (NFA) and possibily transformed in a deterministic automata (DFA).
The resulting automaton will try to match all the regular expressions at once and will succeed on one of them.
There are many tools that can help you here, they are called "lexer generator" and there are solutions that work with most of the languages.
You don't say which language are you using. For C programmers I would suggest to have a look at the re2c tool. Of course the traditional (f)lex is always an option.
I've come across a similar problem in the past. I used a solution similar to the one suggested by akdom.
I was lucky in that my regular expressions usually had some substring that must appear in every string it matches. I was able to extract these substrings using a simple parser and index them in an FSA using the Aho-Corasick algorithms. The index was then used to quickly eliminate all the regular expressions that trivially don't match a given string, leaving only a few regular expressions to check.
I released the code under the LGPL as a Python/C module. See esmre on Google code hosting.
We had to do this on a product I worked on once. The answer was to compile all your regexes together into a Deterministic Finite State Machine (also known as a deterministic finite automaton or DFA). The DFA could then be walked character by character over your string and would fire a "match" event whenever one of the expressions matched.
Advantages are it runs fast (each character is compared only once) and does not get any slower if you add more expressions.
Disadvantages are that it requires a huge data table for the automaton, and there are many types of regular expressions that are not supported (for instance, back-references).
The one we used was hand-coded by a C++ template nut in our company at the time, so unfortunately I don't have any FOSS solutions to point you toward. But if you google regex or regular expression with "DFA" you'll find stuff that will point you in the right direction.
Martin Sulzmann Has done quite a bit of work in this field.
He has a HackageDB project explained breifly here which use partial derivatives seems to be tailor made for this.
The language used is Haskell and thus will be very hard to translate to a non functional language if that is the desire (I would think translation to many other FP languages would still be quite hard).
The code is not based on converting to a series of automata and then combining them, instead it is based on symbolic manipulation of the regexes themselves.
Also the code is very much experimental and Martin is no longer a professor but is in 'gainful employment'(1) so may be uninterested/unable to supply any help or input.
this is a joke - I like professors, the less the smart ones try to work the more chance I have of getting paid!
10,000 regexen eh? Eric Wendelin's suggestion of a hierarchy seems to be a good idea. Have you thought of reducing the enormity of these regexen to something like a tree structure?
As a simple example: All regexen requiring a number could branch off of one regex checking for such, all regexen not requiring one down another branch. In this fashion you could reduce the number of actual comparisons down to a path along the tree instead of doing every single comparison in 10,000.
This would require decomposing the regexen provided into genres, each genre having a shared test which would rule them out if it fails. In this way you could theoretically reduce the number of actual comparisons dramatically.
If you had to do this at run time you could parse through your given regular expressions and "file" them into either predefined genres (easiest to do) or comparative genres generated at that moment (not as easy to do).
Your example of comparing "hello" to "[H|h]ello" and ".{0,20}ello" won't really be helped by this solution. A simple case where this could be useful would be: if you had 1000 tests that would only return true if "ello" exists somewhere in the string and your test string is "goodbye;" you would only have to do the one test on "ello" and know that the 1000 tests requiring it won't work, and because of this, you won't have to do them.
If you're thinking in terms of "10,000 regexes" you need to shift your though processes. If nothing else, think in terms of "10,000 target strings to match". Then look for non-regex methods built to deal with "boatloads of target strings" situations, like Aho-Corasick machines. Frankly, though, it seems like somethings gone off the rails much earlier in the process than which machine to use, since 10,000 target strings sounds a lot more like a database lookup than a string match.
Aho-Corasick was the answer for me.
I had 2000 categories of things that each had lists of patterns to match against. String length averaged about 100,000 characters.
Main Caveat: The patters to match were all language patters not regex patterns e.g. 'cat' vs r'\w+'.
I was using python and so used https://pypi.python.org/pypi/pyahocorasick/.
import ahocorasick
A = ahocorasick.Automaton()
patterns = [
[['cat','dog'],'mammals'],
[['bass','tuna','trout'],'fish'],
[['toad','crocodile'],'amphibians'],
]
for row in patterns:
vals = row[0]
for val in vals:
A.add_word(val, (row[1], val))
A.make_automaton()
_string = 'tom loves lions tigers cats and bass'
def test():
vals = []
for item in A.iter(_string):
vals.append(item)
return vals
Running %timeit test() on my 2000 categories with about 2-3 traces per category and a _string length of about 100,000 got me 2.09 ms vs 631 ms doing sequential re.search() 315x faster!.
You'd need to have some way of determining if a given regex was "additive" compared to another one. Creating a regex "hierarchy" of sorts allowing you to determine that all regexs of a certain branch did not match
You could combine them in groups of maybe 20.
(?=(regex1)?)(?=(regex2)?)(?=(regex3)?)...(?=(regex20)?)
As long as each regex has zero (or at least the same number of) capture groups, you can look at what what captured to see which pattern(s) matched.
If regex1 matched, capture group 1 would have it's matched text. If not, it would be undefined/None/null/...
If you're using real regular expressions (the ones that correspond to regular languages from formal language theory, and not some Perl-like non-regular thing), then you're in luck, because regular languages are closed under union. In most regex languages, pipe (|) is union. So you should be able to construct a string (representing the regular expression you want) as follows:
(r1)|(r2)|(r3)|...|(r10000)
where parentheses are for grouping, not matching. Anything that matches this regular expression matches at least one of your original regular expressions.
I would recommend using Intel's Hyperscan if all you need is to know which regular expressions match. It is built for this purpose. If the actions you need to take are more sophisticated, you can also use ragel. Although it produces a single DFA and can result in many states, and consequently a very large executable program. Hyperscan takes a hybrid NFA/DFA/custom approach to matching that handles large numbers of expressions well.
I'd say that it's a job for a real parser. A midpoint might be a Parsing Expression Grammar (PEG). It's a higher-level abstraction of pattern matching, one feature is that you can define a whole grammar instead of a single pattern. There are some high-performance implementations that work by compiling your grammar into a bytecode and running it in a specialized VM.
disclaimer: the only one i know is LPEG, a library for Lua, and it wasn't easy (for me) to grasp the base concepts.
I'd almost suggest writing an "inside-out" regex engine - one where the 'target' was the regex, and the 'term' was the string.
However, it seems that your solution of trying each one iteratively is going to be far easier.
You could compile the regex into a hybrid DFA/Bucchi automata where each time the BA enters an accept state you flag which regex rule "hit".
Bucchi is a bit of overkill for this, but modifying the way your DFA works could do the trick.
I use Ragel with a leaving action:
action hello {...}
action ello {...}
action ello2 {...}
main := /[Hh]ello/ % hello |
/.+ello/ % ello |
any{0,20} "ello" % ello2 ;
The string "hello" would call the code in the action hello block, then in the action ello block and lastly in the action ello2 block.
Their regular expressions are quite limited and the machine language is preferred instead, the braces from your example only work with the more general language.
Try combining them into one big regex?
I think that the short answer is that yes, there is a way to do this, and that it is well known to computer science, and that I can't remember what it is.
The short answer is that you might find that your regex interpreter already deals with all of these efficiently when |'d together, or you might find one that does. If not, it's time for you to google string-matching and searching algorithms.
The fastest way to do it seems to be something like this (code is C#):
public static List<Regex> FindAllMatches(string s, List<Regex> regexes)
{
List<Regex> matches = new List<Regex>();
foreach (Regex r in regexes)
{
if (r.IsMatch(string))
{
matches.Add(r);
}
}
return matches;
}
Oh, you meant the fastest code? i don't know then....