How to convert a PCRE to a POSIX RE? - regex

This interesting question Regex to match anything (including the empty string) except a specific given string concerned how to do a negative look-ahead in MySQL. The poster wanted to get the effect of
Kansas(?! State)
because MySQL doesn't implement look-ahead assertions, a number of answers came up the equivalent
Kansas($|[^ ]| ($|[^S])| S($|[^t])| St($|[^a])| Sta($|[^t])| Stat($|[^e]))
The poster pointed out that's a PITA to do for potentially lots of expressions.
Is there a script/utility/mode of PCRE (or some other package) that will convert a PCRE (if possible) to an equivalent regex that doesn't use Perl's snazzy features? I'm fully aware that some Perl-style regexes cannot be stated as an ordinary regex, so I would not expect the tool to do the impossible, of course!

You don't want to do this. It isn't actually mindbogglingly difficult to translate the advanced features to basic features - it's just another flavor of compiler, and compiler writers are pretty clever people - but most of the things that the snazzy features solve are (a) impossible to do with a standard regex because they recognize non-regular languages, so you'd have to approximate them so that at least they work for a limited-length text or (b) possible, but only with a regex of exponential size. And 'exponential' is compsci-speak for "don't go there". You will get swamped in OutOfMemory errors and seemingly-infinite loops if you try to use an exponential solution on anything you would actually want to process.
In other words, Abandon all hope, ye who enter here. It is virtually always better to let the regex do what it's good at and do the rest with other tools. Even such a simple thing as inverting a regex is much, much easier solved with the original regex in combination with the negation operator than with the monstrosity that would result from an accurate regex inverter.

Related

Why don't regex engines ensure all required characters are in the string?

For example, look at this email validating regex:
^([0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*#([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,9})$. If you look carefully, there are three parts: stuff, the # character, and more stuff. So the regex requires an email address to have an #, thus the string aaaaaaaaaaaaaaaaaaaaaa! will not match.
Yet most regex engines will catastrophically backtrack given this combination. (PCRE, which powers Regex101, is smarter than most, but other regex/string combinations can cause catastrophic backtracking.)
Without needing to know much about Big O, I can tell that combinatorial things are exponential, while searching is linear. So why don't regex engines ensure the string contains required characters (so they can quit early)?
Unfortunately, most of what I've read about catastrophic backtracking puts the blame on the regex writer for writing evil regexes, instead of exploring the possibility that regex engines/compilers need to do better. Although I found several sources that look at regex engines/compilers, they are too technical.
Coming back after getting more experience, I know that regexes are declarative, meaning the execution plan is determined by the computer, not the programmer. Optimization is one of the ways that regex engines differ the most.
While PCRE and Perl have challenged the declarative status-quo with the introduction of backtracking control verbs, it is other engines, without the verbs, which are most likely to catastrophically backtrack.
I think you're taking this the wrong way, really:
Unfortunately, most of what I've read about catastrophic backtracking puts the blame on the regex writer for writing evil regexes, instead of exploring the possibility that regex engines/compilers need to do better. Although I found several sources that look at regex engines/compilers, they are too technical.
Well, if you write a regex, your regex engine will need to follow that program you've written.
If you write a complex program, then there's nothing the engine can do about that; this regex explicitly specifies that you'll need to match "stuff" first, before looking for the #.
Now, not being too involved in writing compilers, I agree, in this case, it might be possible to first identify all the "static" elements, which here are only said #, and look for them. Sadly, in the general case, this won't really help you, because there might either be more than one static element or the none at all…
If you cared about speed, you'd actually just first search for the # with plain linear search, and then do your regex thing after you've found one.
Regexes were never meant to be as fast as linear search engines, because they were rather meant to be much, much more powerful.
So, not only are you taking the wrong person to the judge (the regex engine rather than the regex, which is a program with a complexity), you're also blaming the victim for the crime (you want to harvest the speed of just looking for the # character, but still use a regex).
by the way, don't validate email addresses with regexes. It's the wrong tool:
http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html

When should I prefer regex over built-in string functions?

Some say I should use regex whenever possible, others say I should use it at least as possible. Is there something like a "Perl Etiquette" about that matter or just TIMTOWTDI?
The level of complexity generally dictates whether I use a regex or not. Some of the questions I ask when deciding whether or not to use a regex are:
Is there no built string function that handles this relatively easily?
Do I need to capture substring groups?
Do I need complex features like look behind or negative sets?
Am I going to make use of character sets?
Will using a regex make my code more readable?
If I answer yes to any of these, I generally use a regex.
I think a lot of the answers you got already are good. I want to address the etiquette part because I think there is some.
Summed up: if there is a robust parser available, use it instead of regular expressions; 100% of the time. Never recommend anything else to a novice. So–
Don'ts
Don't split or match against commas for CSV, use Text::CSV/Text::CSV_XS.
Don't write regexes against HTML or XML, use XML::LibXML, XML::Twig, HTML::TreeBuilder, HTML::TokeParser::Simple, et cetera.
Don't write regexes for things that are trivial to split or unpack.
Dos
Do use substr, index, and rindex where appropriate but recognize they can come off "unperly" so they are best used when benchmarking shows them superior to regular expressions; regexes can be surprisingly fast in many cases.
Do use regular expressions when there is no good parser available and writing a Parse::RecDescent grammar is overkill, too much work, or will be too slow.
Do use regular expressions for throw-away code like one-liners on well-known/predictable data including the HTML/CSV previously banned from regular expression use.
Do be aware of alternatives for bigger problems like P::RecD, Parse::Yapp, and Marpa.
Do keep your own council. Perl is supposed to be fun. Do whatever you like; just be prepared to get bashed if you complain when not following advice and it goes sideways. :P
I don't know of any "etiquette" about this.
Perl regex are highly optimized (that's one of the things the language is known for, although there are engines that are faster), and in the end, if your regex is so simple that it could be replaced by a string function, I don't believe that the regex will be any significantly less performant. If the problem you are trying to resolve is so time sensitive that you might look into other possibilities of optimization.
Another important aspect is readability. And I think that handling all string transformations through regex also add to this, insteas of mixing and matching different approaches.
Just my two cents.
Though I would classify this as too opinionated for SO, I'll give my point of view.
Use regex when the string is:
"Too Dynamic" (The string could have a lot of variation to it, that making use of the string library(ies) would be cumbersome.
"Contains patterns" if there is a genuine pattern to the string (and may be as simple as 1 character or a group of characters) this is where (i feel) regex excels.
"Too Complex" If you find yourself declaring a whole function block just to do what a single pattern can do, I can see it being worthwhile just to use regex. (However, see "Too Complex" below, too).
Do not use regex to be:
"Fast" Consider the overhead involved in spinning up a regex library over grabbing information directly from a string.
"Too Complex" Good code isn't always short. If you begin making a huge pattern to circumvent several lines of code, that's fine, but keep in mind it's at the risk of readability. Coming back to that piece and trying to wrap your head around it again may not be worth just doing the plain-jane method.
I'd say, if you need more than one or two string function calls to do it, use a regex. ;)
For things that are not too complex that the regex becomes bloated, affects the readability of code and cause performance issues. You can do it via a serious of steps, using builtin functions and other means. You may not have a cool single line regex, but your code will be readable and maintanable.
And also not too simple problems because, again, regexes are heavy weight and there are usually built-in functions that handled the simple scenarios.
It is going to depend on what you are going to do. Ofcourse, please don't use regex for parsing ( especially HTML etc. )
Perl is a great language for regex. It honestly has one of the greatest parsers of any language, so that is why you see so many "use regex" answers. I am not sure what the aversion to regex is, however.
My answer would be: can you sum up the work in a single pattern easier than using the string function, or do you need to use multiple string functions versus a single regex? In either case, I would aim for regex. Otherwise, do what feels comfortable for you.

RegEx to match a string if it does not follow another string

What RegEx pattern should be used to match CP_ but not CPLAT::CP_?
(?<!CPLAT::)CP_
Uses negative lookbehind
Also, does anyone have a very simple tutorial like RegEx for Dummies? Is it strange that I code in C++ but cannot grasp RegEx easily?
No, it's not strange. Regex mastery requires a certain mindset that doesn't come naturally. And being able to program, in C++ or any other language, doesn't seem to help--if anything, it's a handicap. There's a good tutorial here, but even the best tutorial will only get you to a pidgin level. If you really want to get your head around regexes, you need The Book.
Another problem is that there's no standard for regexes; every programming language, every framework, every IDE or text editor seems to have its own "flavor" of regex. Some have features that others don't, while some use different syntax to do the same things. That's where The Other Book comes in. Many examples of the kinds of tasks we commonly use regexes for, in several of the most popular flavors, and thoroughly explained.
[^:]CP_
Will find all instances of CP_ that aren't preceeded by a :
use the g option (depending on regex flavor) if you expect more than one CP_ match per line.
I think you want "^CP_" as your regular expression. The ^ tells the expression to check to this patter at the start of the input.
http://www.regular-expressions.info/anchors.html

library for converting regular expressions to NFAs?

Is there a good library for converting Regular Expressions into NFAs? I see lots of academic papers on the subject, which are helpful, but not much in the way of working code.
My question is due partially to curiosity, and partially to an actual need to speed up regular expression matching on a production system I'm working on. Although it might be fun to explore this subject for learning's sake, I'm not sure it's a "practical" solution to speeding up our pattern matching. We're a Java shop, but would happily take pointers to good code in any language.
Edit:
Interesting, I did not know that Java's regexps were already NFAs. The title of this paper lead me to believe otherwise. Incidentally, we are currently doing our regexp matching in Postgres; if the simple solution is to move the matching into the Java code that would be great.
Addressing your need to speed up your regexes:
Java's implementation of its regex engine is NFA based. As such, to tune your regexes, I would say that you would benefit from a deeper understanding of how the engine is implemented.
And as such I direct you to: Mastering Regular Expressions The book gives substantial treatment to the NFA engine and how it performs matches, including how to tune your regex specific to the NFA engine.
Additionally, look into Atomic Grouping for tuning your regex.
Disclaimer: I'm not an expert on java+regexes. But, if I understand correctly...
If Java's regular expression matcher is similar to most others, it does use NFA's - but not the way you might expect. Instead of the forward-only implementation you may have heard about, it's using a backtracking solution which simplifies subexpression matching, and is probably required for Backreference usage. However, it performs alternation poorly.
You want to see: http://swtch.com/~rsc/regexp/regexp1.html (concerning edge cases which perform poorly on this altered architecture).
I've also written a question which I suppose comes down to the same thing:
Regex implementation that can handle machine-generated regex's: *non-backtracking*, O(n)?
But basically, it looks like for some very odd reason all common major-vendor regex implementaions have terrible performance when used on certain regexes, even though this is unnecessary.
Disclaimer: I'm a googler, not an expert on regexes.
There is a bunch of faster-than-JDK regex libraries one of which is dk.brics.automaton. According to the benchmark linked in the article, it is approximately x20 faster than the JDK implementation.
This library was written by Anders Møller and had also been mavenized.

Constructing regex

I use regex buddy which takes in a regex and then gives out the meaning of it from which one gets what it could be doing? On similar lines is it possible to have some engine which takes natural language input describing about the pattern one needs to match/replace and gives out the correct(almost correct) regex for that description?
e.g. Match the whole word 'dio' in some file
So regex for that could be : <dio>
or
\bdio\b
-AD.
P.S. = I think few guys here might think this as a 'subjective' 'not-related-to-programming' question, but i just need to ask this question nonetheless. For myself. - Thanks.
This would be complicated to program, because you need a natural language parser able to derive meaning. Unless you limit it to a strict subset -- in which case, you're reinventing an expression language, and you'll eventually wind up back at regular expressions -- only with bigger symbols. so what's the gain?
Regexes were developed for a reason -- they're the simplest, most accurate representation possible.
There is a Symbolix Regular Expression Builder package for Emacs, but looking at it, I think that regular expressions are easier to work with.
Short answer: no, not until artificial intelligence improves A LOT.
If you wrote something like this, you'd have a very limited syntax. For someone to know "Match the whole word 'dio' in some file", they would basically need to have significant knowledge of regular expressions. At that point, just use regular expressions.
For non-technical users, this will never work unless you limit it to basic "find this phrase" or, maybe, "find lines starting/ending with ??". They're never going to come up with something like this:
Find lines containing a less-than symbol followed by the string 'img' followed by one or more groupings of: some whitespace followed by one or more letters followed by either a double-quoted string or a single-quoted string, and those groupings are followed by any length of whitespace then a slash and a greater-than sign.
That's my attempt at a plain-language version of this relatively simple regex:
/<img(\s+[a-z]+=("[^"]*"|'[^']*'))+\s*/>/i
Yeah, I agree with you that it is subjective. But I will answer your question because I think that you have asked a wrong question.
The answer is "YES". Almost anything can be coded and this would be a rather simple application to code. Will it work perfectly? No, it wouldn't because natural language is quite complex to parse and interpret. But it is possible to write such an engine with some constraints.
Generating a regex via the use of a natural language processor is quite possible. Prolog is supposed to be a good language choice for this kind of problem. In practice, however, what you'd be doing, in effect, is designing your own input language which provides a regex as output. If your goal is to produce regexs for a specific task, this might in fact be useful. Perhaps the task you are doing tends to require certain formulations that are doable but not built into regular expressions. Though whether this will be more effective than just creating the regexs one at a time depends on your project. Usually this is probably not the case, since your own language is not going to be as well-known or as well-documented as regex. If your goal is to produce a replacement for regex whose output will be parsed as a regex, I think you're asking a lot. Not to say people haven't done the same sort of thing before (e.g. the C++ language as an 'improvement' that runs, originally, on C++).
try the open source mac application Ruby Regexp Machine, at http://www.rubyregexp.sf.net. It is written in ruby, so you can use some of the code even if you are not on mac. You can describe a lot of simple regular expresions in an easy english grammar. As a disclosure, i did make this tool.