The engine for parsing strings which is called "regular expressions" in Perl is very different from what is known by the term "regular expressions" in books.
So, my question is: is there some document describing the Perl's regexp implementation and how and in what ways does it really differ from the classic one (by classic I mean a regular expressions that can really be transformed to ordinary DFA/NFA) and how it works?
Thank you.
Perl regular expressions are of course called Perl regular expressions, or regexes for short. They may also be called patterns or rules. But what they are, or at least can be, is recursive descent parsers. They’re implemented using a recursive backtracker, although you can swap in a DFA engine if you prefer to offload DFA‐solvable tasks to it.
Here are some relevant citations about these matters, with all emboldening — and some of the text :) — mine:
You specify a pattern by creating a regular expression (or regex),
and Perl’s regular expression engine (the “Engine”, for the rest of this
chapter) then takes that expression and determines whether (and how) the
pattern matches your data. While most of your data will probably be
text strings, there’s nothing stopping you from using regexes to search
and replace any byte sequence, even what you’d normally think of as
“binary” data. To Perl, bytes are just characters that happen to have
an ordinal value less than 256.
If you’re acquainted with regular expressions from some other venue, we
should warn you that regular expressions are a bit different in Perl.
First, they aren’t entirely “regular” in the theoretical sense of the
word, which means they can do much more than the traditional regular
expressions taught in computer science classes. Second, they are used
so often in Perl that they have their own special variables, operators,
and quoting conventions which are tightly integrated into the language,
not just loosely bolted on like any other library.
— Programming Perl, by Larry Wall, Tom Christiansen, and Jon Orwant
This is the Apocalypse on Pattern Matching, generally having to do with
what we call “regular expressions”, which are only marginally related to
real regular expressions. Nevertheless, the term has grown with the
capabilities of our pattern matching engines, so I’m not going to try to
fight linguistic necessity here. I will, however, generally call them
“regexes” (or “regexen”, when I’m in an Anglo‐Saxon mood).
— Perl6 Apocalypse 5: Pattern Matching, by Larry Wall
There’s a lot of new syntax there, so let’s step through it slowly, starting with:
$file = rx/ ^ <$hunk>* $ /;
This statement creates a pattern object. Or, as it’s known in Perl 6, a
“rule”. People will probably still call them “regular expressions” or
“regexes” too (and the keyword rx reflects that), but Perl patterns long
ago ceased being anything like “regular”, so we’ll try and avoid those
terms.
[Update: We’ve resurrected the term “regex” to refer to these patterns in
general. When we say “rule” now, we’re specifically referring to the kind
of regex that you would use in a grammar. See S05.]
— Perl6 Exegesis 5: Pattern Matching, by Damian Conway
This document summarizes Apocalypse 5, which is about the new regex syntax.
We now try to call them regex rather than “regular expressions” because
they haven’t been regular expressions for a long time, and we think the
popular term “regex” is in the process of becoming a technical term with a
precise meaning of: “something you do pattern matching with, kinda like a regular
expression”. On the other hand, one of the purposes of the redesign
is to make portions of our patterns more amenable to analysis under
traditional regular expression and parser semantics, and that involves
making careful distinctions between which parts of our patterns and
grammars are to be treated as declarative, and which parts as procedural.
In any case, when referring to recursive patterns within a grammar, the
terms rule and token are generally preferred over regex.
— Perl6 Synopsis 5: Regexes and Rules,
by Damian Conway, Allison Randal, Patrick Michaud, Larry Wall, and Moritz Lenz
The O'Reilly book 'Mastering Regular Expressions' has a very good explanation of Perl's and other engines. For me this is the reference book on the topic.
There is no formal mathematical name for the language accepted by PCREs.
The term "regular expressions with backtracking" or "regular expressions with backreferences" is about as close as you will get. Anybody familiar with the difference will know what you mean.
(There are only two common types of regexp implementations: DFA-based, and backtracking-based. The former generally accept the "regular languages" in the traditional Computer Science sense. The latter generally accept... More, and it depends on the specific implementation, but backreferences are always one the non-DFA features.)
I asked the same question on the theoretical CS Stack Exchange (Regular expressions aren't), and the answer that got the most upvotes was “regex.”
The dialect is called PCRE (Perl-compatible Regular Expressions).
It's documented in the Perl manual.
Or in "Programming Perl" by Wall, Orwant and Christiansen
Related
Has anyone ever tried to describe regular expression that would match regular expressions?
This topic is almost impossible to find on the web due to the repeating keywords.
It is probably unusable in real-world applications, since the languages that support regular expressions usually have a method of parsing them, which we can use for validation, and a method of delimiting the regular expressions in code, which can be used for searching purposes.
But still I am wondering how would a regular expression that matches all regular expressions look like. It should be possible to write one.
I don't have a formal proof for this, but I strongly suspect that the language of regular expressions is not itself regular, and therefore not subject to regular expressions¹. This would make a proper regex to represent it impossible.
Why? Well, it can be shown that a language that requires balanced parentheses such as Lisp (or, more famously, HTML) is not regular using the pumping lemma:
The proof that the language of balanced (i.e., properly nested) parentheses is not regular follows the same idea. Given p, there is a string of balanced parentheses that begins with more than p left parentheses, so that y will consist entirely of left parentheses. By repeating y, we can produce a string that does not contain the same number of left and right parentheses, and so they cannot be balanced.
Regular expressions permit nested capture groups, which seem to fall into this category:
Take the example from the previous lesson, if we wanted to capture the image file number along with the filename, I can write the expression ^(IMG(\d+))\.png$.
In any case, this may be a better question for the Computer Science Stack Exchange site.
Edit:
¹tomp points out that PCRE-based regular expression engines (and likely others) are actually able to match all context-free grammars and at least some context-sensitive grammars! That represents a massive difference in expressive power. Assuming the article is correct, pretty cool!
(Of course, whether these extended implementations are still "regular expressions" is up for debate. Since we're on a programming site I'll take the position that they are. On a CS site I'd probably take the opposite position!)
So it may be technically possible to represent regular expressions as a regular expression.
Even so, the task of writing a regex representing all regexes is enormously complex. Consider for comparison the task of validating an email address. Many resources boil this down to something akin to [^#]+#[^#]+, or "as long as there is only one at symbol and at least one character before and one character after it, we're good".
But have a look at this apparently complete regex to validate RFC 822. Is it correct? Who knows. I'm certainly not going to check it.
Having seen this, I wouldn't want to try to write a regex to validate regular expressions.
I just coded this in a couple of minutes, so don't expect too much...still, it can match a regex in a string.
^([igsmx]{1,})?\/(?=.*?(\\w|\\d|\[.*?\]|\(.*?\))).*?\/([igsmx]{1,})?$
It can be extended, a looooooot...
I'd be interested to know what kind of algorithms are used for matching it, and how they are optimised, because I imagine that somes regexes could produce a vast number of possible matches that could cause serious problems on a poorly witten regex parser.
Also, I recently discovered the concept of a ReDoS, why do regexes such as (a|aa)+ or (a|a?)+ cause problems?
EDIT: I have used them most in C# and Python, so that's what was in my mind when I was considering the question. I assume Python's is written in C like the rest of the interpreter, but I have no idea about C#
I find http://www.regular-expressions.info has really useful info about regular expressions.
The author specifically talks about catastrophic uses of regular expression.
Regex Buddy has this debug page which "offers you a unique view inside a regular expression engine".
http://www.regexbuddy.com/debug.html
There are two kinds of regular expression engine: NFA and DFA. I am quite rusty so I don't dare go into specifics by memory. Here is a page that goes through the algorithms, though. Some parsers will perform better with poorly-written expressions. A good book on the subject (that is sitting on my shelf) is Mastering Regular Expression.
What does the "regular" in the phrase "regular expression" mean?
I have heard that regexes were regular at one time, but no more
The regular in regular expression comes from that it matches a regular language.
The concept of regular expressions used in formal language theory is quite different from what engines like PCRE call regular expressions. PCRE and other similar engines have features like lookahead, conditionals and recursion, which make them able to match non-regular languages.
It comes from regular language. This is part of formal language theory. Check out the Chomsky hierarchy for other formal languages.
It's signifying that it's a regular language.
Regexes are still popular. Some people frown on them but they remain a quick and easy (if you know how to use them) way of matching certain types of strings. The alternative is often a good few lines of code looping through strings and extracting the bits that you need which is much nastier!
I still use them on a regular (pun fully intended) basis, to give you a use case I used one the other day to match lines of guitar chords as oppose to lyrics. They're also commonly used for things like basic validation of email addresses and the like.
They're certainly not dead.
I think it comes from the term for the class of grammars that regular expressions describe: regular grammars (or "regular" languages). Where that term comes from is likely answered by a trip to Wikipedia.
Modern regex engines that implement all those fancy look-ahead, pattern re-match, and subexpression counting features, well, those are recognizing a class of grammar that's a superset of regular grammars. "Classical" regular expressions correspond in mechanical ways to theoretical machines called "finite automata". That's a really fun subject in and of itself.
As the name suggests we may think that regular expressions can match regular languages only. But regular expressions we use in practice contain stuff that I am not sure it's possible to implement with their theoretical counterparts. How for example would you simulate a back-reference?
So the question arises: what is the theoretical power of the regular expressions we use in practice? Can you think of a way to match {(a^n)(b^n)|n>=0}? What about {(a^n)(b^n)(c^n)|n>=0}?
The answer to your question is, "regular expression" languages that allow back-references are neither regular nor context-free. (In other words, as you pointed out, you cannot simulate back-reference with a regular language, nor with a CFL.) In fact, Wikipedia says many of the "regular expression" languages we use in practice are NP-Complete:
Pattern matching with an unbounded
number of back references, as
supported by numerous modern tools, is
NP-complete (see,[11] Theorem 6.2).
As others have suggested, the regular expression languages commonly supported in computer languages and libraries are a different animal from regular expressions in formal language theory. Larry Wall wrote in regard to Perl "regexes,"
'Regular expressions' [...] are only
marginally related to real regular
expressions. Nevertheless, the term
has grown with the capabilities of our
pattern matching engines, so I'm not
going to try to fight linguistic
necessity here. I will, however,
generally call them "regexes"
You asked,
Can you think of a way to match
{(a^n)(b^n)|n>=0}? What about
{(a^n)(b^n)(c^n)|n>=0}?
I'm not sure here if you're trying to test whether theoretical regular expression languages can match the "language of squares", or whether you're looking for an implementation in a (practical) regex language. Here's the proof why the former is not possible; and here's a long explanation and implementation of the latter for java regexes.
The basic difficulty with regular expressions that you are hinting at is the fact that regular expressions don't have a "memory" to them. In the purest form, no real regular expression should be able to recognize either of these languages. Any regular expression that could parse these sorts of languages would be, by definition, not regular. I think what you mean by "regular expressions we use is practice" is extended regular expressions, which are not technically regular expressions.
The problem with your question is that you are asking to apply a specifically contrived theoretical scenario to a practical situation, which almost always ends in disaster.
So my answer is sort of a non-answer, in that I am saying you would have to rephrase the question to ask about extended regular expressions for it to have an answer.
A couple of resources that might help in this matter:
Helpful wikipedia article
Similar StackOverflow question
Good book with a chapter on this topic
I'm also making my answer a community wiki for anyone else who wants to contribute to this line of thought.
I'm currently learning lua. regarding pattern-matching in lua I found the following sentence in the lua documentation on lua.org:
Nevertheless, pattern matching in Lua is a powerful tool and includes some features that are difficult to match with standard POSIX implementations.
As I'm familiar with posix regular expressions I would like to know if there are any common samples where lua pattern matching is "better" compared to regular expression -- or did I misinterpret the sentence? and if there are any common examples: why is any of pattern-matching vs. regular expressions better suited?
Are any common samples where lua pattern matching is "better" compared to regular expression?
It is not so much particular examples as that Lua patterns have a higher signal-to-noise ratio than POSIX regular expressions. It is the overall design that is often preferable, not particular examples.
Here are some factors that contribute to the good design:
Very lightweight syntax for matching common character types including uppercase letters (%u), decimal digits (%d), space characters (%s) and so on. Any character type can be complemented by using the corresponding capital letter, so pattern %S matches any nonspace character.
Quoting is extremely simple and regular. The quoting character is %, so it is always distinct from the string-quoting character \, which makes Lua patterns much easier to read than POSIX regular expressions (when quoting is necessary). It is always safe to quote symbols, and it is never necessary to quote letters, so you can just go by that rule of thumb instead of memorizing what symbols are special metacharacters.
Lua offers "captures" and can return multiple captures as the result of a match call. This interface is much, much better than capturing substrings through side effects or having some hidden state that has to be interrogated to find captures. Capture syntax is simple: just use parentheses.
Lua has a "shortest match" - modifier to go along with the "longest match" * operator. So for example s:find '%s(%S-)%.' finds the shortest sequence of nonspace characters that is preceded by space and followed by a dot.
The expressive power of Lua patterns is comparable to POSIX "basic" regular expressions, without the alternation operator |. What you are giving up is "extended" regular expressions with |. If you need that much expressive power I recommend going all the way to LPEG which gives you essentially the power of context-free grammars at quite reasonable cost.
http://lua-users.org/wiki/LibrariesAndBindings contains a listing of functionality including regex libraries if you wish to continue using them.
To answer the question (and note that I'm by no means a Lua guru), the language has a strong tradition of being used in embedded applications, where a full regex engine would unduly increase the size of the code being used on the platform, sometimes much larger than just all of the Lua library itself.
[Edit] I just found in the online version of Programming in Lua (an excellent resource for learning the language) where this is described by one of the principles of the language: see the comments below
[/Edit]
I find personally that the default pattern matching Lua provides satisfies most of my regex-y needs. Your mileage may vary.
Ok, just a slight noob note for this discussion; I particularly got confused by this page:
SciTE Regular Expressions
since that one says \s matches whitespace, as I know from other regular expression syntaxes... And so I'm trying it in a shell:
$ lua
Lua 5.1.4 Copyright (C) 1994-2008 Lua.org, PUC-Rio
> c=" d"
> print(c:match(" "))
> print(c:match("."))
> print(c:match("\s"))
nil
> print("_".. c:match("[ ]") .."_")
_ _
> print("_".. c:match("[ ]*") .."_")
_ _
> print("_".. c:match("[\s]*") .."_")
__
Hmmm... seems \s doesn't get recognized here - so that page probably refers to the regular expression in Scite's Find/Replace - not to Lua's regex syntax (which scite also uses).
Then I reread lua-users wiki: Patterns Tutorial, and start getting the comment about the escape character being %, not \ in #NormanRamsey's answer. So, trying this:
> print("_".. c:match("[%s]*") .."_")
_ _
... does indeed work.
So, as I originally thought that Lua's "patterns" are different commands/engine from Lua's "regular expression", I guess a better way to say it is: Lua's "patterns" are the Lua-specific "regular expression" syntax/engine (in other words, there aren't two of them :) )
Cheers!
With the risk of getting some downvotes for speaking the truth, I'll be bluntly honest about it (like an answer should be, after all): aside from being able to return multiple captures for a single match call (possible in regular expressions, but in a much more convoluted manner) and the %bxy pattern which matches a balanced pair of delimiters (e.g. all kind of brackets and such) and qualifies as useful, powerful and "better", almost everything Lua patterns can do, regular expressions can do as well.
The shortcomings of Lua patterns compared to regular expressions when it comes to "features" on the other hand are significant and too many too mention (e.g. lack of OR, lack of non-capturing groups, lookaround expressions, etc). Now that would be balanced if, say, Lua patterns would be significantly faster that the usually slower regular expressions, but I'm not sure whether - and where - such a comparison exists, one that would exclude the general native Lua speed due to its lightweight nature, the use of tables and so on.
The real reason Lua didn't bother to add regular expressions to its toolbox can't be the length of the required code (that's nonsense, modern computers don't even blink when it comes to 4000 lines of code vs "just" 500, even if it translates a bit differently into a library), but is probably due to the fact that being a scripting language, it was assumed that the "parent" language already includes the ability to use regular expressions. It is plain obvious when looking at the overall picture that Lua as a language was designed with simplicity, speed and only the necessary features in mind. It works well in most cases, but if you need more capabilities in this area and you cannot replicate them using Lua's other features, regular expressions are more comprehensive.
The good thing is that the differences in syntax between the Lua pattern and regular expressions are mostly minor, so if you know one you can relatively easy adapt to the other.