Lua pattern matching vs. regular expressions - regex

I'm currently learning lua. regarding pattern-matching in lua I found the following sentence in the lua documentation on lua.org:
Nevertheless, pattern matching in Lua is a powerful tool and includes some features that are difficult to match with standard POSIX implementations.
As I'm familiar with posix regular expressions I would like to know if there are any common samples where lua pattern matching is "better" compared to regular expression -- or did I misinterpret the sentence? and if there are any common examples: why is any of pattern-matching vs. regular expressions better suited?

Are any common samples where lua pattern matching is "better" compared to regular expression?
It is not so much particular examples as that Lua patterns have a higher signal-to-noise ratio than POSIX regular expressions. It is the overall design that is often preferable, not particular examples.
Here are some factors that contribute to the good design:
Very lightweight syntax for matching common character types including uppercase letters (%u), decimal digits (%d), space characters (%s) and so on. Any character type can be complemented by using the corresponding capital letter, so pattern %S matches any nonspace character.
Quoting is extremely simple and regular. The quoting character is %, so it is always distinct from the string-quoting character \, which makes Lua patterns much easier to read than POSIX regular expressions (when quoting is necessary). It is always safe to quote symbols, and it is never necessary to quote letters, so you can just go by that rule of thumb instead of memorizing what symbols are special metacharacters.
Lua offers "captures" and can return multiple captures as the result of a match call. This interface is much, much better than capturing substrings through side effects or having some hidden state that has to be interrogated to find captures. Capture syntax is simple: just use parentheses.
Lua has a "shortest match" - modifier to go along with the "longest match" * operator. So for example s:find '%s(%S-)%.' finds the shortest sequence of nonspace characters that is preceded by space and followed by a dot.
The expressive power of Lua patterns is comparable to POSIX "basic" regular expressions, without the alternation operator |. What you are giving up is "extended" regular expressions with |. If you need that much expressive power I recommend going all the way to LPEG which gives you essentially the power of context-free grammars at quite reasonable cost.

http://lua-users.org/wiki/LibrariesAndBindings contains a listing of functionality including regex libraries if you wish to continue using them.
To answer the question (and note that I'm by no means a Lua guru), the language has a strong tradition of being used in embedded applications, where a full regex engine would unduly increase the size of the code being used on the platform, sometimes much larger than just all of the Lua library itself.
[Edit] I just found in the online version of Programming in Lua (an excellent resource for learning the language) where this is described by one of the principles of the language: see the comments below
[/Edit]
I find personally that the default pattern matching Lua provides satisfies most of my regex-y needs. Your mileage may vary.

Ok, just a slight noob note for this discussion; I particularly got confused by this page:
SciTE Regular Expressions
since that one says \s matches whitespace, as I know from other regular expression syntaxes... And so I'm trying it in a shell:
$ lua
Lua 5.1.4 Copyright (C) 1994-2008 Lua.org, PUC-Rio
> c=" d"
> print(c:match(" "))
> print(c:match("."))
> print(c:match("\s"))
nil
> print("_".. c:match("[ ]") .."_")
_ _
> print("_".. c:match("[ ]*") .."_")
_ _
> print("_".. c:match("[\s]*") .."_")
__
Hmmm... seems \s doesn't get recognized here - so that page probably refers to the regular expression in Scite's Find/Replace - not to Lua's regex syntax (which scite also uses).
Then I reread lua-users wiki: Patterns Tutorial, and start getting the comment about the escape character being %, not \ in #NormanRamsey's answer. So, trying this:
> print("_".. c:match("[%s]*") .."_")
_ _
... does indeed work.
So, as I originally thought that Lua's "patterns" are different commands/engine from Lua's "regular expression", I guess a better way to say it is: Lua's "patterns" are the Lua-specific "regular expression" syntax/engine (in other words, there aren't two of them :) )
Cheers!

With the risk of getting some downvotes for speaking the truth, I'll be bluntly honest about it (like an answer should be, after all): aside from being able to return multiple captures for a single match call (possible in regular expressions, but in a much more convoluted manner) and the %bxy pattern which matches a balanced pair of delimiters (e.g. all kind of brackets and such) and qualifies as useful, powerful and "better", almost everything Lua patterns can do, regular expressions can do as well.
The shortcomings of Lua patterns compared to regular expressions when it comes to "features" on the other hand are significant and too many too mention (e.g. lack of OR, lack of non-capturing groups, lookaround expressions, etc). Now that would be balanced if, say, Lua patterns would be significantly faster that the usually slower regular expressions, but I'm not sure whether - and where - such a comparison exists, one that would exclude the general native Lua speed due to its lightweight nature, the use of tables and so on.
The real reason Lua didn't bother to add regular expressions to its toolbox can't be the length of the required code (that's nonsense, modern computers don't even blink when it comes to 4000 lines of code vs "just" 500, even if it translates a bit differently into a library), but is probably due to the fact that being a scripting language, it was assumed that the "parent" language already includes the ability to use regular expressions. It is plain obvious when looking at the overall picture that Lua as a language was designed with simplicity, speed and only the necessary features in mind. It works well in most cases, but if you need more capabilities in this area and you cannot replicate them using Lua's other features, regular expressions are more comprehensive.
The good thing is that the differences in syntax between the Lua pattern and regular expressions are mostly minor, so if you know one you can relatively easy adapt to the other.

Related

How the Perl regular expressions dialect/implementation is called?

The engine for parsing strings which is called "regular expressions" in Perl is very different from what is known by the term "regular expressions" in books.
So, my question is: is there some document describing the Perl's regexp implementation and how and in what ways does it really differ from the classic one (by classic I mean a regular expressions that can really be transformed to ordinary DFA/NFA) and how it works?
Thank you.
Perl regular expressions are of course called Perl regular expressions, or regexes for short. They may also be called patterns or rules. But what they are, or at least can be, is recursive descent parsers. They’re implemented using a recursive backtracker, although you can swap in a DFA engine if you prefer to offload DFA‐solvable tasks to it.
Here are some relevant citations about these matters, with all emboldening — and some of the text :) — mine:
You specify a pattern by creating a regular expression (or regex),
and Perl’s regular expression engine (the “Engine”, for the rest of this
chapter) then takes that expression and determines whether (and how) the
pattern matches your data. While most of your data will probably be
text strings, there’s nothing stopping you from using regexes to search
and replace any byte sequence, even what you’d normally think of as
“binary” data. To Perl, bytes are just characters that happen to have
an ordinal value less than 256.
If you’re acquainted with regular expressions from some other venue, we
should warn you that regular expressions are a bit different in Perl.
First, they aren’t entirely “regular” in the theoretical sense of the
word, which means they can do much more than the traditional regular
expressions taught in computer science classes. Second, they are used
so often in Perl that they have their own special variables, operators,
and quoting conventions which are tightly integrated into the language,
not just loosely bolted on like any other library.
      — Programming Perl, by Larry Wall, Tom Christiansen, and Jon Orwant
This is the Apocalypse on Pattern Matching, generally having to do with
what we call “regular expressions”, which are only marginally related to
real regular expressions. Nevertheless, the term has grown with the
capabilities of our pattern matching engines, so I’m not going to try to
fight linguistic necessity here. I will, however, generally call them
“regexes” (or “regexen”, when I’m in an Anglo‐Saxon mood).
      — Perl6 Apocalypse 5: Pattern Matching, by Larry Wall
There’s a lot of new syntax there, so let’s step through it slowly, starting with:
$file = rx/ ^ <$hunk>* $ /;
This statement creates a pattern object. Or, as it’s known in Perl 6, a
“rule”. People will probably still call them “regular expressions” or
“regexes” too (and the keyword rx reflects that), but Perl patterns long
ago ceased being anything like “regular”, so we’ll try and avoid those
terms.
[Update: We’ve resurrected the term “regex” to refer to these patterns in
general. When we say “rule” now, we’re specifically referring to the kind
of regex that you would use in a grammar. See S05.]
      — Perl6 Exegesis 5: Pattern Matching, by Damian Conway
This document summarizes Apocalypse 5, which is about the new regex syntax.
We now try to call them regex rather than “regular expressions” because
they haven’t been regular expressions for a long time, and we think the
popular term “regex” is in the process of becoming a technical term with a
precise meaning of: “something you do pattern matching with, kinda like a regular
expression”. On the other hand, one of the purposes of the redesign
is to make portions of our patterns more amenable to analysis under
traditional regular expression and parser semantics, and that involves
making careful distinctions between which parts of our patterns and
grammars are to be treated as declarative, and which parts as procedural.
In any case, when referring to recursive patterns within a grammar, the
terms rule and token are generally preferred over regex.
      — Perl6 Synopsis 5: Regexes and Rules,
by Damian Conway, Allison Randal, Patrick Michaud, Larry Wall, and Moritz Lenz
The O'Reilly book 'Mastering Regular Expressions' has a very good explanation of Perl's and other engines. For me this is the reference book on the topic.
There is no formal mathematical name for the language accepted by PCREs.
The term "regular expressions with backtracking" or "regular expressions with backreferences" is about as close as you will get. Anybody familiar with the difference will know what you mean.
(There are only two common types of regexp implementations: DFA-based, and backtracking-based. The former generally accept the "regular languages" in the traditional Computer Science sense. The latter generally accept... More, and it depends on the specific implementation, but backreferences are always one the non-DFA features.)
I asked the same question on the theoretical CS Stack Exchange (Regular expressions aren't), and the answer that got the most upvotes was “regex.”
The dialect is called PCRE (Perl-compatible Regular Expressions).
It's documented in the Perl manual.
Or in "Programming Perl" by Wall, Orwant and Christiansen

what does regular in regex/"regular expression" mean?

What does the "regular" in the phrase "regular expression" mean?
I have heard that regexes were regular at one time, but no more
The regular in regular expression comes from that it matches a regular language.
The concept of regular expressions used in formal language theory is quite different from what engines like PCRE call regular expressions. PCRE and other similar engines have features like lookahead, conditionals and recursion, which make them able to match non-regular languages.
It comes from regular language. This is part of formal language theory. Check out the Chomsky hierarchy for other formal languages.
It's signifying that it's a regular language.
Regexes are still popular. Some people frown on them but they remain a quick and easy (if you know how to use them) way of matching certain types of strings. The alternative is often a good few lines of code looping through strings and extracting the bits that you need which is much nastier!
I still use them on a regular (pun fully intended) basis, to give you a use case I used one the other day to match lines of guitar chords as oppose to lyrics. They're also commonly used for things like basic validation of email addresses and the like.
They're certainly not dead.
I think it comes from the term for the class of grammars that regular expressions describe: regular grammars (or "regular" languages). Where that term comes from is likely answered by a trip to Wikipedia.
Modern regex engines that implement all those fancy look-ahead, pattern re-match, and subexpression counting features, well, those are recognizing a class of grammar that's a superset of regular grammars. "Classical" regular expressions correspond in mechanical ways to theoretical machines called "finite automata". That's a really fun subject in and of itself.

I don’t get regular expressions

I don’t understand or see the need for regular expressions.
Can some explain them in simple terms and provide some basic examples where they could be useful, or even critical.
Use them where you need to use/manipulate patterns. For instance, suppose you need to recognise the following pattern:
Any letter, A-Z, either upper or lower case, 5 or 6 times
3 digits
a single letter a-z (definitely lower case)
(Things like this crop up for zip code, credit card, social security number validation etc.)
That's not really hard to write in code - but it becomes harder as the pattern becomes more complicated. With a regular expression, you describe the pattern (rather than the code to validate it) and let the regex engine do the work for you.
The pattern here would be something like
[A-Za-z]{5,6}[0-9]{3}[a-z]
(There are other ways of expressing it too.) Grouping constructs make it easy to match a whole pattern and grab (or replace) different bits of it, too.
A few downsides though:
Regexes can become complicated and hard to read quite quickly. Document thoroughly!
There are variations in behaviour between different regex engines
The complexity can be hard to judge if you're not an expert (which I'm certainly not!); there are "gotchas" which can make the patterns really slow against particular input, and these gotchas aren't obvious at all
Some people overuse regular expressions massively (and some underuse them, of course). The worst example I've seen was where someone asked (on a C# group) how to check whether a string was length 3 - this is clearly a job for using String.Length, but someone seriously suggested matching a regex. Madness. (They also got the regex wrong, which kinda proves the point.)
Regexes use backslashes to escape various things (e.g. use . to mean "a dot" rather than just "any character". In many languages the backslash itself needs escaping.
What regular expressions are used for:
Regular expressions is a language in itself that allows you to perform complex validation of string inputs. I.e. you pass it a string and it will return true or false if it is a match or not.
How regular expressions are used:
Form validation, determine if what the user entered is of the format you want
Finding the position of a certain pattern in a block of text
Search and replace where the search term is a regex and what to replace is a normal string.
Some regular expression language features:
Alternation: allows you to select one thing or another. Example match only yes or no.
yes|no
Grouping: You can define scope and have precedence using parentheses. For example match 3 color shades.
gr(a|e)y|black|white
Quantification: You can quantify how much of something you want. ? means 1 or 0, * means 0 or more. + means at least one. Example: Accept a binary string that is not empty:
(0|1)+
Why regular expressions?
Regular expressions make it easy to match strings, it can often replace several dozen lines of source code with a simple small regular expression string.
Not for all types of matching:
To understand how something is useful, you should also understand how it is not useful. Regular expressions are bad for certain tasks for example when you need to guarantee that a string has an equal number of parentheses.
Available in just about all languages:
Regular expressions are available in just about any programming language.
Formal language:
Any regular expression can be converted to a deterministic finite state machine. And in this same way you can figure out how to make source code that will validate your regular expression.
Example:
[hc]+at
matches "hat", "cat", "hhat", "chat", "hcat", "ccchat", and so on, but not "at"
Source, further reading
They look a bit cryptic but they provide a very powerful tool for finding patterns in text. Anything from href tags in HTML pages to validating email addresses.
And they can be processed into a very efficient data structure (FSA) that finds matches very fast.
They are a bit tricky, but extremely powerful and worth learning. The web is full of tutorial and examples, start for example from here and look at the examples here.
If I could direct the OP to some of the answers/comments on one of my own questions: How important is knowing Regexs?
Regular expressions are a very concise way to specify most pattern-matching and -replacement problems, and regexp engines can be very highly optimized.
If you wanted to do the same job as even a relatively simple regexp, you'd have to write a lot of code, which probably would contain a number of bugs, be hard to understand and perform badly.
Whereas doing the same with a regexp is much shorter, almost certainly performs as well as is technically possible, and is easier to understand to anyone familiar with regexpes (though it should be commented in either case)
The email example is actually a bad example for regular expressions. Regexes can be used, but the resulting expression (for example this one which doesn't handle "John Doe " style addresses) is hugely complicated - take a look at the email address specification and you'll see why...
However regexes are very useful in a host of other situations, extracting ip addresses from text, tags from html etc. Finding all versioned files would be another example. Something along the lines of:
my_versioned_file_(\d{4}-\d{2}-\d{2}).txt
will match any filenames of the format my_versioned_file_2009-02-26.txt and pull out the date as a captured group (the part wrapped in "()") for you to further analyse.
No regexes are not necessary, but they can save a world of time in writing a hand rolled parser for something a regex can easily achieve.
Whenever you've got some pattern to find in a lot of textual data or if you want to check that a string is in a certain format.
For example an email address...
The code for checking for an at symbol and the presence of a valid domain will look quite big where you could just use a regular expression and have an answer in 2 lines of code.
Regex r = new Regex("<An Email Address Regex>");
bool isValidEmail = r.IsMatch(MyInput);
Other examples would be for checking numbers are in the correct format before parsing them into integers etc.
Jon and Sqook gave a fine explanation and definition of Regular Expressions, and for simple problems it is pretty understandable, but if you use it for complex problems regular expressions can be a &$#( (at least for me ;-))
I use Expresso a lot to help me build complex regular expression code.
http://www.ultrapico.com/Expresso.htm
It has a build in library with expressions you can use, a design mode where you can build your code and a test mode where you can test and validate the code. It helped me build and understand complex expressions better!
Goodluck!
Some practical real world usages:
Finding abstract classes that extend JUnit's TestCase:
abstract\s+class\s+\w+\s+extends\s+TestCase
This is useful for finding test cases that cannot be instantiated and will need excluding from an ant build script that runs test cases. You cannot search for regular text because you don't know the class names in advance. hence the \w+ (At least one word character).
Finding running bash or bourne shell scripts:
ps -e | grep -e " sh| bash"
this is useful if you want to kill them all or something, if you did a search for just sh you'd not get the bash ones and have to run the command again for bash scripts. Again, more serviceable than perfect, but nearly no regex you write on the fly will be.
It's not perfect, but most regexes won't be, or they'll take so long to write they're not worth it. The ones you perfect are the ones you commit as part of some sort of validation or built application.
Example of critical use is JavaScript:
If you need to do search or replace on a string, the only matching you can do is a regular expression. It's in the JavaScript API on those string methods...
Personally, I mostly use regular expressions only when I need some advanced matching in some automated find/replace in a text editor (TextPad or Visual Studio). The most powerful feature in my view is the ability to match a pattern that can be inserted in the replace.
To give you some examples:
Email Address
Password requires at least 1 alphabet and 1 digit
How can you acheive these requirements?
The best way is to use regular expression.
Read the following links to learn more:
How To: Use Regular Expressions to Constrain Input in ASP.NET
http://msdn.microsoft.com/en-us/library/ms998267.aspx

Constructing regex

I use regex buddy which takes in a regex and then gives out the meaning of it from which one gets what it could be doing? On similar lines is it possible to have some engine which takes natural language input describing about the pattern one needs to match/replace and gives out the correct(almost correct) regex for that description?
e.g. Match the whole word 'dio' in some file
So regex for that could be : <dio>
or
\bdio\b
-AD.
P.S. = I think few guys here might think this as a 'subjective' 'not-related-to-programming' question, but i just need to ask this question nonetheless. For myself. - Thanks.
This would be complicated to program, because you need a natural language parser able to derive meaning. Unless you limit it to a strict subset -- in which case, you're reinventing an expression language, and you'll eventually wind up back at regular expressions -- only with bigger symbols. so what's the gain?
Regexes were developed for a reason -- they're the simplest, most accurate representation possible.
There is a Symbolix Regular Expression Builder package for Emacs, but looking at it, I think that regular expressions are easier to work with.
Short answer: no, not until artificial intelligence improves A LOT.
If you wrote something like this, you'd have a very limited syntax. For someone to know "Match the whole word 'dio' in some file", they would basically need to have significant knowledge of regular expressions. At that point, just use regular expressions.
For non-technical users, this will never work unless you limit it to basic "find this phrase" or, maybe, "find lines starting/ending with ??". They're never going to come up with something like this:
Find lines containing a less-than symbol followed by the string 'img' followed by one or more groupings of: some whitespace followed by one or more letters followed by either a double-quoted string or a single-quoted string, and those groupings are followed by any length of whitespace then a slash and a greater-than sign.
That's my attempt at a plain-language version of this relatively simple regex:
/<img(\s+[a-z]+=("[^"]*"|'[^']*'))+\s*/>/i
Yeah, I agree with you that it is subjective. But I will answer your question because I think that you have asked a wrong question.
The answer is "YES". Almost anything can be coded and this would be a rather simple application to code. Will it work perfectly? No, it wouldn't because natural language is quite complex to parse and interpret. But it is possible to write such an engine with some constraints.
Generating a regex via the use of a natural language processor is quite possible. Prolog is supposed to be a good language choice for this kind of problem. In practice, however, what you'd be doing, in effect, is designing your own input language which provides a regex as output. If your goal is to produce regexs for a specific task, this might in fact be useful. Perhaps the task you are doing tends to require certain formulations that are doable but not built into regular expressions. Though whether this will be more effective than just creating the regexs one at a time depends on your project. Usually this is probably not the case, since your own language is not going to be as well-known or as well-documented as regex. If your goal is to produce a replacement for regex whose output will be parsed as a regex, I think you're asking a lot. Not to say people haven't done the same sort of thing before (e.g. the C++ language as an 'improvement' that runs, originally, on C++).
try the open source mac application Ruby Regexp Machine, at http://www.rubyregexp.sf.net. It is written in ruby, so you can use some of the code even if you are not on mac. You can describe a lot of simple regular expresions in an easy english grammar. As a disclosure, i did make this tool.

BNF to regular Expressions

How can I describe the language
A → AA | ( A ) | ε
generates using regular expressions?
Regular expressions accept strings from regular languages. Regular languages can also be accepted by an FSM.
There's an potentially infinite number of brackets in your language that you have to match up. That means you need an infinite state, obviously impossible in any Finite State Machine. Therefore, your language isn't regular and can't be matched with a regex.
Regular expressions cannot match nesting brackets.
I'm not sure about how you could notate that language, but that language isn't regular, as can be shown using the pumping lemma for regular languages (and thus, it can't be noted by a regex). An intuitive explanation is that accepting words from that language would require the FDA to 'remember' the number of opening parenthesis that it just read every time it begins to read closing parenthesis, and that isn't possible for them as they have no 'memory'. A push-down automaton, on the other hand...
Could that language be noted as {(n)n}*, for any n?
You can't - regular expressions can only recognise a small subset of possible languages. In particular, informally, any language that requires an unbounded amount of memory potentially to recognise is not RE recognisable.
Here, you'd need an unbounded amount of memory to remember how many opening parentheses you've seen in order to make sure the number of closing parentheses are the same.
You'll need some mechanism that is capable of parsing Context-Free Grammars to be able to recognise languages described by BNF in general. Modern parsers are very good at this!
As others have said you cannot do this with a single regular expression, however you can tokenize it with two "\(" and "\)". Seeing that your language can only have brackets in it though I'm not sure this is going to be very useful.
Note: You will also need a passer to ensure the brackets are paired up correctly. So “(()()” will tokenize but will not parse.