BNF to regular Expressions - regex

How can I describe the language
A → AA | ( A ) | ε
generates using regular expressions?

Regular expressions accept strings from regular languages. Regular languages can also be accepted by an FSM.
There's an potentially infinite number of brackets in your language that you have to match up. That means you need an infinite state, obviously impossible in any Finite State Machine. Therefore, your language isn't regular and can't be matched with a regex.

Regular expressions cannot match nesting brackets.

I'm not sure about how you could notate that language, but that language isn't regular, as can be shown using the pumping lemma for regular languages (and thus, it can't be noted by a regex). An intuitive explanation is that accepting words from that language would require the FDA to 'remember' the number of opening parenthesis that it just read every time it begins to read closing parenthesis, and that isn't possible for them as they have no 'memory'. A push-down automaton, on the other hand...
Could that language be noted as {(n)n}*, for any n?

You can't - regular expressions can only recognise a small subset of possible languages. In particular, informally, any language that requires an unbounded amount of memory potentially to recognise is not RE recognisable.
Here, you'd need an unbounded amount of memory to remember how many opening parentheses you've seen in order to make sure the number of closing parentheses are the same.
You'll need some mechanism that is capable of parsing Context-Free Grammars to be able to recognise languages described by BNF in general. Modern parsers are very good at this!

As others have said you cannot do this with a single regular expression, however you can tokenize it with two "\(" and "\)". Seeing that your language can only have brackets in it though I'm not sure this is going to be very useful.
Note: You will also need a passer to ensure the brackets are paired up correctly. So “(()()” will tokenize but will not parse.

Related

Regular expression that matches regular expressions

Has anyone ever tried to describe regular expression that would match regular expressions?
This topic is almost impossible to find on the web due to the repeating keywords.
It is probably unusable in real-world applications, since the languages that support regular expressions usually have a method of parsing them, which we can use for validation, and a method of delimiting the regular expressions in code, which can be used for searching purposes.
But still I am wondering how would a regular expression that matches all regular expressions look like. It should be possible to write one.
I don't have a formal proof for this, but I strongly suspect that the language of regular expressions is not itself regular, and therefore not subject to regular expressions¹. This would make a proper regex to represent it impossible.
Why? Well, it can be shown that a language that requires balanced parentheses such as Lisp (or, more famously, HTML) is not regular using the pumping lemma:
The proof that the language of balanced (i.e., properly nested) parentheses is not regular follows the same idea. Given p, there is a string of balanced parentheses that begins with more than p left parentheses, so that y will consist entirely of left parentheses. By repeating y, we can produce a string that does not contain the same number of left and right parentheses, and so they cannot be balanced.
Regular expressions permit nested capture groups, which seem to fall into this category:
Take the example from the previous lesson, if we wanted to capture the image file number along with the filename, I can write the expression ^(IMG(\d+))\.png$.
In any case, this may be a better question for the Computer Science Stack Exchange site.
Edit:
¹tomp points out that PCRE-based regular expression engines (and likely others) are actually able to match all context-free grammars and at least some context-sensitive grammars! That represents a massive difference in expressive power. Assuming the article is correct, pretty cool!
(Of course, whether these extended implementations are still "regular expressions" is up for debate. Since we're on a programming site I'll take the position that they are. On a CS site I'd probably take the opposite position!)
So it may be technically possible to represent regular expressions as a regular expression.
Even so, the task of writing a regex representing all regexes is enormously complex. Consider for comparison the task of validating an email address. Many resources boil this down to something akin to [^#]+#[^#]+, or "as long as there is only one at symbol and at least one character before and one character after it, we're good".
But have a look at this apparently complete regex to validate RFC 822. Is it correct? Who knows. I'm certainly not going to check it.
Having seen this, I wouldn't want to try to write a regex to validate regular expressions.
I just coded this in a couple of minutes, so don't expect too much...still, it can match a regex in a string.
^([igsmx]{1,})?\/(?=.*?(\\w|\\d|\[.*?\]|\(.*?\))).*?\/([igsmx]{1,})?$
It can be extended, a looooooot...

what does regular in regex/"regular expression" mean?

What does the "regular" in the phrase "regular expression" mean?
I have heard that regexes were regular at one time, but no more
The regular in regular expression comes from that it matches a regular language.
The concept of regular expressions used in formal language theory is quite different from what engines like PCRE call regular expressions. PCRE and other similar engines have features like lookahead, conditionals and recursion, which make them able to match non-regular languages.
It comes from regular language. This is part of formal language theory. Check out the Chomsky hierarchy for other formal languages.
It's signifying that it's a regular language.
Regexes are still popular. Some people frown on them but they remain a quick and easy (if you know how to use them) way of matching certain types of strings. The alternative is often a good few lines of code looping through strings and extracting the bits that you need which is much nastier!
I still use them on a regular (pun fully intended) basis, to give you a use case I used one the other day to match lines of guitar chords as oppose to lyrics. They're also commonly used for things like basic validation of email addresses and the like.
They're certainly not dead.
I think it comes from the term for the class of grammars that regular expressions describe: regular grammars (or "regular" languages). Where that term comes from is likely answered by a trip to Wikipedia.
Modern regex engines that implement all those fancy look-ahead, pattern re-match, and subexpression counting features, well, those are recognizing a class of grammar that's a superset of regular grammars. "Classical" regular expressions correspond in mechanical ways to theoretical machines called "finite automata". That's a really fun subject in and of itself.

Lua pattern matching vs. regular expressions

I'm currently learning lua. regarding pattern-matching in lua I found the following sentence in the lua documentation on lua.org:
Nevertheless, pattern matching in Lua is a powerful tool and includes some features that are difficult to match with standard POSIX implementations.
As I'm familiar with posix regular expressions I would like to know if there are any common samples where lua pattern matching is "better" compared to regular expression -- or did I misinterpret the sentence? and if there are any common examples: why is any of pattern-matching vs. regular expressions better suited?
Are any common samples where lua pattern matching is "better" compared to regular expression?
It is not so much particular examples as that Lua patterns have a higher signal-to-noise ratio than POSIX regular expressions. It is the overall design that is often preferable, not particular examples.
Here are some factors that contribute to the good design:
Very lightweight syntax for matching common character types including uppercase letters (%u), decimal digits (%d), space characters (%s) and so on. Any character type can be complemented by using the corresponding capital letter, so pattern %S matches any nonspace character.
Quoting is extremely simple and regular. The quoting character is %, so it is always distinct from the string-quoting character \, which makes Lua patterns much easier to read than POSIX regular expressions (when quoting is necessary). It is always safe to quote symbols, and it is never necessary to quote letters, so you can just go by that rule of thumb instead of memorizing what symbols are special metacharacters.
Lua offers "captures" and can return multiple captures as the result of a match call. This interface is much, much better than capturing substrings through side effects or having some hidden state that has to be interrogated to find captures. Capture syntax is simple: just use parentheses.
Lua has a "shortest match" - modifier to go along with the "longest match" * operator. So for example s:find '%s(%S-)%.' finds the shortest sequence of nonspace characters that is preceded by space and followed by a dot.
The expressive power of Lua patterns is comparable to POSIX "basic" regular expressions, without the alternation operator |. What you are giving up is "extended" regular expressions with |. If you need that much expressive power I recommend going all the way to LPEG which gives you essentially the power of context-free grammars at quite reasonable cost.
http://lua-users.org/wiki/LibrariesAndBindings contains a listing of functionality including regex libraries if you wish to continue using them.
To answer the question (and note that I'm by no means a Lua guru), the language has a strong tradition of being used in embedded applications, where a full regex engine would unduly increase the size of the code being used on the platform, sometimes much larger than just all of the Lua library itself.
[Edit] I just found in the online version of Programming in Lua (an excellent resource for learning the language) where this is described by one of the principles of the language: see the comments below
[/Edit]
I find personally that the default pattern matching Lua provides satisfies most of my regex-y needs. Your mileage may vary.
Ok, just a slight noob note for this discussion; I particularly got confused by this page:
SciTE Regular Expressions
since that one says \s matches whitespace, as I know from other regular expression syntaxes... And so I'm trying it in a shell:
$ lua
Lua 5.1.4 Copyright (C) 1994-2008 Lua.org, PUC-Rio
> c=" d"
> print(c:match(" "))
> print(c:match("."))
> print(c:match("\s"))
nil
> print("_".. c:match("[ ]") .."_")
_ _
> print("_".. c:match("[ ]*") .."_")
_ _
> print("_".. c:match("[\s]*") .."_")
__
Hmmm... seems \s doesn't get recognized here - so that page probably refers to the regular expression in Scite's Find/Replace - not to Lua's regex syntax (which scite also uses).
Then I reread lua-users wiki: Patterns Tutorial, and start getting the comment about the escape character being %, not \ in #NormanRamsey's answer. So, trying this:
> print("_".. c:match("[%s]*") .."_")
_ _
... does indeed work.
So, as I originally thought that Lua's "patterns" are different commands/engine from Lua's "regular expression", I guess a better way to say it is: Lua's "patterns" are the Lua-specific "regular expression" syntax/engine (in other words, there aren't two of them :) )
Cheers!
With the risk of getting some downvotes for speaking the truth, I'll be bluntly honest about it (like an answer should be, after all): aside from being able to return multiple captures for a single match call (possible in regular expressions, but in a much more convoluted manner) and the %bxy pattern which matches a balanced pair of delimiters (e.g. all kind of brackets and such) and qualifies as useful, powerful and "better", almost everything Lua patterns can do, regular expressions can do as well.
The shortcomings of Lua patterns compared to regular expressions when it comes to "features" on the other hand are significant and too many too mention (e.g. lack of OR, lack of non-capturing groups, lookaround expressions, etc). Now that would be balanced if, say, Lua patterns would be significantly faster that the usually slower regular expressions, but I'm not sure whether - and where - such a comparison exists, one that would exclude the general native Lua speed due to its lightweight nature, the use of tables and so on.
The real reason Lua didn't bother to add regular expressions to its toolbox can't be the length of the required code (that's nonsense, modern computers don't even blink when it comes to 4000 lines of code vs "just" 500, even if it translates a bit differently into a library), but is probably due to the fact that being a scripting language, it was assumed that the "parent" language already includes the ability to use regular expressions. It is plain obvious when looking at the overall picture that Lua as a language was designed with simplicity, speed and only the necessary features in mind. It works well in most cases, but if you need more capabilities in this area and you cannot replicate them using Lua's other features, regular expressions are more comprehensive.
The good thing is that the differences in syntax between the Lua pattern and regular expressions are mostly minor, so if you know one you can relatively easy adapt to the other.

When is a issue too complex for a regular expression?

Please don't answer the obvious, but what are the limit signs that tell us a problem should not be solved using regular expressions?
For example: Why is a complete email validation too complex for a regular expression?
Regular expressions are a textual representation of finite-state automata. That is to say, they are limited to only non-recursive matching. This means that you can't have any concept of "scope" or "sub-match" in your regexp. Consider the following problem:
(())()
Are all the open parens matched with a close paren?
Obviously, when we look at this as human beings, we can easily see that the answer is "yes". However, no regular expression will be able to reliably answer this question. In order to do this sort of processing, you will need a full pushdown automaton (like a DFA with a stack). This is most commonly found in the guise of a parser such as those generated by ANTLR or Bison.
A few things to look out for:
beginning and ending tag detection -- matched pairing
recursion
needing to go backwards (though you can reverse the string, but that's a hack)
regexes, as much as I love them, aren't good at those three things. And remember, keep it simple! If you're trying to build a regex that does "everything", then you're probably doing it wrong.
When you need to parse an expression that's not defined by a regular language.
What it comes down to is using common sense. If what you are trying to match becomes an unmanageable, monster regular expression then you either need to break it up into small, logical sub-regular expressions or you need to start re-thinking your solution.
Take email addresses (as per your example). This simple regular expression (taken from RegEx buddy) matches 99% of all emails out there:
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b
It is short and to the point and you will rarely run into issues with it. However, as the author of RegEx buddy points out, if your email address is in the rare top-level domain ".museum" it will not be accepted.
To truely match all email addresses you need to adhere to the standard known as RFC 2822. It outlines the multitude of ways email addresses can be formatted and it is extremely complex.
Here is a sample regular expression attempting to adhere to RFC 2822:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"
(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x
0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9]
(?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.)
{3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08
\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
This obviously becomes a problem of diminishing returns. It is better to use the easily maintained implementation that matches 99% of email addresses vs the monsterous one that accepts 99.9% of them.
Regular expressions are a great tool to have in your programmers toolbox but they aren't a solution to all your parsing problems. If you find your RegEx solution starting to become extremely complex you need to either attempt to logically break it up into smaller regular expressions to match portions of your text or you need to start looking at other methods to solve your problem. Similarly, there are simply problems that Regular Expressions, due to their nature, can't solve (as one poster said, not adhering to Regular Language).
Regular expressions are suited for tokenizing, finding or identifying individual bits of text, e.g. finding keywords, strings, comments, etc. in source code.
Regular expressions are not suited for determining the relationship between multiple bits of text, e.g. finding a block of source code with properly paired braces. You need a parser for that. The parser can use regular expressions for tokenizing the input, while the parser itself determines how the different regex matches fit together.
Essentially, you're going to far with your regular expressions if you start thinking about "balancing groups" (.NET's capture group subtraction feature) or "recursion" (Perl 5.10 and PCRE).
Here's a good quote from Raymond Chen:
Don't make regular expressions do what they're not good at. If you want to match a simple pattern, then match a simple pattern. If you want to do math, then do math. As commenter Maurits put it, "The trick is not to spend time developing a combination hammer/screwdriver, but just use a hammer and a screwdriver.
Source
Solve the problem with a regex, then give it to somebody else conversant in regexes. If they can't tell you what it does (or at least say with confidence that they understand) in about 10 minutes, it's too complex.
Sure sign to stop using regexps is this: if you have many grouping braces '()' and many alternatives '|' then it is a sure sign that you try to do a (complex) parsing with regular expressions.
Add to the mix Perl extensions, backreferences, etc and soon you have yourself a parser that is hard to read, hard to modify, and hard to reason about it's properties (e.g. is there an input on which this parser will work in a exponential time).
This is a time to stop regexing and start parsing (with hand-made parser, parser generators or parser combinators).
Along with tremendous expressions, there are principal limitations on the words, which can be handled by regexp.
For instance you can not not write regexp for word described by n chars a, then n chars b, where n can be any, more strictly .
In different languages regexp is a extension of Regular language, but time of parsing can be extremely large and this code is non-portable.
Whenever you can't be sure it really solves the problem, for example:
HTML parsing
Email validation
Language parsers
Especially so when there already exist tools that solve the problem in a totally understandable way.
Regex can be used in the domains I mentioned, but only as a subset of the whole problem and for specific, simple cases.
This goes beyond the technical limitations of regexes (regular languages + extensions), the maintainability and readability limit is surpassed a lot earlier than the technical limit in most cases.
A problem is too complex for regular expressions when constraints of the problem can change after the solution is written. So, in your example, how can you be sure an email address is valid when you do not have access to the target mail system to verify that the email address is attached to a valid user? You can't.
My limit is a Regex pattern that's about 30-50 characters long (varying depending on how much is fixed text and how much is regex commands)
This may sound stupid but I often lament not being able to do database type of queries using regular expression. Now especially more then before because I am entering those types of search string all the time on search engines. its very difficult, if not impossible to search for +complex AND +"regular expression"
For example, how do I search in emacs for commands that have both Buffer and Window in their name? I need to search separately for .*Buffer.*Window and .*Window.*Buffer

Regexp that matches valid regexps

Is there a regular expression that matches valid regular expressions?
(I know there are several flavors of regexps. One would do.)
Is there a regular expression that matches valid regular expressions?
By definion, it's quite simple: No.
The language of all regexes is no regular language (just look at nested parentheses) and therefore there can't be a regular expression to parse it.
If you merely want to check whether a regular expression is valid or not, simply try to compile it with whichever programming language or regular expression library you're working with.
Parsing regular expressions is far from trivial. As the author of RegexBuddy, I have been around that block a few times. If you really want to do it, use a regex to tokenize the input, and leave the parsing logic to procedural code. That is, your regex would match one regex token (^, $, \w, (, ), etc.) at a time, and your procedural code would check if they're in the right order.
Unfortunately, most invalid regular expressions are invalid due to parentheses nesting errors. This is exactly the type of strings that regular expressions can't match. (Okay, some fancy regular expression systems have recursion extensions, but that's rare)
As already said, you cannot describe regular expressions with a regular expression due to their recursive nature. You'll need a context free grammar for that.
But what would be the point of having such a regular expression, anyway? If you just want to check whether a regular expression is correct, you can simply try to use it (Pattern.compile(regexp) in Java) and if it screams it is not valid.
You probably need a parser, not a regex. Regexes are powerful tools, but are not parsing tools. They are not well suited to nested grammars, for example.
From Douglas Crockford's The JavaScript Programming Language video 4 (of 4):
/\/(\\[^\x00-\x1f]|\[(\\[^\x00-\x1f]|[^\x00-\x1f\\\/])*\]|[^\x00-\x1f\\\/\[])+\/[gim]*/
http://video.yahoo.com/watch/111596/1710658 at approximately -17.20.
Depending on your goal I would say definately maybe.
If you want to filter regexps out from somewhere, it might prove difficult as regular expressions come in all sizes and shapes and they don't all start and end with slashes.
If you just need to know wether or not a regexp is valid there is another way. Depending on the language you're using you could try/catch
If you can be more specific I could try and give a better answer, the question is intruiging.