RegExp to validate a formula (math expression with matched parentheses)? - regex

I have used regExp quit a bit of times but still far from being an expert. This time I want to validate a formula (or math expression) by regExp. The difficult part here is to validate proper starting and ending parentheses with in the formula.
I believe, there would be some sample on the web but I could not find it. Can somebody post a link for such example? or help me by some other means?

Languages with matched nested parentheses are not regular languages and can therefore not be recognized by regular expressions. Some implementations of regular expression (for example in the .NET framework) have extensions to deal with this but that is really no fun to work with. So I suggest to use an available parser or implement a simple parser yourself (for the sake of fun).
For the extension in the .NET implementation see the MSDN on balancing groups.

If your mathematical expression involves matched nested parenthesis, then it's not a regular grammar but a context-free one and as such, can not be parsed using regex.

Related

Trying to understand parsing and scanning (difference for reg. languages and cf languages)

First off,I don't study Computer Science, I'm just interested in the subject.
A parser basically does this right:
reading the input
create tokens
actually parse tokens and create an AST
So I thought that in order to decide whether a word is in a regular language, you use a FSM and for CF languages you need a parser because of the recursive structures that may exist. Hence, scanner generators exist for regular languages and parser generators for CF languages.
But now I read that you can build a recursive decent parser for regular expressions:
http://matt.might.net/articles/parsing-regex-with-recursive-descent/
So how does this all go togther?
Why do I need to parse regular languages? I thought a finite state machine was enough?
If, e.g. I want to recognize block comments in a java programme (i.e. /* .. */), I only need to write a FSM, so basically a switch-case-statement. I dont need a parser for this...
Thanks for help and clarification!
There is a difference between what a regular expression can match and what you need to parse a regular expression. Regular expressions can contain nested groups for instance, so you can't parse those with a regular expression. You have to ”count” nested pairs of parenthesis for example, which is outside the capabilities of a regular language.
See also: Is there a regular language to represent regular expressions.

Regular expression that matches regular expressions

Has anyone ever tried to describe regular expression that would match regular expressions?
This topic is almost impossible to find on the web due to the repeating keywords.
It is probably unusable in real-world applications, since the languages that support regular expressions usually have a method of parsing them, which we can use for validation, and a method of delimiting the regular expressions in code, which can be used for searching purposes.
But still I am wondering how would a regular expression that matches all regular expressions look like. It should be possible to write one.
I don't have a formal proof for this, but I strongly suspect that the language of regular expressions is not itself regular, and therefore not subject to regular expressions¹. This would make a proper regex to represent it impossible.
Why? Well, it can be shown that a language that requires balanced parentheses such as Lisp (or, more famously, HTML) is not regular using the pumping lemma:
The proof that the language of balanced (i.e., properly nested) parentheses is not regular follows the same idea. Given p, there is a string of balanced parentheses that begins with more than p left parentheses, so that y will consist entirely of left parentheses. By repeating y, we can produce a string that does not contain the same number of left and right parentheses, and so they cannot be balanced.
Regular expressions permit nested capture groups, which seem to fall into this category:
Take the example from the previous lesson, if we wanted to capture the image file number along with the filename, I can write the expression ^(IMG(\d+))\.png$.
In any case, this may be a better question for the Computer Science Stack Exchange site.
Edit:
¹tomp points out that PCRE-based regular expression engines (and likely others) are actually able to match all context-free grammars and at least some context-sensitive grammars! That represents a massive difference in expressive power. Assuming the article is correct, pretty cool!
(Of course, whether these extended implementations are still "regular expressions" is up for debate. Since we're on a programming site I'll take the position that they are. On a CS site I'd probably take the opposite position!)
So it may be technically possible to represent regular expressions as a regular expression.
Even so, the task of writing a regex representing all regexes is enormously complex. Consider for comparison the task of validating an email address. Many resources boil this down to something akin to [^#]+#[^#]+, or "as long as there is only one at symbol and at least one character before and one character after it, we're good".
But have a look at this apparently complete regex to validate RFC 822. Is it correct? Who knows. I'm certainly not going to check it.
Having seen this, I wouldn't want to try to write a regex to validate regular expressions.
I just coded this in a couple of minutes, so don't expect too much...still, it can match a regex in a string.
^([igsmx]{1,})?\/(?=.*?(\\w|\\d|\[.*?\]|\(.*?\))).*?\/([igsmx]{1,})?$
It can be extended, a looooooot...

When should I not use regular expressions?

After some research I figured that it is not possible to parse recursive structures (such as HTML or XML) using regular expressions. Is it possible to comprehensively list out day to day coding scenarios where I should avoid using regular expressions because it is just impossible to do that particular task using regular expressions? Let us say the regex engine in question is not PCRE.
Don't use regular expressions when:
the language you are trying to parse is not a regular language, or
when there are readily available parsers specifically made for the data you are trying to parse.
Parsing HTML and XML with regular expressions is usually a bad idea both because they are not regular languages and because libraries already exist that can parse it for you.
As another example, if you need to check if an integer is in the range 0-255, it's easier to understand if you use your language's library functions to parse it to an integer and then check its numeric value instead of trying to write the regular expression that matches this range.
I'll plagiarize myself from my blog post, When to use and when not to use regular expressions...
Public websites should not allow users to enter regular expressions for searching. Giving the full power of regex to the general public for a website's search engine could have a devastating effect. There is such a thing as a regular expression denial of service (ReDoS) attack that should be avoided at all costs.
HTML/XML parsing should not be done with regular expressions. First of all, regular expressions are designed to parse a regular language which is the simplest among the Chomsky hierarchy. Now, with the advent of balancing group definitions in the .NET flavor of regular expressions you can venture into slightly more complex territory and do a few things with XML or HTML in controlled situations. However, there's not much point. There are parsers available for both XML and HTML which will do the job more easily, more efficiently, and more reliably. In .NET, XML can be handled the old XmlDocument way or even more easily with Linq to XML. Or for HTML there's the HTML Agility Pack.
Conclusion
Regular expressions have their uses. I still contend that in many cases they can save the programmer a lot of time and effort. Of course, given infinite time & resources, one could almost always build a procedural solution that's more efficient than an equivalent regular expression.
Your decision to abandon regex should be based on 3 things:
1.) Is the regular expression so slow in your scenario that it has become a bottleneck?
2.) Is your procedural solution actually quicker & easier to write than the regular expression?
3.) Is there a specialized parser that will do the job better?
My rule of thumb is, use regular expressions when no other solution exists. If there's already a parser (for example, XML, HTML) or you're just looking for strings rather than patterns, there's no need to use regular expressions.
Always ask yourself "can I solve this without using regular expressions?". The answer to that question will tell you whether you should use regular expressions.

I don’t get regular expressions

I don’t understand or see the need for regular expressions.
Can some explain them in simple terms and provide some basic examples where they could be useful, or even critical.
Use them where you need to use/manipulate patterns. For instance, suppose you need to recognise the following pattern:
Any letter, A-Z, either upper or lower case, 5 or 6 times
3 digits
a single letter a-z (definitely lower case)
(Things like this crop up for zip code, credit card, social security number validation etc.)
That's not really hard to write in code - but it becomes harder as the pattern becomes more complicated. With a regular expression, you describe the pattern (rather than the code to validate it) and let the regex engine do the work for you.
The pattern here would be something like
[A-Za-z]{5,6}[0-9]{3}[a-z]
(There are other ways of expressing it too.) Grouping constructs make it easy to match a whole pattern and grab (or replace) different bits of it, too.
A few downsides though:
Regexes can become complicated and hard to read quite quickly. Document thoroughly!
There are variations in behaviour between different regex engines
The complexity can be hard to judge if you're not an expert (which I'm certainly not!); there are "gotchas" which can make the patterns really slow against particular input, and these gotchas aren't obvious at all
Some people overuse regular expressions massively (and some underuse them, of course). The worst example I've seen was where someone asked (on a C# group) how to check whether a string was length 3 - this is clearly a job for using String.Length, but someone seriously suggested matching a regex. Madness. (They also got the regex wrong, which kinda proves the point.)
Regexes use backslashes to escape various things (e.g. use . to mean "a dot" rather than just "any character". In many languages the backslash itself needs escaping.
What regular expressions are used for:
Regular expressions is a language in itself that allows you to perform complex validation of string inputs. I.e. you pass it a string and it will return true or false if it is a match or not.
How regular expressions are used:
Form validation, determine if what the user entered is of the format you want
Finding the position of a certain pattern in a block of text
Search and replace where the search term is a regex and what to replace is a normal string.
Some regular expression language features:
Alternation: allows you to select one thing or another. Example match only yes or no.
yes|no
Grouping: You can define scope and have precedence using parentheses. For example match 3 color shades.
gr(a|e)y|black|white
Quantification: You can quantify how much of something you want. ? means 1 or 0, * means 0 or more. + means at least one. Example: Accept a binary string that is not empty:
(0|1)+
Why regular expressions?
Regular expressions make it easy to match strings, it can often replace several dozen lines of source code with a simple small regular expression string.
Not for all types of matching:
To understand how something is useful, you should also understand how it is not useful. Regular expressions are bad for certain tasks for example when you need to guarantee that a string has an equal number of parentheses.
Available in just about all languages:
Regular expressions are available in just about any programming language.
Formal language:
Any regular expression can be converted to a deterministic finite state machine. And in this same way you can figure out how to make source code that will validate your regular expression.
Example:
[hc]+at
matches "hat", "cat", "hhat", "chat", "hcat", "ccchat", and so on, but not "at"
Source, further reading
They look a bit cryptic but they provide a very powerful tool for finding patterns in text. Anything from href tags in HTML pages to validating email addresses.
And they can be processed into a very efficient data structure (FSA) that finds matches very fast.
They are a bit tricky, but extremely powerful and worth learning. The web is full of tutorial and examples, start for example from here and look at the examples here.
If I could direct the OP to some of the answers/comments on one of my own questions: How important is knowing Regexs?
Regular expressions are a very concise way to specify most pattern-matching and -replacement problems, and regexp engines can be very highly optimized.
If you wanted to do the same job as even a relatively simple regexp, you'd have to write a lot of code, which probably would contain a number of bugs, be hard to understand and perform badly.
Whereas doing the same with a regexp is much shorter, almost certainly performs as well as is technically possible, and is easier to understand to anyone familiar with regexpes (though it should be commented in either case)
The email example is actually a bad example for regular expressions. Regexes can be used, but the resulting expression (for example this one which doesn't handle "John Doe " style addresses) is hugely complicated - take a look at the email address specification and you'll see why...
However regexes are very useful in a host of other situations, extracting ip addresses from text, tags from html etc. Finding all versioned files would be another example. Something along the lines of:
my_versioned_file_(\d{4}-\d{2}-\d{2}).txt
will match any filenames of the format my_versioned_file_2009-02-26.txt and pull out the date as a captured group (the part wrapped in "()") for you to further analyse.
No regexes are not necessary, but they can save a world of time in writing a hand rolled parser for something a regex can easily achieve.
Whenever you've got some pattern to find in a lot of textual data or if you want to check that a string is in a certain format.
For example an email address...
The code for checking for an at symbol and the presence of a valid domain will look quite big where you could just use a regular expression and have an answer in 2 lines of code.
Regex r = new Regex("<An Email Address Regex>");
bool isValidEmail = r.IsMatch(MyInput);
Other examples would be for checking numbers are in the correct format before parsing them into integers etc.
Jon and Sqook gave a fine explanation and definition of Regular Expressions, and for simple problems it is pretty understandable, but if you use it for complex problems regular expressions can be a &$#( (at least for me ;-))
I use Expresso a lot to help me build complex regular expression code.
http://www.ultrapico.com/Expresso.htm
It has a build in library with expressions you can use, a design mode where you can build your code and a test mode where you can test and validate the code. It helped me build and understand complex expressions better!
Goodluck!
Some practical real world usages:
Finding abstract classes that extend JUnit's TestCase:
abstract\s+class\s+\w+\s+extends\s+TestCase
This is useful for finding test cases that cannot be instantiated and will need excluding from an ant build script that runs test cases. You cannot search for regular text because you don't know the class names in advance. hence the \w+ (At least one word character).
Finding running bash or bourne shell scripts:
ps -e | grep -e " sh| bash"
this is useful if you want to kill them all or something, if you did a search for just sh you'd not get the bash ones and have to run the command again for bash scripts. Again, more serviceable than perfect, but nearly no regex you write on the fly will be.
It's not perfect, but most regexes won't be, or they'll take so long to write they're not worth it. The ones you perfect are the ones you commit as part of some sort of validation or built application.
Example of critical use is JavaScript:
If you need to do search or replace on a string, the only matching you can do is a regular expression. It's in the JavaScript API on those string methods...
Personally, I mostly use regular expressions only when I need some advanced matching in some automated find/replace in a text editor (TextPad or Visual Studio). The most powerful feature in my view is the ability to match a pattern that can be inserted in the replace.
To give you some examples:
Email Address
Password requires at least 1 alphabet and 1 digit
How can you acheive these requirements?
The best way is to use regular expression.
Read the following links to learn more:
How To: Use Regular Expressions to Constrain Input in ASP.NET
http://msdn.microsoft.com/en-us/library/ms998267.aspx

Regexp that matches valid regexps

Is there a regular expression that matches valid regular expressions?
(I know there are several flavors of regexps. One would do.)
Is there a regular expression that matches valid regular expressions?
By definion, it's quite simple: No.
The language of all regexes is no regular language (just look at nested parentheses) and therefore there can't be a regular expression to parse it.
If you merely want to check whether a regular expression is valid or not, simply try to compile it with whichever programming language or regular expression library you're working with.
Parsing regular expressions is far from trivial. As the author of RegexBuddy, I have been around that block a few times. If you really want to do it, use a regex to tokenize the input, and leave the parsing logic to procedural code. That is, your regex would match one regex token (^, $, \w, (, ), etc.) at a time, and your procedural code would check if they're in the right order.
Unfortunately, most invalid regular expressions are invalid due to parentheses nesting errors. This is exactly the type of strings that regular expressions can't match. (Okay, some fancy regular expression systems have recursion extensions, but that's rare)
As already said, you cannot describe regular expressions with a regular expression due to their recursive nature. You'll need a context free grammar for that.
But what would be the point of having such a regular expression, anyway? If you just want to check whether a regular expression is correct, you can simply try to use it (Pattern.compile(regexp) in Java) and if it screams it is not valid.
You probably need a parser, not a regex. Regexes are powerful tools, but are not parsing tools. They are not well suited to nested grammars, for example.
From Douglas Crockford's The JavaScript Programming Language video 4 (of 4):
/\/(\\[^\x00-\x1f]|\[(\\[^\x00-\x1f]|[^\x00-\x1f\\\/])*\]|[^\x00-\x1f\\\/\[])+\/[gim]*/
http://video.yahoo.com/watch/111596/1710658 at approximately -17.20.
Depending on your goal I would say definately maybe.
If you want to filter regexps out from somewhere, it might prove difficult as regular expressions come in all sizes and shapes and they don't all start and end with slashes.
If you just need to know wether or not a regexp is valid there is another way. Depending on the language you're using you could try/catch
If you can be more specific I could try and give a better answer, the question is intruiging.