To use or not to use regular expressions? - regex

I just asked this question about using a regular expression to allow numbers between -90.0 and +90.0. I got some answers on how to implement the regular expression, but most of the answers also mentioned that that would be better handled without using a regular expression or using a regular expression would be overkill. So how do you decide when to use a regular expression and when not to use a regular expression. Is there a check list you can follow?

Regular expressions are a text processing tool for character-based tests. More formally, regular expressions are good at handling regular languages and bad at almost anything else.
In practice, this means that regular expressions are not well suited for tasks that require discovering meaning (semantics) in text that goes beyond the character level. This would require a full-blown parser.
In your particular case: recognizing a number in a text is an exercise that regular expressions are good at (decimal numbers can be trivially described using a regular language). This works on the character level.
But doing more advanced stuff with the number that requires knowledge of its numerical value (i.e. its semantics) requires interpretation. Regular expressions are bad at this. So finding a number in text is easy. Finding a number in text that is greater than 11 but smaller than 1004 (or that is divisible by 3) is hard: it requires recognizing the meaning of the number.

I would say that regex expressions are most effective on Strings. For other data types, manipulations of that data type will usually be more intuitive and provide better results.
For example, if you know that you're dealing with DateTime, then you can use the Parse and TryParse methods will the different formats and it will usually be more reliable than your own regex expressions.
In your example, you are dealing with numbers so deal with them accordingly.
Regex is very powerful, but it isn't the easiest code to read and to debug. When another reliable solution is at hand, you should probably go for that.

Without meaning to be circular or obtuse, you should use regular expressions when you have a string which contains information structured in a regular language, and you want to turn this string into an object model.

Basic use-case for RegEx :-
You need "Key Value Pairs" - Both Key and Values are embedded within other noisy text - cant be accessed or isolated otherwise.
You need to automate extraction of these values by looping over multiple documents.
Number and combination of Key Value pairs maybe discovered as you progress parsing through text.

The answer is straight forward:
If you can solve your problem without regular expressions (just by string functions), you don't use regular expressions. As it was said in one book I've read: regular expressions are violence over computer.
If it's to complicated to use language string functions, use regular expressions.

Related

Using RegEx for simple operations

I was wondering if there might be some reason someone would want to use a regular expression for a problem that could also be written easily without using regular expressions.
I came to this thought because of this question.
The question is something fairly simple, and the answers vary in 2 categories, those who do solve it with regular expressions and those who just use some other simple operation.
Summary of the question: Remove the first part of a url path (example: String path = "/folder1/folder2/folder3/").
2 Solutions:
//With regex
String newPathRegex = path.replaceAll("^/[^/]*", "");
//Without regex
String newPathNoRegex = path.substring(path.indexOf('/', 1));
Personally I think the no RegEx solution is a lot easier to read, but I'm not an expert on regular expressions.
So the question comes down to: Should you avoid using regular expressions in cases as simple as this one? Is there better performance in the RegEx solution?
A few reasons why it is useful to use regular expressions:
Regular expressions run in O(n log n) in the size of the expression and O(n) of the length of the string. So the time complexity is guaranteed to be very reasonable, whereas custom programs can sometimes be badly implemented. Most programs running in (pseudo)-linear time are considered to be very fast. Although it is possible to construct tailor made algorithm that will outperform regular expressions for each task that can be carried out by a regex, it is in general not easy for humans to do so. Regular expressions thus guarantee the construction of a fast enough algorithm.
Most properties on regular expressions are decidable: it is decidable whether two regular expressions determine the same set of strings, etc. So there is an entire algebra defined over it. All (non-trivial, language-invariant) properties on programs are undecidable: that's a consequence of Rice's theorem, so you can't prove in general that two programs will do the same thing (are equivalent), whereas this is an easy task for regular expressions.
Modifiable. Perhaps you want to remove the first part of the path, but only if it is not ... In general modifications to a regular expression tend to be easy whereas modifying a program can blow up the size of the code.
The most problematic part is that not all programmers are familiar with regular expressions, and that they are a bit cryptic: the semantics are sometimes a bit hard to guess. And furthermore, the pumping lemma states not every problem can be transformed into a regular expression (problem).

Regular expression that matches regular expressions

Has anyone ever tried to describe regular expression that would match regular expressions?
This topic is almost impossible to find on the web due to the repeating keywords.
It is probably unusable in real-world applications, since the languages that support regular expressions usually have a method of parsing them, which we can use for validation, and a method of delimiting the regular expressions in code, which can be used for searching purposes.
But still I am wondering how would a regular expression that matches all regular expressions look like. It should be possible to write one.
I don't have a formal proof for this, but I strongly suspect that the language of regular expressions is not itself regular, and therefore not subject to regular expressions¹. This would make a proper regex to represent it impossible.
Why? Well, it can be shown that a language that requires balanced parentheses such as Lisp (or, more famously, HTML) is not regular using the pumping lemma:
The proof that the language of balanced (i.e., properly nested) parentheses is not regular follows the same idea. Given p, there is a string of balanced parentheses that begins with more than p left parentheses, so that y will consist entirely of left parentheses. By repeating y, we can produce a string that does not contain the same number of left and right parentheses, and so they cannot be balanced.
Regular expressions permit nested capture groups, which seem to fall into this category:
Take the example from the previous lesson, if we wanted to capture the image file number along with the filename, I can write the expression ^(IMG(\d+))\.png$.
In any case, this may be a better question for the Computer Science Stack Exchange site.
Edit:
¹tomp points out that PCRE-based regular expression engines (and likely others) are actually able to match all context-free grammars and at least some context-sensitive grammars! That represents a massive difference in expressive power. Assuming the article is correct, pretty cool!
(Of course, whether these extended implementations are still "regular expressions" is up for debate. Since we're on a programming site I'll take the position that they are. On a CS site I'd probably take the opposite position!)
So it may be technically possible to represent regular expressions as a regular expression.
Even so, the task of writing a regex representing all regexes is enormously complex. Consider for comparison the task of validating an email address. Many resources boil this down to something akin to [^#]+#[^#]+, or "as long as there is only one at symbol and at least one character before and one character after it, we're good".
But have a look at this apparently complete regex to validate RFC 822. Is it correct? Who knows. I'm certainly not going to check it.
Having seen this, I wouldn't want to try to write a regex to validate regular expressions.
I just coded this in a couple of minutes, so don't expect too much...still, it can match a regex in a string.
^([igsmx]{1,})?\/(?=.*?(\\w|\\d|\[.*?\]|\(.*?\))).*?\/([igsmx]{1,})?$
It can be extended, a looooooot...

When should I not use regular expressions?

After some research I figured that it is not possible to parse recursive structures (such as HTML or XML) using regular expressions. Is it possible to comprehensively list out day to day coding scenarios where I should avoid using regular expressions because it is just impossible to do that particular task using regular expressions? Let us say the regex engine in question is not PCRE.
Don't use regular expressions when:
the language you are trying to parse is not a regular language, or
when there are readily available parsers specifically made for the data you are trying to parse.
Parsing HTML and XML with regular expressions is usually a bad idea both because they are not regular languages and because libraries already exist that can parse it for you.
As another example, if you need to check if an integer is in the range 0-255, it's easier to understand if you use your language's library functions to parse it to an integer and then check its numeric value instead of trying to write the regular expression that matches this range.
I'll plagiarize myself from my blog post, When to use and when not to use regular expressions...
Public websites should not allow users to enter regular expressions for searching. Giving the full power of regex to the general public for a website's search engine could have a devastating effect. There is such a thing as a regular expression denial of service (ReDoS) attack that should be avoided at all costs.
HTML/XML parsing should not be done with regular expressions. First of all, regular expressions are designed to parse a regular language which is the simplest among the Chomsky hierarchy. Now, with the advent of balancing group definitions in the .NET flavor of regular expressions you can venture into slightly more complex territory and do a few things with XML or HTML in controlled situations. However, there's not much point. There are parsers available for both XML and HTML which will do the job more easily, more efficiently, and more reliably. In .NET, XML can be handled the old XmlDocument way or even more easily with Linq to XML. Or for HTML there's the HTML Agility Pack.
Conclusion
Regular expressions have their uses. I still contend that in many cases they can save the programmer a lot of time and effort. Of course, given infinite time & resources, one could almost always build a procedural solution that's more efficient than an equivalent regular expression.
Your decision to abandon regex should be based on 3 things:
1.) Is the regular expression so slow in your scenario that it has become a bottleneck?
2.) Is your procedural solution actually quicker & easier to write than the regular expression?
3.) Is there a specialized parser that will do the job better?
My rule of thumb is, use regular expressions when no other solution exists. If there's already a parser (for example, XML, HTML) or you're just looking for strings rather than patterns, there's no need to use regular expressions.
Always ask yourself "can I solve this without using regular expressions?". The answer to that question will tell you whether you should use regular expressions.

In Which Cases Is Better To Use Regular Expressions?

I'm starting to learn Regular Expressions and I want to know: In which cases is better to use them?
Regular expressions is a form of pattern matching that you can apply on textual content. Take for example the DOS wildcards ? and * which you can use when you're searching for a file
. That is a kind of very limited subset of RegExp. For instance, if you want to find all files beginning with "fn", followed by 1 to 4 random characters, and ending with "ht.txt", you can't do that with the usual DOS wildcards. RegExp, on the other hand, could handle that and much more complicated patterns.
Regular expressions are, in short, a way to effectively
handle data
search and replace strings
provide extended string handling.
Often a regular expression can in itself provide string handling that other functionalities such as the built-in string methods and properties can only do if you use them in a complicated function or loop.
When you are trying to find/replace/validate complicated string patterns.
I use regular expressions when comparing strings (preg_match), replacing substrings (sed,preg_replace), replacing characters (sed,preg_replace), searching for strings in files (grep), splitting strings (preg_split) etc.
It is a very flexible and widespread pattern expression language and it is very useful to know.
BUT! It's like they say about poker, it's very easy to learn, but very hard to master.
I just came across a question that i thought was perfect for a RegEx, have a look and decide for yourself.
There are some cases where, if you need better performance, you should avoid regular expressions in favor of writing code. An example of this is parsing very large CSV files.
Regular expressions are a dsl (domain specific language) for parsing text. Just like xpath is a dsl for traversing xml. It is essentially a mini language inside of a general purpose language. You can accomplish quite a bit in a very small amount of code because it is specialized for a narrow purpose. One very common use for regular expressions is checking if a string is an email address, phone number, ssn, etc...
There are also cases where regular expressions are >>NOT<< appropriate (in general; there are always exceptions).
Parsing HTML
Parsing XML
In the above cases a DOM parser is almost always a better choice. The grammars are complex and there are too many edge cases, such as nested tags.
Also be sure to consider future maintenance programmers (which may be you). Comments and/or well-chosen method/constant/variable names can make a world of difference, especially for developers not fluent in regular expressions.
Regular expressions can be especially useful for validating the format of free text input. Of course they can't validate the correctness of data, just its format. And you have to keep in mind regional variations for certain types of values (phone numbers or postal codes for example). But for cases where valid input can be defined as a text pattern, regexes make quick work of the validation.

When is a issue too complex for a regular expression?

Please don't answer the obvious, but what are the limit signs that tell us a problem should not be solved using regular expressions?
For example: Why is a complete email validation too complex for a regular expression?
Regular expressions are a textual representation of finite-state automata. That is to say, they are limited to only non-recursive matching. This means that you can't have any concept of "scope" or "sub-match" in your regexp. Consider the following problem:
(())()
Are all the open parens matched with a close paren?
Obviously, when we look at this as human beings, we can easily see that the answer is "yes". However, no regular expression will be able to reliably answer this question. In order to do this sort of processing, you will need a full pushdown automaton (like a DFA with a stack). This is most commonly found in the guise of a parser such as those generated by ANTLR or Bison.
A few things to look out for:
beginning and ending tag detection -- matched pairing
recursion
needing to go backwards (though you can reverse the string, but that's a hack)
regexes, as much as I love them, aren't good at those three things. And remember, keep it simple! If you're trying to build a regex that does "everything", then you're probably doing it wrong.
When you need to parse an expression that's not defined by a regular language.
What it comes down to is using common sense. If what you are trying to match becomes an unmanageable, monster regular expression then you either need to break it up into small, logical sub-regular expressions or you need to start re-thinking your solution.
Take email addresses (as per your example). This simple regular expression (taken from RegEx buddy) matches 99% of all emails out there:
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b
It is short and to the point and you will rarely run into issues with it. However, as the author of RegEx buddy points out, if your email address is in the rare top-level domain ".museum" it will not be accepted.
To truely match all email addresses you need to adhere to the standard known as RFC 2822. It outlines the multitude of ways email addresses can be formatted and it is extremely complex.
Here is a sample regular expression attempting to adhere to RFC 2822:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"
(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x
0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9]
(?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.)
{3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08
\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
This obviously becomes a problem of diminishing returns. It is better to use the easily maintained implementation that matches 99% of email addresses vs the monsterous one that accepts 99.9% of them.
Regular expressions are a great tool to have in your programmers toolbox but they aren't a solution to all your parsing problems. If you find your RegEx solution starting to become extremely complex you need to either attempt to logically break it up into smaller regular expressions to match portions of your text or you need to start looking at other methods to solve your problem. Similarly, there are simply problems that Regular Expressions, due to their nature, can't solve (as one poster said, not adhering to Regular Language).
Regular expressions are suited for tokenizing, finding or identifying individual bits of text, e.g. finding keywords, strings, comments, etc. in source code.
Regular expressions are not suited for determining the relationship between multiple bits of text, e.g. finding a block of source code with properly paired braces. You need a parser for that. The parser can use regular expressions for tokenizing the input, while the parser itself determines how the different regex matches fit together.
Essentially, you're going to far with your regular expressions if you start thinking about "balancing groups" (.NET's capture group subtraction feature) or "recursion" (Perl 5.10 and PCRE).
Here's a good quote from Raymond Chen:
Don't make regular expressions do what they're not good at. If you want to match a simple pattern, then match a simple pattern. If you want to do math, then do math. As commenter Maurits put it, "The trick is not to spend time developing a combination hammer/screwdriver, but just use a hammer and a screwdriver.
Source
Solve the problem with a regex, then give it to somebody else conversant in regexes. If they can't tell you what it does (or at least say with confidence that they understand) in about 10 minutes, it's too complex.
Sure sign to stop using regexps is this: if you have many grouping braces '()' and many alternatives '|' then it is a sure sign that you try to do a (complex) parsing with regular expressions.
Add to the mix Perl extensions, backreferences, etc and soon you have yourself a parser that is hard to read, hard to modify, and hard to reason about it's properties (e.g. is there an input on which this parser will work in a exponential time).
This is a time to stop regexing and start parsing (with hand-made parser, parser generators or parser combinators).
Along with tremendous expressions, there are principal limitations on the words, which can be handled by regexp.
For instance you can not not write regexp for word described by n chars a, then n chars b, where n can be any, more strictly .
In different languages regexp is a extension of Regular language, but time of parsing can be extremely large and this code is non-portable.
Whenever you can't be sure it really solves the problem, for example:
HTML parsing
Email validation
Language parsers
Especially so when there already exist tools that solve the problem in a totally understandable way.
Regex can be used in the domains I mentioned, but only as a subset of the whole problem and for specific, simple cases.
This goes beyond the technical limitations of regexes (regular languages + extensions), the maintainability and readability limit is surpassed a lot earlier than the technical limit in most cases.
A problem is too complex for regular expressions when constraints of the problem can change after the solution is written. So, in your example, how can you be sure an email address is valid when you do not have access to the target mail system to verify that the email address is attached to a valid user? You can't.
My limit is a Regex pattern that's about 30-50 characters long (varying depending on how much is fixed text and how much is regex commands)
This may sound stupid but I often lament not being able to do database type of queries using regular expression. Now especially more then before because I am entering those types of search string all the time on search engines. its very difficult, if not impossible to search for +complex AND +"regular expression"
For example, how do I search in emacs for commands that have both Buffer and Window in their name? I need to search separately for .*Buffer.*Window and .*Window.*Buffer