After some research I figured that it is not possible to parse recursive structures (such as HTML or XML) using regular expressions. Is it possible to comprehensively list out day to day coding scenarios where I should avoid using regular expressions because it is just impossible to do that particular task using regular expressions? Let us say the regex engine in question is not PCRE.
Don't use regular expressions when:
the language you are trying to parse is not a regular language, or
when there are readily available parsers specifically made for the data you are trying to parse.
Parsing HTML and XML with regular expressions is usually a bad idea both because they are not regular languages and because libraries already exist that can parse it for you.
As another example, if you need to check if an integer is in the range 0-255, it's easier to understand if you use your language's library functions to parse it to an integer and then check its numeric value instead of trying to write the regular expression that matches this range.
I'll plagiarize myself from my blog post, When to use and when not to use regular expressions...
Public websites should not allow users to enter regular expressions for searching. Giving the full power of regex to the general public for a website's search engine could have a devastating effect. There is such a thing as a regular expression denial of service (ReDoS) attack that should be avoided at all costs.
HTML/XML parsing should not be done with regular expressions. First of all, regular expressions are designed to parse a regular language which is the simplest among the Chomsky hierarchy. Now, with the advent of balancing group definitions in the .NET flavor of regular expressions you can venture into slightly more complex territory and do a few things with XML or HTML in controlled situations. However, there's not much point. There are parsers available for both XML and HTML which will do the job more easily, more efficiently, and more reliably. In .NET, XML can be handled the old XmlDocument way or even more easily with Linq to XML. Or for HTML there's the HTML Agility Pack.
Conclusion
Regular expressions have their uses. I still contend that in many cases they can save the programmer a lot of time and effort. Of course, given infinite time & resources, one could almost always build a procedural solution that's more efficient than an equivalent regular expression.
Your decision to abandon regex should be based on 3 things:
1.) Is the regular expression so slow in your scenario that it has become a bottleneck?
2.) Is your procedural solution actually quicker & easier to write than the regular expression?
3.) Is there a specialized parser that will do the job better?
My rule of thumb is, use regular expressions when no other solution exists. If there's already a parser (for example, XML, HTML) or you're just looking for strings rather than patterns, there's no need to use regular expressions.
Always ask yourself "can I solve this without using regular expressions?". The answer to that question will tell you whether you should use regular expressions.
Related
I just asked this question about using a regular expression to allow numbers between -90.0 and +90.0. I got some answers on how to implement the regular expression, but most of the answers also mentioned that that would be better handled without using a regular expression or using a regular expression would be overkill. So how do you decide when to use a regular expression and when not to use a regular expression. Is there a check list you can follow?
Regular expressions are a text processing tool for character-based tests. More formally, regular expressions are good at handling regular languages and bad at almost anything else.
In practice, this means that regular expressions are not well suited for tasks that require discovering meaning (semantics) in text that goes beyond the character level. This would require a full-blown parser.
In your particular case: recognizing a number in a text is an exercise that regular expressions are good at (decimal numbers can be trivially described using a regular language). This works on the character level.
But doing more advanced stuff with the number that requires knowledge of its numerical value (i.e. its semantics) requires interpretation. Regular expressions are bad at this. So finding a number in text is easy. Finding a number in text that is greater than 11 but smaller than 1004 (or that is divisible by 3) is hard: it requires recognizing the meaning of the number.
I would say that regex expressions are most effective on Strings. For other data types, manipulations of that data type will usually be more intuitive and provide better results.
For example, if you know that you're dealing with DateTime, then you can use the Parse and TryParse methods will the different formats and it will usually be more reliable than your own regex expressions.
In your example, you are dealing with numbers so deal with them accordingly.
Regex is very powerful, but it isn't the easiest code to read and to debug. When another reliable solution is at hand, you should probably go for that.
Without meaning to be circular or obtuse, you should use regular expressions when you have a string which contains information structured in a regular language, and you want to turn this string into an object model.
Basic use-case for RegEx :-
You need "Key Value Pairs" - Both Key and Values are embedded within other noisy text - cant be accessed or isolated otherwise.
You need to automate extraction of these values by looping over multiple documents.
Number and combination of Key Value pairs maybe discovered as you progress parsing through text.
The answer is straight forward:
If you can solve your problem without regular expressions (just by string functions), you don't use regular expressions. As it was said in one book I've read: regular expressions are violence over computer.
If it's to complicated to use language string functions, use regular expressions.
I have used regExp quit a bit of times but still far from being an expert. This time I want to validate a formula (or math expression) by regExp. The difficult part here is to validate proper starting and ending parentheses with in the formula.
I believe, there would be some sample on the web but I could not find it. Can somebody post a link for such example? or help me by some other means?
Languages with matched nested parentheses are not regular languages and can therefore not be recognized by regular expressions. Some implementations of regular expression (for example in the .NET framework) have extensions to deal with this but that is really no fun to work with. So I suggest to use an available parser or implement a simple parser yourself (for the sake of fun).
For the extension in the .NET implementation see the MSDN on balancing groups.
If your mathematical expression involves matched nested parenthesis, then it's not a regular grammar but a context-free one and as such, can not be parsed using regex.
I'm starting to learn Regular Expressions and I want to know: In which cases is better to use them?
Regular expressions is a form of pattern matching that you can apply on textual content. Take for example the DOS wildcards ? and * which you can use when you're searching for a file
. That is a kind of very limited subset of RegExp. For instance, if you want to find all files beginning with "fn", followed by 1 to 4 random characters, and ending with "ht.txt", you can't do that with the usual DOS wildcards. RegExp, on the other hand, could handle that and much more complicated patterns.
Regular expressions are, in short, a way to effectively
handle data
search and replace strings
provide extended string handling.
Often a regular expression can in itself provide string handling that other functionalities such as the built-in string methods and properties can only do if you use them in a complicated function or loop.
When you are trying to find/replace/validate complicated string patterns.
I use regular expressions when comparing strings (preg_match), replacing substrings (sed,preg_replace), replacing characters (sed,preg_replace), searching for strings in files (grep), splitting strings (preg_split) etc.
It is a very flexible and widespread pattern expression language and it is very useful to know.
BUT! It's like they say about poker, it's very easy to learn, but very hard to master.
I just came across a question that i thought was perfect for a RegEx, have a look and decide for yourself.
There are some cases where, if you need better performance, you should avoid regular expressions in favor of writing code. An example of this is parsing very large CSV files.
Regular expressions are a dsl (domain specific language) for parsing text. Just like xpath is a dsl for traversing xml. It is essentially a mini language inside of a general purpose language. You can accomplish quite a bit in a very small amount of code because it is specialized for a narrow purpose. One very common use for regular expressions is checking if a string is an email address, phone number, ssn, etc...
There are also cases where regular expressions are >>NOT<< appropriate (in general; there are always exceptions).
Parsing HTML
Parsing XML
In the above cases a DOM parser is almost always a better choice. The grammars are complex and there are too many edge cases, such as nested tags.
Also be sure to consider future maintenance programmers (which may be you). Comments and/or well-chosen method/constant/variable names can make a world of difference, especially for developers not fluent in regular expressions.
Regular expressions can be especially useful for validating the format of free text input. Of course they can't validate the correctness of data, just its format. And you have to keep in mind regional variations for certain types of values (phone numbers or postal codes for example). But for cases where valid input can be defined as a text pattern, regexes make quick work of the validation.
Are Regular Expressions useful for a Web Designer (XHTML/CSS)? Can a web designer get any help if they learn regular expressions?
Anyone who works with text files on a regular basis, which includes all programmers, can benefit from learning regular expressions. They make find-and-replace tasks much easier, and save you a lot of manual editing. Almost all text-editing programs support regular expression searches.
You won't be able to use them in your code, if all you're doing is HTML and CSS. But if you start to use JavaScript, you'll find them useful for things like testing the value of input fields.
Yes.
Regular expressions are not part of (X)HTML or CSS, but they are part of the tools you will probably use with them: Javascript, XSLT, and any server scripts.
The key is to remember that, because of the difficulty in parsing a language with quoted strings and SGML-style tags, that regex shouldn't be used to parse (X)HTML except as a last resort. Tokenizers exist so you don't have to do that particular hard work. You would find most regular expressions in use for checking and sanitizing input.
In short: yes, but always use the best tool for the job.
Among other things, it's great for validating input and cleansing html
Even if you don't use them directly, RegExes are a way of thinking about manipulating text that have proven very valuable to me over the years from the very first project I was paid money to write in PERL to things I do today in .NET
Can a web designer get any help if
they learn regular expressions?
If you're asking for some resources to help you with regular expressions, I find the MDC page on Regular Expressions and the Regular Expression Cheat Sheet quite useful references, and the Regular Expression Tester handy for testing regular expressions.
Steve
CSS has partial support for regular expressions, so they might be of some use in that area.
Other than that, only if you code Javascript.
There are languages and protocols you use when creating web artifacts (pages, scripts, styles, etc.) and there are tools you use to manipulate those artifacts (editors, utilities, interpreters, etc.).
Regular expressions would belong to latter set and they belong to more advanced specter of it (based on the fact that average web designer is likely to have no clue). But any computer literate person would benefit from learning and using regular expressions - not just web designers.
Anything that makes you stand out (in positive way of course) in comparison to the crowd will benefit you - so go ahead and learn regex - you won't regret.
Please don't answer the obvious, but what are the limit signs that tell us a problem should not be solved using regular expressions?
For example: Why is a complete email validation too complex for a regular expression?
Regular expressions are a textual representation of finite-state automata. That is to say, they are limited to only non-recursive matching. This means that you can't have any concept of "scope" or "sub-match" in your regexp. Consider the following problem:
(())()
Are all the open parens matched with a close paren?
Obviously, when we look at this as human beings, we can easily see that the answer is "yes". However, no regular expression will be able to reliably answer this question. In order to do this sort of processing, you will need a full pushdown automaton (like a DFA with a stack). This is most commonly found in the guise of a parser such as those generated by ANTLR or Bison.
A few things to look out for:
beginning and ending tag detection -- matched pairing
recursion
needing to go backwards (though you can reverse the string, but that's a hack)
regexes, as much as I love them, aren't good at those three things. And remember, keep it simple! If you're trying to build a regex that does "everything", then you're probably doing it wrong.
When you need to parse an expression that's not defined by a regular language.
What it comes down to is using common sense. If what you are trying to match becomes an unmanageable, monster regular expression then you either need to break it up into small, logical sub-regular expressions or you need to start re-thinking your solution.
Take email addresses (as per your example). This simple regular expression (taken from RegEx buddy) matches 99% of all emails out there:
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b
It is short and to the point and you will rarely run into issues with it. However, as the author of RegEx buddy points out, if your email address is in the rare top-level domain ".museum" it will not be accepted.
To truely match all email addresses you need to adhere to the standard known as RFC 2822. It outlines the multitude of ways email addresses can be formatted and it is extremely complex.
Here is a sample regular expression attempting to adhere to RFC 2822:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"
(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x
0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9]
(?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.)
{3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08
\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
This obviously becomes a problem of diminishing returns. It is better to use the easily maintained implementation that matches 99% of email addresses vs the monsterous one that accepts 99.9% of them.
Regular expressions are a great tool to have in your programmers toolbox but they aren't a solution to all your parsing problems. If you find your RegEx solution starting to become extremely complex you need to either attempt to logically break it up into smaller regular expressions to match portions of your text or you need to start looking at other methods to solve your problem. Similarly, there are simply problems that Regular Expressions, due to their nature, can't solve (as one poster said, not adhering to Regular Language).
Regular expressions are suited for tokenizing, finding or identifying individual bits of text, e.g. finding keywords, strings, comments, etc. in source code.
Regular expressions are not suited for determining the relationship between multiple bits of text, e.g. finding a block of source code with properly paired braces. You need a parser for that. The parser can use regular expressions for tokenizing the input, while the parser itself determines how the different regex matches fit together.
Essentially, you're going to far with your regular expressions if you start thinking about "balancing groups" (.NET's capture group subtraction feature) or "recursion" (Perl 5.10 and PCRE).
Here's a good quote from Raymond Chen:
Don't make regular expressions do what they're not good at. If you want to match a simple pattern, then match a simple pattern. If you want to do math, then do math. As commenter Maurits put it, "The trick is not to spend time developing a combination hammer/screwdriver, but just use a hammer and a screwdriver.
Source
Solve the problem with a regex, then give it to somebody else conversant in regexes. If they can't tell you what it does (or at least say with confidence that they understand) in about 10 minutes, it's too complex.
Sure sign to stop using regexps is this: if you have many grouping braces '()' and many alternatives '|' then it is a sure sign that you try to do a (complex) parsing with regular expressions.
Add to the mix Perl extensions, backreferences, etc and soon you have yourself a parser that is hard to read, hard to modify, and hard to reason about it's properties (e.g. is there an input on which this parser will work in a exponential time).
This is a time to stop regexing and start parsing (with hand-made parser, parser generators or parser combinators).
Along with tremendous expressions, there are principal limitations on the words, which can be handled by regexp.
For instance you can not not write regexp for word described by n chars a, then n chars b, where n can be any, more strictly .
In different languages regexp is a extension of Regular language, but time of parsing can be extremely large and this code is non-portable.
Whenever you can't be sure it really solves the problem, for example:
HTML parsing
Email validation
Language parsers
Especially so when there already exist tools that solve the problem in a totally understandable way.
Regex can be used in the domains I mentioned, but only as a subset of the whole problem and for specific, simple cases.
This goes beyond the technical limitations of regexes (regular languages + extensions), the maintainability and readability limit is surpassed a lot earlier than the technical limit in most cases.
A problem is too complex for regular expressions when constraints of the problem can change after the solution is written. So, in your example, how can you be sure an email address is valid when you do not have access to the target mail system to verify that the email address is attached to a valid user? You can't.
My limit is a Regex pattern that's about 30-50 characters long (varying depending on how much is fixed text and how much is regex commands)
This may sound stupid but I often lament not being able to do database type of queries using regular expression. Now especially more then before because I am entering those types of search string all the time on search engines. its very difficult, if not impossible to search for +complex AND +"regular expression"
For example, how do I search in emacs for commands that have both Buffer and Window in their name? I need to search separately for .*Buffer.*Window and .*Window.*Buffer