Using RegEx for simple operations - regex

I was wondering if there might be some reason someone would want to use a regular expression for a problem that could also be written easily without using regular expressions.
I came to this thought because of this question.
The question is something fairly simple, and the answers vary in 2 categories, those who do solve it with regular expressions and those who just use some other simple operation.
Summary of the question: Remove the first part of a url path (example: String path = "/folder1/folder2/folder3/").
2 Solutions:
//With regex
String newPathRegex = path.replaceAll("^/[^/]*", "");
//Without regex
String newPathNoRegex = path.substring(path.indexOf('/', 1));
Personally I think the no RegEx solution is a lot easier to read, but I'm not an expert on regular expressions.
So the question comes down to: Should you avoid using regular expressions in cases as simple as this one? Is there better performance in the RegEx solution?

A few reasons why it is useful to use regular expressions:
Regular expressions run in O(n log n) in the size of the expression and O(n) of the length of the string. So the time complexity is guaranteed to be very reasonable, whereas custom programs can sometimes be badly implemented. Most programs running in (pseudo)-linear time are considered to be very fast. Although it is possible to construct tailor made algorithm that will outperform regular expressions for each task that can be carried out by a regex, it is in general not easy for humans to do so. Regular expressions thus guarantee the construction of a fast enough algorithm.
Most properties on regular expressions are decidable: it is decidable whether two regular expressions determine the same set of strings, etc. So there is an entire algebra defined over it. All (non-trivial, language-invariant) properties on programs are undecidable: that's a consequence of Rice's theorem, so you can't prove in general that two programs will do the same thing (are equivalent), whereas this is an easy task for regular expressions.
Modifiable. Perhaps you want to remove the first part of the path, but only if it is not ... In general modifications to a regular expression tend to be easy whereas modifying a program can blow up the size of the code.
The most problematic part is that not all programmers are familiar with regular expressions, and that they are a bit cryptic: the semantics are sometimes a bit hard to guess. And furthermore, the pumping lemma states not every problem can be transformed into a regular expression (problem).

Related

TCL string match vs regexps

Is it right that we should avoid using regexp as it is slow. Instead we should use string operations. Are there cases that both can be used but regexp is better?
You should use the appropriate tool for the job. That means, you should not avoid regex, you should use it when it is necessary.
If you are just searching for a fixed sequence of characters, use string operations.
If you are searching for a pattern, then use regular expressions.
Example
Search for the word "Foo". use string operations it will also find
"Foobar", is this OK? NO, well then maybe search for "Foo ", but then
it will not find "Foo," and "Foo."
With regex no problem, you can match for a word boundary /\mFoo\M/ and
this regex will not be slow.
I think this negative image comes from special problems like catastrophic backtracking.
There has been a recent example (catastrophic-backtracking-shouldnt-be-happening-on-this-regex) where this behaviour was unexpected.
Conclusion
A regex has to be well designed, if it isn't then the performance can be catastrophic. But the same can also happen to your normal code if you use a bad algorithm.
For a small job it should nearly never be a problem to use a regex, if your task is bigger and has to be repeated often, do a benchmark.
From my own experience, I am analyzing really big text files (some hundred MB) and use regexes to find the rows I am interested in and I don't experience performance problems because of regex.
Here an interesting read about code optimization
Regular expressions (REs) are a marvelous hammer. They can solve some problems elegantly, and many more with brute force, but it won't be pretty. And some problems can be solved with REs if you hit them enough, but there are much better solutions available (for example, things that are a good fit for string map)
string match - or globbing - can be thought of as a simplified version of regular expressions. The glob pattern will usually be shorter than the equivalent regular expression (character classes are an exception - ERs support them, with globs you need to spell them out). I don't know offhand how the performance differs; I'd expect string match to be slightly faster on equivalent patterns because of the simpler logic, but time is much more reliable than expectations.
For a specific case where REs are easier to use, extracting a substring contextually vs. by simple character position is a good example. Or for matching one of several alternatives.
My rule of thumb is to use the simplest thing that works. If that's string match, then great. If it seems like the pattern is too complex for that, go to a regexp and be happy you have the choice.
The best advice I can give, and the advice I use myself is, use regular expressions only when a simpler solution won't work.
If you can use simple string matching, or use glob patterns, use them. It's only when those cannot work that you should be using regular expressions.
To address your specific question I would say that, no, there is no time when you can use either but that regular expressions are the better choice. Maybe there's an edge case I'm not thinking of, but generally speaking, simpler solutions are always better.
I don't know about Tcl in particular, but generally it can be said that if you're looking for exact text matches (e. g. find all lines that start with #define) then string operations are faster. But if you're looking for patterns (e. g. all lines that contain a word that starts with c and ends with t) then regular expressions are the right tool for this (\bc\w*t\b would be a good regex for this - compare this to the program logic you'd need if you had to write this yourself.
And even if regex is slower in a case like this, chances are high that it won't matter in terms of execution speed, but it'll matter a lot when changes to the matching logic are required (oh, now we need to look for a word that starts with c and ends with t but contains at least two as and no x --> \bc(?=\w*a\w*a)(?!\w*x)\w*t\b).
A place where most regex engines don't want to go is recursion (matching nested tags, nested parentheses and all that). That's where parsers enter the picture.
Regular expression matching is a kind of string operation. While it's not as fast as some of the more basic operations, it is enormously more capable too. It's also more difficult to use, especially if you don't already know the basic syntax of REs, but that's not a reason to avoid them. However, replacing a regular expression with a collection of basic string operations can just lead to the program getting enormously longer: sometimes, you simply need complex manipulations.
Tcl does a number of things to make RE operations more efficient. Notably, it detects particularly simple REs and converts them into glob-like matches (as in string match) which are faster but much less powerful, and it does a number of things to cache the compiled form of REs so that matching has less overhead. It also uses an automata-theoretic matching engine that has fewer surprises during match time (at a cost of more time to compile the RE in the first place).
In short, don't avoid them. Use them where appropriate. (And time if you're in doubt about speed.)
regexp aka regular expressions are used to match many different strings and can be very complex or even to validate a specific input.
string match only allows wildcards such as * and ? and basic character grouping with [] as in regexp.
You can read about it here: http://www.tcl.tk/man/tcl8.5/TclCmd/string.htm#M40
A basic guide what regexp can do also with some examples are explained here: http://www.regular-expressions.info/
So in short: If you don't need regexp or even don't know much about it, i recommand you to not use it. If you just want to compare two strings for their equality use string equal.

To use or not to use regular expressions?

I just asked this question about using a regular expression to allow numbers between -90.0 and +90.0. I got some answers on how to implement the regular expression, but most of the answers also mentioned that that would be better handled without using a regular expression or using a regular expression would be overkill. So how do you decide when to use a regular expression and when not to use a regular expression. Is there a check list you can follow?
Regular expressions are a text processing tool for character-based tests. More formally, regular expressions are good at handling regular languages and bad at almost anything else.
In practice, this means that regular expressions are not well suited for tasks that require discovering meaning (semantics) in text that goes beyond the character level. This would require a full-blown parser.
In your particular case: recognizing a number in a text is an exercise that regular expressions are good at (decimal numbers can be trivially described using a regular language). This works on the character level.
But doing more advanced stuff with the number that requires knowledge of its numerical value (i.e. its semantics) requires interpretation. Regular expressions are bad at this. So finding a number in text is easy. Finding a number in text that is greater than 11 but smaller than 1004 (or that is divisible by 3) is hard: it requires recognizing the meaning of the number.
I would say that regex expressions are most effective on Strings. For other data types, manipulations of that data type will usually be more intuitive and provide better results.
For example, if you know that you're dealing with DateTime, then you can use the Parse and TryParse methods will the different formats and it will usually be more reliable than your own regex expressions.
In your example, you are dealing with numbers so deal with them accordingly.
Regex is very powerful, but it isn't the easiest code to read and to debug. When another reliable solution is at hand, you should probably go for that.
Without meaning to be circular or obtuse, you should use regular expressions when you have a string which contains information structured in a regular language, and you want to turn this string into an object model.
Basic use-case for RegEx :-
You need "Key Value Pairs" - Both Key and Values are embedded within other noisy text - cant be accessed or isolated otherwise.
You need to automate extraction of these values by looping over multiple documents.
Number and combination of Key Value pairs maybe discovered as you progress parsing through text.
The answer is straight forward:
If you can solve your problem without regular expressions (just by string functions), you don't use regular expressions. As it was said in one book I've read: regular expressions are violence over computer.
If it's to complicated to use language string functions, use regular expressions.

When to use parser-generator, when is regex is enough?

I have not gotten into the field of formal languages in computer science yet, so maybe my question is silly. I am writing a simple NMEA parser in C++, and I have to choose:
My first idea was to build a simple finite state machine manually, but then I thought that maybe I could do it with less work, even more efficiently. I used regular expressions before, but I think the NMEA regular expression is very long and should take "long time" to match it.
Then I thought about using a parser generator. I think all use the same method: they generate a FSA. But I don't know which is more efficient. When do you normally use parser generators instead of regexes (I think you could write regex in parser generator)?
Please explain the differences, I'm interested in both theory and experience.
Well, a simple rule of thumb is: If the grammar of the data you are trying to parse is regular, use regular expressions. If it is not, regular expressions may still work (as most regex engines also support non-regular grammars), but it might well be painful (complicated / bad performance).
Another aspect is what you are trying to do with the parsed data. If you are only interested in one field, a regex is probably easier to read. If you need to read deeply nested structures, a parser is likely to be more maintainable.
Regex is a parser-generator.
From wikipedia:
Regular expressions (abbreviated as regex or regexp, with plural forms regexes, regexps, or regexen) are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.
If you're going over a list that only needs to be gone over once, then save the list to a file and read it from there. If you're checking things that are different every time, use regex and store the results in an array or something.
It's much faster than you would assume it to be. I've seen expressions bigger than this post.
Adding that you can nest as much as you'd like, in whatever language you decide to code it in. You could even do it in sections, for maximum re-usability.
As Sneakyness points out, you can have a large and complicated regular expression that is surprisingly powerful. I've seen some examples of this, but none were maintainable by mere mortals. Even using Expresso only helped so much; it was still difficult to understand and risky to modify. So unless you're a savant with a fixation on Grep, I would not recommend this direction.
Instead, consider focusing on the grammar and letting a compiler compiler do the heavy lifting for you.

When is it best to use Regular Expressions over basic string splitting / substring'ing?

It seems that the choice to use string parsing vs. regular expressions comes up on a regular basis for me anytime a situation arises that I need part of a string, information about said string, etc.
The reason that this comes up is that we're evaluating a soap header's action, after it has been parsed into something manageable via the OperationContext object for WCF and then making decisions on that. Right now, the simple solution seems to be basic substring'ing to keep the implementation simple, but part of me wonders if RegEx would be better or more robust. The other part of me wonders if it'd be like using a shotgun to kill a fly in our particular scenario.
So I have to ask, what's the typical threshold that people use when trying to decide to use RegEx over typical string parsing. Note that I'm not very strong in Regular Expressions, and because of this, I try to shy away unless it's absolutely vital to avoid introducing more complication than I need.
If you couldn't tell by my choice of abbreviations, this is in .NET land (C#), but I believe that doesn't have much bearing on the question.
EDIT: It seems as per my typical Raybell charm, I've been too wordy or misleading in my question. I want to apologize. I was giving some background to help give clues as to what I was doing, not mislead people.
I'm basically looking for a guideline as to when to use substring, and variations thereof, over Regular Expressions and vice versa. And while some of the answers may have missed this (and again, my fault), I've genuinely appreciated them and up-voted as accordingly.
My main guideline is to use regular expressions for throwaway code, and for user-input validation. Or when I'm trying to find a specific pattern within a big glob of text. For most other purposes, I'll write a grammar and implement a simple parser.
One important guideline (that's really hard to sidestep, though I see people try all the time) is to always use a parser in cases where the target language's grammar is recursive.
For example, consider a tiny "expression language" for evaluating parenthetized arithmetic expressions. Examples of "programs" in this language would look like this:
1 + 2
5 * (10 - 6)
((1 + 1) / (2 + 2)) / 3
A grammar is easy to write, and looks something like this:
DIGIT := ["0"-"9"]
NUMBER := (DIGIT)+
OPERATOR := ("+" | "-" | "*" | "/" )
EXPRESSION := (NUMBER | GROUP) (OPERATOR EXPRESSION)?
GROUP := "(" EXPRESSION ")"
With that grammar, you can build a recursive descent parser in a jiffy.
An equivalent regular expression is REALLY hard to write, because regular expressions don't usually have very good support for recursion.
Another good example is JSON ingestion. I've seen people try to consume JSON with regular expressions, and it's INSANE. JSON objects are recursive, so they're just begging for regular grammars and recursive descent parsers.
Hmmmmmmm... Looking at other people's responses, I think I may have answered the wrong question.
I interpreted it as "when should use use a simple regex, rather than a full-blown parser?" whereas most people seem to have interpreted the question as "when should you roll your own clumsy ad-hoc character-by-character validation scheme, rather than using a regular expression?"
Given that interpretation, my answer is: never.
Okay.... one more edit.
I'll be a little more forgiving of the roll-your-own scheme. Just... don't call it "parsing" :o)
I think a good rule of thumb is that you should only use string-matching primitives if you can implement ALL of your logic using a single predicate. Like this:
if (str.equals("DooWahDiddy")) // No problemo.
if (str.contains("destroy the earth")) // Okay.
if (str.indexOf(";") < str.length / 2) // Not bad.
Once your conditions contain multiple predicates, then you've started inventing your own ad hoc string validation language, and you should probably just man up and study some regular expressions.
if (str.startsWith("I") && str.endsWith("Widget") &&
(!str.contains("Monkey") || !str.contains("Pox"))) // Madness.
Regular expressions really aren't that hard to learn. Compared to a huuuuge full-featured language like C# with dozens of keywords, primitive types, and operators, and a standard library with thousands of classes, regular expressions are absolutely dirt simple. Most regex implementations support about a dozen or so operations (give or take).
Here's a great reference:
http://www.regular-expressions.info/
PS: As a bonus, if you ever do want to learn about writing your own parsers (with lex/yacc, ANTLR, JavaCC, or other similar tools), learning regular expressions is a great preparation, because parser-generator tools use many of the same principles.
The regex can be
easier to understand
express more clearly the intent
much shorter
easier to change/adapt
In some situations all of those advantages would be achieved by using a regex, in others only some are achieved (the regex is not really easy to understand for example) and in yet other situations the regex is harder to understand, obfuscates the intent, longer and hard to change.
The more of those (and possibly other) advantages I gain from the regex, the more likely I am to use them.
Possible rule of thumb: if understanding the regex would take minutes for someone who is somewhat familiar with regular expressions, then you don't want to use it (unless the "normal" code is even more convoluted ;-).
Hm ... still no simple rule-of-thumb, sorry.
[W]e're evaluating a soap header's
action and making decisions on that
Never use regular expressions or basic string parsing to process XML. Every language in common usage right now has perfectly good XML support. XML is a deceptively complex standard and it's unlikely your code will be correct in the sense that it will properly parse all well-formed XML input, and even it if does, you're wasting your time because (as just mentioned) every language in common usage has XML support. It is unprofessional to use regular expressions to parse XML.
To answer your question, in general the usage of regular expressions should be minimized as they're not very readable. Oftentimes you can combine string parsing and regular expressions (perhaps in a loop) to create a much simpler solution than regular expressions alone.
I would agree with what benjismith said, but want to elaborate just a bit. For very simple syntaxes, basic string parsing can work well, but so can regexes. I wouldn't call them overkill. If it works, it works - go with what you find simplest. And for moderate to intermediate string parsing, a regex is usually the way to go.
As soon as you start finding yourself needing to define a grammar however, i.e. complex string parsing, get back to using some sort of finite state machine or the likes as quickly as you can. Regexes simply don't scale well, to use the term loosely. They get complex, hard to interpret, and even incapable.
I've seen at least one project where the use of regexes kept growing and growing and soon they had trouble inserting new functionality. When it finally came time to do a new major release, they dumped all the regexes and went the route of a grammar parser.
When your required transformation isn't basic -- but is still conceptually simple.
no reason to pull out Regex if you're doing a straight string replacement, for example... its easier to just use the string.Replace
on the other hand, a complex rule with many conditionals or special cases that would take more than 50 characters of regex can be a nightmare to maintain later on if you don't explicitly write it out
I would always use a regex unless it's something very simple such as splitting a comma-separated string. If I think there's a chance the strings might one day get more complicated, I'll probably start with a regex.
I don't subscribe to the view that regexes are hard or complicated. It's one tool that every developer should learn and learn well. They have a myriad of uses, and once learned, this is exactly the sort of thing you never have to worry about ever again.
Regexes are rarely overkill - if the match is simple, so is the regex.
I would think the easiest way to know when to use regular expressions and when not to, is when your string search requires an IF/THEN statement or anything resembling this or that logic, then you need something better than a simple string comparison which is where regex shines.

When is a issue too complex for a regular expression?

Please don't answer the obvious, but what are the limit signs that tell us a problem should not be solved using regular expressions?
For example: Why is a complete email validation too complex for a regular expression?
Regular expressions are a textual representation of finite-state automata. That is to say, they are limited to only non-recursive matching. This means that you can't have any concept of "scope" or "sub-match" in your regexp. Consider the following problem:
(())()
Are all the open parens matched with a close paren?
Obviously, when we look at this as human beings, we can easily see that the answer is "yes". However, no regular expression will be able to reliably answer this question. In order to do this sort of processing, you will need a full pushdown automaton (like a DFA with a stack). This is most commonly found in the guise of a parser such as those generated by ANTLR or Bison.
A few things to look out for:
beginning and ending tag detection -- matched pairing
recursion
needing to go backwards (though you can reverse the string, but that's a hack)
regexes, as much as I love them, aren't good at those three things. And remember, keep it simple! If you're trying to build a regex that does "everything", then you're probably doing it wrong.
When you need to parse an expression that's not defined by a regular language.
What it comes down to is using common sense. If what you are trying to match becomes an unmanageable, monster regular expression then you either need to break it up into small, logical sub-regular expressions or you need to start re-thinking your solution.
Take email addresses (as per your example). This simple regular expression (taken from RegEx buddy) matches 99% of all emails out there:
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b
It is short and to the point and you will rarely run into issues with it. However, as the author of RegEx buddy points out, if your email address is in the rare top-level domain ".museum" it will not be accepted.
To truely match all email addresses you need to adhere to the standard known as RFC 2822. It outlines the multitude of ways email addresses can be formatted and it is extremely complex.
Here is a sample regular expression attempting to adhere to RFC 2822:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"
(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x
0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9]
(?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.)
{3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08
\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
This obviously becomes a problem of diminishing returns. It is better to use the easily maintained implementation that matches 99% of email addresses vs the monsterous one that accepts 99.9% of them.
Regular expressions are a great tool to have in your programmers toolbox but they aren't a solution to all your parsing problems. If you find your RegEx solution starting to become extremely complex you need to either attempt to logically break it up into smaller regular expressions to match portions of your text or you need to start looking at other methods to solve your problem. Similarly, there are simply problems that Regular Expressions, due to their nature, can't solve (as one poster said, not adhering to Regular Language).
Regular expressions are suited for tokenizing, finding or identifying individual bits of text, e.g. finding keywords, strings, comments, etc. in source code.
Regular expressions are not suited for determining the relationship between multiple bits of text, e.g. finding a block of source code with properly paired braces. You need a parser for that. The parser can use regular expressions for tokenizing the input, while the parser itself determines how the different regex matches fit together.
Essentially, you're going to far with your regular expressions if you start thinking about "balancing groups" (.NET's capture group subtraction feature) or "recursion" (Perl 5.10 and PCRE).
Here's a good quote from Raymond Chen:
Don't make regular expressions do what they're not good at. If you want to match a simple pattern, then match a simple pattern. If you want to do math, then do math. As commenter Maurits put it, "The trick is not to spend time developing a combination hammer/screwdriver, but just use a hammer and a screwdriver.
Source
Solve the problem with a regex, then give it to somebody else conversant in regexes. If they can't tell you what it does (or at least say with confidence that they understand) in about 10 minutes, it's too complex.
Sure sign to stop using regexps is this: if you have many grouping braces '()' and many alternatives '|' then it is a sure sign that you try to do a (complex) parsing with regular expressions.
Add to the mix Perl extensions, backreferences, etc and soon you have yourself a parser that is hard to read, hard to modify, and hard to reason about it's properties (e.g. is there an input on which this parser will work in a exponential time).
This is a time to stop regexing and start parsing (with hand-made parser, parser generators or parser combinators).
Along with tremendous expressions, there are principal limitations on the words, which can be handled by regexp.
For instance you can not not write regexp for word described by n chars a, then n chars b, where n can be any, more strictly .
In different languages regexp is a extension of Regular language, but time of parsing can be extremely large and this code is non-portable.
Whenever you can't be sure it really solves the problem, for example:
HTML parsing
Email validation
Language parsers
Especially so when there already exist tools that solve the problem in a totally understandable way.
Regex can be used in the domains I mentioned, but only as a subset of the whole problem and for specific, simple cases.
This goes beyond the technical limitations of regexes (regular languages + extensions), the maintainability and readability limit is surpassed a lot earlier than the technical limit in most cases.
A problem is too complex for regular expressions when constraints of the problem can change after the solution is written. So, in your example, how can you be sure an email address is valid when you do not have access to the target mail system to verify that the email address is attached to a valid user? You can't.
My limit is a Regex pattern that's about 30-50 characters long (varying depending on how much is fixed text and how much is regex commands)
This may sound stupid but I often lament not being able to do database type of queries using regular expression. Now especially more then before because I am entering those types of search string all the time on search engines. its very difficult, if not impossible to search for +complex AND +"regular expression"
For example, how do I search in emacs for commands that have both Buffer and Window in their name? I need to search separately for .*Buffer.*Window and .*Window.*Buffer