A successor to regex? [closed] - regex

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 12 years ago.
Looking at some of the regex questions commonly asked on SO, it seems to me there's a number of areas where the traditional regex syntax is falling short of the kind of tasks people are looking for it to do nowadays. For instance:
I want to match a number between 1 and 31, how do I do that ?
The usual answer is don't use regex for this, use normal conditional comparisons. That's fine if you've got just the number by itself, but not so great when you want to match the number as part of a longer string. Why can't we write something like \d{1~31}, and either modify the regex to do some form of counting or have the regex engine internally translate it into [1-9]|[12]\d|3[01] ?
How do I match an even/odd number of occurrences of a specific string ?
This results in a very messy regex, it would be great to be able to just do (mytext){Odd}.
How do I parse XML with regex ?
We all know that's a bad idea, but this and similar tasks would be easier if the [^ ] operator wasn't limited to just a single character. It'd be nice to be able to do <name>(.*)[^(</name>)]
How do I validate an email with regex ?
Very commonly done and yet very complex to do correctly with regex. It'd save everyone having to re-invent the wheel if a syntax like {IsEmail} could be used instead.
I'm sure there are others that would be useful too. I don't know too much about regex internals to know how easy these would be too implement, or if it would even be possible. Implementing some form of counting (to solve the first two problems) may mean it's not technically a 'regular expression' anymore, but it sure would be useful.
Is a 'regex 2.0' syntax desirable, technically possible, and is there anyone working on anything like this ?

I believe Larry Wall covered this with Perl 6 regexes. The basic idea is to replace simple regular expressions with more-useful grammar rules. They're easier to read and it's easier to put code in for things like making sure that you have an number of matches. Plus, you can name rules like IsEmail. I can't possibly list all the details here, but suffice it to say, it sounds like what you're suggesting.
Here are some examples from http://dev.perl.org/perl6/doc/design/exe/E05.html:
Matching IP address:
token quad { (\d**1..3) <?{ $1 < 256 }> }
$str ~~ m/ <quad> <dot> <quad> <dot> <quad> <dot> <quad> /;
Matching nested parentheses:
$str =~ m/ \( [ <-[()]> + : | <self> ]* \) /;
Annotated:
$str =~ m/ <'('> # Match a literal '('
[ # Start a non-capturing group
<-[()]> + # Match a non-paren (repeatedly)
: # ...and never backtrack that match
| # Or
<self> # Recursively match entire pattern
]* # Close group and match repeatedly
<')'> # Match a literal ')'
/;

Don't blame the tool, blame the user.
Regular Expressions were built for matching patterns in strings. That's it.
It was not made for:
Integer validation
Markup language parsing
Very complex validation (ie.: RFC 2822)
Exact string comparison
Spelling correction
Vector computation
Genetic decoding
Miracle making
Baby saving
Finance administering
Sub-atomic partitioning
Flux capacitor activating
Warp core engaging
Time traveling
Headache inducing
Never-mind that last one. It seems that regular expressions are very well adapted to doing that last task when they are being used where they shouldn't.
Should we redesign the screwdriver because it can't nail? NO, use a hammer.
Simply use the proper tool for the task. Stop using regular expressions for tasks which they don't qualify for.
I want to match a number between 1 and 31, how do I do that?
Use your language constructs to try to convert the string to an integer and do the appropriate comparisons.
How do I match an even/odd number of occurrences of a specific string?
Regular expressions are not a string parser. You can however extract the relevant part with a regular expression if you only need to parse a sub-section of the original string.
How do I parse XML with regex?
You don't. Use a XML or a HTML parser depending on your need. Also, an XML parser can't do the job of an HTML parser (unless you have a perfectly formed XHTML document) and the reverse is also true.
How do I validate an email with regex?
You either use this large abomination or you do it properly with a parser.

All of those are reasonably possible in Perl.
To match a 1..31 with a regex pattern:
/( [0-9]+ ) (?(?{ $^N < 1 && $^N > 31 })(*FAIL)) /x
To generate something like [1-9]|[12]\d|3[01]:
use Regexp::Assemble qw( );
my $ra = Regexp::Assemble->new();
$ra->add($_) for (1..31);
my $re = $ra->re; # qr/(?:[456789]|3[01]?|1\d?|2\d?)/
Perl 5.10+ uses tries to optimise alternations, so the following should be sufficient:
my $re = join '|', 1..31;
$re = qr/$re/;
To match an even number of occurrences:
/ (?: pat{2} )* /x
To match an odd number of occurrences:
/ pat (?: pat{2} )* /x
Pattern negative match:
/<name> (.*?) </name>/x # Non-greedy matching
/<name> ( (?: (?!</name>). )* ) </name>/x
To get a pattern matching email addresses:
use Regexp::Common qw( Email::Address );
/$RE{Email}{Address}/

Probably it is already there and from a long time ago. It's called "grammars". Ever heard of yacc and lex ? Now there is a need for something simple. As strange it may appear, the big strength of regex is that they are very simple to write on the spot.
I believe in some (but large) specialized areas there is already what is needed. I'm thinking of XPath syntax.
Is there a larger (not limited to XML but still simple) alternative around that could cover all cases ? Maybe you should take a look at perl 6 grammars.

No. We should leave regular expressions as is. They are already far too complicated. When was the last time you thought you had nailed it, i.e., got the whole extended regex syntax (choose your flavour) loaded in your squashy memory?
The theory behind regexes is nice and simple. But then we wanted this and that to go with it. The tool is useful, but falls short on non-regular matching. That is ok!
What most people miss, is that context-free grammars and little specialized interpreters are really easy to write.
Instead of making regexes more difficult, we should be rooting for parser support in standard libraries for our languages of choice!

Related

auto-generating substitution in perl

I'm trying to autogenerate a regex pattern in perl based on some input, to handle various variables that are created by token pasting in a Makefile... So, for example, I might have a pattern such as:
foo_1_$(AB)_$(CB)
Given this pattern, I need to create a regex that will convert all instances of foo_1_\$(\w+)_\$(\w+) to bar_1_\$($1)_\$($2). The main issue I'm having is with the "to" side of the pattern -- I need to increment the $ number reference each time -- notice that there may be a variable number of tokens in any given pattern.
So... I'm thinking something like this:
foreach $pattern (#patterns) {
my $from = $pattern;
# foo_1_$(AB)_$(CD)
$from =~ s/\$\(\w+\)/\$\(\\w\\\+\)/g;
# foo_1_$(\w+)_$(\w+)
my $to = $pattern =~ s/foo/bar/r;
# bar_1_$(AB)_$(CD);
$to =~ s/\$\(\w+\)/\\\$\(\$?)/g; #???
# bar_1_\$($1)_\$($2)
# ^ ^
#this next part is done outside of this loop, but for the example code:
$line ~= s/\Q$from\E/$to/;
}
How do I cause each subsequent replacement in my to to have an incremental index?
Writing code to generate regex off of a given pattern is a complex undertaking (except in simplest cases), and that's when it is precisely specified what that pattern can be. In this case I also don't see why one can't solve the problem by writing the regex for a given type of a pattern (instead of writing code that would write regex).†
In either case one would need those regex so here's some of that. Since no precise rules for what the patterns may be are given, I use some basic assumptions drawn from hints in the question.
I take it that the pattern to replace (foo_) is followed by a number, and then by the pattern _$(AB) (literal dollar and parens with chars inside), repeated any number of times ("there may be a variable number of tokens").
One way to approach this is by matching the whole following pattern (all repetitions). With lookahead
s/[a-z]+_([0-9]+)(?=_(\$\(\w+\))+)/XXX_$1/;
A simple minded test in a one-liner
perl -wE'$_=q{foo_1_$(AB)_$(CB)}; s/[a-z]+_([0-9]+)(?=_(\$\(\w+\))+)/XXX_$1/; say'
replaces foo to XXX. It works for only one group _$(AB), and for more than two, as well.
This does not match the lone foo_1, without following _$(AB), decided based on the "spirit" of the question (since such a requirement is not spelled out). If such a case in fact should be matched as well then that is possible with a few small changes (mostly related to moving _ into the pattern to be replaced, as optional ([a-z]+_[0-9]+_?))
Update If the "tokens" that follow foo_ (to be replaced) can in fact be anything (so not necessarily $(..)), except that they are strung together with _, then we can use a modification like
/[a-z]+_(\d?)(?=(_[^_]+)*)/XXX_$1/;
where the number after foo_ is optional, per example given in a comment. But then it's simpler
/[a-z]+(?=(_[^_]+)*)/XXX/;
Example
perl -wE'
$_=q{foo_$(AB)_123_$(CD)_foo_$(EF)}; say;
s/[a-z]+(?=(_[^_]+)*)/XXX/; say'
prints
foo_$(AB)_123_$(CD)_foo_$(EF)
XXX_$(AB)_123_$(CD)_foo_$(EF)
Note: what the above regex does is also done by /[a-z]+(?=_)/XXX/. However, the more detailed regex above can be tweaked and adapted for more precise requirements and I'd use that, or its variations, as a main building block for complete solutions.
If the rules for what may be a pattern are less structured (less than "any tokens connected with _") then we need to know them, and probably very precisely.
This clearly doesn't generate the regex from a given pattern, as asked, but is a regex to match such a (class of) patterns. That can solve the problem given sufficient specification for what those patterns may be like -- which would be necessary for regex generation as well.
† Another option is that some templating system is used but then you are again directly writing regex to match given types of patterns.

Regex for nested matches

Consider the string
cos(t(2))+t(51)
Using a regular expression, I'd like to match cos(t(2)), t(2) and t(51). The general pattern this fits is intended to be something like
variable or function name + opening_parenthesis + contents + closing_parenthesis,
where contents can be any expression that has an equal number of opening and closing parentheses.
I'm using [a-zA-Z]+\([\W\w]*\) which returns cos(t(2)))+t(51), which of course is not the desired result.
Any ideas on how to achieve this using regex? I'm particularly stuck at this "equal number of opening and closing parentheses".
Niels, this is an interesting and tricky question because you are looking for overlapping matches. Even with recursion, the task is not trivial.
You asked about any idea how to achieve this with regex, so it sounds like even if this is not available in matlab, you would be interested in seeing an answer that shows you how to do it in regex.
This makes sense to me because tools often change the regex libraries they use. For instance Notepad++, which used to have crippled regex, switched to PCRE in version 6. (As it happens, PCRE would work with this solution.)
In Perl and PCRE, you can use this short regex:
(?=(\b\w+\((?:\d+|(?1))\)))
This will match:
cos(t(2))
t(2)
t(51)
For instance, in php, you could use this code (see the results at the bottom of the online demo).
$regex = "~(?=(\b\w+\((?:\d+|(?1))\)))~";
$string = "cos(t(2))+t(51)";
$count = preg_match_all($regex,$string,$matches);
print_r($matches[1]);
How does it work?
To allow overlapping matches, we use a lookahead. That way, after matching cos(t(2)), the engine will position itself NOT after cos(t(2)), but before the o in cos
In fact the engine does not actually match cos(t(2)) but merely captures it to Group 1. What it matches is the assertion that at this position in the string, looking ahead, we can see x. After matching this assertion, it tries to match it again starting from the next position in the string.
The expression in the lookahead (which describes what we're looking for) is almost very simple: in (\b\w+\((?:\d+|(?1))\)), after the \d+, the alternation | allows us to repeat subroutine number one with (?1), which is to say, the whole expression we are currently within. So we don't recurse the entire regex (which includes a lookahead), but a subexpression thereof.

regex taking a long time

I have the following script, which grabs a webpage, then does a regex to find items I'm looking for:
use warnings;
use strict;
use LWP::Simple;
my $content=get('http://mytempscripts.com/2011/09/temporary-post.html') or die $!;
$content=~s/\n//g;
$content=~s/ / /g;
$content=~/<b>this is a temp post<\/b><br \/><br \/>(.*?)<div style='clear: both;'><\/div>/;
my $temp=$1;
while($temp=~/((.*?)([0-9]{1,})(.*?)\s+(.*?)([0-9]{1,})(.*?)\s+(.*?)([0-9] {1,})(.*?)\s+)/g){
print "found a match\n";
}
This works, but takes a long, long time. When I shorten the regex to the following, I get the results in less than a second. Why does my original regex take so long? How do I correct it?
while($temp=~/((.*?)([0-9]{1,})(.*?)\s+(.*?)([0-9]{1,})(.*?)\s+(.*?)([0-9] {1,})(.*?)\s+)/g){
print "found a match\n";
}
Regular expressions are like the sort function in Perl. You think it's pretty simple because it's just a single command, but in the end, it uses a lot of processing power to do the job.
There are certain things you can do to help out:
Keep your syntax simple as possible.
Precompile your regular expression pattern by using qr// if you're using that regular expression in a loop. That'll prevent Perl from having to compile your regular expression with each loop.
Try to avoid regular expression syntax that has to do backtracking. This usually ends up being the most general matching patterns (such as .*).
The wretched truth is that after decades of writing in Perl, I've never masted the deep dark secrets of regular expression parsing. I've tried many times to understand it, but that usually means doing research on the Web, and ...well... I get distracted by all of the other stuff on the Web.
And, it's not that difficult, any half decent developer with an IQ of 240, and a penchant for sadism should easily be able to pick it up.
#David W.: I guess I'm confused on backtracking. I had to read your link several times but still don't quite understand how to implement it (or, not implement it) in my case. – user522962
Let's take a simple example:
my $string = 'foobarfubar';
$string =~ /foo.*bar.*(.+)/;
my $result = $1;
What will $result be? It will be r. You see how that works? Let's see what happens.
Originally, the regular expression is broken into tokens, and the first token foo.* is used. That actually matches the whole string:
"foobarfubar" =~ /foo.*/
However, if the first regular expression token captures the whole string, the rest of the regular expression fails. Therefore, the regular expression matching algorithm has to back track:
"foobarfubar" =~ /foo.*/ #/bar.*/ doesn't match
"foobarfuba" =~ /foo.*/ #/bar.*/ doesn't match.
"foobarfub" =~ /foo.*/ #/bar.*/ doesn't match.
"foobarfu" =~ /foo.*/ #/bar.*/ doesn't match.
"foobarf" =~ /foo.*/ #/bar.*/ doesn't match.
"foobar" =~ /foo.*/ #/bar.*/ doesn't match.
...
"foo" =~ /foo.*/ #Now /bar.*/ can match!
Now, the same happens for the rest of the string:
"foobarfubar" =~ /foo.*bar.*/ #But the final /.+/ doesn't match
"foobarfuba" =~ /foo.*bar.*/ #And the final /.+/ can match the "r"!
Backtracking tends to happen with the .* and .+ expression since they're so loose. I see you're using non-greedy matches which can help, but it can still be an issue if you are not careful -- especially if you have very long and complex regular expressions.
I hope this helps explain backtracking.
The issue you're running into isn't that your program doesn't work, but that it takes a long, long time.
I was hoping that the general gist of my answer is that regular expression parsing isn't as simple as Perl makes it out to be. I can see the command sort #foo; in a program, but forget that if #foo contains a million or so entries, it might take a while. In theory, Perl could be using a bubble sort and thus the algorithm is a O2. I hope that Perl is actually using a more efficient algorithm and my actual time will be closer to O * log (O). However, all this is hidden by my simple one line statement.
I don't know if backtracking is an issue in your case, but you're treating an entire webpage output as a single string to match against a regular expression which could result in a very long string. You attempt to match it against another regular expression which you do over and over again. Apparently, that is quite a process intensive step which is hidden by the fact it's a single Perl statement (much like sort #foo hides its complexity).
Thinking about this on and off over the weekend, you really should not attempt to parse HTML or XML with regular expressions because it is so sloppy. You end up with something rather inefficient and fragile.
In cases like this may be better off using something like HTML::Parser or XML::Simple which I'm more familiar with, but doesn't necessarily work with poorly formatted HTML.
Perl regular expressions are nice, but they can easily get out of our control.
One thing you might try is changing all the capture groups (...) to non-capture groups (?:...)
That will save some effort for the matcher if all you need to is print out "found a match", but I'm not sure you can do that in reality if your real code does more.
Also, just generally speaking having a lot of wildcards like (.*?) is just going to add weight I think, so maybe knowing what you are trying to match you will be able to eliminate some of those? I can't say for sure; don't see any purely formal optimizations here.

Regular Expression to find words in varying orders

I am searching for a way to model a RegEx which would give a match for both of these strings when searched for "sun shining".
the sun is shining
a shining sun is nice
I'd use positive lookaheads for each word, like this (and you can add as many as you like):
(?=.*?\bsun\b)(?=.*?\bshining\b).*
Basic regular expressions don't handle differing orders of words very well. There are ways to do it but the regular expressions become ugly and unreadable to all but the regex gurus. I prefer to opt for readability in most cases myself.
My advice would be to use a simple or variant, something like:
sun.+shining|shining.+sun
with word boundaries if necessary:
\bsun\b.+\bshining\b|\bshining\b.+\bsun\b
As Lucero points out, this will become unwieldy as the number of words your searching for increases, in which case I would go for the multiple regex match solution:
def hasAllWords (string, words[]):
count = words[].length()
for each word in words[]:
if not string.match ("\b" + word + "\b"):
return false
return true
That pseudo-code will run a check for each word and ensure that all of them appear.
You will need to use a regular expression that considers every permutation like this:
\b(sun\b.+\bshining|shining\b.+\bsun)\b
Here the word boundaries \b are used to only match the words sun and shining and no sub-words like in “sunny”.
You use two regexes.
if ( ( $line =~ /\bsun\b.+\bshining\b/ ) ||
( $line =~ /\bshining\b.+\bsun\b/ ) ) {
# do whatever
}
Sometimes you have to do what seems to be low-tech. Other answers to this question will have you building complex regexes with alternation and lookahead and whatever, but sometimes the best way is to do it the simplest way, and in this case, it's to use two different regexes.
Don't worry about execution speed. Unless you benchmark this solution against other more complicated single-expression solutions, you don't know which is faster. It's incredibly easy to write slow regexes.

What are good regular expressions?

I have worked for 5 years mainly in java desktop applications accessing Oracle databases and I have never used regular expressions. Now I enter Stack Overflow and I see a lot of questions about them; I feel like I missed something.
For what do you use regular expressions?
P.S. sorry for my bad english
Consider an example in Ruby:
puts "Matched!" unless /\d{3}-\d{4}/.match("555-1234").nil?
puts "Didn't match!" if /\d{3}-\d{4}/.match("Not phone number").nil?
The "/\d{3}-\d{4}/" is the regular expression, and as you can see it is a VERY concise way of finding a match in a string.
Furthermore, using groups you can extract information, as such:
match = /([^#]*)#(.*)/.match("myaddress#domain.com")
name = match[1]
domain = match[2]
Here, the parenthesis in the regular expression mark a capturing group, so you can see exactly WHAT the data is that you matched, so you can do further processing.
This is just the tip of the iceberg... there are many many different things you can do in a regular expression that makes processing text REALLY easy.
Regular Expressions (or Regex) are used to pattern match in strings. You can thus pull out all email addresses from a piece of text because it follows a specific pattern.
In some cases regular expressions are enclosed in forward-slashes and after the second slash are placed options such as case-insensitivity. Here's a good one :)
/(bb|[^b]{2})/i
Spoken it can read "2 be or not 2 be".
The first part are the (brackets), they are split by the pipe | character which equates to an or statement so (a|b) matches "a" or "b". The first half of the piped area matches "bb". The second half's name I don't know but it's the square brackets, they match anything that is not "b", that's why there is a roof symbol thingie (technical term) there. The squiggly brackets match a count of the things before them, in this case two characters that are not "b".
After the second / is an "i" which makes it case insensitive. Use of the start and end slashes is environment specific, sometimes you do and sometimes you do not.
Two links that I think you will find handy for this are
regular-expressions.info
Wikipedia - Regular expression
Coolest regular expression ever:
/^1?$|^(11+?)\1+$/
It tests if a number is prime. And it works!!
N.B.: to make it work, a bit of set-up is needed; the number that we want to test has to be converted into a string of “1”s first, then we can apply the expression to test if the string does not contain a prime number of “1”s:
def is_prime(n)
str = "1" * n
return str !~ /^1?$|^(11+?)\1+$/
end
There’s a detailled and very approachable explanation over at Avinash Meetoo’s blog.
If you want to learn about regular expressions, I recommend Mastering Regular Expressions. It goes all the way from the very basic concepts, all the way up to talking about how different engines work underneath. The last 4 chapters also gives a dedicated chapter to each of PHP, .Net, Perl, and Java. I learned a lot from it, and still use it as a reference.
If you're just starting out with regular expressions, I heartily recommend a tool like The Regex Coach:
http://www.weitz.de/regex-coach/
also heard good things about RegexBuddy:
http://www.regexbuddy.com/
As you may know, Oracle now has regular expressions: http://www.oracle.com/technology/oramag/webcolumns/2003/techarticles/rischert_regexp_pt1.html. I have used the new functionality in a few queries, but it hasn't been as useful as in other contexts. The reason, I believe, is that regular expressions are best suited for finding structured data buried within unstructured data.
For instance, I might use a regex to find Oracle messages that are stuffed in log file. It isn't possible to know where the messages are--only what they look like. So a regex is the best solution to that problem. When you work with a relational database, the data is usually pre-structured, so a regex doesn't shine in that context.
A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids. You are probably familiar with wildcard notations such as *.txt to find all text files in a file manager. The regex equivalent is .*\.txt$.
A great resource for regular expressions: http://www.regular-expressions.info
These RE's are specific to Visual Studio and C++ but I've found them helpful at times:
Find all occurrences of "routineName" with non-default params passed:
routineName\(:a+\)
Conversely to find all occurrences of "routineName" with only defaults:
routineName\(\)
To find code enabled (or disabled) in a debug build:
\#if._DEBUG*
Note that this will catch all the variants: ifdef, if defined, ifndef, if !defined
Validating strong passwords:
This one will validate a password with a length of 5 to 10 alphanumerical characters, with at least one upper case, one lower case and one digit:
^(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9])[a-zA-Z0-9]{5,10}$