regex taking a long time - regex

I have the following script, which grabs a webpage, then does a regex to find items I'm looking for:
use warnings;
use strict;
use LWP::Simple;
my $content=get('http://mytempscripts.com/2011/09/temporary-post.html') or die $!;
$content=~s/\n//g;
$content=~s/ / /g;
$content=~/<b>this is a temp post<\/b><br \/><br \/>(.*?)<div style='clear: both;'><\/div>/;
my $temp=$1;
while($temp=~/((.*?)([0-9]{1,})(.*?)\s+(.*?)([0-9]{1,})(.*?)\s+(.*?)([0-9] {1,})(.*?)\s+)/g){
print "found a match\n";
}
This works, but takes a long, long time. When I shorten the regex to the following, I get the results in less than a second. Why does my original regex take so long? How do I correct it?
while($temp=~/((.*?)([0-9]{1,})(.*?)\s+(.*?)([0-9]{1,})(.*?)\s+(.*?)([0-9] {1,})(.*?)\s+)/g){
print "found a match\n";
}

Regular expressions are like the sort function in Perl. You think it's pretty simple because it's just a single command, but in the end, it uses a lot of processing power to do the job.
There are certain things you can do to help out:
Keep your syntax simple as possible.
Precompile your regular expression pattern by using qr// if you're using that regular expression in a loop. That'll prevent Perl from having to compile your regular expression with each loop.
Try to avoid regular expression syntax that has to do backtracking. This usually ends up being the most general matching patterns (such as .*).
The wretched truth is that after decades of writing in Perl, I've never masted the deep dark secrets of regular expression parsing. I've tried many times to understand it, but that usually means doing research on the Web, and ...well... I get distracted by all of the other stuff on the Web.
And, it's not that difficult, any half decent developer with an IQ of 240, and a penchant for sadism should easily be able to pick it up.
#David W.: I guess I'm confused on backtracking. I had to read your link several times but still don't quite understand how to implement it (or, not implement it) in my case. – user522962
Let's take a simple example:
my $string = 'foobarfubar';
$string =~ /foo.*bar.*(.+)/;
my $result = $1;
What will $result be? It will be r. You see how that works? Let's see what happens.
Originally, the regular expression is broken into tokens, and the first token foo.* is used. That actually matches the whole string:
"foobarfubar" =~ /foo.*/
However, if the first regular expression token captures the whole string, the rest of the regular expression fails. Therefore, the regular expression matching algorithm has to back track:
"foobarfubar" =~ /foo.*/ #/bar.*/ doesn't match
"foobarfuba" =~ /foo.*/ #/bar.*/ doesn't match.
"foobarfub" =~ /foo.*/ #/bar.*/ doesn't match.
"foobarfu" =~ /foo.*/ #/bar.*/ doesn't match.
"foobarf" =~ /foo.*/ #/bar.*/ doesn't match.
"foobar" =~ /foo.*/ #/bar.*/ doesn't match.
...
"foo" =~ /foo.*/ #Now /bar.*/ can match!
Now, the same happens for the rest of the string:
"foobarfubar" =~ /foo.*bar.*/ #But the final /.+/ doesn't match
"foobarfuba" =~ /foo.*bar.*/ #And the final /.+/ can match the "r"!
Backtracking tends to happen with the .* and .+ expression since they're so loose. I see you're using non-greedy matches which can help, but it can still be an issue if you are not careful -- especially if you have very long and complex regular expressions.
I hope this helps explain backtracking.
The issue you're running into isn't that your program doesn't work, but that it takes a long, long time.
I was hoping that the general gist of my answer is that regular expression parsing isn't as simple as Perl makes it out to be. I can see the command sort #foo; in a program, but forget that if #foo contains a million or so entries, it might take a while. In theory, Perl could be using a bubble sort and thus the algorithm is a O2. I hope that Perl is actually using a more efficient algorithm and my actual time will be closer to O * log (O). However, all this is hidden by my simple one line statement.
I don't know if backtracking is an issue in your case, but you're treating an entire webpage output as a single string to match against a regular expression which could result in a very long string. You attempt to match it against another regular expression which you do over and over again. Apparently, that is quite a process intensive step which is hidden by the fact it's a single Perl statement (much like sort #foo hides its complexity).
Thinking about this on and off over the weekend, you really should not attempt to parse HTML or XML with regular expressions because it is so sloppy. You end up with something rather inefficient and fragile.
In cases like this may be better off using something like HTML::Parser or XML::Simple which I'm more familiar with, but doesn't necessarily work with poorly formatted HTML.
Perl regular expressions are nice, but they can easily get out of our control.

One thing you might try is changing all the capture groups (...) to non-capture groups (?:...)
That will save some effort for the matcher if all you need to is print out "found a match", but I'm not sure you can do that in reality if your real code does more.
Also, just generally speaking having a lot of wildcards like (.*?) is just going to add weight I think, so maybe knowing what you are trying to match you will be able to eliminate some of those? I can't say for sure; don't see any purely formal optimizations here.

Related

auto-generating substitution in perl

I'm trying to autogenerate a regex pattern in perl based on some input, to handle various variables that are created by token pasting in a Makefile... So, for example, I might have a pattern such as:
foo_1_$(AB)_$(CB)
Given this pattern, I need to create a regex that will convert all instances of foo_1_\$(\w+)_\$(\w+) to bar_1_\$($1)_\$($2). The main issue I'm having is with the "to" side of the pattern -- I need to increment the $ number reference each time -- notice that there may be a variable number of tokens in any given pattern.
So... I'm thinking something like this:
foreach $pattern (#patterns) {
my $from = $pattern;
# foo_1_$(AB)_$(CD)
$from =~ s/\$\(\w+\)/\$\(\\w\\\+\)/g;
# foo_1_$(\w+)_$(\w+)
my $to = $pattern =~ s/foo/bar/r;
# bar_1_$(AB)_$(CD);
$to =~ s/\$\(\w+\)/\\\$\(\$?)/g; #???
# bar_1_\$($1)_\$($2)
# ^ ^
#this next part is done outside of this loop, but for the example code:
$line ~= s/\Q$from\E/$to/;
}
How do I cause each subsequent replacement in my to to have an incremental index?
Writing code to generate regex off of a given pattern is a complex undertaking (except in simplest cases), and that's when it is precisely specified what that pattern can be. In this case I also don't see why one can't solve the problem by writing the regex for a given type of a pattern (instead of writing code that would write regex).†
In either case one would need those regex so here's some of that. Since no precise rules for what the patterns may be are given, I use some basic assumptions drawn from hints in the question.
I take it that the pattern to replace (foo_) is followed by a number, and then by the pattern _$(AB) (literal dollar and parens with chars inside), repeated any number of times ("there may be a variable number of tokens").
One way to approach this is by matching the whole following pattern (all repetitions). With lookahead
s/[a-z]+_([0-9]+)(?=_(\$\(\w+\))+)/XXX_$1/;
A simple minded test in a one-liner
perl -wE'$_=q{foo_1_$(AB)_$(CB)}; s/[a-z]+_([0-9]+)(?=_(\$\(\w+\))+)/XXX_$1/; say'
replaces foo to XXX. It works for only one group _$(AB), and for more than two, as well.
This does not match the lone foo_1, without following _$(AB), decided based on the "spirit" of the question (since such a requirement is not spelled out). If such a case in fact should be matched as well then that is possible with a few small changes (mostly related to moving _ into the pattern to be replaced, as optional ([a-z]+_[0-9]+_?))
Update If the "tokens" that follow foo_ (to be replaced) can in fact be anything (so not necessarily $(..)), except that they are strung together with _, then we can use a modification like
/[a-z]+_(\d?)(?=(_[^_]+)*)/XXX_$1/;
where the number after foo_ is optional, per example given in a comment. But then it's simpler
/[a-z]+(?=(_[^_]+)*)/XXX/;
Example
perl -wE'
$_=q{foo_$(AB)_123_$(CD)_foo_$(EF)}; say;
s/[a-z]+(?=(_[^_]+)*)/XXX/; say'
prints
foo_$(AB)_123_$(CD)_foo_$(EF)
XXX_$(AB)_123_$(CD)_foo_$(EF)
Note: what the above regex does is also done by /[a-z]+(?=_)/XXX/. However, the more detailed regex above can be tweaked and adapted for more precise requirements and I'd use that, or its variations, as a main building block for complete solutions.
If the rules for what may be a pattern are less structured (less than "any tokens connected with _") then we need to know them, and probably very precisely.
This clearly doesn't generate the regex from a given pattern, as asked, but is a regex to match such a (class of) patterns. That can solve the problem given sufficient specification for what those patterns may be like -- which would be necessary for regex generation as well.
† Another option is that some templating system is used but then you are again directly writing regex to match given types of patterns.

Perl, Change the Case of Letter at { character

I am a perl newb, and just need to get something done quick and dirty.
I have lines of text (from .bib files) such as
Title = {{the Particle Swarm - Explosion, Stability, and Convergence in a Multidimensional Complex Space}},
How can I write a regex such that the first letter after the second { becomes capitalised.
Thanks
One way, for the question as asked
$string =~ s/{{\K(\w)/uc($1)/ge;
whereby /e makes it evaluate the replacement side as code. The \K makes it drop all previous matches so {{ aren't "consumed" (and thus need not be retyped in the replacement side).
If you wish to allow for possible spaces:  $string =~ s/{{\s*\K(\w)/uc($1)/ge;, and as far as I know bibtex why not allow for spaces between curlies as well, so {\s*{.
If simple capitalization is all you need then \U$1 in the replacement side sufficies and there is no need for /e modifier with it, per comment by Grinnz. The \U is a generic quote-like operator, which can thus also be used in regex; see under Escape sequences in perlre, and in perlretut.
I recommend a good read through the tutorial perlretut. That will go a long way.
However, I must also ask: Are you certain that you may indeed just unleash that on your whole file? Will it catch all cases you need? Will it not clip something else you didn't mean to?

perl regular expressions substitue

I'm new to perl and I found this expressions bit difficult to understand
$a =~ s/\'/\'\"\'\"\'/g
Can someone help me understand what this piece of code does?
It is not clear which part of it is a problem. Here is a brief mention of each.
The expression $str =~ s/pattern/replacement/ is a regular expression, replacing the part of string $str that is matched by the "pattern", by the "replacement". With /g at the end it does so for all instances of the "pattern" found in the string. There is far, far more to it -- see the tutorial, a quick start, the reference, and full information in perlre and perlop.
Some characters have a special meaning, in particular when used in the "pattern". A common set is .^$*+?()[{\|, but this depends on the flavor of regex. Also, whatever is used as a delimiter (commonly /) has to be escaped as well. See documentation and for example this post. If you want to use any one of those characters as a literal character to be found in the string, they have to be escaped by the backslash. (Also see about escaped sequences.) In your example, the ' is escaped, and so are the characters used as the replacement.
The thing is, these in fact don't need to be escaped. So this works
use strict;
use warnings;
my $str = "a'b'c";
$str =~ s/'/'"'/g;
print "$str\n";
It prints the desired a'"'b'"'c. Note that you still may escape them and it will work. One situation where ' is indeed very special is in a one liner, where it delimits the whole piece of code to submit to Perl via shell, to be run. (However, merely escaping it there does not work.) Altogether, the expression you show does the replacement just as in the answer by Jens, but with a twist that is not needed.
A note on a related feature: you can use nearly anything for delimiters, not necessarily just /. So this |patt|repl| is fine, and nicely even this is: {patt}{repl} (unless they are used inside). This is sometimes extremely handy, improving readability a lot. For one thing, then you don't have to escape the / itself inside the regex.
I hope this helps a little.
All the \ in that are useless (but harmless), so it is equivalent to s/'/'"'"'/g. It replaces every ' with '"'"' in the string.
This is often used for shell quoting. See for example https://stackoverflow.com/a/24869016/17389
This peace of Code replace every singe Quote with '"'"'
Can be simplified by removing the needless back slashes
$a =~ s/'/"'"'/g
Every ' will we replaced with "'"'

A successor to regex? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 12 years ago.
Looking at some of the regex questions commonly asked on SO, it seems to me there's a number of areas where the traditional regex syntax is falling short of the kind of tasks people are looking for it to do nowadays. For instance:
I want to match a number between 1 and 31, how do I do that ?
The usual answer is don't use regex for this, use normal conditional comparisons. That's fine if you've got just the number by itself, but not so great when you want to match the number as part of a longer string. Why can't we write something like \d{1~31}, and either modify the regex to do some form of counting or have the regex engine internally translate it into [1-9]|[12]\d|3[01] ?
How do I match an even/odd number of occurrences of a specific string ?
This results in a very messy regex, it would be great to be able to just do (mytext){Odd}.
How do I parse XML with regex ?
We all know that's a bad idea, but this and similar tasks would be easier if the [^ ] operator wasn't limited to just a single character. It'd be nice to be able to do <name>(.*)[^(</name>)]
How do I validate an email with regex ?
Very commonly done and yet very complex to do correctly with regex. It'd save everyone having to re-invent the wheel if a syntax like {IsEmail} could be used instead.
I'm sure there are others that would be useful too. I don't know too much about regex internals to know how easy these would be too implement, or if it would even be possible. Implementing some form of counting (to solve the first two problems) may mean it's not technically a 'regular expression' anymore, but it sure would be useful.
Is a 'regex 2.0' syntax desirable, technically possible, and is there anyone working on anything like this ?
I believe Larry Wall covered this with Perl 6 regexes. The basic idea is to replace simple regular expressions with more-useful grammar rules. They're easier to read and it's easier to put code in for things like making sure that you have an number of matches. Plus, you can name rules like IsEmail. I can't possibly list all the details here, but suffice it to say, it sounds like what you're suggesting.
Here are some examples from http://dev.perl.org/perl6/doc/design/exe/E05.html:
Matching IP address:
token quad { (\d**1..3) <?{ $1 < 256 }> }
$str ~~ m/ <quad> <dot> <quad> <dot> <quad> <dot> <quad> /;
Matching nested parentheses:
$str =~ m/ \( [ <-[()]> + : | <self> ]* \) /;
Annotated:
$str =~ m/ <'('> # Match a literal '('
[ # Start a non-capturing group
<-[()]> + # Match a non-paren (repeatedly)
: # ...and never backtrack that match
| # Or
<self> # Recursively match entire pattern
]* # Close group and match repeatedly
<')'> # Match a literal ')'
/;
Don't blame the tool, blame the user.
Regular Expressions were built for matching patterns in strings. That's it.
It was not made for:
Integer validation
Markup language parsing
Very complex validation (ie.: RFC 2822)
Exact string comparison
Spelling correction
Vector computation
Genetic decoding
Miracle making
Baby saving
Finance administering
Sub-atomic partitioning
Flux capacitor activating
Warp core engaging
Time traveling
Headache inducing
Never-mind that last one. It seems that regular expressions are very well adapted to doing that last task when they are being used where they shouldn't.
Should we redesign the screwdriver because it can't nail? NO, use a hammer.
Simply use the proper tool for the task. Stop using regular expressions for tasks which they don't qualify for.
I want to match a number between 1 and 31, how do I do that?
Use your language constructs to try to convert the string to an integer and do the appropriate comparisons.
How do I match an even/odd number of occurrences of a specific string?
Regular expressions are not a string parser. You can however extract the relevant part with a regular expression if you only need to parse a sub-section of the original string.
How do I parse XML with regex?
You don't. Use a XML or a HTML parser depending on your need. Also, an XML parser can't do the job of an HTML parser (unless you have a perfectly formed XHTML document) and the reverse is also true.
How do I validate an email with regex?
You either use this large abomination or you do it properly with a parser.
All of those are reasonably possible in Perl.
To match a 1..31 with a regex pattern:
/( [0-9]+ ) (?(?{ $^N < 1 && $^N > 31 })(*FAIL)) /x
To generate something like [1-9]|[12]\d|3[01]:
use Regexp::Assemble qw( );
my $ra = Regexp::Assemble->new();
$ra->add($_) for (1..31);
my $re = $ra->re; # qr/(?:[456789]|3[01]?|1\d?|2\d?)/
Perl 5.10+ uses tries to optimise alternations, so the following should be sufficient:
my $re = join '|', 1..31;
$re = qr/$re/;
To match an even number of occurrences:
/ (?: pat{2} )* /x
To match an odd number of occurrences:
/ pat (?: pat{2} )* /x
Pattern negative match:
/<name> (.*?) </name>/x # Non-greedy matching
/<name> ( (?: (?!</name>). )* ) </name>/x
To get a pattern matching email addresses:
use Regexp::Common qw( Email::Address );
/$RE{Email}{Address}/
Probably it is already there and from a long time ago. It's called "grammars". Ever heard of yacc and lex ? Now there is a need for something simple. As strange it may appear, the big strength of regex is that they are very simple to write on the spot.
I believe in some (but large) specialized areas there is already what is needed. I'm thinking of XPath syntax.
Is there a larger (not limited to XML but still simple) alternative around that could cover all cases ? Maybe you should take a look at perl 6 grammars.
No. We should leave regular expressions as is. They are already far too complicated. When was the last time you thought you had nailed it, i.e., got the whole extended regex syntax (choose your flavour) loaded in your squashy memory?
The theory behind regexes is nice and simple. But then we wanted this and that to go with it. The tool is useful, but falls short on non-regular matching. That is ok!
What most people miss, is that context-free grammars and little specialized interpreters are really easy to write.
Instead of making regexes more difficult, we should be rooting for parser support in standard libraries for our languages of choice!

Regex to replace all ocurrences of a given character, ONLY after a given match

For the sake of simplicity, let's say that we have input strings with this format:
*text1*|*text2*
So, I want to leave text1 alone, and remove all spaces in text2.
This could be easy if we didn't have text1, a simple search and replace like this one would do:
%s/\s//g
but in this context I don't know what to do.
I tried with something like:
%s/\(.*|\S*\).\(.*\)/\1\2/g
which works, but removing only the first character, I mean, this should be run on the same line one time for each offending space.
So, a preferred restriction, is to solve this with only one search and replace. And, although I used Vim syntax, use the regular expression flavor you're most comfortable with to answer, I mean, maybe you need some functionality only offered by Perl.
Edit:
My solution for Vim:
%s:\(|.*\)\#<=\s::g
One way, in perl:
s/(^.*\||(?=\s))\s*/$1/g
Certainly much greater efficiency is possible if you allow more than just one search and replace.
So you have a string with one pipe (|) in it, and you want to replace only those spaces that don't precede the pipe?
s/\s+(?![^|]*\|)//g
You might try embedding Perl code in a regular expression (using the (?{...}) syntax), which is, however, rather an experimental feature and might not work or even be available in your scenario.
This
s/(.*?\|)(.*)(?{ $x = $2; $x =~ s:\s::g })/$1$x/
should theoretically work, but I got an "Out of memory!" failure, which can be fixed by replacing '\s' with a space:
s/(.*?\|)(.*)(?{ $x = $2; $x =~ s: ::g })/$1$x/