Regex to match absence of substring - regex

I am looking for a regex which only matches when certain sub-strings are not present. In particular - if a line of code does not assign or return the return value from a method.
Examples:
this.execute(); // should match
var x = this.execute(); // no match
return this.execute(); // no match
I was trying to use the following regex
^(?!.*=|return).*execute\(\).*
This works with regex testers etc. - but I am getting "invalid perl operator" exception when using in practice.
Thanks..

Since you want to exclude only assignment or return it's easily negated
while (<DATA>) { print if not /(?:=|return)\s+this\.execute/ }
__DATA__
this.execute();
var x = this.execute();
return this.execute();
This prints only the line this.execute();.
With Lookaround Assertions, a negative lookahead that you offer does work
if (/^(?!.*=|return)\s+this\.execute/x) { print "$_\n" }
As for the negative lookbehind, there is one problem. First, here's what works
if ( /(?<! =\s ) this\.execute/x ) { print "$_\n" }
if ( /(?<! return \s ) this\.execute/x ) { print "$_\n" }
This excludes = or return, with one space. The thing is, we can't put \s+ there nor can we do alternation -- Perl can't do it for this particular assertion, see perlretut. We get
Variable length lookbehind not implemented in regex m/(?<!=\s+)this\.execute/ at
We can add varying space \s+ outside of the assertion, with this..., and then combine multiple conditions to provide for a possibility that there is no space between = and this....
However, there's no reason for this if you can use a regular negated match.
The reported error can only be about basic syntax. It is about the exact code you run, not the regex.

Not so sure if I understand the question but you might consider trying this one. ^this.execute\(\);

With situations like these, its best to find the "lowest common denominator" in the matches you want to distinguish from similar looking strings. In this case, the var x can be ignored - your requirements are satisfied by saying "anything before the method call is ok - the method call alone is not." That statement is probably a bit too tight though, so let's change it to "anthing other than whitespace before the method call is ok, otherwise flag the call". Which means;
my $method_call = qr/ ( this \. \w+ ) \( /x;
while (<$fh>) {
if (/ ^ \s* $method_call /x) {
warn "Found method call on line $.: $1\n"
}
}
I'm presumming $fh is a filehandle to the souce code file. I've also made some presumptions which you may need to tweek about how you want to define a method call - ie. opening bracket for parameters is compulsory. Using 'extended mode regexs' allows the use of whitespace in the regex for easier reading. Also, using 'quote rule' allows referring to a regex by name inside another to make things clearer.
If on the other hand, you want to insist on the presence of var x or return before giving the ok, we can reverse the search - ie explicitly look for the "ok" situations and flag any other calls:
my $method_call = qr/ ( this \. \w+ ) \( /x;
while (<$fh>) {
next if / ^ \s* return \s+ $method_call /x; # return OK
next if / ^ \s* var \s+ \w+ = \s+ $method_call /x; # var OK
warn "Found method call on line $.: $1\n" if /$method_call/ ;
}
Both of these are a little verbose but show more clearly what you're trying to do.

I don't think we have enough information here. I say this because the following works for me in the shell
~$ echo "execute()"| perl -ne 'print if /^(?!.*=|return).*execute\(\).*/'
execute()
~$ echo "return execute()"| perl -ne 'print if /^(?!.*=|return).*execute\(\).*/'
~$
In the above code, I am running a one liner in a shell that pipes a string into a perl program. The perl program will print the string if it matches the regex. I get no errors from your regex.
It's possible that the error is due to your version of perl or something else entirely may be happening.
I am using perl v5.22.2

I mean, the simple answer is, just use the ! operator on your test, but here's the conversion in case you were wondering:
/expression/ => /^(?!.*expression)/ (either use DOTALL or [^] in JavaScript)
/^expression/ => /^(?!expression)/

Related

Capturing text before and after a C-style code block with a Perl regular expression

I am trying to capture some text before and after a C-style code block using a Perl regular expression. So far this is what I have:
use strict;
use warnings;
my $text = << "END";
int max(int x, int y)
{
if (x > y)
{
return x;
}
else
{
return y;
}
}
// more stuff to capture
END
# Regex to match a code block
my $code_block = qr/(?&block)
(?(DEFINE)
(?<block>
\{ # Match opening brace
(?: # Start non-capturing group
[^{}]++ # Match non-brace characters without backtracking
| # or
(?&block) # Recursively match the last captured group
)* # Match 0 or more times
\} # Match closing brace
)
)/x;
# $2 ends up undefined after the match
if ($text =~ m/(.+?)$code_block(.+)/s){
print $1;
print $2;
}
I am having an issue with the 2nd capture group not being initialized after the match. Is there no way to continue a regular expression after a DEFINE block? I would think that this should work fine.
$2 should contain the comment below the block of code but it doesn't and I can't find a good reason why this isn't working.
Capture groups are numbered left-to-right in the order they occur in the regex, not in the order they are matched. Here is a simplified view of your regex:
m/
(.+?) # group 1
(?: # the $code_block regex
(?&block)
(?(DEFINE)
(?<block> ... ) # group 2
)
)
(.+) # group 3
/xs
Named groups can also be accessed as numbered groups.
The 2nd group is the block group. However, this group is only used as a named subpattern, not as a capture. As such, the $2 capture value is undef.
As a consequence, the text after the code-block will be stored in capture $3.
There are two ways to deal with this problem:
For complex regexes, only use named capture. Consider a regex to be complex as soon as you assemble it from regex objects, or if captures are conditional. Here:
if ($text =~ m/(?<before>.+?)$code_block(?<afterwards>.+)/s){
print $+{before};
print $+{afterwards};
}
Put all your defines at the end, where they can't mess up your capture numbering. For example, your $code_block regex would only define a named pattern which you then invoke explicitly.
There are also ready tools that can be leveraged for this, in a few lines of code.
Perhaps the first module to look at is the core Text::Balanced.
The extract_bracketed in list context returns: matched substring, remainder of the string after the match, and the substring before the match. Then we can keep matching in the remainder
use warnings;
use strict;
use feature 'say';
use Text::Balanced qw/extract_bracketed/;
my $text = 'start {some {stuff} one} and {more {of it} two}, and done';
my ($match, $lead);
while (1) {
($match, $text, $lead) = extract_bracketed($text, '{', '[^{]*');
say $lead // $text;
last if not defined $match;
}
what prints
start
and
, and done
Once there is no match we need to print the remainder, thus $lead // $text (as there can be no $lead either). The code uses $text directly and modifies it, down to the last remainder; if you'd like to keep the original text save it away first.
I've used a made-up string above, but I tested it on your code sample as well.
This can also be done using Regexp::Common.
Break the string using its $RE{balanced} regex, then take odd elements
use Regexp::Common qw(balanced);
my #parts = split /$RE{balanced}{-parens=>'{}'}/, $text;
my #out_of_blocks = #parts[ grep { $_ & 1 } 1..$#parts ];
say for #out_of_blocks;
If the string starts with the delimiter the first element is an empty string, as usual with split.
To clean out leading and trailing spaces pass it through map { s/(^\s*|\s*$//gr }.
You're very close.
(?(DEFINE)) will define the expression & parts you want to use but it doesn't actually do anything other than define them. Think of this tag (and everything it envelops) as you defining variables. That's nice and clean, but defining the variables doesn't mean the variables get used!
You want to use the code block after defining it so you need to add the expression after you've declared your variables (like in any programming language)
(?(DEFINE)
(?<block>\{(?:[^{}]++|(?&block))*\})
)
(?&block)
This part defines your variables
(?(DEFINE)
(?<block>\{(?:[^{}]++|(?&block))*\})
)
This part calls your variables into use.
(?&block)
Edits
Edit 1
(?(DEFINE)
(?<block>\{(?:[^{}]++|(?&block))*\})
)
(?&block)\s*(?:\/\/|\/\*)([\s\S]*?)(?:\r\n|\r|\n|$)
The regex above will get the comment after a block (as you've already defined).
You had a . which will match any character (except newline - unless you use the s modifier which specifies that . should also match newline characters)
Edit 2
(?(DEFINE)
(?<block>\{(?:[^{}]++|(?&block))*\})
)
(?&block)\s*(?:(?:\/\/([\s\S]*?)(?:\r\n|\r|\n|$))|\/\*([\s\S]*?)\*\/)
This regex is more syntactically correct for capturing comments. The previous edit will work with /* up until a new line or end of file. This one will work until the closing tag or end of file.
Edit 3
As for your code not working, I'm not exactly sure. You can see your code running here and it seems to be working just fine. I would use one of the regular expressions I've written above instead.
Edit 4
I think I finally understand what you're saying. What you're trying to do is impossible with regex. You cannot reference a group without capturing it, therefore, the only true solution is to capture it. There is, however, a hack-around alternative that works for your situation. If you want to grab the first and last sections without the second section you can use the following regex, which, will not check the second section of your regex for proper syntax (downside). If you do need to check the syntax you're going to have to deal with there being an additional capture group.
(.+?)\{.*\}\s*(?:(?:\/\/([\s\S]*?)(?:\r\n|\r|\n|$))|\/\*([\s\S]*?)\*\/)
This regex captures everything before the { character, then matches everything after it until it meets } followed by any whitespace, and finally by //. This, however, will break if you have a comment within a block of code (after a })

Perl regular expression letter.number

Perl
I have strings:
xxxx.log, log.1, log.2, blog, photolog
So I would like to match only the strings named
[log.num] e.g log.1 log.12 log.3
[.log] e.g xxxx.log
not include the string named blog, phtotlog.
any helps would be appreciated.
So far this is my code but i will match blog and some words have "log" that is not i want.
while( <> ) {
printf "%s",$_ if /log/;
}
As explained, \blog\b may work for your data. However, if log can come in the string along yet other non-word characters (like log-a.txt) that gets matched, too, so you need be more specific. One way is to match precisely what is expected. By the shown sample
while (<>) {
print if /(?: \.log | log\.\d+ )$/x;
}
where (?: ) is the non capturing group, used so that either of the alternation patterns is anchored to the end of the string, $ (but not needlessly captured). Otherwise we'd match x.log.OLD or such. With the /x modifier spaces may be used without being matched, good for readability.
The patterns in alternation | can be combined but that gets far more complicated.
The printf %s, $var has no advantage over print $var (unless the format is more involved).
A one-liner test
perl -wE'
#ary = qw(log-out xxxx.log log.1 log.2 blog photolog);
/(?:\.log|log\.\d+)$/ && say for #ary
'
where feature say is used for the newline, needed here. In a one-liner -E (capital) enables it.
This should do the trick:
while( <> ) {
printf "%s",$_ if /\blog\b/;
}
The key being the \b that's a word boundary. You can read about them here: http://www.regular-expressions.info/wordboundaries.html but pretty much it's a space, non letter, the beginning of the line, or end of the line. Another option would be:
/\.log\d/
That would match .log3 The period has to be escaped and the \d means a number. That being said I think \b is what you want.

PHP preg_match_all trouble

I have written a regular expression that I tested in rubular.com and it returned 4 matches. The subject of testing can be found here http://pastebin.com/49ERrzJN and the PHP code is below. For some reason the PHP code returns only the first 2 matches. How to make it to match all 4? It seems it has something to do with greediness or so.
$file = file_get_contents('x.txt');
preg_match_all('~[0-9]+\s+(((?!\d{7,}).){2,20})\s{2,30}(((?!\d{7,}).){2,30})\s+([0-9]+)-([0-9]+)-([0-9]+)\s+(F|M)\s+(.{3,25})\s+(((?!\d{7,}).){2,50})~', $file, $m, PREG_SET_ORDER);
foreach($m as $v) echo 'S: '. $v[1]. '; N: '. $v[3]. '; D:'. $v[7]. '<br>';
Your regex is very slooooooow. After trying it on regex101.com, I found it would timeout on PHP (but not JS, for whatever reason). I'm pretty sure the timeout happens at around 50,000 steps. Actually, it makes sense now why you're not using an online PHP regex tester.
I'm not sure if this is the source of your problem, but there is a default memory limit in PHP:
memory_limit [default:] "128M"
[history:] "8M" before PHP 5.2.0, "16M" in PHP 5.2.0
If you use the multiline modifier (I assume that preg_match_all essentially adds the global modifier), you can use this regex that only takes 1282 steps to find all 4 matches:
^ [0-9]+\s+(((?!\d{7,}).){2,20})\s{2,30}(((?!\d{7,}).){2,30})\s+([0-9]+)-([0-9]+)-([0-9]+)\s+(F|M)\s+(.{3,25})\s+(((?!\d{7,}).){2,50})
Actually, there are only 2 characters that I added. They're at the beginning, the anchor ^ and the literal space.
If you have to write a long pattern, the first thing to do is to make it readable. To do that, use the verbose mode (x modifier) that allows comments and free-spacing, and use named captures.
Then you need to make a precise description of what you are looking for:
your target takes a whole line => use the anchors ^ and $ with the modifier m, and use the \h class (that only contains horizontal white-spaces) instead of the \s class.
instead of using this kind of inefficient sub-patterns (?:(?!.....).){m,n} to describe what your field must not contain, describe what the field can contain.
use atomic groups (?>...) when needed instead of non-capturing groups to avoid useless backtracking.
in general, using precise characters classes avoids a lot of problems
pattern:
~
^ \h*+ # start of the line
# named captures # field separators
(?<VOTERNO> [0-9]+ ) \h+
(?<SURNAME> \S+ (?>\h\S+)*? ) \h{2,}
(?<OTHERNAMES> \S+ (?>\h\S+)*? ) \h{2,}
(?<DOB> [0-9]{2}-[0-9]{2}-[0-9]{4} ) \h+
(?<SEX> [FM] ) \h+
(?<APPID_RECNO> [0-9A-Z/]+ ) \h+
(?<VILLAGE> \S+ (?>\h\S+)* )
\h* $ # end of the line
~mx
demo
If you want to know what goes wrong with a pattern, you can use the function preg_last_error()

Can not catch substring by regex which ends with tab

I have two types of strings:
1: ANN=abcdefgh;blabla
2 wrong version: ANN=abcdefgh\tyxz\tyxz
2 actual version: ANN=abcdefgh
Now I want to extract the abcdefgh with a regex. So the start to extract is always after "ANN=". But the end is eighter a semicolon (;) or the FIRST occurrence of a tab.
How does the regex for this look? I tried:
(my #splitUpAnn) = $tabValues[7] =~ /ANN=(.*)[;\t]/;
But I always get just the version 1 with the semicolon back, but it does not work for version two...
EDIT: To be clear. I did not get back ANYTHING for the version two. The problem is NOT that the last tab is used!
EDIT2: Ups, there was something different in the input data than expected. Either I have a semicolon at the end of NOTHING (see "2 actual version"). Sorry for that! So what would the regex then be?
Use .*? instead of .*.
.* is greedy so it matches with second occurrence of TAB.
DEMO
Just use the non-greedy quantifier *? that matches the least it can:
for my $string ('ANN=abcdefgh;blabla', "ANN=abcdefgh\tyxz\tyxz") {
(my #splitUpAnn) = $string =~ /ANN=(.*?)[;\t]/;
print "#splitUpAnn\n";
}
If you want to get the string up to the first semicolon if present, or everything otherwise, just use
$string =~ /ANN=([^;]*)/
i.e. capture everything that's not a semicolon.
/ANN=(.*?)[;\t]/
Make your regex non greedy.
.* is greedy and will match upto the last ; or \t available.
my ($ann) = $tabValues[7] =~ /ANN=(.*?)[;\t]/;
The leading ^ negates the character class, so [^;\t] matches any character except ; and tab.
There are multiple suggestions of making you .* non-greedy, but using non-greediness as anything but an optimization is very fragile and error prone.
I've tested and I got a match
if ( "ANN=abcdefgh;blabla" =~ /(ANN=(.*)[;\t])/ ) {
print $1."\n" ;}
if ( "ANN=abcdefgh\tyxz\tyxz" =~ /(ANN=(.*)[;\t])/ ) {
print $1."\n" ;}
result is:
ANN=abcdefgh;
ANN=abcdefgh yxz
So:
your request is really greedy, as described in previous answers
Perhaps the problem lies in the way you put the values in the array, but the regexp is correct

Perl - regex - I want to read and search each line for a string followed by a ";"

I'm playing and learning Perl so that I can read log files. I want to search every line and look for a string of alphanumeric followed by this ; at the beginning of each line.
This is part of what I have:
if ($line =~ /\S([a-zA-Z][a-zA-Z0-9]*)/)
but I think this is wrong.
Please advise.
"Alphanumeric" is a bit ambiguous now, since many people still infected with ASCII think it means A-Z with 0-9, but Perl thinks about it differently depending on the version (Know your character classes under different semantics). As with any regular expression, your job is to design a pattern the includes only what you want and doesn't exclude anything that you do want.
Also, many people still use the ^ to mean the beginning of the string, which is does if there's no /m flag. However, the re module can now set default flags, so your regex might not be what you think it is when another programmer tries to be helpful.
I tend to write things like:
my $alphanum = qr/[a-z0-9]/i;
my $regex = qr/
\A # absolute start of string
(?:$alphanum)+ # I can change this elsewhere
;
/x;
if( $line =~ $regex ) { ... }
Try:
if ($line =~ /^[a-z0-9]+;/i) { ... }
^ matches the start of a line. The + matches once or more. /i makes the search case-insensitive.