PHP preg_match_all trouble - regex

I have written a regular expression that I tested in rubular.com and it returned 4 matches. The subject of testing can be found here http://pastebin.com/49ERrzJN and the PHP code is below. For some reason the PHP code returns only the first 2 matches. How to make it to match all 4? It seems it has something to do with greediness or so.
$file = file_get_contents('x.txt');
preg_match_all('~[0-9]+\s+(((?!\d{7,}).){2,20})\s{2,30}(((?!\d{7,}).){2,30})\s+([0-9]+)-([0-9]+)-([0-9]+)\s+(F|M)\s+(.{3,25})\s+(((?!\d{7,}).){2,50})~', $file, $m, PREG_SET_ORDER);
foreach($m as $v) echo 'S: '. $v[1]. '; N: '. $v[3]. '; D:'. $v[7]. '<br>';

Your regex is very slooooooow. After trying it on regex101.com, I found it would timeout on PHP (but not JS, for whatever reason). I'm pretty sure the timeout happens at around 50,000 steps. Actually, it makes sense now why you're not using an online PHP regex tester.
I'm not sure if this is the source of your problem, but there is a default memory limit in PHP:
memory_limit [default:] "128M"
[history:] "8M" before PHP 5.2.0, "16M" in PHP 5.2.0
If you use the multiline modifier (I assume that preg_match_all essentially adds the global modifier), you can use this regex that only takes 1282 steps to find all 4 matches:
^ [0-9]+\s+(((?!\d{7,}).){2,20})\s{2,30}(((?!\d{7,}).){2,30})\s+([0-9]+)-([0-9]+)-([0-9]+)\s+(F|M)\s+(.{3,25})\s+(((?!\d{7,}).){2,50})
Actually, there are only 2 characters that I added. They're at the beginning, the anchor ^ and the literal space.

If you have to write a long pattern, the first thing to do is to make it readable. To do that, use the verbose mode (x modifier) that allows comments and free-spacing, and use named captures.
Then you need to make a precise description of what you are looking for:
your target takes a whole line => use the anchors ^ and $ with the modifier m, and use the \h class (that only contains horizontal white-spaces) instead of the \s class.
instead of using this kind of inefficient sub-patterns (?:(?!.....).){m,n} to describe what your field must not contain, describe what the field can contain.
use atomic groups (?>...) when needed instead of non-capturing groups to avoid useless backtracking.
in general, using precise characters classes avoids a lot of problems
pattern:
~
^ \h*+ # start of the line
# named captures # field separators
(?<VOTERNO> [0-9]+ ) \h+
(?<SURNAME> \S+ (?>\h\S+)*? ) \h{2,}
(?<OTHERNAMES> \S+ (?>\h\S+)*? ) \h{2,}
(?<DOB> [0-9]{2}-[0-9]{2}-[0-9]{4} ) \h+
(?<SEX> [FM] ) \h+
(?<APPID_RECNO> [0-9A-Z/]+ ) \h+
(?<VILLAGE> \S+ (?>\h\S+)* )
\h* $ # end of the line
~mx
demo
If you want to know what goes wrong with a pattern, you can use the function preg_last_error()

Related

Regular expression for Medicare MBI number format

I'm wondering if someone can help me to create a regular expression to check if a string matches the new Medicare MBI number format. Here are the specifics in regards to character position and what they can contain.
I'm using Cache ObjectScript, but any language would be helpful just so I can get the idea.
If PCRE is an option, you could leverage subroutines:
(?(DEFINE)
(?P<numeric>\d) # numbers
(?P<abc>(?![SLOIBZ])[A-Z]) # A-Z without S,L,O,I,B,Z
(?P<both>(?&numeric)|(?&abc)) # combined
)
^ # start of line/string
(?&numeric)(?&abc)(?&both) # in packs of three
(?&numeric)(?&abc)(?&both)
(?&numeric)(?&abc)(?&abc)
(?&numeric)(?&numeric)
$ # end of line/string
Paste your IDs into the demo on regex101.com (but don't save it on regex101 or you'll expose those IDs to the public permanently).
Of course, it is not a must to use subroutines, it just makes the expression clearer, more readable and maintainable.
But, you could very well go for
^
\d
(?![SLOIBZ])[A-Z]
\d|(?![SLOIBZ])[A-Z]
\d
(?![SLOIBZ])[A-Z]
\d|(?![SLOIBZ])[A-Z]
\d
(?![SLOIBZ])[A-Z]
(?![SLOIBZ])[A-Z]
\d
\d
$
Or condensed (just copy and paste it):
^\d(?![SLOIBZ])[A-Z]\d|(?![SLOIBZ])[A-Z]\d(?![SLOIBZ])[A-Z]\d|(?![SLOIBZ])[A-Z]\d(?![SLOIBZ])[A-Z](?![SLOIBZ])[A-Z]\d\d$
First position should be 1-9
https://www.cms.gov/Outreach-and-Education/Medicare-Learning-Network-MLN/MLNProducts/Downloads/MedicareCard-FactSheet-TextOnly-909365.pdf
\b[1-9][AC-HJKMNP-RT-Yac-hjkmnp-rt-y][AC-HJKMNP-RT-Yac-hjkmnp-rt-y0-9][0-9]-?[AC-HJKMNP-RT-Yac-hjkmnp-rt-y][AC-HJKMNP-RT-Yac-hjkmnp-rt-y0-9][0-9]-?[AC-HJKMNP-RT-Yac-hjkmnp-rt-y]{2}\d{2}\b
Using ! condition (a negative look-ahead) for one of the previous answers:
\b[1-9](?![sloibzSLOIBZ])[a-zA-Z](?![sloibzSLOIBZ)])[a-zA-Z0-9][0-9]-?(?![sloibzSLOIBZ])[a-zA-Z](?![sloibzSLOIBZ])[a-zA-Z0-9][0-9]-?(?![sloibzSLOIBZ])[a-zA-Z]{2}\d{2}\b
Or, even shorter:
\b[1-9](?![sloibzSLOIBZ])[a-zA-Z](?![sloibzSLOIBZ)])[a-zA-Z\d]\d-?(?![sloibzSLOIBZ])[a-zA-Z](?![sloibzSLOIBZ])[a-zA-Z\d]\d-?(?![sloibzSLOIBZ])[a-zA-Z]{2}\d{2}\b
public static boolean isValidHICN(String mHICN) {
String mPatternHICN = "[1-9]{1}[ACDEFGHJKMNPQRTUVWXY]{1}[A-N]{1}[0-9]{1}[ACDEFGHJKMNPQRTUVWXY]{1}[A-N]{1}[0-9]{1}[ACDEFGHJKMNPQRTUVWXY]{2}[0-9]{2}";
return (mHICN.matches(mPatternHICN));
}

Capturing text before and after a C-style code block with a Perl regular expression

I am trying to capture some text before and after a C-style code block using a Perl regular expression. So far this is what I have:
use strict;
use warnings;
my $text = << "END";
int max(int x, int y)
{
if (x > y)
{
return x;
}
else
{
return y;
}
}
// more stuff to capture
END
# Regex to match a code block
my $code_block = qr/(?&block)
(?(DEFINE)
(?<block>
\{ # Match opening brace
(?: # Start non-capturing group
[^{}]++ # Match non-brace characters without backtracking
| # or
(?&block) # Recursively match the last captured group
)* # Match 0 or more times
\} # Match closing brace
)
)/x;
# $2 ends up undefined after the match
if ($text =~ m/(.+?)$code_block(.+)/s){
print $1;
print $2;
}
I am having an issue with the 2nd capture group not being initialized after the match. Is there no way to continue a regular expression after a DEFINE block? I would think that this should work fine.
$2 should contain the comment below the block of code but it doesn't and I can't find a good reason why this isn't working.
Capture groups are numbered left-to-right in the order they occur in the regex, not in the order they are matched. Here is a simplified view of your regex:
m/
(.+?) # group 1
(?: # the $code_block regex
(?&block)
(?(DEFINE)
(?<block> ... ) # group 2
)
)
(.+) # group 3
/xs
Named groups can also be accessed as numbered groups.
The 2nd group is the block group. However, this group is only used as a named subpattern, not as a capture. As such, the $2 capture value is undef.
As a consequence, the text after the code-block will be stored in capture $3.
There are two ways to deal with this problem:
For complex regexes, only use named capture. Consider a regex to be complex as soon as you assemble it from regex objects, or if captures are conditional. Here:
if ($text =~ m/(?<before>.+?)$code_block(?<afterwards>.+)/s){
print $+{before};
print $+{afterwards};
}
Put all your defines at the end, where they can't mess up your capture numbering. For example, your $code_block regex would only define a named pattern which you then invoke explicitly.
There are also ready tools that can be leveraged for this, in a few lines of code.
Perhaps the first module to look at is the core Text::Balanced.
The extract_bracketed in list context returns: matched substring, remainder of the string after the match, and the substring before the match. Then we can keep matching in the remainder
use warnings;
use strict;
use feature 'say';
use Text::Balanced qw/extract_bracketed/;
my $text = 'start {some {stuff} one} and {more {of it} two}, and done';
my ($match, $lead);
while (1) {
($match, $text, $lead) = extract_bracketed($text, '{', '[^{]*');
say $lead // $text;
last if not defined $match;
}
what prints
start
and
, and done
Once there is no match we need to print the remainder, thus $lead // $text (as there can be no $lead either). The code uses $text directly and modifies it, down to the last remainder; if you'd like to keep the original text save it away first.
I've used a made-up string above, but I tested it on your code sample as well.
This can also be done using Regexp::Common.
Break the string using its $RE{balanced} regex, then take odd elements
use Regexp::Common qw(balanced);
my #parts = split /$RE{balanced}{-parens=>'{}'}/, $text;
my #out_of_blocks = #parts[ grep { $_ & 1 } 1..$#parts ];
say for #out_of_blocks;
If the string starts with the delimiter the first element is an empty string, as usual with split.
To clean out leading and trailing spaces pass it through map { s/(^\s*|\s*$//gr }.
You're very close.
(?(DEFINE)) will define the expression & parts you want to use but it doesn't actually do anything other than define them. Think of this tag (and everything it envelops) as you defining variables. That's nice and clean, but defining the variables doesn't mean the variables get used!
You want to use the code block after defining it so you need to add the expression after you've declared your variables (like in any programming language)
(?(DEFINE)
(?<block>\{(?:[^{}]++|(?&block))*\})
)
(?&block)
This part defines your variables
(?(DEFINE)
(?<block>\{(?:[^{}]++|(?&block))*\})
)
This part calls your variables into use.
(?&block)
Edits
Edit 1
(?(DEFINE)
(?<block>\{(?:[^{}]++|(?&block))*\})
)
(?&block)\s*(?:\/\/|\/\*)([\s\S]*?)(?:\r\n|\r|\n|$)
The regex above will get the comment after a block (as you've already defined).
You had a . which will match any character (except newline - unless you use the s modifier which specifies that . should also match newline characters)
Edit 2
(?(DEFINE)
(?<block>\{(?:[^{}]++|(?&block))*\})
)
(?&block)\s*(?:(?:\/\/([\s\S]*?)(?:\r\n|\r|\n|$))|\/\*([\s\S]*?)\*\/)
This regex is more syntactically correct for capturing comments. The previous edit will work with /* up until a new line or end of file. This one will work until the closing tag or end of file.
Edit 3
As for your code not working, I'm not exactly sure. You can see your code running here and it seems to be working just fine. I would use one of the regular expressions I've written above instead.
Edit 4
I think I finally understand what you're saying. What you're trying to do is impossible with regex. You cannot reference a group without capturing it, therefore, the only true solution is to capture it. There is, however, a hack-around alternative that works for your situation. If you want to grab the first and last sections without the second section you can use the following regex, which, will not check the second section of your regex for proper syntax (downside). If you do need to check the syntax you're going to have to deal with there being an additional capture group.
(.+?)\{.*\}\s*(?:(?:\/\/([\s\S]*?)(?:\r\n|\r|\n|$))|\/\*([\s\S]*?)\*\/)
This regex captures everything before the { character, then matches everything after it until it meets } followed by any whitespace, and finally by //. This, however, will break if you have a comment within a block of code (after a })

Regex to match absence of substring

I am looking for a regex which only matches when certain sub-strings are not present. In particular - if a line of code does not assign or return the return value from a method.
Examples:
this.execute(); // should match
var x = this.execute(); // no match
return this.execute(); // no match
I was trying to use the following regex
^(?!.*=|return).*execute\(\).*
This works with regex testers etc. - but I am getting "invalid perl operator" exception when using in practice.
Thanks..
Since you want to exclude only assignment or return it's easily negated
while (<DATA>) { print if not /(?:=|return)\s+this\.execute/ }
__DATA__
this.execute();
var x = this.execute();
return this.execute();
This prints only the line this.execute();.
With Lookaround Assertions, a negative lookahead that you offer does work
if (/^(?!.*=|return)\s+this\.execute/x) { print "$_\n" }
As for the negative lookbehind, there is one problem. First, here's what works
if ( /(?<! =\s ) this\.execute/x ) { print "$_\n" }
if ( /(?<! return \s ) this\.execute/x ) { print "$_\n" }
This excludes = or return, with one space. The thing is, we can't put \s+ there nor can we do alternation -- Perl can't do it for this particular assertion, see perlretut. We get
Variable length lookbehind not implemented in regex m/(?<!=\s+)this\.execute/ at
We can add varying space \s+ outside of the assertion, with this..., and then combine multiple conditions to provide for a possibility that there is no space between = and this....
However, there's no reason for this if you can use a regular negated match.
The reported error can only be about basic syntax. It is about the exact code you run, not the regex.
Not so sure if I understand the question but you might consider trying this one. ^this.execute\(\);
With situations like these, its best to find the "lowest common denominator" in the matches you want to distinguish from similar looking strings. In this case, the var x can be ignored - your requirements are satisfied by saying "anything before the method call is ok - the method call alone is not." That statement is probably a bit too tight though, so let's change it to "anthing other than whitespace before the method call is ok, otherwise flag the call". Which means;
my $method_call = qr/ ( this \. \w+ ) \( /x;
while (<$fh>) {
if (/ ^ \s* $method_call /x) {
warn "Found method call on line $.: $1\n"
}
}
I'm presumming $fh is a filehandle to the souce code file. I've also made some presumptions which you may need to tweek about how you want to define a method call - ie. opening bracket for parameters is compulsory. Using 'extended mode regexs' allows the use of whitespace in the regex for easier reading. Also, using 'quote rule' allows referring to a regex by name inside another to make things clearer.
If on the other hand, you want to insist on the presence of var x or return before giving the ok, we can reverse the search - ie explicitly look for the "ok" situations and flag any other calls:
my $method_call = qr/ ( this \. \w+ ) \( /x;
while (<$fh>) {
next if / ^ \s* return \s+ $method_call /x; # return OK
next if / ^ \s* var \s+ \w+ = \s+ $method_call /x; # var OK
warn "Found method call on line $.: $1\n" if /$method_call/ ;
}
Both of these are a little verbose but show more clearly what you're trying to do.
I don't think we have enough information here. I say this because the following works for me in the shell
~$ echo "execute()"| perl -ne 'print if /^(?!.*=|return).*execute\(\).*/'
execute()
~$ echo "return execute()"| perl -ne 'print if /^(?!.*=|return).*execute\(\).*/'
~$
In the above code, I am running a one liner in a shell that pipes a string into a perl program. The perl program will print the string if it matches the regex. I get no errors from your regex.
It's possible that the error is due to your version of perl or something else entirely may be happening.
I am using perl v5.22.2
I mean, the simple answer is, just use the ! operator on your test, but here's the conversion in case you were wondering:
/expression/ => /^(?!.*expression)/ (either use DOTALL or [^] in JavaScript)
/^expression/ => /^(?!expression)/

Regular expression to get executed script names from command history

I am trying to write a regex that will parse the syntax for calling a script and capture the script name.
All of these are valid syntax for the call
# normal way
run cred=username/password script.bi
# single quoted username password, also separated in a different way
run cred='username password' script.bi
# username/password is optional
run script.bi
# script extension is optional
run script
# the call might be broken into multiple lines using \
# THIS ONE SHOULD NOT MATCH
run cred=username/password \
script.bi
This is what I have so far
my $r = q{run +(?:cred=(?:[^\s\']*|\'.*\') +)?([^\s\\]+)};
to capture the values in $1.
But I get a
Unmatched [ before HERE mark in regex m/run +(?:cred=(?:[^\s\']*|\'.*\') +)?([ << HERE ^\s\]+)/
The \\ is being treated as a \ and hence in the regex it becomes \] so escaping ] and hence the unmatched [
Replace with run +(?:cred=(?:[^\s\']*|\'.*\') +)?([^\s\\\\]+) ( note the \\\\ ) and try.
Also, from the comments you must be using qr for regex than just q.
( I had just looked at the error, not the validity / efficiency of the regex for your problem)
The essence of your problem with specifying the regex is a difference of one byte: q versus qr. You're writing a regex, so call it what it is. Treating the pattern as a string means you have to deal with the rules for string quoting on top of the rules for regex escaping.
As for the language that your regex matches, add anchors to force the pattern to match the entire line. The regex engine is fiercely determined and will keep working until it finds a match. Without anchors, it's happy to find a substring.
Sometimes this gives you surprising results. Have you ever dealt with a petulant child (or a childish adult) who takes a narrow, exceedingly literal interpretation of what you say? The regex engine is that way, but it's trying to help.
With the last example it matches because
You said with the ? quantifier that the cred=... subpattern can match zero times, so the regex engine skipped it.
You said the script name is the following substring that's a run of one or more non-whitespace, non-backslash characters, so the regex engine saw cred=username/password, none of which are whitespace or backslash characters, and matched. Regexes are greedy: they consider what's right in front of them without regard to whether a given substring “should have” been matched by another subpattern.
The last example fits the bill—although not in the way that you intended. An important lesson with regexes is any quantifier such as ? or * that can match zero times always succeeds!
Without the $ anchor, the pattern from your question leaves the trailing backslash unmatched, which you can see with a slight modification to $runpat.
qr{run +(?:cred=(?:[^\s']*|\'.*\') +)?([^\s\\]+)(.*)}; # ' SO hiliter hack
Notice the (.*) at the end to grab any non-newline characters that may be left. Changing the loop to
while (<DATA>) {
next unless /$runpat/;
print "line $.: \$1=[$1]; \$2=[$2]\n";
}
gives the following output for line 15.
line 15: $1=[cred=username/password]; $2=[ \]
As a complete program, that becomes
#! /usr/bin/env perl
use strict;
use warnings;
# The goofy comment on the next line is a hack to
# help Stack Overflow's syntax highlighter recover
# from its confusion after seeing the quotes. It's
# for presentation only: you won't need it in your
# real code.
my $runpat = qr{^\s*run +(?:cred=(?:[^\s']*|\'.*\') +)?([^\s\\]+)$}; # '
while (<DATA>) {
next unless /$runpat/;
print "line $.: \$1=[$1]\n";
}
__DATA__
# normal way
run cred=username/password script.bi
# single quoted username password, also separated in a different way
run cred='username password' script.bi
# username/password is optional
run script.bi
# script extension is optional
run script
# the call might be broken into multiple lines using \
# THIS ONE SHOULD NOT MATCH
run cred=username/password \
script.bi
Output:
line 2: $1=[script.bi]
line 5: $1=[script.bi]
line 8: $1=[script.bi]
line 11: $1=[script]
Conciseness isn't always helpful with regexes. Consider the following alternative but equivalent specification:
my $runpat = qr{
^ \s*
(?:
run \s+ cred=(?:[^\s']*|'.*?') \s+ (?<script> [^\s\\]+) # ' hiliter
| run \s+ (?!cred=) (?<script> [^\s\\]+)
)
\s* $
}x;
Yes, it takes more room to write, but it's clearer about acceptable alternatives. Your loop is nearly the same
while (<DATA>) {
next unless /$runpat/;
print "line $.: script=[$+{script}]\n";
}
and even spares the poor reader from having to count parentheses.
To use named capture buffers, e.g., (?<script>...), be sure to add
use 5.10.0;
to the top of your program to provide executable documentation of the minimum required version of perl.
Are there sometimes arguments to the script? If not, why not:
/^run(?:\s.*\s|\s)(\S+)\s*$/
I guess that doesn't work on the line continuation bit.
/^run(?:\s+cred=(?:[^'\s]*|'[^']*')\s+|\s+)([^\\\s]+)\s*$/
Test program:
#!/usr/bin/perl
$foo="# normal way
run cred=username/password script.bi
# single quoted username password, also separated in a different way
run cred='username password' script.bi
# username/password is optional
run script.bi
# script extension is optional
run script
# the call might be broken into multiple lines using \
# THIS ONE SHOULD NOT MATCH
run cred=username/password \\
script.bi
";
foreach my $line (split(/\n/,$foo))
{
print "Looking >$line<\n";
print "Match >$1<\n"
if ($line =~ /^run(?:\s+cred=(?:[^'\s]*|'[^']*')\s+|\s+)([^\\\s]+)\s*$/);
}
Example output:
Looking ># normal way<
Looking >run cred=username/password script.bi<
Match >script.bi<
Looking ><
Looking ># single quoted username password, also separated in a different way<
Looking >run cred='username password' script.bi<
Match >script.bi<
Looking ><
Looking ># username/password is optional<
Looking >run script.bi<
Match >script.bi<
Looking ><
Looking ># script extension is optional<
Looking >run script<
Match >script<
Looking ><
Looking ># the call might be broken into multiple lines using <
Looking ># THIS ONE SHOULD NOT MATCH<
Looking >run cred=username/password \<
Looking >script.bi<

Regex: delete contents of square brackets

Is there a regular expression that can be used with search/replace to delete everything occurring within square brackets (and the brackets)?
I've tried \[.*\] which chomps extra stuff (e.g. "[chomps] extra [stuff]")
Also, the same thing with lazy matching \[.*?\] doesn't work when there is a nested bracket (e.g. "stops [chomping [too] early]!")
Try something like this:
$text = "stop [chomping [too] early] here!";
$text =~ s/\[([^\[\]]|(?0))*]//g;
print($text);
which will print:
stop here!
A short explanation:
\[ # match '['
( # start group 1
[^\[\]] # match any char except '[' and ']'
| # OR
(?0) # recursively match group 0 (the entire pattern!)
)* # end group 1 and repeat it zero or more times
] # match ']'
The regex above will get replaced with an empty string.
You can test it online: http://ideone.com/tps8t
EDIT
As #ridgerunner mentioned, you can make the regex more efficiently by making the * and the character class [^\[\]] match once or more and make it possessive, and even by making a non capturing group from group 1:
\[(?:[^\[\]]++|(?0))*+]
But a real improvement in speed might only be noticeable when working with large strings (you can test it, of course!).
This is technically not possible with regular expressions because the language you're matching does not meet the definition of "regular". There are some extended regex implementations that can do it anyway using recursive expressions, among them are:
Greta:
http://easyethical.org/opensource/spider/regexp%20c++/greta2.htm#_Toc39890907
and
PCRE
http://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions
See "Recursive Patterns", which has an example for parentheses.
A PCRE recursive bracket match would look like this:
\[(?R)*\]
edit:
Since you added that you're using Perl, here's a page that explicitly describes how to match balanced pairs of operators in Perl:
http://perldoc.perl.org/perlfaq6.html#Can-I-use-Perl-regular-expressions-to-match-balanced-text%3f
Something like:
$string =~ m/(\[(?:[^\[\]]++|(?1))*\])/xg;
Since you're using Perl, you can use modules from the CPAN and not have to write your own regular expressions. Check out the Text::Balanced module that allows you to extract text from balanced delimiters. Using this module means that if your delimiters suddenly change to {}, you don't have to figure out how to modify a hairy regular expression, you only have to change the delimiter parameter in one function call.
If you are only concerned with deleting the contents and not capturing them to use elsewhere you can use a repeated removal from the inside of the nested groups to the outside.
my $string = "stops [chomping [too] early]!";
# remove any [...] sequence that doesn't contain a [...] inside it
# and keep doing it until there are no [...] sequences to remove
1 while $string =~ s/\[[^\[\]]*\]//g;
print $string;
The 1 while will basically do nothing while the condition is true. If a s/// matches and removes a bracketed section the loop is repeated and the s/// is run again.
This will work even if your using an older version of Perl or another language that doesn't support the (?0) recursion extended pattern in Bart Kiers's answer.
You want to remove only things between the []s that aren't []s themselves. IE:
\[[^\]]*\]
Which is a pretty hairy mess of []s ;-)
It won't handle multiple nested []s though. IE, matching [foo[bar]baz] won't work.