RegExp infinite loop only in perl, why? - regex

I have a regular expression to test whether a CSV cell contains a correct file path:
EDIT The CSV lists filepaths that does not yet exists when script runs (I cannot use -e), and filepath can include * or %variable% or {$variable}.
my $FILENAME_REGEXP = '^(|"|""")(?:[a-zA-Z]:[\\\/])?[\\\/]{0,2}(?:(?:[\w\s\.\*-]+|\{\$\w+}|%\w+%)[\\\/]{0,2})*\1$';
Since CSV cells sometimes contains wrappers of double quotes, and sometimes the filename itself needs to be wrapped by double quotes, I made this grouping (|"|""") ... \1
Then using this function:
sub ValidateUNCPath{
my $input = shift;
if ($input !~ /$FILENAME_REGEXP/){
return;
}
else{
return "This is a Valid File Path.";
}
}
I'm trying to test if this phrase is matching my regexp (It should not match):
"""c:\my\dir\lord"
but my dear Perl gets into infinite loop when:
ValidateUNCPath('"""c:\my\dir\lord"');
EDIT actually it loops on this:
ValidateUNCPath('"""\aaaaaaaaa\bbbbbbb\ccccccc\Netwxn00.map"');
I made sure in http://regexpal.com that my regexp correctly catches those non-symmetric """ ... " wrapping double quotes, but Perl got his own mind :(
I even tried the /g and /o flags in
/$FILENAME_REGEXP/go
but it still hangs. What am I missing ?

First off, nothing you have posted can cause an infinite loop, so if you're getting one, its not from this part of the code.
When I try out your subroutine, it returns true for all sorts of strings that are far from looking like paths, for example:
.....
This is a Valid File Path.
.*.*
This is a Valid File Path.
-
This is a Valid File Path.
This is because your regex is rather loose.
^(|"|""") # can match the empty string
(?:[a-zA-Z]:[\\\/])? # same, matches 0-1 times
[\\\/]{0,2} # same, matches 0-2 times
(?:(?:[\w\s\.\*-]+|\{\$\w+}|%\w+%)[\\\/]?)+\1$ # only this is not optional
Since only the last part actually have to match anything, you are allowing all kinds of strings, mainly in the first character class: [\w\s\.\*-]
In my personal opinion, when you start relying on regexes that look like yours, you're doing something wrong. Unless you're skilled at regexes, and hope noone who isn't will ever be forced to fix it.
Why don't you just remove the quotes? Also, if this path exists in your system, there is a much easier way to check if it is valid: -e $path

If the regex engine was naïve,
('y') x 20 =~ /^.*.*.*.*.*x/
would take a very long time to fail since it has to try
20 * 20 * 20 * 20 * 20 = 3,200,000 possible matches.
Your pattern has a similar structure, meaning it has many components match wide range of substrings of your input.
Now, Perl's regex engine is highly optimised, and far far from naïve. In the above pattern, it will start by looking for x, and exit very very fast. Unfortunately, it doesn't or can't similarly optimise your pattern.
Your patterns is a complete mess. I'm not going to even try to guess what it's suppose to match. You will find that this problem will solve itself once you switch to a correct pattern.

Update
Edit: From trial and error, the below grouping sub-expression [\w\s.*-]+ is causing backtrack problem
(?:
(?:
[\w\s.*-]+
| \{\$\w+\}
| %\w+%
)
[\\\/]?
)+
Fix #1,
Unrolled method
'
^
( # Nothing
|" # Or, "
|""" # Or, """
)
# Here to end, there is no provision for quotes (")
(?: # If there are no balanced quotes, this will fail !!
[a-zA-Z]
:
[\\\/]
)?
[\\\/]{0,2}
(?:
[\w\s.*-]
| \{\$\w+\}
| %\w+%
)+
(?:
[\\\/]
(?:
[\w\s.*-]
| \{\$\w+\}
| %\w+%
)+
)*
[\\\/]?
\1
$
'
Fix #2, Independent Sub-Expression
'
^
( # Nothing
|" # Or, "
|""" # Or, """
)
# Here to end, there is no provision for quotes (")
(?: # If there are no balanced quotes, this will fail !!
[a-zA-Z]
:
[\\\/]
)?
[\\\/]{0,2}
(?>
(?:
(?:
[\w\s.*-]+
| \{\$\w+\}
| %\w+%
)
[\\\/]?
)+
)
\1
$
'
Fix #3, remove the + quantifier (or add +?)
'
^
( # Nothing
|" # Or, "
|""" # Or, """
)
# Here to end, there is no provision for quotes (")
(?: # If there are no balanced quotes, this will fail !!
[a-zA-Z]
:
[\\\/]
)?
[\\\/]{0,2}
(?:
(?:
[\w\s.*-]
| \{\$\w+\}
| %\w+%
)
[\\\/]?
)+
\1
$
'

Thanks to sln this is my fixed regexp:
my $FILENAME_REGEXP = '^(|"|""")(?:[a-zA-Z]:[\\\/])?[\\\/]{0,2}(?:(?:[\w\s.-]++|\{\$\w+\}|%\w+%)[\\\/]{0,2})*\*?[\w.-]*\1$';
(I also disallowed * char in directories, and only allowed single * in (last) filename)

Related

Perl regex - read java file and match entire text of a function in file

I am trying to read a .java file into a perl variable, and I want to match a function, say for instance:
public String example(){
return "hello";
}
What would the regex patter for this look like?
Current Attempt:
use strict;
use warnings;
open ( FILE, "example.java" ) || die "can't open file!";
my #lines = <FILE>;
close (FILE);
my $line;
foreach $line (#lines) {
if($line =~ /String example(.*)}/s){
print $line;
}
}
**Adopted from this answer
Regex:
^\s*([\w\s]+\(.*\)\s*(\{((?>"(?:[^"\\]*+|\\.)*"|'(?:[^'\\]*+|\\.)*'|//.*$|/\*[\s\S]*?\*/(\w+)["']?[^;]+\4;$|[^{}<'"/]++|[^{}]++|(?2))*)}))
Breakdown:
^ \s*
( # (1 start)
[\w\s]+ \( .* \) \s* # How it matches a function definition
( # (2 start)
\{ # Opening curly bracket
( # (3 start)
(?> # Atomic grouping (for its non-capturing purpose only)
"(?: [^"\\]*+ | \\ . )*" # Double quoted strings
| '(?: [^'\\]*+ | \\ . )*' # Single quoted strings
| // .* $ # A comment block starting with //
| /\* [\s\S]*? \*/ # A multi-line comment block /*...*/
( \w+ ) # (4) ^
["']? [^;]+ \4 ; $ # ^
| [^{}<'"/]++ # Force engine to backtrack if it encounters special characters (possessive)
| [^{}]++ # Default matching behavior (possessive)
| (?2) # Recurs 2nd capturing group
)* # Zero to many times of atomic group
) # (3 end)
} # Closing curly bracket
) # (2 end)
) # (1 end)
Revo's regex is the Right Way To Do it (as much as a regex ever can be!).
But sometimes you just need something quick, to manipulate a file you have control over. I find, when using regexes, that it's often important to define "Good enough".
So, it may be "good enough" to assume the indentation is correct. In that case, you can just detect the start of the fn, then read until you find the next closing curly with the same indentation:
( # Capture \1.
^([\t ])+ # Match and capture leading whitespace to \2.
(?:\w+\s*)? # Privacy specifier, if any.
\w+\s*\( # Name and opening round brace: is a function.
.*? # Need Dot-matches-newline, to match fn body.
\n\2} # Curly brace is as indented as start of fn.
) # End capture of \1.
Should work on clean code that you wrote yourself, code you can pass through an auto-formatter first, etc.
Will work with K&R, Hortmann and Allman indent styles.
Will fail with one-line and in-line functions, and indent styles like GNU, Whitesmiths, Pico, Ratliff and Pico - things which Rico's answer handles with no problems at all.
Also fails on lambdas, nested functions, and functions which use generics, but even Revo's doesn't recognize those, and they're not that common.
And neither of our regexes capture the comments preceding a function, which is pretty sinful.

Regex word can be optional but only if it matches the characters

Following pattern: (v[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2})(-[0-9]{1,2})?((-schema)?(-dev)?)((-schema)?(-dev)?) from http://regexr.com/ is meant to be used in a shell script with grep and does match the following strings (working example):
Hello I am a text and this is my v1.12.33-32 version
Hello I am a text and this is my v1.12.33-dev version
Hello I am a text and this is my v1.12.33-dev-schema version
Hello I am a text and this is my v1.12.33-schema version
Hello I am a text and this is my v1.12.33-3-schema version
and so forth
So I made the words schema and dev optional. They can be ommitted or used in a arbitrary order. What I don't what is this:
Hello I am a text and this is my v1.12.33-foo version
or Hello I am a text and this is my v1.12.33-asfs version
to match.
I want the option to be a bit more constrained. At the moment the Regex is still matching the stuff that...well actually matches.
This for example:
Hello I am a text and this is my v1.123.33
results in an empty string while this:
`Hello I am a text and this is my v1.12.33-bla"
still results in v.1.12.33
Is this because of the grouping I made? So at least the fully matching groups will be taken for the returned match-string?
To match only the version string, disallow extra trailing tags, yet allow trailing unmatched text, you need a regex language that supports lookahead. Standard grep / egrep regexes do not support lookahead.
You have two options:
Since you seem to be relying on GNU grep anyway, you could use a Perl regex, such as
v[0-9]{1,2}(\.[0-9]{1,2}){2}(-[0-9]{1,2})?((-schema(-dev)?)?|(-dev(-schema)?)?)?(?!\S)
The negative lookahead at the end allows the match to appear at the end of the line, but also requires that if it does not end the line then the next character following the match must be whitespace (which is not itself included in the match).
You could give up on completely isolating the target text via -o, and instead allow the pattern to match the trailing context, too:
v[0-9]{1,2}(\.[0-9]{1,2}){2}(-[0-9]{1,2})?((-schema(-dev)?)?|(-dev(-schema)?)?)?(\s.*)?$
In this case, you could isolate the target text in a second step, by stripping off any tail beginning with whitespace.
Note that neither of these pays attention to text preceeding the match. You have similar options for handling that portion as you do for handling the trailing portion.
The problem seems to be all the optional expressions lurking at the
edge (end).
You can solve that a few ways, but none are %100 because you'd need
more rules to control what matches.
It's not like you can say no - is allowed afterword, the engine will
backtrack to one of the range digits {1,2} to make a match.
What seems to work for now is passing on a whitespace end edge
or matching the dev/schema items.
(v[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2})(-[0-9]{1,2})?(?:(?!\S)|(-(schema|dev)(?:-(schema|dev))?))
Expanded
( # (1 start)
v [0-9]{1,2}
\. [0-9]{1,2}
\. [0-9]{1,2}
) # (1 end)
( - [0-9]{1,2} )? # (2)
(?:
(?! \S ) # Whitespace boundary
| # or,
( # (3 start)
-
( schema | dev ) # (4)
(?:
-
( schema | dev ) # (5)
)?
) # (3 end)
)
edit
If you want to avoid matching the same schema|dev word twice, just add
a negative assertion of group 4, before capture group 5 above.
(v[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2})(-[0-9]{1,2})?(?:(?!\S)|(-(schema|dev)(?:-(?!\4)(schema|dev))?))
Expanded
( # (1 start)
v [0-9]{1,2}
\. [0-9]{1,2}
\. [0-9]{1,2}
) # (1 end)
( - [0-9]{1,2} )? # (2)
(?:
(?! \S ) # Whitespace boundary
| # or,
( # (3 start)
-
( schema | dev ) # (4)
(?:
-
(?! \4 ) # Not same word twice
( schema | dev ) # (5)
)?
) # (3 end)
)
Since regular expressions are open-ended, you need to specify with $ where you want the match to end, so you don't let the regex engine silently ignore trailing junk.
With only two tags in the optional set, I would just enumerate the 4 possibilities:
(v[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2})(-[0-9]{1,2})?(-schema|-dev|-dev-schema|-schema-dev)?$
My version:
grep --perl-regexp \
'\bv(?:\d{1,2}\.){2}\d{1,2}(?:\-\d{1,2})?(?:\-(?:schema|dev))?(?:\s|$)' \
path/to/file
Where
the first \b is a word boundary(you might want to make it stricter);
(?: ... ) expressions are non-capturing groups;
\s|$ is either a space character, or the end of line
The rest is just refactored for simplicity.
The expression allows only schema, or dev at the "end".

Regex to find strings not containing a specified value

I'm using notepad++'s regular expression search function to find all strings in a .txt document that do not contain a specific value (HIJ in the below example), where all strings begin with the same value (ABC in the below example).
How would I go about doing this?
Example
Every String starts with ABC
ABC is never used in a string other than at the beginning,
ABCABC123 would be two strings --"ABC" and "ABC123"
HIJ may appear multiple times in a string
I need to find the strings that do not contain HIJ
Input is one long file with no line breaks, but does contain special characters (*, ^, #, ~, :) and spaces
Example Input:
ABC1234HIJ56ABC7#HIJABC89ABCHIJ0ABE:HIJABC12~34HI456J
Example Input would be viewed as the following strings
ABC1234HIJ56
ABC7#HIJ
ABC89
ABCHIJ0ABE:HIJ
ABC12%34HI456J
The Third and Fifth strings both lack "HIJ" and therefore are included in the output, all others are not included in the output.
Example desired output:
ABC89
ABC12~34HI456J
I am 99% new to RegEx and will be looking more into it in the future, as my job description suddenly changed earlier this week when someone else in the company left suddenly, and therefore I have been doing this manually by searching (ABC|HIJ) and going through the search function's results looking for "ABC" appearing twice in a row. Supposedly the former employee was able to do this in an automated way, but left no documentation.
Any help would be appreciated!
This question is a repeat of a prior question I asked, but I was very very bad at formatting a question and it seems to have sunk beyond noticeable levels.
You can find the items you want with:
ABC(?:[^HA]+|H(?!IJ)|A(?!BC))*+(?=ABC|$)
Note: in this first pattern, you can replace (?=ABC|$) with (?!HIJ)
pattern details:
ABC
(?: # non-capturing group
[^HA]+ # all that is not a H or an A
| # OR
H(?!IJ) # an H not followed by IJ
|
A(?!BC) # an A not followed by BC
)*+ # repeat the group
(?=ABC|$) # followed by "ABC" or the end of the string
Note: if you want to remove all that is not the items you want you can make this search replace:
search: (?:ABC(?:[^HA]+|H(?!IJ)|A(?!BC))*+HIJ.*?(?=ABC|$))+|(?=ABC)
replace: \r\n
you could use this pattern
(ABC(?:(?!HIJ).)*?)(?=ABC|\R)
Demo
( # Capturing Group (1)
ABC # "ABC"
(?: # Non Capturing Group
(?! # Negative Look-Ahead
HIJ # "HIJ"
) # End of Negative Look-Ahead
. # Any character except line break
) # End of Non Capturing Group
*? # (zero or more)(lazy)
) # End of Capturing Group (1)
(?= # Look-Ahead
ABC # "ABC"
| # OR
\R # <line break>
) # End of Look-Ahead
You can use the following expression to match your criterion:
(^ABC(?:(?!HIJ).)*$)
This starts with ABC and looks ahead (negative) for HIJ pattern. The pattern works for the separated strings.
For a single line pattern (as provided in your question), a slight modification of this works (as follows):
(ABC(?:(?!HIJ).)*?)(?=ABC|$)

Matching percentages

I've been trying to enhance some code which determines whether a string is a valid percentage.
I decided that it was time to finally have a hundred problems, and learned regex.
I've been using this web regex tester to build my pattern.
I'm trying to do this rather loosely, such that valid percentages may be integer or decimal, positive or negative, include commas or not, and have any amount of whitespace at the beginning and end, as well as around the optional negative sign and the required percentage sign.
So far, I have \s*-?\s*\d+(,\d+)*(?:\.\d*)?\s*%\s*, which matches almost all of my test cases correctly:
0
0
0
% 0
- 0 %
20948.924780%
315%
2,456,875 %
2,104.86%
89fqyf0gp948y1-%ghghpq98fy92,.?><
, , , ,,,, 0,0,000,00,00,,,0
, , , ,,,, 0,0,000,00,00,,,0%
000000000,00000000000 %
000000000,00000000000,00000000000 %
000000000,00000000000,00000000000,00000000000.00000000000 %
These are not in any particular order, some pass and some fail, but only one is incorrect. In , , , ,,,, 0,0,000,00,00,,,0%, the last 0%\n is a match, but the whole line should be invalid. Start and end indicators do not seem to have the effect I had assumed, as a $ makes only the last example match, while a ^ at the beginning makes no matches register.
It may be something small, but as someone who only learned regex yesterday, it's far beyond my reach.
Thanks!
Start and end indicators do not seem to have the effect I had assumed, as a $ makes only the last example match, while a ^ at the beginning makes no matches register.
Those anchors should be working. However, it does depend on the regex engine and the options whether they match line begins/ends or file begins/ends. On RegExr, you'd have to check the multiline option: http://regexr.com?380p9 - in programming, use the m flag.
It could be done like this.
Edit: So after realizing its a line thing, this is the regex now.
Note(s) -
Uses multiline mode line Bergi's.
Also, you CANNOT just use \s wihitespace class in this.
It doesn't matter what mode used, \s will WILL match CRLF if it can, which means
-
000,000000.22
%
will match because it satisfies all the conditions.
[^\S\r\n] means match whitespace except CRLF characters. It could be replaced with
[^\S\n] in the real world. The initial input on that tester used \r\n linebreaks.
Good Luck!!
# ^[^\S\r\n]*-?[^\S\r\n]*(?:(?:\.\d+)|(?:\d+(?:,\d+)*(?:\.\d*)?))[^\S\r\n]*%[^\S\r\n]*$
^ # BOL
[^\S\r\n]*
-? # optional -
[^\S\r\n]*
(?: # group
(?: \. \d+ ) # .number
| # or
(?: # group
\d+ # number
(?: , \d+ )* # optional many ,number
(?: \. \d* )? # optional . optional number
) # end group
) # end group
[^\S\r\n]*
% # %
[^\S\r\n]*
$ # EOL

RegEx lookahead on .*

I have a pattern that needs to find the last occurrence of string1 unless string2 is found anywhere in the subject, then it needs the first occurrence of string1. In order to solve this I wrote this inefficient negative lookahead.
/(.(?!.*?string2))*string1/
It takes several seconds to run (prohibitively long on subjects lacking any occurrence of either string). Is there a more efficient way to accomplish this?
You should be able to use the following:
/string1(?!.*?string2)/
This will match string1 as long as string2 is not found later in the string, which I think meets your requirements.
Edit: After seeing your update, try the following:
/.*?string1(?=.*?string2)|.*string1/
You could also do if/else statements in your regex !
(?(?=.*string2).*(string1).*$|^.*?(string1))
Explanation:
(? # If
(?=.*string2) # Lookahead, if there is string2
.*(string1).*$ # Then match the last string1
| # Else
^.*?(string1) # Match the first string1
)
If string1 is found, you'll find it in group 1.
Ok now, i have understand what you want, a bit long but optimized to be fast:
nutria\d. -> string1
RABBIT -> string2
The pattern (example in PHP):
$pattern = <<<LOD
~(?J) # allow multiple capture groups with the same name
### capture the first nutria if RABBIT isn't found before ###
^ (?>[^Rn]++|R++(?!ABBIT)|n++(?!utria\d.))* (?<res>nutria\d.)
### try to capture the last nutria without RABBIT until the end ###
(?>
(?>
(?> [^Rn]++ | R++(?!ABBIT) | n++(?!utria\d.) )*
(?<res>nutria\d.)
)* # repeat as possible to catch the last nutria
(?> [^R]++ | R++(?!ABBIT) )* $ # the end without RABBIT
)? # /!\important/!\ this part is optional, then only the first captured
# nutria is in the result when RABBIT is found in this part
| # OR
### capture the first nutria when RABBIT is found before
^(?> [^n]++ | n++(?!utria\d.) )* (?<res>nutria\d.)
~x
LOD;
$subjects = array( 'groundhog nutria1A beaver nutria1B',
'polecat nutria2A badger RABBIT nutria2B',
'weasel RABBIT nutria3A nutria3B nutria3C',
'vole nutria4A marten nutria4B marmot nutria4C RABBIT');
foreach($subjects as $subject) {
if (preg_match($pattern, $subject, $match))
echo '<br/>'.$match['res'];
}
The pattern is designed to fail as fast as possible using atomic groups and possessive quantifiers with alternations and thus avoids catastrophic backtracking using the least possible lookaheads (only when a n or an R is found, and it fails quickly)
Try this regex:
string1(?!.*?string1)|string1(?=.*?string2)
Live Demo: http://www.rubular.com/r/uAjOqaTkYH
Edit live on Debuggex
Try using the possessive operator .*+, it uses less memory (it doesn't store the entire backtrace of matching cases). It may also run faster because of this.