single regex not working for a three different patterns - regex

I need a optimal regexp to match all these three types of texts in a text file.
[TRUE,FALSE]
[4,5,6,7]
[2-15]
i am trying the following regex match which is not working
m/([0-9A-Fa-fx,]+)\s*[-~,]\s*([0-9A-Fa-fx,]+)/)

/
(?(DEFINE)
(?<WORD> [a-zA-Z]+ )
(?<NUM> [0-9]+ )
)
\[ \s*
(?: (?&WORD) (?: \s* , \s* (?&WORD) )+
| (?&NUM) (?: \s* , \s* (?&NUM) )+
| (?&NUM) \s* - \s* (?&NUM)
)
\s* \]
/x

4-7 is a subset of 2-15. This regex should capture them:
/TRUE|FALSE|[2-9]|1[0-5]/

A quick'n'dirty test program:
#!/usr/bin/env perl
use strict;
use warnings;
for my $line (<DATA>) {
chomp $line;
print "$line: ";
if ($line =~ /
^ # beginning of the string
\[ # a literal opening sq. bracket
( # alternatives:
(TRUE|FALSE) (,(TRUE|FALSE))* # one or more thruth words
| (\d+) (,\d+)* # one or more numbers
| (\d+) - (\d+) # a range of numbers
) # end of alternatives
\] # a literal closing sq. bracket
$ # end of the string
/x) {
print "match\n";
}
else {
print "no match\n";
}
}
__DATA__
[TRUE]
foo
[FALSE,TRUE,FALSE]
[FALSE,TRUE,]
[42,FALSE]
[17,42,666]
bar
[17-42]
[17,42-666]
Output:
[TRUE]: match
foo: no match
[FALSE,TRUE,FALSE]: match
[FALSE,TRUE,]: no match
[42,FALSE]: no match
[17,42,666]: match
bar: no match
[17-42]: match
[17,42-666]: no match

Related

Regex to match specific functions and their arguments in files

I'm working on a gettext javascript parser and I'm stuck on the parsing regex.
I need to catch every argument passed to a specific method call _n( and _(. For example, if I have these in my javascript files:
_("foo") // want "foo"
_n("bar", "baz", 42); // want "bar", "baz", 42
_n(domain, "bux", var); // want domain, "bux", var
_( "one (optional)" ); // want "one (optional)"
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) // could have on the same line two calls..
This refs this documentation: http://poedit.net/trac/wiki/Doc/Keywords
I'm planning in doing it in two times (and two regex):
catch all function arguments for _n( or _( method calls
catch the stringy ones only
Basically, I'd like a Regex that could say "catch everything after _n( or _( and stop at the last parenthesis ) actually when the function is done. I dunno if it is possible with regex and without a javascript parser.
What could also be done is "catch every "string" or 'string' after _n( or _( and stop at the end of the line OR at the beginning of a new _n( or _( character.
In everything I've done I get either stuck on _( "one (optional)" ); with its inside parenthesis or apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) with two calls on the same line.
Here is what I implemented so far, with un-perfect regex: a generic parser and the javascript one or the handlebars one
Note: Read this answer if you're not familiar with recursion.
Part 1: match specific functions
Who said that regex can't be modular? Well PCRE regex to the rescue!
~ # Delimiter
(?(DEFINE) # Start of definitions
(?P<str_double_quotes>
(?<!\\) # Not escaped
" # Match a double quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
" # Match the ending double quote
)
(?P<str_single_quotes>
(?<!\\) # Not escaped
' # Match a single quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
' # Match the ending single quote
)
(?P<brackets>
\( # Match an opening bracket
(?: # A non capturing group
(?&str_double_quotes) # Recurse/use the str_double_quotes pattern
| # Or
(?&str_single_quotes) # Recurse/use the str_single_quotes pattern
| # Or
[^()] # Anything not a bracket
| # Or
(?&brackets) # Recurse the bracket pattern
)*
\)
)
) # End of definitions
# Let's start matching for real now:
_n? # Match _ or _n
\s* # Optional white spaces
(?P<results>(?&brackets)) # Recurse/use the brackets pattern and put it in the results group
~sx
The s is for matching newlines with . and the x modifier is for this fancy spacing and commenting of our regex.
Online regex demo
Online php demo
Part 2: getting rid of opening & closing brackets
Since our regex will also get the opening and closing brackets (), we might need to filter them. We will use preg_replace() on the results:
~ # Delimiter
^ # Assert begin of string
\( # Match an opening bracket
\s* # Match optional whitespaces
| # Or
\s* # Match optional whitespaces
\) # Match a closing bracket
$ # Assert end of string
~x
Online php demo
Part 3: extracting the arguments
So here's another modular regex, you could even add your own grammar:
~ # Delimiter
(?(DEFINE) # Start of definitions
(?P<str_double_quotes>
(?<!\\) # Not escaped
" # Match a double quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
" # Match the ending double quote
)
(?P<str_single_quotes>
(?<!\\) # Not escaped
' # Match a single quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
' # Match the ending single quote
)
(?P<array>
Array\s*
(?&brackets)
)
(?P<variable>
[^\s,()]+ # I don't know the exact grammar for a variable in ECMAScript
)
(?P<brackets>
\( # Match an opening bracket
(?: # A non capturing group
(?&str_double_quotes) # Recurse/use the str_double_quotes pattern
| # Or
(?&str_single_quotes) # Recurse/use the str_single_quotes pattern
| # Or
(?&array) # Recurse/use the array pattern
| # Or
(?&variable) # Recurse/use the array pattern
| # Or
[^()] # Anything not a bracket
| # Or
(?&brackets) # Recurse the bracket pattern
)*
\)
)
) # End of definitions
# Let's start matching for real now:
(?&array)
|
(?&variable)
|
(?&str_double_quotes)
|
(?&str_single_quotes)
~xis
We will loop and use preg_match_all(). The final code would look like this:
$functionPattern = <<<'regex'
~ # Delimiter
(?(DEFINE) # Start of definitions
(?P<str_double_quotes>
(?<!\\) # Not escaped
" # Match a double quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
" # Match the ending double quote
)
(?P<str_single_quotes>
(?<!\\) # Not escaped
' # Match a single quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
' # Match the ending single quote
)
(?P<brackets>
\( # Match an opening bracket
(?: # A non capturing group
(?&str_double_quotes) # Recurse/use the str_double_quotes pattern
| # Or
(?&str_single_quotes) # Recurse/use the str_single_quotes pattern
| # Or
[^()] # Anything not a bracket
| # Or
(?&brackets) # Recurse the bracket pattern
)*
\)
)
) # End of definitions
# Let's start matching for real now:
_n? # Match _ or _n
\s* # Optional white spaces
(?P<results>(?&brackets)) # Recurse/use the brackets pattern and put it in the results group
~sx
regex;
$argumentsPattern = <<<'regex'
~ # Delimiter
(?(DEFINE) # Start of definitions
(?P<str_double_quotes>
(?<!\\) # Not escaped
" # Match a double quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
" # Match the ending double quote
)
(?P<str_single_quotes>
(?<!\\) # Not escaped
' # Match a single quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
' # Match the ending single quote
)
(?P<array>
Array\s*
(?&brackets)
)
(?P<variable>
[^\s,()]+ # I don't know the exact grammar for a variable in ECMAScript
)
(?P<brackets>
\( # Match an opening bracket
(?: # A non capturing group
(?&str_double_quotes) # Recurse/use the str_double_quotes pattern
| # Or
(?&str_single_quotes) # Recurse/use the str_single_quotes pattern
| # Or
(?&array) # Recurse/use the array pattern
| # Or
(?&variable) # Recurse/use the array pattern
| # Or
[^()] # Anything not a bracket
| # Or
(?&brackets) # Recurse the bracket pattern
)*
\)
)
) # End of definitions
# Let's start matching for real now:
(?&array)
|
(?&str_double_quotes)
|
(?&str_single_quotes)
|
(?&variable)
~six
regex;
$input = <<<'input'
_ ("foo") // want "foo"
_n("bar", "baz", 42); // want "bar", "baz", 42
_n(domain, "bux", var); // want domain, "bux", var
_( "one (optional)" ); // want "one (optional)"
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) // could have on the same line two calls..
// misleading cases
_n("foo (")
_n("foo (\)", 'foo)', aa)
_n( Array(1, 2, 3), Array(")", '(') );
_n(function(foo){return foo*2;}); // Is this even valid?
_n (); // Empty
_ (
"Foo",
'Bar',
Array(
"wow",
"much",
'whitespaces'
),
multiline
); // PCRE is awesome
input;
if(preg_match_all($functionPattern, $input, $m)){
$filtered = preg_replace(
'~ # Delimiter
^ # Assert begin of string
\( # Match an opening bracket
\s* # Match optional whitespaces
| # Or
\s* # Match optional whitespaces
\) # Match a closing bracket
$ # Assert end of string
~x', // Regex
'', // Replace with nothing
$m['results'] // Subject
); // Getting rid of opening & closing brackets
// Part 3: extract arguments:
$parsedTree = array();
foreach($filtered as $arguments){ // Loop
if(preg_match_all($argumentsPattern, $arguments, $m)){ // If there's a match
$parsedTree[] = array(
'all_arguments' => $arguments,
'branches' => $m[0]
); // Add an array to our tree and fill it
}else{
$parsedTree[] = array(
'all_arguments' => $arguments,
'branches' => array()
); // Add an array with empty branches
}
}
print_r($parsedTree); // Let's see the results;
}else{
echo 'no matches';
}
Online php demo
You might want to create a recursive function to generate a full tree. See this answer.
You might notice that the function(){} part isn't parsed correctly. I will let that as an exercise for the readers :)
Try this:
(?<=\().*?(?=\s*\)[^)]*$)
See live demo
Below regex should help you.
^(?=\w+\()\w+?\(([\s'!\\\)",\w]+)+\);
Check the demo here
\(( |"(\\"|[^"])*"|'(\\'|[^'])*'|[^)"'])*?\)
This should get anything between a pair of parenthesis, ignoring parenthesis in quotes.
Explanation:
\( // Literal open paren
(
| //Space or
"(\\"|[^"])*"| //Anything between two double quotes, including escaped quotes, or
'(\\'|[^'])*'| //Anything between two single quotes, including escaped quotes, or
[^)"'] //Any character that isn't a quote or close paren
)*? // All that, as many times as necessary
\) // Literal close paren
No matter how you slice it, regular expressions are going to cause problems. They're hard to read, hard to maintain, and highly inefficient. I'm unfamiliar with gettext, but perhaps you could use a for loop?
// This is just pseudocode. A loop like this can be more readable, maintainable, and predictable than a regular expression.
for(int i = 0; i < input.length; i++) {
// Ignoring anything that isn't an opening paren
if(input[i] == '(') {
String capturedText = "";
// Loop until a close paren is reached, or an EOF is reached
for(; input[i] != ')' && i < input.length; i++) {
if(input[i] == '"') {
// Loop until an unescaped close quote is reached, or an EOF is reached
for(; (input[i] != '"' || input[i - 1] == '\\') && i < input.length; i++) {
capturedText += input[i];
}
}
if(input[i] == "'") {
// Loop until an unescaped close quote is reached, or an EOF is reached
for(; (input[i] != "'" || input[i - 1] == '\\') && i < input.length; i++) {
capturedText += input[i];
}
}
capturedText += input[i];
}
capture(capturedText);
}
}
Note: I didn't cover how to determine if it's a function or just a grouping symbol. (ie, this will match a = (b * c)). That's complicated, as is covered in detail here. As your code gets more and more accurate, you get closer and closer to writing your own javascript parser. You might want to take a look at the source code for actual javascript parsers if you need that sort of accuracy.
One bit of code (you can test this PHP code at http://writecodeonline.com/php/ to check):
$string = '_("foo")
_n("bar", "baz", 42);
_n(domain, "bux", var);
_( "one (optional)" );
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples)';
preg_match_all('/(?<=(_\()|(_n\())[\w", ()%]+(?=\))/i', $string, $matches);
foreach($matches[0] as $test){
$opArr = explode(',', $test);
foreach($opArr as $test2){
echo trim($test2) . "\n";
}
}
you can see the initial pattern and how it works here: http://regex101.com/r/fR7eU2/1
Output is:
"foo"
"bar"
"baz"
42
domain
"bux"
var
"one (optional)"
"No apples"
"%1 apple"
"%1 apples"
apples
We can do this in two steps:
1)catch all function arguments for _n( or _( method calls
(?:_\(|_n\()(?:[^()]*\([^()]*\))*[^()]*\)
See demo.
http://regex101.com/r/oE6jJ1/13
2)catch the stringy ones only
"([^"]*)"|(?:\(|,)\s*([^"),]*)(?=,|\))
See demo.
http://regex101.com/r/oE6jJ1/14

Why is this regex not greedy?

In this regex
$line = 'this is a regular expression';
$line =~ s/^(\w+)\b(.*)\b(\w+)$/$3 $2 $1/;
print $line;
Why is $2 equal to " is a regular "? My thought process is that (.*) should be greedy and match all characters until the end of the line and therefore $3 would be empty.
That's not happening, though. The regex matcher is somehow stopping right before the last word boundary and populating $3 with what's after the last word boundary and the rest of the string is sent to $2.
Any explanation?
Thanks.
$3 can't be empty when using this regex because the corresponding capturing group is (\w+), which must match at least one word character or the whole match will fail.
So what happens is (.*) matches "is a regular expression", \b matches the end of the string, and (\w+) fails to match. The regex engine then backtracks to (.*) matching "is a regular " (note the match includes the space), \b matches the word boundary before e, and (\w+) matches "expression".
If you change(\w+) to (\w*) then you will end up with the result you expected, where (.*) consumes the whole string.
Greedy doesn't mean it gets to match absolutely everything. It just means it can take as much as possible and still have the regex succeed.
This means that since you use the + in group 3 it can't be empty and still succeed as + means 1 or more.
If you want 3 to be empty, just change (\w+) to (\w?). Now since ? means 0 or 1 it can be empty, and therefore the greedy .* takes everything. Note: This seems to work only in Perl, due to how perl deals with lines.
In order for the regex to match the whole string, ^(\w+)\b requires that the entire first word be \1. Likewise, \b(\w+)$ requires that the entire last word be \3. Therefore, no matter how greedy (.*) is, it can only capture ' is a regular ', otherwise the pattern won't match. At some point while matching the string, .* probably did take up the entire ' is a regular expression', but then it found that it had to backtrack and let the \w+ get its match too.
The way that you wrote your regexp it doesn't matter if .* is being greedy, or non-greedy.
It will still match.
The reason is that you used \b between .* and \w+.
use strict;
use warnings;
my $string = 'this is a regular expression';
sub test{
my($match,$desc) = #_;
print '# ', $desc, "\n" if $desc;
print "test( qr'$match' );\n";
if( my #elem = $string =~ $match ){
print ' 'x4,'[\'', join("']['",#elem), "']\n\n"
}else{
print ' 'x4,"FAIL\n\n";
}
}
test( qr'^ (\w+) \b (.*) \b (\w+) $'x, 'original' );
test( qr'^ (\w+) \b (.*+) \b (\w+) $'x, 'extra-greedy' );
test( qr'^ (\w+) \b (.*?) \b (\w+) $'x, 'non-greedy' );
test( qr'^ (\w+) \b (.*) \b (\w*) $'x, '\w* instead of \w+' );
test( qr'^ (\w+) \b (.*) (\w+) $'x, 'no \b');
test( qr'^ (\w+) \b (.*?) (\w+) $'x, 'no \b, non-greedy .*?' );
# original
test( qr'(?^x:^ (\w+) \b (.*) \b (\w+) $)' );
['this'][' is a regular ']['expression']
# extra-greedy
test( qr'(?^x:^ (\w+) \b (.*+) \b (\w+) $)' );
FAIL
# non-greedy
test( qr'(?^x:^ (\w+) \b (.*?) \b (\w+) $)' );
['this'][' is a regular ']['expression']
# \w* instead of \w+
test( qr'(?^x:^ (\w+) \b (.*) \b (\w*) $)' );
['this'][' is a regular expression']['']
# no \b
test( qr'(?^x:^ (\w+) \b (.*) (\w+) $)' );
['this'][' is a regular expressio']['n']
# no \b, non-greedy .*?
test( qr'(?^x:^ (\w+) \b (.*?) (\w+) $)' );
['this'][' is a regular ']['expression']

Regex replace variables pointer replace

i have a string like
<? if $items.var1.var2 == $items2.var1.var2 ?>
i want to replace the string that it looks like this
<? if $items->{var1}->{var2} == $items2->{var1}->{var2} ?>
So strings like that i want to replace not quoted '$items2.var1.var2' strings like this
<? if $items.var1.var2 == '$items2.var1.var2' ?>
Can anyone help me?
So you need to replace dots/variables only if they are followed by an even number of quotes. In Python, for example:
>>> re.sub(r"\.(\w+)(?=(?:[^']*'[^']*')*[^']*$)", r"->{\1}",
... "<? if $items.var1.var2 == '$items2.var1.var2' ?>")
"<? if $items->{var1}->{var2} == '$items2.var1.var2' ?>"
Explanation:
\. # Match .
(\w+) # Match an alnum word and capture it
(?= # Assert that it's possible to match:
(?: # Match this (don't capture):
[^']*' # Any number of non-quotes, followed by a '
[^']*' # The same again
)* # any number of times, including 0
[^']* # Match any number of non-quotes
$ # until the end of the string.
) # End of lookahead assertion.
This works in perl:
$str =~ s/\.(?=(\w|\.)+(?!'|"|\.)\W)/->/g;
#explained:
$str =~ s/
\. #A literal .
(?= #Followed by
(\w|\.)+ #More than one word char or literal .
(?!'|"|\.) #Terminated by a \W (non-word char) that's not ',", or .
\W
)
/->/gx;
At least for your examples. http://codepad.org/a4L43hDr

Perl Regular expression for IP address range

I have some internet traffic data to analyze. I need to analyze only those packets that are within a certain IP range. So, I need to write a if statement. I suppose I need a regular expression for the test condition. My knowledge of regexp is a little weak. Can someone tell me how would I construct a regular expression for that condition. An example range may be like
Group A
56.286.75.0/19
57.256.106.0/21
64.131.14.0/22
Group B
58.176.44.0/21
58.177.92.0/19
The if statement would be like
if("IP in A" || "IP in B") {
do something
}
else { do something else }
so i would need to make the equivalent regexp for "IP in A" and "IP in B" conditions.
I don't think that regexps provide much advantage for this problem.
Instead, use the Net::Netmask module. The "match" method should do what you want.
I have to echo the disagreement with using a regex to check IP addresses...however, here is a way to pull IPs out of text:
qr{
(?<!\d) # No digit having come immediately before
(?: [1-9] \d? # any one or two-digit number
| 1 \d \d # OR any three-digit number starting with 1
| 2 (?: [0-4] \d # OR 200 - 249
| 5 [0-6] # OR 250 - 256
)
)
(?: \. # followed by a dot
(?: [1-9] \d? # 1-256 reprise...
| 1 \d \d
| 2 (?: [0-4 \d
| 5 [0-6]
)
)
){3} # that group exactly 3 times
(?!\d) # no digit following immediately after
}x
;
But given that general pattern, we can construct an IP parser. But for the given "ranges", I wouldn't do anything less than the following:
A => qr{
(?<! \d )
(?: 56\.186\. 75
| 57\.256\.106
| 64\.131\. 14
)
\.
(?: [1-9] \d?
| 1 \d \d
| 2 (?: [0-4] \d
| 5 [0-6]
)
)
(?! \d )
}x
B => qr{
(?<! \d )
58 \.
(?: 176\.44
| 177\.92
)
\.
(?: [1-9] \d?
| 1 \d \d
| 2 (?: [0-4] \d
| 5 [0-6]
)
)
(?! \d )
}x
I'm doing something like:
use NetAddr::IP;
my #group_a = map NetAddr::IP->new($_), #group_a_masks;
...
my $addr = NetAddr::IP->new( $ip_addr_in );
if ( grep $_->contains( $addr ), #group_a ) {
print "group a";
}
I chose NetAddr::IP over Net::Netmask for IPv6 support.
Martin is right, use Net::Netmask. If you really want to use a regex though...
$prefix = "192.168.1.0/25";
$ip1 = "192.168.1.1";
$ip2 = "192.168.1.129";
$prefix =~ s/([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)\/([0-9]+)/$mask=(2**32-1)<<(32-$5); $1<<24|$2<<16|$3<<8|$4/e;
$ip1 =~ s/([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)/$1<<24|$2<<16|$3<<8|$4/e;
$ip2 =~ s/([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)/$1<<24|$2<<16|$3<<8|$4/e;
if (($prefix & $mask) == ($ip1 & $mask)) {
print "ip1 matches\n";
}
if (($prefix & $mask) == ($ip2 & $mask)) {
print "ip2 matches\n";
}

PCRE (recursive) pattern that matches a string containing a correctly parenthesized substring. Why does this one fail?

Well, there are other ways (hmmm... or rather working ways) to do it, but the question is why does this one fail?
/
\A # start of the string
( # group 1
(?: # group 2
[^()]* # something other than parentheses (greedy)
| # or
\( (?1) \) # parenthesized group 1
) # -group 2
+ # at least once (greedy)
) # -group 1
\Z # end of the string
/x
Fails to match a string with nested parentheses: "(())"
It doesn't fail
$ perl junk.pl
matched junk >(())<
$ cat junk.pl
my $junk = qr/
\A # start of the string
( # group 1
(?: # group 2
[^()]* # something other than parentheses (greedy)
| # or
\( (?1) \) # parenthesized group 1
) # -group 2
+ # at least once (greedy)
) # -group 1
\Z # end of the string
/x;
if( "(())" =~ $junk ){
print "matched junk >$1<\n";
}
Wow!.. Thank you, junk! It really works... in Perl. But not in PCRE. So, the question is mutating into "What's the difference between Perl and PCRE regex pattern matching?"
And voila! There is an answer:
Recursion difference from Perl
In PCRE (like Python, but unlike Perl), a recursive subpattern call is
always treated as an atomic group. That is, once it has matched some of
the subject string, it is never re-entered, even if it contains untried
alternatives and there is a subsequent matching failure.
Therefore, we just need to swap two subpatterns:
/ \A ( (?: \( (?1) \) | [^()]* )+ ) \Z /x
Thank you!