How to recursively match strings balanced with multi-character delimiters? - regex

How can I recursively match a string balanced with multi-character delimiters?
Consider a LaTeX inline quotation such that 2 doubleticks (``) mark up where the quote begins, and 2 apostrophes (\x27\x27) where it ends.
The following code gives me ``five''. I want to capture two ``three `four' ``five'' three four'' six
my $str = q|one ``two ``three `four' ``five'' three four'' six'' seven|;
if ( $str =~ /
(
``
(?:
[^`']
|
(?1)
)*
''
)
/x
)
{
print "$1\n";
}
I guess it has to do with how to negate, not a character class ([^`'], but multi-character strings.

(?:(?!PAT)(?s:.))* is to PAT as [^CHAR]* is to CHAR, so
(?:(?!``|'')(?s:.))*
matches any character that isn't the start of those two sequences. However, I think lookaheads are little expensive, so I believe
(?: [^`']+ | `(?!`) | '(?!') )*
would be cheaper. We get the following:
/
(
``
(
(?: [^`']+ | `(?!`) | '(?!') )*
(?:
(?-2)
(?: [^`']+ | `(?!`) | '(?!') )*
)*
)
''
)
/x
We can simplify for a small performance drop.
/
(
``
(
(?: [^`']+
| `(?!`)
| '(?!')
| (?-2)
)*
)
''
)
/x
In both snippets, The text you want to capture is in $2.

Related

Powershell Regex to match between vertical bar ( | )

Below is just two lines of string that I am matching too
6 |UDP |ENABLED | |15006 |010.247.060.120 | UDP/IP Communications | UDP/IP Communications GH1870
10 |Gway |ONLINE | |41794 |127.000.000.001 | DM-MD64x64 | DM-MD64x64
Below is the regex I have so far, but it only matches the bottom line
(?i)(?<cipid>([\w\.]+))\s*\|\s*(?<ty>\w+)?\s*\|\s*(?<stat>[\w ]+)\s*\|\s*(?<devid>\w+)?\s*\|\s*(?<prt>\d+)\s*\|\s*(?<ip>([\d\.]+))\s*\|\s*(?<mdl>[\w-]+)\s*\|\s*(?<desc>.+)
I was wondering if I could have a regular expression that just matches every character between every vertical line, instead of having to explicitly say what is between the vertical lines
Thanks all
This usually works. (?:^|(?<=\|))[^|]*?(?=\||$)
https://regex101.com/r/KMNc47/1
Formatted
(?: ^ | (?<= \| ) ) # BOS or Pipe behind
[^|]*? # Optional non-pipe chars
(?= \| | $ ) # Pipe ahead or EOS
Here it is with whitespace trim and includes a capture group.
(?:^|(?<=\|))\s*([^|]*?)\s*(?=\||$)
https://regex101.com/r/KMNc47/2
Formatted
(?: ^ | (?<= \| ) ) # BOS or Pipe behind
\s*
( [^|]*? ) # (1), Optional non-pipe chars
\s*
(?= \| | $ ) # Pipe ahead or EOS
Here it is in a Capture Collection configuration.
(?:(?:^|\|)\s*([^|]*?)\s*(?=\||$))+
https://regex101.com/r/KMNc47/3
Formatted
(?:
(?: ^ | \| ) # BOS or Pipe
\s*
( [^|]*? ) # (1), Optional non-pipe chars
\s*
(?= \| | $ ) # Pipe ahead or EOS
)+

Regex skip in C++

This is my string:
/*
Block1 {
anythinghere
}
*/
// Block2 { }
# Block3 { }
Block4 {
anything here
}
I am using this regex to get each block name and inside content.
regex e(R"~((\w+)\s+\{([^}]+)\})~", std::regex::optimize);
But this regex gets all inside of description too. There is a “skip” option in PHP that you can use to skip all descriptions.
What_I_want_to_avoid(*SKIP)(*FAIL)|What_I_want_to_match
But this is C++ and I cannot use this skip method. What should I do to skip all descriptions and just get Block4 in C++ regex?
This regex detects Block1, Block2, Block3 and Block4 but I want to skip Block1, Block2, Block3 and just get Block4 (skip descriptions). How do I have to edit my regex to get just Block4 (everything outside the descriptions)?
Since you requested this long regex, here it is.
This will not handle nested Blocks like block{ block{ } }
it would match block{ block{ } } only.
Since you specified you are using C++11 as the engine, I didn't use
recursion. This is easily changed to use recursion say if you were to use
PCRE or Perl, or even BOOST::Regex. Let me know if you'd want to see that.
As it is it's flawed, but works for your sample.
Another thing it won't do is parse Preprocessor Directives '#...' because
I forgot the rules for that (thought I did it recently, but can't find a record).
To use it, sit in a while ( regex_search() ) loop looking for a match on
capture group 1, if (m[1].success) etc.. That will be your block.
The rest of the matches are for comments, quotes, or non-comments, unrelated
to the block. These have to be matched to progress the match position.
The code is long and redundant because there is no function calls (recursion) in the C++11 EMCAscript. Like I said, use boost::regex or something.
Benchmark
Sample:
/*
Block1 {
anythinghere
}
*/
// Block2 { }
Block4 {
// CommentedBlock{ asdfasdf }
anyth"}"ing here
}
Block5 {
/* CommentedBlock{ asdfasdf }
anyth}"ing here
*/
}
Results:
Regex1: (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})|[\S\s](?:(?!\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})[^/"'\\])*)
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 8
Elapsed Time: 1.95 s, 1947.26 ms, 1947261 µs
Regex Explained:
# Raw: (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})|[\S\s](?:(?!\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})[^/"'\\])*)
# Stringed: "(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\\\n?)*?\\n)|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(\\w+\\s*\\{(?:(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\\\n?)*?\\n)|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(?!\\})[\\S\\s][^}/\"'\\\\]*))*\\})|[\\S\\s](?:(?!\\w+\\s*\\{(?:(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\\\n?)*?\\n)|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(?!\\})[\\S\\s][^}/\"'\\\\]*))*\\})[^/\"'\\\\])*)"
(?: # Comments
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
|
// # Start // comment
(?: [^\\] | \\ \n? )*? # Possible line-continuation
\n # End // comment
)
| # OR,
(?: # Non - comments
"
[^"\\]* # Double quoted text
(?: \\ [\S\s] [^"\\]* )*
"
| '
[^'\\]* # Single quoted text
(?: \\ [\S\s] [^'\\]* )*
'
|
( # (1 start), BLOCK
\w+ \s* \{
####################
(?: # ------------------------
(?: # Comments inside a block
/\*
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/
|
//
(?: [^\\] | \\ \n? )*?
\n
)
|
(?: # Non - comments inside a block
"
[^"\\]*
(?: \\ [\S\s] [^"\\]* )*
"
| '
[^'\\]*
(?: \\ [\S\s] [^'\\]* )*
'
|
(?! \} )
[\S\s]
[^}/"'\\]*
)
)* # ------------------------
#####################
\}
) # (1 end), BLOCK
| # OR,
[\S\s] # Any other char
(?: # -------------------------
(?! # ASSERT: Here, cannot be a BLOCK{ }
\w+ \s* \{
(?: # ==============================
(?: # Comments inside a block
/\*
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/
|
//
(?: [^\\] | \\ \n? )*?
\n
)
|
(?: # Non - comments inside a block
"
[^"\\]*
(?: \\ [\S\s] [^"\\]* )*
"
|
'
[^'\\]*
(?: \\ [\S\s] [^'\\]* )*
'
|
(?! \} )
[\S\s]
[^}/"'\\]*
)
)* # ==============================
\}
) # ASSERT End
[^/"'\\] # Char which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
)* # -------------------------
) # Done Non - comments
Tl;DR: Regular expressions cannot be used to parse full blown computer languages. What you want to do cannot be done with regular expressions. You need to develop a mini-C++ parser to filter out comments. The answer to this related question might point you in the right direction.
Regex can be used to process regular expressions, but computer languages such as C++, PHP, Java, C#, HTML, etc. have a more complex syntax that includes a property named "middle recursion". Middle recursion includes complications such as an arbitrary number of matching parenthesis, begin / end quotes, and comments that can contain symbols
If you want to understand this in more detail, read the answers to this question about the difference between regular expressions and context free grammars. If you are really curious, enroll in a Formal Language Theory class.

Regex to match specific functions and their arguments in files

I'm working on a gettext javascript parser and I'm stuck on the parsing regex.
I need to catch every argument passed to a specific method call _n( and _(. For example, if I have these in my javascript files:
_("foo") // want "foo"
_n("bar", "baz", 42); // want "bar", "baz", 42
_n(domain, "bux", var); // want domain, "bux", var
_( "one (optional)" ); // want "one (optional)"
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) // could have on the same line two calls..
This refs this documentation: http://poedit.net/trac/wiki/Doc/Keywords
I'm planning in doing it in two times (and two regex):
catch all function arguments for _n( or _( method calls
catch the stringy ones only
Basically, I'd like a Regex that could say "catch everything after _n( or _( and stop at the last parenthesis ) actually when the function is done. I dunno if it is possible with regex and without a javascript parser.
What could also be done is "catch every "string" or 'string' after _n( or _( and stop at the end of the line OR at the beginning of a new _n( or _( character.
In everything I've done I get either stuck on _( "one (optional)" ); with its inside parenthesis or apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) with two calls on the same line.
Here is what I implemented so far, with un-perfect regex: a generic parser and the javascript one or the handlebars one
Note: Read this answer if you're not familiar with recursion.
Part 1: match specific functions
Who said that regex can't be modular? Well PCRE regex to the rescue!
~ # Delimiter
(?(DEFINE) # Start of definitions
(?P<str_double_quotes>
(?<!\\) # Not escaped
" # Match a double quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
" # Match the ending double quote
)
(?P<str_single_quotes>
(?<!\\) # Not escaped
' # Match a single quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
' # Match the ending single quote
)
(?P<brackets>
\( # Match an opening bracket
(?: # A non capturing group
(?&str_double_quotes) # Recurse/use the str_double_quotes pattern
| # Or
(?&str_single_quotes) # Recurse/use the str_single_quotes pattern
| # Or
[^()] # Anything not a bracket
| # Or
(?&brackets) # Recurse the bracket pattern
)*
\)
)
) # End of definitions
# Let's start matching for real now:
_n? # Match _ or _n
\s* # Optional white spaces
(?P<results>(?&brackets)) # Recurse/use the brackets pattern and put it in the results group
~sx
The s is for matching newlines with . and the x modifier is for this fancy spacing and commenting of our regex.
Online regex demo
Online php demo
Part 2: getting rid of opening & closing brackets
Since our regex will also get the opening and closing brackets (), we might need to filter them. We will use preg_replace() on the results:
~ # Delimiter
^ # Assert begin of string
\( # Match an opening bracket
\s* # Match optional whitespaces
| # Or
\s* # Match optional whitespaces
\) # Match a closing bracket
$ # Assert end of string
~x
Online php demo
Part 3: extracting the arguments
So here's another modular regex, you could even add your own grammar:
~ # Delimiter
(?(DEFINE) # Start of definitions
(?P<str_double_quotes>
(?<!\\) # Not escaped
" # Match a double quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
" # Match the ending double quote
)
(?P<str_single_quotes>
(?<!\\) # Not escaped
' # Match a single quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
' # Match the ending single quote
)
(?P<array>
Array\s*
(?&brackets)
)
(?P<variable>
[^\s,()]+ # I don't know the exact grammar for a variable in ECMAScript
)
(?P<brackets>
\( # Match an opening bracket
(?: # A non capturing group
(?&str_double_quotes) # Recurse/use the str_double_quotes pattern
| # Or
(?&str_single_quotes) # Recurse/use the str_single_quotes pattern
| # Or
(?&array) # Recurse/use the array pattern
| # Or
(?&variable) # Recurse/use the array pattern
| # Or
[^()] # Anything not a bracket
| # Or
(?&brackets) # Recurse the bracket pattern
)*
\)
)
) # End of definitions
# Let's start matching for real now:
(?&array)
|
(?&variable)
|
(?&str_double_quotes)
|
(?&str_single_quotes)
~xis
We will loop and use preg_match_all(). The final code would look like this:
$functionPattern = <<<'regex'
~ # Delimiter
(?(DEFINE) # Start of definitions
(?P<str_double_quotes>
(?<!\\) # Not escaped
" # Match a double quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
" # Match the ending double quote
)
(?P<str_single_quotes>
(?<!\\) # Not escaped
' # Match a single quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
' # Match the ending single quote
)
(?P<brackets>
\( # Match an opening bracket
(?: # A non capturing group
(?&str_double_quotes) # Recurse/use the str_double_quotes pattern
| # Or
(?&str_single_quotes) # Recurse/use the str_single_quotes pattern
| # Or
[^()] # Anything not a bracket
| # Or
(?&brackets) # Recurse the bracket pattern
)*
\)
)
) # End of definitions
# Let's start matching for real now:
_n? # Match _ or _n
\s* # Optional white spaces
(?P<results>(?&brackets)) # Recurse/use the brackets pattern and put it in the results group
~sx
regex;
$argumentsPattern = <<<'regex'
~ # Delimiter
(?(DEFINE) # Start of definitions
(?P<str_double_quotes>
(?<!\\) # Not escaped
" # Match a double quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
" # Match the ending double quote
)
(?P<str_single_quotes>
(?<!\\) # Not escaped
' # Match a single quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
' # Match the ending single quote
)
(?P<array>
Array\s*
(?&brackets)
)
(?P<variable>
[^\s,()]+ # I don't know the exact grammar for a variable in ECMAScript
)
(?P<brackets>
\( # Match an opening bracket
(?: # A non capturing group
(?&str_double_quotes) # Recurse/use the str_double_quotes pattern
| # Or
(?&str_single_quotes) # Recurse/use the str_single_quotes pattern
| # Or
(?&array) # Recurse/use the array pattern
| # Or
(?&variable) # Recurse/use the array pattern
| # Or
[^()] # Anything not a bracket
| # Or
(?&brackets) # Recurse the bracket pattern
)*
\)
)
) # End of definitions
# Let's start matching for real now:
(?&array)
|
(?&str_double_quotes)
|
(?&str_single_quotes)
|
(?&variable)
~six
regex;
$input = <<<'input'
_ ("foo") // want "foo"
_n("bar", "baz", 42); // want "bar", "baz", 42
_n(domain, "bux", var); // want domain, "bux", var
_( "one (optional)" ); // want "one (optional)"
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) // could have on the same line two calls..
// misleading cases
_n("foo (")
_n("foo (\)", 'foo)', aa)
_n( Array(1, 2, 3), Array(")", '(') );
_n(function(foo){return foo*2;}); // Is this even valid?
_n (); // Empty
_ (
"Foo",
'Bar',
Array(
"wow",
"much",
'whitespaces'
),
multiline
); // PCRE is awesome
input;
if(preg_match_all($functionPattern, $input, $m)){
$filtered = preg_replace(
'~ # Delimiter
^ # Assert begin of string
\( # Match an opening bracket
\s* # Match optional whitespaces
| # Or
\s* # Match optional whitespaces
\) # Match a closing bracket
$ # Assert end of string
~x', // Regex
'', // Replace with nothing
$m['results'] // Subject
); // Getting rid of opening & closing brackets
// Part 3: extract arguments:
$parsedTree = array();
foreach($filtered as $arguments){ // Loop
if(preg_match_all($argumentsPattern, $arguments, $m)){ // If there's a match
$parsedTree[] = array(
'all_arguments' => $arguments,
'branches' => $m[0]
); // Add an array to our tree and fill it
}else{
$parsedTree[] = array(
'all_arguments' => $arguments,
'branches' => array()
); // Add an array with empty branches
}
}
print_r($parsedTree); // Let's see the results;
}else{
echo 'no matches';
}
Online php demo
You might want to create a recursive function to generate a full tree. See this answer.
You might notice that the function(){} part isn't parsed correctly. I will let that as an exercise for the readers :)
Try this:
(?<=\().*?(?=\s*\)[^)]*$)
See live demo
Below regex should help you.
^(?=\w+\()\w+?\(([\s'!\\\)",\w]+)+\);
Check the demo here
\(( |"(\\"|[^"])*"|'(\\'|[^'])*'|[^)"'])*?\)
This should get anything between a pair of parenthesis, ignoring parenthesis in quotes.
Explanation:
\( // Literal open paren
(
| //Space or
"(\\"|[^"])*"| //Anything between two double quotes, including escaped quotes, or
'(\\'|[^'])*'| //Anything between two single quotes, including escaped quotes, or
[^)"'] //Any character that isn't a quote or close paren
)*? // All that, as many times as necessary
\) // Literal close paren
No matter how you slice it, regular expressions are going to cause problems. They're hard to read, hard to maintain, and highly inefficient. I'm unfamiliar with gettext, but perhaps you could use a for loop?
// This is just pseudocode. A loop like this can be more readable, maintainable, and predictable than a regular expression.
for(int i = 0; i < input.length; i++) {
// Ignoring anything that isn't an opening paren
if(input[i] == '(') {
String capturedText = "";
// Loop until a close paren is reached, or an EOF is reached
for(; input[i] != ')' && i < input.length; i++) {
if(input[i] == '"') {
// Loop until an unescaped close quote is reached, or an EOF is reached
for(; (input[i] != '"' || input[i - 1] == '\\') && i < input.length; i++) {
capturedText += input[i];
}
}
if(input[i] == "'") {
// Loop until an unescaped close quote is reached, or an EOF is reached
for(; (input[i] != "'" || input[i - 1] == '\\') && i < input.length; i++) {
capturedText += input[i];
}
}
capturedText += input[i];
}
capture(capturedText);
}
}
Note: I didn't cover how to determine if it's a function or just a grouping symbol. (ie, this will match a = (b * c)). That's complicated, as is covered in detail here. As your code gets more and more accurate, you get closer and closer to writing your own javascript parser. You might want to take a look at the source code for actual javascript parsers if you need that sort of accuracy.
One bit of code (you can test this PHP code at http://writecodeonline.com/php/ to check):
$string = '_("foo")
_n("bar", "baz", 42);
_n(domain, "bux", var);
_( "one (optional)" );
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples)';
preg_match_all('/(?<=(_\()|(_n\())[\w", ()%]+(?=\))/i', $string, $matches);
foreach($matches[0] as $test){
$opArr = explode(',', $test);
foreach($opArr as $test2){
echo trim($test2) . "\n";
}
}
you can see the initial pattern and how it works here: http://regex101.com/r/fR7eU2/1
Output is:
"foo"
"bar"
"baz"
42
domain
"bux"
var
"one (optional)"
"No apples"
"%1 apple"
"%1 apples"
apples
We can do this in two steps:
1)catch all function arguments for _n( or _( method calls
(?:_\(|_n\()(?:[^()]*\([^()]*\))*[^()]*\)
See demo.
http://regex101.com/r/oE6jJ1/13
2)catch the stringy ones only
"([^"]*)"|(?:\(|,)\s*([^"),]*)(?=,|\))
See demo.
http://regex101.com/r/oE6jJ1/14

single regex not working for a three different patterns

I need a optimal regexp to match all these three types of texts in a text file.
[TRUE,FALSE]
[4,5,6,7]
[2-15]
i am trying the following regex match which is not working
m/([0-9A-Fa-fx,]+)\s*[-~,]\s*([0-9A-Fa-fx,]+)/)
/
(?(DEFINE)
(?<WORD> [a-zA-Z]+ )
(?<NUM> [0-9]+ )
)
\[ \s*
(?: (?&WORD) (?: \s* , \s* (?&WORD) )+
| (?&NUM) (?: \s* , \s* (?&NUM) )+
| (?&NUM) \s* - \s* (?&NUM)
)
\s* \]
/x
4-7 is a subset of 2-15. This regex should capture them:
/TRUE|FALSE|[2-9]|1[0-5]/
A quick'n'dirty test program:
#!/usr/bin/env perl
use strict;
use warnings;
for my $line (<DATA>) {
chomp $line;
print "$line: ";
if ($line =~ /
^ # beginning of the string
\[ # a literal opening sq. bracket
( # alternatives:
(TRUE|FALSE) (,(TRUE|FALSE))* # one or more thruth words
| (\d+) (,\d+)* # one or more numbers
| (\d+) - (\d+) # a range of numbers
) # end of alternatives
\] # a literal closing sq. bracket
$ # end of the string
/x) {
print "match\n";
}
else {
print "no match\n";
}
}
__DATA__
[TRUE]
foo
[FALSE,TRUE,FALSE]
[FALSE,TRUE,]
[42,FALSE]
[17,42,666]
bar
[17-42]
[17,42-666]
Output:
[TRUE]: match
foo: no match
[FALSE,TRUE,FALSE]: match
[FALSE,TRUE,]: no match
[42,FALSE]: no match
[17,42,666]: match
bar: no match
[17-42]: match
[17,42-666]: no match

Perl Regular expression for IP address range

I have some internet traffic data to analyze. I need to analyze only those packets that are within a certain IP range. So, I need to write a if statement. I suppose I need a regular expression for the test condition. My knowledge of regexp is a little weak. Can someone tell me how would I construct a regular expression for that condition. An example range may be like
Group A
56.286.75.0/19
57.256.106.0/21
64.131.14.0/22
Group B
58.176.44.0/21
58.177.92.0/19
The if statement would be like
if("IP in A" || "IP in B") {
do something
}
else { do something else }
so i would need to make the equivalent regexp for "IP in A" and "IP in B" conditions.
I don't think that regexps provide much advantage for this problem.
Instead, use the Net::Netmask module. The "match" method should do what you want.
I have to echo the disagreement with using a regex to check IP addresses...however, here is a way to pull IPs out of text:
qr{
(?<!\d) # No digit having come immediately before
(?: [1-9] \d? # any one or two-digit number
| 1 \d \d # OR any three-digit number starting with 1
| 2 (?: [0-4] \d # OR 200 - 249
| 5 [0-6] # OR 250 - 256
)
)
(?: \. # followed by a dot
(?: [1-9] \d? # 1-256 reprise...
| 1 \d \d
| 2 (?: [0-4 \d
| 5 [0-6]
)
)
){3} # that group exactly 3 times
(?!\d) # no digit following immediately after
}x
;
But given that general pattern, we can construct an IP parser. But for the given "ranges", I wouldn't do anything less than the following:
A => qr{
(?<! \d )
(?: 56\.186\. 75
| 57\.256\.106
| 64\.131\. 14
)
\.
(?: [1-9] \d?
| 1 \d \d
| 2 (?: [0-4] \d
| 5 [0-6]
)
)
(?! \d )
}x
B => qr{
(?<! \d )
58 \.
(?: 176\.44
| 177\.92
)
\.
(?: [1-9] \d?
| 1 \d \d
| 2 (?: [0-4] \d
| 5 [0-6]
)
)
(?! \d )
}x
I'm doing something like:
use NetAddr::IP;
my #group_a = map NetAddr::IP->new($_), #group_a_masks;
...
my $addr = NetAddr::IP->new( $ip_addr_in );
if ( grep $_->contains( $addr ), #group_a ) {
print "group a";
}
I chose NetAddr::IP over Net::Netmask for IPv6 support.
Martin is right, use Net::Netmask. If you really want to use a regex though...
$prefix = "192.168.1.0/25";
$ip1 = "192.168.1.1";
$ip2 = "192.168.1.129";
$prefix =~ s/([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)\/([0-9]+)/$mask=(2**32-1)<<(32-$5); $1<<24|$2<<16|$3<<8|$4/e;
$ip1 =~ s/([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)/$1<<24|$2<<16|$3<<8|$4/e;
$ip2 =~ s/([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)/$1<<24|$2<<16|$3<<8|$4/e;
if (($prefix & $mask) == ($ip1 & $mask)) {
print "ip1 matches\n";
}
if (($prefix & $mask) == ($ip2 & $mask)) {
print "ip2 matches\n";
}