Regex parsing of DNS answer - regex

I would like to parse the following lines
8.8.19.12.53 > 125.15.15.9.40583: [udp sum ok] 62639 q: A? mp.microsoft.com. 6/5/9 mp.microsoft.com. CNAME .mp.microsoft.com.c.footprint.net., mp.microsoft.com.c.footprint.net. A 8.250.143.254, mp.microsoft.com.c.footprint.net. A 8.250.157.254 ns: c.footprint.net. NS d.ns.c.footprint.net. ar: d.ns.c.footprint.net. A 4.26.235.155 (439)
8.8.19.12.53 > 125.15.15.9.42091: [udp sum ok] 46555 q: A? www.toto.net. 1/0/0 www.toto.net. A 120.33.1.11 (47)
and get the following output
125.15.15.9 mp.microsoft.com A 8.250.143.254 A 8.250.157.254
125.15.15.9 www.toto.net A 120.33.1.11
I succeeded in parsing the first two fields with command
sed -Eun 's/[^>]+> ([0-9.]+)\.[0-9]+:.+q: A\? ([a-z0-9.-]+)\.([^:]+).*/\1:\2:\3/pg
`
But I cannot get the resolved IPs (A xx.xx.xx.xx). In fact there may be several.
Would it be possible to get such output using sed or Perl ?
EDIT:
As I added in comments, parsing of a larger input sample, I also require several lines to be discarded in the output. This lines can be characterized by :
the number of A records ("A xx.xx.xx.xx") is non null
or the line must not contains NXDomain\*?-
I succeed in meeting the new first need, but not for the second.
Following the #ikegami reply, here is my attempt:
perl -nle '
my $field_value_re = qr/(?![^\s:]++:(?!\S)) \S++ (?: (?! \s++ [^\s:]++:(?!\S) ) \s++ \S++ )*+/x;
my ($id, $rest) = /^ \s+ ( [^:]++ ) : \s++ $field_value_re ( .* ) /sx
or next;
my ($ip) = $id =~ /^ \S++ \s++ \S++ \s++ ( [^\s\.]++\.[^\s\.]++\.[^\s\.]++\.[^\s\.]++ )\.[^\s\.]++ \z /x
or next;
my %fields = $rest =~ /\G \s++ ( [^\s:]++ ) :(?!\S) \s++ ( $field_value_re ) /gsx;
my ($query, $answers) = $fields{q} =~ /^ A\? \s++ ( \S++ ) \s++ \S++ \s++ ( .* ) /sx
or next;
$query =~ s/\.\z//;
my #answers = split(/\s*+,\s*+/, $answers);
my ($afield) = join " ", map { /^\S++\s++A\s++(\S++)/ } #answers;
if ( length($afield) != 0)
{
print join " ", $ip, $query, $afield;
}
' dns.sample

This does as you ask with the sample data
I first build a regex pattern $url_re that matches numeric URLs to make the following code more concise. Then I search for the first URL immediately after >, the named URL right after A?, and all of the following URLs which are preceded by A
They are all stored in array #urls and printed
use strict;
use warnings 'all';
use 5.010;
my $url_re = qr/(?:\d+\.){3}\d+/;
while ( <DATA> ) {
my #urls = ( />\s+($url_re)/, /A\?\s+([-\w.]+\w)/, /(A\s+$url_re)/g );
say "#urls";
}
__DATA__
8.8.19.12.53 > 125.15.15.9.40583: [udp sum ok] 62639 q: A? mp.microsoft.com. 6/5/9 mp.microsoft.com. CNAME .mp.microsoft.com.c.footprint.net., mp.microsoft.com.c.footprint.net. A 8.250.143.254, mp.microsoft.com.c.footprint.net. A 8.250.157.254 ns: c.footprint.net. NS d.ns.c.footprint.net. ar: d.ns.c.footprint.net. A 4.26.235.155 (439)
8.8.19.12.53 > 125.15.15.9.42091: [udp sum ok] 46555 q: A? www.toto.net. 1/0/0 www.toto.net. A 120.33.1.11 (47)
output
125.15.15.9 mp.microsoft.com A 8.250.143.254 A 8.250.157.254 A 4.26.235.155
125.15.15.9 www.toto.net A 120.33.1.11

Each line appears to be of the form
{"id" with spaces}: {stuff} [ {key}: {stuff} ]*
You appear to be interested in information inside the "id", and inside the field named q. The value of the q field appears to be of the form
A? {word} {word} {ns_return} [, {ns_return} ]*
Here's a robust solution that handles the format described above.
perl -nle'
my $field_value_re = qr/(?![^\s:]++:(?!\S)) \S++ (?: (?! \s++ [^\s:]++:(?!\S) ) \s++ \S++ )*+/x;
my ($id, $id_val, $rest) = /^ ( [^:]++ ) : \s++ ( $field_value_re ) ( .* ) /sx
or next;
next if $id_val =~ /\bNXDomain\b/;
my ($ip) = $id =~ /^ \S++ \s++ \S++ \s++ ( [^\s\.]++\.[^\s\.]++\.[^\s\.]++\.[^\s\.]++ )\.[^\s\.]++ \z /x
or next;
my %fields = $rest =~ /\G \s++ ( [^\s:]++ ) :(?!\S) \s++ ( $field_value_re ) /gsx;
my ($query, $answers) = $fields{q} =~ /^ A\? \s++ ( \S++ ) \s++ \S++ \s++ ( .* ) /sx
or next;
$query =~ s/\.\z//;
my #answers =
map { /^\S++\s++A\s++(\S++)/ }
split(/\s*+,\s*+/, $answers);
next if !#answers;
print join " ", $ip, $query, map { "A $_" } #answers;
' log
125.15.15.9 mp.microsoft.com A 8.250.143.254 A 8.250.157.254
125.15.15.9 www.toto.net A 120.33.1.11

This prints the desired output by using the map function in a somewhat unorthodox way to ignore any fields after q:
perl -lne 'print join qq/\t/, m/> ([\d\.]+)\./, map {/A\? ([^\s]+)\./, /(A [\d\.]+)/g} / q:([^:]+)/' log.txt

Related

Substitute: replacement evaluation

Is the first and the second substitution equivalent if the replacement is passed in a variable?
#!/usr/bin/env perl6
use v6;
my $foo = 'switch';
my $t1 = my $t2 = my $t3 = my $t4 = 'this has a $foo in it';
my $replace = prompt( ':' ); # $0
$t1.=subst( / ( \$ \w+ ) /, $replace );
$t2.=subst( / ( \$ \w+ ) /, { $replace } );
$t3.=subst( / ( \$ \w+ ) /, { $replace.EVAL } );
$t4.=subst( / ( \$ \w+ ) /, { ( $replace.EVAL ).EVAL } );
say "T1 : $t1";
say "T2 : $t2";
say "T3 : $t3";
say "T4 : $t4";
# T1 : this has a $0 in it
# T2 : this has a $0 in it
# T3 : this has a $foo in it
# T4 : this has a switch in it
The only difference between $replace and {$replace} is that the second is a block that returns the value of the variable. It's only adding a level of indirection, but the result is the same.
Update: Edited according to #raiph's comments.

Rewriting a recursive regex for older Perl version

The following piece of code works just fine with Perl (v5.16.2). However, when I run it using Perl v5.8.9, it complains about the following regex. How can I rewrite this regex in a way that works with Perl v5.8.9. (I can't update the version).
REGEX:
use strict;
use warnings;
our %formula_per_k;
INIT {
# List all functions that you want to allow in formulas. All other words will be interpretted as variables.
my #FORMULA_FUNCS = qw(sqrt exp log);
# Load the data via a file.
my $data = do {local $/; <DATA>};
# Parse K blocks
while ($data =~ m{
^K \s+ (\w+) \s* \{
( (?: [^{}]+ | \{(?2)\} )* ) # Matched braces only.
\}
}mgx) {
my ($name, $params) = ($1, $2);
# Parse LOL block
next if $params !~ m{
LOL \s* \{
( (?: [^{}]+ | \{(?1)\} )*? ) # Matched braces only.
\}
}mx;
my $lol = $1;
# Start building anonymous subroutine
my $conditions = '';
# Parse Conditions and Formulas
while ($lol =~ m{
COND \s* \{ (.*?) \} \s*
FORMULA \s* \{ (.*?) \}
}gx) {
my ($cond, $formula) = ($1, $2);
# Remove Excess spacing and translate variable into perl scalar.
for ($cond, $formula) {
s/^\s+|\s+$//g;
s{([a-zA-Z]+)}{
my $var = $1;
$var = "\$hashref->{$var}" if ! grep {$var eq $_} #FORMULA_FUNCS;
$var
}eg;
}
$conditions .= "return $formula if $cond; ";
}
my $code = "sub {my \$hashref = shift; ${conditions} return; }";
my $sub = eval $code;
if ($#) {
die "Invalid formulas in $name: $#";
}
$formula_per_k{$name} = $sub;
}
}
sub formula_per_k {
my ($k, $vars) = #_;
die "Unrecognized K value '$k'" if ! exists $formula_per_k{$k};
return $formula_per_k{$k}($vars);
}
print "'K1', {d => .1} = " . formula_per_k('K1', {d => .1}) . "\n";
print "'K1', {d => .05} = " . formula_per_k('K1', {d => .05}) . "\n";
print "'K3', {d => .02} = " . formula_per_k('K3', {d => .02}) . "\n";
print "'K3', {d => .021} = " . formula_per_k('K3', {d => .021}) . "\n";
__DATA__
... #OTHER STUFFS
K K1 {
LOL {
COND { d < 0.01 }
FORMULA { -0.2 + 3.3*sqrt(d) }
COND { d >= 0.01 }
FORMULA { -0.2 + 3.3*sqrt(d+0.4) }
}
}
... #OTHER STUFFS
K K2 {
LOL {
COND { d < 0.03 }
FORMULA { -2.2 + 1.3*sqrt(d) }
COND { d >= 0.03 }
FORMULA { -2.2 + 1.3*sqrt(d+0.8) }
}
}
... #OTHER STUFFS
K K3 {
LOL {
COND { d < 0.02 }
FORMULA { -4.3 + 0.3*sqrt(d) }
COND { d >= 0.02 }
FORMULA { -4.3 + 0.3*sqrt(d+0.3) }
}
}
... #OTHER STUFF
Outputs:
'K1', {d => .1} = 2.13345237791561
'K1', {d => .05} = 2.01370729772479
'K3', {d => .02} = -4.13029437251523
'K3', {d => .021} = -4.13002941430942
ERROR:
Sequence (?1...) not recognized in regex; marked by <-- HERE in m/
^K \s+ M3 \s* {
( (?: [^{}]+ | {(?2 <-- HERE )} )* ) # Matched braces only.
}
/ at ./code.pl line 215, <RFILE> line 12.
UPDATE:
Code is updated.
This was originally suggested by https://stackoverflow.com/users/1733163/miller
Before the introduction of (?PARNO), we had to use (??{ code }) to create recursive regular expressions. An example can be found in perlre - Extended Patterns.
The following is tested on v5.16.2, v5.20.0, and locally on a v5.8.9 perlbrew:
our $braces_re;
$braces_re = qr{
\{
(?:
(?> [^{}]+ )
|
(??{ $braces_re })
)*
\}
}sx;
# parse FOO block
while (
$data =~ m{
^FOO \s+ (\w+) \s* \{
( (?: [^{}]+ | (??{ $braces_re }) )* ) # Matched braces only.
\}
}mgx
)
{
my $params = $1;
# parse BAR block
next if $params !~ m{
BAR \s* \{
( (?: [^{}]+ | (??{ $braces_re }) )*? ) # Matched braces only.
\}
}mx;
# SOME CODE
}
Note, I intentionally separated out the declaration of the _re variable and its initialization. There are some versions of perl that will let you declare a recursive regular expression in the same statement as the initialization, but v5.8.9 is not one of them.
Also, if you're comfortable altering your original regex more than just dropping in a replacement for (?PARNO) notation, then the above can be reduced to the following. Also confirmed on v5.16.2:
my $braces_re;
$braces_re = qr{
(?:
(?> [^{}]+ )
| # The following is a "postponed" regular subexpression.
\{ (??{ $braces_re }) \} # Deferred execution enables recursive regex
)*
}sx;
# parse FOO block
while ( $data =~ m{^FOO \s+ (\w+) \s* \{ ( $braces_re ) \} }mgx ) {
my $params = $1;
# parse BAR block
next if $params !~ m{BAR \s* \{ ( $braces_re ) \}}mx;
# SOME CODE
}

perl stream file for regex token including scanned tokens

I am trying to stream a file in perl and tokenize the lines and include the tokens.
I have:
while( $line =~ /([\/][\d]*[%].*?[%][\d]*[\/]|[^\s]+|[\s]+)/g ) {
my $word = $1;
#...
}
But it doesn't work when there's no spaces in the token.
For example, if my line is:
$line = '/15%one (1)(2)%15/ is a /%good (1)%/ +/%number(2)%/.'
I would like to split that line into:
$output =
[
'/15%one (1)(2)%15/',
' ',
'is',
' ',
'a',
'/%good (1)%/',
' ',
'+',
'/%number(2)%/',
'.'
]
What is the best way to do this?
(?:(?!STRING).)* is to STRING as [^CHAR]* is to CHAR, so
my #tokens;
push #tokens, $1
while $line =~ m{
\G
( \s+
| ([\/])([0-9]*)%
(?: (?! %\3\2 ). )*
%\3\2
| (?: (?! [\/][0-9]*% )\S )+
)
}sxg;
but that doesn't validate. If you want to validate, you could use
my #tokens;
push #tokens, $1
while $line =~ m{
\G
( \s+
| ([\/])([0-9]*)%
(?: (?! %\3\2 ). )*
%\3\2
| (?: (?! [\/][0-9]*% )\S )+
| \z (*COMMIT) (*FAIL)
| (?{ die "Syntax error" })
)
}sxg;
The following also validates, but it's a bit more readable and makes it easy to differentiate the token types.:
my #tokens;
for ($line) {
m{\G ( \s+ ) }sxgc
&& do { push #tokens, $1; redo };
m{\G ( ([\/])([0-9]*)% (?: (?! %\3\2 ). )* %\3\2 ) }sxgc
&& do { push #tokens, $1; redo };
m{\G ( (?: (?! [\/][0-9]*% )\S )+ ) }sxgc
&& do { push #tokens, $1; redo };
m{\G \z }sxgc
&& last;
die "Syntax error";
}
pos will get you information about where the error occurred.

Regex with recursive expression to match nested braces?

I'm trying to match text like sp { ...{...}... }, where the curly braces are allowed to nest. This is what I have so far:
my $regex = qr/
( #save $1
sp\s+ #start Soar production
( #save $2
\{ #opening brace
[^{}]* #anything but braces
\} #closing brace
| (?1) #or nested braces
)+ #0 or more
)
/x;
I just cannot get it to match the following text: sp { { word } }. Can anyone see what is wrong with my regex?
There are numerous problems. The recursive bit should be:
(
(?: \{ (?-1) \}
| [^{}]+
)*
)
All together:
my $regex = qr/
sp\s+
\{
(
(?: \{ (?-1) \}
| [^{}]++
)*
)
\}
/x;
print "$1\n" if 'sp { { word } }' =~ /($regex)/;
This is case for the underused Text::Balanced, a very handy core module for this kind of thing. It does rely on the pos of the start of the delimited sequence being found/set first, so I typically invoke it like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::Balanced 'extract_bracketed';
sub get_bracketed {
my $str = shift;
# seek to beginning of bracket
return undef unless $str =~ /(sp\s+)(?={)/gc;
# store the prefix
my $prefix = $1;
# get everything from the start brace to the matching end brace
my ($bracketed) = extract_bracketed( $str, '{}');
# no closing brace found
return undef unless $bracketed;
# return the whole match
return $prefix . $bracketed;
}
my $str = 'sp { { word } }';
print get_bracketed $str;
The regex with the gc modifier tells the string to remember where the end point of the match is, and extract_bracketed uses that information to know where to start.

How can I parse runmqsc command output using Perl?

I am trying devise Perl regex to parse command output from IBM's runmqsc utility.
Each line of output of interest contains one or more attribute/value pairs with format: "ATTRIBUTE(VALUE)". The value for an attribute can be empty, or can contain parenthesis itself. Typically, a maximum of two attribute/value pairs appear on a given line, so the regex is written under this assumption.
Example input to Perl RE:
CHANNEL(TO.IPTWX01) CHLTYPE(CLUSRCVR)
DISCINT(6000) SHORTRTY(10)
TRPTYPE(TCP) DESCR( )
LONGTMR(1200) SCYEXIT( )
CONNAME(NODE(1414)) MREXIT( )
MREXIT( ) CONNAME2(SOME(1416))
TPNAME( ) BATCHSZ(50)
MCANAME( ) MODENAME( )
ALTTIME(00.41.56) SSLPEER()
CONTRIVED() ATTR (00-41-56)
CONTRIVED() DOCTORED()
MSGEXIT( )
I have the following Perl code to capture each attribute/value pair.
Perl Code
my $resplit = qr/\s+([^\s]+(?:\([^)]*\))?)\s?/;
while ( <IN2> )
{ s/[\s\r\n]+$//;
if ( m/^\s(?:$resplit)(?:$resplit)?$/ )
{ my ($one,$two) = ($1,$2);
print "one: $one, two: $two\n";
}
}
Here's the output when the above code is applied to sample input:
one: CHANNEL(TO.IPTWX01), two: CHLTYPE(CLUSRCVR)
one: DISCINT(6000), two: SHORTRTY(10)
one: TRPTYPE(TCP), two: DESCR( )
one: LONGTMR(1200), two: SCYEXIT( )
one: CONNAME(NODE(1414)), two: MREXIT( )
one: MREXIT( ), two: CONNAME2(SOME(1416))
one: TPNAME( ), two: BATCHSZ(50)
one: MCANAME( ), two: MODENAME( )
one: ALTTIME(00.41.56), two: SSLPEER()
one: CONTRIVED(), two: ATTR(00-41-56)
one: CONTRIVED(), two: DOCTORED()
one: MSGEXIT(, two: )
This works great with the exception of the last line in the output
above. I'm really struggling to figure out how
to modify the above expression $resplit to capture the last case.
Can anyone offer any ideas/suggestions on how to make this work or
another approach?
The Text::Balanced module is designed to handle this sort of problem. This approach will handle any number of columns as well.
use strict;
use warnings;
use Text::Balanced qw(extract_bracketed);
my ($extracted, $remainder, $prefix);
while ( defined($remainder = <DATA>) ){
while ( Get_paren_text() ){
$prefix =~ s/ //g;
print $prefix, $extracted, "\n";
}
}
sub Get_paren_text {
($extracted, $remainder, $prefix)
= extract_bracketed($remainder, '()', '[\w ]+');
return defined $extracted;
}
__DATA__
CHANNEL(TO.IPTWX01) CHLTYPE(CLUSRCVR) FOO( ( BAR) )
DISCINT(6000) SHORTRTY(10) BIZZ((((BUZZ) ) ) ) )
TRPTYPE(TCP) DESCR( )
LONGTMR(1200) SCYEXIT( )
CONNAME(NODE(1414)) MREXIT( )
MREXIT( ) CONNAME2(SOME(1416))
TPNAME( ) BATCHSZ(50)
MCANAME( ) MODENAME( )
ALTTIME(00.41.56) SSLPEER()
CONTRIVED() ATTR (00-41-56)
CONTRIVED() DOCTORED()
MSGEXIT( )
I wanted to try to use Regexp::Grammars.
So here it is:
#! /opt/perl/bin/perl
use strict;
#use warnings;
use 5.10.1;
use Regexp::Grammars;
my $grammar = qr{
<line>
<token: line>
(?: <[pair]> \s* )+
(?{
my $arr = $MATCH{pair};
local $MATCH = {};
for my $pair( #$arr ){
my($key) = keys %$pair;
my($value) = values %$pair;
$MATCH->{$key} = $value;
}
})
<token: pair>
<attrib> \s* \( \s* <value> \s* \)
(?{
$MATCH = {
$MATCH{attrib} => $MATCH{value}
};
})
<token: attrib>
[^()]*?
<token: value>
(?:
<MATCH=pair> |
[^()]*?
)
}x;
use warnings;
my %attr;
while( my $line = <> ){
$line =~ /$grammar/;
for my $key ( keys %{ $/{line} } ){
$attr{$key} = $/{line}{$key};
}
}
use YAML;
say Dump \%attr;
---
ALTTIME: 00.41.56
ATTR: 00-41-56
BATCHSZ: 50
CHANNEL: TO.IPTWX01
CHLTYPE: CLUSRCVR
CONNAME:
NODE: 1414
CONNAME2:
SOME: 1416
CONTRIVED: ''
DESCR: ''
DISCINT: 6000
DOCTORED: ''
LONGTMR: 1200
MCANAME: ''
MODENAME: ''
MREXIT: ''
MSGEXIT: ''
SCYEXIT: ''
SHORTRTY: 10
SSLPEER: ''
TPNAME: ''
TRPTYPE: TCP
while ( <IN2> ) {
while ( /([A-Z]+)\s*(\((?:[^()]*+|(?2))*\))/g ) {
print "$1$2\n";
}
}
This works for nested parens e.g.
CONNAME(NODE(1414, SOME(1416) ) ) ATTR (00-41-56)
The (?2) part is recursive, the *+ means "don't backtrack" - only works in Perl 5.10 or later; I got this from http://faq.perl.org/perlfaq6.html#Can_I_use_Perl_regul
#!/usr/bin/perl
use strict;
use warnings;
my #parsed;
while ( my $line = <DATA> ) {
while ( $line =~ / ([A-Z0-9]+) \s* \( (.*?) \) \s /gx ) {
push #parsed, { $1 => $2 }
}
}
use Data::Dumper;
print Dumper \#parsed;
__DATA__
CHANNEL(TO.IPTWX01) CHLTYPE(CLUSRCVR)
DISCINT(6000) SHORTRTY(10)
TRPTYPE(TCP) DESCR( )
LONGTMR(1200) SCYEXIT( )
CONNAME(NODE(1414)) MREXIT( )
MREXIT( ) CONNAME2(SOME(1416))
TPNAME( ) BATCHSZ(50)
MCANAME( ) MODENAME( )
ALTTIME(00.41.56) SSLPEER()
CONTRIVED() ATTR (00-41-56)
CONTRIVED() DOCTORED()
MSGEXIT( )