I am trying to stream a file in perl and tokenize the lines and include the tokens.
I have:
while( $line =~ /([\/][\d]*[%].*?[%][\d]*[\/]|[^\s]+|[\s]+)/g ) {
my $word = $1;
#...
}
But it doesn't work when there's no spaces in the token.
For example, if my line is:
$line = '/15%one (1)(2)%15/ is a /%good (1)%/ +/%number(2)%/.'
I would like to split that line into:
$output =
[
'/15%one (1)(2)%15/',
' ',
'is',
' ',
'a',
'/%good (1)%/',
' ',
'+',
'/%number(2)%/',
'.'
]
What is the best way to do this?
(?:(?!STRING).)* is to STRING as [^CHAR]* is to CHAR, so
my #tokens;
push #tokens, $1
while $line =~ m{
\G
( \s+
| ([\/])([0-9]*)%
(?: (?! %\3\2 ). )*
%\3\2
| (?: (?! [\/][0-9]*% )\S )+
)
}sxg;
but that doesn't validate. If you want to validate, you could use
my #tokens;
push #tokens, $1
while $line =~ m{
\G
( \s+
| ([\/])([0-9]*)%
(?: (?! %\3\2 ). )*
%\3\2
| (?: (?! [\/][0-9]*% )\S )+
| \z (*COMMIT) (*FAIL)
| (?{ die "Syntax error" })
)
}sxg;
The following also validates, but it's a bit more readable and makes it easy to differentiate the token types.:
my #tokens;
for ($line) {
m{\G ( \s+ ) }sxgc
&& do { push #tokens, $1; redo };
m{\G ( ([\/])([0-9]*)% (?: (?! %\3\2 ). )* %\3\2 ) }sxgc
&& do { push #tokens, $1; redo };
m{\G ( (?: (?! [\/][0-9]*% )\S )+ ) }sxgc
&& do { push #tokens, $1; redo };
m{\G \z }sxgc
&& last;
die "Syntax error";
}
pos will get you information about where the error occurred.
Related
I need to process a csv file but one of the fields contains line breaks.
How can I replace all line breaks that are not after the double quote character (") with space? Any solution with awk, perl, sed etc is acceptable.
The file that is in the form:
497,50,2008-08-02T16:56:53Z,469,4,"foo bar
foo
bar"
518,153,2008-08-02T17:42:28Z,469,2,"foo bar
bar"
The desired output is:
497,50,2008-08-02T16:56:53Z,469,4,"foo bar foo bar"
518,153,2008-08-02T17:42:28Z,469,2,"foo bar bar"
I understood your question to be a request to replace intra-field newlines (even if they occur immediately after a ", such as in a field containing ␊foo or foo "bar"␊baz). The following achieves that:
use Text::CSV_XS qw( );
my $qfn_in = ...;
my $qfn_out = ...;
open(my $fh_in, '<', $qfn_in) or die("Can't open \"$qfn_in\": $!\n");
open(my $fh_out, '>', $qfn_out) or die("Can't create \"$qfn_out\": $!\n");
my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 2 });
while ( my $row = $csv->getline($fh_in) ) {
s/\n/ /g for #$row;
$csv->say($fh_out, $row);
}
I think it would make more sense to use the following:
for (#$row) {
s/^\s+//; # Remove leading whitespace.
s/\s+\z//; # Remove trailing whitespace.
s/\s+/ /g; # Replaces whitespace with a single space.
}
You can try this sed but the question is'nt clear enough to know what to do with a line like
497,50,2008-08-02T16:56:53Z,469,4,"truc biz",test
sed ':A;/[^"]$/{N;bA};y/\n/ /' infile
Its fairly easy to match the fields in csv.
The framework is the stuff between quoted/non-quoted fields
and is either delimiter or end of record tokens.
So the framework is matched as well to validate the fields.
After doing that, it's just a matter of replacing linebreaks in quoted fields.
That can be done in a call back.
The regex ((?:^|,|\r?\n)[^\S\r\n]*)(?:("[^"\\]*(?:\\[\S\s][^"\\]*)*"[^\S\r\n]*(?=$|,|\r?\n))|([^,\r\n]*(?=$|,|\r?\n)))
Here it is in Perl, all in one package.
use strict;
use warnings;
$/ = undef;
sub RmvNLs {
my ($delim, $quote, $non_quote) = #_;
if ( defined $non_quote ) {
return $delim . $non_quote;
}
$quote =~ s/\s*\r?\n/ /g;
return $delim . $quote;
}
my $csv = <DATA>;
$csv =~ s/
( # (1 start), Delimiter (BOS, comma or newline)
(?: ^ | , | \r? \n )
[^\S\r\n]* # Leading optional horizontal whitespaces
) # (1 end)
(?:
( # (2 start), Quoted string field
" # Quoted string
[^"\\]*
(?: \\ [\S\s] [^"\\]* )*
"
[^\S\r\n]* # Trailing optional horizontal whitespaces
(?= $ | , | \r? \n ) # Delimiter ahead (EOS, comma or newline)
) # (2 end)
| # OR
( # (3 start), Non quoted field
[^,\r\n]* # Not comma or newline
(?= $ | , | \r? \n ) # Delimiter ahead (EOS, comma or newline)
) # (3 end)
)
/RmvNLs($1,$2,$3)/xeg;
print $csv;
__DATA__
497,50,2008-08-02T16:56:53Z,469,4,"foo bar
foo
bar"
518,153,2008-08-02T17:42:28Z,469,2,"foo bar
bar"
Output
497,50,2008-08-02T16:56:53Z,469,4,"foo bar foo bar"
518,153,2008-08-02T17:42:28Z,469,2,"foo bar bar"
I would like to parse the following lines
8.8.19.12.53 > 125.15.15.9.40583: [udp sum ok] 62639 q: A? mp.microsoft.com. 6/5/9 mp.microsoft.com. CNAME .mp.microsoft.com.c.footprint.net., mp.microsoft.com.c.footprint.net. A 8.250.143.254, mp.microsoft.com.c.footprint.net. A 8.250.157.254 ns: c.footprint.net. NS d.ns.c.footprint.net. ar: d.ns.c.footprint.net. A 4.26.235.155 (439)
8.8.19.12.53 > 125.15.15.9.42091: [udp sum ok] 46555 q: A? www.toto.net. 1/0/0 www.toto.net. A 120.33.1.11 (47)
and get the following output
125.15.15.9 mp.microsoft.com A 8.250.143.254 A 8.250.157.254
125.15.15.9 www.toto.net A 120.33.1.11
I succeeded in parsing the first two fields with command
sed -Eun 's/[^>]+> ([0-9.]+)\.[0-9]+:.+q: A\? ([a-z0-9.-]+)\.([^:]+).*/\1:\2:\3/pg
`
But I cannot get the resolved IPs (A xx.xx.xx.xx). In fact there may be several.
Would it be possible to get such output using sed or Perl ?
EDIT:
As I added in comments, parsing of a larger input sample, I also require several lines to be discarded in the output. This lines can be characterized by :
the number of A records ("A xx.xx.xx.xx") is non null
or the line must not contains NXDomain\*?-
I succeed in meeting the new first need, but not for the second.
Following the #ikegami reply, here is my attempt:
perl -nle '
my $field_value_re = qr/(?![^\s:]++:(?!\S)) \S++ (?: (?! \s++ [^\s:]++:(?!\S) ) \s++ \S++ )*+/x;
my ($id, $rest) = /^ \s+ ( [^:]++ ) : \s++ $field_value_re ( .* ) /sx
or next;
my ($ip) = $id =~ /^ \S++ \s++ \S++ \s++ ( [^\s\.]++\.[^\s\.]++\.[^\s\.]++\.[^\s\.]++ )\.[^\s\.]++ \z /x
or next;
my %fields = $rest =~ /\G \s++ ( [^\s:]++ ) :(?!\S) \s++ ( $field_value_re ) /gsx;
my ($query, $answers) = $fields{q} =~ /^ A\? \s++ ( \S++ ) \s++ \S++ \s++ ( .* ) /sx
or next;
$query =~ s/\.\z//;
my #answers = split(/\s*+,\s*+/, $answers);
my ($afield) = join " ", map { /^\S++\s++A\s++(\S++)/ } #answers;
if ( length($afield) != 0)
{
print join " ", $ip, $query, $afield;
}
' dns.sample
This does as you ask with the sample data
I first build a regex pattern $url_re that matches numeric URLs to make the following code more concise. Then I search for the first URL immediately after >, the named URL right after A?, and all of the following URLs which are preceded by A
They are all stored in array #urls and printed
use strict;
use warnings 'all';
use 5.010;
my $url_re = qr/(?:\d+\.){3}\d+/;
while ( <DATA> ) {
my #urls = ( />\s+($url_re)/, /A\?\s+([-\w.]+\w)/, /(A\s+$url_re)/g );
say "#urls";
}
__DATA__
8.8.19.12.53 > 125.15.15.9.40583: [udp sum ok] 62639 q: A? mp.microsoft.com. 6/5/9 mp.microsoft.com. CNAME .mp.microsoft.com.c.footprint.net., mp.microsoft.com.c.footprint.net. A 8.250.143.254, mp.microsoft.com.c.footprint.net. A 8.250.157.254 ns: c.footprint.net. NS d.ns.c.footprint.net. ar: d.ns.c.footprint.net. A 4.26.235.155 (439)
8.8.19.12.53 > 125.15.15.9.42091: [udp sum ok] 46555 q: A? www.toto.net. 1/0/0 www.toto.net. A 120.33.1.11 (47)
output
125.15.15.9 mp.microsoft.com A 8.250.143.254 A 8.250.157.254 A 4.26.235.155
125.15.15.9 www.toto.net A 120.33.1.11
Each line appears to be of the form
{"id" with spaces}: {stuff} [ {key}: {stuff} ]*
You appear to be interested in information inside the "id", and inside the field named q. The value of the q field appears to be of the form
A? {word} {word} {ns_return} [, {ns_return} ]*
Here's a robust solution that handles the format described above.
perl -nle'
my $field_value_re = qr/(?![^\s:]++:(?!\S)) \S++ (?: (?! \s++ [^\s:]++:(?!\S) ) \s++ \S++ )*+/x;
my ($id, $id_val, $rest) = /^ ( [^:]++ ) : \s++ ( $field_value_re ) ( .* ) /sx
or next;
next if $id_val =~ /\bNXDomain\b/;
my ($ip) = $id =~ /^ \S++ \s++ \S++ \s++ ( [^\s\.]++\.[^\s\.]++\.[^\s\.]++\.[^\s\.]++ )\.[^\s\.]++ \z /x
or next;
my %fields = $rest =~ /\G \s++ ( [^\s:]++ ) :(?!\S) \s++ ( $field_value_re ) /gsx;
my ($query, $answers) = $fields{q} =~ /^ A\? \s++ ( \S++ ) \s++ \S++ \s++ ( .* ) /sx
or next;
$query =~ s/\.\z//;
my #answers =
map { /^\S++\s++A\s++(\S++)/ }
split(/\s*+,\s*+/, $answers);
next if !#answers;
print join " ", $ip, $query, map { "A $_" } #answers;
' log
125.15.15.9 mp.microsoft.com A 8.250.143.254 A 8.250.157.254
125.15.15.9 www.toto.net A 120.33.1.11
This prints the desired output by using the map function in a somewhat unorthodox way to ignore any fields after q:
perl -lne 'print join qq/\t/, m/> ([\d\.]+)\./, map {/A\? ([^\s]+)\./, /(A [\d\.]+)/g} / q:([^:]+)/' log.txt
The regex below captures everything up to the last balanced }.
Now, what regex would be able to capture everything up to the next unbalanced }? In other words, how can I can get ... {three {four}} five} from $str instead of just ... {three {four}}?
my $str = "one two {three {four}} five} six";
if ( $str =~ /
(
.*?
{
(?> [^{}] | (?-1) )+
}
)
/sx
)
{
print "$1\n";
}
So you want to match
[noncurlies [block noncurlies [...]]] "}"
where a block is
"{" [noncurlies [block noncurlies [...]]] "}"
As a grammar:
start : text "}"
text : noncurly* ( block noncurly* )*
block : "{" text "}"
noncurly : /[^{}]/
As a regex (5.10+):
/
^
(
(
[^{}]*
(?:
\{ (?-1) \}
[^{}]*
)*
)
\}
)
/x
As a regex (5.10+):
/
^ ( (?&TEXT) \} )
(?(DEFINE)
(?<TEXT> [^{}]* (?: (?&BLOCK) [^{}]* )* )
(?<BLOCK> \{ (?&TEXT) \} )
)
/x
The following piece of code works just fine with Perl (v5.16.2). However, when I run it using Perl v5.8.9, it complains about the following regex. How can I rewrite this regex in a way that works with Perl v5.8.9. (I can't update the version).
REGEX:
use strict;
use warnings;
our %formula_per_k;
INIT {
# List all functions that you want to allow in formulas. All other words will be interpretted as variables.
my #FORMULA_FUNCS = qw(sqrt exp log);
# Load the data via a file.
my $data = do {local $/; <DATA>};
# Parse K blocks
while ($data =~ m{
^K \s+ (\w+) \s* \{
( (?: [^{}]+ | \{(?2)\} )* ) # Matched braces only.
\}
}mgx) {
my ($name, $params) = ($1, $2);
# Parse LOL block
next if $params !~ m{
LOL \s* \{
( (?: [^{}]+ | \{(?1)\} )*? ) # Matched braces only.
\}
}mx;
my $lol = $1;
# Start building anonymous subroutine
my $conditions = '';
# Parse Conditions and Formulas
while ($lol =~ m{
COND \s* \{ (.*?) \} \s*
FORMULA \s* \{ (.*?) \}
}gx) {
my ($cond, $formula) = ($1, $2);
# Remove Excess spacing and translate variable into perl scalar.
for ($cond, $formula) {
s/^\s+|\s+$//g;
s{([a-zA-Z]+)}{
my $var = $1;
$var = "\$hashref->{$var}" if ! grep {$var eq $_} #FORMULA_FUNCS;
$var
}eg;
}
$conditions .= "return $formula if $cond; ";
}
my $code = "sub {my \$hashref = shift; ${conditions} return; }";
my $sub = eval $code;
if ($#) {
die "Invalid formulas in $name: $#";
}
$formula_per_k{$name} = $sub;
}
}
sub formula_per_k {
my ($k, $vars) = #_;
die "Unrecognized K value '$k'" if ! exists $formula_per_k{$k};
return $formula_per_k{$k}($vars);
}
print "'K1', {d => .1} = " . formula_per_k('K1', {d => .1}) . "\n";
print "'K1', {d => .05} = " . formula_per_k('K1', {d => .05}) . "\n";
print "'K3', {d => .02} = " . formula_per_k('K3', {d => .02}) . "\n";
print "'K3', {d => .021} = " . formula_per_k('K3', {d => .021}) . "\n";
__DATA__
... #OTHER STUFFS
K K1 {
LOL {
COND { d < 0.01 }
FORMULA { -0.2 + 3.3*sqrt(d) }
COND { d >= 0.01 }
FORMULA { -0.2 + 3.3*sqrt(d+0.4) }
}
}
... #OTHER STUFFS
K K2 {
LOL {
COND { d < 0.03 }
FORMULA { -2.2 + 1.3*sqrt(d) }
COND { d >= 0.03 }
FORMULA { -2.2 + 1.3*sqrt(d+0.8) }
}
}
... #OTHER STUFFS
K K3 {
LOL {
COND { d < 0.02 }
FORMULA { -4.3 + 0.3*sqrt(d) }
COND { d >= 0.02 }
FORMULA { -4.3 + 0.3*sqrt(d+0.3) }
}
}
... #OTHER STUFF
Outputs:
'K1', {d => .1} = 2.13345237791561
'K1', {d => .05} = 2.01370729772479
'K3', {d => .02} = -4.13029437251523
'K3', {d => .021} = -4.13002941430942
ERROR:
Sequence (?1...) not recognized in regex; marked by <-- HERE in m/
^K \s+ M3 \s* {
( (?: [^{}]+ | {(?2 <-- HERE )} )* ) # Matched braces only.
}
/ at ./code.pl line 215, <RFILE> line 12.
UPDATE:
Code is updated.
This was originally suggested by https://stackoverflow.com/users/1733163/miller
Before the introduction of (?PARNO), we had to use (??{ code }) to create recursive regular expressions. An example can be found in perlre - Extended Patterns.
The following is tested on v5.16.2, v5.20.0, and locally on a v5.8.9 perlbrew:
our $braces_re;
$braces_re = qr{
\{
(?:
(?> [^{}]+ )
|
(??{ $braces_re })
)*
\}
}sx;
# parse FOO block
while (
$data =~ m{
^FOO \s+ (\w+) \s* \{
( (?: [^{}]+ | (??{ $braces_re }) )* ) # Matched braces only.
\}
}mgx
)
{
my $params = $1;
# parse BAR block
next if $params !~ m{
BAR \s* \{
( (?: [^{}]+ | (??{ $braces_re }) )*? ) # Matched braces only.
\}
}mx;
# SOME CODE
}
Note, I intentionally separated out the declaration of the _re variable and its initialization. There are some versions of perl that will let you declare a recursive regular expression in the same statement as the initialization, but v5.8.9 is not one of them.
Also, if you're comfortable altering your original regex more than just dropping in a replacement for (?PARNO) notation, then the above can be reduced to the following. Also confirmed on v5.16.2:
my $braces_re;
$braces_re = qr{
(?:
(?> [^{}]+ )
| # The following is a "postponed" regular subexpression.
\{ (??{ $braces_re }) \} # Deferred execution enables recursive regex
)*
}sx;
# parse FOO block
while ( $data =~ m{^FOO \s+ (\w+) \s* \{ ( $braces_re ) \} }mgx ) {
my $params = $1;
# parse BAR block
next if $params !~ m{BAR \s* \{ ( $braces_re ) \}}mx;
# SOME CODE
}
I'm trying to match text like sp { ...{...}... }, where the curly braces are allowed to nest. This is what I have so far:
my $regex = qr/
( #save $1
sp\s+ #start Soar production
( #save $2
\{ #opening brace
[^{}]* #anything but braces
\} #closing brace
| (?1) #or nested braces
)+ #0 or more
)
/x;
I just cannot get it to match the following text: sp { { word } }. Can anyone see what is wrong with my regex?
There are numerous problems. The recursive bit should be:
(
(?: \{ (?-1) \}
| [^{}]+
)*
)
All together:
my $regex = qr/
sp\s+
\{
(
(?: \{ (?-1) \}
| [^{}]++
)*
)
\}
/x;
print "$1\n" if 'sp { { word } }' =~ /($regex)/;
This is case for the underused Text::Balanced, a very handy core module for this kind of thing. It does rely on the pos of the start of the delimited sequence being found/set first, so I typically invoke it like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::Balanced 'extract_bracketed';
sub get_bracketed {
my $str = shift;
# seek to beginning of bracket
return undef unless $str =~ /(sp\s+)(?={)/gc;
# store the prefix
my $prefix = $1;
# get everything from the start brace to the matching end brace
my ($bracketed) = extract_bracketed( $str, '{}');
# no closing brace found
return undef unless $bracketed;
# return the whole match
return $prefix . $bracketed;
}
my $str = 'sp { { word } }';
print get_bracketed $str;
The regex with the gc modifier tells the string to remember where the end point of the match is, and extract_bracketed uses that information to know where to start.