Perl - Parse blocks from text file

Perl - Parse blocks from text file - regex

First, I apologize if you feel this is a duplicate. I looked around and found some very similar questions, but I either got lost or it wasn't quite what I think I need and therefore couldn't come up with a proper implementation.
QUESTION:
So I have a txt file that contains entries made by another script (I can edit the format for how these entries are generated if you can suggest a better way to format them):
SR4 Pool2
11/5/2012 13:45
----------
Beginning Wifi_Main().
SR4 Pool2
11/8/2012 8:45
----------
This message is a
multiline message.
SR4 Pool4
11/5/2012 14:45
----------
Beginning Wifi_Main().
SR5 Pool2
11/5/2012 13:48
----------
Beginning Wifi_Main().
And I made a perl script to parse the file:
#!C:\xampp-portable\perl\bin\perl.exe
use strict;
use warnings;
#use Dumper;
use CGI 'param','header';
use Template;
#use Config::Simple;
#Config::Simple->import_from('config.ini', \%cfg);
my $cgh = CGI->new;
my $logs = {};
my $key;
print "Content-type: text/html\n\n";
open LOG, "logs/Pool2.txt" or die $!;
while ( my $line = <LOG> ) {
chomp($line);
}
print $logs;
close LOG;
My goal is to have a hash in the end that looks like this:
$logs = {
SR4 => {
Pool2 => {
{
time => '11/5/2012 13:45',
msg => 'Beginning Wifi_NDIS_Main().',
},
{
time => '11/8/2012 8:45',
msg => 'This message is a multiline message.',
},
},
Pool4 => {
{
time => '11/5/2012 13:45',
msg => 'Beginning Wifi_NDIS_Main().',
},
},
},
SR5 => {
Pool2 => {
{
time => '11/5/2012 13:45',
msg => 'Beginning Wifi_NDIS_Main().',
},
},
},
};
What would be the best way of going about this? Should I change the formatting of the generated logs to make it easier on myself? If you need anymore info, just ask. Thank you in advanced. :)

The format makes no sense. You used a hash at the third level, but you didn't specify keys for the values. I'm assuming it should be an array.
my %logs;
{
local $/ = ""; # "Paragraph mode"
while (<>) {
my #lines = split /\n/;
my ($x, $y) = split ' ', $lines[0];
my $time = $lines[1];
my $msg = join ' ', #lines[3..$#lines];
push #{ $logs{$x}{$y} }, {
time => $time,
msg => $msg,
};
}
}
Should I change the formatting of the generated logs
Your time stamps appear to be ambiguous. In most time zones, an hour of the year is repeated.

If you can possibly output it as XML, reading it in would be embarrasingly easy with XML::Simple

Although Karthik T idea of using XML makes sense, and I would also consider it, I'm not sure if this is the best route. The first problem is putting it in XML format in the first place.
The second is that XML format might not be so easily parsed. Sure, the XML::Simple module will read the whole thing in one swoop, you then have to parse the XML data structure itself.
If you can set the output however you want, make it in a format that's easy to parse. I like using prefix data identifiers. In the following example, each piece of data has it's own identifier. The ER: tells me when I hit the end of record:
DT: 11/5/2012 13:35
SR: SR4
PL: Pool2
MG: Beginning Wifi_Main().
ER:
DT: 1/8/2012 8:45
SR: SR4
PL: Pool2
MG: This message is a
MG: multiline message.
ER:
Parsing this output is straight forward:
my %hash;
while ( $line = <DATA> ) {
chomp $line;
if ( not $line eq "ER:" ) {
my ($key, $value) = split ( ": ", $line );
$hash{$key} .= "$value "; #Note trailing space!
}
else {
clean_up_hash ( \%hash ); #Remove trailing space on all values
create_entry ( \%log, \%hash );
%hash = ();
}
}
I like using classes whenever I start getting complex data structures, and I would probably create a Local::Log class and subclasses to store each layer of the log. However, it's not an absolute necessity and wasn't part of your question. Still, I would use a create_entry subroutine just to keep the logic of figuring out where in your log that entry belongs inside your loop.
NOTE: I append a space after each piece of data. I did this to make the code simpler since some of your messages may take more than one line. There are other ways to handle this, but I was trying to keep the loop as clean as possible and with as few if statements as possible.

Related

Regular expression only finding first match

I'm working on something that is similar to other designs I've done, but for some reason, it's only finding the first key/value pair, whereas other ones found all of them. It looks good in regex101.com, which is where I typically test these.
I'm parsing c++ code to get what I need for a reference spreadsheet for error tracking across a system, and results go into a spreadsheet, or is used as a key to lookup info in another file. I do something similar for about 20 files, plus there's other data coming from a sql query, or access/mdb file. The data for this file looks like this:
m_ErrorMap.insert(make_pair(
MAKEWORD(scError,seFatal),
HOP_FATAL_ERROR ));
m_ErrorMap.insert(make_pair(
MAKEWORD(scError,seNotSelected),
HOP_NOT_SELECTED));
m_ErrorMap.insert(make_pair(
MAKEWORD(scError,seCoverOpen),
HOP_COVER_OPEN ));
m_ErrorMap.insert(make_pair(
MAKEWORD(scError,seLeverPosition),
HOP_LEVER_POSITION ));
m_ErrorMap.insert(make_pair(
MAKEWORD(scError,seJam),
HOP_JAM ));
I read this as a string from the file (looks good), and feed it into this Function as $fileContent:
Function Get-Contents60{
[cmdletbinding()]
Param ([string]$fileContent)
Process
{
#m_ErrorMap.insert(make_pair(
#MAKEWORD(scError,seJam),
#HOP_JAM ));
# construct regex
switch -Regex ($fileContent -split '\r?\n') { #this is splitting on each line test regex with https://regex101.com/
'MAKEWORD["("][\w]+,(\w+)[")"],' { #seJam
# add relevant key to key collection
$keys = $Matches[1] } #only match once
',(HOP.*?)[\s]' { # HOP_JAM
# we've reached the relevant error, set it for all relevant keys
foreach($key in $keys){
Write-Host "60 key: $key"
Write-Host "Matches[0]: $($Matches[0]) Matches[1]: $($Matches[1])"
$errorMap[$key] = $($Matches[1])
Write-Host "60 key: $key ... value: $($errorMap[$key])"
}
}
'break' {
# reset/clear key collection
$keys = #()
}
}#switch
#Write-Host "result:" $result -ForegroundColor Green
#$result;
return $errorMap
}#End of Process
}#End of Function
I stepped through it in VSCode, and its finding the first key/value pair, and after that it's not finding anything. I looked at it in regex101.com, and it's finding line endings/breaks, and the MAKEWORD regex and HOP regex are finding what they should on each line it should.
I'm not sure if the issue is that they aren't all in the same line, and maybe I need to change it so it doesn't break on newline and breaks on something else for each key/value pair? I'm a little fuzzy on this.
I'm using powershell 5.1, and VSCode.
Update:
I modified Theo's answer and it worked great. I had simplified the class name from m_HopErrorMap to m_ErrorMap for this question, and the regular expression was grabbing that for each one. I modified that slightly, and Theo's works.
function Get-Contents60{
[cmdletbinding()]
Param ([string]$fileContent)
# create an ordered hashtable to store the results
$errorMap = [ordered]#{}
# process the lines one-by-one
switch -Regex ($fileContent -split '\r?\n') {
'MAKEWORD\([^,]+,([^)]+)\),' { # seJam, seFatal etc.
$key = $matches[1]
}
'(HOP_[^)]+)' {
$errorMap[$key] = $matches[1].Trim()
}
}
# output the completed data as object
[PsCustomObject]$errorMap
return $errorMap
}

I would simplify your function to
function Get-Contents60{
[cmdletbinding()]
Param ([string]$fileContent)
# create an ordered hashtable to store the results
$errorMap = [ordered]#{}
# process the lines one-by-one
switch -Regex ($fileContent -split '\r?\n') {
'MAKEWORD\([^,]+,([^)]+)\),' { # seJam, seFatal etc.
$key = $matches[1]
}
'(HOP[^)]+)' {
$errorMap[$key] = $matches[1].Trim()
}
}
# output the completed data as object
[PsCustomObject]$errorMap
}
Then, using your example text, for which I'm using a Here-string, but in real life you would load the file content with $c = Get-Content -Path 'X:\TheErrors.txt' -Raw you do
$result = Get-Contents60 -fileContent $c
To display on screen
$result | Format-Table -AutoSize
giving you
seFatal seNotSelected seCoverOpen seLeverPosition seJam
------- ------------- ----------- --------------- -----
HOP_FATAL_ERROR HOP_NOT_SELECTED HOP_COVER_OPEN HOP_LEVER_POSITION HOP_JAM

Perl deferred interpolation of string

I have a situation where there is a triage script that takes in a message, compares it against a list of regex's and the first one that matches sets the bucket. Some example code would look like this.
my $message = 'some message: I am bob';
my #buckets = (
{
regex => '^some message:(.*)',
bucket => '"remote report: $1"',
},
# more pairs
);
foreach my $e (#buckets) {
if ($message =~ /$e->{regex}/i) {
print eval "$e->{bucket}";
}
}
This code will give remote report: I am bob. I keep looking at this and feel like there has to be a better way to do this then it is done now. especially with the double quoting ('""') in the bucket. Is there a better way for this to be handled?

Perl resolves the interpolation when that expression is evaluated. For that, it is sufficient to use a subroutine, no eval needed:
...
bucket => sub { "remote report: $1" },
...
print $e->{bucket}->();
Note that you effectively eval your regexes as well. You can use pre-compiled regex objects in your hash, with the qr// operator:
...
regex => qr/^some message:(.*)/i,
...
if ($message =~ /$e->{regex}/) {

You could use sprintf-style format strings:
use strict;
use warnings;
my $message = 'some message: I am bob';
my #buckets = (
{
regex => qr/^some message:(.*)/,
bucket => 'remote report: %s',
},
# more pairs
);
foreach my $e (#buckets) {
if (my #matches = ($message =~ /$e->{regex}/ig)) {
printf($e->{bucket}, #matches);
}
}

Twitch TMI, extract all users

I'm trying to extract all the user names from the source (https://tmi.twitch.tv/group/user/twitchpresents/chatters), but so far I'm only able to get like the first "name".
Goal is to get all the names into an array, and then just remove the "viewers", "admins", "staff", etc. names.
use strict;
use warnings;
my #listusers = userlist();
sub userlist {
my $url = "https://tmi.twitch.tv/group/user/twitchpresents/chatters";
my $array = get($url);
my #array2;
my $time = 0;
while ($time != 2){
my $mylist = (join "",grep(/"\s*(.*?)\s*"/, $array[$time])) =~ /"\s*(.*?)\s*"/;
print $1;
$time++;
}
return #array2;
}
print #listusers;

I assume that you get the page with get from LWP::Simple. Please always show relevant includes.
Since this is valid JSON, use a module for that
use warnings;
use strict;
use feature 'say';
use Data::Dump qw(dd);
use LWP::Simple;
use JSON;
my $data_json = get($url);
my $data = JSON->new->decode($data_json);
#dd $data;
my #names = #{ $data->{chatters}{moderators} }; # get "moderators"
say "#names";
# my #all_names = map { #$_ } values %{$data->{chatters}}; # or get all names
This prints the line: cliccer cuda hnlbot nixi93 scorpy0 somppe
I print "names" of moderators as an example of getting one category out; the commented out line gets all names in one array. Once you have a hashref there are various ways to extract what you need.
Your code seems to be trying to parse that JSON string. It is very easy with a module.
One can view complex data structures with Data::Dumper or such. I use Data::Dump.
Here JSON is used, which delegates to JSON::XS if installed or to the "pure Perl" (and slower) JSON::PP otherwise. Another option is Cpanel::JSON::XS.
For convenience, the structure printed by dd $data is
{
_links => {},
chatter_count => 15,
chatters => {
admins => [],
global_mods => [],
moderators => [
"cliccer",
"cuda",
"hnlbot",
"joffy95",
"nixi93",
"scorpy0",
"somppe",
],
staff => [],
viewers => [
"coldblood94",
"coldbot",
"gabenator",
"gharokk",
"reconcrusadershadow",
"scrubnubslulz",
"shai_the_panda",
"sonadourge",
],
},
}

I recommend that you use
Mojo::UserAgent
for this, as it contains a JSON decoder as well as an HTTP user agent and makes the code much more concise
I assume you want the names of all the different categories of chatters so that's what the map statement does: simply flattening all of the categories—admins, global_mods, moderators, staff, and viewers—into a single list
If anything goes wrong with the HTTP transfer then the subroutine will simply return an empty list, so you may want to add some proper error handling
use strict;
use warnings;
use feature 'say';
use Mojo::UserAgent;
my #list_users = chatters();
say for #list_users;
sub chatters {
my $url = 'https://tmi.twitch.tv/group/user/twitchpresents/chatters';
return unless my $data = Mojo::UserAgent->new->get($url)->res->json;
map { #$_ } values %{ $data->{chatters} };
}
output
girlonduty
moobot
ravager
100tesports
123mickeypr
13eebo
13urnxcalibur
2dmoody
2l33t
2tony79
3nticed
...

Perl taint mode with domain name input for CGI resulting in “Insecure dependency in eval”

Given the following in a CGI script with Perl and taint mode I have not been able to get past the following.
tail /etc/httpd/logs/error_log
/usr/local/share/perl5/Net/DNS/Dig.pm line 906 (#1)
(F) You tried to do something that the tainting mechanism didn't like.
The tainting mechanism is turned on when you're running setuid or
setgid, or when you specify -T to turn it on explicitly. The
tainting mechanism labels all data that's derived directly or indirectly
from the user, who is considered to be unworthy of your trust. If any
such data is used in a "dangerous" operation, you get this error. See
perlsec for more information.
[Mon Jan 6 16:24:21 2014] dig.cgi: Insecure dependency in eval while running with -T switch at /usr/local/share/perl5/Net/DNS/Dig.pm line 906.
Code:
#!/usr/bin/perl -wT
use warnings;
use strict;
use IO::Socket::INET;
use Net::DNS::Dig;
use CGI;
$ENV{"PATH"} = ""; # Latest attempted fix
my $q = CGI->new;
my $domain = $q->param('domain');
if ( $domain =~ /(^\w+)\.(\w+\.?\w+\.?\w+)$/ ) {
$domain = "$1\.$2";
}
else {
warn("TAINTED DATA SENT BY $ENV{'REMOTE_ADDR'}: $domain: $!");
$domain = ""; # successful match did not occur
}
my $dig = new Net::DNS::Dig(
Timeout => 15, # default
Class => 'IN', # default
PeerAddr => $domain,
PeerPort => 53, # default
Proto => 'UDP', # default
Recursion => 1, # default
);
my #result = $dig->for( $domain, 'NS' )->to_text->rdata();
#result = sort #result;
print #result;
I normally use Data::Validate::Domain to do checking for a “valid” domain name, but could not deploy it in a way in which the tainted variable error would not occur.
I read that in order to untaint a variable you have to pass it through a regex with capture groups and then join the capture groups to sanitize it. So I deployed $domain =~ /(^\w+)\.(\w+\.?\w+\.?\w+)$/. As shown here it is not the best regex for the purpose of untainting a domain name and covering all possible domains but it meets my needs. Unfortunately my script is still producing tainted failures and I can not figure out how.
Regexp-Common does not provide a domain regex and modules don’t seem to work with untainting variable so I am at a loss now.
How to get this thing to pass taint checking?

$domain is not tainted
I verified that your $domain is not tainted. This is the only variable you use that could be tainted, in my opinion.
perl -T <(cat <<'EOF'
use Scalar::Util qw(tainted);
sub p_t($) {
if (tainted $_[0]) {
print "Tainted\n";
} else {
print "Not tainted\n";
}
}
my $domain = shift;
p_t($domain);
if ($domain =~ /(^\w+)\.(\w+\.?\w+\.?\w+)$/) {
$domain = "$1\.$2";
} else {
warn("$domain\n");
$domain = "";
}
p_t($domain);
EOF
) abc.def
It prints
Tainted
Not tainted
What Net::DNS::Dig does
See Net::DNS::Dig line 906. It is the beginning of to_text method.
sub to_text {
my $self = shift;
my $d = Data::Dumper->new([$self],['tobj']);
$d->Purity(1)->Deepcopy(1)->Indent(1);
my $tobj;
eval $d->Dump; # line 906
…
From new definition I know that $self is just hashref containing values from new parameters and several other filled in the constructor. The evaled code produced by $d->Dump is setting $tobj to a deep copy of $self (Deepcopy(1)), with correctly set self-references (Purity(1)) and basic pretty-printing (Indent(1)).
Where is the problem, how to debug
From what I found out about &Net::DNS::Dig::to_text, it is clear that the problem is at least one tainted item inside $self. So you have a straightforward way to debug your problem further: after constructing the $dig object in your script, check which of its items is tainted. You can dump the whole structure to stdout using print Data::Dumper::Dump($dig);, which is roughly the same as the evaled code, and check suspicious items using &Scalar::Util::tainted.
I have no idea how far this is from making Net::DNS::Dig work in taint mode. I do not use it, I was just curious and wanted to find out, where the problem is. As you managed to solve your problem otherwise, I leave it at this stage, allowing others to continue debugging the issue.

As resolution to this question if anyone comes across it in the future it was indeed the module I was using which caused the taint checks to fail. Teaching me an important lesson on trusting modules in a CGI environment. I switched to Net::DNS as I figured it would not encounter this issue and sure enough it does not. My code is provided below for reference in case anyone wants to accomplish the same thing I set out to do which is: locate the nameservers defined for a domain within its own zone file.
#!/usr/bin/perl -wT
use warnings;
use strict;
use IO::Socket::INET;
use Net::DNS;
use CGI;
$ENV{"PATH"} = ""; // Latest attempted fix
my $q = CGI->new;
my $domain = $q->param('domain');
my #result;
if ( $domain =~ /(^\w+)\.(\w+\.?\w+\.?\w+)$/ ) {
$domain = "$1\.$2";
}
else {
warn("TAINTED DATA SENT BY $ENV{'REMOTE_ADDR'}: $domain: $!");
$domain = ""; # successful match did not occur
}
my $ip = inet_ntoa(inet_aton($domain));
my $res = Net::DNS::Resolver->new(
nameservers => [($ip)],
);
my $query = $res->query($domain, "NS");
if ($query) {
foreach my $rr (grep { $_->type eq 'NS' } $query->answer) {
push(#result, $rr->nsdname);
}
}
else {
warn "query failed: ", $res->errorstring, "\n";
}
#result = sort #result;
print #result;
Thanks for the comments assisting me in this matter, and SO for teaching more then any other resource I have come across.

Perl web API using Data::Dumper

We've developed an open web API using Apache and mod_perl, where you can pass text created by Data::Dumper to make requests.
Our data generally looks like this:
$VAR1 = {
'OurField' => 'OurValue'
};
Currently, I noticed we're using an eval to get the data back into a Perl hash server side:
my $VAR1;
eval $our_dumper_string;
#$VAR1 is now filled with hash value
The problem with this, is it is a major security issue. You can pass malicious perl code in there and it will run server side...
It there a better way to safely take a Data::Dumper string and turn it into a hash?

Yes. Use JSON::XS and use JSON rather than Data::Dumper format. That is much more compatible with other web APIs

If your data is simple and predictable you can even try to write a simple "parser" to read back the values in a data stricture
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $data = { 'key1' => 'value' };
my $dumper = Dumper($data);
print $dumper;
my $data_2;
while( $dumper =~ /(.+)$/mg) {
if ( $1 =~ m/'(.*)' => '(.*)'/ ) {
$data_2->{$1} = $2;
}
}
print Dumper( $data_2 );
(this is just an example and wont work with integers or nested data structures)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Perl - Parse blocks from text file - regex

If you can possibly output it as XML, reading it in would be embarrasingly easy with XML::Simple

Related

Regular expression only finding first match

Perl deferred interpolation of string

Twitch TMI, extract all users

Perl taint mode with domain name input for CGI resulting in “Insecure dependency in eval”

Perl web API using Data::Dumper

Categories

Resources