Parsing (partially) non-uniform text blocks in Perl

Parsing (partially) non-uniform text blocks in Perl - regex

I have a file with a few blocks that look like this in a file (and in a variable, at this point in the program).
Vlan2 is up, line protocol is up
....
reliability 255/255, txload 1/255, rxload 1/255^M
....
Last clearing of "show interface" counters 49w5d
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
....
L3 out Switched: ucast: 17925 pkt, 23810209 bytes mcast: 0 pkt, 0 bytes
33374 packets input, 13154058 bytes, 0 no buffer
Received 926 broadcasts (0 IP multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
3094286 packets output, 311981311 bytes, 0 underruns
0 output errors, 0 interface resets
0 output buffer failures, 0 output buffers swapped out
Here's a second block, to show you how the blocks can slightly vary:
port-channel86 is down (No operational members)
...
reliability 255/255, txload 1/255, rxload 1/255
...
Last clearing of "show interface" counters 31w2d
...
RX
147636 unicast packets 0 multicast packets 0 broadcast packets
84356 input packets 119954232 bytes
0 jumbo packets 0 storm suppression packets
0 runts 0 giants 0 CRC 0 no buffer
0 input error 0 short frame 0 overrun 0 underrun 0 ignored
0 watchdog 0 bad etype drop 0 bad proto drop 0 if down drop
0 input with dribble 0 input discard
0 Rx pause
TX
147636 unicast packets 0 multicast packets 0 broadcast packets
84356 output packets 119954232 bytes
0 jumbo packets
0 output error 0 collision 0 deferred 0 late collision
0 lost carrier 0 no carrier 0 babble 0 output discard
0 Tx pause
0 interface resets
I want to pick out certain data elements from each block, which may or may not exist in each block. For example, in the first block I posted I may want to know that there are 0 runts, 0 input errors and 0 overrun. In the second block, I might want to know that there are 0 jumbo packets, collisions, etc. If a given query isn't in the block, it's acceptable to just return na, as this is designed to be processed uniformly.
Each block is structured in a similar way to the two I posted; newlines and spaces delimiting some entries, commas delimiting others.
I have a few ideas as to how this might work. I'm unaware if there is any kind of "look back" function in Perl, but I could attempt to look for the field names (runts, "input errors", etc) and then grab the previous integer; that seems like it would be the most elegant solution for this, but I'm unsure if it's possible.
Currently, I'm doing this in Perl. Each "block" that I'm processing is actually several of these blocks (separated by double newlines). It doesn't have to be done in a single regular expressions; I believe it can be done by applying several regular expressions per block. Performance is not really a factor, as this script will run maybe once per hour.
My goal is to get all of this into a .csv file (or some other data format that's easily graphable) in an automated fashion.
Any ideas?
Edit: example output in CSV as I mentioned, which would be written line by line (for multiple entries like this) to a file as the end result. If a particular entry isn't found in the block, it is marked na in the corresponding line:
interface_name,txload,rxload,last_clearing,input_queue,output_drops,runts,....
vlan2,1,1,49w5d,0-75-0-0,0,0,....
port-channel86,1,1,31w2d,na,na,0,...

Simple hash of properties and numbers.
sub extract {
my ($block) = #_;
my %r;
while ($block =~ /(?<num>\d+) \s (?<name>[A-Za-z\s]+)/gmsx) {
my $name = $+{name};
my $num = $+{num};
$name =~ s/\A \s+//msx;
$name =~ s/\s+ \z//msx;
$r{$name} = $num;
}
return %r;
}
my $block = <<'';
Vlan2 is up, line protocol is up
⋮
my $block2 = <<'';
port-channel86 is down (No operational members)
⋮
use Data::Dumper qw(Dumper);
print Dumper {extract $block};
print Dumper {extract $block2};

I don't think a single regex could do it, nor would I want to support it if it could.
Using multiple regexes, you could easily use something like:
(\d+) runts
(\d+) input errors
...etc...
A simple array of property names and a loop could solve this pretty quickly and with very little fuss.
If you can strip down the input to smaller chunks with some preprocessing, you would be less likely to get false positives.

Here is one way to do it in awk, but this needs lots of tweak to be perfect.
But again, use SNMP.
awk '{
printf $1
for (i=1;i<=NF;i++) {
if ($i" "$(i+1)~/Input queue:/) printf ",%s",$(i+2)
if ($i~/runts/) printf ",%s",$(i-1)
if ($i~/multicast,/) printf ",%s",$(i-1)
}
print ""
}' RS="swapped out" file

Related

Python: Match a special caracter with regular expression

Hi everyone I'm using the re.match function to extract pieces of string within a row from the file.
My code is as follows:
## fp_tmp => pointer of file
for x in fp_tmp:
try:
cpuOverall=re.match(r"(Overall CPU load average)\s+(\S+)(%)",x)
cpuUsed=re.match(r"(Total)\s+(\d+)(%)",x)
ramUsed=re.match(r"(RAM Utilization)\s+(\d+\%)",x)
####Not Work####
if cpuUsed is not None: cpuused_new=cpuUsed.group(2)
if ramUsed is not None: ramused_new=ramUsed.group(2)
if cpuOverall is not None: cpuoverall_new=cpuOverall.group(2)
except:
searchbox_result = None
Each field is extracted from the following corresponding line:
ramUsed => RAM Utilization 2%
cpuUsed => Total 4%
cpuOverall => Overall CPU load average 12%
ramUsed, cpuUsed, cpuOverall are the variable where I want write the result!!
Corretly line are:
(space undefined) RAM Utilization 2%
(space undefined) Total 4%
(space undefined) Overall CPU load average 12%
When I execute the script all variable return a value: None.
With other variable the script work corretly.
Why the code not work in this case? I use the python3
I think that the problem is a caracter % that not read.
Do you have any suggestions?
PROBLEM 2:
## fp_tmp => pointer of file
for x in fp_tmp:
try:
emailReceived=re.match(r".*(Messages Received)\s+\S+\s+\S+\s+(\S+)",x)
####Not Work####
if emailReceived is not None: emailreceived_new=emailReceived.group(2)
except:
searchbox_result = None
Each field is extracted from the following corresponding on 2 lines in a file:
[....]
Counters: Reset Uptime Lifetime
Receiving
Messages Received 3,406 1,558 3,406
[....]
Rates (Events Per Hour): 1-Minute 5-Minutes 15-Minutes
Receiving
Messages Received 0 0 0
Recipients Received 0 0 0
[....]
I want extract only second occured, that:
Rates (Events Per Hour): 1-Minute 5-Minutes 15-Minutes
Receiving
Messages Received 0 0 0 <-this
Do you have any suggestions?

cpuOverall line: you forgot that there is more information at the start of the line. Change to
'.*(Overall CPU load average)\s+(\S+%)'
cpuUsed line: you forgot that there is more information at the start of the line. Change to
'.*(Total)\s+(\d+%)'
ramUsed line: you forgot that there is more information at the start of the line... Change to
'.*(RAM Utilization)\s+(\d+%)'
Remember that re.match looks for an exact match from the start:
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. [..]
With these changes, your three variables are set to the percentages:
>>> print (cpuused_new,ramused_new,cpuoverall_new)
4% 2% 12%

How to match this regular expression using TCL

Kindly give me some input on this. I have the below input for a TCL regular expression.
set a { Descriptor Blocks:
10.132.224.74 (Tunnel42), from 10.132.224.74, Send flag is 0x0
Composite metric is (2032896/128256), route is Internal
Vector metric:
Minimum bandwidth is 4096 Kbit
Total delay is 55000 microseconds
Reliability is 255/255
Load is 1/255
Minimum MTU is 1380
Hop count is 1
Originating router is 10.128.9.65
10.135.0.86 (GigabitEthernet0/1), from 10.135.0.86, Send flag is 0x0
Composite metric is (2033152/2032896), route is Internal
Vector metric:
Minimum bandwidth is 4096 Kbit
Total delay is 55010 microseconds
Reliability is 255/255
Load is 1/255
Minimum MTU is 1380
Hop count is 2
Originating router is 10.128.9.65
Internal tag is 200 }
From the above i want to separate like two list element, the regular expression should separate by following word.
Here there are two interface output is there, one is for
10.132.224.74 (Tunnel42)
interface and another one is for
10.135.0.86 (GigabitEthernet0/1)
If there is no line starting with "Internal tag is " after the "Originating router
is " line it should divide upto "Originating router is " line as a one
list element.
If there is a line "Internal tag is " is available after the
"Originating router is " line it should divide upto "Internal tag is "
as a one list
I am expecting the output like
{Tunnel42), from 10.132.224.74, Send flag is 0x0
Composite metric is (2032896/128256), route is Internal
Vector metric:
Minimum bandwidth is 4096 Kbit
Total delay is 55000 microseconds
Reliability is 255/255
Load is 1/255
Minimum MTU is 1380
Hop count is 1
Originating router is 10.128.9.65
10.135.0.86 (GigabitEthernet0/1), from 10.135.0.86, Send flag is 0x0
Composite metric is (2033152/2032896), route is Internal
Vector metric:
Minimum bandwidth is 4096 Kbit
Total delay is 55010 microseconds
Reliability is 255/255
Load is 1/255
Minimum MTU is 1380
Hop count is 2
Originating router is 10.128.9.65
Internal tag is 200

A more generalized approach can be splitting them input into line and parsing them as needed
set a { Descriptor Blocks:
10.132.224.74 (Tunnel42), from 10.132.224.74, Send flag is 0x0
Composite metric is (2032896/128256), route is Internal
Vector metric:
Minimum bandwidth is 4096 Kbit
Total delay is 55000 microseconds
Reliability is 255/255
Load is 1/255
Minimum MTU is 1380
Hop count is 1
Originating router is 10.128.9.65
10.135.0.86 (GigabitEthernet0/1), from 10.135.0.86, Send flag is 0x0
Composite metric is (2033152/2032896), route is Internal
Vector metric:
Minimum bandwidth is 4096 Kbit
Total delay is 55010 microseconds
Reliability is 255/255
Load is 1/255
Minimum MTU is 1380
Hop count is 2
Originating router is 10.128.9.65
Internal tag is 200 }
set tunnelStart 0
set interfaceStart 0
set tunnelInfo {}
set interfaceInfo {}
set result {}
foreach line [split $a \n] {
if {[regexp {\(Tunnel\d+\)} $line]} {
# If suppose, we already identified 'tunnelInfo' and extracted it, then that variable won't be empty
if {$tunnelInfo ne {}} {
regsub {\n$} $tunnelInfo {} tunnelInfo
# So, appending it to 'result'
lappend result $tunnelInfo
# Then, resetting the 'tunnelInfo'
set tunnelInfo {}
}
set tunnelStart 1
set interfaceStart 0
} elseif {[regexp {\(GigabitEthernet\d+/\d+\)} $line]} {
# Same reason as explained above
if {$interfaceInfo ne {}} {
regsub {\n$} $interfaceInfo {} interfaceInfo
lappend result $interfaceInfo
set interfaceInfo {}
}
set interfaceStart 1
set tunnelStart 0
}
if {$tunnelStart} {
#Appending each line along with '\n'
append tunnelInfo $line\n
} elseif {$interfaceStart} {
append interfaceInfo $line\n
}
}
#Removing the last '\n' alone
regsub {\n$} $tunnelInfo {} tunnelInfo
regsub {\n$} $interfaceInfo {} interfaceInfo
# At last checking if the variable is not empty, append it to 'result'
if {$tunnelInfo ne {}} {
lappend result $tunnelInfo
}
if {$interfaceInfo ne {}} {
lappend result $interfaceInfo
}
puts $result
You can put them in a procedure & call wherever you want to separate the input. If suppose your input has more than one tunnel and interface lines information, you could re-write the code to parse it accordingly.

You can use the textutil module to do this easily:
package require textutil
textutil::split::splitx $a {\n(?=\s*\d)}
This splits the original text into a list of three items: the " Descriptor Blocks:" substring and one item each for the two blocks. It works by finding junctures where a line break and optional whitespace is followed by a digit. The line break is removed, but the leading whitespace and the digit is preserved.
Core-Tcl solution:
The substitution
regsub -all -line {^(?=\s*\d)} $a \n
will split the text into three parts (the first part being the " Descriptor Blocks:" substring) by inserting an extra line break before each block. This solution obviously depends on only the first line in each block starting with a digit optionally preceded by whitespace. The -line option makes ^ anchor after a line break.
Note that this results in a text with three parts, not a list of three elements: if you want that you will need to break the text up at every double line break. Another way to deal with this is to have regsub instead insert a character that won't occur in the text, and then split on that character, e.g.
split [regsub -all -line {^(?=\s*\d)} $a #] #
Documentation: package, regsub, split, textutil package

Sort multi-line blocks in large (~10GB) by single token in block

I have a large file (~10GB) full of memory traces in this format:
INPUT:
Address: 7f2da282c000
Data:
0x7f2da282c000
0
0x7f2db4c810d0
0
0x304b2e198
0x304b2e1b8
0x304b3af38
0x304b54528
Address: 603000
Data:
0x603000
0
0x7f2db4c810d0
0
0x304b2e198
0x304b2e1b8
0x304b3af38
0x304b54528
.
.
.
Address: 7f2da2a38dc0
Data:
0x7f2da2a38dc0
0
0x7f2db4c810d0
0
0x304b2e198
0x304b2e1b8
0x304b3af38
0x304b54528
These are addresses and 64 bytes of data at those addresses at different points in time as the accesses occurred. Each hex value in the data field represents 8 bytes. Suppose each address and its data make up one multi-line block.
Certain addresses are accessed/updated multiple times and I'd like to sort the multi-line blocks so that each address that has multiple updates, has those accesses right below it like this:
OUTPUT:
Address: 7f2da2a38dc0
Data:
0x7f2da2a38dc0
0
0x7f2db4c810d0
0
0x304b2e198
0x304b2e1b8
0x304b3af38
0x304b54528
0x304b2e198
0x304b2e1b8
0x304b3af38
0x304b54528
0x304b2e198
0x304b2e1b8
0x304b3af38
0x304b54528
.
.
.
0x7f2da2a38dc0
0
0x7f2db4c810d0
0
0x7f2da2a38dc0
0
0x7f2db4c810d0
0
Address: 0xadsf212
Data:
[Updates]
[Updates]
.
.
.
[Updates]
Where each address that is accessed more than once, has its respective updates below it, and addresses that are accessed only once are thrown out.
What I tried:
-Comparing each address to every other address in a simple c++ program, but it's way too slow, (has been running for a couple days now).
-Used *nix sort to get all the addresses and their counts (sort -k 2,2 bigTextFile.txt | uniq -cd > output file), but only the first line of the multi-line blocks are sorted by, the deadbeeff part in 'Address: deadbeeff' and the data blocks are left behind. Is there any way for sort to take a set of lines and sort them from a single value in the top line of the block, i.e. the address value and move the entire block around? I found some awk scripts that looked not applicable.
-Looked into making a database out of the file with address, the access index and the data as three columns and then run a query for all the data updates that have the same address, but I've never used databases and I'm not sure this is the best approach.
Any recommendations on what I tried, or new approaches is appreciated.

This is pretty basic file processing. It sounds like you just need to hash the blocks on address and then print the map values that have more than one block. In languages like perl this is simple:
use strict;
sub read_block {
my #data;
while (<>) {
s/^Address: //; # Remove "Address: ".
return \#data unless /\S/;
push #data, $_ unless /^Data/; # Ignore "Data:".
}
\#data
}
sub main {
my %map;
while (1) {
my $block = read_block;
last unless scalar(#$block) > 0;
my $addr = shift #$block; # Add the block to the hash.
push #{$map{$addr}}, $block;
}
# Just for fun, sort keys by address.
my #sorted_addr = sort { hex $a cmp hex $b } keys %map;
# Print blocks that have more than one access.
foreach my $addr (#sorted_addr) {
next unless scalar(#{$map{$addr}}) > 1; # Ignore blocks of 1.
print "Address: $addr";
foreach my $block (#{$map{$addr}}) {
print #$block;
print "\n"; # Leave a blank line between blocks.
}
}
}
main;
Of course you'll need a machine with enough RAM to hold the data. 32Gb ought to do nicely. If you don't have that, a trickier 2-pass algorithm will do with much less.

Parse ASCII Output of a Device-File in C++

i wrote a kernelspace driver for a USB-device. If it is connected it mounts to /dev/myusbdev0 for example.
Via command line with echo -en "command" > /dev/myusbdev0 i can send commands to the device and read results with cat /dev/myusbdev0.
Ok, now i have to write a C++ program. At first i would open the device file for read/write with:
int fd = open("/dev/echo", O_RDRW);
After that a cmd will be send to get the device working:
char cmd[] = { "\x02sEN LMDscandata 1\x03" };
write(fd, cmd, sizeof(cmd));
Now i get to the part i dont now how to handle yet. i now need to read from the device, as its keeping on sending me data continously. this data i need to read and parse now ...
char buf[512];
read(fd, buf, sizeof(buf);
The data looks like following, each one starts with \x02 and ends with \x03, they are not always the same size:
sRA LMDscandata 1 1 89A27F 0 0 343 347 27477BA9 2747813B 0 0 7 0 0
1388 168 0 1 DIST1 3F800000 00000000 186A0 1388 15 8A1 8A5 8AB 8AC 8A6
8AC 8B6 8C8 8C2 8C9 8CB 8C4 8E4 8E1 8EB 8E0 8F5 908 8FC 907 906 0 0 0
0 0 0 All Values are separated with a 20hex {SPC}
it think i need some kind of while loop to continiously read the data from an \x02 until i read a \x03.
if i have a complete scan, i need to parse this ascii message in its seperate parts (some variables uint_16, uint_8, enum_16, ...).
any idea how i can read a complete scan into a buf[] and then parse its components out?

As you say the device is sending continiously, i would recommend adding a queue to hold the chunks coming in, and some dispatching that takes out parts of the queue, i.e. x02 to x03, decoupling the work that is done from receiving chunks.
Furthermore you can have then single objects handling one complete block from x02 to x03, perhaps threaded (makes sense with the information given).
device => chunk reader => input queue => inputer reader => data handling
hope this helps

Perl RegEx for Matching 11 column File

I'm trying to write a perl regex to match the 5th column of files that contain 11 columns. There's also a preamble and footer which are not data. Any good thoughts on how to do this? Here's what I have so far:
if($line =~ m/\A.*\s(\b\w{9}\b)\s+(\b[\d,.]+\b)\s+(\b[\d,.sh]+\b)\s+.*/i) {
And this is what the forms look like:
No. Form 13F File Number Name
____ 28-________________ None
[Repeat as necessary.]
FORM 13F INFORMATION TABLE
TITLE OF VALUE SHRS OR SH /PUT/ INVESTMENT OTHER VOTING AUTHORITY
NAME OF INSURER CLASS CUSSIP (X$1000) PRN AMT PRNCALL DISCRETION MANAGERS SOLE SHARED NONE
Abbott Laboratories com 2824100 4,570 97,705 SH sole 97,705 0 0
Allstate Corp com 20002101 12,882 448,398 SH sole 448,398 0 0
American Express Co com 25816109 11,669 293,909 SH sole 293,909 0 0
Apollo Group Inc com 37604105 8,286 195,106 SH sole 195,106 0 0
Bank of America com 60505104 174 12,100 SH sole 12,100 0 0
Baxter Internat'l Inc com 71813109 2,122 52,210 SH sole 52,210 0 0
Becton Dickinson & Co com 75887109 8,216 121,506 SH sole 121,506 0 0
Citigroup Inc com 172967101 13,514 3,594,141 SH sole 3,594,141 0 0
Coca-Cola Co. com 191216100 318 6,345 SH sole 6,345 0 0
Colgate Palmolive Co com 194162103 523 6,644 SH sole 6,644 0 0

If you ever do write a regex this long, you should at least use the x flag to ignore whitespace, and importantly allow whitespace and comments:
m/
whatever
something else # actually trying to do this
blah # for fringe case X
/xi
If you find it hard to read your own regex, others will find it Impossible.

I think a regular expression is overkill for this.
What I'd do is clean up the input and use Text::CSV_XS on the file, specifying the record separator (sep_char).

Like Ether said, another tool would be appropriate for this job.
#fields = split /\t/, $line;
if (#fields == 11) { # less than 11 fields is probably header/footer
$the_5th_column = $fields[4];
...
}

My first thought is that the sample data is horribly mangled in your example. It'd be great to see it embedded inside some <pre>...</pre> tags so columns will be preserved.
If you are dealing with columnar data, you can go after it using substr() or unpack() easier than you can using regex. You can use regex to parse out the data, but most of us who've been programming Perl a while also learned that regex is not the first tool to grab a lot of times. That's why you got the other comments. Regex is a powerful weapon, but it's also easy to shoot yourself in the foot.
http://perldoc.perl.org/functions/substr.html
http://perldoc.perl.org/functions/unpack.html
Update:
After a bit of nosing around on the SEC edgar site, I've found that the 13F files are nicely formatted. And, you should have no problem figuring out how to process them using substr and/or unpack.
FORM 13F INFORMATION TABLE
VALUE SHARES/ SH/ PUT/ INVSTMT OTHER VOTING AUTHORITY
NAME OF ISSUER TITLE OF CLASS CUSIP (x$1000) PRN AMT PRN CALL DSCRETN MANAGERS SOLE SHARED NONE
- ------------------------------ ---------------- --------- -------- -------- --- ---- ------- ------------ -------- -------- --------
3M CO COM 88579Y101 478 6051 SH SOLE 6051 0 0
ABBOTT LABS COM 002824100 402 8596 SH SOLE 8596 0 0
AFLAC INC COM 001055102 291 6815 SH SOLE 6815 0 0
ALCATEL-LUCENT SPONSORED ADR 013904305 172 67524 SH SOLE 67524 0 0
If you are seeing the 13F files unformatted, as in your example, then you are not viewing correctly because there are tabs between columns in some of the files.
I looked through 68 files to get an idea of what's out there, then wrote a quick unpack-based routine and got this:
3M CO, COM, 88579Y101, 478, 6051, SH, , SOLE, , 6051, 0, 0
ABBOTT LABS, COM, 002824100, 402, 8596, SH, , SOLE, , 8596, 0, 0
AFLAC INC, COM, 001055102, 291, 6815, SH, , SOLE, , 6815, 0, 0
ALCATEL-LUCENT, SPONSORED ADR, 013904305, 172, 67524, SH, , SOLE, , 67524, 0, 0
Based on some of the other files here's some thoughts on how to process them:
Some of the files use tabs to separate the columns. Those are trivial to parse and you do not need regex to split the columns. 0001031972-10-000004.txt appears to be that way and looks very similar to your example.
Some of the files use tabs to align the columns, not separate them. You'll need to figure out how to compress multiple tab runs into a single tab, then probably split on tabs to get your columns.
Others use a blank line to separate the rows vertically so you'll need to skip blank lines.
Others allow wrap columns to the next line (like a spreadsheet would in a column that is not wide enough. It's not too hard to figure out how to deal with that, but how to do it is being left as an exercise for you.
Some use centered column alignment, resulting in leading and trailing whitespace in your data. s/^\s+//; and s/\s+$//; will become your friends.
The most interesting one I saw appeared to have been created correctly, then word-wrapped at column 78, leading me to think some moron loaded their spreadsheet or report into their word processor then saved it. Reading that is a two step process of getting rid of the wrapping carriage-returns, then re-processing the data to parse out the columns. As an added task they also have column headings in the data for page breaks.
You should be able to get 100% of the files parsed, however you'll probably want to do it with a couple different parsing methods because of the use of tabs and blank lines and embedded column headers.
Ah, the fun of processing data from the wilderness.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js