Stuck on perl regex expression for string with ending white space - regex

Following is a line from an ftp log:
2013-03-05 18:37:31 543.21.12.22 []sent
/home/mydomain/public_html/court-9746hd/Chairman-confidential-video.mpeg
226 court-9746hd#mydomain.com 256
I am using a program called Simple Event Correlate which pulls values from inside the parenthesis of a regex expression and sets those values to a variable.
So, here is an entry in a SEC config file which is supposed to operate on the previous log file line:
pattern=sent \/home\/mydomain\/public_html\/(.*)\/(.*)
This succeeds in pulling out the logged in user, court-9746hd, and setting it to a variable, but fails to properly extract the file name downloaded, or, Chairman-confidential-video.mpeg
Instead, it pulls out the file downloaded as: Chairman-confidential-video.mpeg 226 court-9746hd#mydomain.com 256
So you see, I'm having difficulty getting the second extraction to stop at the first white space after the file name. I've tried:
pattern=sent \/home\/mydomain\/public_html\/(.*)\/(.*)\s
but I only get the same result. Any help would be greatly appreciated.

If you only want to match non-whitespace, replace .* with \S* or if space is the only character you want to exclude then use [^ ]* instead.
Also, man perlre is a good reference.

Rather than using the .* construct, use something narrower in scope, as a general rule. In this case what you want is something which is not a white space, so say that explicitly:
pattern=sent \/home\/mydomain\/public_html\/([^\s]+)\/([^\s]+)

One option is to first capture the full path from the line, and then use File::Spec to get the user and file info:
use strict;
use warnings;
use File::Spec;
my $line = '2013-03-05 18:37:31 543.21.12.22 []sent /home/mydomain/public_html/court-9746hd/Chairman-confidential-video.mpeg 226 court-9746hd#mydomain.com 256';
my ( $path ) = $line =~ m!\s+(/home\S+)\s+!;
my ( $user, $file ) = ( File::Spec->splitdir($path) )[ -2, -1 ];
print "User: $user\nFile: $file";
Output:
User: court-9746hd
File: Chairman-confidential-video.mpeg
However, if you want to only use a regex, the following will work:
m!/home/.+/.+/([^/]+)/(\S+)!

Related

Is there a Perl regex metacharacter or a way to have specify a default value, if a subpattern capture does not match?

Here's the idea. I am parsing command line options but doing it across the entire command line, not by each #ARGV element separately.
program --format="%H:%M:%S" --timeout 12 --nofail
I want the parsing to work with these cases.
--name=value, easy to parse
--name value, pretty easy
--name no value, default the value to 1
Here is the regex which works, except it cannot do the missing value case
%options = "#ARGV" ~= /--([A-Za-z]+)[= ]([^-]\S*)/g;
i.e. match --name=value or --name value but not --name --name, --name --name is two names, not a --name=value pair.
If a --name has no value following it that matches the second capture in the regex, is there a way, within the regex, to specify a default, in my case a 1, to indicate "true". i.e. if an --name has no argument, like --nofail then set that argument to 1 indicating true.
Actually, in asking this I figured out a workaround using separate match statements which is fine. However, just out of curiosity, the question still stands, is there a Perl regex way to have a default if a submatch fails?
I don't see how to return a list reflecting a changed input from a regex alone. To change the input we need s{}{}er operator, as we need code in its replacement part to analyze captures and decide what to change; and, we get a string, not a list, which need be further processed (split).
Here is then one such take, with a minimal intrusion of code.
Match name and value, with = or space between them, and if value ($2) is undefined give it a value; so we need /e to implement that.† Once we are at it, put a space between all name-value pairs. This goes under /r so that the changed string is returned, and passed through split
my %arg = split ' ',
$args =~ s{ --(\w+) (?: =|\s+|\z) ([^-]\S*)? }{ $1.' '.($2//'7 ') }ergx;
The split can be done by another regex instead but that's still extra processing.
A complete program (with more flags added to the input)
use warnings;
use strict;
use feature 'say';
my $args = shift // q(--fmt="%H:%M" --f1 --time 12 --f2 --f3);
say $args;
my %arg = split ' ',
$args =~ s{ --(\w+) (?: =|\s+|\z) ([^-]\S*)? }{ $1 . ' ' . ($2//'1 ') }ergx;
say "$_ => $arg{$_}" for keys %arg;
This prints as expected. But note that there may be edge cases, and in particular having a space inside (a quoted) argument value, like "%H %M", would require a far more complex pattern.
I presume that the regex ask is for play/study. Normally this goes by libraries, like Getopt::Long. If that is somehow not possible then processing #ARGV term by term is nice and easy -- and fast.
† In order to actually do "if value ($2) is undefined give it a value" we need to run code in the replacement part, what is done under the /e modifier

Perl Range command mismatching similar strings with one ending in a carriage return

The range command in Perl
RANGE
/^ identifier cust_pri/ .. /addr-type-none/
matches on strings with cust_pri and cust_pri_sip where a carriage return is immediately after the string cust_pri (and cust_pri_sip). I don't want a match on cust_pri_sip but only on cust_pri.
I tried putting in \r\n and both individually to no avail. Is there a string or metachar I can put into the end of perl range to help differentiate the two strings?
I need to look at data for both types of interfaces but on the first range command it is also collecting the data the second range command is also collecting (cust_pri_sip) causing my first script to error out. The second works find. I cannot change the input data and I need a way to differentiate the two.
This is a sub script of the main Perl program
WIDTH = 65
DIRECTORY = /home/myfiles/
MASTER Config Lines
identifier cust_pri
description *
addr prefix 0.0.0.0
network interfaces M00|1:\d*
tcp media profile
monitoring filters
node functionality
default location string
alt family realm
addr-type-none
RANGE
/^ identifier cust_pri/ .. /addr-type-none/
#
There is another sub script that is similar to above
RANGE
/^ identifier cust_pri_sip/ .. /addr-type-none/
The first script also collects the data of both scripts because it matches.
You can explicitly exclude _sip with /^ identifier cust_pri(?!_sip)/ or you can say cust_pri has to be at the end of the line with nothing after it with /^ identifier cust_pri$/

Match a file name that includes the path and a period within the name

I have a file I need to take just its name:
/var/www/foo/dog.tur-tles.chickens.txt
I want to match just the:
dog.tur-tles.chickens
I have tried this in regexer:
([^\/]*)$
This matches:
dog.tur-tles.chickens.txt
I can't figure out how to only exclude that last period.
You can assume it will always be a .txt, but I wanted to build in the ability that if a file was named dog-turtles.txt.txt it would see that the name is dog-turtles.txt.
You could use something like so: ([^\/]*)(\.).+?$.
An example is available here. Not though that this will fail for extensions such as .tar.gz and so on.
You may use File::Basename.fileparse to get the file name, then use rindex to get the last index of . and then get the required substring using substr:
use File::Basename;
$x = fileparse('/var/www/foo/dog.tur-tles.chickens.txt');
print substr($x, 0, rindex($x, '.')) . "\n";
Output of a sample program:
dog.tur-tles.chickens
$name = ($pathname =~ s{.*/}{}r =~ s{\.[^.]+$}{}r)
substitution 1 : just remove dir
substitution 2 : just remove extension if presente
Just add .txt to your regex and since * is greedy by default it will match everything till last .txt
([^\/]*)\.txt$
Input:
/var/www/foo/dog.tur-tles.chickens.txt.txt
/var/www/foo/dog.tur-tles.chickens.txt
Output:
dog.tur-tles.chickens.txt
dog.tur-tles.chickens
See DEMO

regex maching after new line in perl

i am trying to match with regex in perl different parts of a text which are not in the same line.
I have a file sized 200 mb aprox with all cases similar to the following example:
rewfww
vfresrgt
rter
*** BLOCK 049 Aeee/Ed "ewewew"U 141202 0206
BLAH1
BLAH2
END
and i want to extract all what is in the same line after the "***" in $1, BLAH1 in $2 and BLAH2 in $3.
i have tried the following without success:
open(archive, "C:/Users/g/Desktop/blahs.txt") or die "die\n";
while(< archive>){
if($_ =~ /^\*\*\*(.*)\n(.*)/s){
print $1;
print $2;
}
}
One more complexity: i don´t know how many BLAH´s are in each case. Perhaps one case have only BLAH1, other case with BLAH1, BLAH2 and BLAH3 etc. The only thing thats sure is the final "END" who separates the cases.
Regards
\*\*\*([^\n]*)\n|(?!^)\G\s*(?!\bEND\b)([^\n]+)
Try this.See demo.
https://regex101.com/r/vN3sH3/17
How about:
#!/usr/bin/perl
use strict;
use warnings;
open(my $archive, '<', "C:/Users/g/Desktop/blahs.txt") or die "die: $!";
while(<$archive>){
if (/^\*{3}/ .. /END/) {
s/^\*{3}//;
print unless /END/;
}
}
As far as I understand your question the following works for me. Please update or provide feedback if you are looking for something more or less strict (or spot any mistakes!).
^(\*{3}.*\n{2})(([a-zA-Z])*([0-9]*)\n{2})*(END)$
^(\*{3}\n{2}) - Find line consisting of three *s followed by two newlines - You could repeat this by adding * after the last closing parenthesis if you want/need to check for a "false" start. While it looks like you may have data in the file before this but this is the start of the data you actually care about/want to capture.
(([a-zA-Z])*([0-9]*)\n{2})* -The desired word characters followed by a number (or numbers if your BLAH count >9) and also check for two trailing spaces. The * at the end denotes that this can repeat zero or more times which accounts for the case where you have no data. If you want a fail if there is not data use ? instead of * to denote it must repeat 1 or more times. this segment assumes you wanted to check for data in the format word+number. If that is not the case this part can be easily modified to accept a wider range of data - let me know if you want/need a more or less strict case
(END)$ - The regex ends with sequence "END". If it is permissible for the data to continue and you just want to stop capture at this point do not include the $
I don't have permissions to post pics yet but a great site to check and to see a visual representation of your regex imo is https://www.debuggex.com/

Perl - Regexp to manipulate .csv

I've got a function in Perl that reads the last modified .csv in a folder, and parses it's values into variables.
I'm finding some problems with the regular expressions.
My .csv look like:
Title is: "NAME_NAME_NAME"
"Period end","Duration","Sample","Corner","Line","PDP OUT TOTAL","PDP OUT OK","PDP OUT NOK","PDP OUT OK Rate"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","ARG - NAME 1","536","536","0","100%"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","USA - NAME 2","1850","1438","412","77.72%"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","AUS - NAME 3","8","6","2","75%"
.(ignore this dot, you will understand later)
So far, I've had some help to parse the values into some variables, by:
open my $file, "<", $newest_file
or die qq(Cannot open file "$newest_file" for reading.);
while ( my $line = <$file> ) {
my ($date_time, $duration, $sample, $corner, $country_name, $pdp_in_total, $pdp_in_ok, $pdp_in_not_ok, $pdp_in_ok_rate)
= parse_line ',', 0, $line;
my ($date, $time) = split /\s+/, $date_time;
my ($country, $name) = $country_name =~ m/(.+) - (.*)/;
print "$date, $time, $country, $name, $pdp_in_total, $pdp_in_ok_rate";
}
The problems are:
I don't know how to make the first AND second line (that are the column names from the .csv) to be ignored;
The file sometimes come with 2-5 empty lines in the end of the file, as I show in my sample (ignore the dot in the end of it, it doesn't exists in the file).
How can I do this?
When you have a csv file with column headers and want to parse the data into variables, the simplest choice would be to use Text::CSV. This code shows how you get your data into the hash reference $row. (I.e. my %data = %$row)
use strict;
use warnings;
use Text::CSV;
use feature 'say';
my $csv = Text::CSV->new({
binary => 1,
eol => $/,
});
# open the file, I use the DATA internal file handle here
my $title = <DATA>;
# Set the headers using the header line
$csv->column_names( $csv->getline(*DATA) );
while (my $row = $csv->getline_hr(*DATA)) {
# you can now access the variables via their header names, e.g.:
if (defined $row->{Duration}) { # this will skip the blank lines
say $row->{Duration};
}
}
__DATA__
Title is: "NAME_NAME_NAME"
"Period end","Duration","Sample","Corner","Line","PDP IN TOTAL","PDP IN OK","PDP IN NOT OK","PDP IN OK Rate"
"04/12/2014 10:00:00","3600","1","GRPS_INB","CHN - Name 1","1198","1195","3","99.74%"
"04/12/2014 10:00:00","3600","1","GRPS_INB","ARG - Name 2","1198","1069","129","89.23%"
"04/12/2014 10:00:00","3600","1","GRPS_INB","NLD - Name 3","813","798","15","98.15%"
If we print one of the $row variables with Data::Dumper, it shows the structure we are getting back from Text::CSV:
$VAR1 = {
'PDP IN TOTAL' => '1198',
'PDP IN NOT OK' => '3',
'PDP IN OK' => '1195',
'Period end' => '04/12/2014 10:00:00',
'Line' => 'CHN - Name 1',
'Duration' => '3600',
'Sample' => '1',
'PDP IN OK Rate' => '99.74%',
'Corner' => 'GRPS_INB'
};
open ...
my $names_from_first_line = <$file>; # you can use them or just ignore them
while($my line = <$file>) {
unless ($line =~ /\S/) {
# skip empty lines
next;
}
..
}
Also, consider using Text::CSV to handle CSV format
1) I don't know how to make the first line (that are the column names from the .csv) to be ignored;
while ( my $line = <$file> ) {
chomp $line;
next if $. == 1 || $. == 2;
2) The file sometimes come with 2-5 empty lines in the end of the file, as I show in my sample (ignore the dot in the end of it, it doesn't exists in the file).
while ( my $line = <$file> ) {
chomp $line;
next if $. == 1 || $. == 2;
next if $line =~ /^\s*$/;
You know that the valid lines will start with dates. I suggest you simply skip lines that don't start with dates in the format you expect:
while ( my $line = <$file> ) {
warn qq(next if not $line =~ /^"\d{2}-\d{2}-d{4}/;); # Temp debugging line
next if not $line =~ /^"\d{2}-\d{2}-d{4}/;
warn qq($line matched regular expression); # Temp debugging line
...
}
The /^"\d{2}-\d{2}-d{4}",/ is a regular expression pattern. The pattern is between the /.../:
^ - Beginning of the line.
" - Quotation Mark.
\d{2} - Followed by two digits.
- - Followed by a dash.
\d{2] - Followed by two more digits.
- - Followed by a dash.
\d{4} - Followed by four more digits
This should be describing the first part of your line which is the date in MM-DD-YYYY format surrounded by quotes and followed by a comma. The =~ tells Perl that you want the thing on the left to match the regular expression on the right.
Regular expressions can be difficult to understand, and is one of the reasons why Perl has such a reputation of being a write-only language. Regular expressions have been likened to sailor cussing. However, regular expressions is an extremely powerful tool, and worth the effort to learn. And with some experience, you'll be able to easily decode them.
The next if... syntax is similar to:
if (...) {
next;
}
Normally, you shouldn't use post-fix if and never use unless (which is if's opposite). They can make your program more difficult to understand. However, when placed right after the opening line of a loop like this, they make a clear statement that you're filtering out lines you don't want. I could have written this (and many people would argue this is preferable):
next unless $line =~ /^"\d{2}-\d{2}-d{4}",/;
This is saying you want to skip lines unless they match your regular expression. It's all a matter of personal preference and what do you think is easier for the poor schlub who comes along next year and has to figure out what your program is doing.
I actually thought about this and decided that if not ... was saying that I expect almost all lines in the file to match my format, and I want to toss away the few exceptions. To me, next unless ... is saying that there are some lines that match my regular expression, and many lines that don't, and I want to only work on lines that match.
Which gets us to the next part of programming: Watching for things that will break your program. My previous answer didn't do a lot of error checking, but it should. What happens if a line doesn't match your format? What if the split didn't work? What if the fields are not what I expect? You should really check each statement to make sure it actually worked. Almost all functions in Perl will return a zero, a null string, or an undef if they don't work. For example, the open statement.
open my $file, "<", $newest_file
or die qq(Cannot open file "$newest_file" for reading.);
If open doesn't work, it returns a file handle value of zero. The or states that if open doesn't return a non-zero file handle, execute the line that follows which kills your program.
So, look through your program, and see any place where you make an assumption that something works as expected and think what happens if it didn't. Then, add checks in your program to something if you get that exception. It could be that you want to report the error or log the error and skip to the next line. It could be that you want your program to come to a screeching halt. It could be that you can recover from the error and continue. What ever you do, check for possible errors (especially from user input) and handle possible errors.
Debugging
I told you regular expressions are tricky. Yes, I made a mistake assuming that your date was a separate field. Instead, it's followed by a space then the time which means that the final ", in the regular expression should not be there. I've fixed the above code. However, you may still need to test and tweak. Which brings us into debugging in Perl.
You can use warn statements to help debug your program. If you copy a statement, then surround it with warn qq(...);, Perl will print out the line (filling out variables) and the line number. I even create macros in my various editors to do this for me.
The qq(...) is a quote like operator. It's another way to do double quotes around a string. The nice thing is that the string can contain actual quotation marks, and the qq(...); will still work.
Once you've finished debugging, you can search for your warn statements and delete them. Perl comes with a powerful built in debugger, and many IDEs integrate with it. However, sometimes it's just easier to toss in a few warn statements to see what's going on in your code -- especially if you're having issues with regular expressions acting up.