Related
The following perl script and TestData simulate the situation where I can only find 2 instead of 4 expected. (to match all support.tier.1 with backslash in between).
How can I modify this perl regex here? thanks
my #TestData(
"support.tier.1",
"support.tier.2",
qw("support\.tier\.1"),
"support\.tier\.2",
quotemeta("support.tier.1\#example.com"),
"support.tier.2\#example.com",
"support\.tier\.1\#example\.com",
"support\.tier\.2\#example\.com",
"sales\#example\.com"
);
Here is the code to be changed:
my $count = 0;
foreach my $tier(#TestData){
if($tier =~ m/support.tier.1/){
print "$count: $tier\n";
}
$count++;
}
I only get 2 matches while the expected is 4:
0: support.tier.1
6: support.tier.1#example.com
Update
Since it seems that you may indeed be getting strings containing backslashes, I suggest that you use String::Unescape to remove those backslashes before testing your strings. You will probably have to install it as it isn't a core module
Your code would look like this
use strict;
use warnings;
use String::Unescape;
my #tiers = (
"support.tier.1",
"support.tier.2",
qw("support\.tier\.1"),
"support\.tier\.2",
quotemeta("support.tier.1\#example.com"),
"support.tier.2\#example.com",
"support\.tier\.1\#example\.com",
"support\.tier\.2\#example\.com",
"sales\#example\.com",
);
my $count = 0;
for my $tier ( #tiers ) {
my $plain = String::Unescape->unescape($tier);
if ( $plain =~ /support\.tier\.1/ ) {
printf "%d: %s\n", ++$count, $tier;
}
}
output
1: support.tier.1
2: "support\.tier\.1"
3: support\.tier\.1\#example\.com
4: support.tier.1#example.com
Note that there is a bug in the String::Unescape module that prevents it from exporting the unescape function. It just means you have to use String::Unescape::unescape or String::Unescape->unescape all the time. Or you could import it manually with *unescape = \&String::Unescape::unescape
The #tiers array contains these exact strings
support.tier.1
support.tier.2
"support\.tier\.1"
support.tier.2
support\.tier\.1\#example\.com
support.tier.2#example.com
support.tier.1#example.com
support.tier.2#example.com
sales#example.com
Can you see that only items 1 and 7 contain the string support.tier.1? The other two that I imagine you expected to match are 3 and 5, which contain spurious backslashes
It's not clear, but it seems unlikely that you will be getting data in this format. If you really want to match support.tier.1 where either dot may be preceded by a backslash character then you need /support\\?\.tier\\?\.1/, but I think you are misunderstanding the way Perl strings work
I may not fully understand, but if I do I agree with the answer that Matt has already attempted to give you. Regex definitely can handle your request if you are saying that the escape character may or may not be before each period in support.tier.1.
A single backslash is \\ and ? means essentially "one or zero:"
use strict;
use warnings;
my #tiers = (
"support.tier.1",
"support.tier.2",
qw("support\.tier\.1"),
"support\.tier\.2",
quotemeta("support.tier.1\#example.com"),
"support.tier.2\#example.com",
"support\.tier\.1\#example\.com",
"support\.tier\.2\#example\.com",
"sales\#example\.com",
);
my $count = 0;
foreach my $tier (#tiers) {
if ($tier =~ /support\\?.tier\\?.1/) {
print "$count: $tier\n";
}
$count++;
}
On an unrelated note, for the purpose of creating an easy-to-follow example, I included a suggestion on how you might better format your sample data instead of using the $str and pushes.
If this works, I'd recommend you ask Matt to post his comment responses as an answer and accept it.
I've got a function in Perl that reads the last modified .csv in a folder, and parses it's values into variables.
I'm finding some problems with the regular expressions.
My .csv look like:
Title is: "NAME_NAME_NAME"
"Period end","Duration","Sample","Corner","Line","PDP OUT TOTAL","PDP OUT OK","PDP OUT NOK","PDP OUT OK Rate"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","ARG - NAME 1","536","536","0","100%"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","USA - NAME 2","1850","1438","412","77.72%"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","AUS - NAME 3","8","6","2","75%"
.(ignore this dot, you will understand later)
So far, I've had some help to parse the values into some variables, by:
open my $file, "<", $newest_file
or die qq(Cannot open file "$newest_file" for reading.);
while ( my $line = <$file> ) {
my ($date_time, $duration, $sample, $corner, $country_name, $pdp_in_total, $pdp_in_ok, $pdp_in_not_ok, $pdp_in_ok_rate)
= parse_line ',', 0, $line;
my ($date, $time) = split /\s+/, $date_time;
my ($country, $name) = $country_name =~ m/(.+) - (.*)/;
print "$date, $time, $country, $name, $pdp_in_total, $pdp_in_ok_rate";
}
The problems are:
I don't know how to make the first AND second line (that are the column names from the .csv) to be ignored;
The file sometimes come with 2-5 empty lines in the end of the file, as I show in my sample (ignore the dot in the end of it, it doesn't exists in the file).
How can I do this?
When you have a csv file with column headers and want to parse the data into variables, the simplest choice would be to use Text::CSV. This code shows how you get your data into the hash reference $row. (I.e. my %data = %$row)
use strict;
use warnings;
use Text::CSV;
use feature 'say';
my $csv = Text::CSV->new({
binary => 1,
eol => $/,
});
# open the file, I use the DATA internal file handle here
my $title = <DATA>;
# Set the headers using the header line
$csv->column_names( $csv->getline(*DATA) );
while (my $row = $csv->getline_hr(*DATA)) {
# you can now access the variables via their header names, e.g.:
if (defined $row->{Duration}) { # this will skip the blank lines
say $row->{Duration};
}
}
__DATA__
Title is: "NAME_NAME_NAME"
"Period end","Duration","Sample","Corner","Line","PDP IN TOTAL","PDP IN OK","PDP IN NOT OK","PDP IN OK Rate"
"04/12/2014 10:00:00","3600","1","GRPS_INB","CHN - Name 1","1198","1195","3","99.74%"
"04/12/2014 10:00:00","3600","1","GRPS_INB","ARG - Name 2","1198","1069","129","89.23%"
"04/12/2014 10:00:00","3600","1","GRPS_INB","NLD - Name 3","813","798","15","98.15%"
If we print one of the $row variables with Data::Dumper, it shows the structure we are getting back from Text::CSV:
$VAR1 = {
'PDP IN TOTAL' => '1198',
'PDP IN NOT OK' => '3',
'PDP IN OK' => '1195',
'Period end' => '04/12/2014 10:00:00',
'Line' => 'CHN - Name 1',
'Duration' => '3600',
'Sample' => '1',
'PDP IN OK Rate' => '99.74%',
'Corner' => 'GRPS_INB'
};
open ...
my $names_from_first_line = <$file>; # you can use them or just ignore them
while($my line = <$file>) {
unless ($line =~ /\S/) {
# skip empty lines
next;
}
..
}
Also, consider using Text::CSV to handle CSV format
1) I don't know how to make the first line (that are the column names from the .csv) to be ignored;
while ( my $line = <$file> ) {
chomp $line;
next if $. == 1 || $. == 2;
2) The file sometimes come with 2-5 empty lines in the end of the file, as I show in my sample (ignore the dot in the end of it, it doesn't exists in the file).
while ( my $line = <$file> ) {
chomp $line;
next if $. == 1 || $. == 2;
next if $line =~ /^\s*$/;
You know that the valid lines will start with dates. I suggest you simply skip lines that don't start with dates in the format you expect:
while ( my $line = <$file> ) {
warn qq(next if not $line =~ /^"\d{2}-\d{2}-d{4}/;); # Temp debugging line
next if not $line =~ /^"\d{2}-\d{2}-d{4}/;
warn qq($line matched regular expression); # Temp debugging line
...
}
The /^"\d{2}-\d{2}-d{4}",/ is a regular expression pattern. The pattern is between the /.../:
^ - Beginning of the line.
" - Quotation Mark.
\d{2} - Followed by two digits.
- - Followed by a dash.
\d{2] - Followed by two more digits.
- - Followed by a dash.
\d{4} - Followed by four more digits
This should be describing the first part of your line which is the date in MM-DD-YYYY format surrounded by quotes and followed by a comma. The =~ tells Perl that you want the thing on the left to match the regular expression on the right.
Regular expressions can be difficult to understand, and is one of the reasons why Perl has such a reputation of being a write-only language. Regular expressions have been likened to sailor cussing. However, regular expressions is an extremely powerful tool, and worth the effort to learn. And with some experience, you'll be able to easily decode them.
The next if... syntax is similar to:
if (...) {
next;
}
Normally, you shouldn't use post-fix if and never use unless (which is if's opposite). They can make your program more difficult to understand. However, when placed right after the opening line of a loop like this, they make a clear statement that you're filtering out lines you don't want. I could have written this (and many people would argue this is preferable):
next unless $line =~ /^"\d{2}-\d{2}-d{4}",/;
This is saying you want to skip lines unless they match your regular expression. It's all a matter of personal preference and what do you think is easier for the poor schlub who comes along next year and has to figure out what your program is doing.
I actually thought about this and decided that if not ... was saying that I expect almost all lines in the file to match my format, and I want to toss away the few exceptions. To me, next unless ... is saying that there are some lines that match my regular expression, and many lines that don't, and I want to only work on lines that match.
Which gets us to the next part of programming: Watching for things that will break your program. My previous answer didn't do a lot of error checking, but it should. What happens if a line doesn't match your format? What if the split didn't work? What if the fields are not what I expect? You should really check each statement to make sure it actually worked. Almost all functions in Perl will return a zero, a null string, or an undef if they don't work. For example, the open statement.
open my $file, "<", $newest_file
or die qq(Cannot open file "$newest_file" for reading.);
If open doesn't work, it returns a file handle value of zero. The or states that if open doesn't return a non-zero file handle, execute the line that follows which kills your program.
So, look through your program, and see any place where you make an assumption that something works as expected and think what happens if it didn't. Then, add checks in your program to something if you get that exception. It could be that you want to report the error or log the error and skip to the next line. It could be that you want your program to come to a screeching halt. It could be that you can recover from the error and continue. What ever you do, check for possible errors (especially from user input) and handle possible errors.
Debugging
I told you regular expressions are tricky. Yes, I made a mistake assuming that your date was a separate field. Instead, it's followed by a space then the time which means that the final ", in the regular expression should not be there. I've fixed the above code. However, you may still need to test and tweak. Which brings us into debugging in Perl.
You can use warn statements to help debug your program. If you copy a statement, then surround it with warn qq(...);, Perl will print out the line (filling out variables) and the line number. I even create macros in my various editors to do this for me.
The qq(...) is a quote like operator. It's another way to do double quotes around a string. The nice thing is that the string can contain actual quotation marks, and the qq(...); will still work.
Once you've finished debugging, you can search for your warn statements and delete them. Perl comes with a powerful built in debugger, and many IDEs integrate with it. However, sometimes it's just easier to toss in a few warn statements to see what's going on in your code -- especially if you're having issues with regular expressions acting up.
What's the best way to clear/reset all regex matching variables?
Example how $1 isn't reset between regex operations and uses the most recent match:
$_="this is the man that made the new year rumble";
/ (is) /;
/ (isnt) /;
say $1; # outputs "is"
Example how this may be problematic when working with loops:
foreach (...){
/($some_value)/;
&doSomething($1) if $1;
}
Update: I didn't think I'd need to do this, but Example-2 is only an example. This question is about resetting matching variables, not the best way to implement them.
Regardless, originally my coding style was more inline with being explicit and using if-blocks. After coming back to this (Example2) now, it is much more concise in reading many lines of code, I'd find this syntax faster to comprehend.
You should use the return from the match, not the state of the group vars.
foreach (...) {
doSomething($1) if /($some_value)/;
}
$1, etc. are only guaranteed to reflect the most recent match if the match succeeds. You shouldn't be looking at them other than right after a successful match.
Regex captures* are reset by a successful match. To reset regex captures, one would use a trivial match operation that's guaranteed to match.
"a" =~ /a/; # Reset captures to undef.
Yeah, it looks weird, but you asked to do some thing weird.
If you fix your code, you don't need weird-looking workarounds. Fixing your code even reveals a bug!
Fixes:
$_ = "this is the man that made the new year rumble";
if (/ (is) / || / (isnt) /) {
say $1;
} else{
... # You're currently printing something random.
}
and
for (...) {
if (/($some_pattern)/) {
do_something($1);
}
}
* — Backrefs are regex patterns that match previously captured text. e.g. \1, \k<foo>. You're actually talking about "regex capture buffers".
You should test whether the match succeeded. For example:
foreach (...){
/($some_value)/ or next;
doSomething($1) if $1;
}
foreach (...){
doSomething($1) if /($some_value)/ and $1;
}
foreach (...){
if (/($some_value)/) {
doSomething($1) if $1;
}
}
Depending on what $some_value is, and how you want to handle matching the empty string and/or 0, you may or may not need to test $1 at all.
To complement the existing, helpful answers (and the sensible recommendation to normally test the result of a matching operation in a Boolean context and take action only if the test succeeds notwithstanding):
Depending on your scenario, you can approach the problem differently:
Disclaimer: I'm not an experienced Perl programmer; do let me know if there are problems with this approach.
Enclose the matching operation in a do { ... } block scopes all regex-related special variables ($&, $1, ...) to that block.
Thus, you can use a do { ... } to prevent these special variables from getting set in the first place (although the ones from a previous regex operation outside the block will obviously remain in effect); for instance:
$_="this is the man that made the new year rumble";
# Match in current scope; -> $&, $1, ... *are* set.
/ (is) /;
# Match inside a `do` block; the *new* $&, $1, ... values
# are set only *inside* the block;
# `&& $1` passes out the block's version of `$1`.
$do1 = do { / (made) / && $1 };
print "\$1 == '$1'; \$do1 == '$do1'\n"; # -> $1 == 'is'; $do1 == 'made'
The advantage of this approach is that none of the current scope's special regex variables are set or altered; the accepted answer, by contrast, alters variables such as $&, and $'.
The disadvantage is that you must explicitly pass out variables of interest; you do get the result of the matching operation by default, however, and if you're only interested in the contents of capture buffers, that will suffice.
You shoud do it this way:
foreach (...) {
someFnc($1) if /.../;
}
But if you want to stick with your style, then check this as an idea:
$_ = "this is the man that made the new year rumble";
$m = /(is)/ ? $1 : undef;
$m = /(isnt)/ ? $1 : undef;
print $m, "\n" if defined $m;
Assigning captures to a list behave closer to what it sounds like you want.
for ("match", "fail") {
my ($fake_1) = /(m.+)/;
doSomething($fake_1) if $fake_1;
}
I've been whacking on this regex for a while, trying to build something that can pick out multiple ordered property values (DTSTART, DTEND, SUMMARY) from an .ics file. I have other options (like reading one line at a time and scanning), but wanted to build a single regex that can handle the whole thing.
SAMPLE PERL
# There has got to be a better way...
my $x1 = '(?:^DTSTART[^\:]*:(?<dts>.*?)$)';
my $x2 = '(?:^DTEND[^\:]*:(?<dte>.*?)$)';
my $x3 = '(?:^SUMMARY[^\:]*:(?<dtn>.*?)$)';
my $fmt = "$x1.*$x2.*$x3|$x1.*$x3.*$x2|$x2.*$x1.*$x3|$x2.*$x3.*$x1|$x3.*$x1.*$x2|$x3.*$x2.*$x1";
if ($evts[1] =~ /$fmt/smo) {
printf "lines:\n==>\n%s\n==>\n%s\n==>\n%s\n", $+{dts}, $+{dte}, $+{dtn};
} else {
print "Failed.\n";
}
SAMPLE DATA
BEGIN:VEVENT
UID:0A5ECBC3-CAFB-4CCE-91E3-247DF6C6652A
TRANSP:OPAQUE
SUMMARY:Gandalf_flinger1
DTEND:20071127T170005
DTSTART,lang=en_us:20071127T103000
DTSTAMP:20100325T003424Z
X-APPLE-EWS-BUSYSTATUS:BUSY
SEQUENCE:0
END:VEVENT
SAMPLE OUTPUT
lines:
==>
20071127T103000
==>
20071127T170005
==>
Gandalf_flinger1
CPAN is your friend:
vFile
iCal parser
You will pull your hair out until bald without a parser on vFile format (other than trivial files.) Regex for this is very hard.
Instead of permuting the three regexes into one big pattern with ORs, why not test the three patterns separately, since (given the anchoring $s, ) they cannot overlap?
my $x1 = qr/(?:^DTSTART[^:]*:(?<dts>.*?)$)/smo;
my $x2 = qr/(?:^DTEND[^:]*:(?<dte>.*?)$)/smo;
my $x3 = qr/(?:^SUMMARY[^:]*:(?<dtn>.*?)$)/smo;
if ($evts[1] =~ $x1 and $evts[1] =~ $x2 and $evts[1] =~ $x3)
{
# ...
}
(I also turned the x variables into patterns themselves, and removed the unneeded escape in the character classes.)
It's better to use three regexes and some extra logic. This problem isn't a good match for regexes.
That's ugly... I think that the "better way" is to match each property, once at a time.
I'm having some issues with parsing CSV data with quotes. My main problem is with quotes within a field. In the following example lines 1 - 4 work correctly but 5,6 and 7 don't.
COLLOQ_TYPE,COLLOQ_NAME,COLLOQ_CODE,XDATA
S,"BELT,FAN",003541547,
S,"BELT V,FAN",000324244,
S,SHROUD SPRING SCREW,000868265,
S,"D" REL VALVE ASSY,000771881,
S,"YBELT,"V"",000323030,
S,"YBELT,'V'",000322933,
I'd like to avoid Text::CSV as it isn't installed on the target server. Realising that CSV's are are more complicated than they look I'm using a recipe from the Perl Cookbook.
sub parse_csv {
my $text = shift; #record containg CSVs
my #columns = ();
push(#columns ,$+) while $text =~ m{
# The first part groups the phrase inside quotes
"([^\"\\]*(?:\\.[^\"\\]*)*)",?
| ([^,]+),?
| ,
}gx;
push(#columns ,undef) if substr($text, -1,1) eq ',';
return #columns ; # list of vars that was comma separated.
}
Does anyone have a suggestion for improving the regex to handle the above cases?
Please, Try Using CPAN
There's no reason you couldn't download a copy of Text::CSV, or any other non-XS based implementation of a CSV parser and install it in your local directory, or in a lib/ sub directory of your project so its installed along with your projects rollout.
If you can't store text files in your project, then I'm wondering how it is you are coding your project.
http://novosial.org/perl/life-with-cpan/non-root/
Should be a good guide on how to get these into a working state locally.
Not using CPAN is really a recipe for disaster.
Please consider this before trying to write your own CSV implementation.
Text::CSV is over a hundred lines of code, including fixed bugs and edge cases, and re-writing this from scratch will just make you learn how awful CSV can be the hard way.
note: I learnt this the hard way. Took me a full day to get a working CSV parser in PHP before I discovered an inbuilt one had been added in a later version. It really is something awful.
You can parse CSV using Text::ParseWords which ships with Perl.
use Text::ParseWords;
while (<DATA>) {
chomp;
my #f = quotewords ',', 0, $_;
say join ":" => #f;
}
__DATA__
COLLOQ_TYPE,COLLOQ_NAME,COLLOQ_CODE,XDATA
S,"BELT,FAN",003541547,
S,"BELT V,FAN",000324244,
S,SHROUD SPRING SCREW,000868265,
S,"D" REL VALVE ASSY,000771881,
S,"YBELT,"V"",000323030,
S,"YBELT,'V'",000322933,
which parses your CSV correctly....
# => COLLOQ_TYPE:COLLOQ_NAME:COLLOQ_CODE:XDATA
# => S:BELT,FAN:003541547:
# => S:BELT V,FAN:000324244:
# => S:SHROUD SPRING SCREW:000868265:
# => S:D REL VALVE ASSY:000771881:
# => S:YBELT,V:000323030:
# => S:YBELT,'V':000322933:
The only issue I've had with Text::ParseWords is when nested quotes in data aren't escaped correctly. However this is badly built CSV data and would cause problems with most CSV parsers ;-)
So you may notice that
# S,"YBELT,"V"",000323030,
came out as (ie. quotes dropped around "V")
# S:YBELT,V:000323030:
however if its escaped like so
# S,"YBELT,\"V\"",000323030,
then quotes will be retained
# S:YBELT,"V":000323030:
tested; working:-
$_.=','; # fake an ending delimiter
while($_=~/"((?:""|[^"])*)",|([^,]*),/g) {
$cell=defined($1) ? $1:$2; $cell=~s/""/"/g;
print "$cell\n";
}
# The regexp strategy is as follows:
# First - we attempt a match on any quoted part starting the CSV line:-
# "((?:""|[^"])*)",
# It must start with a quote, and end with a quote followed by a comma, and is allowed to contain either doublequotes - "" - or anything except a sinlge quote [^"] - this goes into $1
# If we can't match that, we accept anything up to the next comma instead, & put it into $2
# Lastly, we convert "" to " and print out the cell.
be warned that CSV files can contain cells with embedded newlines inside the quotes, so you'll need to do this if reading the data in line-at-a-time:
if("$pre$_"=~/,"[^,]*\z/) {
$pre.=$_; next;
}
$_="$pre$_";
This works like charm
line is assumed to be comma separated with embeded ,
my #columns = Text::ParseWords::parse_line(',', 0, $line);
Finding matching pairs using regexs is non-trivial and generally unsolvable task. There are plenty of examples in the Jeffrey Friedl's Mastering regular expressions book. I don't have it at hand now, but I remember that he used CSV for some examples, too.
You can (try to) use CPAN.pm to simply have your program install/update Text::CSV. As said before, you can even "install" it to a home or local directory, and add that directory to #INC (or, if you prefer not to use BEGIN blocks, you can use lib 'dir'; - it's probably better).
Tested:
use Test::More tests => 2;
use strict;
sub splitCommaNotQuote {
my ( $line ) = #_;
my #fields = ();
while ( $line =~ m/((\")([^\"]*)\"|[^,]*)(,|$)/g ) {
if ( $2 ) {
push( #fields, $3 );
} else {
push( #fields, $1 );
}
last if ( ! $4 );
}
return( #fields );
}
is_deeply(
+[splitCommaNotQuote('S,"D" REL VALVE ASSY,000771881,')],
+['S', '"D" REL VALVE ASSY', '000771881', ''],
"Quote in value"
);
is_deeply(
+[splitCommaNotQuote('S,"BELT V,FAN",000324244,')],
+['S', 'BELT V,FAN', '000324244', ''],
"Strip quotes from entire value"
);