Why isn't my regex matching my input data? - regex

I have a column of values (strings) that look like this:
arg123ala
arg345ala_r
thr567por thr789pro
pro1ala,thr2leu
I am trying to identify those values where the following pattern is met only once and no extra text is present:
three letters-some numbers-three letters
In the previous example, this would match the first value, but not the other three, because they have extra bits of text or there are two instances of the pattern separated by blank spaces or commas.
I tried using something like this in Perl:
if ( $value =~ /^[[:alpha:]]{3}\d{1,9}[[:alpha:]]{3}$) {
$qualifier = "ok";
}
else {
$qualifier = "needs cleaning";
}
And actually checked the regular expression in regexplanet.com, where it worked beautifully. However, when I used it in my code it wasn't matching any of the values I listed above, missing even the first one. Any idea why this could be happening? Any advice on an alternative for this?

It works fine. Here it is fixed (you didn't terminate your regex) and incorporated into a working program
use strict;
use warnings;
use v5.10;
while ( my $value = <DATA> ) {
my $qualifier;
if ( $value =~ /^[[:alpha:]]{3}\d{1,9}[[:alpha:]]{3}$/ ) {
$qualifier = "ok";
}
else {
$qualifier = "needs cleaning";
}
say $qualifier;
}
__DATA__
arg123ala
arg345ala_r
thr567por thr789pro
pro1ala,thr2leu
output
ok
needs cleaning
needs cleaning
needs cleaning

Looks like topic starter forgot final / in regexp.
I would use expression like this: /^[a-z]{3}\d+[a-z]{3}$/

Related

Perl switch/case Fails on Literal Regex String Containing Non-Capturing Group '?'

I have text files containing lines like:
2/17/2018 400000098627 =2,000.0 $2.0994 $4,387.75
3/7/2018 1)0000006043 2,000.0 $2.0731 $4,332.78
3/26/2018 4 )0000034242 2,000.0 $2.1729 $4,541.36
4/17/2018 2)0000008516 2,000.0 $2.219 $4,637.71
I am matching them with /^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+([0-9|.|,]+)\s+\$/ But I also have some files with lines in a completely different format, which I match with a different regex. When I open a file I determine which format and assign $pat = '<regex-string>'; in a switch/case block:
$pat = '/^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+([0-9|.|,]+)\s+\$/'
But the ? character that introduces the non-capturing group I use to match repeats after the date and before the first currency amount causes the Perl interpreter to fail to compile the script, reporting on abort:
syntax error at ./report-dates-amounts line 28, near "}continue "
If I delete the ? character, or replace ? with \? escaped character, or first assign $q = '?' then replace ? with $q inside a " string assignment (ie. $pat = "/^\s*(\S+)\s+($q:[0-9|\)| ]+)+\s+([0-9|.|,]+)\s+\$/"; ) the script compiles and runs. If I assign the regex string outside the switch/case block that also works OK. Perl v5.26.1 .
My code also doesn't have any }continue in it, which as reported in the compilation failure is probably some kind of transformation of the switch/case code by Switch.pm into something native the compiler chokes on. Is this some kind of bug in Switch.pm? It fails even when I use given/when in exactly the same way.
#!/usr/local/bin/perl
use Switch;
# Edited for demo
switch($format)
{
# Format A eg:
# 2/17/2018 400000098627 =2,000.0 $2.0994 $4,387.75
# 3/7/2018 1)0000006043 2,000.0 $2.0731 $4,332.78
# 3/26/2018 4 )0000034242 2,000.0 $2.1729 $4,541.36
# 4/17/2018 2)0000008516 2,000.0 $2.219 $4,637.71
#
case /^(?:april|snow)$/i
{ # This is where the ? character breaks compilation:
$pat = '^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+\D?(\S+)\s+\$';
# WORKS:
# $pat = '^\s*(\S+)\s+(' .$q. ':[0-9|\)| ]+)+\s+\D' .$q. '(\S+)\s+\$';
}
# Format B
case /^(?:umberto|petro)$/i
{
$pat = '^(\S+)\s+.*Think 1\s+(\S+)\s+';
}
}
Don't use Switch. As mentionned by #choroba in the comments, Switch uses a source filter, which leads to mysterious and hard to debug errors, as you constated.
The module's documentation itself says:
In general, use given/when instead. It were introduced in perl 5.10.0. Perl 5.10.0 was released in 2007.
However, given/when is not necessarily a good option as it is experimental and likely to change in the future (it seems that this feature was almost removed from Perl v5.28; so you definitely don't want to start using it now if you can avoid it). A good alternative is to use for:
for ($format) {
if (/^(?:april|snow)$/i) {
...
}
elsif (/^(?:umberto|petro)$/i) {
...
}
}
It might look weird a first, but once you get used to it, it's actually reasonable in my opinion. Or, of course, you can use none of this options and just do:
sub pattern_from_format {
my $format = shift;
if ($format =~ /^(?:april|snow)$/i) {
return qr/^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+\D?(\S+)\s+\$/;
}
elsif ($format =~ /^(?:umberto|petro)$/i) {
return qr/^(\S+)\s+.*Think 1\s+(\S+)\s+/;
}
# Some error handling here maybe
}
If, for some reason, you still want to use Switch: use m/.../ instead of /.../.
I have no idea why this bug is happening, however, the documentation says:
Also, the presence of regexes specified with raw ?...? delimiters may cause mysterious errors. The workaround is to use m?...? instead.
Which I misread at first, and therefore tried to use m/../ instead of /../, which fixed the issue.
Another option instead of an if/elsif chain would be to loop over a hash which maps your regular expressions to the values which should be assigned to $pat:
#!/usr/local/bin/perl
my %switch = (
'^(?:april|snow)$' => '^\s*(\S+)\s+(?:[0-9|\)| ]+)+\s+\D?(\S+)\s+\$',
'^(?:umberto|petro)$' => '^(\S+)\s+.*Think 1\s+(\S+)\s+',
);
for my $re (keys %switch) {
if ($format =~ /$re/i) {
$pat = $switch{$re};
last;
}
}
For a more general case (i.e., if you're doing more than just assigning a string to a scalar) you could use the same general technique, but use coderefs as the values of your hash, thus allowing it to execute an arbitrary sub based on the match.
This approach can cover a pretty wide range of the functionality usually associated with switch/case constructs, but note that, because the conditions are pulled from the keys of a hash, they'll be evaluated in a random order. If you have data which could match more than one condition, you'll need to take extra precautions to handle that, such as having a parallel array with the conditions in the proper order or using Tie::IxHash instead of a regular hash.

Match string with escape characters or backslashes

The following perl script and TestData simulate the situation where I can only find 2 instead of 4 expected. (to match all support.tier.1 with backslash in between).
How can I modify this perl regex here? thanks
my #TestData(
"support.tier.1",
"support.tier.2",
qw("support\.tier\.1"),
"support\.tier\.2",
quotemeta("support.tier.1\#example.com"),
"support.tier.2\#example.com",
"support\.tier\.1\#example\.com",
"support\.tier\.2\#example\.com",
"sales\#example\.com"
);
Here is the code to be changed:
my $count = 0;
foreach my $tier(#TestData){
if($tier =~ m/support.tier.1/){
print "$count: $tier\n";
}
$count++;
}
I only get 2 matches while the expected is 4:
0: support.tier.1
6: support.tier.1#example.com
Update
Since it seems that you may indeed be getting strings containing backslashes, I suggest that you use String::Unescape to remove those backslashes before testing your strings. You will probably have to install it as it isn't a core module
Your code would look like this
use strict;
use warnings;
use String::Unescape;
my #tiers = (
"support.tier.1",
"support.tier.2",
qw("support\.tier\.1"),
"support\.tier\.2",
quotemeta("support.tier.1\#example.com"),
"support.tier.2\#example.com",
"support\.tier\.1\#example\.com",
"support\.tier\.2\#example\.com",
"sales\#example\.com",
);
my $count = 0;
for my $tier ( #tiers ) {
my $plain = String::Unescape->unescape($tier);
if ( $plain =~ /support\.tier\.1/ ) {
printf "%d: %s\n", ++$count, $tier;
}
}
output
1: support.tier.1
2: "support\.tier\.1"
3: support\.tier\.1\#example\.com
4: support.tier.1#example.com
Note that there is a bug in the String::Unescape module that prevents it from exporting the unescape function. It just means you have to use String::Unescape::unescape or String::Unescape->unescape all the time. Or you could import it manually with *unescape = \&String::Unescape::unescape
The #tiers array contains these exact strings
support.tier.1
support.tier.2
"support\.tier\.1"
support.tier.2
support\.tier\.1\#example\.com
support.tier.2#example.com
support.tier.1#example.com
support.tier.2#example.com
sales#example.com
Can you see that only items 1 and 7 contain the string support.tier.1? The other two that I imagine you expected to match are 3 and 5, which contain spurious backslashes
It's not clear, but it seems unlikely that you will be getting data in this format. If you really want to match support.tier.1 where either dot may be preceded by a backslash character then you need /support\\?\.tier\\?\.1/, but I think you are misunderstanding the way Perl strings work
I may not fully understand, but if I do I agree with the answer that Matt has already attempted to give you. Regex definitely can handle your request if you are saying that the escape character may or may not be before each period in support.tier.1.
A single backslash is \\ and ? means essentially "one or zero:"
use strict;
use warnings;
my #tiers = (
"support.tier.1",
"support.tier.2",
qw("support\.tier\.1"),
"support\.tier\.2",
quotemeta("support.tier.1\#example.com"),
"support.tier.2\#example.com",
"support\.tier\.1\#example\.com",
"support\.tier\.2\#example\.com",
"sales\#example\.com",
);
my $count = 0;
foreach my $tier (#tiers) {
if ($tier =~ /support\\?.tier\\?.1/) {
print "$count: $tier\n";
}
$count++;
}
On an unrelated note, for the purpose of creating an easy-to-follow example, I included a suggestion on how you might better format your sample data instead of using the $str and pushes.
If this works, I'd recommend you ask Matt to post his comment responses as an answer and accept it.

How to assign class based on regexp match (sorting in perl)

I am reading from file. Based on value in one column, I want to assign my own class/tag to it.
These regexps:
'LTR*','MLT*','MST*' ...
belong to the class HERV.
'Charlie*','Looper*' ...
belong to the class DNA
Right now I have two arrays, one with regexps and one with respective classes:
my #array = map { qr{$_} } ('Alu*', 'HERV*', 'Charlie*' ...
my #classes = ('Alu', 'HERV', 'DNA', 'LINE' ...
So that I know that if my line matches Charlie*, it belongs to the class DNA.
To sum it up, for every line of the file I am looping the whole array and looking for match:
for my $i (0 .. $#array) {
if ($type =~ m/$array[$i]/) {
my $class=$classes[$i];
}
}
Of course, this is not too clever. It would be much better to say: "this group of regexps belongs to this class" which suggests use of hash.
However, I consider it quite inconvenient to loop all lines, than all keys of hashmap and then all values of certain keys and, when there is a match, use the key as the resulting class/tag. Is this good solution or not?
Thank you very much.
You can do something like this:
my %re = (
HERV=>qr/LTR|MLT|MST/,
DNA=> qr/Charlie|Looper/
);
my $class;
for (keys %re) {
$class = $_, last if ($type =~ $re{$_});
}
This will save you some regex compilation and one loop.
The CPAN module Text::Prefix::XS appears to do what you want: determine which if any of a list of prefixes match a given text. I have not used the module, but from what I can tell you would do something like:
my %prefix2class = ( LTR => 'HERV',
MLV => 'HERV',
...
Charlie => 'DNA' );
my $search = prefix_search_create( keys %prefix2class );
# ... now, for a given $type, no need to loop ...
my $pfx = prefix_search($search, $type);
my $class = $prefix2class{$pfx};
(Note: Your regexes look to me like shell-style/fnmatch-style patterns dubiously compiled as regexes, and from this I infer that you actually want simple prefix matching. Otherwise, the regex /Charlie*/, for example, would match Charli, Charlieeee, fooCharliebar, and so on — that seems unlikely to be representative of your "value in one column".)

evaluate pattern stored in variable perl regexp

I am trying to find out if basket has apple [simplified version of a big problem]
$check_fruit = "\$fruit =~ \/has\/apple\/";
$fruit="basket/has/mango/";
if ($check_fruit) {
print "apple found\n";
}
check_fruit variable is holding the statement of evaluating the regexp.
However it check_fruit variable always becomes true and shows apple found :(
Can somebody help me here If I am missing something.
Goal to accomplish:
Okay so let me explain:
I have a file with a pattern clause defined on eachline similar to:
Line1: $fruit_origin=~/europe\\/finland/ && $fruit_taste=~/sweet/
Line2: similar stuff that can contain ~10 pattern checks seprated by && or || with metacharacters too
2.I have another a list of fruit attributes from a perl hash containing many such fruits
3 I want to categorize each fruit to see how many fruits fall into category defined by each line of the file seprately.
Sort of fruit count /profile per line Is there an easier way to accomplish this ? Thanks a lot
if ($check_fruit) returns true because $check_fruit is defined, not empty and not zero. If you want to evaluate its content, use eval. But a subroutine would serve better:
sub check_fruit {
my $fruit = shift;
return $fruit =~ m(has/apple);
}
if (check_fruit($fruit)) {
print "Apple found\n";
}
Why is there a need to store the statement in a variable? If you're sure the value isn't set by a user, then you can do
if (eval $check_fruit) {
but this isn't safe if the user can set anything in that expression.
Put the pattern (and only the pattern) into the variable, use the variable inside the regular expression matching delimiters m/.../. If you don't know the pattern in advance then use quotemeta for escaping any meta characters.
It should look like this:
my $check_fruit = '/has/apple/'; # here no quotemeta is needed
my $fruit = 'basket/has/mango/';
if ($fruit =~ m/$check_fruit/) {
# do stuff!
}
$check_fruit is nothing but a variable holding string data. If you want to execute the code it contains, you have to use eval.
There were also some other errors in your code related to string quoting/escaping. This fixes that as well:
use strict;
use warnings;
my $check_fruit = '$apple =~ m|/has/mango|';
my $apple="basket/has/mango/";
if (eval $check_fruit) {
print "apple found\n";
}
However, this is not usually a good design. At the very least, it makes for confusing code. It is also a huge security hole if $check_fruit is coming from the user. You can put a regex into a variable, which is preferable:
Edit: note that a regex that comes from user input can be a security problem as well, but it is more limited in scope.
my $check_fruit = qr|/has/mango|;
my $apple="basket/has/mango/";
if ($apple =~ /$check_fruit/) {
print "apple found\n";
}
There are other things you can do to make your Perl code more dynamic, as well. The best approach would depend on what you are trying to accomplish.

Better way to write this regex to match multi-ordered property list?

I've been whacking on this regex for a while, trying to build something that can pick out multiple ordered property values (DTSTART, DTEND, SUMMARY) from an .ics file. I have other options (like reading one line at a time and scanning), but wanted to build a single regex that can handle the whole thing.
SAMPLE PERL
# There has got to be a better way...
my $x1 = '(?:^DTSTART[^\:]*:(?<dts>.*?)$)';
my $x2 = '(?:^DTEND[^\:]*:(?<dte>.*?)$)';
my $x3 = '(?:^SUMMARY[^\:]*:(?<dtn>.*?)$)';
my $fmt = "$x1.*$x2.*$x3|$x1.*$x3.*$x2|$x2.*$x1.*$x3|$x2.*$x3.*$x1|$x3.*$x1.*$x2|$x3.*$x2.*$x1";
if ($evts[1] =~ /$fmt/smo) {
printf "lines:\n==>\n%s\n==>\n%s\n==>\n%s\n", $+{dts}, $+{dte}, $+{dtn};
} else {
print "Failed.\n";
}
SAMPLE DATA
BEGIN:VEVENT
UID:0A5ECBC3-CAFB-4CCE-91E3-247DF6C6652A
TRANSP:OPAQUE
SUMMARY:Gandalf_flinger1
DTEND:20071127T170005
DTSTART,lang=en_us:20071127T103000
DTSTAMP:20100325T003424Z
X-APPLE-EWS-BUSYSTATUS:BUSY
SEQUENCE:0
END:VEVENT
SAMPLE OUTPUT
lines:
==>
20071127T103000
==>
20071127T170005
==>
Gandalf_flinger1
CPAN is your friend:
vFile
iCal parser
You will pull your hair out until bald without a parser on vFile format (other than trivial files.) Regex for this is very hard.
Instead of permuting the three regexes into one big pattern with ORs, why not test the three patterns separately, since (given the anchoring $s, ) they cannot overlap?
my $x1 = qr/(?:^DTSTART[^:]*:(?<dts>.*?)$)/smo;
my $x2 = qr/(?:^DTEND[^:]*:(?<dte>.*?)$)/smo;
my $x3 = qr/(?:^SUMMARY[^:]*:(?<dtn>.*?)$)/smo;
if ($evts[1] =~ $x1 and $evts[1] =~ $x2 and $evts[1] =~ $x3)
{
# ...
}
(I also turned the x variables into patterns themselves, and removed the unneeded escape in the character classes.)
It's better to use three regexes and some extra logic. This problem isn't a good match for regexes.
That's ugly... I think that the "better way" is to match each property, once at a time.