Partial match of strings, operator ( =~ ) - regex

I have used " =~ " to compare two strings (The length of two strings is the same.) in my script to allow a don't care condition. If a character is "." in a string, that character is ignored to compare. In other words, it is a partial match case.
comp_test.pl :
#!/usr/bin/perl
use strict;
use warnings;
my $a=".0.0..0..1...0..........0...0......010.1..........";
my $b="10.0..0..1...0..........0...0......010.1..........";
my $c=".0.0..0..1...0..........0...0......010.1..........";
if ($a =~ $b) {
print "a and b same\n";
}
if ($a =~ $c) {
print "a and c same\n";
}
Because of don't care condition by ".", the expected result should be both "a and b same" and "a and c same". However, currently, the result is only "a and c same". Please let me know any good operator or changing "." to "x" may help?

This is not a perl version problem. You are doing a regular expression match. The operand on the left of the =~ is the string and the operand on the right is the regex being applied to it.
This can be used for the kind of partial matching you are doing, given that the strings are the same length and each character of the regular expression matches a character of the string, but only where there is a . on the right. Where there is a 1 or a 0 in the regular expression ($b in the case of $a =~ $b), there must be an exactly matching character in the string ($a), not a ..
To do the kind of partial match you seem to want to do, you can use a bitwise exclusive or, like so:
sub partial_match {
my ($string_1, $string_2) = #_;
return 0 if length($string_1) != length($string_2);
# bitwise exclusive or the two strings together; where there is
# a 0 in one string and a 1 in the other, the result will be "\1".
# count \1's to get the number of mismatches
my $mismatches = ( $string_1 ^ $string_2 ) =~ y/\1//;
return $mismatches == 0;
}

While . matches 1 (or any other character), 1 doesn't match . (or any other character other than 1).
The following is a fast solution. It performs best when most strings match (since it always checks the entire string).
sub is_match { ( ( $_[0] ^ $_[1] ) =~ tr/\x00\x1E\x1F//c ) == 0 }
say is_match($a, $b) ? "match" : "no match";
say is_match($b, $c) ? "match" : "no match";
How it works:
Hex of characters
=================
30 30 31 31 2E 2E "0011.."
30 31 30 31 30 31 "010101"
XOR -----------------
00 01 01 00 1E 1F
^^ ^^ 2 mismatches
This solution even works if one of the strings is shorter than the other (since the XOR will result in 30, 31 or 2E for the extra characters).
The following is a fast solution. It performs best when most strings don't match (since it stops checking as soon as a match is impossible).
sub make_matcher {
my $pat =
join '',
map { $_ eq '.' ? $_ : "[.\Q$_\E]" }
split //, $_[0];
return qr/^$pat\z/;
}
sub is_match { $_[0] =~ make_matcher($_[1]) }
say is_match($a, $b) ? "match" : "no match";
say is_match($b, $c) ? "match" : "no match";
Tested.

From my understanding, you are trying to compare the length of 2 strings.Basically, only the length of the sting need to be compared, not the bit-wise characters.
my $a=".0.0..0..1...0..........0...0......010.1..........";
my $b="10.0..0..1...0..........0...0......010.1..........";
my $c=".0.0..0..1...0..........0...0......010.1..........";
So the code could be:
if(length($a) == length($b))
{
print "match found";
}
else
{
print "No match";
}

Related

Why is the Perl "doesn't match" operator not working here?

I have the following Perl code:
my $athCombined = "$athSymbol $athExpiration $athStrike $athType";
if (($instrumentType eq "STOCK" && $cbSymbol ne $athSymbol) ||
($instrumentType eq "OPTION" && $cbSymbol !~ /^$athSymbol.*$athExpiration $athStrike $athType$/) ||
($instrumentType eq "FUTURESOPTION" && $cbSymbol !~ /^$athCombined$/)) {
print "ERROR: Symbols on lines $cbLineNum and $athLineNum don't match. ABORTING.\n";
print "instrumentType =$instrumentType\n";
print "cbSymbol =$cbSymbol\n";
print "athCombined =$athCombined\n";
print "length cbSymbol =" . length($cbSymbol) . "\n";
print "length athCombined=" . length($athCombined) . "\n";
if ($instrumentType eq "FUTURESOPTION") {
print "YES1\n";
}
if ($cbSymbol !~ /^$athCombined$/) {
print "YES2\n";
}
if ($cbSymbol eq $athCombined) {
print "they are the same\n";
}
exit;
}
I am getting the following output:
ERROR: Symbols on lines 434 and 1906 don't match. ABORTING.
instrumentType =FUTURESOPTION
cbSymbol =/ESM19 1/50 JUN 19 (Monday) (Wk1) /E1AM19 2745 CALL
athCombined =/ESM19 1/50 JUN 19 (Monday) (Wk1) /E1AM19 2745 CALL
length cbSymbol =51
length athCombined=51
YES1
YES2
they are the same
The conditional $cbSymbol !~ /^$athCombined$/ is returning true even though $cbSymbol and $athCombined are the same. If I change the conditional to cbSymbol ne $athCombined, then it works correctly, but I'd like to know why it is not working as is.
Interestingly, I am not able to reproduce this in a simple test program:
my $a = "1";
my $b = "2";
my $c = "3";
my $abc = "1 2 3";
my $def = "$a $b $c";
if ($abc !~ /^$def$/) {
print "something is wrong\n";
}
"something is wrong" does not get printed out.
$foo !~ /^$bar$/ is not equivalent to checking for string inequality, as in $foo ne $bar. It checks if $foo does not match the regex ^$bar$, which differs especially if $bar contains regex metacharacters, but also in that $ can match before (allow) a trailing newline and not only at the end of the string. (\z is the equivalent of $ that does not allow for a trailing newline.)
$athCombined in your program is /ESM19 1/50 JUN 19 (Monday) (Wk1) /E1AM19 2745 CALL, which contains the regex metacharacters ( and ) (in this case they form capture groups and thus the literal parentheses in your other string are not matched by anything). You can solve this by using quotemeta modifiers:
$cbSymbol !~ /^\Q$athCombined\E\z/
but better would just be to use the string equality operators eq or ne if that's what you're trying to test.

Matching all characters in a string except one in any position

How to match (preferably in perl) all strings that match the query string except one character?
Query: TLAQLLLDK
Want to match: xLAQLLLDK, TxAQLLLDK, TLxQLLLDK, etc.
Where 'x' is any capital letter '[A-Z]'.
Use alternation operator.
^(?:[A-Z]LAQLLLDK|T[A-Z]AQLLLDK|TL[A-Z]QLLLDK|.....)$
Likewise fill all..
You can do that by writing a terrible regular expression, which will be horribly slow to build and even slower to execute, or you can just don't use regexes for things like these and write a function that just compares both strings character after character, allows for one "mistake" and returns True only if there was exactly one mistake.
How to match (preferably in perl) all strings that match the query string except one character?
Expanding the answer of #Avinash, by generating the required regular expression dynamically at run time:
my $query = 'TLAQLLLDK';
my $re_proto = '(' . join( '|', map { (my$x=$query)=~s/^(.{$_})./$1\[A-Za-z]/; $x; } (0 .. length($query)-1) ) . ')';
my $re = qr/^$re_proto$/;
my #input = qw(xLAQLLLDK TxAQLLLDK TLxQLLLDK);
my #matches = grep { /$re/ } #input;
print "#matches\n";
(I had to include the [a-z] too, since your example input uses the x as the marker.)
If you need to do that very often, I would advise to cache the generated regular expressions.
Is this what you are looking for?
#!/usr/bin/perl
use strict;
my #str = ("ULAQLLLDK","TAAQLLLDK","TLCQLLLDK","TLAQLLLDK");
while(<#str>){
if (/[A-S,U-Z]LAQLLLDK|T[A-K,M-Z]AQLLLDK|TL[B-Z]QLLLDK/ ){
print "$_\n";
}
}
output:
ULAQLLLDK
TAAQLLLDK
TLCQLLLDK
There are only 9 x 25 = 225 such strings, so you may as well generate them all and put them in a hash for comparison
use strict;
use warnings;
use 5.010;
my %matches;
my $s = 'TLAQLLLDK';
for my $i (0 .. length($s) - 1) {
my $c = substr $s, $i, 1;
for my $cc ('A' .. 'Z') {
substr(my $ss = $s, $i, 1) = $cc;
++$matches{$ss} unless $cc eq $c;
}
}
printf "%d matches found\n", scalar keys %matches;
for ( qw/ TLAQLLLDK TLAQLXLDK / ) {
printf "\$matches{%s} = %s\n", $_, $matches{$_} // 'undef';
}
output
225 matches found
$matches{TLAQLLLDK} = undef
$matches{TLAQLXLDK} = 1

Changing time format using regex in perl

I want to read 12h format time from file and replace it with 24 hour
example
this is due at 3:15am -> this is due 15:15
I tried saving variables in regex and manupilate it later but didnt work, I also tried using substitution "/s" but because it is variable I couldnt figure it out
Here is my code:
while (<>) {
my $line = $_;
print ("this is text before: $line \n");
if ($line =~ m/\d:\d{2}pm/g){
print "It is PM! \n";}
elsif ($line =~ m/(\d):(\d\d)am/g){
print "this is try: $line \n";
print "Its AM! \n";}
$line =~ s/($regexp)/<French>$lexicon{$1}<\/French>/g;
print "sample after : $line\n";
}
A simple script can do the work for you
$str="this is due at 3:15pm";
$str=~m/\D+(\d+):\d+(.*)$/;
$hour=($2 eq "am")? ( ($1 == 12 )? 0 : $1 ) : ($1 == 12 ) ? $1 :$1+12;
$min=$2;
$str=~s/at.*/$hour:$min/g;
print "$str\n";
Gives output as
this is due 15:15
What it does??
$str=~m/\D+(\d+):(\d+)(.*)$/; Tries to match the string with the regex
\D+ matches anything other than digits. Here it matches this is due at
(\d+) matches any number of digits. Here it matches 3. Captured in group 1 , $1 which is the hours
: matches :
(\d+) matches any number of digits. Here it matches 15, which is the minutes
(.*) matches anything follwed, here am . Captures in group 2, `$2
$ anchors the regex at end of
$hour=($2 eq "am")? ( ($1 == 12 )? 0 : $1 ) : ($1 == 12 ) ? $1 :$1+12; Converts to 24 hour clock. If $2 is pm adds 12 unless it is 12. Also if the time is am and 12 then the hour is 0
$str=~s/at.*/$hour:$min/g; substitutes anything from at to end of string with $hour:$min, which is the time obtained from the ternary operation performed before
#!/usr/bin/env perl
use strict;
use warnings;
my $values = time_12h_to_24h("11:00 PM");
sub time_12h_to_24h
{
my($t12) = #_;
my($hh,$mm,$ampm) = $t12 =~ m/^(\d\d?):(\d\d?)\s*([AP]M?)/i;
$hh = ($hh % 12) + (($ampm =~ m/AM?/i) ? 0 : 12);
return sprintf("%.2d:%.2d", $hh, $mm);
}
I found this code in the bleow link. Please check:
Is my pseudo code for changing time format correct?
Try this it give what you expect
my #data = <DATA>;
foreach my $sm(#data){
if($sm =~/(12)\.\d+(pm)/g){
print "$&\n";
}
elsif($sm =~m/(\d+(\.)?\d+)(pm)/g )
{
print $1+12,"\n";
}
}
__DATA__
Time 3.15am
Time 3.16pm
Time 5.17pm
Time 1.11am
Time 1.01pm
Time 12.11pm

Regular expressions, matching operator using a string variable in Perl

I am using a regex but am getting some odd, unexpected "matches". "Names" are sent to a subroutine to be compared to an array called #ASlist, which contains multiple rows. The first element of each row is also a name, followed by 0 to several synonyms. The goal is to match the incoming "name" to any row in #ASlist that has a matching cell.
Sample input, from which $names is derived for the comparison against #ASlist:
13 1 13 chr7 7 70606019 74345818 Otud7a Klf13 E030018B13Rik Trpm1 Mir211 Mtmr10 Fan1 Mphosph10 Mcee Apba2 Fam189a1 Ndnl2 Tjp1 Tarsl2 Tm2d3 1810008I18Rik Pcsk6 Snrpa1 H47 Chsy1 Lrrk1 Aldh1a3 Asb7 Lins Lass3 Adamts17
Sample lines from #ASlist:
HSPA5 BIP FLJ26106 GRP78 MIF2
NDUFA5 B13 CI-13KD-B DKFZp781K1356 FLJ12147 NUFM UQOR13
ACAN AGC1 AGCAN CSPG1 CSPGCP MSK16 SEDK
The code:
my ($name) = #_; ## this comes in from another loop elsewhere in code I did not include
chomp $name;
my #collectmatches = (); ## container to collect matches
foreach my $ASline ( #ASlist ){
my #synonyms = split("\t", $ASline );
for ( my $i = 0; $i < scalar #synonyms; $i++ ){
chomp $synonyms[ $i ];
#print "COMPARE $name TO $synonyms[ $i ]\n";
if ( $name =~m/$synonyms[$i]/ ){
print "\tname $name from block matches\n\t$synonyms[0]\n\tvia $synonyms[$i] from AS list\n";
push ( #collectmatches, $synonyms[0], $synonyms[$i] );
}
else {
# print "$name does not match $synonyms[$i]\n";
}
}
}
The script is working but also reports weird matches. Such as, when $name is "E030018B13Rik" it matches "NDUFA5" when it occurs in #ASlist. These two should not be matched up.
If I change the regex from ~m/$synonyms[$i]/ to ~m/^$synonyms[$i]$/, the "weird" matches go away, BUT the script misses the vast majority of matches.
The NDUFA5 record contains B13 as a pattern, which will match E030018<B13>Rik.
If you want to be more literal, then add boundary conditions to your regular expression /\b...\b/. Also should probably escape regular expression special characters using quotemeta.
if ( $name =~ m/\b\Q$synonyms[$i]\E\b/ ) {
Or if you want to test straight equality, then just use eq
if ( $name eq $synonyms[$i] ) {
Another, more Perlish way to test for string equality is to use a hash.
You don't show any real test data, but this short Perl program builds a hash from your array #ASlist of lines of match strings. After that, most of the work is done.
The subsequent for loop tests just E030018B13Rik to see if it is one of the keys of the new %ASlist and prints an appropriate message
use strict;
use warnings;
my #ASlist = (
'HSPA5 BIP FLJ26106 GRP78 MIF2',
'NDUFA5 B13 CI-13KD-B DKFZp781K1356 FLJ12147 NUFM UQOR13',
'ACAN AGC1 AGCAN CSPG1 CSPGCP MSK16 SEDK',
);
my %ASlist = map { $_ => 1 } map /\S+/g, #ASlist;
for (qw/ E030018B13Rik /) {
printf "%s %s\n", $_, $ASlist{$_} ? 'matches' : 'doesn\'t match';
}
output
E030018B13Rik doesn't match
Since you only need to compare two strings, you can simply use eq:
if ( $name eq $synonyms[$i] ){
You are using B13 as the regular expression. As none of the characters has a special meaning, any string containing the substring B13 matches the expression.
E030018B13Rik
^^^
If you want the expression to match the whole string, use anchors:
if ($name =~m/^$synonyms[$i]$/) {
Or, use index or eq to detect substrings (or identical strings, respectively), as your input doesn't seem to use any features of regular expressions.

Perl pattern matching with pattern arithmetic

I want to understand how can I do arithmetic on matched sub-patterns in perl regex.
This is just a sample code and I want to understand how can I use \1 (already matched sub-pattern. in this case - 7) to match pattern+1 (8)
my $y = 77668;
if($y =~ /(\d)\1(\d)\2\1+1/) #How to increment a previously
#matched sub-pattern and form a pattern?
{
print $y;
}
EDIT
From the answers, I see that pattern arithmetic is not possible.
This is what I want to achieve.
I want to form a regex which will match this pattern:
N-3N-2N-1NNN+1N+2N+3 (N = 3,4,5,6
Its possible via regex code blocks:
my $y = 77668;
if($y =~ /(\d)\1(\d)\2(??{$1+1})/ ) {
print $y;
}
In this snippet (??{ CODE }) returns another regex that must match, so this regex looks like "8" ($1+1). As a result, whole regex will match only if 5th digit is greather and 1st by 1. But drawback with 1st digit is 9, this code block will return "10", so possible its wrong behavior, but you said nothing about what must be done in this case.
Now about N-3N-2N-1NNN+1N+2N+3 question, you can match it with this regex:
my $n = 5;
if( $y =~ /(??{ ($n-3).($n-2).($n-1).$n.($n+1).($n+2).($n+3) })/ ){
Or more "scalable" way:
my $n = 5;
if( $y =~ /(??{ $s=''; $s .= $n+$_ foreach(-3..3); $s; })/ ){
Again, what we must do if $n == 2 ?? $n-3 will be -1. Its not a simply digit cus it have sign, so you should think about this cases.
One another way. Match what we have and then check it.
if( $y =~ /(\d)(\d)(\d)(\d)(\d)(\d)(\d)/ ) {
if( $1 == ($4-3) && $2 == ($4-2) && $3 == ($4-1) && $6 == ($4+1) && $7 == ($4+2) && $7 == ($4+3) ) {
#...
Seems this method litle bit clumsy, but its obivious to everyone (i hope).
Also, you can optimize your regex since 7 ascending digits streak is not so frequent combination, plus get some lulz from co-workers xD:
sub check_number {
my $i;
for($i=1; $i<length($^N); $i++) {
last if substr($^N, $i, 1)<=substr($^N, $i-1, 1);
}
return $i<length($^N) ? "(*FAIL)" : "(*ACCEPT)";
}
if( $y =~ /[0123][1234][2345][3456][4567][5678][6789](??{ check_number() })/ ) {
Or... Maybe most human-friendly method:
if( $y =~ /0123456|1234567|2345678|3456789/ ) {
Seems last variant is bingo xD Its good example about not searching regex when things are so simple)
Of course this is possible. We are talking about Perl regexes after all. But it will be rather ugly:
say "55336"=~m{(\d)\1(\d)\2(\d)(?(?{$1+1==$3})|(*F))}?"match":"fail";
or pretty-printed:
say "55336" =~ m{ (\d)\1 (\d)\2 (\d)
(? (?{$1+1==$3}) # true-branch: nothing
|(*FAIL)
)
}x
? "match" : "fail";
What does this do? We collect the digits in ordinary captures. At the end, we use an if-else pattern:
(? (CONDITION) TRUE | FALSE )
We can embed code into a regex with (?{ code }). The return value of this code can be used as a condition. The (*FAIL) (short: (*F)) verb causes the match to fail. Use (*PRUNE) if you only want a branch, not the whole pattern to fail.
Embedded code is also great for debugging. However, older perls cannot use regexes inside this regex code :-(
So we can match lots of stuff and test it for validity inside the pattern itself. However, it might be a better idea to do that outside of the pattern like:
"string" =~ /regex/ and (conditions)
Now to your main pattern N-3N-2N-1NNN+1N+2N+3 (I hope I parsed it correctly):
my $super_regex = qr{
# N -3 N-2 N-1 N N N+1 N+2 N+3
(\d)-3\1-2\1-1\1\1(\d)(\d)(\d)
(?(?{$1==$2-1 and $1==$3-2 and $1==$4-3})|(*F))
}x;
say "4-34-24-144567" =~ $super_regex ? "match" : "fail";
Or did you mean
my $super_regex = qr{
#N-3 N-2 N-1 N N N+1 N+2 N+3
(\d)(\d)(\d) (\d)\4 (\d)(\d)(\d)
(? (?{$1==$4-3 and $2==$4-2 and $3==$4-1 and
$5==$4+1 and $6==$4+2 and $7==$4+3})|(*F))
}x;
say "123445678" =~ $super_regex ? "match" : "fail";
The scary thing is that these even works (with perl 5.12).
We could also generate parts of the pattern at match-time with the (??{ code }) construct — the return value of this code is used as a pattern:
my $super_regex = qr{(\d)(??{$1+1})(??{$1+2})}x;
say "234"=~$super_regex ? "match":"fail"
et cetera. However, I think readability suffers more this way.
If you need more than nine captures, you can use named captures with the
(?<named>pattern) ... \k<named>
constructs. The contents are also available in the %+ hash, see perlvar for that.
To dive further into the secrets of Perl regexes, I recommend reading perlre a few times.