Perl - how can I match strings that are not exactly the same? - regex

I have a list of strings I want to find within a file. This would be fairly simple to accomplish if the strings in my list and in the file matched exactly. Unfortunately, there are typos and variations on the name. Here's an example of how some of these strings differ
List File
B-Arrestin Beta-Arrestin
Becn-1 BECN 1
CRM-E4 CRME4
Note that each of those pairs should count as a match despite being different strings.
I know that I could categorize every kind of variation and write separate REGEX to identify matches but that is cumbersome enough that I might be better off manually looking for matches. I think the best solution for my problem would be some kind of expression that says:
"Match this string exactly but still count it as a match if there are X characters that do not match"
Does something like this exist? Is there another way to match strings that are not exactly the same but close?

As 200_success pointed out, you can do fuzzy matching with Text::Fuzzy, which computes the Levenshtein distance between bits of text. You will have to play with what maximum Levenshtein distance you want to allow, but if you do a case-insensitive comparison, the maximum distance in your sample data is three:
use strict;
use warnings;
use 5.010;
use Text::Fuzzy;
my $max_dist = 3;
while (<DATA>) {
chomp;
my ($string1, $string2) = split ' ', $_, 2;
my $tf = Text::Fuzzy->new(lc $string1);
say "'$string1' matches '$string2'" if $tf->distance(lc $string2) <= $max_dist;
}
__DATA__
B-Arrestin Beta-Arrestin
Becn-1 BECN 1
CRM-E4 CRME4
Output:
'B-Arrestin' matches 'Beta-Arrestin'
'Becn-1' matches 'BECN 1'
'CRM-E4' matches 'CRME4'

There are CPAN modules for that:
String::Approx
Text::Fuzzy

Related

Perl Regex Find and Return Every Possible Match

Im trying to create a while loop that will find every possible sub-string within a string. But so far all I can match is the largest instance or the shortest. So for example I have the string
EDIT CHANGE STRING FOR DEMO PURPOSES
"A.....B.....B......B......B......B"
And I want to find every possible sequence of "A.......B"
This code will give me the shortest possible return and exit the while loop
while($string =~ m/(A(.*?)B)/gi) {
print "found\n";
my $substr = $1;
print $substr."\n";
}
And this will give me the longest and exit the while loop.
$string =~ m/(A(.*)B)/gi
But I want it to loop through the string returning every possible match. Does anyone know if Perl allows for this?
EDIT ADDED DESIRED OUTPUT BELOW
found
A.....B
found
A.....B.....B
found
A.....B.....B......B
found
A.....B.....B......B......B
found
A.....B.....B......B......B......B
There are various ways to parse the string so to scoop up what you want.
For example, use regex to step through all A...A substrings and process each capture
use warnings;
use strict;
use feature 'say';
my $s = "A.....B.....B......B......B......B";
while ($s =~ m/(A.*)(?=A|$)/gi) {
my #seqs = split /(B)/, $1;
for my $i (0..$#seqs) {
say #seqs[0..$i] if $i % 2 != 0;
}
}
The (?=A|$) is a lookahead, so .* matches everything up to an A (or the end of string) but that A is not consumed and so is there for the next match. The split uses () in the separator pattern so that the separator, too, is returned (so we have all those B's). It only prints for an even number of elements, so only substrings ending with the separator (B here).
The above prints
A.....B
A.....B.....B
A.....B.....B......B
A.....B.....B......B......B
A.....B.....B......B......B......B
There may be bioinformatics modules that do this but I am not familiar with them.

Regex — Character occuring more than x number of times in a string (in different positions)

I need regex to find when "#" has more than 10 occurrences. The string will contain email addresses, comma separated, so #s will never be next to each other.
What I have is #{10,} only works if there are 10 #s in a row. I need for it to work when there is 10+ #s in the string. TIA!
To answer your question, you can count the number of matches in a regex using the "saturn" operator.
use warnings;
use strict;
my $str = 'a#b.com, m#b.ca, x#c###d#xxxxx###, #';
my $count =()= $str =~ /#/g;
if ($count > 10){
...
}
Here's an example without the intermediary $count variable:
if ((() = $str =~ /#/g) > 10){
That said, I'm with the others who commented in that regex may not be the best approach here.
The following pattern matches strings with more than 10 #:
#.*#.*#.*#.*#.*#.*#.*#.*#.*#.*#
After being optimized (. ⇒ [^#]), having redundancies removed, and being embedded into a match operator, it becomes the following:
/\#(?:[^\#]*\#){10}/
At one point, you said you wanted to find "10+" instead of "more than 10". For that, you'd use.
/\#(?:[^\#]*\#){9}/

How to do conditional ("if exist" logic) search & replace in Perl?

in my Perl script I want to do conditional search & replace using regular expression: Find a certain pattern, and if the pattern exists in a hash, then replace it with something else.
For example, I want to search for a combination of "pattern1" and "pattern2", and if the latter exists in a hash, then replace the combination with "pattern1" and "replacement". I tried the following, but it just doesn't do anything at all.
$_ =~ s/(pattern1)(pattern2)/$1replacement/gs if exists $my_hash{$2};
I also tried stuff like:
$_ =~ s/(pattern1)(pattern2) && exists $my_hash{$2}/$1replacement/gs;
Also does nothing at all, as if no match is found.
Can anyone help me with this regex problem? Thx~
I would do it a different way. It looks like you have a 'search this, replace that' hash.
So:
#!/usr/bin/env perl
use strict;
use warnings;
#our 'mappings'.
#note - there can be gotchas here with substrings
#so make sure you anchor patterns or sort, so
#you get the right 'substring' match occuring.
my %replace = (
"this phrase" => "that thing",
"cabbage" => "carrot"
);
#stick the keys together into an alternation regex.
#quotemeta means regex special characters will be escaped.
#you can remove that, if you want to use regex in your replace keys.
my $search = join( "|", map {quotemeta} keys %replace );
#compile it - note \b is a zero width 'word break'
#so it will only match whole words, not substrings.
$search = qr/\b($search)\b/;
#iterate the special DATA filehandle - for illustration and a runnable example.
#you probably want <> instead for 'real world' use.
while (<DATA>) {
#apply regex match and replace
s/(XX) ($search)/$1 $replace{$2}/g;
#print current line.
print;
}
##inlined data filehandle for testing.
__DATA__
XX this phrase cabbage
XX cabbage carrot cabbage this phrase XX this phrase
XX no words here
and this shouldn't cabbage match this phrase at all
By doing this, we turn your hash keys into a regex (you can print it - it looks like: (?^:\b(cabbage|this\ phrase)\b)
Which is inserted into the substitution pattern. This will only match if the key is present, so you can safely do the substitution operation.
Note - I've added quotemeta because then it escapes any special characters in the keys. And the \b is a "word boundary" match so it doesn't do substrings within words. (Obviously, if you do want that, then get rid of them)
The above gives output of:
XX that thing cabbage
XX carrot carrot cabbage this phrase XX that thing
XX no words here
and this shouldn't cabbage match this phrase at all
If you wanted to omit lines that didn't pattern match, you can stick && print; after the regex.
What is wrong (as in not working) with
if (exists($h{$patt1)) { $text =~ s/$patt1$patt2/$patt1$1replacement/g; }
If $patt1 exists as a key in a hash then you go ahead and replace $patt1$patt2 with $patt1$replacement. Of course, if $patt1$patt2 is found in $text, otherwise nothing happens. Your first code snippet is circular, while the second one can't work like that at all.
If you want $patt1$patt2 first, and hash key as well then it seems that you'd have to go slow
if ($str =~ /$patt11$patt2/ && exists $h{$patt2}) {
$str =~ s/$patt1$patt2/$patt1$replacement/gs;
}
If this is what you want then it is really simple: you need two unrelated conditions, whichever way you turn it around. Can't combine them since it would be circular.
From the point of view of the outcome these are the same. If either condition fails nothing happens, regardless of the order in which you check them.
NOTE Or maybe you don't have to go slow, see Sobrique's post.

Find the number of occurences of a pattern based on a condition on success match index

I have a long string, containing alphabetic words and each delimited by one single character ";" .
In perl, how to count the number of occurences of a pattern (delimited by ";") if index of a success match is dividable by 5.
Example:
$String = "the;fox;jumped;over;the;dog"
$Pattern = "the"
OUTPUT: 1
In above case, the $Pattern "the" exists as the first and 5th words in the $String and both matches are delimited by ";" . However; the output result would be 1, since only the index of second match (5) is dividable by 5.
If possible, I am wondering if there is a way to do this with a single pattern matching without using list or array as the $String is extremely long.
Thanks for any help.
Non-Regex Solution
The simplest solution to this problem is likely to break it up into steps and not use a regex at all.
The following splits the string based off the delimiter and then filters based off every 5th element:
my $string = "the;two;three;four;the;six;seven;eight;nine;ten;eleven;twelve;the;fourteen;the;sixteen";
my $pattern = 'the';
my $i = 0;
my $count = grep {(++$i % 5) == 0 && $_ eq $pattern} split /;/, $string;
print $count, "\n";
Outputs:
2
Partial Regex Solution
It's also possible to create a regex that will pull every 5th word from the string.
The below takes advantage of the fact that a capture group in a repeated pattern will only match the final repetition.
my $count = grep {$_ eq $pattern} $string =~ /(?:([^;]+)\b;?){5}/g;
Full Regex Solution
You asked for a single regex solution though.
The following accomplishes that, but is easily the most complicated and therefore the most fragile solution:
my $count = () = $string =~ /\G(?:(?:[^;]+\b;?){5})+?(?<=(?|;($pattern)|\b($pattern);))/g;
Of the three, I would definitely recommend the first method. It's the easiest to understand at a glance and therefore the easiest to maintain and adapt in the future.
You can do this without a regular expression by just iterating over the words.
my #words = qw( the fox jumped over the dog );
# A long string
my $string = join ';', map { $words[ rand #words ] } ( 1..10000 );
my $pattern = 'the';
my $position = 5;
my $count = 0;
my $matches = 0;
foreach my $token ( split /;/, $string ) {
next if (++$count % $position);
$matches++ if $token eq $pattern;
}
print "There are $matches instances of '$pattern' showing up with an index mod 5 = 0.\n";
#Miller, I was not able to add comment under your answer; so I am adding here.
Sure. Let me elaborate on the context a bit then. First of all, my tendency to use regex is more about convenience rather than interest; as in my home project I am trying to focus and spend my time more on the actual process, not coding.
Please note that my question(s) are all about real problems not just some fancy ones. Being exposed to programming, but not that much experienced in regex in perl, once I raise the questions, I am trying to see how the regex/perl experts would solve them, then I would grasp the idea, and tweak the answer for other problems down the road.
The flat files I am processing are 10s of GB and I need to do thousands of iteration on every run. I have multi-threaded the code to use all my pc's 8 cores. On every run, I need to use a new regex towards the solution to the problems. My needed regexes are mostly around counting, look-arounds, and conditions on match found positions/indexes. Conditions could be as complicated as a degree n polynomial equation on the match found positions/indexes. The ever-changing part of the code would be just the Condition criteria which I would eventually rather treat it as a variable parameter provided as an ARG to the .PL code file.
Now more about the Search and Condition criteria:
Let :
"P" be the match found position; and,
"I" be the match found index.
As an example, the following would be one of the Search and Condition criteria:
Search criteria: the(?=six)
Condition criteria: 3×P^4 - 4×I^3 - 1 is dividable by 7 .
This reads as: Count the number of "the(?=six)" provided that 3×P^4 - 4×I^3 - 1 is dividable by 7
What I need to know is how powerful regex in perl is to solve these types of problems; and if it is not indeed, then I will inevitably go another route.
I hope I have provided a better sense of what I refer as "convenience" and "my tendency to use regex".

Getting equal number of digits on both sides of a character in a string

I have a string
$test = 'xyz45sd2-32d34-sd23-456562.abc.com'
The objective is to obtain $1 = 23 and $2 = 45 i.e equal number of digits on both sides of the last -. Note that the number of digits is variable, and is not necessarily 2.
I have tried the following:
$test1 =~ s/.*(\d+)-(\d+).*//;
But
$1 contains 3
$2 contains 456562
You can try this regex
if($test1 =~ m/(\S+)-(\S+)-([a-z]*)(\d+)-(\d\d)(\d+).*/)
{
print $4,"|",$5;
}
I assume that u need only the first 2 didgits from 456562
perl -e '"xyz45sd2-32d34-sd23-456562.abc.com" =~ /(\d{2})-(\d{2})\d*(?=\.)/; print "$1\n$2\n"'
This other entry confirms that regex does not count:
How to match word where count of characters same
Building upon GreatBigBore's idea, if there's an upper bound to the count, then you could try the or operator |. This only matches your requirement to find a match; depending on the matched count the match will be in different bins. Only one case correctly places them in $1 and $2.
(\d{3})-(\d{3})|(\d{2})-(\d{2})|(\d{1})-(\d{1})
However if you concatenate the result captures as $1$3$5 and $2$4$6, you will effectively get the 2 stings you were looking for.
Another idea is to operate iteratively, you could repeat your search on the string by increasing the number until the match fails. (\d{1})-(\d{1}) , (\d{2})-(\d{2}) ...
A binary search comes to mind making it an O{ln(N)}, N being the upper limit for the capture length.
Theoretical answer
Short answer:
What you're looking for is not possible using regular expressions.
Long Answer:
Regular expressions (as their name suggests) are a compact representation of Regular languages (Type-3 grammars in the Chomsky Heirarchy).
What you're looking for is not possible using regular expressions as you're trying to write out an expression that maintains some kind of count (some contextual information other than beginning and end). This kind of behavior cannot be modelled as a DFA(actually any Finite Automaton). The informal proof of whether a language is regular is that there exists a DFA that accepts that language. As this kind of contextual information cannot be modeled in a DFA, thus by contradiction, you cannot write a regular expression for your problem.
Practical Solution
my ($lhs,$rhs) = $test =~ /^[^-]+-[^-]+-([^-]+)-([^-.]+)\S+/;
# Alernatively and faster
my (undef,undef,$lhs,$rhs) = split /-/, $test;
# Rest is common, no matter how $lhs and $rhs is extracted.
my #left = reverse split //, $lhs;
my #right = split //, $rhs;
my $i;
for($i=0; exists($left[$i]) and exists($right[$i]) and $left[$i] =~ /\d/ and $right[$i] =~ /\d/ ; ++$i){}
--$i;
$lhs= join "", reverse #left[0..$i];
$rhs= join "", #right[0..$i];
print $lhs, "\t", $rhs, "\n";
Edit: It's possible to improve the my solution by using regular expressions to extract the required numeric portions of $lhs and $rhs instead of split, reverse and for.
as #Samveen said it's technically not possible to do in pure regex
And Like #Samveen solution here's another version
#get left and right
my (undef,undef,$left,$right) = split /-/, $test;
#get left numbers
$left =~ s/.*?(\d+)$/$1/;
##get right numbers
$right =~ s/^(\d+).*/$1/;
##get length of both
my $right_length = length $right;
my $left_length = length $left;
if ($right_length > $left_length){
#make right length as same as left length
$right =~ s/(\d{$left_length}).*/$1/;
} else {
#make left length as same as right length
$left =~ s/.*(\d{$right_length})/$1/;
}
print $left, "\t", $right, "\n";