Capture the match content of two different regexp in perl - regex

I am using a while loop with two separate regular expression
while(($string1=~m/(\d+)/igs)==($string2=~m/([^^]*?)\n+/igs)) {}
to store the value of the matching pattern of the $string1 i have used $temp1=$1,
How can I store the matching pattern of the $string2. Please give some suggestion.

my ($m1,$m2);
while (do{
($m1,$m2) = ();
$m1 = $1 if $string1 =~ /(\d+)/igs;
$m2 = $1 if $string2 =~ /([^^]*?)\n+/igs;
defined $m1 == defined $m2;
}) {
# print "$m1-$m2-\n";
}

There might be more clever ways, but I'd just break them up into separate statements:
while (1) {
$res1 = $string1=~m/(\d+)/igs;
$temp1 = $1;
$res2 = $string2=~m/([^^]*?)\n+/igs
$temp2 = $1;
last unless $res1 == $res2;
...
}
Just because it's perl you don't have to find the most terse, cryptic way to write something (that's what APL is for).

If the "g" and "s" options aren't really necessary to your task and you actually only want to compare the first matching substrings, you can make a one-line test as follows:
if (($a =~ /regex1/)[0] == ($b =~ regex2/)[0]) {
...
And if you need to know what the two matched strings were, just add some temporary variables to hold them:
if (($first = ($a =~ /regex1/)[0]) == ($second = ($b =~ regex2/)[0])) {
...
But if you really want to compare all of the successive matches in each string to see if each pair are equal, there's no single-statement solution I can think of that will do it. Your regexes each return a list and "==" only compares their lengths. You've got to use the first solution proposed above and write out the comparison code in "long-hand".
The second solution above won't work since it will keep testing only the first match in each string.
It's a bit hard to understand what you're trying to do but you could at least drop the "i" option on the first test for /(\d+)/. Presumably the "s" option is only needed for the second string since you're looking for embedded new-lines.

Related

Regular expression to match exactly and only n times

If I have the lines:
'aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf'
'asfdcacttaskdfjcacttklasdjf'
'cksjdfcacttlkasdjf'
I want to match them by the number of times a repeating subunit (cactt) occurs. In other words, if I ask for n repeats, I want matches that contain n and ONLY n instances of the pattern.
My initial attempt was implemented in perl and looks like this:
sub MATCHER {
print "matches with $_ CACTT's\n";
my $pattern = "^(.*?CACTT.+?){$_}(?!.*?CACTT).*\$";
my #grep_matches = grep(/$pattern/, #matching);
print "$_\n" for #grep_matches;
my #copy = #grep_matches;
my $squashed = #copy;
print "number of rows total: $squashed\n";
}
for (2...6) {
MATCHER($_);
}
Notes:
#matching contains the strings from 1, 2, and 3 in an array.
the for loop is set from integers 2-6 because I have a separate regex that works to forbid duplicate occurrences of the pattern.
This loop ALMOST works except that for n=2, matches containing 3 occurrences of the "cactt" pattern are returned. In fact, for any string containing n+1 matches (where n>=2), lines with n+1 occurrences are also returned by the match. I though the negative lookahead could prevent this behavior in perl. If anyone could give me thoughts, I would be appreciative.
Also, I have thought of getting a count per line and separating them by count; I dislike the approach because it requires two steps when one should accomplish what I want.
I would be okay with a:
foreach (#matches) { $_ =~ /$pattern/; push(#selected_by_n, $1);}
The regex seems like it should be similar, but for whatever reason in practice the results differ dramatically.
Thanks in advance!
Your code is sort of strange. This regex
my $pattern = "^(.*?CACTT.+?){$_}(?!.*?CACTT).*\$";
..tries to match first beginning of string ^, then a minimal match of any character .*?, followed by your sequence CACTT, followed by a minimal match (but slightly different from .*?) .+?. And you want to match these $_ times. You assume $_ will be correct when calling the sub (this is bad). Then you have a look-ahead assumption that wants to make sure that there is no minimal match of any char .*? followed by your sequence, followed by any char of any length followed by end of line $.
First off, this is always redundant: ^.*. Beginning of line anchor followed by any character any number of times. This actually makes the anchor useless. Same goes for .*$. Why? Because any match that will occur, will occur anyway at the first possible time. And .*$ matches exactly the same thing that the empty string does: Anything.
For example: the regex /^.*?foo.*?$/ matches exactly the same thing as /foo/. (Excluding cases of multiline matching with strings that contain newlines).
In your case, if you want to count the occurrences of a string inside a string, you can just match them like this:
my $count = () = $str =~ /CACTT/gi;
This code:
my #copy = #grep_matches;
my $squashed = #copy;
Is completely redundant. You can just do my $squashed = #grep_matches. It makes little to no sense to first copy the array.
This code:
MATCHER($_);
Does the same as this: MATCHER("foo") or MATCHER(3.1415926536). You are not using the subroutine argument, you are ignoring it, and relying on the fact that $_ is global and visible inside the sub. What you want to do is
sub MATCHER {
my $number = shift; # shift argument from #_
Now you have encapsulated the code and all is well.
What you want to do in your case, I assume, is to count the occurrences of the substring inside your strings, then report them. I would do something like this
use strict;
use warnings;
use Data::Dumper;
my %data;
while (<DATA>) {
chomp;
my $count = () = /cactt/gi; # count number of matches
push #{ $data{$count} }, $_; # store count and original
}
print Dumper \%data;
__DATA__
aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf
asfdcacttaskdfjcacttklasdjf
cksjdfcacttlkasdjf
This will print
$VAR1 = {
'2' => [
'asfdcacttaskdfjcacttklasdjf'
],
'3' => [
'aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf'
],
'1' => [
'cksjdfcacttlkasdjf'
]
};
This is just to demonstrate how to create the data structure. You can now access the strings in the order of matches. For example:
for (#$data{3}) { # print strings with 3 matches
print;
}
Would you just do something like this:
use warnings;
use strict;
my $n=2;
my $match_line_cnt=0;
my $line_cnt=0;
while (<DATA>) {
my $m_cnt = () = /cactt/g;
if ($m_cnt>=$n){
print;
$match_line_cnt++;
}
$line_cnt++;
}
print "total lines: $line_cnt\n";
print "matched lines: $match_line_cnt\n";
print "squashed: ",$line_cnt-$match_line_cnt;
__DATA__
aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf
asfdcacttaskdfjcacttklasdjf
cksjdfcacttlkasdjf
prints:
aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf
asfdcacttaskdfjcacttklasdjf
total lines: 3
matched lines: 2
squashed: 1
I think you're unintentionally asking two seperate questions.
If you want to directly capture the number of times a pattern matches in a string, this one liner is all you need.
$string = 'aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf';
$pattern = qr/cactt/;
print $count = () = $string =~ m/$pattern/g;
-> 3
That last line is as if you had written $count = #junk = $string =~ m/$pattern/g; but without needing an intermediate array variable. () = is the null list assignment and it throws away whatever is assigned to it just like scalar undef = throws away its right hand side. But, the null list assignment still returns the number of things thrown away when its left hand side is in scalar context. It returns an empty list in list context.
If you want to match strings that only contain some number of pattern matches, then you want to stop matching once too many are found. If the string is large (like a document) then you would waste a lot of time counting past n.
Try this.
sub matcher {
my ($string, $pattern, $n) = #_;
my $c = 0;
while ($string =~ m/$pattern/g) {
$c++;
return if $c > $n;
}
return $c == $n ? 1 : ();
}
Now there is one more option but if you call it over and over again it gets inefficient. You can build a custom regex that matches only n times on the fly. If you only build this once however, it's just fine and speedy. I think this is what you originally had in mind.
$regex = qr/^(?:(?:(?!$pattern).)*$pattern){$n}(?:(?!$pattern).)*$/;
I'll leave the rest of that one to you. Check for n > 1 etc. The key is understanding how to use lookahead. You have to match all the NOT THINGS before you try to match THING.
https://perldoc.perl.org/perlre

Matching consecutive characters from a string using regex

I'm not sure how to title this question, so moving along...
I'd like to be able to match a portion of a string that is a subset of a larger string. For example:
MatchPartOfThisString -> Reference string
fThisDiff -> string I'd like to be able to say matches 5 consecutive characters in
I suppose I could loop through the first string, taking the minimum number of consecutive matches from the reference string, and see if the other string matches each of the matches I get from systematically trying to match:
if(fThisDiff =~ /Match/) {
do something...;
}
if(fThisDiff =~ /atchP/) {
do something...;
}
if(fThisDiff =~ /tchPa/) {
do something...;
}
etc.
I'd like to do this more elegantly though, if there is a way to interpret portions of the reference string repeatedly with a singular regex. I do not think this is the case, but I'd like confirmation regardless.
Here is a basic take on it, by hand with builtin tools.
Build a regex pattern with alternation of substrings of desired length from your reference string.
use warnings;
use strict;
use feature 'say';
sub get_alt_re {
my ($str, $len) = #_;
$len //= 1; #/
my #substrings;
foreach my $beg (0 .. length($str)-$len) {
push #substrings, substr($str, $beg, $len);
}
return '(' . join('|', map quotemeta, #substrings) . ')';
}
my $ref = q(MatchPartOfThisString);
my $target = q(fThisDiff);
my $re = get_alt_re($ref, 5);
my #m = $target =~ /$re/g;
say for #m;
Prints the line fThis.
The code should be made more robust and general. Then, it is fairly easily modified to match for a range of lengths (not only one, 5 above). Further, it can use libraries for subtasks (those repeated calls to substr beg for C code). But this demonstrates that a basic solution can be rather simple.
There's no simple way to do this with regex features, but a couple CPAN modules can help construct such a regex in this case.
use strict;
use warnings;
use String::Substrings 'substrings';
use Data::Munge 'list2re';
my $match_string = 'MatchPartOfThisString';
my $re = list2re substrings $match_string, 5;
my $subject = 'fThisDiff';
if ($subject =~ m/($re)/) {
print "Matched $1 from $match_string in $subject\n";
}
The best approach would be to use the longest common substring algorithm (not to be confused with the similarly-named longest common subsequence algorithm) then check its length.
use String::LCSS_XS qw( lcss );
my $longest = lcss("MatchPartOfThisString", "fThisDiff");
say length($longest);
If you have really long strings and you want to to squeeze out every millisecond, a tailored version of the algorithm that quits as soon as the target length is found and that avoids building the string would be faster.

How do I check if a string has exactly one of a certain character

I'm trying to scan strings to see if they have exactly one of a certain character.
For example if I'm looking for a question mark
Hello? I'm here
Will match the regex however
Hello? Are you listening?
Will not
I've tried ?{1} and ?{1}[^?]+ but they both don't work. Can anyone point me in the right direction?
Why not do:
(\?)
and count the number of matches.
Or even more simply, count number of ? in string using tr///
my $c = $string1 =~ tr/?//;
You could do something like
my $cnt = () = $str =~ m/\Q$pat/g;
if ($cnt == 1) {
# matched
}
else {
# failed
}
$pat is the pattern (character in this case) you want to match, such as '?'.
If you're looking for a particular character only, you can use the transliteration operator, tr///:
my $count = $string =~ tr/?/?/;
if( $count == 1 ) {
...
}
With the transliteration operator, I can leave off the replacement side and any characters not lined up with a replacement character will use the previous replacement character. If there isn't a previous replacement character, it makes no replacement. I just leave out the second part of the tr///:
my $count = $string =~ tr/?//;
if( $count == 1 ) {
...
}
This won't work for patterns though. This is strictly for character-to-character replacements. For a pattern, you do the same thing with Lee Duhem's answer
You can use this regex:
^[^?]*\?[^?]*$
Online Demo

Sensethising domains

So I'm trying to put all numbered domains into on element of a hash doing this:
### Domanis ###
my $dom = $name;
$dom =~ /(\w+\.\w+)$/; #this regex get the domain names only
my $temp = $1;
if ($temp =~ /(^d+\.\d+)/) { # this regex will take out the domains with number
my $foo = $1;
$foo = "OTHER";
$domain{$foo}++;
}
else {
$domain{$temp}++;
}
where $name will be something like:
something.something.72.154
something.something.72.155
something.something.72.173
something.something.72.175
something.something.73.194
something.something.73.205
something.something.73.214
something.something.abbnebraska.com
something.something.cableone.net
something.something.com.br
something.something.cox.net
something.something.googlebot.com
My code currently print this:
72.175
73.194
73.205
73.214
abbnebraska.com
cableone.net
com.br
cox.net
googlebot.com
lstn.net
but I want it to print like this:
abbnebraska.com
cableone.net
com.br
cox.net
googlebot.com
OTHER
lstn.net
where OTHER is all the numbered domains, so any ideas how?
You really shouldn't need to split the variable into two, e.g. this regex will match the case you want to trap:
/\d{1,3}\.\d{1,3}$/ -- returns true if the string ends with two 1-3 long digits separated by a dot
but I mean if you only need to separate those domains that are not numbered you could just check the last character in the domain whether it is a letter, because TLDs cannot contain numbers, so you would do something like
/\w$/ -- if returns true, it is not a numbered domain (providing you've stripped spaces and new lines)
But I suppose it is better to be more specific in the regex, which also better illustrates the logic you are looking for in your script, so I'd use the former regex.
And actually you could do something like this:
if (my ($domain) = $name =~ /\.(\w+.\w+)$/)
{
#the domain is assigned to the variable $domain
} else {
#it is a number domain
}
Take what it currently puts, and use the regex:
/\d+\.\d+/
if it matches this, then its a pair of numbers, so remove it.
This way you'll be able to keep any words with numbers in them.
Please, please indent your code correctly, and use whitespace to separate out various bits and pieces. It'll make your code so much easier to read.
Interestingly, you mentioned that you're getting the wrong output, but the section of the code you post has no print, printf, or say statement. It looks like you're attempting to count up the various domain names.
If these are the value of $name, there are several issues here:
if ($temp =~ /(^d+\.\d+)/) {
Matches nothing. This is saying that your string starts with one or more letter d followed by a period followed by one or more digits. The ^ anchors your regular expression to the beginning of the string.
I think, but not 100% sure, you want this:
if ( $temp =~ /\d\.\d/ ) {
This will find all cases where there are two digits with a period in between them. This is the sub-pattern to /\d+\.\d+/, so both regular expressions will match the same thing.
The
$dom =~ /(\w+\.\w+)$/;
Is matching anywhere in the entire string $dom where there are two letters, digits. or underscores with a decimal between them. Is that what you want?
I also believe this may indicate an error of some sort:
my $foo = $1;
$foo = "OTHER";
$domain{$foo} ++;
This is setting $foo to whatever $dom is matching, but then immediately resets $foo to OTHER, and increments $domain{OTHER}.
We need a sample of your initial data, and maybe the actual routine that prints your output.

how do you match two strings in two different variables using regular expressions?

$a='program';
$b='programming';
if ($b=~ /[$a]/){print "true";}
this is not working
thanks every one i was a little confused
The [] in regex mean character class which match any one of the character listed inside it.
Your regex is equivalent to:
$b=~ /[program]/
which returns true as character p is found in $b.
To see if the match happens or not you are printing true, printing true will not show anything. Try printing something else.
But if you wanted to see if one string is present inside another you have to drop the [..] as:
if ($b=~ /$a/) { print true';}
If variable $a contained any regex metacharacter then the above matching will fail to fix that place the regex between \Q and \E so that any metacharacters in the regex will be escaped:
if ($b=~ /\Q$a\E/) { print true';}
Assuming either variable may come from external input, please quote the variables inside the regex:
if ($b=~ /\Q$a\E/){print true;}
You then won't get burned when the pattern you'll be looking for will contain "reserved characters" like any of -[]{}().
(apart the missing semicolons:) Why do you put $a in square brackets? This makes it a list of possible characters. Try:
$b =~ /\Q${a}\E/
Update
To answer your remarks regarding = and =~:
=~ is the matching operator, and specifies the variable to which you are applying the regex ($b) in your example above. If you omit =~, then Perl will automatically use an implied $_ =~.
The result of a regular expression is an array containing the matches. You usually assign this so an array, such as in ($match1, $match2) = $b =~ /.../;. If, on the other hand, you assign the result to a scalar, then the scalar will be assigned the number of elements in that array.
So if you write $b = /\Q$a\E/, you'll end up with $b = $_ =~ /\Q$a\E/.
$a='program';
$b='programming';
if ( $b =~ /\Q$a\E/) {
print "match found\n";
}
If you're just looking for whether one string is contained within another and don't need to use any character classes, quantifiers, etc., then there's really no need to fire up the regex engine to do an exact literal match. Consider using index instead:#!/usr/bin/env perl
#!/usr/bin/env perl
use strict;
use warnings;
my $target = 'program';
my $string = 'programming';
if (index($string, $target) > -1) {
print "target is in string\n";
}