Match a cycle of letters in a string in Perl - regex

Let's say I have a string 'abc'. How do I match all 3 or more occurrences of 'abc' and its cycles ('bca', 'cab') in a large string.
Right now I am using individual entries as regex to match, but a) It is taking too long because the string is very large, and b) I'm getting the same regions in subsequent matches. For example, if my input is:
dabcabcabcabgyklagkbcabcabcahkgljla
^-------^ ^-------^
I want my output to be two matches:
1. abcabcabc position 2
2. bcabcabca position 20
Right now I'm getting 4 lines of output:
1. abcabcabc position 2
2. bcabcabca position 3
3. cabcabcab position 4
4. bcabcabca position 20
I hope I explained my problem. I got the desired output in another complicated way by doing a multi regex matching using all possible combinations in a single regex like this:
while($str =~ /(abc){3,}|(bca){3,}|cab{3,}/g {
print "$1\tposition $-[0]\n";
}
But it was a serious performance hit, and given the size of my input, it is taking forever to run. Please help me with a more efficient algorithm. Really sorry if this was asked earlier, but I couldn't find any page that helped me.
Thanks in advance

I suggest you use just /(abc){2,}/ preceded by nothing, c, or bc and followed by nothing, a, or ab, so
/ ( (?:b?c)? (?:abc){2,} (?:ab?)? ) /xg
The idea is to break down any sequence, like bcabcabcabcabca into a number of abcs, possibly preceded bt c or (here) bc and possibly followed by (here) a or ab, like this.
bc abcabcabcabc a
so that the regex engine doesn't have to check for three diffrent strings at every point.
Doing it that way may find sequences up to three characters shorter than you require, but it should be faster and you can add an additional filter on length. Like this
use strict;
use warnings;
my $seq = 'dabcabcabcabgyklagkbcabcabcahkgljla';
while ($seq =~ / ( (?:b?c)? (?:abc){2,} (?:ab?)? ) /xg) {
next unless length $1 >= 9;
my $subseq = $1;
chop $subseq while length($subseq) % 3;
print "$subseq\tposition $-[0]\n";
}
output
abcabcabc position 1
bcabcabca position 19

I've used the data you posted and have found a variation of my original solution that runs about four to five times faster than your original. Unfortunately the sequence you posted is only 225KB and there is only a single occurrence of one of the SSRs in it, so I don't know how representative it is.
Essentially, instead looking for a sequence of four rotations of the pattern, it looks only for repetitions of the core SSR, with an optional prefix and suffix that lets the overall sequence start anywhere within the SSR, like this
/ (?:AAT|AT|T|) (?:AAAT){3,} (:?AAA|AA|A|) /x
All of this regex is built automatically.
use strict;
use warnings;
use autodie;
open my $fh, '<', 'chr1.txt';
my $seq = <$fh>;
close $fh;
my #ssrs = qw( AAAT AAAC AACC AACG );
retrieve_ssr('Sample', $seq, \#ssrs);
sub retrieve_ssr {
my ($name, $seq, $ssr_list) = #_;
for my $ssr (#$ssr_list) {
my $len = length $ssr;
my $n = $len == 5 ? 3 : 12 / $len;
$n = 1;
my $prefix = join '', map { substr($ssr, -$_) . '|' } 1 .. $len-1;
my $suffix = join '', map { substr($ssr, 0, $_) . '|' } reverse 1 .. $len-1;
my $re = qr/ (?:$prefix) (?:$ssr){$n,} (?:$suffix) /x;
while ($seq =~ /$re/g) {
my $start = $-[0] + 1;
my $length = $+[0] - $-[0];
my $excess = $length % $len;
pos($seq) -= $excess;
$length -= $excess;
my $seq = substr $seq, $-[0], $length;
print "$start\t$+[0]\t$length\t$seq\n";
}
}
}
output
23738 23752 12 TAAATAAATAAA

It strikes me that you don't need to have 3 separate regexes, you really just need one regex like this:
perl -ne 'print "$1\tposition $-[0]\n" while /(b?c?(abc){1,}a?b?)/g' mydata.txt
The idea is that the core pattern abc is matched as needed, and then you just need to account for the potential prefix of "b?c?" and a potential suffix of "a?b?" (if the prefix or suffix were longer then it would be matched by the main regex in the center).
As given this expression will find matches of 3 chars or longer, but you can obviously up the minimum length by changing the value inside {1,}
This solution does risk a few false positives in the prefix and suffix however, as it would match "babc", so you could run a 2nd slow search on the results for complete accuracy.

Related

Replace only the second occurance of string in a line in perl regex

I have a string like "ven|ven|vett|vejj|ven|ven". Treat each "|" delimiter for each column.
By splitting the string with "|" saving all the columns in array and reading each column into $str
So, I'm trying to do this as
$string =~ s/$str/venky/g if $str =~ /ven/i; # it will do globally.
Which not met the requirement.
On-demand basis, I need to replace string at the particular number of occurrence of the string.
For example, I've a request to change 2nd occurrence of "ven" to venky.
Then how can I met this requirement simply? Is it some-thing like
$string =~ s/ven/venky/2;
As of my knowledge we have 'o' for replace once and 'g' for globally. I'm struggling for the solution to get the replacement at particular occurrence. And I should not use pos() to get the position, because string keeps on change. It becomes difficult to trace it every-time. That's my intention.
Please help me on this regard.
There is no flag that you can add to the regex that will do this.
The easiest way would be to split and loop. However, if you insist to use one regex, it is doable:
/^(?:[^v]|v[^e]|ve[^n])*ven(?:[^v]|v[^e]|ve[^n])*\Kven/
If you want to replace the Nth occurrence instead of the second, you can do:
/^(?:(?:[^v]|v[^e]|ve[^n])*ven){N-1}(?:[^v]|v[^e]|ve[^n])*\Kven/
The general idea:
(?:[^v]|v[^e]|ve[^n])* - matches any string that isn't part of ven
\K is a cool matcher that drops everything matched so far, so you can sort of use it as a lookbehind with variable length
Currently you're replacing every instance of'ven' with 'venky' if your string contains a match for ven, which of course it does.
What I assume you're trying to do is to substitute 'ven' for 'venky' within your string if it's the second element:
my $string = 'ven|ven|vett|vejj|ven|ven';
my #elements = split(/\|/, $string);
my $count;
foreach (#elements){
$count++;
s/$_/venky/g if /ven/i and $count == 2;
}
print join('|', #elements);
print "\n";
Your approach was already pretty good. What you described makes sense, but I think you are having trouble implementing it.
I created a function to do the work. It takes 4 arguments:
$string is the string we want to work on
$n is the nth occurance you want to replace
$needle is the thing you want to replace – thing needle in a haystack
Note that right now we allow to pass stuff that might contain regular expressions. So you would have to use quotemeta on it or match with /\Q$needle\E/
$replacement is the replacement for the $needle
The idea is to split up the string, then check each element if it matches the pattern ($needle) and keep track of how many have matched. If the nth one is reached, replace it and stop processing. Then put the string back together.
use strict;
use warnings;
use feature 'say';
say replace_nth_occurance("ven|ven|vett|vejj|ven|ven", 2, 'ven', 'venky');
sub replace_nth_occurance {
my ($string, $n, $needle, $replacement) = #_;
# take the string appart
my #elements = split /\|/, $string;
my $count = 0; # keep track of ...
foreach my $e (#elements) {
$count++ if $e =~ m/$needle/; # ... how many matches we've found
if ($count == $n) {
$e =~ s/$needle/$replacement/; # replace
last; # and stop processing
}
}
# put it back into the pipe-separated format
return join '|', #elements;
}
Output:
ven|venky|vett|vejj|ven|ven
To replace the n'th occurrence of "ven" to "venky":
my $n = 3;
my $test = "seven given ravens";
$test =~ s/ven/--$n == 0 ? "venky" : $&/eg;
This uses the ability with the /e flag to specify the substitution part as an expression.

How do I count regex matches in perl when using multiple possible match targets separated by "|"?

I have a (very) long list of strings of numbers that I need to count the number of occurrences of certain values in order to decide whether to pull the line the string is associated with. Essentially, the file is formatted like this:
,4,8,9,11,12,
,5,6,7,9,11,
etc.
where the strings range in length from 1 - 100 values, the values range from 1 - 100, and the values in the string the are always ordered smallest to largest.
I'm trying to find all the lines that have, for example, at least two out of the three values 4, 9, and 11, so here is the test code I wrote to try out my regex:
my $string = ",4,8,9,11,12,";
my $test = ",4,|,9,|,11,";
my #c = $string =~ m/$test/g;
my $count = #c;
print "count: $count\n";
print "\#c:, join(" ", #c), "\n";
The output when I run this is:
count: 2
#c:,4, ,9,
When I expect count to be 3 and #c to be ,4, ,9, ,11,.
I realize this is because the 9 and the 11 share the same comma, but I'm wondering if anyone knows how to get around this. I can't just drop the last comma from the match because if I'm trying to match ,4 in a string that has a ,41, for example, it will the erroneously count the ,41,.
I suppose I could do something like:
my $test = "4|9|11";
$string =~ s/,/ /;
my #c = $string =~ m/\b($test)\b/g
which works, but adds another step before the match counting. Is there a way to perform the matches keeping the original string unchanged?
I'm also trying to avoid looping through my match targets individually and summing the individual match counts because I'm trying to maximize efficiency. I'm working with some really massive lists of values requiring millions of permutations and the way I currently have my script written using loops it's taking days to complete. I'm hoping by regex matching it will go faster.
Thanks
The problem is that the trailing , is consumed in the ,9, match, so when it starts looking for the next match it starts at 11,12,. There's no leading , before the 11, so it can't match that. I'd recommend using a lookahead like this:
,(4|9|11)(?=,)
This way, the trailing , will not be consumed as part of the match.
For example:
my $string = ",4,8,9,11,12,";
my $test = ",(4|9|11)(?=,)";
my #c = $string =~ m/$test/g;
my $count = #c;
print "count: $count\n";
print "\#c:", join(" ", #c), "\n";
Outputs:
count: 3
#c:4 9 11
Just ignore the commas. This does what you want:
printf "count: %d\n", scalar( () = $string =~ /\b(?:4|9|11)\b/g );
The list assignment () = ... to the empty list happens in scalar context, provided by scalar(), when it returns the number of elements in the list on the right-hand side.
The (?:...) is just to avoid creating a capture group, which should improve performance.
Edit:
Okay, the OP is asking for performance, so I did some benchmarking, and it turned out that a simple
++$count while ($string =~ /\b(?:4|9|11)\b/g);
is faster than my list assignment trickery above (approx 30% speedup on my ancient laptop) and the answer from p.s.w.g with the lookahead pattern (approx 20% speedup, so his solution actually was maybe less fancy but faster than my first solution).
I would use the following instead of a regex:
#!/usr/bin/perl
use strict;
use warnings;
my #values = qw(4 9 11);
while (<DATA>) {
my %hash = map { $_ => 1 } split /,/;
my $count = 0;
foreach my $value (#values) {
$count++ if exists $hash{$value};
}
print if $count >= 2;
}
__DATA__
,4,8,9,11,12,
,5,6,7,9,11,
,1,2,3,4,5,
Output:
,4,8,9,11,12,
,5,6,7,9,11,
This one will work for you too as you are having overlapping during regex matching:
my $str = ',4,8,9,11,12,11,';
my #arr = $str =~ /(?=,(4|9|11),)/g;

Regex greedyness REasking

I have this text $line = "config.txt.1", and I want to match it with regex and extract the number
part of it. I am using two versions:
$line = "config.txt.1";
(my $result) = $line =~ /(\d*).*/; #ver 1, matched, but returns nothing
(my $result) = $line =~ /(\d).*/; #ver 2, matched, returns 1
(my $result) = $line =~ /(\d+).*/; #ver 3, matched, returns 1
I think the * was sort of messing things around, I have been looking at this, but still
don't the greedy mechanism in the regex engine. If I start from left of the regex, and potentially there might be no digits in the text, so for ver 1, it will match too. But for
ver 3, it won't match. Can someone give me an explanation for why it is that and how
I should write for what I want? (potentially with a number, not necessarily single digit)
Edit
Requirement: potentially with a number, not necessarily single digit, and match can not capture anything, but should not fail
The output must be as follows (for the above example):
config.txt 1
The regex /(\d*).*/ always matches immediately, because it can match zero characters. It translates to match as many digits at this position as possible (zero or more). Then, match as many non-newline characters as possible. Well, the match starts looking at the c of config. Ok, it matches zero digits.
You probably want to use a regex like /\.(\d+)$/ -- this matches an integer number between a period . and the end of string.
Use the literal '.' as a reference to match before the number:
#!/usr/bin/perl
use strict;
use warnings;
my #line = qw(config.txt file.txt config.txt.1 config.foo.2 config.txt.23 differentname.fsdfsdsdfasd.2444);
my (#capture1, #capture2);
foreach (#line){
my (#filematch) = ($_ =~ /(\w+\.\w+)/);
my (#numbermatch) = ($_ =~ /\w+\.\w+\.?(\d*)/);
my $numbermatch = $numbermatch[0] // $numbermatch[1];
push #capture1, #filematch;
push #capture2, #numbermatch;
}
print "$capture1[$_]\t$capture2[$_]\n" for 0 .. $#capture1;
Output:
config.txt
file.txt
config.txt 1
config.foo 2
config.txt 23
differentname.fsdfsdsdfasd 2444
Thanks guys, I think I figured out myself what I want:
my ($match) = $line =~ /\.(\d+)?/; #this will match and capture any digit
#number if there was one, and not fail
#if there wasn't one
To capture all digits following a final . and not fail the match if the string doesn't end with digits, use /(?:\.(\d+))?$/
perl -E 'if ("abc.123" =~ /(?:\.(\d+))?$/) { say "matched $1" } else { say "match failed" }'
matched 123
perl -E 'if ("abc" =~ /(?:\.(\d+))?$/) { say "matched $1" } else { say "match failed" }'
matched
You do not need .* at all. These two statements assign the exact same number:
my ($match1) = $str =~ /(\d+).*/;
my ($match1) = $str =~ /(\d+)/;
A regex by default matches partially, you do not need to add wildcards.
The reason your first match does not capture a number is because * can match zero times as well. And since it does not have to match your number, it does not. Which is why .* is actually detrimental in that regex. Unless something is truly optional, you should use + instead.

split first from rest of list using regex substitution

I need to split a list between its first item and the rest of its items using regex substitution only.
The lists of items are input as strings using '##' as a separator, e.g.:
''
'one'
'one##two'
'one##two##three'
'one##two words##three'
My Perl attempt doesn't really work:
my $sampleText = 'one##two words##three';
my $first = $sampleText;
my $rest = $sampleText;
$first =~ s/(.+?)(##.*)?/$1/g;
$rest =~ s/(.?+)(##)?(.*)/$3/g;
print "sampleText = '$sampleText', first = '$first', rest = '$rest'\n";
sampleText = 'one##two words##three', first = 'one', rest = 'ne##two words##three'
Please note the constraints:
the separator is a multi-character string
only regex substitutions are allowed (1)
I could "chain" regex substitutions if necessary
The expected end result is two strings: the first element, and the initial string with the first element cut off (2)
the list may have from 0 to n items, each being any string not containing the separator.
(1) I work with this rather large Perl system where at some point lists of items are processed using provided operations. One of them is a regex substitution. None of the others one are applicable. Solving the problem using full Perl code is easy, but that would mean modifying the system, which is not an option as this time.
(2) the context is the Unimarc bibliographic format, where authors of a publication are to be split into the standard Unimarc fields 700$a for the first author, and 701$a for any remaining authors.
I assume point (1) means you cannot use the split builtin? It would be easy using splits optional third parameter which lets you specify the maximum number of items.
my( $first, $rest ) = split( '##', $sampleText, 2 );
But if it has to be regex replace then your is almost right, but using .+? wont work when there's no sperators (because it will just take the first character You can fix this by anchoring the end. Instead something like:
my $sampleText = 'one##two words##three';
my $first = $sampleText;
my $rest = $sampleText;
$first =~ s/(.+?)(|##(.*))$/$1/g;
$rest =~ s/(.+?)(|##(.*))$/$3/g;
print "sampleText = '$sampleText', first = '$first', rest = '$rest'\n";
Whatever is the matter with :
my ( $first, $rest ) = split /##/, $sampleText, 2;
?
try
my ($first, $rest) = /(.+?)\#\#(.*)/;
// (or, m//) is matching; you don't need to use s/// for substitution. It returns the matches (here, to $first, $rest), or you can capture them later using $1, $2, &c.
You have reversed the quantifiers ? and + in the second regex, it should be:
$rest =~ s/(.+?)(##)?(.*)/$3/g;
___^^
or more concise:
$rest =~ s/.+?##(.*)/$1/;
I'd must match; not substitute:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
chomp;
m{([^#]*?)##(.*)} and print "[$1][$2]\n";
}
__DATA__
''
'one'
'one##two'
'one##two##three'
'one##two words##three'

Regex range operator

I have a string '11 15 '. W/ a Regex I then compare the values within that string, in this case 11 and 15 (could be any number of digits but I'll keep it simple with 2 2-digit numbers).
For each of those numbers, I then see if it matches any of the numbers I want; in this case I want to see if the number is '12', '13', or '14'. If it is, then I change the value of '$m':
my $string = '11 15 ';
while ( $string =~ /([0-9]{1,})\s+/ig ) {
my $m = $1;
print $m . ".....";
$m = 'change value' if $m =~ /[12...14]{2,}/g;
print $m . "\n";
}
Produces:
11.....change value
15.....15
'15' stays the same, as it should. But '11' changes. What am I doing wrong?
[12...14] matches against "1", "2", ".", and "4". "11" Matches that; "15" doesn't. If you're just matching against numbers, you shouldn't be using regular expressions. Change your line to the following:
$m = 'change value' if $m ~~ [11..14];
Or, if unable to guarantee perl >= v5.10:
$m = 'change value' if grep { $m == $_ } 11..14;
You've misunderstood the regular expression. Where you've written [12...14]{2,}, this means "match 2 or more of the characters 1 or 2 or dot or dot or dot or dot or 1 or 4".
Try something like:
$m='change value' if $m=~/(\d{2,})/ and $1 >= 12 and $1 <= 14;
In a substitution operation, this could be written as:
$m =~ s/(\d{2,})/ $1 >= 12 && $1 <= 14 ? 'change value' : $1/ge;
That is, capture 2 or more digits and then test what you have captured to see if they're what you want to change by using perl code in the replacement section of the substitution. The e modifier indicates that Perl should evaluate the replacement as Perl code.
Let's rewrite your code a bit:
my $string = '11 15 ';
while ( $string =~ /(\d+)/g ) {
I've changed your while statement's regular expression. You can use \d+ to represent one or more digits, and that's easier to understand than [0-9]{1,}. You also (since a space won't match \d) don't need the last space on the end of your string.
Let's look at the rest of the code:
my $string = '11 15';
while ( $string =~ /(\d+)/g ) {
my $match = $1;
print "$match.....";
if ($match >= 12 and $match <= 14) { #if ($match ~~ [12..14]) for Perl > 5.10
print 'change value\n';
}
else {
print "$match\n";
}
}
You can't use a regular expression the way you are to test for range.
Instead, use the regular range test of
if ($match >= 12 and $match <= 14)
or the newer group test:
if ($match ~~ [12..14]) #Note only two dots and not three!
That last one only works in newer versions of Perl like 5.12 I have on my Mac, and 5.14 I have on my Linux box, but not the Perl 5.8 I have on my Solaris box).
A few tips:
Use indents and spaces. It makes your code more readable.
Use descriptive names for variables. Instead of $m, I used $match.
Don't use the appended if statements. The appended if is harder to spot, so you might miss something important, and it makes your code harder to update. It can be used if the statement itself is clear and simple, and it improves readability. The last is a bit subjective, but you'll commonly see appended if statements in things like return if not -f $file;.
Keep variables single purpose. In this case, instead of changing the value of $match, I used an if/else statement. Imagine if your code was a bit more complex, and someone had to add in a new feature. They see the $match variable and think this is what they need. Unfortunately, you changed what $match is. It's now a value to be printed out and not the string match. It might take the person who changed your program quite a while to figure out what happened to the value of $match and why it has bee mysteriously set to changed value.
In the print statement, you can include variables inside of double quotes. This is very different from almost all other languages. This is because Perl variable use sigils to mark variable names. It usually makes it easier to read if your combine variables and other strings in a single string.
For example:
print "The range of possible values are $low to $high\n";
vs.
print "The range of possible values are " . $low . " to " . $high . "\n";
Notice how in the second example, I had to be careful of spaces inside the quotes while in the first example, the required spaces came rather naturally. Imagine having to change that statement in a later version of the program. Which would be easier to maintain?