Perl split function - use repeating characters as delimiter - regex

I want to split a string using repeating letters as delimiter, for example,
"123aaaa23a3" should be split as ('123', '23a3') while "123abc4" should be left unchanged.
So I tried this:
#s = split /([[:alpha:]])\1+/, '123aaaa23a3';
But this returns '123', 'a', '23a3', which is not what I wanted. Now I know that this is because the last 'a' in 'aaaa' is captured by the parantheses and thus preserved by split(). But anyway, I can't add something like ?: since [[:alpha:]] must be captured for back reference.
How can I resolve this situation?

Hmm, its an interesting one. My first thought would be - your delimiter will always be odd numbers, so you can just discard any odd numbered array elements.
Something like this perhaps?:
my %s = (split (/([[:alpha:]])\1+/, '123aaaa23a3'), '' );
print Dumper \%s;
This'll give you:
$VAR1 = {
'23a3' => '',
'123' => 'a'
};
So you can extract your pattern via keys.
Unfortunately my second approach of 'selecting out' the pattern matches via %+ doesn't help particularly (split doesn't populate the regex stuff).
But something like this:
my #delims ='123aaaa23a3' =~ m/(?<delim>[[:alpha:]])\g{delim}+/g;
print Dumper \%+;
By using a named capture, we identify that a is from the capture group. Unfortunately, this doesn't seem to be populated when you do this via split - which might lead to a two-pass approach.
This is the closest I got:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my $str = '123aaaa23a3';
#build a regex out of '2-or-more' characters.
my $regex = join ( "|", map { $_."{2,}"} $str =~ m/([[:alpha:]])\1+/g);
#make the regex non-capturing
$regex = qr/(?:$regex)/;
print "Using: $regex\n";
#split on the regex
my #s = split m/$regex/, $str;
print Dumper \#s;
We first process the string to extract "2-or-more" character patterns, to set as our delmiters. Then we assemble a regex out of them, using non-capturing, so we can split.

One solution would be to use your original split call and throw away every other value. Conveniently, List::Util::pairkeys is a function that keeps the first of every pair of values in its input list:
use List::Util 1.29 qw( pairkeys );
my #vals = pairkeys split /([[:alpha:]])\1+/, '123aaaa23a3';
Gives
Odd number of elements in pairkeys at (eval 6) line 1.
[ '123', '23a3' ]
That warning comes from the fact that pairkeys wants an even-sized list. We can solve that by adding one more value at the end:
my #vals = pairkeys split( /([[:alpha:]])\1+/, '123aaaa23a3' ), undef;
Alternatively, and maybe a little neater, is to add that extra value at the start of the list and use pairvalues instead:
use List::Util 1.29 qw( pairvalues );
my #vals = pairvalues undef, split /([[:alpha:]])\1+/, '123aaaa23a3';

The 'split' can be made to work directly by using the delayed execution assertion (aka postponed regular subexpression), (??{ code }), in the regular expression:
#s = split /[[:alpha:]](??{"$&+"})/, '123aaaa23a3';
(??{ code }) is documented on the 'perlre' manual page.
Note that, according to the 'perlvar' manual page, the use of $& anywhere in a program imposes a considerable performance penalty on all regular expression matches. I've never found this to be a problem, but YMMV.

Related

Regular expression to match exactly and only n times

If I have the lines:
'aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf'
'asfdcacttaskdfjcacttklasdjf'
'cksjdfcacttlkasdjf'
I want to match them by the number of times a repeating subunit (cactt) occurs. In other words, if I ask for n repeats, I want matches that contain n and ONLY n instances of the pattern.
My initial attempt was implemented in perl and looks like this:
sub MATCHER {
print "matches with $_ CACTT's\n";
my $pattern = "^(.*?CACTT.+?){$_}(?!.*?CACTT).*\$";
my #grep_matches = grep(/$pattern/, #matching);
print "$_\n" for #grep_matches;
my #copy = #grep_matches;
my $squashed = #copy;
print "number of rows total: $squashed\n";
}
for (2...6) {
MATCHER($_);
}
Notes:
#matching contains the strings from 1, 2, and 3 in an array.
the for loop is set from integers 2-6 because I have a separate regex that works to forbid duplicate occurrences of the pattern.
This loop ALMOST works except that for n=2, matches containing 3 occurrences of the "cactt" pattern are returned. In fact, for any string containing n+1 matches (where n>=2), lines with n+1 occurrences are also returned by the match. I though the negative lookahead could prevent this behavior in perl. If anyone could give me thoughts, I would be appreciative.
Also, I have thought of getting a count per line and separating them by count; I dislike the approach because it requires two steps when one should accomplish what I want.
I would be okay with a:
foreach (#matches) { $_ =~ /$pattern/; push(#selected_by_n, $1);}
The regex seems like it should be similar, but for whatever reason in practice the results differ dramatically.
Thanks in advance!
Your code is sort of strange. This regex
my $pattern = "^(.*?CACTT.+?){$_}(?!.*?CACTT).*\$";
..tries to match first beginning of string ^, then a minimal match of any character .*?, followed by your sequence CACTT, followed by a minimal match (but slightly different from .*?) .+?. And you want to match these $_ times. You assume $_ will be correct when calling the sub (this is bad). Then you have a look-ahead assumption that wants to make sure that there is no minimal match of any char .*? followed by your sequence, followed by any char of any length followed by end of line $.
First off, this is always redundant: ^.*. Beginning of line anchor followed by any character any number of times. This actually makes the anchor useless. Same goes for .*$. Why? Because any match that will occur, will occur anyway at the first possible time. And .*$ matches exactly the same thing that the empty string does: Anything.
For example: the regex /^.*?foo.*?$/ matches exactly the same thing as /foo/. (Excluding cases of multiline matching with strings that contain newlines).
In your case, if you want to count the occurrences of a string inside a string, you can just match them like this:
my $count = () = $str =~ /CACTT/gi;
This code:
my #copy = #grep_matches;
my $squashed = #copy;
Is completely redundant. You can just do my $squashed = #grep_matches. It makes little to no sense to first copy the array.
This code:
MATCHER($_);
Does the same as this: MATCHER("foo") or MATCHER(3.1415926536). You are not using the subroutine argument, you are ignoring it, and relying on the fact that $_ is global and visible inside the sub. What you want to do is
sub MATCHER {
my $number = shift; # shift argument from #_
Now you have encapsulated the code and all is well.
What you want to do in your case, I assume, is to count the occurrences of the substring inside your strings, then report them. I would do something like this
use strict;
use warnings;
use Data::Dumper;
my %data;
while (<DATA>) {
chomp;
my $count = () = /cactt/gi; # count number of matches
push #{ $data{$count} }, $_; # store count and original
}
print Dumper \%data;
__DATA__
aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf
asfdcacttaskdfjcacttklasdjf
cksjdfcacttlkasdjf
This will print
$VAR1 = {
'2' => [
'asfdcacttaskdfjcacttklasdjf'
],
'3' => [
'aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf'
],
'1' => [
'cksjdfcacttlkasdjf'
]
};
This is just to demonstrate how to create the data structure. You can now access the strings in the order of matches. For example:
for (#$data{3}) { # print strings with 3 matches
print;
}
Would you just do something like this:
use warnings;
use strict;
my $n=2;
my $match_line_cnt=0;
my $line_cnt=0;
while (<DATA>) {
my $m_cnt = () = /cactt/g;
if ($m_cnt>=$n){
print;
$match_line_cnt++;
}
$line_cnt++;
}
print "total lines: $line_cnt\n";
print "matched lines: $match_line_cnt\n";
print "squashed: ",$line_cnt-$match_line_cnt;
__DATA__
aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf
asfdcacttaskdfjcacttklasdjf
cksjdfcacttlkasdjf
prints:
aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf
asfdcacttaskdfjcacttklasdjf
total lines: 3
matched lines: 2
squashed: 1
I think you're unintentionally asking two seperate questions.
If you want to directly capture the number of times a pattern matches in a string, this one liner is all you need.
$string = 'aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf';
$pattern = qr/cactt/;
print $count = () = $string =~ m/$pattern/g;
-> 3
That last line is as if you had written $count = #junk = $string =~ m/$pattern/g; but without needing an intermediate array variable. () = is the null list assignment and it throws away whatever is assigned to it just like scalar undef = throws away its right hand side. But, the null list assignment still returns the number of things thrown away when its left hand side is in scalar context. It returns an empty list in list context.
If you want to match strings that only contain some number of pattern matches, then you want to stop matching once too many are found. If the string is large (like a document) then you would waste a lot of time counting past n.
Try this.
sub matcher {
my ($string, $pattern, $n) = #_;
my $c = 0;
while ($string =~ m/$pattern/g) {
$c++;
return if $c > $n;
}
return $c == $n ? 1 : ();
}
Now there is one more option but if you call it over and over again it gets inefficient. You can build a custom regex that matches only n times on the fly. If you only build this once however, it's just fine and speedy. I think this is what you originally had in mind.
$regex = qr/^(?:(?:(?!$pattern).)*$pattern){$n}(?:(?!$pattern).)*$/;
I'll leave the rest of that one to you. Check for n > 1 etc. The key is understanding how to use lookahead. You have to match all the NOT THINGS before you try to match THING.
https://perldoc.perl.org/perlre

Matching consecutive characters from a string using regex

I'm not sure how to title this question, so moving along...
I'd like to be able to match a portion of a string that is a subset of a larger string. For example:
MatchPartOfThisString -> Reference string
fThisDiff -> string I'd like to be able to say matches 5 consecutive characters in
I suppose I could loop through the first string, taking the minimum number of consecutive matches from the reference string, and see if the other string matches each of the matches I get from systematically trying to match:
if(fThisDiff =~ /Match/) {
do something...;
}
if(fThisDiff =~ /atchP/) {
do something...;
}
if(fThisDiff =~ /tchPa/) {
do something...;
}
etc.
I'd like to do this more elegantly though, if there is a way to interpret portions of the reference string repeatedly with a singular regex. I do not think this is the case, but I'd like confirmation regardless.
Here is a basic take on it, by hand with builtin tools.
Build a regex pattern with alternation of substrings of desired length from your reference string.
use warnings;
use strict;
use feature 'say';
sub get_alt_re {
my ($str, $len) = #_;
$len //= 1; #/
my #substrings;
foreach my $beg (0 .. length($str)-$len) {
push #substrings, substr($str, $beg, $len);
}
return '(' . join('|', map quotemeta, #substrings) . ')';
}
my $ref = q(MatchPartOfThisString);
my $target = q(fThisDiff);
my $re = get_alt_re($ref, 5);
my #m = $target =~ /$re/g;
say for #m;
Prints the line fThis.
The code should be made more robust and general. Then, it is fairly easily modified to match for a range of lengths (not only one, 5 above). Further, it can use libraries for subtasks (those repeated calls to substr beg for C code). But this demonstrates that a basic solution can be rather simple.
There's no simple way to do this with regex features, but a couple CPAN modules can help construct such a regex in this case.
use strict;
use warnings;
use String::Substrings 'substrings';
use Data::Munge 'list2re';
my $match_string = 'MatchPartOfThisString';
my $re = list2re substrings $match_string, 5;
my $subject = 'fThisDiff';
if ($subject =~ m/($re)/) {
print "Matched $1 from $match_string in $subject\n";
}
The best approach would be to use the longest common substring algorithm (not to be confused with the similarly-named longest common subsequence algorithm) then check its length.
use String::LCSS_XS qw( lcss );
my $longest = lcss("MatchPartOfThisString", "fThisDiff");
say length($longest);
If you have really long strings and you want to to squeeze out every millisecond, a tailored version of the algorithm that quits as soon as the target length is found and that avoids building the string would be faster.

Telling regex search to only start searching at a certain index

Normally, a regex search will start searching for matches from the beginning of the string I provide. In this particular case, I'm working with a very large string (up to several megabytes), and I'd like to run successive regex searches on that string, but beginning at specific indices.
Now, I'm aware that I could use the substr function to simply throw away the part at the beginning I want to exclude from the search, but I'm afraid this is not very efficient, since I'll be doing it several thousand times.
The specific purpose I want to use this for is to jump from word to word in a very large text, skipping whitespace (regardless of whether it's simple space, tabs, newlines, etc). I know that I could just use the split function to split the text into words by passing \s+ as the delimiter, but that would make things for more complicated for me later on, as there a various other possible word delimiters such as quotes (ok, I'm using the term 'word' a bit generously here), so it would be easier for me if I could just hop from word to word using successive regex searches on the same string, always specifying the next index at which to start looking as I go. Is this doable in Perl?
So you want to match against the words of a body of text.
(The examples find words that contain i.)
You think having the starting positions of the words would help, but it isn't useful. The following illustrates what it might look like to obtain the positions and use them:
my #positions;
while ($text =~ /\w+/g) {
push #positions, $-[0];
}
my #matches;
for my $pos (#positions) {
pos($text) = $pos;
push #matches $1 if $text =~ /\G(\w*i\w*)/g;
}
If would far simpler not to use the starting positions at all. Aside from being far simpler, we also remove the need for two different regex patterns to agree as to what constitute a word. The result is the following:
my #matches;
while ($text =~ /\b(\w*i\w*)/g) {
push #matches $1;
}
or
my #matches = $text =~ /\b(\w*i\w*)/g;
A far better idea, however, is to extra the words themselves in advance. This approach allows for simpler patterns and more advanced definitions of "word"[1].
my #matches;
while ($text =~ /(\w+)/g) {
my $word = $1;
push #matches, $word if $word =~ /i/;
}
or
my #matches = grep { /i/ } $text =~ /\w+/g;
For example, a proper tokenizer could be used.
In the absence of more information, I can only suggest the pos function
When doing a global regex search, the engine saves the position where the previous match ended so that it knows where to start searching for the next iteration. The pos function gives access to that value and allows it to be set explicitly, so that a subsequent m//g will start looking at the specified position instead of at the start of the string
This program gives an example. The string is searched for the first non-space character after each of a list of offsets, and displays the character found, if any
Note that the global match must be done in scalar context, which is applied by if here, so that only the next match will be reported. Otherwise the global search will just run on to the end of the file and leave information about only the very last match
use strict;
use warnings 'all';
use feature 'say';
my $str = 'a b c d e f g h i j k l m n';
# 0123456789012345678901234567890123456789
# 1 2 3
for ( 4, 31, 16, 22 ) {
pos($str) = $_;
say $1 if $str =~ /(\S)/g;
}
output
c
l
g
i

How to do conditional ("if exist" logic) search & replace in Perl?

in my Perl script I want to do conditional search & replace using regular expression: Find a certain pattern, and if the pattern exists in a hash, then replace it with something else.
For example, I want to search for a combination of "pattern1" and "pattern2", and if the latter exists in a hash, then replace the combination with "pattern1" and "replacement". I tried the following, but it just doesn't do anything at all.
$_ =~ s/(pattern1)(pattern2)/$1replacement/gs if exists $my_hash{$2};
I also tried stuff like:
$_ =~ s/(pattern1)(pattern2) && exists $my_hash{$2}/$1replacement/gs;
Also does nothing at all, as if no match is found.
Can anyone help me with this regex problem? Thx~
I would do it a different way. It looks like you have a 'search this, replace that' hash.
So:
#!/usr/bin/env perl
use strict;
use warnings;
#our 'mappings'.
#note - there can be gotchas here with substrings
#so make sure you anchor patterns or sort, so
#you get the right 'substring' match occuring.
my %replace = (
"this phrase" => "that thing",
"cabbage" => "carrot"
);
#stick the keys together into an alternation regex.
#quotemeta means regex special characters will be escaped.
#you can remove that, if you want to use regex in your replace keys.
my $search = join( "|", map {quotemeta} keys %replace );
#compile it - note \b is a zero width 'word break'
#so it will only match whole words, not substrings.
$search = qr/\b($search)\b/;
#iterate the special DATA filehandle - for illustration and a runnable example.
#you probably want <> instead for 'real world' use.
while (<DATA>) {
#apply regex match and replace
s/(XX) ($search)/$1 $replace{$2}/g;
#print current line.
print;
}
##inlined data filehandle for testing.
__DATA__
XX this phrase cabbage
XX cabbage carrot cabbage this phrase XX this phrase
XX no words here
and this shouldn't cabbage match this phrase at all
By doing this, we turn your hash keys into a regex (you can print it - it looks like: (?^:\b(cabbage|this\ phrase)\b)
Which is inserted into the substitution pattern. This will only match if the key is present, so you can safely do the substitution operation.
Note - I've added quotemeta because then it escapes any special characters in the keys. And the \b is a "word boundary" match so it doesn't do substrings within words. (Obviously, if you do want that, then get rid of them)
The above gives output of:
XX that thing cabbage
XX carrot carrot cabbage this phrase XX that thing
XX no words here
and this shouldn't cabbage match this phrase at all
If you wanted to omit lines that didn't pattern match, you can stick && print; after the regex.
What is wrong (as in not working) with
if (exists($h{$patt1)) { $text =~ s/$patt1$patt2/$patt1$1replacement/g; }
If $patt1 exists as a key in a hash then you go ahead and replace $patt1$patt2 with $patt1$replacement. Of course, if $patt1$patt2 is found in $text, otherwise nothing happens. Your first code snippet is circular, while the second one can't work like that at all.
If you want $patt1$patt2 first, and hash key as well then it seems that you'd have to go slow
if ($str =~ /$patt11$patt2/ && exists $h{$patt2}) {
$str =~ s/$patt1$patt2/$patt1$replacement/gs;
}
If this is what you want then it is really simple: you need two unrelated conditions, whichever way you turn it around. Can't combine them since it would be circular.
From the point of view of the outcome these are the same. If either condition fails nothing happens, regardless of the order in which you check them.
NOTE Or maybe you don't have to go slow, see Sobrique's post.

Split string (or regex match) at position/index of nth character in Perl?

There is a similarly worded question, but I think this is slightly different.
Basically, say I have this string:
"aa{bb{dccd"
Here I would like to split the string at the last brace {; and have the parts returned as an array. I can easily find the position (0-based index) of this character using rindex:
perl -e '
$aa="aa{bb{dccd" ;
$ri = rindex($aa, "{") ;
print "$ri\n"; '
5
... and given that I'm not a Perl coder, first thing I think of is to use something like $str = split($aa, 3). Unfortunately, that is not correct syntax - split takes a regex as first argument (what to match for), and string as second - and it doesn't take an integer position index as argument.
I found posts like Perl Guru Forums: Perl Programming Help: Intermediate: split or splice string on char count?, which recommend using substr in a similar context; however, I'd have to write two substrs to populate the list as per the example above, and so I'd rather hear about alternatives to substr.
Basically, if the problem of matching the position of N-th character can be expressed as a regex match, the split could work just as well - so that would be my primary question. However, I'd also be interested in hearing if there are Perl built-in functions that could accept a list/array of integers specifying character positions, and return an array containing the split sections.
EDIT:
To summarize the above - I'd like to have the character indexes, because I'd like to print them out for debugging; and at the same time, use them for splitting a string into array - but without using substrs.
EDIT2: I just realized that I left something out from the OP -- and that is, that in the problem that I'm working on, I have to first retrieve character indexes (by rindex or otherwise); then I have to do calculations on them (so they may increase, or decrease) - and only then am I supposed to split the string (based on the new index values). It may have been that my original example was too simple, and didn't express this focus on indexes/character positions much (and not to mention that my first thought of split implies character indexes anyways - but I really cannot remember which programming language it came from :))
You wrote:
I'd also be interested in hearing if there are Perl built-in functions that could accept a list/array of integers specifying character positions, and return an array containing the split sections.
To create a function that takes a list of offsets and produces a list of substrings with those split positions, convert the offsets to lengths and pass these as an argument to unpack.
There’s a &cut2fmt function in Chapter 1 of the Perl Cookbook that does this very thing. Here is an excerpt, reproduced here by kind permission of the author:
Sometimes you prefer to think of your data as being cut up at specific columns. For example, you might want to place
cuts right before positions 8, 14, 20, 26, and 30. Those are the column numbers where each field begins. Although you could calculate that the proper unpack format is "A7 A6 A6 A6 A4 A*", this is too much mental strain for the virtuously lazy Perl programmer. Let Perl figure it out for you. Use the cut2fmt function below:
sub cut2fmt {
my(#positions) = #_;
my $template = '';
my $lastpos = 1;
foreach $place (#positions) {
$template .= "A" . ($place - $lastpos) . " ";
$lastpos = $place;
}
$template .= "A*";
return $template;
}
$fmt = cut2fmt(8, 14, 20, 26, 30);
print "$fmt\n";
A7 A6 A6 A6 A4 A*
So the way you would use that is like this:
$fmt = cut2fmt(8, 14, 20, 26, 30);
#list = unpack($fmt, $string);
or directly as
#list = unpack(cut2fmt(8, 14, 20, 26, 30), $string);
I believe this is what you were asking for.
my ($pre, $post) = split /\{(?!.*\{)/s, $s;
or
my ($pre, $post) = $s =~ /^(.*)\{(.*)/s;
The second is probably better.
If you need the index of the {, use length($pre). (With the second solution, you could also use $-[2] - 1. See #- and #+ in perlvar.)
Here are some ways:
split /.*\K{/, $str;
split /{(?!.*{)/, $str;
$str =~ /(.*){(.*)/;
Use /regex/s if the string can span multiple lines.
The way to do this using rindex is to employ substr to extract the two parts of the string according to the position of the {.
Note that this includes the { in the suffix part. To exclude it you would use $i + 1 in the second substr call.
my $str = "aa{bb{dccd";
my $i = rindex $str, '{';
my $pref = substr $str, 0, $i;
my $suff = substr $str, $i;
print $pref, "\n";
print $suff, "\n";
output
aa{bb
{dccd
Update
I have just read about your wish to avoid substr and do the split in a single operation. unpack will do that for you, like this
my $str = "aa{bb{dccd";
my $i = rindex $str, '{';
my ($pref, $suff) = unpack "A$i A*", $str;
print $pref, "\n";
print $suff, "\n";
with identical output to the previous code.
I still don't see what's so difficult about this. Is it that you don't want to discard the brace (or whatever your delimiter is)? These adaptations of #Qtax's solutions leave the brace in either the first or second substring:
# split before the brace
split /.*\K(?=\{)/, $str;
split /(?=\{(?!.*\{))/, $str;
$str =~ /(.*)(\{.*)/;
# split after the brace
split /.*\{\K)/, $str;
split /(?<=\{(?!.*\{))/, $str;
$str =~ /(.*\{)(.*)/;
(I know it isn't necessary to escape the brace, but I think it's a little easier to read this way.)
Right, I'll post this as an answer, this is how far I got.
Thanks to these resources:
Splitting a string in Perl
Regular Expressions - a Simple User Guide and Tutorial
... I learned about the "curly brace" regex operator, {n} which 'Matches the preceding character, or character range, n times exactly'. Thus, I can match for /.{5}(.)/:
perl -e '
$aa="aa{bb{dccd" ;
$aa =~ /.{5}(.)/ && print "--${1}--\n"; '
--{--
this selects through first 5 "any" characters - and then select and print the next one. Or:
/ # start regex
{ # match "{" character
{5} # repeat previous five times
(.) # select into match group (the $1) next character
/ # end regex
So, finally, I can use the rindex to perform such a split:
perl -e '
$aa="aa{bb{dccd" ;
$ri = rindex($aa, "{") ;
$aa =~ /.{$ri}(.)/ && print "--${1}--\n";
#res = split(/^.{$ri}(.)/, $aa);
print join("; ", #res) . "\n"; '
--{--
; {; dccd
.. but given that also requires some capturing at start, so here are other variants:
#res = split(/^(.{$ri})(.)/, $aa);
--{--
; aa{bb; {; dccd
#res = split(/^(.{$ri})./, $aa);
--{--
; aa{bb; dccd
... which both would work for me - except I have a blank as first item, which I'd like to get rid of in one pass (without calling extra splice), but don't know how to :)
Cheers!