Regex range operator - regex

I have a string '11 15 '. W/ a Regex I then compare the values within that string, in this case 11 and 15 (could be any number of digits but I'll keep it simple with 2 2-digit numbers).
For each of those numbers, I then see if it matches any of the numbers I want; in this case I want to see if the number is '12', '13', or '14'. If it is, then I change the value of '$m':
my $string = '11 15 ';
while ( $string =~ /([0-9]{1,})\s+/ig ) {
my $m = $1;
print $m . ".....";
$m = 'change value' if $m =~ /[12...14]{2,}/g;
print $m . "\n";
}
Produces:
11.....change value
15.....15
'15' stays the same, as it should. But '11' changes. What am I doing wrong?

[12...14] matches against "1", "2", ".", and "4". "11" Matches that; "15" doesn't. If you're just matching against numbers, you shouldn't be using regular expressions. Change your line to the following:
$m = 'change value' if $m ~~ [11..14];
Or, if unable to guarantee perl >= v5.10:
$m = 'change value' if grep { $m == $_ } 11..14;

You've misunderstood the regular expression. Where you've written [12...14]{2,}, this means "match 2 or more of the characters 1 or 2 or dot or dot or dot or dot or 1 or 4".
Try something like:
$m='change value' if $m=~/(\d{2,})/ and $1 >= 12 and $1 <= 14;
In a substitution operation, this could be written as:
$m =~ s/(\d{2,})/ $1 >= 12 && $1 <= 14 ? 'change value' : $1/ge;
That is, capture 2 or more digits and then test what you have captured to see if they're what you want to change by using perl code in the replacement section of the substitution. The e modifier indicates that Perl should evaluate the replacement as Perl code.

Let's rewrite your code a bit:
my $string = '11 15 ';
while ( $string =~ /(\d+)/g ) {
I've changed your while statement's regular expression. You can use \d+ to represent one or more digits, and that's easier to understand than [0-9]{1,}. You also (since a space won't match \d) don't need the last space on the end of your string.
Let's look at the rest of the code:
my $string = '11 15';
while ( $string =~ /(\d+)/g ) {
my $match = $1;
print "$match.....";
if ($match >= 12 and $match <= 14) { #if ($match ~~ [12..14]) for Perl > 5.10
print 'change value\n';
}
else {
print "$match\n";
}
}
You can't use a regular expression the way you are to test for range.
Instead, use the regular range test of
if ($match >= 12 and $match <= 14)
or the newer group test:
if ($match ~~ [12..14]) #Note only two dots and not three!
That last one only works in newer versions of Perl like 5.12 I have on my Mac, and 5.14 I have on my Linux box, but not the Perl 5.8 I have on my Solaris box).
A few tips:
Use indents and spaces. It makes your code more readable.
Use descriptive names for variables. Instead of $m, I used $match.
Don't use the appended if statements. The appended if is harder to spot, so you might miss something important, and it makes your code harder to update. It can be used if the statement itself is clear and simple, and it improves readability. The last is a bit subjective, but you'll commonly see appended if statements in things like return if not -f $file;.
Keep variables single purpose. In this case, instead of changing the value of $match, I used an if/else statement. Imagine if your code was a bit more complex, and someone had to add in a new feature. They see the $match variable and think this is what they need. Unfortunately, you changed what $match is. It's now a value to be printed out and not the string match. It might take the person who changed your program quite a while to figure out what happened to the value of $match and why it has bee mysteriously set to changed value.
In the print statement, you can include variables inside of double quotes. This is very different from almost all other languages. This is because Perl variable use sigils to mark variable names. It usually makes it easier to read if your combine variables and other strings in a single string.
For example:
print "The range of possible values are $low to $high\n";
vs.
print "The range of possible values are " . $low . " to " . $high . "\n";
Notice how in the second example, I had to be careful of spaces inside the quotes while in the first example, the required spaces came rather naturally. Imagine having to change that statement in a later version of the program. Which would be easier to maintain?

Related

extract string between two dots

I have a string of the following format:
word1.word2.word3
What are the ways to extract word2 from that string in perl?
I tried the following expression but it assigns 1 to sub:
#perleval $vars{sub} = $vars{string} =~ /.(.*)./; 0#
EDIT:
I have tried several suggestions, but still get the value of 1. I suspect that the entire expression above has a problem in addition to parsing. However, when I do simple assignment, I get the correct result:
#perleval $vars{sub} = $vars{string} ; 0#
assigns word1.word2.word3 to variable sub
. has a special meaning in regular expressions, so it needs to be escaped.
.* could match more than intended. [^.]* is safer.
The match operator (//) simply returns true/false in scalar context.
You can use any of the following:
$vars{sub} = $vars{string} =~ /\.([^.]*)\./ ? $1 : undef;
$vars{sub} = ( $vars{string} =~ /\.([^.]*)\./ )[0];
( $vars{sub} ) = $vars{string} =~ /\.([^.]*)\./;
The first one allows you to provide a default if there's no match.
Try:
/\.([^\.]+)\./
. has a special meaning and would need to be escaped. Then you would want to capture the values between the dots, so use a negative character class like ([^\.]+) meaning at least one non-dot. if you use (.*) you will get:
word1.stuff1.stuff2.stuff3.word2 to result in:
stuff1.stuff2.stuff3
But maybe you want that?
Here is my little example, I do find the perl one liners a little harder to read at times so I break it out:
use strict;
use warnings;
if ("stuff1.stuff2.stuff3" =~ m/\.([^.]+)\./) {
my $value = $1;
print $value;
}
else {
print "no match";
}
result
stuff2
. has a special meaning: any character (see the expression between your parentheses)
Therefore you have to escape it (\.) if you search a literal dot:
/\.(.*)\./
You've got to make sure you're asking for a list when you do the search.
my $x= $string =~ /look for (pattern)/ ;
sets $x to 1
my ($x)= $string =~ /look for (pattern)/ ;
sets $x to pattern.

Match a cycle of letters in a string in Perl

Let's say I have a string 'abc'. How do I match all 3 or more occurrences of 'abc' and its cycles ('bca', 'cab') in a large string.
Right now I am using individual entries as regex to match, but a) It is taking too long because the string is very large, and b) I'm getting the same regions in subsequent matches. For example, if my input is:
dabcabcabcabgyklagkbcabcabcahkgljla
^-------^ ^-------^
I want my output to be two matches:
1. abcabcabc position 2
2. bcabcabca position 20
Right now I'm getting 4 lines of output:
1. abcabcabc position 2
2. bcabcabca position 3
3. cabcabcab position 4
4. bcabcabca position 20
I hope I explained my problem. I got the desired output in another complicated way by doing a multi regex matching using all possible combinations in a single regex like this:
while($str =~ /(abc){3,}|(bca){3,}|cab{3,}/g {
print "$1\tposition $-[0]\n";
}
But it was a serious performance hit, and given the size of my input, it is taking forever to run. Please help me with a more efficient algorithm. Really sorry if this was asked earlier, but I couldn't find any page that helped me.
Thanks in advance
I suggest you use just /(abc){2,}/ preceded by nothing, c, or bc and followed by nothing, a, or ab, so
/ ( (?:b?c)? (?:abc){2,} (?:ab?)? ) /xg
The idea is to break down any sequence, like bcabcabcabcabca into a number of abcs, possibly preceded bt c or (here) bc and possibly followed by (here) a or ab, like this.
bc abcabcabcabc a
so that the regex engine doesn't have to check for three diffrent strings at every point.
Doing it that way may find sequences up to three characters shorter than you require, but it should be faster and you can add an additional filter on length. Like this
use strict;
use warnings;
my $seq = 'dabcabcabcabgyklagkbcabcabcahkgljla';
while ($seq =~ / ( (?:b?c)? (?:abc){2,} (?:ab?)? ) /xg) {
next unless length $1 >= 9;
my $subseq = $1;
chop $subseq while length($subseq) % 3;
print "$subseq\tposition $-[0]\n";
}
output
abcabcabc position 1
bcabcabca position 19
I've used the data you posted and have found a variation of my original solution that runs about four to five times faster than your original. Unfortunately the sequence you posted is only 225KB and there is only a single occurrence of one of the SSRs in it, so I don't know how representative it is.
Essentially, instead looking for a sequence of four rotations of the pattern, it looks only for repetitions of the core SSR, with an optional prefix and suffix that lets the overall sequence start anywhere within the SSR, like this
/ (?:AAT|AT|T|) (?:AAAT){3,} (:?AAA|AA|A|) /x
All of this regex is built automatically.
use strict;
use warnings;
use autodie;
open my $fh, '<', 'chr1.txt';
my $seq = <$fh>;
close $fh;
my #ssrs = qw( AAAT AAAC AACC AACG );
retrieve_ssr('Sample', $seq, \#ssrs);
sub retrieve_ssr {
my ($name, $seq, $ssr_list) = #_;
for my $ssr (#$ssr_list) {
my $len = length $ssr;
my $n = $len == 5 ? 3 : 12 / $len;
$n = 1;
my $prefix = join '', map { substr($ssr, -$_) . '|' } 1 .. $len-1;
my $suffix = join '', map { substr($ssr, 0, $_) . '|' } reverse 1 .. $len-1;
my $re = qr/ (?:$prefix) (?:$ssr){$n,} (?:$suffix) /x;
while ($seq =~ /$re/g) {
my $start = $-[0] + 1;
my $length = $+[0] - $-[0];
my $excess = $length % $len;
pos($seq) -= $excess;
$length -= $excess;
my $seq = substr $seq, $-[0], $length;
print "$start\t$+[0]\t$length\t$seq\n";
}
}
}
output
23738 23752 12 TAAATAAATAAA
It strikes me that you don't need to have 3 separate regexes, you really just need one regex like this:
perl -ne 'print "$1\tposition $-[0]\n" while /(b?c?(abc){1,}a?b?)/g' mydata.txt
The idea is that the core pattern abc is matched as needed, and then you just need to account for the potential prefix of "b?c?" and a potential suffix of "a?b?" (if the prefix or suffix were longer then it would be matched by the main regex in the center).
As given this expression will find matches of 3 chars or longer, but you can obviously up the minimum length by changing the value inside {1,}
This solution does risk a few false positives in the prefix and suffix however, as it would match "babc", so you could run a 2nd slow search on the results for complete accuracy.

Regex greedyness REasking

I have this text $line = "config.txt.1", and I want to match it with regex and extract the number
part of it. I am using two versions:
$line = "config.txt.1";
(my $result) = $line =~ /(\d*).*/; #ver 1, matched, but returns nothing
(my $result) = $line =~ /(\d).*/; #ver 2, matched, returns 1
(my $result) = $line =~ /(\d+).*/; #ver 3, matched, returns 1
I think the * was sort of messing things around, I have been looking at this, but still
don't the greedy mechanism in the regex engine. If I start from left of the regex, and potentially there might be no digits in the text, so for ver 1, it will match too. But for
ver 3, it won't match. Can someone give me an explanation for why it is that and how
I should write for what I want? (potentially with a number, not necessarily single digit)
Edit
Requirement: potentially with a number, not necessarily single digit, and match can not capture anything, but should not fail
The output must be as follows (for the above example):
config.txt 1
The regex /(\d*).*/ always matches immediately, because it can match zero characters. It translates to match as many digits at this position as possible (zero or more). Then, match as many non-newline characters as possible. Well, the match starts looking at the c of config. Ok, it matches zero digits.
You probably want to use a regex like /\.(\d+)$/ -- this matches an integer number between a period . and the end of string.
Use the literal '.' as a reference to match before the number:
#!/usr/bin/perl
use strict;
use warnings;
my #line = qw(config.txt file.txt config.txt.1 config.foo.2 config.txt.23 differentname.fsdfsdsdfasd.2444);
my (#capture1, #capture2);
foreach (#line){
my (#filematch) = ($_ =~ /(\w+\.\w+)/);
my (#numbermatch) = ($_ =~ /\w+\.\w+\.?(\d*)/);
my $numbermatch = $numbermatch[0] // $numbermatch[1];
push #capture1, #filematch;
push #capture2, #numbermatch;
}
print "$capture1[$_]\t$capture2[$_]\n" for 0 .. $#capture1;
Output:
config.txt
file.txt
config.txt 1
config.foo 2
config.txt 23
differentname.fsdfsdsdfasd 2444
Thanks guys, I think I figured out myself what I want:
my ($match) = $line =~ /\.(\d+)?/; #this will match and capture any digit
#number if there was one, and not fail
#if there wasn't one
To capture all digits following a final . and not fail the match if the string doesn't end with digits, use /(?:\.(\d+))?$/
perl -E 'if ("abc.123" =~ /(?:\.(\d+))?$/) { say "matched $1" } else { say "match failed" }'
matched 123
perl -E 'if ("abc" =~ /(?:\.(\d+))?$/) { say "matched $1" } else { say "match failed" }'
matched
You do not need .* at all. These two statements assign the exact same number:
my ($match1) = $str =~ /(\d+).*/;
my ($match1) = $str =~ /(\d+)/;
A regex by default matches partially, you do not need to add wildcards.
The reason your first match does not capture a number is because * can match zero times as well. And since it does not have to match your number, it does not. Which is why .* is actually detrimental in that regex. Unless something is truly optional, you should use + instead.

changing several expressions in one line in perl

I want to take a line containing several expressions of the same structure, containing 4 digit hexa numbers, and changing the number in that structure according to a hash table. I tried using this next peace of code:
while ($line =~ s/14'h([0-9,a-f][0-9,a-f][0-9,a-f][0-9,a-f])/14'h$hash_point->{$1}/g){};
Where $hash_point is a pointer to the hash table.
But it tells me that I try to use an undefined value, when I tried running the fallowing code:
while ($line =~ s/14'h([0-9,a-f][0-9,a-f][0-9,a-f][0-9,a-f])/14'h----/g){print $1," -> ",$hash_point->{$1},"\n";};
It changed all the wanted numbers to "----" but printed out the values only 2 times (there were much more changes).
Where is the problem?
This is what I used in the end:
$line =~ s/14'h([0-9a-f][0-9a-f][0-9a-f][0-9a-f])/"14'h".$hash_point->{$1}/ge;
and in order to account for numbers not in the hash I've added:
$line =~ s/14'h([0-9a-f][0-9a-f][0-9a-f][0-9a-f])/"14'h".((hash_point->{$1}) or ($1))/ge;
I also wanted to know what numbers don't appear at the hash:
$line =~ s/14'h([0-9a-f][0-9a-f][0-9a-f][0-9a-f])/"14'h".(($hash_point->{$1}) or (print "number $1 didn't change\n") &&($1))/ge;
and finaly, I wanted to be able to control whether the massage from the previous stage would be printed, I've added the use of $flag which in defined only if I want the massages to appear:
$line =~ s/14'h([0-9a-f][0-9a-f][0-9a-f][0-9a-f])/"14'h".(($hash_point->{$1}) or (((defined($flag)) && (print "number $1 didn't change\n")) or ($1)))/ge;
Your regexp seems to work well for me except when hexa number is not present in the hash.
I tried:
#!/usr/bin/perl
use 5.10.1;
use strict;
use warnings;
use Data::Dumper;
my $line = q!14'hab63xx14'hab88xx14'hab64xx14'hab65xx14'hcdef!;
my $hash_point = {
ab63 => 'ONE',
ab64 => 'TWO',
ab65 => 'THREE',
};
while ($line =~ s/14'h([0-9,a-f][0-9,a-f][0-9,a-f][0-9,a-f])/14'h$hash_point->{$1}/g){};
say $line;
This produces:
Use of uninitialized value in concatenation (.) or string at C:\tests\perl\test5.pl line 15.
Use of uninitialized value in concatenation (.) or string at C:\tests\perl\test5.pl line 15.
14'hONExx14'hxx14'hTWOxx14'hTHREExx14'h
The errors are for numbers ab88 and cdef that are not keys in the hash.
Just a small correction, but both of your regexes don't do what you think it does.
/[a-f,0-9]/
Matches any character from a to f, 0 to 9, and a comma. You are looking for
/[a-z0-9]/
Not that this is what is breaking your program (M42 probably got it right, but we can't be sure unless you show us the hash).
Also, apologies, not enough rep to actually answer to other posts.
EDIT:
Well, you go through a lot of hoops in that answer, so here's how I'd do it instead:
s/14'h\K(\p{AHex}{4})/if (defined($hash_point->{$1})) {
$hash_point->{$1};
} else {
say $1 if $flag;
$1;
}/ge
Mainly because chaining and's and &&'s and sosuch generally makes for fairly hard-to-understand code. All whitespace is optional, so squash it for the one-liner!

How do I remove all hyphens with a Perl regex?

I thought this would have done it...
$rowfetch = $DBS->{Row}->GetCharValue("meetdays");
$rowfetch = /[-]/gi;
printline($rowfetch);
But it seems that I'm missing a small yet critical piece of the regex syntax.
$rowfetch is always something along the lines of:
------S
-M-W---
--T-TF-
etc... to represent the days of the week a meeting happens
$rowfetch =~ s/-//gi
That's what you need for your second line there. You're just finding stuff, not actually changing it without the "s" prefix.
You also need to use the regex operator "=~" for this.
Here is what your code presently does:
# Assign 'rowfetch' to the value fetched from:
# The function 'GetCharValue' which is a method of:
# An Value in A Hash Identified by the key "Row" in:
# Either a Hash-Ref or a Blessed Hash-Ref
# Where 'GetCharValue' is given the parameter "meetdays"
$rowfetch = $DBS->{Row}->GetCharValue("meetdays");
# Assign $rowfetch to the number of times
# the default variable ( $_ ) matched the expression /[-]/
$rowfetch = /[-]/gi;
# Print the number of times.
printline($rowfetch);
Which is equivalent to having written the following code:
$rowfetch = ( $_ =~ /[-]/ )
printline( $rowfetch );
The magic you are looking for is the
=~
Token instead of
=
The former is a Regex operator, and the latter is an assignment operator.
There are many different regex operators too:
if( $subject =~ m/expression/ ){
}
Will make the given codeblock execute only if $subject matches the given expression, and
$subject =~ s/foo/bar/gi
Replaces ( s/) all instances of "foo" with "bar", case-insentitively (/i), and repeating the replacement more than once(/g), on the variable $subject.
Using the tr operator is faster than using a s/// regex substitution.
$rowfetch =~ tr/-//d;
Benchmark:
use Benchmark qw(cmpthese);
my $s = 'foo-bar-baz-blee-goo-glab-blech';
cmpthese(-5, {
trd => sub { (my $a = $s) =~ tr/-//d },
sub => sub { (my $a = $s) =~ s/-//g },
});
Results on my system:
Rate sub trd
sub 300754/s -- -79%
trd 1429005/s 375% --
Off-topic, but without the hyphens, how will you know whether a "T" is Tuesday or Thursday?