Regular expression Capture and Backrefence - regex

Here's the string I'm searching.
T+4ACCGT+12CAAGTACTACCGT+12CAAGTACTACCGT+4ACCGA+6CTACCGT+12CAAGTACTACCGT+12CAAGTACTACCG
I want to capture the digits behind the number for X digits (X being the previous number) I also want to capture the complete string.
ie the capture should return:
+4ACCG
+12AAGTACTACCGT
etc.
and :
ACCG
AAGTACTACCGT
etc.
Here's the regex I'm using:
(\+(\d+)([ATGCatgcnN]){\2});
and I'm using $1 and $3 for the captures.
What am I missing ?

You can not use a backreference in a quantifier. \1 is a instruction to match what $1 contains, so {\1} is not a valid quantifier. But why do you need to match the exact number? Just match the letters (because the next part starts again with a +).
So try:
(\+\d+([ATGCatgcnN]+));
and find the complete match in $1 and the letters in $2
Another problem in your regex is that your quantifier is outside your third capturing group. That way only the last letter would be in the capturing group. Place the quantifier inside the group to capture the whole sequence.
You can also remove the upper or lower case letters from your class by using the i modifier to match case independent:
/(\+\d+([ATGCN]+))/gi

This loop works because the \G assertion tells the regex engine to begin the search after the last match , (digit(s)), in the string.
$_ = 'T+4ACCGT+12CAAGTACTACCGT+12CAAGTACTACCGT+4ACCGA+6CTACCGT+12CAAGTACTACCGT+12CAAGTACTACCG';
while (/(\d+)/g) {
my $dig = $1;
/\G([TAGCN]{$dig})/i;
say $1;
}
The results are
ACCG
CAAGTACTACCG
CAAGTACTACCG
ACCG
CTACCG
CAAGTACTACCG
CAAGTACTACCG
I think this is correct but not sure :-|
Update: Added the \G assertion which tells the regex to begin immediately after the last matched number.

my #sequences = split(/\+/, $string);
for my $seq (#sequences) {
my($bases) = $seq =~ /([^\d]+)/;
}

Related

How to do a look behind in perl regex

I am trying to do a look behind using regex.
What I have tried seems to work but nothing is being captured.
my $names="Frank_J_Smith_1980-01-05.doc";
if($names =~ /(?![0-9]{4}-[0-9]{2}-[0-9]{2})/)
{
print("$1");
}
I wrote a match statement using the same code and it matches.
if($names =~ /.*_(?![0-9]{4}-[0-9]{2}-[0-9]{2})/)
{
print("$1");
}
I am expecting to see Frank_J_Smtih but I am getting nothing. It does hit the if statement finds the date but the output is nothing.
What you probably meant here is a lookahead not a lookbehind:
if($names =~ /(.*)(?=_\d{4}-\d{2}-\d{2})/)
{
print("$1");
}
Expression inside (?=...) is a positive lookahead here which means there is an underscore followed by date string ahead of current position.
Also note that (.*) is a captured group that you need to use to be able to use $1 later.
Without capturing group, you can use:
if($names =~ /.*(?=_\d{4}-\d{2}-\d{2})/)
{
print("$&\n");
}
Where $& represents full matched string from regex.
Otherwise you can just use substitute:
$names =~ s/_\d{4}-\d{2}-\d{2}\..*//;
print("$names");
Another way is to use a negative lookahead.
if($name =~ /((?:(?!_\d).)*)/) {
print $1
}
This expression will capture everything until an underscore followed by a digit is found.
The (?! ... is the negative lookahead.

Non-Capturing and Capturing Groups - The right way

I'm trying to match an array of elements preceeded by a specific string in a line of text. For Example, match all pets in the text below:
fruits:apple,banana;pets:cat,dog,bird;colors:green,blue
/(?:pets:)(\w+[,|;])+/g**
Using the given regex I only could match the last word "bird"
Can anybody help me to understand the right way of using Non-Capturing and Capturing Groups?
Thanks!
First, let's talk about capturing and non-capturing group:
(?:...) non-capturing version, you're looking for this values, but don't need it
() capturing version, you want this values! You're searching for it
So:
(?:pets:) you searching for "pets" but don't want to capture it, after that point, you WANT to capture (if I've understood):
So try (?:pets:)([a-zA-Z,]+); ... You're searching for "pets:" (but don't want it !) and stop at the first ";" (and don't want it too).
Result is :
Match 1 : cat,dog,bird
A better solution exists with 1 match == 1 pet.
Since you want to have each pet in a separate match and you are using PCRE \G is, as suggested by Wiktor, a decent option:
(?:pets:)|\G(?!^)(\w+)(?:[,;]|$)
Explanation:
1st Alternative (?:pets:) to find the start of the pattern
2nd Alternative \G(?!^)(\w+)(?:[,;]|$)
\G asserts position at the end of the previous match or the start of the string for the first match
Negative Lookahead (?!^) to assert that the Regex does not match at the start of the string
(\w+) to matches the pets
Non-capturing group (?:[,;]|$) used as a delimiter (matches a single character in the list ,; (case sensitive) or $ asserts position at the end of the string
Perl Code Sample:
use strict;
use Data::Dumper;
my $str = 'fruits:apple,banana;pets:cat,dog,bird;colors:green,blue';
my $regex = qr/(?:pets:)|\G(?!^)(\w+)(?:[,;]|$)/mp;
my #result = ();
while ( $str =~ /$regex/g ) {
if ($1 ne '') {
#print "$1\n";
push #result, $1;
}
}
print Dumper(\#result);

Replace a given substring before the digits

In a string like /p20 (can be any number) i want to replace the /p into /pag- but keep the number /pag-20
This is what i tried:
preg_replace('/\/p+[0-9]/', '/pag-', $string);
but the result is /pag-0
Use a capturing group and a backreference:
$string = "/p-20";
echo preg_replace('~/p-([0-9])~', '/pag-$1', $string);
^^^^^^^ ^^
Here, /p- matches a literal substring and ([0-9]) matches and captures any 1 digit into Group 1 that can be referred to with $1 backreference from the replacement pattern.
Alternatively, you may use a lookahead based solution:
preg_replace('~/p-(?=[0-9])~', '/pag-', $string);
See the PHP demo
Here, no backreference is necessary as the (?=[0-9]) positive lookahead does not consume text it matches, i.e. it does not add the text to the match value.

Perl regular expression to retrieve the first digit

I have a string with the value Validation_File_2_3.45.2017.csv.
How do I extract the first digit which is 2 in this case using a regular expression?
I have tried the expression ($Filedigit) = ($Filename =~ m/^[0-9]/g) but it didn't work
In a comment, you said you tried this:
($Filedigit)= ($Filename =~ m/^[0-9]/g);
A couple of things. You should always check that a match is successful before continuing on with your script, specifically when trying to capture. Next, ^ looks from the beginning of a string, then immediately looks for a single digit 0-9, globally. This won't match unless you had a filename such as 2_blah.csv. However, you're not actually attempting to capture anything, so if you do happen to match an entry, $Filedigit will be 1 in all cases (signifying a match happened).
Here's an example that does what you want:
use warnings;
use strict;
my $str = 'Validation_File_2_3.45.2017.csv';
# confirm there's a match
if (my ($num) = $str =~ /^.*?(\d+)/){
print "$num\n";
}
else {
print "no match\n";
}
Explanation of the regex:
^ - start from beginning of string
.*? - anything, non-greedy
( - begin capture
\d+ - any number of contiguous digit chars
) - end capture
So, it starts from the beginning of the string, throws away anything before the first set of contiguous digits and captures them and puts that into the variable.
See perlreftut and perlre.

Perl regex - why does the regex /[0-9\.]+(\,)/ match comma

The following seems to match ,
Can someone explain why?
I would like to match more than one Number or point, ended by comma.
123.456.768,
123,
.,
1.2,
But doing the following unexpectedly prints , too
my $text = "241.000,00";
foreach my $match ($text =~ /[0-9\.]+(\,)/g){
print "$match \n";
}
print $text;
# prints 241.000,
# ,
Update:
The comma matched because:
In list context, //g returns a list of matched groupings, or if there are no groupings, a list of matches to the whole regex
As defined here.
Use a zero-width positive look-ahead assertion to exclude the comma from the match itself:
$text =~ /[0-9\.]+(?=,)/g
Your match in the foreach loop is in list context. In list context, a match returns what its captured. Parens indicate a capture, not the whole regex. You have parens around a comma. You want it the other way around, put the parens aroundt he bit you want.
my $text = "241.000,00";
# my($foo) puts the right hand side in list context.
my($integer_part) = $text =~ /([0-9\.]+),/;
print "$integer_part\n"; # 241.000
If you don't want to match the comma, use a lookahead assertion:
/[0-9\.]+(?=,)/g
You're capturing the wrong thing! Move the parens from around the comma to around the number.
$text =~ /([0-9\.]+),/g
You can replace the comma with a lookahead, or just exclude the comma altogether since it isn't part of what you want to capture, it won't make a difference in this case. However, the pattern as it is puts the comma instead of the number into capture group 1, and then doesn't even reference by capture group, returning the entire match instead.
This is how a capture group is retrieved:
$mystring = "The start text always precedes the end of the end text.";
if($mystring =~ m/start(.*)end/) {
print $1;
}