How to do a look behind in perl regex - regex

I am trying to do a look behind using regex.
What I have tried seems to work but nothing is being captured.
my $names="Frank_J_Smith_1980-01-05.doc";
if($names =~ /(?![0-9]{4}-[0-9]{2}-[0-9]{2})/)
{
print("$1");
}
I wrote a match statement using the same code and it matches.
if($names =~ /.*_(?![0-9]{4}-[0-9]{2}-[0-9]{2})/)
{
print("$1");
}
I am expecting to see Frank_J_Smtih but I am getting nothing. It does hit the if statement finds the date but the output is nothing.

What you probably meant here is a lookahead not a lookbehind:
if($names =~ /(.*)(?=_\d{4}-\d{2}-\d{2})/)
{
print("$1");
}
Expression inside (?=...) is a positive lookahead here which means there is an underscore followed by date string ahead of current position.
Also note that (.*) is a captured group that you need to use to be able to use $1 later.
Without capturing group, you can use:
if($names =~ /.*(?=_\d{4}-\d{2}-\d{2})/)
{
print("$&\n");
}
Where $& represents full matched string from regex.
Otherwise you can just use substitute:
$names =~ s/_\d{4}-\d{2}-\d{2}\..*//;
print("$names");

Another way is to use a negative lookahead.
if($name =~ /((?:(?!_\d).)*)/) {
print $1
}
This expression will capture everything until an underscore followed by a digit is found.
The (?! ... is the negative lookahead.

Related

regex negative lookbehind matching when expected not to

Can someone help me understand why the following regex is matching when i would expect it not to match.
String to check against
/opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med_Transfer/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt
regex
(?<!Transfer\/)\w*PINPUK.*(?:csv|txt)$
I was expecting this to not match as the string Transfer/ appears before 0 or more word chars followed by the string PINPUK. If I change the pattern from \w* to \w{6} to explicitly match 6 word chars this correctly returns no match.
Can someone help me understand why with the 0 or more quantifier on my "word" character results in the regex giving a match?
Your regex pattern (?<!Transfer/)\w*PINPUK.*(?:csv|txt)$ is looking for \w*PINPUK not immediately preceded by Transfer/
Given the string
/opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med_Transfer/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt
the regex engine will start by matching \w*PINPUK with CBD99_PINPUK
But that is preceded by Transfer/ so the engine backtracks and finds BD99_PINPUK
That is preceded by C, which isn't Transfer/, so the match is successful
As for a fix, just put the slash outside the look-behind
(?<!Transfer)/\w*PINPUK.*(?:csv|txt)$
That forces the \w* to begin right after the slash, and the pattern now correctly fails
Borodin has given an excellent explanation of why this doesn't work and a solution for this case (move a /). Sometimes something simple like that isn't possible though so here I'll explain an alternate work around that might be useful
Things will match as you expect if you move the \w* inside the negative look-behind. Like so:
(?<!Transfer\/\w*)PINPUK.*(?:csv|txt)$
Unfortunately Perl doesn't allow this, negative look-behinds must be fixed width. But still, there is a way to perform one match: match in reverse
^(?:vsc|txt).*KUPNIP(?!\w*\/refsnarT)
This uses a variable length negative look-ahead, something Perl does allow. Putting all this together in a script we get
use strict;
use warnings;
use feature 'say';
my $string_matches = '/opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med_Transfer/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt';
say "Trying $string_matches";
if ( reverse($string_matches) =~ /^(?:vsc|txt).*KUPNIP(?!\w*\/refsnarT)/ ) {
say 'It matched';
} else {
say 'No match';
}
say '';
my $string_doesnt_match = '/opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt';
say "Trying $string_doesnt_match";
if ( reverse($string_doesnt_match) =~ /^(?:vsc|txt).*KUPNIP(?!\w*\/refsnarT)/ ) {
say 'It matched';
} else {
say 'No match';
}
Which outputs
Trying /opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med_Transfer/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt
No match
Trying /opt/lnpsite/ni00/flat/tmp/Med_Local_Bak/ROI_Med/CBD99_PINPUK_14934_09_02_2017_12_07_36.txt
It matched

Perl regular expression to retrieve the first digit

I have a string with the value Validation_File_2_3.45.2017.csv.
How do I extract the first digit which is 2 in this case using a regular expression?
I have tried the expression ($Filedigit) = ($Filename =~ m/^[0-9]/g) but it didn't work
In a comment, you said you tried this:
($Filedigit)= ($Filename =~ m/^[0-9]/g);
A couple of things. You should always check that a match is successful before continuing on with your script, specifically when trying to capture. Next, ^ looks from the beginning of a string, then immediately looks for a single digit 0-9, globally. This won't match unless you had a filename such as 2_blah.csv. However, you're not actually attempting to capture anything, so if you do happen to match an entry, $Filedigit will be 1 in all cases (signifying a match happened).
Here's an example that does what you want:
use warnings;
use strict;
my $str = 'Validation_File_2_3.45.2017.csv';
# confirm there's a match
if (my ($num) = $str =~ /^.*?(\d+)/){
print "$num\n";
}
else {
print "no match\n";
}
Explanation of the regex:
^ - start from beginning of string
.*? - anything, non-greedy
( - begin capture
\d+ - any number of contiguous digit chars
) - end capture
So, it starts from the beginning of the string, throws away anything before the first set of contiguous digits and captures them and puts that into the variable.
See perlreftut and perlre.

Lookbehind does not work as expected

I am trying to understand the lookbehind.
This example I am trying doesn't work as I expected. I wanted to try to form a regex that would match John but not John.
The following:
$ perl -e '
my $var = "John.";
if( $var =~ m/J*/) {
print "Matches!\n";
}
'
Matches!
matches up to and including . of course. The problem is the following:
$ perl -e '
my $var = "John.";
if( $var =~ m/J*(?<![.])/) {
print "Matches!\n";
}
'
Matches!
For the latter I expected that the regex would match John. consuming >.< (the period)
Then at the next position it would look behind and realize that it consumed a period (.) and would reject the match.
Is my understanding wrong? What am I messing up here?
Update:
Same result also for my $var = "John. ";
Update 2:
My question is not about how to match only John and not John.
But to understand how lookbehind works and if it is not supposed to work in this case why.
The * is a quantification operator, not a placeholder. So A* means zero or more A characters. Without any further context, this always matches, e.g. "foo" =~ /J*/ is true.
What you intended to write was /J.*/ which does what you've actually described.
Now let's look what happens when we do "John." =~ /(J.*(?<![.]))/:
The regex engine sees J, which matches.
The next pattern is .*, which matches ohn..
Next the assertion (?<![.]) is tested, which fails.
The regex engine therefore backtracks.
We try .* again, but this time only match ohn.
Next the assertion (?<![.]) is tested, which suceeds.
In the above regex, I enclosed the pattern in a capture group, which we can now read out:
$ perl -E'"John." =~ /(J.*(?<![.]))/ and say "<$1>" or say "No match"'
<John>
It is often more efficient to use a character class instead of assertions and .* quantifications, so that we can avoid backtracking:
/J[^.]*/
However, this is not strictly equivalent to the above regexes.
This regexp:
/John(?![.])/
will match John but not John. It uses a negative look-ahead assertion (rather than look-behind).
If you want to match full names other than 'John', you'll need to be a bit more specific about what you do and don't want to allow in the match, as putting J* will match zero or more J's.
Edit: Obviously I misread the * per #amon's post. Look-ahead vs. look-behind still applies.

Regular expression Capture and Backrefence

Here's the string I'm searching.
T+4ACCGT+12CAAGTACTACCGT+12CAAGTACTACCGT+4ACCGA+6CTACCGT+12CAAGTACTACCGT+12CAAGTACTACCG
I want to capture the digits behind the number for X digits (X being the previous number) I also want to capture the complete string.
ie the capture should return:
+4ACCG
+12AAGTACTACCGT
etc.
and :
ACCG
AAGTACTACCGT
etc.
Here's the regex I'm using:
(\+(\d+)([ATGCatgcnN]){\2});
and I'm using $1 and $3 for the captures.
What am I missing ?
You can not use a backreference in a quantifier. \1 is a instruction to match what $1 contains, so {\1} is not a valid quantifier. But why do you need to match the exact number? Just match the letters (because the next part starts again with a +).
So try:
(\+\d+([ATGCatgcnN]+));
and find the complete match in $1 and the letters in $2
Another problem in your regex is that your quantifier is outside your third capturing group. That way only the last letter would be in the capturing group. Place the quantifier inside the group to capture the whole sequence.
You can also remove the upper or lower case letters from your class by using the i modifier to match case independent:
/(\+\d+([ATGCN]+))/gi
This loop works because the \G assertion tells the regex engine to begin the search after the last match , (digit(s)), in the string.
$_ = 'T+4ACCGT+12CAAGTACTACCGT+12CAAGTACTACCGT+4ACCGA+6CTACCGT+12CAAGTACTACCGT+12CAAGTACTACCG';
while (/(\d+)/g) {
my $dig = $1;
/\G([TAGCN]{$dig})/i;
say $1;
}
The results are
ACCG
CAAGTACTACCG
CAAGTACTACCG
ACCG
CTACCG
CAAGTACTACCG
CAAGTACTACCG
I think this is correct but not sure :-|
Update: Added the \G assertion which tells the regex to begin immediately after the last matched number.
my #sequences = split(/\+/, $string);
for my $seq (#sequences) {
my($bases) = $seq =~ /([^\d]+)/;
}

Perl regex - why does the regex /[0-9\.]+(\,)/ match comma

The following seems to match ,
Can someone explain why?
I would like to match more than one Number or point, ended by comma.
123.456.768,
123,
.,
1.2,
But doing the following unexpectedly prints , too
my $text = "241.000,00";
foreach my $match ($text =~ /[0-9\.]+(\,)/g){
print "$match \n";
}
print $text;
# prints 241.000,
# ,
Update:
The comma matched because:
In list context, //g returns a list of matched groupings, or if there are no groupings, a list of matches to the whole regex
As defined here.
Use a zero-width positive look-ahead assertion to exclude the comma from the match itself:
$text =~ /[0-9\.]+(?=,)/g
Your match in the foreach loop is in list context. In list context, a match returns what its captured. Parens indicate a capture, not the whole regex. You have parens around a comma. You want it the other way around, put the parens aroundt he bit you want.
my $text = "241.000,00";
# my($foo) puts the right hand side in list context.
my($integer_part) = $text =~ /([0-9\.]+),/;
print "$integer_part\n"; # 241.000
If you don't want to match the comma, use a lookahead assertion:
/[0-9\.]+(?=,)/g
You're capturing the wrong thing! Move the parens from around the comma to around the number.
$text =~ /([0-9\.]+),/g
You can replace the comma with a lookahead, or just exclude the comma altogether since it isn't part of what you want to capture, it won't make a difference in this case. However, the pattern as it is puts the comma instead of the number into capture group 1, and then doesn't even reference by capture group, returning the entire match instead.
This is how a capture group is retrieved:
$mystring = "The start text always precedes the end of the end text.";
if($mystring =~ m/start(.*)end/) {
print $1;
}