Perl regular expression to retrieve the first digit - regex

I have a string with the value Validation_File_2_3.45.2017.csv.
How do I extract the first digit which is 2 in this case using a regular expression?
I have tried the expression ($Filedigit) = ($Filename =~ m/^[0-9]/g) but it didn't work

In a comment, you said you tried this:
($Filedigit)= ($Filename =~ m/^[0-9]/g);
A couple of things. You should always check that a match is successful before continuing on with your script, specifically when trying to capture. Next, ^ looks from the beginning of a string, then immediately looks for a single digit 0-9, globally. This won't match unless you had a filename such as 2_blah.csv. However, you're not actually attempting to capture anything, so if you do happen to match an entry, $Filedigit will be 1 in all cases (signifying a match happened).
Here's an example that does what you want:
use warnings;
use strict;
my $str = 'Validation_File_2_3.45.2017.csv';
# confirm there's a match
if (my ($num) = $str =~ /^.*?(\d+)/){
print "$num\n";
}
else {
print "no match\n";
}
Explanation of the regex:
^ - start from beginning of string
.*? - anything, non-greedy
( - begin capture
\d+ - any number of contiguous digit chars
) - end capture
So, it starts from the beginning of the string, throws away anything before the first set of contiguous digits and captures them and puts that into the variable.
See perlreftut and perlre.

Related

Perl: How to match last n digits of a string with n consecutive digits or more?

I use the following:
if ($content =~ /([0-9]{11})/) {
my $digits = $1;
}
to extract 11 consecutive digits from a string. However, it grabs the first 11 consecutive digits. How can I get it to extract the last 11 consecutive digits so that I would get 24555199361 from a string with hdjf95724555199361?
/([0-9]{11})/
means
/^.*?([0-9]{11})/s # Minimal lead that allows a match.
You get what you want by making the .* greedy.
/^.*([0-9]{11})/s # Maximal lead that allows a match.
If the digits appear at the very end of the string, you can also use the following:
/([0-9]{11})\z/
Whenever you want to match something at the end of a string, use the end of line anchor $.
$content =~ m/(\d{11})$/;
If that pattern is not the very end, but you want to match the "last" occurence of that pattern, you would first match "the entire string" with /.*/ and then backtrack to the final occurence of the pattern. The /s flag permits the . metacharacter to match a line feed.
$content =~ m/.*(\d{11})/s;
See the Perl regexp tutorial for more information.

perl regex substring

$str="!bypass";
I need return string that only start with regex "!"
How can I return bypass ?
To match strings that start with a ! you need this pattern. The ^ is the anchor at the beginning of the string.
/^!/
If you want to capture the stuff after the !, you need this pattern. The parenthesis () are a capture group. They tell Perl to grab everything between them and keep it. The . means any character, and the + is a quantifier for as many as possible, at least one. So .+ means grab everything.
/^!(.+)/
To apply it, do this.
$str =~ m/^!(.+)/;
And to get the "bypass" out of that pattern, use the $1 match variable that was assigned automatically by Perl with the m// operation.
print $1; # will print bypass
To make that conditional, it would be:
print $1 if $str =~ m/^!(.+)/;
The if here is in post-fix notation, which lets you omit the block and the parenthesis. It's the same as the following, but shorter and easier to read for single statements.
if ( $str =~ m/^!(.+)/ ) {
print $1;
}
If you want to permanently change $str to not have an exclamation mark at the beginning, you need to use a substitution instead.
$str =~ s/^!//;
The s/// is the substitution operator. It changes $str in place. The original value including the ! will be lost.
Use ^!\K.+.
It works this way:
^! - Match initial ! (but this will soon change, see below).
\K - Keep - "forget" about what you have matched so far and set the starting point of the match here (after the !).
.+ - Match non-empty sequence of chars.
Due to \K, only the last part (.+) is actually matched.

Why is Perl lazy when regex matching with * against a group?

In perl, the * is usually greedy, unless you add a ? after it. When * is used against a group, however, the situation seems different. My question is "why". Consider this example:
my $text = 'f fjfj ff';
my (#matches) = $text =~ m/((?:fj)*)/;
print "#matches\n";
# --> ""
#matches = $text =~ m/((?:fj)+)/;
print "#matches\n";
# --> "fjfj"
In the first match, perl lazily prints out nothing, though it could have matched something, as is demonstrated in the second match. Oddly, the behavior of * is greedy as expected when the contents of the group is just . instead of actual characters:
#matches = $text =~ m/((?:..)*)/;
print "#matches\n";
# --> 'f fjfj f'
Note: The above was tested on perl 5.12.
Note: It doesn't matter whether I use capturing or non-capturing parentheses for inside group.
This isn't a matter of greedy or lazy repetition. (?:fj)* is greedily matching as many repetitions of "fj" as it can, but it will successfully match zero repetitions. When you try to match it against the string "f fjfj ff", it will first attempt to match at position zero (before the first "f"). The maximum number of times you can successfully match "fj" at position zero is zero, so the pattern successfully matches the empty string. Since the pattern successfully matched at position zero, we're done, and the engine has no reason to try a match at a later position.
The moral of the story is: don't write a pattern that can match nothing, unless you want it to match nothing.
Perl will match as early as possible in the string (left-most). It can do that with your first match by matching zero occurrences of fj at the start of the string

Regular expression Capture and Backrefence

Here's the string I'm searching.
T+4ACCGT+12CAAGTACTACCGT+12CAAGTACTACCGT+4ACCGA+6CTACCGT+12CAAGTACTACCGT+12CAAGTACTACCG
I want to capture the digits behind the number for X digits (X being the previous number) I also want to capture the complete string.
ie the capture should return:
+4ACCG
+12AAGTACTACCGT
etc.
and :
ACCG
AAGTACTACCGT
etc.
Here's the regex I'm using:
(\+(\d+)([ATGCatgcnN]){\2});
and I'm using $1 and $3 for the captures.
What am I missing ?
You can not use a backreference in a quantifier. \1 is a instruction to match what $1 contains, so {\1} is not a valid quantifier. But why do you need to match the exact number? Just match the letters (because the next part starts again with a +).
So try:
(\+\d+([ATGCatgcnN]+));
and find the complete match in $1 and the letters in $2
Another problem in your regex is that your quantifier is outside your third capturing group. That way only the last letter would be in the capturing group. Place the quantifier inside the group to capture the whole sequence.
You can also remove the upper or lower case letters from your class by using the i modifier to match case independent:
/(\+\d+([ATGCN]+))/gi
This loop works because the \G assertion tells the regex engine to begin the search after the last match , (digit(s)), in the string.
$_ = 'T+4ACCGT+12CAAGTACTACCGT+12CAAGTACTACCGT+4ACCGA+6CTACCGT+12CAAGTACTACCGT+12CAAGTACTACCG';
while (/(\d+)/g) {
my $dig = $1;
/\G([TAGCN]{$dig})/i;
say $1;
}
The results are
ACCG
CAAGTACTACCG
CAAGTACTACCG
ACCG
CTACCG
CAAGTACTACCG
CAAGTACTACCG
I think this is correct but not sure :-|
Update: Added the \G assertion which tells the regex to begin immediately after the last matched number.
my #sequences = split(/\+/, $string);
for my $seq (#sequences) {
my($bases) = $seq =~ /([^\d]+)/;
}

Perl regex - why does the regex /[0-9\.]+(\,)/ match comma

The following seems to match ,
Can someone explain why?
I would like to match more than one Number or point, ended by comma.
123.456.768,
123,
.,
1.2,
But doing the following unexpectedly prints , too
my $text = "241.000,00";
foreach my $match ($text =~ /[0-9\.]+(\,)/g){
print "$match \n";
}
print $text;
# prints 241.000,
# ,
Update:
The comma matched because:
In list context, //g returns a list of matched groupings, or if there are no groupings, a list of matches to the whole regex
As defined here.
Use a zero-width positive look-ahead assertion to exclude the comma from the match itself:
$text =~ /[0-9\.]+(?=,)/g
Your match in the foreach loop is in list context. In list context, a match returns what its captured. Parens indicate a capture, not the whole regex. You have parens around a comma. You want it the other way around, put the parens aroundt he bit you want.
my $text = "241.000,00";
# my($foo) puts the right hand side in list context.
my($integer_part) = $text =~ /([0-9\.]+),/;
print "$integer_part\n"; # 241.000
If you don't want to match the comma, use a lookahead assertion:
/[0-9\.]+(?=,)/g
You're capturing the wrong thing! Move the parens from around the comma to around the number.
$text =~ /([0-9\.]+),/g
You can replace the comma with a lookahead, or just exclude the comma altogether since it isn't part of what you want to capture, it won't make a difference in this case. However, the pattern as it is puts the comma instead of the number into capture group 1, and then doesn't even reference by capture group, returning the entire match instead.
This is how a capture group is retrieved:
$mystring = "The start text always precedes the end of the end text.";
if($mystring =~ m/start(.*)end/) {
print $1;
}