Why doesn't this Regex match in Perl? - regex

I have a string that can read something like this (although not always, numbers can vary).
Board Length,45,inches,color,board height,8,inches,black,store,wal-mart,Board weight,20,dollars
I am trying to match the 45 that follows the Board Length this regex expression.
if ($string =~/Board Length,(\d+\.\d+)/){
print $string;
}
Is the formatting wrong? I thought d+ would match as many numbers as needed, . would match a literal '.', and d+ would match any numbers after the decimal (if there are any).

As you have put it, decimal . and following digits are mandatory. Thus (\.\d+)? to make it optional,
if ($string =~/Board Length,(\d+(?:\.\d+)?)/)

You are absolutely right about what that should match. However, without the '?' character, you are specifying that all of those pieces must be present.
\d+\.\d+
This means "1 or more numbers, period, 1 or more numbers"
1.5, 253333.7, 0.0 would all be matched. However, your example uses 45, which has no "." in it, nor numbers afterward. There are a few solutions to your problem, the most full proof of which was stated above by mpapac. Allow the decimal and following digits to be optional.
(\.\d+)?
The problem with this as such is that putting a () around it makes it another capture group. You may or may not want this. Putting the ?: inside it means "Use this as a group, but don't capture it." Hence:
(?:\.\d+)?
The other option is not to do the grouping, and instead make both the decimal itself optional and the digits after the decimal ZERO or more instead of ONE or more. That would look something like this:
\d+\.?\d*

You are not printing what you capture. You're printing $_ which we don't know what it is.
if ($string =~/Board Length,(\d+\.\d+)/){
print $_;
}
What I think you want is:
if ($string =~/Board Length,(\d+\.\d+)/){
print $1;
}

You have the following expression:
$string =~/Board Length,(\d+\.\d+) /
Your string is this:
Board Length,45,inches
The string Board Length will match the pattern Board Length,. However, the rest of our pattern is matching one or more digits followed by a period follows by one or more digits. This doesn't match the string 45. There's no decimal there.
The question is what are you trying to match. For example, if the number is surrounded by commas, you could do this:
$string =~ /Board Length,([^,]+),/;
my $number = $1;
The [^,] means Not a comma. You're capturing everything after a comma to the next comma. This will allow you to capture 45, 45.32, and even 4.5e+10. Just anything between the two commas.
Note that you use $1 for your first capture group and not $_.
Another way is to use non-greedy matching:
$string =~ /Board Length,(.+?),/;
my $number = $1;
What happens if what is captured isn't a number? You can test for that using the looks_like_number function from Scalar::Util (which has been included in Perl distributions for a long time).:
use Scalar::Util qw(looks_like_number);
my $string = "Board Length,Extra long,feet,...";
...
$string =~ /Board Length,(.+?),/;
my $number = $1;
if ( looks_like_number( $number ) ) {
print "$number is a number\n";
}
else {
print "Nope. $number isn't a number\n";
}

Related

Matching a number a max number of times

I have the following regex in Perl that works for matching strings with 6 or fewer digits in them. However, this also matches strings with more than 6 digits.
$string =~ /[0-9]{1,6}/
Matches:
T12345#1
0897
112355501234
I'd like the regex to match the first 2 but not the last case.
Use a negated look ahead to see if a digit follows, and negated look behind so you don't just match the last six digits:
$string =~ /(?<!\d)\d{1,6}(?!\d)/
Or you could do it this way:
$string =~ /^(?!.*\d{7}.*).*$/
If you just want to reject strings that contain more than six decimal digits then you can use tr/// to count them
if ( $string =~ tr/0-9// <= 6 ) { ... }
But you don't make it clear whether separate numeric substrings should be counted together. You say T12345#1 is valid, but what about T12345#12345?

Extract digits and hyphen from a line in file

I have to extract string with particular format from a file. i.e string format is 1 followed by hyphen and 7 digits.
for ex.
#CARES# AR_NUMBER=1-4742637
here I have to extract only 1-4742637.
Help me, how to extract?
The following will capture that: /\b(1-\d{7})\b/
As demonstrated:
use strict;
use warnings;
my $text = <<'END_TEXT';
for ex.
#CARES# AR_NUMBER=1-4742637
END_TEXT
if ($text =~ /\b(1-\d{7})\b/) {
print "$1";
}
Outputs:
1-4742637
if ($subject =~ m/(1-[\d]+)/) {
# Successful match
} else {
# Match attempt failed
}
(1-[\d]+)
Match the regular expression below and capture its match into backreference number 1 «(1-[\d]+)»
Match the characters “1-” literally «1-»
Match a single digit 0..9 «[\d]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
1-\d{7}
will select the required part.
Please try
$var =~ /1-\d{7}/
{} for number of matches
You can do that as:
([0-9-]+)
or specifically for your case:
(1-\d{7})
and the first captured group \1 or $1 will contain what you want.
Demo: http://regex101.com/r/pY4gB6

Trying to use /^\s*$/ match multiple blank lines and replace them failed and get a confusing result

Perl version : 5.16.01
I'm reading a book about regex which based on perl 5.8
The book said that s/^\s*$/blabla/mg can match and replace multiple blank lines.
But when I praticed, I got a confusing result.
code:
$text = "c\n\n\n\n\nb";
$text =~ s/^\s*$/<p>/mg;
print "$text";
Here is the result:
C:\Users\Administrator\Desktop\regex>perl t2h.pl
c
<p><p>
b
I want to know why I didn't get a single <p> but double between 'c' and 'b'. Does Perl's /$/ change after 5.8 ?
The lesson here is be wary of regular expressions that will match a zero-width pattern, you could get unexpected results.
We can see what's happening here by showing the prematch, match and post match of both replacements:
use strict;
use warnings;
my $text = "c\n\n\n\nb";
$text =~ s{^\s*$}{
printf qq{<"%s" - "%s" - "%s">\n}, map s/\n/\\n/gr, ($`, $&, $');
"<p>"
}emg;
$text =~ s/\n/\\n/g;
print qq{Result: "$text"};
Outputs <"Prematch" - "Match" - "Postmatch">:
<"c\n" - "\n\n" - "\nb">
<"c\n\n\n" - "" - "\nb">
Result: "c\n<p><p>\nb"
Basically, the regex matches from position 2 until 4, capturing 2 return characters. After that replacement it starts searching from position 4 and matches a zero width pattern, so adds a second <p>.
One of the reasons this isn't intuitive is because our regex has replaced the \n\n at positions 2 & 3 with a <p>. However, lookbehind assertions (which ^ is special variant) treat the string as it originally was, not as it might have been replaced by previous passes of a /g regex. Therefore when matching at position 4, the regex sees c\n\n\n behind it instead of c\n<p> (as demonstrated in our output above), and therefore will match ^ again and $ immediately in front of it with no spacing between.
The solution to this is to not allow zero width patterns by using + in this instance instead of *.
Secondary Example
Another example of this is the following, simpler regex
my $text = "caab";
$text =~ s/a*/<p>/g;
print $text;
Outputs:
<p>c<p><p>b<p>
The positional breakdown of this matching is as follows:
0 c - match a zero width pattern
1 a - Match a 2 character pattern
2 a
3 b - Match a zero width pattern
4 $ - match a zero width pattern
Therefore, the final lesson is to simply be wary of regexes that will match a zero width pattern.
Quantifier * match 0 or more times,
quantifier ? match 1 or more times.
So your regex should be written as s/^\s+$/<p>/mg
You can try this:
#!/usr/bin/perl
$text = "c\n\n\n\n\nb";
$text =~ s/[\r\n]//g;
print $text;
DEMO http://ideone.com/WmVFHz

Regex: Matching 4-Digits within words

I have a body of text I'm looking to pull repeat sets of 4-digit numbers out from.
For Example:
The first is 1234 2) The Second is 2098 3) The Third is 3213
Now I know i'm able to get the first set of digits out by simply using:
/\d{4}/
...returning 1234
But how do I match the second set of digits, or the third, and so on...?
edit: How do i return 2098, or 3213
You don't appear to have a proper answer to your question yet.
The solution is to use the /g modifier on your regex. In list context it will find all of the numbers in your string at once, like this
my $str = 'The first is 1234 2) The Second is 2098 3) The Third is 3213';
my #numbers = $str =~ /\b \d{4} \b/gx;
print "#numbers\n";
output
1234 2098 3213
Or you can iterate through them, using scalar context in a while loop, like this
while ($str =~ /\b (\d{4}) \b/gx) {
my $number = $1;
print $number, "\n";
}
output
1234
2098
3213
I have added the \b patterns to the regex so that it only matches whole four-digit numbers and doesn't, for example, find 1234 in 1234567. The /x modifier just allows me to add spaces so that the pattern is more intelligible.
See http://perldoc.perl.org/perlre.html for discussion on the use of the 'g' modifier which will cause your regular expression to match ALL occurrances of its pattern, not just the first.
If you want a pattern that finds the $n'th 4-digit group, this seems to work:
$pat = "^(?:.*?\\b(\\d{4})\\b){$n}";
if ($s =~ /$pat/) {
print "Found $1\n";
} else {
print "Not found\n";
}
I did this by building a string pattern because I couldn't get a variable interpolated into a quantifier {$n}.
This pattern finds 4-digit groups that are on word boundaries (the \b tests); I don't know if that meets your requirements. The pattern uses .*? to ensure that as few characters as possible are matched between each four-digit group. The pattern is matched $n times, and the capture group $1 is set to whatever it was in the last iteration, i.e. the $n'th one.
EDIT: When I just tried it again, it seemed to interpolate $n in a quantifier just fine. I don't know what I did differently that it didn't work last time. So maybe this will work:
if ($s =~ /^(?:.*?\b(\d{4}\b){$n}/) { ...
If not, see amon's comment about qr//.
If the regex is only matched once, then match all three in one regex and extract them using matched groups:
^.*\b(\d{4})\b.*\b(\d{4})\b.*\b(\d{4})\b.*$
The three 4-digit numbers will be captured in group 1. 2 and 3.
Ajb's answer with "gx" is the best. If you know you will have three numbers, this straighforward line does the trick:
my $str = 'The first is 1234 2) The Second is 2098 3) The Third is 3213';
my ($num1, $num2, $num3) = $str =~ /\b \d{4} \b/gx;
print "$num1, $num2, $num3\n";

help with perl regex rules

I would need some help with a regex issue in perl. I need to match non_letter characters "nucleated" around letter characters string (of size one).
That is to say... I have a string like
CDF((E)TR)FT
and I want to match ALL the following:
C, D, F((, ((E), )T, R), )F, T.
I was trying with something like
/([^A-Za-z]*[A-Za-z]{1}[^A-Za-z]*)/
but I'm obtaining:
C, D, F((, E), T, R), F, T.
Is like if once a non-letter characters has been matched it can NOT be matched again in another matching.
How can I do this?
A little late on this. Somebody has probably proposed this already.
I would consume the capture in the assertion to the left (via backref) and not consume the capture in the assertion to the right. All the captures can be seen, but the last one is not consumed, so the next pass continues right after the last atomic letter was found.
Character class is simplified for clarity:
/(?=([^A-Z]*))(\1[A-Z])(?=([^A-Z]*))/
(?=([^A-Z]*)) # ahead is optional non A-Z characters, captured in grp 1
(\1[A-Z]) # capture grp 2, consume capture group 1, plus atomic letter
(?=([^A-Z]*)) # ahead is optional non A-Z characters, captured in grp 3
Do globally, in a while loop, combined groups $2$3 (in that order) are the answer.
Test:
$samp = 'CDF((E)TR)FT';
while ( $samp =~ /(?=([^A-Z]*))(\1[A-Z])(?=([^A-Z]*))/g )
{
print "$2$3, ";
}
output:
C, D, F((, ((E), )T, R), )F, T,
The problem is that you are consuming your characters or non letter characters the first time you encounter them, therefore you can't match all that you want. A solution would be to use different regexes for different patterns and combine the results at the end so that you could have your desired result :
This will match all character starting with a non character followed by a single character but NOT followed by a non character
[^A-Z]+[A-Z](?![^A-Z])
This will match a character enclosed by non characters, containing overlapping results :
(?=([^A-Z]+[A-Z][^A-Z]+))
This will match a character followed by one or more non characters only if it is not preceded by a non character :
(?<![^A-Z])[A-Z][^A-Z]+
And this will match single characters which are not enclosed to non characters
(?<![^A-Z])[A-Z](?![^A-Z])
By combining the results you will have the correct desired result:
C,D,T, )T, )F, ((E), F((, R)
Also if you understand the small parts you could join this into one Regex :
#!/usr/local/bin/perl
use strict;
my $subject = "0C0CC(R)CC(L)C0";
while ($subject =~ m/(?=([^A-Z]+[A-Z][^A-Z]+))|(?=((?<![^A-Z])[A-Z][^A-Z]+))|(?=((?<![^A-Z])[A-Z](?![^A-Z])))|(?=([^A-Z]+[A-Z](?![^A-Z])))/g) {
# matched text = $1, $2, $3, $4
print $1, " " if defined $1;
print $2, " " if defined $2;
print $3, " " if defined $3;
print $4, " " if defined $4;
}
Output :
0C0 0C C( (R) )C C( (L) )C0
You're right, once a character has been consumed in a regex match, it can't be matched again. In regex flavors that fully support lookaround assertions, you could do it with the regex
(?<=(\P{L}*))\p{L}(?=(\P{L}*))
where the match result would be the letter, and $1 and $2 would contain the non-letters around it. Since they are only matched in the context of lookaround assertions, they are not consumed in the match and can therefore be matched multiple times. You then need to construct the match result as $1 + $& + $2. This approach would work in .NET, for example.
In most other flavors (including Perl) that have limited support for lookaround, you can take a mixed approach, which is necessary because lookbehind expressions don't allow for indefinite repetition:
\P{L}*\p{L}(?=(\P{L}*))
Now $& will contain the non-letter characters before the letter and the letter itself, and $1 contains any non-letter characters that follow the letter.
while ($subject =~ m/\P{L}*\p{L}(?=(\P{L}*))/g) {
# matched text = $& . $1
}
Or, you could do it the hard way and tokenize first, then process the tokens:
#!/usr/bin/perl
use warnings;
use strict;
my $str = 'CDF((E)TR)FT';
my #nucleated = nucleat($str);
print "$_\n" for #nucleated;
sub nucleat {
my($s) = #_;
my #parts; # return list stored here
my #tokens = grep length, split /([a-z])/i, $s;
# bracket the tokens with empty strings to avoid warnings
unshift #tokens, '';
push #tokens, '';
foreach my $i (0..$#tokens) {
next unless $tokens[$i] =~ /^[a-z]$/i; # one element per letter token
my $str = '';
if ($tokens[$i-1] !~ /^[a-z]$/i) { # punc before letter
$str .= $tokens[$i-1];
}
$str .= $tokens[$i]; # the letter
if ($tokens[$i+1] !~ /^[a-z]$/i) { # punc after letter
$str .= $tokens[$i+1];
}
push #parts, $str;
}
return #parts;
}