Regex: Matching 4-Digits within words - regex

I have a body of text I'm looking to pull repeat sets of 4-digit numbers out from.
For Example:
The first is 1234 2) The Second is 2098 3) The Third is 3213
Now I know i'm able to get the first set of digits out by simply using:
/\d{4}/
...returning 1234
But how do I match the second set of digits, or the third, and so on...?
edit: How do i return 2098, or 3213

You don't appear to have a proper answer to your question yet.
The solution is to use the /g modifier on your regex. In list context it will find all of the numbers in your string at once, like this
my $str = 'The first is 1234 2) The Second is 2098 3) The Third is 3213';
my #numbers = $str =~ /\b \d{4} \b/gx;
print "#numbers\n";
output
1234 2098 3213
Or you can iterate through them, using scalar context in a while loop, like this
while ($str =~ /\b (\d{4}) \b/gx) {
my $number = $1;
print $number, "\n";
}
output
1234
2098
3213
I have added the \b patterns to the regex so that it only matches whole four-digit numbers and doesn't, for example, find 1234 in 1234567. The /x modifier just allows me to add spaces so that the pattern is more intelligible.

See http://perldoc.perl.org/perlre.html for discussion on the use of the 'g' modifier which will cause your regular expression to match ALL occurrances of its pattern, not just the first.

If you want a pattern that finds the $n'th 4-digit group, this seems to work:
$pat = "^(?:.*?\\b(\\d{4})\\b){$n}";
if ($s =~ /$pat/) {
print "Found $1\n";
} else {
print "Not found\n";
}
I did this by building a string pattern because I couldn't get a variable interpolated into a quantifier {$n}.
This pattern finds 4-digit groups that are on word boundaries (the \b tests); I don't know if that meets your requirements. The pattern uses .*? to ensure that as few characters as possible are matched between each four-digit group. The pattern is matched $n times, and the capture group $1 is set to whatever it was in the last iteration, i.e. the $n'th one.
EDIT: When I just tried it again, it seemed to interpolate $n in a quantifier just fine. I don't know what I did differently that it didn't work last time. So maybe this will work:
if ($s =~ /^(?:.*?\b(\d{4}\b){$n}/) { ...
If not, see amon's comment about qr//.

If the regex is only matched once, then match all three in one regex and extract them using matched groups:
^.*\b(\d{4})\b.*\b(\d{4})\b.*\b(\d{4})\b.*$
The three 4-digit numbers will be captured in group 1. 2 and 3.

Ajb's answer with "gx" is the best. If you know you will have three numbers, this straighforward line does the trick:
my $str = 'The first is 1234 2) The Second is 2098 3) The Third is 3213';
my ($num1, $num2, $num3) = $str =~ /\b \d{4} \b/gx;
print "$num1, $num2, $num3\n";

Related

Matching a number a max number of times

I have the following regex in Perl that works for matching strings with 6 or fewer digits in them. However, this also matches strings with more than 6 digits.
$string =~ /[0-9]{1,6}/
Matches:
T12345#1
0897
112355501234
I'd like the regex to match the first 2 but not the last case.
Use a negated look ahead to see if a digit follows, and negated look behind so you don't just match the last six digits:
$string =~ /(?<!\d)\d{1,6}(?!\d)/
Or you could do it this way:
$string =~ /^(?!.*\d{7}.*).*$/
If you just want to reject strings that contain more than six decimal digits then you can use tr/// to count them
if ( $string =~ tr/0-9// <= 6 ) { ... }
But you don't make it clear whether separate numeric substrings should be counted together. You say T12345#1 is valid, but what about T12345#12345?

Why doesn't this Regex match in Perl?

I have a string that can read something like this (although not always, numbers can vary).
Board Length,45,inches,color,board height,8,inches,black,store,wal-mart,Board weight,20,dollars
I am trying to match the 45 that follows the Board Length this regex expression.
if ($string =~/Board Length,(\d+\.\d+)/){
print $string;
}
Is the formatting wrong? I thought d+ would match as many numbers as needed, . would match a literal '.', and d+ would match any numbers after the decimal (if there are any).
As you have put it, decimal . and following digits are mandatory. Thus (\.\d+)? to make it optional,
if ($string =~/Board Length,(\d+(?:\.\d+)?)/)
You are absolutely right about what that should match. However, without the '?' character, you are specifying that all of those pieces must be present.
\d+\.\d+
This means "1 or more numbers, period, 1 or more numbers"
1.5, 253333.7, 0.0 would all be matched. However, your example uses 45, which has no "." in it, nor numbers afterward. There are a few solutions to your problem, the most full proof of which was stated above by mpapac. Allow the decimal and following digits to be optional.
(\.\d+)?
The problem with this as such is that putting a () around it makes it another capture group. You may or may not want this. Putting the ?: inside it means "Use this as a group, but don't capture it." Hence:
(?:\.\d+)?
The other option is not to do the grouping, and instead make both the decimal itself optional and the digits after the decimal ZERO or more instead of ONE or more. That would look something like this:
\d+\.?\d*
You are not printing what you capture. You're printing $_ which we don't know what it is.
if ($string =~/Board Length,(\d+\.\d+)/){
print $_;
}
What I think you want is:
if ($string =~/Board Length,(\d+\.\d+)/){
print $1;
}
You have the following expression:
$string =~/Board Length,(\d+\.\d+) /
Your string is this:
Board Length,45,inches
The string Board Length will match the pattern Board Length,. However, the rest of our pattern is matching one or more digits followed by a period follows by one or more digits. This doesn't match the string 45. There's no decimal there.
The question is what are you trying to match. For example, if the number is surrounded by commas, you could do this:
$string =~ /Board Length,([^,]+),/;
my $number = $1;
The [^,] means Not a comma. You're capturing everything after a comma to the next comma. This will allow you to capture 45, 45.32, and even 4.5e+10. Just anything between the two commas.
Note that you use $1 for your first capture group and not $_.
Another way is to use non-greedy matching:
$string =~ /Board Length,(.+?),/;
my $number = $1;
What happens if what is captured isn't a number? You can test for that using the looks_like_number function from Scalar::Util (which has been included in Perl distributions for a long time).:
use Scalar::Util qw(looks_like_number);
my $string = "Board Length,Extra long,feet,...";
...
$string =~ /Board Length,(.+?),/;
my $number = $1;
if ( looks_like_number( $number ) ) {
print "$number is a number\n";
}
else {
print "Nope. $number isn't a number\n";
}

Extract digits and hyphen from a line in file

I have to extract string with particular format from a file. i.e string format is 1 followed by hyphen and 7 digits.
for ex.
#CARES# AR_NUMBER=1-4742637
here I have to extract only 1-4742637.
Help me, how to extract?
The following will capture that: /\b(1-\d{7})\b/
As demonstrated:
use strict;
use warnings;
my $text = <<'END_TEXT';
for ex.
#CARES# AR_NUMBER=1-4742637
END_TEXT
if ($text =~ /\b(1-\d{7})\b/) {
print "$1";
}
Outputs:
1-4742637
if ($subject =~ m/(1-[\d]+)/) {
# Successful match
} else {
# Match attempt failed
}
(1-[\d]+)
Match the regular expression below and capture its match into backreference number 1 «(1-[\d]+)»
Match the characters “1-” literally «1-»
Match a single digit 0..9 «[\d]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
1-\d{7}
will select the required part.
Please try
$var =~ /1-\d{7}/
{} for number of matches
You can do that as:
([0-9-]+)
or specifically for your case:
(1-\d{7})
and the first captured group \1 or $1 will contain what you want.
Demo: http://regex101.com/r/pY4gB6

Trying to use /^\s*$/ match multiple blank lines and replace them failed and get a confusing result

Perl version : 5.16.01
I'm reading a book about regex which based on perl 5.8
The book said that s/^\s*$/blabla/mg can match and replace multiple blank lines.
But when I praticed, I got a confusing result.
code:
$text = "c\n\n\n\n\nb";
$text =~ s/^\s*$/<p>/mg;
print "$text";
Here is the result:
C:\Users\Administrator\Desktop\regex>perl t2h.pl
c
<p><p>
b
I want to know why I didn't get a single <p> but double between 'c' and 'b'. Does Perl's /$/ change after 5.8 ?
The lesson here is be wary of regular expressions that will match a zero-width pattern, you could get unexpected results.
We can see what's happening here by showing the prematch, match and post match of both replacements:
use strict;
use warnings;
my $text = "c\n\n\n\nb";
$text =~ s{^\s*$}{
printf qq{<"%s" - "%s" - "%s">\n}, map s/\n/\\n/gr, ($`, $&, $');
"<p>"
}emg;
$text =~ s/\n/\\n/g;
print qq{Result: "$text"};
Outputs <"Prematch" - "Match" - "Postmatch">:
<"c\n" - "\n\n" - "\nb">
<"c\n\n\n" - "" - "\nb">
Result: "c\n<p><p>\nb"
Basically, the regex matches from position 2 until 4, capturing 2 return characters. After that replacement it starts searching from position 4 and matches a zero width pattern, so adds a second <p>.
One of the reasons this isn't intuitive is because our regex has replaced the \n\n at positions 2 & 3 with a <p>. However, lookbehind assertions (which ^ is special variant) treat the string as it originally was, not as it might have been replaced by previous passes of a /g regex. Therefore when matching at position 4, the regex sees c\n\n\n behind it instead of c\n<p> (as demonstrated in our output above), and therefore will match ^ again and $ immediately in front of it with no spacing between.
The solution to this is to not allow zero width patterns by using + in this instance instead of *.
Secondary Example
Another example of this is the following, simpler regex
my $text = "caab";
$text =~ s/a*/<p>/g;
print $text;
Outputs:
<p>c<p><p>b<p>
The positional breakdown of this matching is as follows:
0 c - match a zero width pattern
1 a - Match a 2 character pattern
2 a
3 b - Match a zero width pattern
4 $ - match a zero width pattern
Therefore, the final lesson is to simply be wary of regexes that will match a zero width pattern.
Quantifier * match 0 or more times,
quantifier ? match 1 or more times.
So your regex should be written as s/^\s+$/<p>/mg
You can try this:
#!/usr/bin/perl
$text = "c\n\n\n\n\nb";
$text =~ s/[\r\n]//g;
print $text;
DEMO http://ideone.com/WmVFHz

Negative regex for Perl string pattern match

I have this regex:
if($string =~ m/^(Clinton|[^Bush]|Reagan)/i)
{print "$string\n"};
I want to match with Clinton and Reagan, but not Bush.
It's not working.
Your regex does not work because [] defines a character class, but what you want is a lookahead:
(?=) - Positive look ahead assertion foo(?=bar) matches foo when followed by bar
(?!) - Negative look ahead assertion foo(?!bar) matches foo when not followed by bar
(?<=) - Positive look behind assertion (?<=foo)bar matches bar when preceded by foo
(?<!) - Negative look behind assertion (?<!foo)bar matches bar when NOT preceded by foo
(?>) - Once-only subpatterns (?>\d+)bar Performance enhancing when bar not present
(?(x)) - Conditional subpatterns
(?(3)foo|fu)bar - Matches foo if 3rd subpattern has matched, fu if not
(?#) - Comment (?# Pattern does x y or z)
So try: (?!bush)
Sample text:
Clinton said
Bush used crayons
Reagan forgot
Just omitting a Bush match:
$ perl -ne 'print if /^(Clinton|Reagan)/' textfile
Clinton said
Reagan forgot
Or if you really want to specify:
$ perl -ne 'print if /^(?!Bush)(Clinton|Reagan)/' textfile
Clinton said
Reagan forgot
Your regex says the following:
/^ - if the line starts with
( - start a capture group
Clinton| - "Clinton"
| - or
[^Bush] - Any single character except "B", "u", "s" or "h"
| - or
Reagan) - "Reagan". End capture group.
/i - Make matches case-insensitive
So, in other words, your middle part of the regex is screwing you up. As it is a "catch-all" kind of group, it will allow any line that does not begin with any of the upper or lower case letters in "Bush". For example, these lines would match your regex:
Our president, George Bush
In the news today, pigs can fly
012-3123 33
You either make a negative look-ahead, as suggested earlier, or you simply make two regexes:
if( ($string =~ m/^(Clinton|Reagan)/i) and
($string !~ m/^Bush/i) ) {
print "$string\n";
}
As mirod has pointed out in the comments, the second check is quite unnecessary when using the caret (^) to match only beginning of lines, as lines that begin with "Clinton" or "Reagan" could never begin with "Bush".
However, it would be valid without the carets.
What's wrong with using two regexs (or three)? This makes your intentions more clear and may even improve your performance:
if ($string =~ /^(Clinton|Reagan)/i && $string !~ /Bush/i) { ... }
if (($string =~ /^Clinton/i || $string =~ /^Reagan/i)
&& $string !~ /Bush/i) {
print "$string\n"
}
If my understanding is correct then you want to match any line which has Clinton and Reagan, in any order, but not Bush. As suggested by Stuck, here is a version with lookahead assertions:
#!/usr/bin/perl
use strict;
use warnings;
my $regex = qr/
(?=.*clinton)
(?!.*bush)
.*reagan
/ix;
while (<DATA>) {
chomp;
next unless (/$regex/);
print $_, "\n";
}
__DATA__
shouldn't match - reagan came first, then clinton, finally bush
first match - first two: reagan and clinton
second match - first two reverse: clinton and reagan
shouldn't match - last two: clinton and bush
shouldn't match - reverse: bush and clinton
shouldn't match - and then came obama, along comes mary
shouldn't match - to clinton with perl
Results
first match - first two: reagan and clinton
second match - first two reverse: clinton and reagan
as desired it matches any line which has Reagan and Clinton in any order.
You may want to try reading how lookahead assertions work with examples at http://www252.pair.com/comdog/mastering_perl/Chapters/02.advanced_regular_expressions.html
they are very tasty :)