Matching a number a max number of times - regex

I have the following regex in Perl that works for matching strings with 6 or fewer digits in them. However, this also matches strings with more than 6 digits.
$string =~ /[0-9]{1,6}/
Matches:
T12345#1
0897
112355501234
I'd like the regex to match the first 2 but not the last case.

Use a negated look ahead to see if a digit follows, and negated look behind so you don't just match the last six digits:
$string =~ /(?<!\d)\d{1,6}(?!\d)/

Or you could do it this way:
$string =~ /^(?!.*\d{7}.*).*$/

If you just want to reject strings that contain more than six decimal digits then you can use tr/// to count them
if ( $string =~ tr/0-9// <= 6 ) { ... }
But you don't make it clear whether separate numeric substrings should be counted together. You say T12345#1 is valid, but what about T12345#12345?

Related

Matching numbers with non-digits embedded

I am trying to match strings of digits that contain non-digits within them. Using the default text in http://regexr.com/, the following should match:
v2.1
-98.7
3.141
.6180
9,000
+42
555.123.4567
+1-(800)-555-2468
The following should not match:
0123456789
12345
I tried:
/[^\n\ ]{1,}\d+\S+\d/g
But it would not match +42 and it incorrectly matched 0123456789 and 12345, and it treated "555.123.4567 +1-(800)-555-2468" as one string.
I tried to alleviate it by putting a $ at the end but that matched nothing. Not sure what I am doing wrong.
You can use this regex to match any text with at least one non-digit:
/^\d*[^\d\n]+\d.*$/mg
RegEx Demo
RegEx Breakup:
^ - Start
\d* - Match 0 or more digits
[^\d\n]+ - Match 1 or more of any character that is not a digit and not a newline
\d - Match a digit
.* - Match 0 or more of any character
$ - End
Try this:
^(?=.*\d)(?=.*[^\d\s])\S+$
This means "at least one digit and one non-digit and no whitespace".
See live demo.
If no newlines were in your input, you could use slightly simpler:
^(?=.*\d)(?=.*\D)\S+$
Aren't you over-thinking this massively? What's wrong with using /\D/ to match a string that contains a non-digit?
I'm not sure what your exact requirements are, but if you're looking for a string that contains at least one digit and at least one non-digit, then the easiest approach is to use to regex matches - /\d/ && /\D/.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
while (<DATA>) {
chomp;
say "$_: " . (/\d/ && /\D/ ? 'matches' : 'doesn\'t match');
}
__DATA__
v2.1
-98.7
3.141
.6180
9,000
+42
555.123.4567
+1-(800)-555-2468
0123456789
12345
Looks like you want to dodge strings made up entirely of digits, or entirely of letters. So you can exclude those. That will also let in strings without any numbers, so also require a number.
my $exclude = qr/(?: [0-9]+ | [A-Za-z]+ )/x;
my #res = grep { not /^$exclude$/ and /\d/ } #strings;
If any other characters need be excluded (underscore?), add it to the list.
It is not clear how your input is coming, this takes a list of ready strings. Add word boundaries and/or /s, depending on the input. Or parse the input into a list of strings for this.
If input comes as as a multi-line string, my #strings = split '\n|\s+', $text;.

Why doesn't this Regex match in Perl?

I have a string that can read something like this (although not always, numbers can vary).
Board Length,45,inches,color,board height,8,inches,black,store,wal-mart,Board weight,20,dollars
I am trying to match the 45 that follows the Board Length this regex expression.
if ($string =~/Board Length,(\d+\.\d+)/){
print $string;
}
Is the formatting wrong? I thought d+ would match as many numbers as needed, . would match a literal '.', and d+ would match any numbers after the decimal (if there are any).
As you have put it, decimal . and following digits are mandatory. Thus (\.\d+)? to make it optional,
if ($string =~/Board Length,(\d+(?:\.\d+)?)/)
You are absolutely right about what that should match. However, without the '?' character, you are specifying that all of those pieces must be present.
\d+\.\d+
This means "1 or more numbers, period, 1 or more numbers"
1.5, 253333.7, 0.0 would all be matched. However, your example uses 45, which has no "." in it, nor numbers afterward. There are a few solutions to your problem, the most full proof of which was stated above by mpapac. Allow the decimal and following digits to be optional.
(\.\d+)?
The problem with this as such is that putting a () around it makes it another capture group. You may or may not want this. Putting the ?: inside it means "Use this as a group, but don't capture it." Hence:
(?:\.\d+)?
The other option is not to do the grouping, and instead make both the decimal itself optional and the digits after the decimal ZERO or more instead of ONE or more. That would look something like this:
\d+\.?\d*
You are not printing what you capture. You're printing $_ which we don't know what it is.
if ($string =~/Board Length,(\d+\.\d+)/){
print $_;
}
What I think you want is:
if ($string =~/Board Length,(\d+\.\d+)/){
print $1;
}
You have the following expression:
$string =~/Board Length,(\d+\.\d+) /
Your string is this:
Board Length,45,inches
The string Board Length will match the pattern Board Length,. However, the rest of our pattern is matching one or more digits followed by a period follows by one or more digits. This doesn't match the string 45. There's no decimal there.
The question is what are you trying to match. For example, if the number is surrounded by commas, you could do this:
$string =~ /Board Length,([^,]+),/;
my $number = $1;
The [^,] means Not a comma. You're capturing everything after a comma to the next comma. This will allow you to capture 45, 45.32, and even 4.5e+10. Just anything between the two commas.
Note that you use $1 for your first capture group and not $_.
Another way is to use non-greedy matching:
$string =~ /Board Length,(.+?),/;
my $number = $1;
What happens if what is captured isn't a number? You can test for that using the looks_like_number function from Scalar::Util (which has been included in Perl distributions for a long time).:
use Scalar::Util qw(looks_like_number);
my $string = "Board Length,Extra long,feet,...";
...
$string =~ /Board Length,(.+?),/;
my $number = $1;
if ( looks_like_number( $number ) ) {
print "$number is a number\n";
}
else {
print "Nope. $number isn't a number\n";
}

Trying to use /^\s*$/ match multiple blank lines and replace them failed and get a confusing result

Perl version : 5.16.01
I'm reading a book about regex which based on perl 5.8
The book said that s/^\s*$/blabla/mg can match and replace multiple blank lines.
But when I praticed, I got a confusing result.
code:
$text = "c\n\n\n\n\nb";
$text =~ s/^\s*$/<p>/mg;
print "$text";
Here is the result:
C:\Users\Administrator\Desktop\regex>perl t2h.pl
c
<p><p>
b
I want to know why I didn't get a single <p> but double between 'c' and 'b'. Does Perl's /$/ change after 5.8 ?
The lesson here is be wary of regular expressions that will match a zero-width pattern, you could get unexpected results.
We can see what's happening here by showing the prematch, match and post match of both replacements:
use strict;
use warnings;
my $text = "c\n\n\n\nb";
$text =~ s{^\s*$}{
printf qq{<"%s" - "%s" - "%s">\n}, map s/\n/\\n/gr, ($`, $&, $');
"<p>"
}emg;
$text =~ s/\n/\\n/g;
print qq{Result: "$text"};
Outputs <"Prematch" - "Match" - "Postmatch">:
<"c\n" - "\n\n" - "\nb">
<"c\n\n\n" - "" - "\nb">
Result: "c\n<p><p>\nb"
Basically, the regex matches from position 2 until 4, capturing 2 return characters. After that replacement it starts searching from position 4 and matches a zero width pattern, so adds a second <p>.
One of the reasons this isn't intuitive is because our regex has replaced the \n\n at positions 2 & 3 with a <p>. However, lookbehind assertions (which ^ is special variant) treat the string as it originally was, not as it might have been replaced by previous passes of a /g regex. Therefore when matching at position 4, the regex sees c\n\n\n behind it instead of c\n<p> (as demonstrated in our output above), and therefore will match ^ again and $ immediately in front of it with no spacing between.
The solution to this is to not allow zero width patterns by using + in this instance instead of *.
Secondary Example
Another example of this is the following, simpler regex
my $text = "caab";
$text =~ s/a*/<p>/g;
print $text;
Outputs:
<p>c<p><p>b<p>
The positional breakdown of this matching is as follows:
0 c - match a zero width pattern
1 a - Match a 2 character pattern
2 a
3 b - Match a zero width pattern
4 $ - match a zero width pattern
Therefore, the final lesson is to simply be wary of regexes that will match a zero width pattern.
Quantifier * match 0 or more times,
quantifier ? match 1 or more times.
So your regex should be written as s/^\s+$/<p>/mg
You can try this:
#!/usr/bin/perl
$text = "c\n\n\n\n\nb";
$text =~ s/[\r\n]//g;
print $text;
DEMO http://ideone.com/WmVFHz

Finding all the ten different digits in a random string

Sorry if this is answered somewhere, but I couldn't find it.
I need to write a regexp to matches on strings that contain the digits from 0 to 9 exactly once. For example:
e8v5i0l9ny3hw1f24z7q6
You can see that numbers [0-9] are present exactly once and in random order. (Letters are present also exactly once, but that is an advanced quest...) It must not match if a digit is missing or if any digit is present more than one time.
So what would be the best regexp to match on strings like these? I am still learning regex and couldn't find a solution. It is PCRE, running in perl environment, but I cannot use perl, only the regex part of it. Sorry for my english and thank you in advance.
What about this pattern to verify the string:
^\D*(?>(\d)(?!.*\1)\D*){10}$
^\D* Starts with any amount of characters, that are no digit
(?>(\d)(?!.*\1)\D*){10} followed by 10x: a digit (captured in first capturing group), if the captured digit is not ahead, followed by any amount of \D non-digits, using a negative lookahead. So 10x a digit, with itself not ahead consecutive should result in 10 different [0-9].
\d is a shorthand for [0-9], \D is the negation [^0-9]
Test at regex101, Regex FAQ
If you need the digit-string then, just extract the digits, e.g. php (test eval.in)
$str = "e8v5i0l9ny3hw1f24z7q6";
$pattern = '/^\D*(?>(\d)(?!.*\1)\D*){10}$/';
if(preg_match($pattern, $str)) {
echo preg_replace('/\D+/', "", $str);
}
It is easy to create a regular expression that matches one specific permutations of the numbers and ingnore the other characters. E.g.
^[^\d]*0[^\d]1[^\d]*2[^\d]*3[^\d]*4[^\d]*5[^\d]*6[^\d]*7[^\d]*8[^\d]*9[^\d]*$
You can combine 10! expressions for every possible permutation with |
Although this is completely inpractical it shows that such a regular expression (without lookahead) is indeed possible.
However this is something that is much better done without regular expression matching.
$s = "e8v5i0l9ny3hw1f24z7q6";
$s = preg_replace('/[^\d]/i', '', $s); //remove non digits
if(strlen($s) == 10) //do we have 10 digits ?
if (!preg_match('/(\d)(\1+)/i', $s)) //if no repeated digits
echo "String has 10 different digits";
http://ideone.com/eY4eGx

Regex: Matching 4-Digits within words

I have a body of text I'm looking to pull repeat sets of 4-digit numbers out from.
For Example:
The first is 1234 2) The Second is 2098 3) The Third is 3213
Now I know i'm able to get the first set of digits out by simply using:
/\d{4}/
...returning 1234
But how do I match the second set of digits, or the third, and so on...?
edit: How do i return 2098, or 3213
You don't appear to have a proper answer to your question yet.
The solution is to use the /g modifier on your regex. In list context it will find all of the numbers in your string at once, like this
my $str = 'The first is 1234 2) The Second is 2098 3) The Third is 3213';
my #numbers = $str =~ /\b \d{4} \b/gx;
print "#numbers\n";
output
1234 2098 3213
Or you can iterate through them, using scalar context in a while loop, like this
while ($str =~ /\b (\d{4}) \b/gx) {
my $number = $1;
print $number, "\n";
}
output
1234
2098
3213
I have added the \b patterns to the regex so that it only matches whole four-digit numbers and doesn't, for example, find 1234 in 1234567. The /x modifier just allows me to add spaces so that the pattern is more intelligible.
See http://perldoc.perl.org/perlre.html for discussion on the use of the 'g' modifier which will cause your regular expression to match ALL occurrances of its pattern, not just the first.
If you want a pattern that finds the $n'th 4-digit group, this seems to work:
$pat = "^(?:.*?\\b(\\d{4})\\b){$n}";
if ($s =~ /$pat/) {
print "Found $1\n";
} else {
print "Not found\n";
}
I did this by building a string pattern because I couldn't get a variable interpolated into a quantifier {$n}.
This pattern finds 4-digit groups that are on word boundaries (the \b tests); I don't know if that meets your requirements. The pattern uses .*? to ensure that as few characters as possible are matched between each four-digit group. The pattern is matched $n times, and the capture group $1 is set to whatever it was in the last iteration, i.e. the $n'th one.
EDIT: When I just tried it again, it seemed to interpolate $n in a quantifier just fine. I don't know what I did differently that it didn't work last time. So maybe this will work:
if ($s =~ /^(?:.*?\b(\d{4}\b){$n}/) { ...
If not, see amon's comment about qr//.
If the regex is only matched once, then match all three in one regex and extract them using matched groups:
^.*\b(\d{4})\b.*\b(\d{4})\b.*\b(\d{4})\b.*$
The three 4-digit numbers will be captured in group 1. 2 and 3.
Ajb's answer with "gx" is the best. If you know you will have three numbers, this straighforward line does the trick:
my $str = 'The first is 1234 2) The Second is 2098 3) The Third is 3213';
my ($num1, $num2, $num3) = $str =~ /\b \d{4} \b/gx;
print "$num1, $num2, $num3\n";