Matching incremental digits in numbers - regex

After googling many days about the issue, finally I am posting this question here and hoping to get it solved by experts here; I am looking for the regex pattern that can match incremental back references. Let me explain:
For number 9422512322, the pattern (\d)\1 will match 22 two times, and I want the pattern (something like (\d)\1+1) that matches 12 (second digit is equal to first digit + 1)
In short the pattern should match all occurrence like 12, 23, 34, 45, 56, etc... There is no replacement, just matches required.

What about something like this?
/01|12|23|34|45|56|67|78|89/
It isn't sexy but it gets the job done.

You can use this regex:
(?:0(?=1)|1(?=2)|2(?=3)|3(?=4)|4(?=5)|5(?=6)|6(?=7)|7(?=8)|8(?=9))+.
This will match:
Any 0s which are followed by 1s, or
Any 1s which are followed by 2s, or
Any 2s which are followed by 3s, ...
Multiple times +, then match the corresponding character ..
Here is a regex demo, and the match is:
12345555567877785

You can run code within Perl regular expressions that can
control the regex execution flow. But, this is not likely
to be implemented anywhere else to this degree.
PCRE has some program variable interaction, but not like Perl.
(Note - to do overlap finds, replace the second ( \d ) with (?=( \d ))
then change print statement to print "Overlap Found $1$3\n";
If you use Perl, you can do all kinds of math-character relationships that can't be
done with brute force permutations.
- Good luck!
Perl example:
use strict;
use warnings;
my $dig2;
while ( "9342251232288 6709090156" =~
/
(
( \d )
(?{ $dig2 = $^N + 1 })
( \d )
(?(?{
$dig2 != $^N
})
(?!)
)
)
/xg )
{
print "Found $1\n";
}
Output:
Found 34
Found 12
Found 67
Found 01
Found 56

Here is one way to do it in Perl, using positive lookahead assertions:
#!/usr/bin/env perl
use strict;
use warnings;
my $number = "9422512322";
my #matches = $number =~ /(0(?=1)|1(?=2)|2(?=3)|3(?=4)|4(?=5)|5(?=6)|6(?=7)|7(?=8)|8(?=9))/g;
# We have only matched the first digit of each pair of digits.
# Add "1" to the first digit to generate the complete pair.
foreach my $first (#matches) {
my $pair = $first . ($first + 1);
print "Found: $pair\n";
}
Output:
Found: 12
Found: 23

Related

The need of \G anchor, capturing group of IPS on max 255 string [duplicate]

I am not clear on the use/need of the \G operator.
I read in the perldoc:
You use the \G anchor to start the next match on the same string where
the last match left off.
I don't really understand this statement. When we use \g we usually move to the character after the last match anyway.
As the example shows:
$_ = "1122a44";
my #pairs = m/(\d\d)/g; # qw( 11 22 44 )
Then it says:
If you use the \G anchor, you force the match after 22 to start with
the a:
$_ = "1122a44";
my #pairs = m/\G(\d\d)/g;
The regular expression cannot match there since it does not
find a digit, so the next match fails and the match operator returns
the pairs it already found
I don't understand this either. "If you use the \G anchor, you force the match after 22 to start with a." But without the \G the matching will be attempted at a anyway right? So what is the meaning of this sentence?
I see that in the example the only pairs printed are 11 and 22. So 44 is not tried.
The example also shows that using c option makes it index 44 after the while.
To be honest, from all these I can not understand what is the usefulness of this operator and when it should be applied.
Could someone please help me understand this, perhaps with a meaningful example?
Update
I think I did not understand this key sentence:
If you use the \G anchor, you force the match after 22 to start with
the a . The regular expression cannot match there since it does not
find a digit, so the next match fails and the match operator returns
the pairs it already found.
This seems to mean that when the match fails, the regex does not proceed further attempts and is consistent with the examples in the answers
Also:
After the match fails at the letter a , perl resets pos() and the next
match on the same string starts at the beginning.
\G is an anchor; it indicates where the match is forced to start. When \G is present, it can't start matching at some arbitrary later point in the string; when \G is absent, it can.
It is most useful in parsing a string into discrete parts, where you don't want to skip past other stuff. For instance:
my $string = " a 1 # ";
while () {
if ( $string =~ /\G\s+/gc ) {
print "whitespace\n";
}
elsif ( $string =~ /\G[0-9]+/gc ) {
print "integer\n";
}
elsif ( $string =~ /\G\w+/gc ) {
print "word\n";
}
else {
print "done\n";
last;
}
}
Output with \G's:
whitespace
word
whitespace
integer
whitespace
done
without:
whitespace
whitespace
whitespace
whitespace
done
Note that I am demonstrating using scalar-context /g matching, but \G applies equally to list context /g matching and in fact the above code is trivially modifiable to use that:
my $string = " a 1 # ";
my #matches = $string =~ /\G(?:(\s+)|([0-9]+)|(\w+))/g;
while ( my ($whitespace, $integer, $word) = splice #matches, 0, 3 ) {
if ( defined $whitespace ) {
print "whitespace\n";
}
elsif ( defined $integer ) {
print "integer\n";
}
elsif ( defined $word ) {
print "word\n";
}
}
But without the \G the matching will be attempted at a anyway right?
Without the \G, it won't be constrained to start matching there. It'll try, but it'll try starting later if required. You can think of every pattern as having an implied \G.*? at the front.
Add the \G, and the meaning becomes obvious.
$_ = "1122a44";
my #pairs = m/\G (\d\d)/xg; # qw( 11 22 )
my #pairs = m/\G .*? (\d\d)/xg; # qw( 11 22 44 )
my #pairs = m/ (\d\d)/xg; # qw( 11 22 44 )
To be honest, from all these I can not understand what is the usefulness of this operator and when it should be applied.
As you can see, you get different results by adding a \G, so the usefulness is getting the result you want.
Interesting answers and alot are valid I guess, but I can also guess that is still doesn't explain alot.
\G 'forces' the next match to occur at the position the last match ended.
Basically:
$str="1122a44";
while($str=~m/\G(\d\d)/g) {
#code
}
First match = "11"
Second match is FORCED TO START at 22 and yes, that's \d\d, so result is "22"
Third 'try' is FORCED to start at "a", but that's not \d\d, so it fails.

Is there a way to evaluate the number of times a Perl regular expression has matched?

I've been poring over perldoc perlre as well as the Regular Expressions Cookbook and related questions on Stack Overflow and I can't seem to find what appears to be a very useful expression: how do I know the number of current match?
There are expressions for the last closed group match ($^N), contents of match 3 (\g{3} if I understood the docs correctly), $', $& and $`. But there doesn't seem to be a variable I can use that simply tells me what the number of the current match is.
Is it really missing? If so, is there any explained technical reason why it is a hard thing to implement, or am I just not reading the perldoc carefully enough?
Please note that I'm interested in a built-in variable, NOT workarounds like using (${$count++}).
For context, I'm trying to build a regular expression that would match only some instances of a match (e.g. match all occurrences of character "E" but do NOT match occurrences 3, 7 and 10 where 3, 7 and 10 are simply numbers in an array). I ran into this when trying to construct a more idiomatic answer to this SO question.
I want to avoid evaluating regexes as strings to actually insert 3, 7 and 10 into the regex itself.
I'm completely ignoring the actually utility or wisdom of using this for the other question.
I thought #- or #+ might do what you want since they hold the offsets of the numbered matches, but it looks like the regex engine already knows what the last index will be:
use v5.14;
use Data::Printer;
$_ = 'abc123abc345abc765abc987abc123';
my #matches = m/
([0-9]+)
(?{
print 'Matched \$' . $#+ . " group with $^N\n";
say p(#+);
})
.*?
([0-9]+)
(?{
print 'Matched \$' . $#+ . " group with $^N\n";
say p(#+);
})
/x;
say "Matches: #matches";
This gives strings that show the last index as 2 even though it hasn't matched $2 yet.
Matched \$2 group with 123
[
[0] 6,
[1] 6,
[2] undef
]
Matched \$2 group with 345
[
[0] 12,
[1] 6,
[2] 12
]
Matches: 123 345
Notice that the first time around, $+[2] is undef, so that one hasn't been filled in yet. You might be able to do something with that, but I think that's probably getting away from the spirit of your question. If you were really fancy, you could create a tied scalar that has the value of the last defined index in #+, I guess.
I played around with this for a bit. Again, I know that this is not really what you are looking for, but I don't think that exists in the way you want it.
I had two thoughts. First, with a split using separator retention mode, you get the interstitial bits as the odd numbered elements in the output list. With the list from the split, you count which match you are on and put it back together how you like:
use v5.14;
$_ = 'ab1cdef2gh3ij4k5lmn6op7qr8stu9vw10xyz';
my #bits = split /(\d+)/; # separator retention mode
my #skips = qw(3 7 10);
my $s;
while( my( $index, $value ) = each #bits ) {
# shift indices to match number ( index = 2 n - 1 )
if( $index % 2 and ! ( ( $index + 1 )/2 ~~ #skips ) ) {
$s .= '^';
}
else {
$s .= $value;
}
}
I get:
ab^cdef^gh3ij^k^lmn^op7qr^stu^vw10xyz
I thought I really liked my split answer until I had the second thought. Does state work inside a substitution? It appears that it does:
use v5.14;
$_ = 'ab1cdef2gh3ij4k5lmn6op7qr8stu9vw10xyz';
my #skips = qw(3 7 10);
s/(\d+)/
state $n = 0;
$n++;
$n ~~ #skips ? $1 : '$'
/eg;
say;
This gives me:
ab$cdef$gh3ij$k$lmn$op7qr$stu$vw10xyz
I don't think you can get much simpler than that, even if that magic variable existed.
I had a third thought which I didn't try. I wonder if state works inside a code assertion. It might, but then I'd have to figure out how to use one of those to make a match fail, which really means it has to skip over the bit that might have matched. That seems really complicated, which is probably what Borodin was pressuring you to show even in pseudocode.

regex validating a time value or a list of time values

I need a regex, which matches a single time value as well as lists of time values in the format hhmm[, hhmm] like for example:
"1245" or "0056, 1034,2355"
I am not so good with regex.. I thought this would do it:
(([0-1][0-9])|(2[0-3]))[0-5][0-9](,[ \t]*(([0-1][0-9])|(2[0-3]))[0-5][0-9])*
single time values are validated correctly, but if I try lists of times, every number behind the comma is accepted. It matches also "1235, 4711".
Can someone give me a hint what i am doing wrong?
Thanks in advance!
$pat = qr/(?:2[0-3]|[01][0-9])[0-5][0-9]/;
while (<DATA>) {
if (/^$pat(,\s*$pat)*$/) {
print;
}
}
__DATA__
1245
0056, 1034,2355
1034,2455
You should add a ^ to instruct the regular expression to match from the beginning of the line.
The following regex should work.
^([01][0-9]|2[0-3])[0-5][0-9](,\s*([01][0-9]|2[0-3])[0-5][0-9])*$
Try it yourself
In my opinion this is more readable regexp and it should work.
while( <DATA> ) {
if( /
^(
(
((0|1)\d)|(2[0-3]) #regex for hour (the first number may be 0, 1, or 2
#if 0 or 1, the second number can be from 0 to 9
#if 2, the second number can be from 0 to 3
)
[0-5]\d #regex for minutes (the first number
#can be from 0 to 5, second from 0 to 9)
)
(
,\s* #comma required
#the separator may be, or may not be
(
((0|1)\d)|(2[0-3])
)
[0-5]\d
)*$
/x ) {
print;
}
}
Your regular expression is basically fine except that it looks for the pattern anywhere inside the target string. That means any string that contains a single valid time will match. You must add beginning and end of string anchors ^ and $ to force the entire string to match the pattern.
You will find it clearer and easier to code regular expressions if you first write a common sub-expression and then use it like a subroutine. It also helps to use the /x modifier so that you can use whitespace to lay out the expresion more clearly.
For instance, this matches a single time string
/ ( [0-1][0-9] | 2[0-3] ) [0-5][0-9] /x
and you can go on to substitute that twice in the main expression.
It is also better to use non-capturing parentheses like (?: ... ) unless you really want to capture the substring into $1, $2 etc.
Take a look at this program and see what you think
use strict;
use warnings;
my $time = qr/(?: (?: [0-1][0-9] | 2[0-3] ) [0-5][0-9] ) /x;
while (<DATA>) {
print if /^ $time (?: ,[ \t]* $time )* $/x;
}
__DATA__
1245
0056, 1034,2355
1235, 4711
0000,1111
output
1245
0056, 1034,2355
0000,1111
This regexp must work:
/^(\d+)(, ?\d+)*$/

2-step regular expression matching with a variable in Perl

I am looking to do a 2-step regular expression look-up in Perl, I have text that looks like this:
here is some text 9337 more text AA 2214 and some 1190 more BB stuff 8790 words
I also have a hash with the following values:
%my_hash = ( 9337 => 'AA', 2214 => 'BB', 8790 => 'CC' );
Here's what I need to do:
Find a number
Look up the text code for the number using my_hash
Check if the text code appears within 50 characters of the identified number, and if true print the result
So the output I'm looking for is:
Found 9337, matches 'AA'
Found 2214, matches 'BB'
Found 1190, no matches
Found 8790, no matches
Here's what I have so far:
while ( $text =~ /(\d+)(.{1,50})/g ) {
$num = $1;
$text_after_num = $2;
$search_for = $my_hash{$num};
if ( $text_after_num =~ /($search_for)/ ) {
print "Found $num, matches $search_for\n";
}
else {
print "Found $num, no matches\n";
}
This sort of works, except that the only correct match is 9337; the code doesn't match 2214. I think the reason is that the regular expression match on 9337 is including 50 characters after the number for the second-step match, and then when the regex engine starts again it is starting from a point after the 2214. Is there an easy way to fix this? I think the \G modifier can help me here, but I don't quite see how.
Any suggestions or help would be great.
You have a problem with greediness. The 1,50 will consume as much as it can. Your regex should be /(\d+)(.+?)(?=($|\d))/
To explain, the question mark will make the multiple match non-greedy (it will stop as soon as the next pattern is matched - the next pattern gets precedence). The ?= is a lookahead operator to say "check if the next element is a digit. If so, match but do not consume." This allows the first digit to get picked up by the beginning of the regex and be put into the next matched pattern.
[EDIT]
I added an optional end value to the lookahead so that it wouldn't die on the last match.
Just use :
/\b\d+\b/g
Why match everything if you don't need to? You should use other functions to determine where the number is :
/(?=9337.{1,50}AA)/
This will fail if AA is further than 50 chars away from the end of 9337. Of course you will have to interpolate your variables to match your hashe's keys and values. This was just an example for your first key/value pair.

In Perl, how can I detect if a string has multiple occurrences of double digits?

I wanted to match 110110 but not 10110. That means at least twice repeating of two consecutive digits which are the same. Any regex for that?
Should match: 110110, 123445446, 12344544644
Should not match: 10110, 123445
/(\d)\1.*\1\1/
This matches a string with 2 instances of a double number, ie 11011 but not 10011
\d matches any digit
\1 matches the first match effectively doubling the first entry
This will also match 1111. If there needs to be other characters between change .* to .+
ooh, this looks neater
((\d)\2).*\1
If you want to find non-matching values, but there has to be 2 sets of doubles, then you would simply need to add the first part again as in
((\d)\2).*((\d)\4)
The bracketing would mean that $1 and $3 would contain the double digits and $2 and $4 contains the single digits (which are then doubled).
11233
$1=11
$2=1
$3=33
$4=3
If I understand correctly, your regexp will be:
m{
(\d)\1 # First repeated pair
.* # Anything in between
(\d)\2 # Second repeated pair
}x
For example:
for my $x (qw(110110 123445446 12344544644 10110 123445)) {
my $m = $x =~ m{(\d)\1.*(\d)\2} ? "matches" : "does not match";
printf "%-11s : %s\n", $x, $m;
}
110110 : matches
123445446 : matches
12344544644 : matches
10110 : does not match
123445 : does not match
If you're talking about all digits, this will do it:
00.*00|11.*11|22.*22|33.*33|44.*44|55.*55|66.*66|77.*77|88.*88|99.*99
It's just 9 different patterns OR'ed together, each of which checks for at least two occurrences of the desired 2-digit pattern.
Using Perls more advanced REs, you can use the following for two consecutive digits twice:
(\d)\1.*\1\1
or, as one of your comments states, two consecutive digits follwed somewhere by two more consecutive digits which may not be the same:
(\d)\1.*(\d)\2
depending on how your data is, here's a minimal regex way.
while(<DATA>){
chomp;
#s = split/\s+/;
foreach my $i (#s){
if( $i =~ /123445/ && length($i) ne 6){
print $_."\n";
}
}
}
__DATA__
This is a line
blah 123445446 blah
blah blah 12344544644 blah
.... 123445 ....
this is last line
There is no reason to do everything in one regex... You can use the rest of Perl as well:
#!/usr/bin/perl -l
use strict;
use warnings;
my #strings = qw( 11233 110110 10110 123445 123445446 12344544644 );
print if is_wanted($_) for #strings;
sub is_wanted {
my ($s) = #_;
my #matches = $s =~ /(?<group>(?<first>[0-9])\k<first>)/g;
return 1 < #matches / 2;
}
__END__
If I've understood your question correctly, then this, according to regexbuddy (set to using perl syntax), will match 110110 but not 10110:
(1{2})0\10
The following is more general and will match any string where two equal digits is repeated later on in the string.
(\d{2})\d+\1\d*
The above will match the following examples:
110110
110011
112345611
2200022345
Finally, to find two sets of double digits in a string and you don't care where they are, try this:
\d*?(\d{2})\d+?\1\d*
This will match the examples above plus this one:
12345501355789
Its the two sets of double 5 in the above example that are matched.
[Update]
Having just seen your extra requirement of matching a string with two different double digits, try this:
\d*?(\d)\1\d*?(\d)\2\d*
This will match strings like the following:
12342202345567
12342202342267
Note that the 22 and 55 cause the first string to match and the pair of 22 cause the second string to match.