Codegolf regex match - regex

In the codegold i found this answer: https://codegolf.stackexchange.com/a/34345/29143 , where is this perl one liner:
perl -e '(q x x x 10) =~ /(?{ print "hello\n" })(?!)/;'
After the -MO=Deparse got:
' ' =~ /(?{ print "hello\n" })(?!)/;
^^^^^^^^^^^^
10 spaces
The explanation told than the (?!) never match, so the regex tries match each character. OK, but why it prints 11 times hello and not 10 times?

Regular expressions start matching based off positions, which can includes both before each character but also after the last character.
The following zero width regular expression will match before each of the 5 characters of the string, but also after the last one, thus demonstrated why you got 11 prints instead of just 10.
use strict;
use warnings;
my $string = 'ABCDE';
# Zero width Regular expression
$string =~ s//x/g;
print $string;
Outputs:
xAxBxCxDxEx
^ ^ ^ ^ ^ ^
1 2 3 4 5 6

It's because when you have a string of n characters there are n+1 positions in the string where the pattern is tested.
example with "abc":
a b c
^ ^ ^ ^
| | | |
| | | +--- end of the string
| | +----- position of c
| +------- position of b
+--------- position of a
The position of the end of the string can be a little counter-intuitive, but this position exists. To illustrate this fact, consider the pattern /c$/ that will succeed with the example string. (think of the position in the string when the end anchor is tested). Or this other one /(?<=c)/ that succeeds in the last position.

Take a look at the following:
$x = "abc"; $x =~ s/.{0}/x/; print("$x\n"); # xabc
$x = "abc"; $x =~ s/.{1}/x/; print("$x\n"); # xbc
$x = "abc"; $x =~ s/.{2}/x/; print("$x\n"); # xc
$x = "abc"; $x =~ s/.{3}/x/; print("$x\n"); # x
Nothing surprising. You can match anywhere between 0 and 3 of the three characters, and place an x at the position where you left off. That's four positions for three characters.
Also consider 'abc' =~ /^abc\z/.
Starting at position 0, ^ matches zero chars.
Starting at position 0, a matches one char.
Starting at position 1, b matches one char.
Starting at position 2, c matches one char.
Starting at position 3, \z matches zero char.
Again, that's a total of four positions needed for a three character string.
Only zero-width assertions can match at the last position, but there are plenty of those (^, \z, \b, (?=...), (?!...), (?<=...), (?:...)?, etc).
You can think of the positions as the edges of the characters, if that helps.
|a|b|c|
0 1 2 3

Related

exactly once from a set of characters perl using regex

how to check exactly one character from a group of characters in perl using regexp.Suppose from (abcde) i want to check if out of all these 5 characters only one has occured which can occur multiple times.I have tried quantifiers but it does not work for a set of characters.
You could use the following regex match:
/
^
[^a-e]*+
(?: a [^bcde]*+
| b [^acde]*+
| c [^abde]*+
| d [^abce]*+
| e [^abcd]*+
)
\z
/x
The following is a simpler pattern that might be less efficient:
/ ^ [^a-e]*+ ([a-e]) (?: \1|[^a-e] )*+ \z /x
A non-regex solution might be simpler.
# Count the number of instances of each letter.
my %chars;
++$chars{$_} for split //;
# Count how many of [a-e] are found.
my $count = 0;
++$count for grep $chars{$_}, qw( a b c d e );
$count == 1
you can use regex to return a list of matches. then you can store the result in an array.
my #arr = "abcdeaa" =~ /a/g; print scalar #arr ."\n";
prints 3
my #arr = "bcde" =~ /a/g; print scalar #arr ."\n";
prints 0
if you use scalar #arr. it will return the length of the array.

Perl regex: Substitution of everything but the pattern

In perl, I would like to substitute a negated class character set (everything but the pattern) by nothing, to keep only the expected string. Normally, this approach should work, but in my case it isn't :
$var =~ s/[^PATTERN]//g;
the original string:
$string = '<iframe src="https://foo.bar/embed/b74ed855-63c9-4795-b5d5-c79dd413d613?autoplay=1&context=cGF0aD0yMSwx</iframe>';
wished pattern to get: b74ed855-63c9-4795-b5d5-c79dd413d613
(5 hex number groups split with 4 dashes)
my code:
$pattern2keep = "[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}";
(should match only : xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx (5 hex number groups split with 4 dashes) , char length : 8-4-4-4-12 )
The following should substitute everything but the pattern by nothing, but in fact it does not.
$string =~ s/[^$pattern2keep]//g;
What am I doing wrong please? Thanks.
A character class matches a single character equal to any one of the characters in the class. If the class begins with a caret then the class is negated, so it matches any one character that isn't any of the characters in the class
If $pattern2keep is [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12} then [^$pattern2keep] will match any character other than -, 0, 1, 2, 4, 8, 9, [, ], a, f, {, or }
You need to capture the substring, like this
use strict;
use warnings 'all';
use feature 'say';
my $string = '<iframe src="https://foo.bar/embed/b74ed855-63c9-4795-b5d5-c79dd413d613?autoplay=1&context=cGF0aD0yMSwx</iframe>';
my $pattern_to_keep = qr/ \p{hex}{8} (?: - \p{hex}{4} ){3} - \p{hex}{12} /x;
my $kept;
$kept = $1 if $string =~ /($pattern_to_keep)/;
say $kept // 'undef';
output
b74ed855-63c9-4795-b5d5-c79dd413d613

Pattern matching in perl (Lookahead and Condition on word Index)

I have a long string, containing alphabetic words and each delimited by one single character ";" . The whole string also starts and ends with a ";" .
How do I count the number of occurrences of a pattern (started with ";") if index of a success match is divisible by 5.
Example:
$String = ";the;fox;jumped;over;the;dog;the;duck;and;the;frog;"
$Pattern = ";the(?=;f)"
OUTPUT: 1
Since:
Note 1: In above case, the $Pattern ;the(?=;f) exists as the 1st and 10th words in the $String; however; the output result would be 1, since only the index of second match (10) is divisible by 5.
Note 2: Every word delimited by ";" counts toward the index set.
Index of the = 1 -> this does not match since 1 is not divisible by 5
Index of fox = 2
Index of jumped = 3
Index of over = 4
Index of the = 5 -> this does not match since the next word (dog) starts with "d" not "f"
Index of dog = 6
Index of the = 7 -> this does not match since 7 is not divisible by 5
Index of duck = 8
Index of and = 9
Index of the = 10 -> this does match since 10 is divisible by 5 and the next word (frog) starts with "f"
Index of frog = 11
If possible, I am wondering if there is a way to do this with a single pattern matching without using list or array as the $String is extremely long.
Use Backtracking control verbs to process the string 5 words at a time
One solution is to add a boundary condition that the pattern is preceded by 4 other words.
Then setup an alteration so that if your pattern is not matched, the 5th word is gobbled and then skipped using backtracking control verbs.
The following demonstrates:
#!/usr/bin/env perl
use strict;
use warnings;
my $string = ";the;fox;jumped;over;the;dog;the;duck;and;the;frog;";
my $pattern = qr{;the(?=;f)};
my #matches = $string =~ m{
(?: ;[^;]* ){4} # Preceded by 4 words
(
$pattern # Match Pattern
|
;(*SKIP)(*FAIL) # Or consume 5th word and skip to next part of string.
)
}xg;
print "Number of Matches = " . #matches . "\n";
Outputs:
Number of Matches = 1
Live Demo
Supplemental Example using Numbers 1 through 100 in words
For additional testing, the following constructs a string of all numbers in word format from 1 to 100 using Lingua::EN::Numbers.
For the pattern it looks for a number that's a single word with the next number that begins with the letter S.
use Lingua::EN::Numbers qw(num2en);
my $string = ';' . join( ';', map { num2en($_) } ( 1 .. 100 ) ) . ';';
my $pattern = qr{;\w+(?=;s)};
my #matches = $string =~ m{(?:;[^;]*){4}($pattern|;(*SKIP)(*FAIL))}g;
print "#matches\n";
Outputs:
;five ;fifteen ;sixty ;seventy
Reference for more techniques
The following question from last month is a very similar problem. However, I provided 5 different solutions in addition to the one demonstrated here:
In Perl, how to count the number of occurences of successful matches based on a condition on their absolute positions
You can count the number of semicolons in each substring up to the matching position. For a million-word string, it takes 150 seconds.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my $string = join ';', q(),
map { qw( the fox jumped over the dog the duck and the frog)[int rand 11] }
1 .. 1000;
$string .= ';';
my $pattern = qr/;the(?=;f)/;
while ($string =~ /$pattern/g) {
my $count = substr($string, 0, pos $string) =~ tr/;//;
say $count if 0 == $count % 5;
}
Revised Answer
One relatively simple way to achieve what you want is by replacing the delimiters in the original text that occur on a 5-word-index boundary:
$text =~ s/;/state $idx++ % 5 ? ',' : ';'/eg;
Now you just need to trivially adjust your $pattern to look for ;the,f instead of ;the;f. You can use the =()= pseudo-operator to return the count:
my $count =()= $text =~ /;the(?=,f)/g;
Original answer after the break. (Thanks to #choroba for pointing out the correct interpretation of the question.)
Character-Based Answer
This uses the /g regex modifier in combination with pos() to look at matching words. For illustration, I print out all matches (not just those on 5-character boundaries), but I print (match) beside those on 5-char boundaries. The output is:
;the;fox;jumped;over;the;dog;the;duck;and;the;frog
^....^....^....^....^....^....^....^....^....^....
`the' #0 (match)
`the' #41
And the code is:
#!/usr/bin/env perl
use 5.010;
my $text = ';the;fox;jumped;over;the;dog;the;duck;and;the;frog';
say $text;
say '^....^....' x 5;
my $pat = qr/;(the)(?=;f)/;
#$pat = qr/;([^;]+)/;
while ($text =~ /$pat/g) {
my $pos = pos($text) - length($1) - 1;
say "`$1' \#$pos". ($pos % 5 ? '' : ' (match)');
}
First of, pos is also possible as a left hand side expression. You could make use of the \G assertion in combination with index (since speed is of concern for you). I expanded your example to showcase that it only "matches" for divisibles of 5 (your example also allowed for indices not divisible by 5 to be 1 a solution, too). Since you only wanted the number of matches, I only used a $count variable and incremented. If you want something more, use the normal if {} clause and do something in the block.
my $string = ";the;fox;jumped;over;the;dog;the;duck;and;the;frog;or;the;fish";
my $pattern = qr/;the(?=;f)/;
my ($index,$count, $position) = (0,0,0);
while(0 <= ($position = index $string, ';',$position)){
pos $string = $position++; #add one to $position, to terminate the loop
++$count if (!(++$index % 5) and $string =~/\G$pattern/);
}
say $count; # says 1, not 2
You could use the experimental features of regexes to solve you problem (especially the (?{}) blocks). Before you do, you really should read the corresponding section in the perldocs.
my ($index, $count) = (0,0);
while ($string =~ /; # the `;'
(?(?{not ++$index % 5}) # if with a code condition
the(?=;f) # almost your pattern, but we'll have to count
|(*FAIL)) # else fail
/gx) {
$count++;
}

Does a regex try to match agains the positions between characters in the text?

I had the idea that when matching a regex only the characters of the text are matched. But then I saw this:
$ perl -e '
my $var = "abcde";
$var =~ s/x?/!/g;
print "$var\n";
'
!a!b!c!d!e!
The way I understand this is that the regex is attempted to be matched against the characters and the nothingness in the indexes between the characters. Is this correct? Or else how come we get the exclamations between the characters?
Yes, that is a useful way to think about it. More formally, we can imagine any string containing a zero-length substring at any position:
'' eq '' . ''
'foo' eq '' . 'f' . '' . 'o' . '' . 'o' . ''
The regex /x?/ tries to match x, or the zero-length string. It's equivalent to /x|/. Note that a regex that always succeeds looks like /(?=)/ (look-ahead to see a zero-length string), because // is special-cased to repeat the last match, unless when used in split //, ... to split after each character.
The match will nevertheless move forward one character in order to avoid an infinite loop: split //, "foo" produces 'f', 'o', 'o' and not '', '', '', ..., 'f', 'o', 'o'
Regex matches are expressed in terms of starting position and end position, which is to say starting position and length.
$ perl -E'say "pos:$-[0] len:".($+[0]-$-[0]) while "abcde" =~ /x?/g;'
pos:0 len:0
pos:1 len:0
pos:2 len:0
pos:3 len:0
pos:4 len:0
pos:5 len:0
It's not so much matching in between as replacing zero characters at each position.
(It would an infinite number of times at position 0 if there wasn't a rule preventing the matching the same number of characters at the same position twice. This forces the engine to look at other positions until all positions are exhausted.)
(There's a virtual position at the end of the string so $ and \z can match.)
As ikegami points out in his answer, each position before/between/after the characters are valid positions of the beginning of a match. In your example, the matches start at each possible position and span a length of 0, so each 0-length match gets replaced with a 1-length '!'.
Internally, the $var maintains a position that the regex engine uses to track its progress while doing the substitution. This position represents the index of the "nothingness" before/between/after the characters. To help visualize this position, here is the same code snippet with various print statements (of the position of $var) inserted throughout the regex.
$ perl -e 'my $var = "abcde";
$var =~ s/(?{ print "Before: ", pos, "\n" })
(?:
(?{ print "Inside: ", pos, "\n" })
x
(?{ print "Never: ", pos, "\n" })
)?
(?{ print "After: ", pos, "\n" })
/ print "Done: ", pos, "\n\n"; "!" /gex;
print "$var\n";
'
The output of the above is:
Before: 0
Inside: 0
After: 0
Done: 0
Before: 0
Inside: 0
After: 0
Before: 1
Inside: 1
After: 1
Done: 1
Before: 1
Inside: 1
After: 1
Before: 2
Inside: 2
After: 2
Done: 2
Before: 2
Inside: 2
After: 2
Before: 3
Inside: 3
After: 3
Done: 3
Before: 3
Inside: 3
After: 3
Before: 4
Inside: 4
After: 4
Done: 4
Before: 4
Inside: 4
After: 4
Before: 5
Inside: 5
After: 5
Done: 5
Before: 5
Inside: 5
After: 5
!a!b!c!d!e!
I'm not sure why the Before, Inside, and After sections are executed twice per position. My guess is that the regex engine is able to detect an infinite loop (as ikegami and amon point out), so it avoids matching those positions the second time it encounters them.
I will say that the pattern matches an empty string at each position in the string.

regular epxressions that matches the longest repeating sequence

I want to match the longest sequence that is repeating at least once
Having:
T_send_ack-new_amend_pending-cancel-replace_replaced_cancel_pending-cancel-replace_replaced
the result should be: pending-cancel-replace_replaced
Try this
(.+)(?=.*\1)
See it here on Regexr
This will match any character sequence with at least one character, that is repeated later on in the string.
You would need to store your matches and decide which one is the longest afterwards.
This solution requires your regex flavour to support backreferences and lookaheads.
it will match any character sequence with at least one character .+ and store it in the group 1 because of the brackets around it. The next step is the positive lookahead (?=.*\1), it will be true if the captured sequence occurs at a later point again in the string.
Here a perl script that does the job:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my $s = q/T_send_ack-new_amend_pending-cancel-replace_replaced_cancel_pending-cancel-replace_replaced/;
my $max = 0;
my $seq = '';
while($s =~ /(.+)(?=.*\1)/g) {
if(length$1 > $max) {
$max = length $1;
$seq = $1;
}
}
say "longuest sequence : $seq, length = $max"
output:
longuest sequence : _pending-cancel-replace_replaced, length = 32
I have to admit that this one got me thinking. It was obvious that positive lookahead is absolutely necessary to solve this with regex. Anyhow here is how it would work in Java:
public static String biggestOccurance(String input){
Pattern p = Pattern.compile("(.+)(?=.*\\1)");
Matcher m = p.matcher(input);
String longestOccurence = "";
while(m.find()){
if(longestOccurence.length() < m.group(1).length()) longestOccurence = m.group(1);
}
return longestOccurence;
}
The thing that got me stuck was the
\\1
I knew that you could refer to a backreference in Java with
$1
but if you replace $1 with \\1 it will not work.
Will have to dig into that.
Cheers,Eugene.
Using Perl you can do:
s='T_send_ack-new_amend_pending-cancel-replace_replaced_cancel_pending-cancel-replace_replaced'
echo $s | perl -pe 's/([^\s]+)(?=.*?\1)/\1\n/g'
Which gives:
T_
send_
ac
k-
n
e
w_
a
mend
_pending-cancel-replace_replaced
_
cancel
_
p
e
n
d
in
g-
c
a
nce
l
-replace
_re
placed
Then you need to post process it in any language or script to get longest text.
One Possible Post Processing of repeated string can be using awk:
echo $s | perl -pe 's/([^\s]+)(?=.*?\1)/\1\n/g' | awk '{ if (length($0) > max) {max = length($0); maxline = $0} } END { print maxline }'
Which prints:
_pending-cancel-replace_replaced
PS: Note longest string here is _pending-cancel-replace_replaced