regular epxressions that matches the longest repeating sequence - regex

I want to match the longest sequence that is repeating at least once
Having:
T_send_ack-new_amend_pending-cancel-replace_replaced_cancel_pending-cancel-replace_replaced
the result should be: pending-cancel-replace_replaced

Try this
(.+)(?=.*\1)
See it here on Regexr
This will match any character sequence with at least one character, that is repeated later on in the string.
You would need to store your matches and decide which one is the longest afterwards.
This solution requires your regex flavour to support backreferences and lookaheads.
it will match any character sequence with at least one character .+ and store it in the group 1 because of the brackets around it. The next step is the positive lookahead (?=.*\1), it will be true if the captured sequence occurs at a later point again in the string.

Here a perl script that does the job:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my $s = q/T_send_ack-new_amend_pending-cancel-replace_replaced_cancel_pending-cancel-replace_replaced/;
my $max = 0;
my $seq = '';
while($s =~ /(.+)(?=.*\1)/g) {
if(length$1 > $max) {
$max = length $1;
$seq = $1;
}
}
say "longuest sequence : $seq, length = $max"
output:
longuest sequence : _pending-cancel-replace_replaced, length = 32

I have to admit that this one got me thinking. It was obvious that positive lookahead is absolutely necessary to solve this with regex. Anyhow here is how it would work in Java:
public static String biggestOccurance(String input){
Pattern p = Pattern.compile("(.+)(?=.*\\1)");
Matcher m = p.matcher(input);
String longestOccurence = "";
while(m.find()){
if(longestOccurence.length() < m.group(1).length()) longestOccurence = m.group(1);
}
return longestOccurence;
}
The thing that got me stuck was the
\\1
I knew that you could refer to a backreference in Java with
$1
but if you replace $1 with \\1 it will not work.
Will have to dig into that.
Cheers,Eugene.

Using Perl you can do:
s='T_send_ack-new_amend_pending-cancel-replace_replaced_cancel_pending-cancel-replace_replaced'
echo $s | perl -pe 's/([^\s]+)(?=.*?\1)/\1\n/g'
Which gives:
T_
send_
ac
k-
n
e
w_
a
mend
_pending-cancel-replace_replaced
_
cancel
_
p
e
n
d
in
g-
c
a
nce
l
-replace
_re
placed
Then you need to post process it in any language or script to get longest text.
One Possible Post Processing of repeated string can be using awk:
echo $s | perl -pe 's/([^\s]+)(?=.*?\1)/\1\n/g' | awk '{ if (length($0) > max) {max = length($0); maxline = $0} } END { print maxline }'
Which prints:
_pending-cancel-replace_replaced
PS: Note longest string here is _pending-cancel-replace_replaced

Related

How to verify if a variable value contains a character and ends with a number using Perl

I am trying to check if a variable contains a character "C" and ends with a number, in minor version. I have :
my $str1 = "1.0.99.10C9";
my $str2 = "1.0.99.10C10";
my $str3 = "1.0.999.101C9";
my $str4 = "1.0.995.511";
my $str5 = "1.0.995.AC";
I would like to put a regex to print some message if the variable has C in 4th place and ends with number. so, for str1,str2,str3 -> it should print "matches". I am trying below regexes, but none of them working, can you help correcting it.
my $str1 = "1.0.99.10C9";
if ( $str1 =~ /\D+\d+$/ ) {
print "Candy match1\n";
}
if ( $str1 =~ /\D+C\d+$/ ) {
print "Candy match2\n";
}
if ($str1 =~ /\D+"C"+\d+$/) {
print "candy match3";
}
if ($str1 =~ /\D+[Cc]+\d+$/) {
print "candy match4";
}
if ($str1 =~ /\D+\\C\d+$/) {
print "candy match5";
}
if ($str1 =~ /C[^.]*\d$/)
C matches the letter C.
[^.]* matches any number of characters that aren't .. This ensures that the match won't go across multiple fields of the version number, it will only match the last field.
\d matches a digit.
$ matches the end of the string. So the digit has to be at the end.
I found it really helpful to use https://www.regextester.com/109925 to test and analyse my regex strings.
Let me know if this regex works for you:
((.*\.){3}(.*C\d{1}))
Following your format, this regex assums 3 . with characters between, and then after the third . it checks if the rest of the string contains a C.
EDIT:
If you want to make sure the string ends in a digit, and don't want to use it to check longer strings containing the formula, use:
^((.*\.){3}(.*C\d{1}))$
Lets look what regex should look like:
start{digit}.{digit}.{2-3 digits}.{2-3 digits}C{1-2 digits}end
very very strict qr/^1\.0\.9{2,3}\.101?C\d+\z/ - must start with 1.0.99[9]?.
very strict qr/^1\.\0.\d{2,3}\.\d{2,3}C\d{1,2}\z/ - must start with 1.0.
strict qr/^\d\.\d\.\d{2,3}\.\d{2,3}C\d{1,2}\z/
relaxed qr/^\d\.\d\.\d+\.\d+C\d+\z/
very relaxed qr/\.\d+C\d+\z/
use strict;
use warnings;
use feature 'say';
my #data = qw/1.0.99.10C9 1.0.99.10C10 1.0.999.101C9 1.0.995.511 1.0.995.AC/;
#my $re = qr/^\d\.\d\.\d+\.\d+C\d+\z/;
my $re = qr/^\d\.\d\.\d{2,3}\.\d{2,3}C\d+\z/;
say '--- Input Data ---';
say for #data;
say '--- Matching -----';
for( #data ) {
say 'match ' . $_ if /$re/;
}
Output
--- Input Data ---
1.0.99.10C9
1.0.99.10C10
1.0.999.101C9
1.0.995.511
1.0.995.AC
--- Matching -----
match 1.0.99.10C9
match 1.0.99.10C10
match 1.0.999.101C9

exactly once from a set of characters perl using regex

how to check exactly one character from a group of characters in perl using regexp.Suppose from (abcde) i want to check if out of all these 5 characters only one has occured which can occur multiple times.I have tried quantifiers but it does not work for a set of characters.
You could use the following regex match:
/
^
[^a-e]*+
(?: a [^bcde]*+
| b [^acde]*+
| c [^abde]*+
| d [^abce]*+
| e [^abcd]*+
)
\z
/x
The following is a simpler pattern that might be less efficient:
/ ^ [^a-e]*+ ([a-e]) (?: \1|[^a-e] )*+ \z /x
A non-regex solution might be simpler.
# Count the number of instances of each letter.
my %chars;
++$chars{$_} for split //;
# Count how many of [a-e] are found.
my $count = 0;
++$count for grep $chars{$_}, qw( a b c d e );
$count == 1
you can use regex to return a list of matches. then you can store the result in an array.
my #arr = "abcdeaa" =~ /a/g; print scalar #arr ."\n";
prints 3
my #arr = "bcde" =~ /a/g; print scalar #arr ."\n";
prints 0
if you use scalar #arr. it will return the length of the array.

Perl regex \d+ and [0-9] operators show only single digit in a alpha-numeric string

I encountered the following problem: If I use the code in the first example the variable $1 includes only the last digit of each string. However, if I use the third example where each "string" is just a number the $1 variable shows the full number with all digits. To me it appears that the \d+ operator works differently in alpha-numeric context and just numeric context.
Here are my questions: Can you reproduce this? Is this behavior intended? How can I capture the full number in the alpha-numeric context using a regex operation in perl? If the nature of the \d operator is by nature lazy, can I make it more greedy (if true, how would i do it?)?
Example 1:
perl -e 'for ($i = 199; $i < 201; $i ++) { print "words".$i."words\n"}' | perl -ne 'if (/\A\w+(\d+)\w+/) {$num = $1; print $num,"\n";}'
Output:
9
0
Example 2:
perl -e 'for ($i = 199; $i < 201; $i ++) { print "words".$i."words\n"}' | perl -ne 'if (/\A\w+([0-9]+)\w+/) {$num = $1; print $num,"\n";}'
Output:
9
0
Example 3:
perl -e 'for ($i = 199; $i < 201; $i ++) { print "words".$i."words\n"}' | perl -ne 'if (/(\d+)/) {$num = $1; print $num,"\n";}'
Output:
199
200
Thanks in advance. Any help is highly appreciated.
Best,
Chris
The results you get are expected. In /\A\w+(\d+)\w+/, the first \w+ is a greedy pattern and will grab as many chars as it can match, and since \w also matches digits.
Either use lazy quantifier - /\A\w+?(\d+)\w+/, or subtract the digit from \w (e.g. like in /\A[^\W\d]+(\d+)\w+/). The \w+? will match 1 or more word chars (letters/digits/_) as few as possible, and [^\W\d] matches any letters or _ symbols, thus, no need to use a lazy quantifier with this pattern.
the problem is that digits are matched by \w.
You should replace "\w" with "\D" ("not digit").
For example :
perl -e 'for ($i = 199; $i < 201; $i ++) { print "words".$i."words\n"}' | perl -ne 'if (/\A\D+(\d+)\D+/) {$num = $1; print $num,"\n";}'
Output:
199
200
Of course, if your data can contain more than one occurrence of digits in a single string, you'll need some more precise regexp.

In regular expression matching of Perl, is it possible to know number of matches in a{n,}?

What I mean is:
For example, a{3,} will match 'a' at least three times greedly. It may find five times, 10 times, etc. I need this number. I need this number for the rest of the code.
I can do the rest less efficiently without knowing it, but I thought maybe Perl has some built-in variable to give this number or is there some trick to get it?
Just capture it and use length.
if (/(a{3,})/) {
print length($1), "\n";
}
Use #LAST_MATCH_END and #LAST_MATCH_START
my $str = 'jlkjmkaaaaaamlmk';
$str =~ /a{3,}/;
say $+[0]-$-[0];
Output:
6
NB: This will work only with a one-character pattern.
Here's an idea (maybe this is what you already had?) assuming the pattern you're interested in counting has multiple characters and variable length:
capture the substring which matches the pattern{3,} subpattern
then match the captured substring globally against pattern (note the absence of the quantifier), and force a list context on =~ to get the number of matches.
Here's a sample code to illustrate this (where $patt is the subpattern you're interested in counting)
my $str = "some catbratmatrattatblat thing";
my $patt = qr/b?.at/;
if ($str =~ /some ((?:$patt){3,}) thing/) {
my $count = () = $1 =~ /$patt/g;
print $count;
...
}
Another (admittedly somewhat trivial) example with 2 subpatterns
my $str = "some catbratmatrattatblat thing 11,33,446,70900,";
my $patt1 = qr/b?.at/;
my $patt2 = qr/\d+,/;
if ($str =~ /some ((?:$patt1){3,}) thing ((?:$patt2){2,})/) {
my ($substr1, $substr2) = ($1, $2);
my $count1 = () = $substr1 =~ /$patt1/g;
my $count2 = () = $substr2 =~ /$patt2/g;
say "count1: " . $count1;
say "count2: " . $count2;
}
Limitation(s) of this approach:
Fails miserably with lookarounds. See amon's example.
If you have a pattern of type /AB{n,}/ where A and B are complex patterns, we can split the regex into multiple pieces:
my $string = "ABABBBB";
my $n = 3;
my $count = 0;
TRY:
while ($string =~ /A/gc) {
my $pos = pos $string; # remember position for manual backtracking
$count++ while $string =~ /\GB/g;
if ($count < $n) {
$count = 0;
pos($string) = $pos; # restore previous position
} else {
last TRY;
}
}
say $count;
Output: 4
However, embedding code into the regex to do the counting may be more desirable, as it is more general:
my $string = "ABABBBB";
my $count;
$string =~ /A(?{ $count = 0 })(?:B(?{ $count++ })){3,}/ and say $count;
Output: 4.
The downside is that this code won't run on older perls. (Code was tested on v14 & v16).
Edit: The first solution will fail if the B pattern backtracks, e.g. $B = qr/BB?/. That pattern should match the ABABBBB string three times, but the strategy will only let it match two times. The solution using embedded code allows proper backtracking.

How can I match a pipe character followed by whitespace and another pipe?

I am trying to find all matches in a string that begins with | |.
I have tried: if ($line =~ m/^\\\|\s\\\|/) which didn't work.
Any ideas?
You are escaping the pipe one time too many, effectively escaping the backslash instead.
print "YES!" if ($line =~ m/^\|\s\|/);
Pipe character should be escaped with a single backslash in a Perl regex. (Perl regexes are a bit different from POSIX regexes. If you're using this in, say, grep, things would be a bit different.) If you're specifically looking for a space between them, then use an unescaped space. They're perfectly acceptable in a Perl regex. Here's a brief test program:
my #lines = <DATA>;
for (#lines) {
print if /^\| \|/;
}
__DATA__
| | Good - space
|| Bad - no space
| | Bad - tab
| | Bad - beginning space
Bad - no bars
If it's a literal string you're searching for, you don't need a regular expression.
my $search_for = '| |';
my $search_in = whatever();
if ( substr( $search_in, 0, length $search_for ) eq $search_for ) {
print "found '$search_for' at start of string.\n";
}
Or it might be clearer to do this:
my $search_for = '| |';
my $search_in = whatever();
if ( 0 == index( $search_in, $search_for ) ) {
print "found '$search_for' at start of string.\n";
}
You might also want to look at quotemeta when you want to use a literal in a regexp.
Remove the ^ and the double back-slashes. The ^ forces the string to be at the beginning of the string. Since you're looking for all matches in one string, that's probably not what you want.
m/\|\s\|/
What about:
m/^\|\s*\|/