How to match sequence group?

How to match sequence group? - regex

say, the given string is abcwhateverdefwhatever34567whatever012 How to match those group which are in sequence like match abc, def, 34567,012?
the regex i have now is (.)\1{2,} but it matches the same characters but not in sequence

If you're still looking for PHP code.
function getSequence($str) {
$prev = 0; $next = 0; $length = strlen($str);
$temp = "";
for($i = 0; $i < $length; $i++) {
$next = ord($str[$i]);
if ($next == $prev + 1) {
$temp .= $str[$i];
} else {
if (strlen($temp) > 1) $result[] = $temp;
$temp = $str[$i];
}
$prev = $next;
}
if (strlen($temp) > 1) $result[] = $temp;
return $result;
}
$str = "abcwhateverdefwhatever34567whatever012";
print_r(getSequence($str));

Here's a solution that solves the problem with regex. It's not very efficient though and I wouldn't recommend it.
from re import findall, X
text = "abcwhateverdefwhatever34567whatever012"
reg = r"""
(?:
(?:0(?=1))|
(?:(?<=0)1)|(?:1(?=2))|
(?:(?<=1)2)|(?:2(?=3))|
(?:(?<=2)3)|(?:3(?=4))|
(?:(?<=3)4)|(?:4(?=5))|
(?:(?<=4)5)|(?:5(?=6))|
(?:(?<=5)6)|(?:6(?=7))|
(?:(?<=6)7)|(?:7(?=8))|
(?:(?<=7)8)|(?:8(?=9))|
(?:(?<=8)9)|
(?:a(?=b))|
(?:(?<=a)b)|(?:b(?=c))|
(?:(?<=b)c)|(?:c(?=d))|
(?:(?<=c)d)|(?:d(?=e))|
(?:(?<=d)e)|(?:e(?=f))|
(?:(?<=e)f)
){1,}
"""
print findall(reg, text, X)
The result is:
['abc', 'def', '34567', '012']
As you can see I only added the numbers and the first 6 letters in the alphabet. It's should be fairly obvious how to continue.

Related

perl: string matching to find longest substring

$string1 = "peachbananaapplepear";
$string2 = "juicenaapplewatermelonpear";
I want to know what's the longest common substring containing the word "apple".
$string2 =~ m/.+apple.+/;
print $string2;
So I use the match operator, and .+ for matching any character before and after the keyword "apple". When I print $string2 it doesn't return naapple but returns the original $string2 instead.

Here is one approach. First get the locations where 'apple' appears in the strings. And for each of those locations in string1, look at all locations in string2.
Look to the left and right to see how far the commonality extends from the initial location.
$string1 = "peachbananaapplepear12345applegrapeapplebcdefghijk";
$string2 = "juicenaapplewatermelonpearkiwi12345applebcdefghijkberryapple";
my $SearchFor="apple";
my $SearchStrLen = length($SearchFor);
# Get locations in first string where the search term appears
my #FirstPositions = getPostions($string1);
# Get locations in second string where the search term appears
my #SecondPositions = getPostions($string2);
CheckForMaxMatch();
sub getPostions
{
my $GivenString = shift;
my #Positions;
my $j=0;
for (my $i=0; $i < length($GivenString); $i += ($SearchStrLen+1) )
{
$j = index($GivenString, $SearchFor, $i);
if ($j == -1) {
last;
}
push (#Positions, $j);
$i = $j;
}
return #Positions;
}
sub CheckForMaxMatch
{
my $MaxLeft=0;
# From the location of 'apple', look to the left and right
# to see how far the characters are same
for my $i (#FirstPositions) {
for my $j (#SecondPositions) {
my $LeftMatchPos = getMaxMatch($i, $j, -1);
my $RightMatchPos = getMaxMatch($i, $j, 1);
if ( ($RightMatchPos - $LeftMatchPos) > ($MaxRight - $MaxLeft) ) {
$MaxLeft = $LeftMatchPos;
$MaxRight = $RightMatchPos;
}
}
}
my $LongestSubString = substr($string1, $MaxLeft, $MaxRight-$MaxLeft);
print "Longest common substring is: $LongestSubString\n";
print "It begins at $MaxLeft and ends at $MaxRight in string1\n";
}
sub getMaxMatch
{
my $i= shift;
my $j= shift;
my $direction= shift;
my $k = ($direction >= 1 ? $SearchStrLen : 0);
my $FirstChar = substr($string1, $i+($k * $direction), 1);
my $SecondChar = substr($string2, $j+($k * $direction), 1);
for ( ; $FirstChar && $SecondChar; $k++ )
{
$FirstChar = substr($string1, $i+($k * $direction), 1);
$SecondChar = substr($string2, $j+($k * $direction), 1);
if ( $FirstChar ne $SecondChar ) {
$direction < 1 ? $k-- : "";
my $pos = ($k ? ($i + $k * $direction) : $i);
return $pos;
}
}
return $i;
}

The =~ operator is not going to reassign the value of $string2. Try this:
$string2 =~ m/(.+apple.+)/;
my $match = $1;
print $match

Based on the general algorithm, but tracks not only the length of the current run (#l), but whether it includes the keyword (#k). Only runs that include the keyword are considered for longest run.
use strict;
use warnings;
use feature qw( say );
sub find_substrs {
our $s; local *s = \shift;
our $key; local *key = \shift;
my #positions;
my $position = -1;
while (1) {
$position = index($s, $key, $position+1);
last if $position < 0;
push #positions, $position;
}
return #positions;
}
sub lcsubstr_which_include {
our $s1; local *s1 = \shift;
our $s2; local *s2 = \shift;
our $key; local *key = \shift;
my #key_starts1 = find_substrs($s1, $key)
or return;
my #key_starts2 = find_substrs($s2, $key)
or return;
my #is_key_start1; $is_key_start1[$_] = 1 for #key_starts1;
my #is_key_start2; $is_key_start2[$_] = 1 for #key_starts2;
my #s1 = split(//, $s1);
my #s2 = split(//, $s2);
my $length = 0;
my #rv;
my #l = ( 0 ) x ( #s1 + 1 ); # Last ele is read when $i1==0.
my #k = ( 0 ) x ( #s1 + 1 ); # Same.
for my $i2 (0..$#s2) {
for my $i1 (reverse 0..$#s1) {
if ($s1[$i1] eq $s2[$i2]) {
$l[$i1] = $l[$i1-1] + 1;
$k[$i1] = $k[$i1-1] || ( $is_key_start1[$i1] && $is_key_start2[$i2] );
if ($k[$i1]) {
if ($l[$i1] > $length) {
$length = $l[$i1];
#rv = [ $i1, $i2, $length ];
}
elsif ($l[$i1] == $length) {
push #rv, [ $i1, $i2, $length ];
}
}
} else {
$l[$i1] = 0;
$k[$i1] = 0;
}
}
}
for (#rv) {
$_->[0] -= $length;
$_->[1] -= $length;
}
return #rv;
}
{
my $s1 = "peachbananaapplepear";
my $s2 = "juicenaapplewatermelonpear";
my $key = "apple";
for (lcsubstr_which_include($s1, $s2, $key)) {
my ($s1_pos, $s2_pos, $length) = #$_;
say substr($s1, $s1_pos, $length);
}
}
This solution in O(NM), meaning it scales amazingly well (for what it does).

matched count for characters in two variables

Iam trying to match two variables and get the count of characters that are matched.
For ex:
$name1 = 'cat';
$name2 = 'dcat';
result needs to be 3.
Tried this but always prints 1
$count = 0;
$count++ while ($name1 =~ /$name2/g);
print "$count\n";

$count = 0;
$matchstr = "";
foreach (split(//,$name1)){
$matchstr .= $_;
$count++ if $name2 =~ /$matchstr/;
}
print "No of matched characters are: $count \n ";

Can one use regex to wildcard a set of whole words in Perl?

This is nuts, I mean pseudocode, but something like this:
/[January, February, March] \d*/
Should match things like January 13 or February 26, and so on...
WHAT I'M DOING:
my $url0 = 'http://www.registrar.ucla.edu/calendar/acadcal13.htm';
my $url1 = 'http://www.registrar.ucla.edu/calendar/acadcal14.htm';
my $url2 = 'http://www.registrar.ucla.edu/calendar/acadcal15.htm';
my $url3 = 'http://www.registrar.ucla.edu/calendar/acadcal16.htm';
my $url4 = 'http://www.registrar.ucla.edu/calendar/acadcal17.htm';
my $url5 = 'http://www.registrar.ucla.edu/calendar/sumcal.htm';
my $document0 = get($url0);
my $document1 = get($url1);
my $document2 = get($url2);
my $document3 = get($url3);
my $document4 = get($url4);
my $document5 = get($url5);
my #dates0 = ($document0 =~ /(January|February|March|April|May|June|July|August|September|October|November|December) \d+/g);
my #dates1 = ($document1 =~ /(January|February|March|April|May|June|July|August|September|October|November|December) \d+/g);
my #dates2 = ($document2 =~ /(January|February|March|April|May|June|July|August|September|October|November|December) \d+/g);
my #dates3 = ($document3 =~ /(January|February|March|April|May|June|July|August|September|October|November|December) \d+/g);
my #dates4 = ($document4 =~ /(January|February|March|April|May|June|July|August|September|October|November|December) \d+/g);
my #dates5 = ($document5 =~ /(January|February|March|April|May|June|July|August|September|October|November|December) \d+/g);
foreach(#dates0)
{
print "$_\r\n";
}
foreach(#dates1)
{
print "$_\r\n";
}
foreach(#dates2)
{
print "$_\r\n";
}
foreach(#dates3)
{
print "$_\r\n";
}
foreach(#dates4)
{
print "$_\r\n";
}
foreach(#dates5)
{
print "$_\r\n";
}
These printing gadgets give the following result: http://pastebin.com/7z13gBqt
This is not good:
http://tinypic.com/r/nqpapx/8

Yes. You can use an alternation.
/(January|February|March|April|May|June|July|August|September|October|November|December) \d*/
Would do that.
If you already have them in an array, then you can change the variable $LIST_SEPARATOR to string them into an alternation. And then parenthesize the whole
use English qw<$LIST_SEPARATOR>; # In line-noise: $"
my $date_regex
= do { local $LIST_SEPARATOR = '|';
qr/(?:#months) \d*/ # ?: if you don't want the capture
};
This gives you a compiled expression, which you can reuse like so:
my #dates;
while ( my $url = <DATA> ) {
my $document = get( $url );
push #dates, [ $document =~ /($date_regex)/g ];
push #dates, $date;
}
__DATA__
http://www.registrar.ucla.edu/calendar/acadcal13.htm
http://www.registrar.ucla.edu/calendar/acadcal14.htm
http://www.registrar.ucla.edu/calendar/acadcal15.htm
http://www.registrar.ucla.edu/calendar/acadcal16.htm
http://www.registrar.ucla.edu/calendar/acadcal17.htm
http://www.registrar.ucla.edu/calendar/sumcal.htm

Character match count between strings in Perl

I have a string (say string 1) that needs to be matched to another string (string2). Both the strings will have the same length and are case in-sensitive.
I want to print the number of character matches between both the strings.
E.g.: String 1: stranger
String 2: strangem
Match count = 7
I tried this:
$string1 = "stranger";
$string2 = "strangem";
my $count = $string1 =~ m/string2/ig;
print "$count\n";
How can I fix this?

Exclusive or, then count the null characters (where the strings were the same):
my $string1 = "stranger";
my $string2 = "strangem";
my $count = ( lc $string1 ^ lc $string2 ) =~ tr/\0//;
print "$count\n";
I missed the "case in-sensitive" bit.

You can use substr for that:
#!/usr/bin/perl
use warnings;
use strict;
my $string1=lc('stranger');
my $string2=lc('strangem');
my $count=0;
for (0..length($string1)-1) {
$count++ if substr($string1,$_,1) eq substr($string2,$_,1);
}
print $count; #prints 7
Or you can use split to get all characters as an array, and loop:
#!/usr/bin/perl
use warnings;
use strict;
my $string1=lc('stranger');
my $string2=lc('strangem');
my $count=0;
my #chars1=split//,$string1;
my #chars2=split//,$string2;
for (0..$#chars1) {
$count++ if $chars1[$_] eq $chars2[$_];
}
print $count; #prints 7
(fc gives more accurate results than lc, but I went for backwards compatibility.)

Not tested
sub cm
{
my #a = shift;
my #b = shift;
# First match prefix of string:
my $n = 0;
while ($n < $#a && $n < $#b && $a[$n] eq $b[$n]) {
++$n;
}
# Then skip one char on either side, and recurse.
if ($n < $#a && $n < $#b) {
# Match rest by skipping one place:
my $n2best = 0;
my $n2a = cm(splice(#a, $n), splice(#b, $n + 1));
$n2best = $n2a;
my $n2b = cm(splice(#a, $n + 1), splice(#b, $n));
$n2best = $n2b if $n2b > $n2best;
my $n2c = cm(splice(#a, $n + 1), splice(#b, $n + 1));
$n2best = $n2c if $n2c > $n2best;
$n += $n2best;
}
return $n;
}
sub count_matches
{
my $a = shift;
my $b = shift;
my #a_chars = split //, $a;
my #b_chars = split //, $b;
return cm(#a_chars, #b_chars);
}
print count_matches('stranger', 'strangem')

Regex to Parse Mail Subject with multiple encoding

There!
I want to match all the Inline Encodings in one Mail-Subject and build the Subject String in utf8.
Some Examples:
[Listname | Topic123] =?utf-8?Q?encodedtext?=
=?iso-8859-1?q?this=20is=20some=20text?=
Klartext-Betreff
[Listname | Topic123] =?utf-8?Q?encodedtext?= =?iso-8859-1?q?this=20is=20some=20text?=
=?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
=?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=
I also got a mail with two different encodings (example in Last line).
In E-Mails it could also be possible, that the subject is split in multiple lines where each line (except the first) starts with at least one whitespace
So I am looking for a regex, that parses:
Part+
Where Part is one of:
Text with spaces
=?charset?encoding?encoded-text?=
I think it woll go to something like:
ENC = (=\?)([A-Za-z0-9-]*)(\?)([A-Za-z0-9-]*)(?)([Any Character])(\?=)
Part = any character that doesnt match to ENC or ENC

function decode ($string, $source_enc, $dest_enc)
{
$parts = preg_split (
'/=\?([^?]+)\?([^?]+)\?([^?]+)\?=/',
$string,
-1, PREG_SPLIT_DELIM_CAPTURE);
$result = "";
for ($i = 0; $i < count ($parts); $i++)
{
$part = $parts [$i];
if ($i % 4 == 0)
$result .= iconv ($source_enc, $dest_enc, $part);
else
{
$charset = $parts [$i++];
$encoding = $parts [$i++];
$text = $parts [$i];
if ($encoding == 'Q' || $encoding == 'q')
$text = quoted_printable_decode ($text);
else if ($encoding == 'B' || $encoding == 'b')
$text = base64_decode ($text);
$result .= iconv ($charset, $dest_enc, $text);
}
}
return $result;
}
echo (decode ("=?utf-8?Q?encodedtext?= =?iso-8859-1?q?this=20is=20some=20text?=
=?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
=?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=",
"ISO-8859-1", "ISO-8859-1"));
Output for me is:
encodedtext this is some text If you can read this yo u understand the example.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to match sequence group? - regex

say, the given string is abcwhateverdefwhatever34567whatever012 How to match those group which are in sequence like match abc, def, 34567,012? the regex i have now is (.)\1{2,} but it matches the same characters but not in sequence

Related

perl: string matching to find longest substring

matched count for characters in two variables

Can one use regex to wildcard a set of whole words in Perl?

Character match count between strings in Perl

Regex to Parse Mail Subject with multiple encoding

Categories

Resources