Parsing input to get specific values - regex

I have input like this:
"[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|phone]0|0|{A=145,B=2,C=12,D=18}|!0|0|{A=167,B=2,C=67,D=17}|.1iit196|0|{A=244,B=6,C=67,D=12}|10:48AM204|0|{A=9,B=201,C=61,D=11}|Calculator66|0|{A=145,B=450,C=49,D=14}|phone";
It appears as a continuous line, there are no line breaks. I need the
largest value out of the values between [ and the first occurrence of
|. In this case, for example, the largest value is 204. Once
that is obtained, I want to print the contents of that element
between []. In this case, it would be "204|0|{A=9,B=201,C=61,D=11}|Calculator".
I've tried something like this, but it is not going anywhere:
my #array1;
my $data = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=1
+7}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,
+D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|phone]0|0|{A=145,B=2,C
+=12,D=18}|!0|0|{A=167,B=2,C=67,D=17}|.1iit196|0|{A=244,B=6,C=67,D=12}
+|10:48AM204|0|{A=9,B=201,C=61,D=11}|Calculator66|0|{A=145,B=450,C=49,
+D=14}|phone";
my $high = 0;
my #values = split(/\[([^\]]+)\]/,$data) ;
print "Values is #values \n";
foreach (#values) {
# I want the value that preceeds the first occurence of | in each array
# element, i.e. 0,0,196,204, etc.
my ($conf,$rest)= split(/\|/,$_);
print "Conf is $conf \n";
print "Rest is $rest \n";
push(#array1, $conf);
push (#array2, $rest);
print "Array 1 is #array1 \n";
print "Array 2 is #array2 \n";
}
$conf = highest(#array1);
my $i=0;
# I want the index value of the element that contains the highest conf value,
# in this case 204.
for (#myarray1) { last if $conf eq $_; $i++; };
print "$conf=$i\n";
# I want to print the rest of the string that was split in the same index
# position.
$rest = #array2[$i];
print "Rest is $rest \n";
# To get the highest conf value
sub highest {
my #data = #_;
my $high = 0;
for(#data) {
$high = $_ if $_ > $high;
}
$high;
}
Maybe I should be using a different approach. Could someone help me, please?

One way of doing it:
#!/usr/bin/perl
use strict;
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|phone]";
my #parts = split(/\]/, $s);
my $max = 0;
my $data = "";
foreach my $part (#parts) {
if ($part =~ /\[(\d+)/) {
if ($1 > $max) {
$max = $1;
$data = substr($part, 1);
}
}
}
print $data."\n";
A couple of notes:
you can split your original string by \], so you get parts like [0|0|{A=145,B=2,C=12,D=18}|!
then you parse each part to get the integer after the initial [
the rest it's easy: keep track of the biggest integer and of the corresponding part, and output it at the end.

In shell script:
#!/bin/bash
MAXVAL=$(cat /tmp/data | tr [ "\\n" | cut -d"|" -f1 | sort -n | tail -1)
cat /tmp/data | tr [] "\\n" | grep ^$MAXVAL
The first line cuts your big mass of data into lines, extracts just the first field, sorts it and takes the max. The second line cuts the data into lines again and greps for that max val.
If you have a LOT of data, this could be slow, so you could put the "lined" data into a temp file or something.

split() is the Right Tool when you know what you want to throw away. Capturing or m//g is the Right Tool when you know what you want to keep. (paraphrased from a Randal Schwartz quote).
You want to specify what to keep (between square brackets) rather than what to throw away (nothing!).
Luckily, your data is "hash shaped" (ie. alternating keys and values), so load it into a hash, sort the keys, and output the value for the highest key:
my %data = $data =~ /\[
(\d+) # digits are the keys
([^]]+) # rest are the values
\]/gx;
my($highest) = sort {$b <=> $a} keys %data; # inefficent if $data is big
print $highest, $data{$highest}, "\n";

Another way of doing this :
#!/usr/bin/perl
use strict;
my $str = '[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|phone]0|0|{A=145,B=2,C=12,D=18}|!0|0|{A=167,B=2,C=67,D=17}|.1iit196|0|{A=244,B=6,C=67,D=12}|10:48AM204|0|{A=9,B=201,C=61,D=11}|Calculator66|0|{A=145,B=450,C=49,D=14}|phone';
my $maxval = 0;
my $pattern;
while ( $str =~ /(\[(\d+)\|.+?\])/g)
{
if ( $maxval < $2 ) {
$maxval = $2;
$pattern = $1;
}
}
print "Maximum value = $maxval and the associate pattern = $pattern \n";
# In this example $maxvalue = 204
# and $pattern = [204|0|{A=9,B=201,C=61,D=11}|Calculator]

Related

Preventing "foo" from matching "foo-bar" with grep -w

I am using grep inside my Perl script and I am trying to grep the exact keyword that I am giving. The problem is that "-w" doesn't recognize the "-" symbol as a separator.
example:
Let's say that I have these two records:
A1BG 0.0767377011073753
A1BG-AS1 0.233775553296782
if I give
grep -w "A1BG"
it returns both of them but I want only the exact one.
Any suggestions?
Many thanks in advance.
PS.
Here is my whole code.
The input file is a two-columns tab separated. So, I want to keep a unique value for each gene. In cases that I have more than one record, I calculate the average.
#!/usr/bin/perl
use strict;
use warnings;
#Find the average fc between common genes
sub avg {
my $total;
$total += $_ foreach #_;
return $total / #_;
}
my #mykeys = `cat G13_T.txt| awk '{print \$1}'| sort -u`;
foreach (#mykeys)
{
my #TSS = ();
my $op1 = 0;
my $key = $_;
chomp($key);
#print "$key\n";
my $command = "cat G13_T.txt|grep -E '([[:space:]]|^)$key([[:space:]]|\$)'";
#my $command = "cat Unique_Genes/G13_T.txt|grep -w $key";
my #belongs= `$command`;
chomp(#belongs);
my $count = scalar(#belongs);
if ($count == 1) {
print "$belongs[0]\n";
}
else {
for (my $i = 0; $i < $count; $i++) {
my #token = split('\t', $belongs[$i]);
my $lfc = $token[1];
push (#TSS, $lfc);
}
$op1 = avg(#TSS);
print $key ."\t". $op1. "\n";
}
}
If I got clarifications in comments right, the objective is to find the average of values (second column) for unique names in the first column. Then there is no need for external tools.
Read the file line by line and add up values for each name. The name uniqueness is granted by using a hash, with names being keys. Along with this also track their counts
use warnings;
use strict;
use feature 'say';
my $file = shift // die "Usage: $0 filename\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my %results;
while (<$fh>) {
#my ($name, $value) = split /\t/;
my ($name, $value) = split /\s+/; # used for easier testing
$results{$name}{value} += $value;
++$results{$name}{count};
}
foreach my $name (sort keys %results) {
$results{$name}{value} /= $results{$name}{count}
if $results{$name}{count} > 1;
say "$name => $results{$name}{value}";
}
After the file is processed each accumulated value is divided by its count and overwritten by that, so by its average (/= divides and assigns), if count > 1 (as a small measure of efficiency).
If there is any use in knowing all values that were found for each name, then store them in an arrayref for each key instead of adding them
while (<$fh>) {
#my ($name, $value) = split /\t/;
my ($name, $value) = split /\s+/; # used for easier testing
push #{$results{$name}}, $value;
}
where now we don't need the count as it is given by the number of elements in the array(ref)
use List::Util qw(sum);
foreach my $name (sort keys %results) {
say "$name => ", sum(#{$results{$name}}) / #{$results{$name}};
}
Note that a hash built this way needs memory comparable to the file size (or may even exceed it), since all values are stored.
This was tested using the shown two lines of sample data, repeated and changed in a file. The code does not test the input in any way, but expects the second field to always be a number.
Notice that there is no reason to ever step out of our program and use external commands.
You may use a POSIX ERE regex with grep like this:
grep -E '([[:space:]]|^)A1BG([[:space:]]|$)' file
To return matches (not matching lines) only:
grep -Eo '([[:space:]]|^)A1BG([[:space:]]|$)' file
Details
([[:space:]]|^) - Group 1: a whitespace or start of line
A1BG - a substring
([[:space:]]|$) - Group 2: a whitespace or end of line

perl count line in double looping, if match regular expression plus 1

I open a file by putting the line to an array. Inside this file based on the regular expression that contains a duplicate value. If the regular expression is a match I want to count it. The regular expression may look like this
$b =~ /\/([^\/]+)##/. I want to match $1 value.
my #array = do
{
open my $FH, '<', 'abc.txt' or die 'unable to open the file\n';
<$FH>;
};
Below is the way I do, it will get the same line in my file. Thank for help.
foreach my $b (#array)
{
$conflictTemp = 0;
$b =~ /\/([^\/]+)##/;
$b = $1;
#print "$b\n";
foreach my $c (#array)
{
$c =~ /\/([^\/]+)##/;
$c = $1;
if($b eq $c)
{
$conflictTemp ++;
#print "$b , $c \n"
#if($conflictTemp > 1)
#{
# $conflict ++;
#}
}
}
}
Below is the some sample data, two sentences are duplicates
/a/b/c/d/code/Debug/atlantis_digital/c/d/code/Debug/atlantis_digital.map##/main/place.09/2
/a/b/c/d/code/C5537_mem_map.cmd##/main/place.09/0
/a/b/c/d/code/.settings/org.eclipse.cdt.managedbuilder.core.prefs##/main/4
/a/b/c/d/code/.project_initial##/main/2
/a/b/c/d/code/.project##/main/CSS5/5
/a/b/c/d/code/.cproject##/main/CSS5/10
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtbuild_initial##/main/2
/a/b/c/d/code/.**cdtbuild##**/main/CSS5/2
/a/b/c/d/code/.**cdtbuild##**/main/CSS5/2
/a/b/c/d/code/.ccsproject##/main/CSS5/3
It looks like you're trying to iterate each element of the array, select some data via pattern match, and then count dupes. Is that correct?
Would it not be easier to:
my %count_of;
while ( <$FH> ) {
my ( $val ) = /\/([^\/]+)##/;
$count_of{$val}++;
}
And then, for the variables that have more than one (e.g. there's a duplicate):
print join "\n", grep { $count_of{$_} > 1 } keys %count_of;
Alternatively, if you're just wanting to play 'spot the dupe':
#!/usr/bin/env perl
use strict;
use warnings;
my %seen;
my $match = qr/\/([^\/]+)##/;
while ( <DATA> ) {
my ( $value ) = m/$match/ or next;
print if $seen{$value}++;
}
__DATA__
/a/b/c/d/code/Debug/atlantis_digital/c/d/code/Debug/atlantis_digital.map##/main/place.09/2
/a/b/c/d/code/C5537_mem_map.cmd##/main/place.09/0
/a/b/c/d/code/.settings/org.eclipse.cdt.managedbuilder.core.prefs##/main/4
/a/b/c/d/code/.project_initial##/main/2
/a/b/c/d/code/.project##/main/CSS5/5
/a/b/c/d/code/.cproject##/main/CSS5/10
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtbuild_initial##/main/2
/a/b/c/d/code/.cdtbuild##/main/CSS5/2
/a/b/c/d/code/.cdtbuild##/main/CSS5/2
/a/b/c/d/code/.ccsproject##/main/CSS5/3
The problem has been solved by the previous answer - I just want to offer an alternate flavour that;
Spells out the regex
Uses the %seen hash to record the line the pattern first appears; to enable
slightly more detailed reporting
use v5.12;
use warnings;
my $regex = qr/
\/ # A literal slash followed by
( # Capture to $1 ...
[^\/]+ # ... anything that's not a slash
) # close capture to $1
## # Must be immdiately followed by literal ##
/x;
my %line_num ;
while (<>) {
next unless /$regex/ ;
my $pattern = $1 ;
if ( $line_num{ $pattern } ) {
say "'$pattern' appears on lines ", $line_num{ $pattern }, " and $." ;
next ;
}
$line_num{ $pattern } = $. ; # Record the line number
}
# Ran on data above will produce;
# '.cdtproject' appears on lines 7 and 8
# '.cdtbuild' appears on lines 10 and 11

Perl Grepping from an Array

I need to grep a value from an array.
For example i have a values
#a=('branches/Soft/a.txt', 'branches/Soft/h.cpp', branches/Main/utils.pl');
#Array = ('branches/Soft/a.txt', 'branches/Soft/h.cpp', branches/Main/utils.pl','branches/Soft/B2/c.tct', 'branches/Docs/A1/b.txt');
Now, i need to loop #a and find each value matches to #Array. For Example
It works for me with grep. You'd do it the exact same way as in the More::ListUtils example below, except for having grep instead of any. You can also shorten it to
my $got_it = grep { /$str/ } #paths;
my #matches = grep { /$str/ } #paths;
This by default tests with /m against $_, each element of the list in turn. The $str and #paths are the same as below.
You can use the module More::ListUtils as well. Its function any returns true/false depending on whether the condition in the block is satisfied for any element in the list, ie. whether there was a match in this case.
use warnings;
use strict;
use Most::ListUtils;
my $str = 'branches/Soft/a.txt';
my #paths = ('branches/Soft/a.txt', 'branches/Soft/b.txt',
'branches/Docs/A1/b.txt', 'branches/Soft/B2/c.tct');
my $got_match = any { $_ =~ m/$str/ } #paths;
With the list above, containing the $str, the $got_match is 1.
Or you can roll it by hand and catch the match as well
foreach my $p (#paths) {
print "Found it: $1\n" if $p =~ m/($str)/;
}
This does print out the match.
Note that the strings you show in your example do not contain the one to match. I added it to my list for a test. Without it in the list no match is found in either of the examples.
To test for more than one string, with the added sample
my #strings = ('branches/Soft/a.txt', 'branches/Soft/h.cpp', 'branches/Main/utils.pl');
my #paths = ('branches/Soft/a.txt', 'branches/Soft/h.cpp', 'branches/Main/utils.pl',
'branches/Soft/B2/c.tct', 'branches/Docs/A1/b.txt');
foreach my $str (#strings) {
foreach my $p (#paths) {
print "Found it: $1\n" if $p =~ m/($str)/;
}
# Or, instead of the foreach loop above use
# my $match = grep { /$str/ } #paths;
# print "Matched for $str\n" if $match;
}
This prints
Found it: branches/Soft/a.txt
Found it: branches/Soft/h.cpp
Found it: branches/Main/utils.pl
When the lines with grep are uncommented and foreach ones commented out I get the corresponding prints for the same strings.
The slashes dot in $a will pose a problem so you either have to escape them it when doing regex match or use a simple eq to find the matches:
Regex match with $a escaped:
my #matches = grep { /\Q$a\E/ } #array;
Simple comparison with "equals":
my #matches = grep { $_ eq $a } #array;
With your sample data both will give an empty array #matches because there is no match.
This Solved My Question. Thanks to all especially #zdim for the valuable time and support
my #SVNFILES = ('branches/Soft/a.txt', 'branches/Soft/b.txt');
my #paths = ('branches/Soft/a.txt', 'branches/Soft/b.txt',
'branches/Docs/A1/b.txt', 'branches/Soft/B2/c.tct');
foreach my $svn (#SVNFILES)
{
chomp ($svn);
my $m = grep { /$svn/ } (#paths);
if ( $m eq '0' ) {
print "Files Mismatch\n";
exit 1;
}
}
You should escape characters like '/' and '.' in any regex when you need it as a character.
Likewise :
$a="branches\/Soft\/a\.txt"
Retry whatever you did with either grep or perl with that. If it still doesn't work, tell us precisely what you tried.

Creating a hash with keys and values selected from between points in a string

I get the following result from my code below.
For example, with $seq set to aaaGACGTCaaaGAATTCaaaGACGTCaaa:
RE: AatII
GACGTC found at 4-9
GACGTC found at 22-27
RE: EcoRI
GACGTC found at 4-9
GACGTC found at 22-27
GAATTC found at 13-18
While this is pretty close to what I want to achieve, what I really want to do is use a list of "reference enzymes" - which I store as hash keys (in the example below AatII and EcoRI in %RE) - in order to find the best cut site in the $seq sequence string and the store the results in a data-structure such as a hash.
The cut site selection is done according to an associated "enzyme string" value for each reference enzyme key in the hash. In the code below the key AatII is set to value GACGTC 5; it will cut GACGTC after the fifth position: (GACGT|C) ; whereas EcoRI => GAATTC 1 splits the sequence GAATTC after the first position (G|AATTC) . So, for each enzyme key in my hash I find a site in the sequence $seq based on that key's associated string and a "cut site" from the number associated with that string in the key's hash value. The number refers to the position of the cut after position "1" of the enzyme string.
For the sequence $seq the results would be: (aaaGACGT)(CaaaG)(AATTCaaaGACGT)(Caaa) (here brackets are used to show cut points). The positions of the cut sites in the string would be as follows:
0------8 9---13 14----------26 27--30
This is based on a run of the script using both EcoRI and AatII to select enzyme sequences and cut the full sequence leaving: aaaGACGT CaaaG AATTCaaaGACGT Caaa
I would like my script to store results of each run in a hash with keys/values such as {0=>8, 9=>13, 14=>26, 27=>30}. By using sort on my keys and values after each iteration; then using a binary search to find the closest previous cutsite and adding "1" to be the value of $end in the current iteration there should be as many entries in the hash as there are cutsites.
I do not know if this is possible. If it is, can someone point me in the right direction as to how to Can anyone help me transform my code in order to approach this problem.
#!/usr/bin/perl
use warnings;
use strict;
my %RE =( 'AatII' => 'GACGTC 5', 'EcoRI' => 'GAATTC 1' );
my $input='';
my #matches =();
my #enz = keys %RE;
my #value = values %RE;
print "Seq:";
my $seq = <STDIN>;
chomp $seq;
print "OK \n";
while ($input ne 'quit') {
print "RE:";
$input = <STDIN>;
chomp $input;
foreach (#enz) {
if ($input =~ /$_/) {
#print "Key:", $_," Value:", $RE{$_};
my #seqval = $seq;
my $val = $RE{$_};
my $real = substr($val, 0, -2);
#my $cut = substr($val, 0, (length($val)-3));
my $cut = chop $val;
my $length = length ($real);
my $mew = substr ($real, 0, $cut);
my $two = substr ($real, -1, ($length-$cut));
#my $push = push #valval;
#chomp %RE{$_};
while ($seq =~ /($real)/g) {
my $match = $1;
#print "$match", "\n";
my $length = length($&);
#print "$length", "\n";
my $pos = length($`);
#print "$pos", "\n";
my $start = $pos + 1;
#print "$start", "\n";
my $end = $pos + $length;
#print "$end", "\n";
my $hitpos = "$start-$end";
#print "$hitpos", "\n";
push #matches, "$match found at $hitpos ";
#print "\tfound:", "\n","\n";
#print "\t\t\t$1$mew", "\n";
#print "\t\t\t$two$3", "\n";
#print "next restriction enzyme:","\n";
} print "$_\n" foreach #matches;
}
}
}
Hi Khuram and welcome to Stackoverflow :-)
It seems you may have dropped your question but I'm adding this answer to make it more complete and potentially useful to others who find it. As #mappec suggested, you should consult the Bioperl website where you may find more resources.
While there may be simpler ways of doing this, I like your idea of creating a hash to store the cut sites is a good one because it leverages one of the powers of perl: the ability to create arbitrarily complex data structures on the fly. That said, it can sometimes be complicated to get your data back out! :-)
As #user1937198 notes, hashes are unordered, so if you want your output to preserve the order/positions of the enzyme strings in your sequence you'll not only have to sort your hash by its keys, you'll have create sortable keys to start with. In your question your sample output shows found at 4-9, ... 22-27, and ... 13-18 out of order because you don't have a datastructure you have sorted. Fixing that part is not too hard. To prove it, here's your script with some of the print statements removed and with the $seq sequence string processed into a HoH(hash of hashes) called %cuttings that is sorted by its keys (but remember, the order is not preserved):
#!/usr/bin/perl
use warnings;
use strict;
my %RE =( 'AatII' => 'GACGTC 5', 'EcoRI' => 'GAATTC 1' );
my %cuttings = ();
my $input='';
my #enz = keys %RE;
print "Seq:";
my $seq = "aaaGACGTCaaaGAATTCaaaGACGTCaaa";
chomp $seq;
print "OK \n";
while ($input ne 'quit') {
print "RE:";
$input = <STDIN>;
chomp $input;
foreach (#enz) {
if ($input =~ /$_/) {
my #seqval = $seq;
my $val = $RE{$_};
my $real = substr($val, 0, -2);
my $cut = chop $val;
my $cutsite = 0 ;
my $length = length ($real);
my $mew = substr ($real, 0, $cut);
my $two = substr ($real, -1, ($length-$cut));
while ($seq =~ /($real)/g) {
my $match = $1;
my $length = length($&);
my $pos = length($`); #`fix SO syntax highlighting :)
my $start = $pos + 1;
my $end = $pos + $length;
my $hitpos = "$start..$end";
my $cutsite = $end ;
${$cuttings{ $cutsite }}{ $input } = "$match at $hitpos ";
}
}
}
foreach my $cutsite (sort { $a <=> $b} keys %cuttings) {
print " $cuttings{$cutsite}{$_}\n" for ( keys %{ $cuttings{$cutsite} } );
}
}
The output would be:
$ ~/tmp/ perl biogenetic.pl
Seq:OK
RE:EcoRI
GAATTC found at 13..18
RE:AatII
GACGTC found at 4..9
GAATTC found at 13..18
GACGTC found at 22..27
RE:quit
The AatII enzyme cut sites are sorted correctly "around" the first EcoRI reference enzyme. If you want to see what the has looks like as you go along you could use Data::Dumper or Data::Printer (also known as DDP) to dump the hash when the program exits in an END block:
END {
use DDP;
p %cuttings ;
}
That would show the following:
{
9 {
AatII "GACGTC found at 4..9 "
},
18 {
EcoRI "GAATTC found at 13..18 "
},
27 {
AatII "GACGTC found at 22..27 "
}
}
NB: I've just reused your code to do this so you were most of the way there as it was. I'm not a geneticist so there may still be issues if enzyme strings do things like overlap (do they do that?). There are a lot of variable names to keep track of in your code and there's probably a way to refactor things to be bit simpler or more elegant - which I leave as an exercise for you and other contributors :-) If you use perl frequently you get good at it very quickly.
HTH. Good luck with your project.

How to find the largest repeating string with overlap in a line

I have a series of lines such as
my $string = "home test results results-apr-25 results-apr-251.csv";
#str = $string =~ /(\w+)\1+/i;
print "#str";
How do I find the largest repeating string with overlap which are separated by whitespace?
In this case I'm looking for the output :
results-apr-25
It looks like you need the String::LCSS_XS which calculates Longest Common SubStrings. Don't try it's Perl-only twin brother String::LCSS because there are bugs in that one.
use strict;
use warnings;
use String::LCSS_XS;
*lcss = \&String::LCSS_XS::lcss; # Manual import of `lcss`
my $var = 'home test results results-apr-25 results-apr-251.csv';
my #words = split ' ', $var;
my $longest;
my ($first, $second);
for my $i (0 .. $#words) {
for my $j ($i + 1 .. $#words) {
my $lcss = lcss(#words[$i,$j]);
unless ($longest and length $lcss <= length $longest) {
$longest = $lcss;
($first, $second) = #words[$i,$j];
}
}
}
printf qq{Longest common substring is "%s" between "%s" and "%s"\n}, $longest, $first, $second;
output
Longest common substring is "results-apr-25" between "results-apr-25" and "results-apr-251.csv"
my $var = "home test results results-apr-25 results-apr-251.csv";
my #str = split " ", $var;
my %h;
my $last = pop #str;
while (my $curr = pop #str ) {
if(($curr =~/^$last/) || $last=~/^$curr/) {
$h{length($curr)}= $curr ;
}
$last = $curr;
}
my $max_key = max(keys %h);
print $h{$max_key},"\n";
If you want to make it without a loop, you will need the /g regex modifier.
This will get you all the repeating string:
my #str = $string =~ /(\S+)(?=\s\1)/ig;
I have replaced \w with \S (in your example, \w doesn't match -), and used a look-ahead: (?=\s\1) means match something that is before \s\1, without matching \s\1 itself—this is required to make sure that the next match attempt starts after the first string, not after the second.
Then, it is simply a matter of extracting the longest string from #str:
my $longest = (sort { length $b <=> length $a } #str)[0];
(Do note that this is a legible but far from being the most efficient way of finding the longest value, but this is the subject of a different question.)
How about:
my $var = "home test results results-apr-25 results-apr-251.csv";
my $l = length $var;
for (my $i=int($l/2); $i; $i--) {
if ($var =~ /(\S{$i}).*\1/) {
say "found: $1";
last;
}
}
output:
found: results-apr-25