Improve Performance of Last Occurrence Match in Perl Regex - regex

I need to find the last occurrence of matches based on an array of acceptable of value. Below is the source codes in Perl. The answer is Q because it is the last occurrence based on acceptable values of A, Q, I & J.
The challenge is how can I change my codes to make the regex faster. It is currently a bottleneck because I have to run it millions times.
my $input = "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z";
my $regex = qr/(A|Q|I|J)/;
my #matches = $input =~ m/\b$regex\b/g;
print $matches[$#matches];
I would like to see new codes that improves the query speed but still can find the Q match.

You can find the last match by simply adding a .* before the matching pattern.
Like this
my $input = "APPLE B C D E F G H INDIGO JACKAL K L M N O P QUIVER R S T U V W X Y Z";
my $regex = qr/APPLE|QUIVER|INDIGO|JACKAL/;
my ($last) = $input =~ /.*\b($regex)\b/;
print $last, "\n";
output
QUIVER

Use \K to discard the previously matched characters from printing at the final.
my $input = "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z";
my $regex = qr/.*\K\b[AQIJ]\b/;
if ($input =~ m/$regex/) {
print $&."\n";
}
Use capturing group.
my $input = "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z";
my $regex = qr/.*\b([AQIJ])\b/;
if ($input =~ m/$regex/) {
print $1."\n";
}
Update:
my $input = "Apple Orange Mango Apple";
my $regex = qr/.*\K\b(?:Apple|Range|Mango)\b/;
if ($input =~ m/$regex/) {
print $&."\n";
}

Related

egrep the line ended with

$ cat file
c f t e, u y r s p I y
p A w p d. R i
G e w o a l n o v s.
P G e a o c f s p
k e i c w a p p e.
$ od -c file
0000000 c f t e , u y r s
0000020 p I y \r \n p A w p
0000040 d . R i \r \n G e w o
0000060 a l n o v s . \r \n P
0000100 G e a o c f s p
0000120 \r \n k e i c w a p
0000140 p e . \r \n
0000146
I tried to use the egrep command to grep all lines ended with .
However, I was not able to do it!
for example:
$ egrep '.*\.' file
p A w p d. R i
G e w o a l n o v s.
k e i c w a p p e.
It did not give me the correct output!
Also tried to use $ to anchor the dot, \r, and \n, none of them work.
Any suggestions will help.
You should begin converting your file in Unix format:
dos2unix file
Then you can simply use this instruction:
egrep "[.]$" file
Your file is in DOS format (with carriage return / line feed endings). Either convert it to unix format first and use
egrep '\.$'
or leave the file unchanged and search for a literal carriage return
egrep $'\\.\r$'
(using bash trickery because grep doesn't understand \r).
egrep '.*\.' just finds all lines that contain a . anywhere.

Best way to reverse the mathematical operator

I have some string which contain
$str = a > b;
$str1 = c < d;
$str2 = e = f;
What is the best way to do if i want to reverse the operator for example > become < , > become < , = become !=. The only way i know is to doing matching and get the a and b and using join function to join a and b with opposite operator. Any better way to do it?
Assuming your strings are really like this:
$str = 'a > b';
$str1 = 'c < d';
$str2 = 'e = f';
For one-character operators you can use the transliteration operator tr///, which is cheaper than regular expressions.
$str =~ tr/<>+-/><-+/;
For turning = into !=, you will need to fall back to a substitution using s///.
You can use a lookup hash for each negation, and turn the lookup keys into a pattern that you can use for the lookup.
use strict;
use warnings;
my %op = (
'<' => '>',
'>' => '<',
'=' => '!=',
'!=' => '=',
);
my $pattern = sprintf '(%s)', join '|', map quotemeta, keys %op;
my #strings = ('a > b', 'a < b', 'e = f', 'g != h');
foreach my $str (#strings) {
print $str;
$str =~ s/$pattern/$op{$1}/e;
print " --> $str\n";
}
__END__
a > b --> a < b
a < b --> a > b
e = f --> e != f
g != h --> g = h
The quotemeta makes sure there are no regular expression meta chars in the pattern. That's not important for the example chars I showed, but it might be in your full use case.
This might be a good time to read perlre.

Solve a puzzle using bash tools such as grep

I need to solve a puzzle using shell script. I tried to combine grep with rev and saved the output into a temporary text file but still don't know how to solve it entirely.
That's the puzzle to solve :
j s e t f l
a l s f e l
g a a n p l
e p f d p k
r e g e l a
f n e t e n
The file that contains the wordlist to use is in http://pastebin.com/DP4mFZAr
I know how to tell grep where to find the patterns to match as fixed strings extracted from a text file using $ grep -Ff wordlist puzzle and
how to search for mirrored words using $ rev puzzle | grep -Ff wordlist puzzle , thus dealing with the horizontal lines, but how do I deal with vertical words too ?
I am covering horizontal and vertical matching. The main idea is to remove the spaces and then use grep -f with the given list of words, stored in words file.
With grep -f, the results are shown within the line. If you just want to see the matched test, use grep -of.
Horizontal matching
$ cat puzzle | tr -d ' ' | grep -f words
alsfel
gaanpl
regela
fneten
$ cat puzzle | tr -d ' ' | grep -of words
als
gaan
regel
eten
Vertical matching
For this, we firstly have to transpose the content of the file. For this, I use what I used for another answer of mine:
transpose () {
awk '{for (i=1; i<=NF; i++) a[i,NR]=$i; max=(max<NF?NF:max)}
END {for (i=1; i<=max; i++)
{for (j=1; j<=NR; j++)
printf "%s%s", a[i,j], (j<NR?OFS:ORS)
}
}'
}
And let's see:
$ cat puzzle | transpose | tr -d ' ' | grep -f words
jagerf
slapen
esafge
tfndet
lllkan
$ cat puzzle | transpose | tr -d ' ' | grep -of words
jager
slapen
af
ge
de
kan
You can then use rev (as you suggest in your question) for mirrored words. Also tac can be interesting for vertically mirrored words.
Diagonal matching
For the diagonal matching, I think that an interesting approach would be to move every single line a little bit to the left/right. This way,
e x x x x
x g x x x
x x g x x
can become
e x x x x
g x x x
g x x
and you can use the vertical/horizontal approaches.
For this, you can use printf as described in Using variables in printf format:
$ cat a
e x x x x
x g x x x
x x g x x
$ awk -v c=20 '{printf "%*s\n", c, $0; c-=2}' a
e x x x x
x g x x x
x x g x x

How do I match across newlines in a perl regex?

I'm trying to work out how to match across newlines with perl (from the shell). following:
(echo a b c d e; echo f g h i j; echo l m n o p) | perl -pe 's/(c.*)/[$1]/'
I get this:
a b [c d e]
f g h i j
l m n o p
Which is what I expect. But when I place an /s at the end of my regex, I get this:
a b [c d e
]f g h i j
l m n o p
What I expect and want it to print is this:
a b [c d e
f g h i j
l m n o p
]
Is the problem with my regex somehow, or my perl invocation flags?
-p loops over input line-by-line, where "lines" are separated by $/, the input record separator, which is a newline by default. If you want to slurp all of STDIN into $_ for matching, use -0777.
$ echo "a b c d e\nf g h i j\nl m n o p" | perl -pe 's/(c.*)/[$1]/s'
a b [c d e
]f g h i j
l m n o p
$ echo "a b c d e\nf g h i j\nl m n o p" | perl -0777pe 's/(c.*)/[$1]/s'
a b [c d e
f g h i j
l m n o p
]
See Command Switches in perlrun for information on both those flags. -l (dash-ell) will also be useful.
The problem is that your one-liner works one line at a time, your regex is fine:
use strict;
use warnings;
use 5.014;
my $s = qq|a b c d e
f g h i j
l m n o p|;
$s =~ s/(c.*)/[$1]/s;
say $s;
There's More Than One Way To Do It: since you're reading "the entire file at a time" anyway, I'd personally drop the -p modifier, slurp the entire input explicitly, and go from there:
echo -e "a b c d e\nf g h i j\nl m n o p" | perl -e '$/ = undef; $_ = <>; s/(c.*)/[$1]/s; print;'
This solution does have more characters, but may be a bit more understandable for other readers (which will be you in three months time ;-D )
Actually your one-liner looks like this:
while (<>) {
$ =~ s/(c.*)/[$1]/s;
}
It's mean that regexp works only with first line of your input.
You're reading a line at a time, so how do you think it can possibly match something that spans more than one line?
Add -0777 to redefine "line" to "file" (and don't forget to add /s to make . match newlines).
$ (echo a b c d e; echo f g h i j; echo l m n o p) | perl -0777pe's/(c.*)/[$1]/s'
a b [c d e
f g h i j
l m n o p
]

How can a Perl regex re-use part of the previous match for the next match?

I need some Perl regular expression help. The following snippet of code:
use strict;
use warnings;
my $str = "In this example, A plus B equals C, D plus E plus F equals G and H plus I plus J plus K equals L";
my $word = "plus";
my #results = ();
1 while $str =~ s/(.{2}\b$word\b.{2})/push(#results,"$1\n")/e;
print #results;
Produces the following output:
A plus B
D plus E
2 plus F
H plus I
4 plus J
5 plus K
What I want to see is this, where a character already matched can appear in a new match in a different context:
A plus B
D plus E
E plus F
H plus I
I plus J
J plus K
How do I change the regular expression to get this result? Thanks --- Dan
General advice: Don't use s/// when you want m//. Be specific in what you match.
The answer is pos:
#!/usr/bin/perl -l
use strict;
use warnings;
my $str = 'In this example, ' . 'A plus B equals C, ' .
'D plus E plus F equals G ' .
'and H plus I plus J plus K equals L';
my $word = "plus";
my #results;
while ( $str =~ /([A-Z] $word [A-Z])/g ) {
push #results, $1;
pos($str) -= 1;
}
print "'$_'" for #results;
Output:
C:\Temp> b
'A plus B'
'D plus E'
'E plus F'
'H plus I'
'I plus J'
'J plus K'
You can use a m//g instead of s/// and assign to the pos function to rewind the match location before the second term:
use strict;
use warnings;
my $str = 'In this example, A plus B equals C, D plus E plus F equals G and H plus I plus J plus K equals L';
my $word = 'plus';
my #results;
while ($str =~ /(.{2}\b$word\b(.{2}))/g) {
push #results, "$1\n";
pos $str -= length $2;
}
print #results;
Another option is to use a lookahead:
use strict;
use warnings;
my $str = "In this example, A plus B equals C, D plus E "
. "plus F equals G and H plus I plus J plus K equals L";
my $word = "plus";
my $chars = 2;
my #results = ();
push #results, $1
while $str =~ /(?=((.{0,$chars}?\b$word\b).{0,$chars}))\2/g;
print "'$_'\n" for #results;
Within the lookahead, capturing group 1 matches the word along with a variable number of leading and trailing context characters, up to whatever maximum you've set. When the lookahead finishes, the backreference \2 matches "for real" whatever was captured by group 2, which is the same as group 1 except that it stops at the end of the word. That sets pos where you want it, without requiring you to calculate how many characters you actually matched after the word.
Given the "Full Disclosure" comment (but assuming .{0,35}, not .{35}), I'd do
use List::Util qw/max min/;
my $context = 35;
while ( $str =~ /\b$word\b/g ) {
my $pre = substr( $str, max(0, $-[0] - $context), min( $-[0], $context ) );
my $post = substr( $str, $+[0], $context );
my $match = substr( $str, $-[0], $+[0] - $-[0] );
$pre =~ s/.*\n//s;
$post =~ s/\n.*//s;
push #results, "$pre$match$post";
}
print for #results;
You'd skip the substitutions if you really meant (?s:.{0,35}).
Here's one way to do it:
use strict;
use warnings;
my $str = "In this example, A plus B equals C, D plus E plus F equals G and H plus I plus J plus K equals L";
my $word = "plus";
my #results = ();
my $i = 0;
while (substr($str, $i) =~ /(.{2}\b$word\b.{2})/) {
push #results, "$1\n";
$i += $-[0] + 1;
}
print #results;
It's not terribly Perl-ish, but it works and it doesn't use too many obscure regular expression tricks. However, you might have to look up the function of the special variable #- in perlvar.
don't have to use regex. basically, just split up the string, use a loop to go over each items, check for "plus" , then get the word from before and after.
my $str = "In this example, A plus B equals C, D plus E plus F equals G and H plus I plus J plus K equals L";
#s = split /\s+/,$str;
for($i=0;$i<=scalar #s;$i++){
if ( "$s[$i]" eq "plus" ){
print "$s[$i-1] plus $s[$i+1]\n";
}
}