Escaping regex pattern in string while using regex on the same - regex

My script is required to insert pattern (?:<\/?[a-z\-\=\"\ ]+>)?in words after each letter which can be used in another regular expression.
Problem is that is some words their may be regex pattern like .*? or (?:<[a-z\-]+>). I tried it but error thows unmatched regex where my pattern adds after ( or space created in regex causing this problem. Any help.
Here is the code I tried:
sub process_info{
my $process_mod = shift;
#print "$process_mod\n";
#b = split('',$process_mod);
my $flag;
for my $i(#b){
#print "######## flag: $flag test: $i\n";
$i = "$i".'(?:<\/?[a-z\-\=\"\ ]>)?' if $flag == 0 and $i !~ /\\|\(|\)|\:|\?|\[|\]/;
#print "$i";
if ($i =~ /\\|\(|\)|\:|\?|\[|\]/){
$flag = 1;
}
else{
$flag = 0;
}
#print "After: $i\n";
}
$process_mod = join('',#b);
#print "$process_mod\n";
return $process_mod;
}

You want to search for a certain plaintext in an XML file. You try to do this by inserting a regex for an XML tag between each character. This is wasteful, but it can be easily done by escaping all metacharacters in the input with the quotemeta function:
sub make_XML_matchable {
my $string = #_;
my $xml_tag = qr{ ... }; # I won't write that regex for you
my $combined = join $xml_tag, map quotemeta, split //, $string;
return qr/$combined/; # return a compiled regex
}
This assumes that you'd want to write a regex that can match XML tags – not impossible, but tedious and difficult to do correctly. Use an XML parser instead to strip all tags from a section:
use XML::LibXML;
my $dom = XML::LibXML->load_xml(string => $xml)
my $text_content = $dom->textContent; # all tags are gone
Or if you're actually trying to match HTML, then you might want to use Mojolicious:
use Mojo;
my $dom = Mojo::DOM->new($html);
my $text_content = $dom->all_text; # all tags are replaced by a space

At the begining of the foreach loop, use this:
for my $i(#b){
$i = quotemeta $i;
$i .= '(?:<\/?[a-z\-\=\"\ ]>)?' if $flag == 0 and $i !~ /[\\|():?[\]]/;
# don't escape __^

Related

multi replace in postgresql query using perl

I'm cleaning some text directly in my query, and rather than using nested replace functions, I found this bit of code that uses perl to perform multiple replacements at once: multi-replace with perl
CREATE FUNCTION pg_temp.multi_replace(string text, orig text[], repl text[])
RETURNS text
AS $BODY$
my ($string, $orig, $repl) = #_;
my %subs;
if (#$orig != #$repl) {
elog(ERROR, "array sizes mismatch");
}
if (ref #$orig[0] eq 'ARRAY' || ref #$repl[0] eq 'ARRAY') {
elog(ERROR, "array dimensions mismatch");
}
#subs{#$orig} = #$repl;
my $re = join "|",
sort { (length($b) <=> length($a)) } keys %subs;
$re = qr/($re)/;
$string =~ s/$re/$subs{$1}/g;
return $string;
$BODY$ language plperl strict immutable;
Example query:
select
name as original_name,
multi_replace(name, '{-,&,LLC$}', '{_,and,business}') as cleaned_name
from some_table
The function finds the pattern LLC at the end of the name string but removes it instead of replacing it with "business."
How can I make this work as intended?
When the strings in #$orig are to be matched literally, I'd actually use this:
my ($string, $orig, $repl) = #_;
# Argument checks here.
my %subs; #subs{ #$orig } = #$repl;
my $pat =
join "|",
map quotemeta,
sort { length($b) <=> length($a) }
#$orig;
return $string =~ s/$re/$subs{$&}/gr;
In particular, map quotemeta, was missing.
(By the way, the sort line isn't needed if you ensure that xy comes before x in #$orig when you want to replace both x(?!y) and xy.)
But you want the strings in #$orig to be treated as regex patterns. For that, you can use the following:
# IMPORTANT! Only provide strings from trusted sources in
# `#$orig` as it allows execution of arbitrary Perl code.
my ($string, $orig, $repl) = #_;
# Argument checks here.
my $re =
join "|",
map "(?:$orig->[$_])(?{ $_ })",
0..$#$orig;
{
use re qw( eval );
$re = qr/$re/;
}
return $string =~ s/$re/$repl->[$^R]/gr;
However, in your environment, I have doubts about the availability of use re qw( eval ); and (?{ }), so the above may be an unviable solution for you.
my ($string, $orig, $repl) = #_;
# Argument checks here.
my $re =
join "|",
map "(?<_$_>$orig->[$_])",
0..$#$orig;
$re = qr/$re/;
return
$string =~ s{$re}{
my ( $n ) =
map substr( $_, 1 ),
grep { $-{$_} && defined( $-{$_}[0] ) }
grep { /^_\d+\z/aa }
keys( %- );
$repl->[$n]
}egr;
While the regexp tests for LLC$ with the special meaning of the $, what gets captured into $1 is just the string LLC and so it doesn't find the look-up value to replace.
If the only thing you care about is $, then you could fix it by changing the map-building lines to:
#subs{map {my $t=$_; $t=~s/\$$//; $t} #$orig} = #$repl;
my $re = join "|",
sort { (length($b) <=> length($a)) } #$orig;
But it will be very hard to make it work more generally for every possible feature of regex.
The purpose of this plperl function in the linked blog post is to find/replace strings, not regular expressions. LLC being found with LLC$ as a search term does not happen in the original code, as the search terms go through quotemeta before being included into $re (as also sugggested in ikegami's answer)
The effect of removing the quotemeta transformation is that LLC at the end of a string is matched, but since as a key it's not found in $subs (because the key there isLLC$), then it's getting replaced by an empty string.
So how to make this work with regular expressions in the orig parameter?
The solution proposed by #ikegami does not seem usable from plperl, as it complains with this error: Unable to load re.pm into plperl.
I thought of an alternative implementation without the (?{ code }) feature: each match from the main alternation regexp can be rechecked against each regexp in orig, in a code block run with /ge. On the first match, the corresponding string in repl is selected as the replacement.
Code:
CREATE or replace FUNCTION pg_temp.multi_replace(string text, orig text[], repl text[])
RETURNS text AS
$BODY$
my ($string, $orig, $repl) = #_;
my %subs;
if (#$orig != #$repl) {
elog(ERROR, "array sizes mismatch");
}
if (ref #$orig[0] eq 'ARRAY' || ref #$repl[0] eq 'ARRAY') {
elog(ERROR, "array dimensions mismatch");
}
#subs{#$orig} = #$repl;
my $re = join "|", keys %subs;
$re = qr/($re)/;
# on each match, recheck the match individually against each regexp
# to find the corresponding replacement string
$string =~ s/$re/{ my $r; foreach (#$orig) { if ($1 =~ $_) {$r=$subs{$_}; last;} } $r;}/ge;
return $string;
$BODY$ language plperl strict immutable;
Test
=> select pg_temp.multi_replace(
'bar foo - bar & LLC',
'{^bar,-,&,LLC$}',
'{baz,_,and,business}'
);
multi_replace
----------------------------
baz foo _ bar and business

perl count line in double looping, if match regular expression plus 1

I open a file by putting the line to an array. Inside this file based on the regular expression that contains a duplicate value. If the regular expression is a match I want to count it. The regular expression may look like this
$b =~ /\/([^\/]+)##/. I want to match $1 value.
my #array = do
{
open my $FH, '<', 'abc.txt' or die 'unable to open the file\n';
<$FH>;
};
Below is the way I do, it will get the same line in my file. Thank for help.
foreach my $b (#array)
{
$conflictTemp = 0;
$b =~ /\/([^\/]+)##/;
$b = $1;
#print "$b\n";
foreach my $c (#array)
{
$c =~ /\/([^\/]+)##/;
$c = $1;
if($b eq $c)
{
$conflictTemp ++;
#print "$b , $c \n"
#if($conflictTemp > 1)
#{
# $conflict ++;
#}
}
}
}
Below is the some sample data, two sentences are duplicates
/a/b/c/d/code/Debug/atlantis_digital/c/d/code/Debug/atlantis_digital.map##/main/place.09/2
/a/b/c/d/code/C5537_mem_map.cmd##/main/place.09/0
/a/b/c/d/code/.settings/org.eclipse.cdt.managedbuilder.core.prefs##/main/4
/a/b/c/d/code/.project_initial##/main/2
/a/b/c/d/code/.project##/main/CSS5/5
/a/b/c/d/code/.cproject##/main/CSS5/10
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtbuild_initial##/main/2
/a/b/c/d/code/.**cdtbuild##**/main/CSS5/2
/a/b/c/d/code/.**cdtbuild##**/main/CSS5/2
/a/b/c/d/code/.ccsproject##/main/CSS5/3
It looks like you're trying to iterate each element of the array, select some data via pattern match, and then count dupes. Is that correct?
Would it not be easier to:
my %count_of;
while ( <$FH> ) {
my ( $val ) = /\/([^\/]+)##/;
$count_of{$val}++;
}
And then, for the variables that have more than one (e.g. there's a duplicate):
print join "\n", grep { $count_of{$_} > 1 } keys %count_of;
Alternatively, if you're just wanting to play 'spot the dupe':
#!/usr/bin/env perl
use strict;
use warnings;
my %seen;
my $match = qr/\/([^\/]+)##/;
while ( <DATA> ) {
my ( $value ) = m/$match/ or next;
print if $seen{$value}++;
}
__DATA__
/a/b/c/d/code/Debug/atlantis_digital/c/d/code/Debug/atlantis_digital.map##/main/place.09/2
/a/b/c/d/code/C5537_mem_map.cmd##/main/place.09/0
/a/b/c/d/code/.settings/org.eclipse.cdt.managedbuilder.core.prefs##/main/4
/a/b/c/d/code/.project_initial##/main/2
/a/b/c/d/code/.project##/main/CSS5/5
/a/b/c/d/code/.cproject##/main/CSS5/10
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtbuild_initial##/main/2
/a/b/c/d/code/.cdtbuild##/main/CSS5/2
/a/b/c/d/code/.cdtbuild##/main/CSS5/2
/a/b/c/d/code/.ccsproject##/main/CSS5/3
The problem has been solved by the previous answer - I just want to offer an alternate flavour that;
Spells out the regex
Uses the %seen hash to record the line the pattern first appears; to enable
slightly more detailed reporting
use v5.12;
use warnings;
my $regex = qr/
\/ # A literal slash followed by
( # Capture to $1 ...
[^\/]+ # ... anything that's not a slash
) # close capture to $1
## # Must be immdiately followed by literal ##
/x;
my %line_num ;
while (<>) {
next unless /$regex/ ;
my $pattern = $1 ;
if ( $line_num{ $pattern } ) {
say "'$pattern' appears on lines ", $line_num{ $pattern }, " and $." ;
next ;
}
$line_num{ $pattern } = $. ; # Record the line number
}
# Ran on data above will produce;
# '.cdtproject' appears on lines 7 and 8
# '.cdtbuild' appears on lines 10 and 11

Perl Grepping from an Array

I need to grep a value from an array.
For example i have a values
#a=('branches/Soft/a.txt', 'branches/Soft/h.cpp', branches/Main/utils.pl');
#Array = ('branches/Soft/a.txt', 'branches/Soft/h.cpp', branches/Main/utils.pl','branches/Soft/B2/c.tct', 'branches/Docs/A1/b.txt');
Now, i need to loop #a and find each value matches to #Array. For Example
It works for me with grep. You'd do it the exact same way as in the More::ListUtils example below, except for having grep instead of any. You can also shorten it to
my $got_it = grep { /$str/ } #paths;
my #matches = grep { /$str/ } #paths;
This by default tests with /m against $_, each element of the list in turn. The $str and #paths are the same as below.
You can use the module More::ListUtils as well. Its function any returns true/false depending on whether the condition in the block is satisfied for any element in the list, ie. whether there was a match in this case.
use warnings;
use strict;
use Most::ListUtils;
my $str = 'branches/Soft/a.txt';
my #paths = ('branches/Soft/a.txt', 'branches/Soft/b.txt',
'branches/Docs/A1/b.txt', 'branches/Soft/B2/c.tct');
my $got_match = any { $_ =~ m/$str/ } #paths;
With the list above, containing the $str, the $got_match is 1.
Or you can roll it by hand and catch the match as well
foreach my $p (#paths) {
print "Found it: $1\n" if $p =~ m/($str)/;
}
This does print out the match.
Note that the strings you show in your example do not contain the one to match. I added it to my list for a test. Without it in the list no match is found in either of the examples.
To test for more than one string, with the added sample
my #strings = ('branches/Soft/a.txt', 'branches/Soft/h.cpp', 'branches/Main/utils.pl');
my #paths = ('branches/Soft/a.txt', 'branches/Soft/h.cpp', 'branches/Main/utils.pl',
'branches/Soft/B2/c.tct', 'branches/Docs/A1/b.txt');
foreach my $str (#strings) {
foreach my $p (#paths) {
print "Found it: $1\n" if $p =~ m/($str)/;
}
# Or, instead of the foreach loop above use
# my $match = grep { /$str/ } #paths;
# print "Matched for $str\n" if $match;
}
This prints
Found it: branches/Soft/a.txt
Found it: branches/Soft/h.cpp
Found it: branches/Main/utils.pl
When the lines with grep are uncommented and foreach ones commented out I get the corresponding prints for the same strings.
The slashes dot in $a will pose a problem so you either have to escape them it when doing regex match or use a simple eq to find the matches:
Regex match with $a escaped:
my #matches = grep { /\Q$a\E/ } #array;
Simple comparison with "equals":
my #matches = grep { $_ eq $a } #array;
With your sample data both will give an empty array #matches because there is no match.
This Solved My Question. Thanks to all especially #zdim for the valuable time and support
my #SVNFILES = ('branches/Soft/a.txt', 'branches/Soft/b.txt');
my #paths = ('branches/Soft/a.txt', 'branches/Soft/b.txt',
'branches/Docs/A1/b.txt', 'branches/Soft/B2/c.tct');
foreach my $svn (#SVNFILES)
{
chomp ($svn);
my $m = grep { /$svn/ } (#paths);
if ( $m eq '0' ) {
print "Files Mismatch\n";
exit 1;
}
}
You should escape characters like '/' and '.' in any regex when you need it as a character.
Likewise :
$a="branches\/Soft\/a\.txt"
Retry whatever you did with either grep or perl with that. If it still doesn't work, tell us precisely what you tried.

How to find the largest repeating string with overlap in a line

I have a series of lines such as
my $string = "home test results results-apr-25 results-apr-251.csv";
#str = $string =~ /(\w+)\1+/i;
print "#str";
How do I find the largest repeating string with overlap which are separated by whitespace?
In this case I'm looking for the output :
results-apr-25
It looks like you need the String::LCSS_XS which calculates Longest Common SubStrings. Don't try it's Perl-only twin brother String::LCSS because there are bugs in that one.
use strict;
use warnings;
use String::LCSS_XS;
*lcss = \&String::LCSS_XS::lcss; # Manual import of `lcss`
my $var = 'home test results results-apr-25 results-apr-251.csv';
my #words = split ' ', $var;
my $longest;
my ($first, $second);
for my $i (0 .. $#words) {
for my $j ($i + 1 .. $#words) {
my $lcss = lcss(#words[$i,$j]);
unless ($longest and length $lcss <= length $longest) {
$longest = $lcss;
($first, $second) = #words[$i,$j];
}
}
}
printf qq{Longest common substring is "%s" between "%s" and "%s"\n}, $longest, $first, $second;
output
Longest common substring is "results-apr-25" between "results-apr-25" and "results-apr-251.csv"
my $var = "home test results results-apr-25 results-apr-251.csv";
my #str = split " ", $var;
my %h;
my $last = pop #str;
while (my $curr = pop #str ) {
if(($curr =~/^$last/) || $last=~/^$curr/) {
$h{length($curr)}= $curr ;
}
$last = $curr;
}
my $max_key = max(keys %h);
print $h{$max_key},"\n";
If you want to make it without a loop, you will need the /g regex modifier.
This will get you all the repeating string:
my #str = $string =~ /(\S+)(?=\s\1)/ig;
I have replaced \w with \S (in your example, \w doesn't match -), and used a look-ahead: (?=\s\1) means match something that is before \s\1, without matching \s\1 itself—this is required to make sure that the next match attempt starts after the first string, not after the second.
Then, it is simply a matter of extracting the longest string from #str:
my $longest = (sort { length $b <=> length $a } #str)[0];
(Do note that this is a legible but far from being the most efficient way of finding the longest value, but this is the subject of a different question.)
How about:
my $var = "home test results results-apr-25 results-apr-251.csv";
my $l = length $var;
for (my $i=int($l/2); $i; $i--) {
if ($var =~ /(\S{$i}).*\1/) {
say "found: $1";
last;
}
}
output:
found: results-apr-25

Regular expression to get the first letter of a word after punctuation symbols in Perl

Can any body tell me a regular expression in Perl for getting the ucfirst letter of the word comming after a dot,question or exclamation sign...
My program reads string character by character.
Requirement :
input string : "abcd[.?!]\s*abcd"
output: "Abcd[.?!]\s*Abcd"
My program is as follows:
#!/usr/bin/perl
use strict;
my $str = <STDIN>;
my $len=length($str);
my $ch;
my $i;
for($i=0;$i<=length($str);$i++)
{
$ch = substr($str,$i,1);
print "$ch";
if($ch =~ 's/([.?!]\s*[a-z])/uc($1)/ge')
{
$i=$i+1;
$ch = substr($str, $i,1);
my $ch = uc($ch);
print "$ch";
}
#elsif($ch eq "?")
#{
# $i=$i+1;
# $ch = substr($str, $i,1);
# my $ch = uc($ch);
# print "$ch";
#}
#elsif($ch eq "!")
#{
# $i=$i+1;
# $ch = substr($str, $i,1);
# my $ch = uc($ch);
# print"$ch";
#}
#elsif($ch eq " ")
#{
# $i=$i+1;
# $ch = substr($str, $i,1);
# my $ch = uc($ch);
# print"$ch";
#}
#else
#{
#print "";
#}
}
print "\n";
Looping over the string, and then looping over the match, is completely redundant. Your entire program can be replaced with this:
perl -pe 's/(^|[.?!]\s*)([a-z])/$1\U\2/g' inputfile >outputfile
I added beginning of line to the first parenthesized expression, although your explanation doesn't include that (but your example does).
Normally,
$s =~ s/(?<=[.?!]|^)\s*[a-z]/\U$1/g;
$s =~ s/(?<![^.?!])\s*[a-z]/\U$1/g;
$s =~ s/(?:^|[.?!])\s*\K[a-z]/\U$1/g;
But if you only read one character at a time,
my $after_punc = 1;
while (my $ch = ...) {
if ($ch =~ /^[.?!]\z/) {
$after_punc = 1;
}
elsif ($ch =~ /^[a-z]\z/) {
$ch = uc($ch) if $after_punc;
$after_punc = 0;
}
elsif ($ch =~ /^\s\z/) {
# Ignore whitespace.
}
else {
$after_punc = 0;
}
...
}
can any body tell me the regex in perl for getting the ucfirst letter of the word comming after a dot,question or exclamation sign...
My program reads string character by character.
Requirement :
input string : "abcd[.?!]\s*abcd"
output: "Abcd[.?!]\s*Abcd"
Your output does not match your explanation. In the input, the initial "a" does not follow a period, question mark, or exclamation mark, but was changed to upper case.
You can and should do this sort of processing with a single substitution. To do exactly as you said:
s/[.?!]\K[[:lower:]]/uc($&)/ge
The \K discards the character matched by [.?!], leaving only the lower-case letter in the matched string. $& is the matched string. The e flag says to evaluate uc($&).
If you also want to make an initial letter uppercase:
s/(?:^|[.?!])\K[[:lower:]]/uc($&)/ge
If you have unicode string, you could use:
$str =~ s/(\pP|^)(\s*\pL)/$1\U$2/g;