Join, split and map using perl for creating new attribs - regex

my $str = "<SampleElement oldattribs=\"sa1 sa2 sa3\">";
$str =~ s#<SampleElement[^>]*oldattribs="([^"]*)"#
my $fulcnt=$&;
my $afids=$1;
my #affs = ();
if($afids =~ m/\s+/) {
#affs = split /\s/, $afids;
my $jnafs = join ",", map { $_=~s/[a-z]*//i, } #affs;
($fulcnt." newattribs=\"$jnafs\"");
}
else {
($fulcnt);
}
#eg;
My Output:
<SampleElement oldattribs="sa1 sa2 sa3" newattribs="1,1,1">
Expected Output:
<SampleElement oldattribs="sa1 sa2 sa3" newattribs="1,2,3">
Someone could point out me where I am doing wrong. Thanks in advance.

Where you're going wrong is earlier than you think - you're parsing XML using regular expressions. XML is contextual, and regex isn't, so it's NEVER going to be better than a dirty hack.
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig -> parse ( \*DATA );
my $sample_elt = $twig -> get_xpath('//SampleElement',0);
my #old_att = split ( ' ', $sample_elt -> att('oldattribs') );
$sample_elt -> set_att('newattribs', join " ", map { /(\d+)/ } #old_att);
$twig -> set_pretty_print ( 'indented_a' );
$twig -> print;
__DATA__
<XML>
<SampleElement oldattribs="sa1 sa2 sa3">
</SampleElement>
</XML>
But to answer the core of your problem - you're misusing map as an iterator here.
map { $_=~s/[a-z]*//i, } #affs;
Because what that is doing is iterating all the elements in #affs and modifying those... but map is just returning the result of the expression - which is 1 because it worked.
If you want to change #affs you'd:
s/[a-z]*//i for #affs;
But if you didn't want to, then the easy answer is to use the r regex flag:
map { s/[a-z]*//ir } #affs;
Or as I've done in my example:
map { /(\d+)/ } #affs;
Which regex matches and captures the numeric part of the string, but as a result the 'captured' text is what's returned.

Here is a simple way to build shown output from the input $str.
Note: The input is in single quotes, not double. Then the \" isn't a problem in the regex.
my $str = '<SampleElement oldattribs=\"sa1 sa2 sa3\">';
# Pull 'sa1 sa2 sa3' string out of it
my ($attrs) = $str =~ /=\\"([^\\]+)/; # " # (turn off bad syntax highlight)
# Build '1,2,3' string from it
my $indices = join ',', map { /(\d+)/ } split ' ', $attrs;
# Extract content between < > so to add to it, put it back together
my ($content) = $str =~ /<(.*)>/;
my $outout = '<' . $content . " newattribs=\"$indices\"" . '>';
This gives the required output.
Some of these can be combined into single statements, if you are into that. For example
my $indices =
join ',', map { /(\d+)/ } split ' ', ($str =~ /"([^\\]+)/)[0]; # "
$str =~ s/<(.*)>/<$1 newattribs=\"$indices\">/;
All of this can be rolled into one regex, but it becomes just unwieldy and hard to maintain.
Above all – this appears to be XML or such ... please don't do it by hand, unless there is literally just a snippet or two. There are excellent parsers.

Found solution on this by searching map function:
my $str = "<SampleElement oldattribs=\"sa1 sa2 sa3\">";
$str=~s#<SampleElement[^>]*oldattribs="([^"]*)"#my $fulcnt=$&; my $afids=$1;
my #affs = ();
if($afids=~m/\s+/)
{
#affs = split /\s/, $afids;
my #newas = join ",", map { (my $foo = $_) =~ s/[a-z]*//i; $foo; } #affs ;
($fulcnt." newattribs=\"#newas\"");
}
else
{
($fulcnt);
}
#eg;
I have updated the below line on my code:
my #newas = join ",", map { (my $foo = $_) =~ s/[a-z]*//i; $foo; } #affs ;
Instead of
my $jnafs = join ",", map { $_=~s/[a-z]*//i, } #affs;
Its working thanks for all.

Related

multi replace in postgresql query using perl

I'm cleaning some text directly in my query, and rather than using nested replace functions, I found this bit of code that uses perl to perform multiple replacements at once: multi-replace with perl
CREATE FUNCTION pg_temp.multi_replace(string text, orig text[], repl text[])
RETURNS text
AS $BODY$
my ($string, $orig, $repl) = #_;
my %subs;
if (#$orig != #$repl) {
elog(ERROR, "array sizes mismatch");
}
if (ref #$orig[0] eq 'ARRAY' || ref #$repl[0] eq 'ARRAY') {
elog(ERROR, "array dimensions mismatch");
}
#subs{#$orig} = #$repl;
my $re = join "|",
sort { (length($b) <=> length($a)) } keys %subs;
$re = qr/($re)/;
$string =~ s/$re/$subs{$1}/g;
return $string;
$BODY$ language plperl strict immutable;
Example query:
select
name as original_name,
multi_replace(name, '{-,&,LLC$}', '{_,and,business}') as cleaned_name
from some_table
The function finds the pattern LLC at the end of the name string but removes it instead of replacing it with "business."
How can I make this work as intended?
When the strings in #$orig are to be matched literally, I'd actually use this:
my ($string, $orig, $repl) = #_;
# Argument checks here.
my %subs; #subs{ #$orig } = #$repl;
my $pat =
join "|",
map quotemeta,
sort { length($b) <=> length($a) }
#$orig;
return $string =~ s/$re/$subs{$&}/gr;
In particular, map quotemeta, was missing.
(By the way, the sort line isn't needed if you ensure that xy comes before x in #$orig when you want to replace both x(?!y) and xy.)
But you want the strings in #$orig to be treated as regex patterns. For that, you can use the following:
# IMPORTANT! Only provide strings from trusted sources in
# `#$orig` as it allows execution of arbitrary Perl code.
my ($string, $orig, $repl) = #_;
# Argument checks here.
my $re =
join "|",
map "(?:$orig->[$_])(?{ $_ })",
0..$#$orig;
{
use re qw( eval );
$re = qr/$re/;
}
return $string =~ s/$re/$repl->[$^R]/gr;
However, in your environment, I have doubts about the availability of use re qw( eval ); and (?{ }), so the above may be an unviable solution for you.
my ($string, $orig, $repl) = #_;
# Argument checks here.
my $re =
join "|",
map "(?<_$_>$orig->[$_])",
0..$#$orig;
$re = qr/$re/;
return
$string =~ s{$re}{
my ( $n ) =
map substr( $_, 1 ),
grep { $-{$_} && defined( $-{$_}[0] ) }
grep { /^_\d+\z/aa }
keys( %- );
$repl->[$n]
}egr;
While the regexp tests for LLC$ with the special meaning of the $, what gets captured into $1 is just the string LLC and so it doesn't find the look-up value to replace.
If the only thing you care about is $, then you could fix it by changing the map-building lines to:
#subs{map {my $t=$_; $t=~s/\$$//; $t} #$orig} = #$repl;
my $re = join "|",
sort { (length($b) <=> length($a)) } #$orig;
But it will be very hard to make it work more generally for every possible feature of regex.
The purpose of this plperl function in the linked blog post is to find/replace strings, not regular expressions. LLC being found with LLC$ as a search term does not happen in the original code, as the search terms go through quotemeta before being included into $re (as also sugggested in ikegami's answer)
The effect of removing the quotemeta transformation is that LLC at the end of a string is matched, but since as a key it's not found in $subs (because the key there isLLC$), then it's getting replaced by an empty string.
So how to make this work with regular expressions in the orig parameter?
The solution proposed by #ikegami does not seem usable from plperl, as it complains with this error: Unable to load re.pm into plperl.
I thought of an alternative implementation without the (?{ code }) feature: each match from the main alternation regexp can be rechecked against each regexp in orig, in a code block run with /ge. On the first match, the corresponding string in repl is selected as the replacement.
Code:
CREATE or replace FUNCTION pg_temp.multi_replace(string text, orig text[], repl text[])
RETURNS text AS
$BODY$
my ($string, $orig, $repl) = #_;
my %subs;
if (#$orig != #$repl) {
elog(ERROR, "array sizes mismatch");
}
if (ref #$orig[0] eq 'ARRAY' || ref #$repl[0] eq 'ARRAY') {
elog(ERROR, "array dimensions mismatch");
}
#subs{#$orig} = #$repl;
my $re = join "|", keys %subs;
$re = qr/($re)/;
# on each match, recheck the match individually against each regexp
# to find the corresponding replacement string
$string =~ s/$re/{ my $r; foreach (#$orig) { if ($1 =~ $_) {$r=$subs{$_}; last;} } $r;}/ge;
return $string;
$BODY$ language plperl strict immutable;
Test
=> select pg_temp.multi_replace(
'bar foo - bar & LLC',
'{^bar,-,&,LLC$}',
'{baz,_,and,business}'
);
multi_replace
----------------------------
baz foo _ bar and business

How do I find what is missing from second string based on first string?

$a = "
fddf\n
dfdf\n
eeee\n"
$b = "
fddf\n
dfdf\n
pppp\n
erww\n"
The output should be "eeee\n" because it is missing in second string. I thought of using perl regular expression but it does not tell me the what is missing from the second string.
First: always use strict and warnings, and don't use $a and $b outside of sort, they're special.
use strict;
use warnings;
my $x = " fddf\n dfdf\n eeee\n";
my $y = " fddf\n dfdf\n pppp\n erww\n";
my #x_chars = split //, $x;
my #y_chars = split //, $y;
my #missing_chars;
while (#x_chars and #y_chars) {
my $next = shift #x_chars;
if ($next eq $y_chars[0]) {
shift #y_chars;
} else {
push #missing_chars, $next;
}
}
push #missing_chars, #x_chars;
my $missing = join '', #missing_chars;
As mentioned CPAN modules like String::Diff (wrapper of Algorithm::Diff) will provide a simpler and more comprehensive solution if your requirements become more complex.

Dynamically capture regular expression match in Perl

I'm trying to dynamically catch regex matching in Perl. I've known that eval will help me do this but I may be doing something wrong.
Code:
use strict;
use warnings;
my %testHash = (
'(\d+)\/(\d+)\/(\d+)' => '$1$2$3'
);
my $str = '1/12/2016';
foreach my $pattern (keys (%testHash)) {
my $value = $testHash{$pattern};
my $result;
eval {
local $_ = $str;
/$pattern/;
print "\$1 - $1\n";
print "\$2 - $2\n";
print "\$3 - $3\n";
eval { print "$value\n"; }
}
}
Is it also possible to store captured regex patterns in an array?
I believe what you really want is a dynamic version of the following:
say $str =~ s/(\d+)\/(\d+)\/(\d+)/$1$2$3/gr;
String::Substitution provides what we need to achieve that.
use String::Substitution qw( gsub_copy );
for my $pattern (keys(%testHash)) {
my $replacement = $testHash{$pattern};
say gsub_copy($str, $pattern, $replacement);
}
Note that $replacement can also be a callback. This permits far more complicated substitutions. For example, if you wanted to convert 1/12/2016 into 2016-01-12, you could use the following:
'(\d+)/(\d+)/(\d+)' => sub { sprintf "%d-%02d-%02d", #_[3,1,2] },
To answer your actual question:
use String::Substitution qw( interpolate_match_vars last_match_vars );
for my $pattern (keys(%testHash)) {
my $template = $testHash{$pattern};
$str =~ $pattern # Or /$pattern/ if you prefer
or die("No match!\n");
say interpolate_match_vars($template, last_match_vars());
}
I am not completely sure what you want to do here, but I don't think your program does what you think it does.
You are useing eval with a BLOCK of code. That's like a try block. If it dies inside of that eval block, it will catch that error. It will not run your string like it was code. You need a string eval for that.
Instead of explaining that, here's an alternative.
This program uses sprintf and numbers the parameters. The %1$s syntax in the pattern says _take the first argument (1$) and format it as a string (%s). You don't need to localize or assign to $_ to do a match. The =~ operator does that on other variables for you. I also use qr{} to create a quoted regular expression (essentially a variable containing a precompiled pattern) that I can use directly. Because of the {} as delimiter, I don't need to escape the slashes.
use strict;
use warnings;
use feature 'say'; # like print ..., "\n"
my %testHash = (
qr{(\d+)/(\d+)/(\d+)} => '%1$s.%2$s.%3$s',
qr{(\d+)/(\d+)/(\d+) nomatch} => '%1$s.%2$s.%3$s',
qr{(\d+)/(\d+)/(\d\d\d\d)} => '%3$4d-%2$02d-%1$02d',
qr{\d} => '%s', # no capture group
);
my $str = '1/12/2016';
foreach my $pattern ( keys %testHash ) {
my #captures = ( $str =~ $pattern );
say "pattern: $pattern";
if ($#+ == 0) {
say " no capture groups";
next;
}
unless (#captures) {
say " no match";
next;
}
# debug-output
for my $i ( 1 .. $#- ) {
say sprintf " \$%d - %s", $i, $captures[ $i - 1 ];
}
say sprintf $testHash{$pattern}, #captures;
}
I included four examples:
The first pattern is the one you had. It uses %1$s and so on as explained above.
The second one does not match. We check the number of elements in #captured by looking at it in scalar context.
The third one shows that you can also reorder the result, or even use the sprintf formatting.
The last one has no capture group. We check by looking at the index of the last element ($# as the sigil for arrays that usually have an # sigil) in #+, which holds the offsets of the ends of the last successful submatches in the currently active dynamic scope. The first element is the end of the overall match, so if this only has one element, we don't have capture groups.
The output for me is this:
pattern: (?^:(\d+)/(\d+)/(\d\d\d\d))
$1 - 1
$2 - 12
$3 - 2016
2016-12-01
pattern: (?^:(\d+)/(\d+)/(\d+) nomatch)
no match
pattern: (?^:\d)
no capture groups
pattern: (?^:(\d+)/(\d+)/(\d+))
$1 - 1
$2 - 12
$3 - 2016
1.12.2016
Note that the order in the output is mixed up. That's because hashes are not ordered in Perl, and if you iterate over the keys in a hash without sort the order is random.
Apologies! I realized both my question and sample code were both vague. But after reading your suggestions I came of with the following code.
I haven't optimized this code yet and there is a limit to the replacement.
foreach my $key (keys %testHash) {
if ( $str =~ $key ) {
my #matchArr = ($str =~ $key); # Capture all matches
# Search and replace (limited from $1 to $9)
for ( my $i = 0; $i < #matchArr; $i++ ) {
my $num = $i+1;
$testHash{$key} =~ s/\$$num/$matchArr[$i]/;
}
$result = $testHash{$key};
last;
}
}
print "$result\n";
Evaluing the regexp in list context returns the matches. so in your example:
use Data::Dumper; # so we can see the result
foreach my $pattern (keys (%testHash)) {
my #a = ($str =~/$pattern/);
print Dumper(\#a);
}
would do the job.
HTH
Georg
Is it also possible to store captured regex patterns in an array?
Of course it is possible to store captured substrings in an array:
#!/usr/bin/env perl
use strict;
use warnings;
my #patterns = map qr{$_}, qw{
(\d+)/(\d+)/(\d+)
};
my $str = '1/12/2016';
foreach my $pattern ( #patterns ) {
my #captured = ($str =~ $pattern)
or next;
print "'$_'\n" for #captured;
}
Output:
'1'
'12'
'2016'
I do not quite understand what you are trying to do with combinations of local, eval EXPR and eval BLOCK in your code and the purpose of the following hash:
my %testHash = (
'(\d+)\/(\d+)\/(\d+)' => '$1$2$3'
);
If you are trying to codify that this pattern should result in three captures, you can do that like this:
my #tests = (
{
pattern => qr{(\d+)/(\d+)/(\d+)},
ncaptures => 3,
}
);
my $str = '1/12/2016';
foreach my $test ( #tests ) {
my #captured = ($str =~ $test->{pattern})
or next;
unless (#captured == $test->{ncaptures}) {
# handle failure
}
}
See this answer to find out how you can automate counting the number of capture groups in a pattern. Using the technique in that answer:
#!/usr/bin/env perl
use strict;
use warnings;
use Test::More;
my #tests = map +{ pattern => qr{$_}, ncaptures => number_of_capturing_groups($_) }, qw(
(\d+)/(\d+)/(\d+)
);
my $str = '1/12/2016';
foreach my $test ( #tests ) {
my #captured = ($str =~ $test->{pattern});
ok #captured == $test->{ncaptures};
}
done_testing;
sub number_of_capturing_groups {
"" =~ /|$_[0]/;
return $#+;
}
Output:
ok 1
1..1

perl regex for multiple matches

I have a file with lines similar to following:
abcd1::101:xyz1,user,user1,abcd1,pqrs1,userblah,abcd1
I want to retain strings up to last ":" and remove all occurrences of abcd1
In the end, I need to have below:
abcd1::101:xyz1,xyz2,xyz3,pqrs1,xyz4
I tried code as below, but for some reason, it is not working. So please help
the account name is "abcd1"
sub UpdateEtcGroup {
my $account = shift;
my $file = "/tmp/group";
#ARGV = ($file);
$^I = ".bak";
while (<>){
s#^($account::\d{1,$}:)$account,?#$1#g;
s/,$//; # to remove the last "," if there
print;
}
}
split is the tool for the job, not a regex.
Because split lets you reliably separate out the field you do want to operate on, from the ones that you don't. Like this:
#!/usr/bin/env perl
use strict;
use warnings;
my $username = 'abcd1';
while ( <DATA> ) {
my #fields = split /:/;
my #users = split ( /,/, pop ( #fields ) );
print join ( ":", #fields,
join ( ",", grep { not m/^$username$/ } #users ) ),"\n";
}
__DATA__
abcd1::101:xyz1,user,user1,abcd1,pqrs1,userblah,abcd1
Don't use a regular expression for this.
use strict;
use warnings;
while (<DATA>) {
chomp;
my #parts = split(/:/, $_);
$parts[-1] = join(',', grep { !/^abcd/ } split(/,/, $parts[-1]));
print join(':', #parts) . "\n";
}
__DATA__
abcd1::101:xyz1,user,user1,abcd1,pqrs1,userblah,abcd1
abcd2::102:user1,xyz2,otheruser,abcd2,pqrs1,xyz4,abcd2
Output:
abcd1::101:xyz1,user,user1,pqrs1,userblah
abcd2::102:user1,xyz2,otheruser,pqrs1,xyz4

In Perl, how can I remove all spaces that are not inside double quotes " "?

I'm tying to come up with some regex that will remove all space chars from a string as long as it's not inside of double quotes (").
Example string:
some string with "text in quotes"
Result:
somestringwith"text in quotes"
So far I've come up with something like this:
$str =~ /"[^"]+"|/g;
But it doesn't seem to be giving the intended result.
I'm honestly very new at perl and haven't had too much regexp experience. So if anyone willing to answer would also be willing to provide some insight into the why and how that would be great!
Thanks!
EDIT
String will not contain escaped "'s
It should actually always be formatted like this:
Some.String = "Some Value"
Result would be
Some.String="Some Value"
Here is a technique using split to separate the quoted strings. It relies on your data being consistent and will not work with loose quotes.
use strict;
use warnings;
my #line = split /("[^"]*")/;
for (#line) {
unless (/^"/) {
s/[ \t]+//g;
}
}
print #line; # line is altered
Basically, you split up the string in order to isolate the quoted strings. Once that is done, perform the substitution on all other strings. Since the array elements are aliased in the loop, substitutions are performed on the actual array.
You can run this script like so:
perl -n script.pl inputfile
To see the output. Or
perl -n -i.bak script.pl inputfile
To do in-place edit on inputfile, while saving backup in inputfile.bak.
With that said, I'm not sure what your edit means. Do you want to change
Some.String = "Some Value"
to
Some.String="Some Value"
Text::ParseWords is tailor-made for this:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::ParseWords;
my #strings = (
q{This.string = "Hello World"},
q{That " string " and "another shoutout to my bytes"},
);
for my $s ( #strings ) {
my #words = quotewords '\s+', 1, $s;
print join('', #words), "\n";
}
Output:
This.string="Hello World"
That" string "and"another shoutout to my bytes"
Using Text::ParseWords means if you ever had to deal with quoted strings with escaped quotation marks in them, you'd be ready ;-)
Also, this sounds like you have a configuration file of some sort and you're trying to parse it. If that is the case, there are probably better solutions.
I suggest removing the quoted substrings using split and then recombining them with join after removing whitespace from the intermediate text.
Note that if the regex used for split contains captures then the captured values will also be included in the list returned.
Here's some sample code.
use strict;
use warnings;
my $source = <<END;
Some.String = "Some Value";
Other.String = "Other Value";
Last.String = "Last Value";
END
print join '', map {s/\s+// unless /"/; $_; } split /("[^"]*")/, $source;
output
Some.String= "Some Value";Other.String = "Other Value";Last.String = "Last Value";
I would simply loop through the string char by char. This way you can handle escaped strings too (just add an isEscaped variable).
my $text='lala "some thing with quotes " lala ... ';
my $quoteOpen = 0;
my $out;
foreach $char(split//,$text) {
if ($char eq "\"" && $quoteOpen==0) {
$quoteOpen = 1;
$out .= $char;
} elsif ($char eq "\"" && $quoteOpen==1) {
$quoteOpen = 0;
$out .= $char;
} elsif ($char =~ /\s/ && $quoteOpen==1) {
$out .= $char;
} elsif ($char !~ /\s/) {
$out .= $char;
}
}
print "$out\n";
Splitting on double quotes, removing spaces only from even fields (i.e. those in quotes):
sub remove_spaces {
my $string = shift;
my #fields = split /"/, $string . ' '; # trailing space needed to keep final " in output
my $flag = 1;
return join '"', map { s/ +//g if $flag; $flag = ! $flag; $_} #fields;
}
It can be done with regex:
s/([^ ]*|\"[^\"]*\") */$1/g
Note that this won't handle any kind of escapes inside the quotes.