Perl regex replacement when replacement is numeric - regex

This is probably something really silly, and I apologize if that is the case. I don't know exactly what to search for, and I haven't had any luck with the searches I've ran over the past half hour or so. Anyway...
So I want to automate making a simple change to xml with perl as part of a build process. This is the change I'm making, it's part of a config file called mapred-site.xml
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
- <value>1024</value>
+ <value>4096</value>
</property>
I've got a perl regex replacement that does exactly what I need it to do, until I change this FOO to 4096
cat mapred-site.xml | perl -p0e "s/(yarn.app.mapreduce.am.resource.mb<\/name>\s*?<value>)....(<\/value>)/\\1FOO\\2/s"
Guessing that the problem is that there are numbers directly next to the \\1 referring to the first portion, and it's pulling them in and trying to do \\14096 or similar, but I haven't been able to come up with a solution.
I apologize if the command itself is sloppy/inefficient, I'm still just getting started with these commands.

Using \1, \2 etc. on the right side of a regex is about a million years old anyway; the recommended way is to use $1, $2, etc. And if you use those you can use braces to separate the variable name from any neighboring stuff, like ${1}FOO${2} (or, just as well, ${1}4096${2}).

Here is a less fragile/more maintainable way to do it, using Mojo::DOM:
cat mapred-site.xml | perl -CSD -0777 -MMojo::DOM -pe '$_ = Mojo::DOM->new->xml(1)->parse($_); $_->find("property > name")->first(sub { $_->text eq "yarn.app.mapreduce.am.resource.mb" })->following("value")->first->content(4096)'
As a more readable script:
use strict;
use warnings;
use Mojo::DOM;
use open ':std', ':encoding(UTF-8)';
my $dom = do { local $/; Mojo::DOM->new->xml(1)->parse(readline \*STDIN) };
my $name = $dom->find('property > name')
->first(sub { $_->text eq 'yarn.app.mapreduce.am.resource.mb' });
$name->following('value')->first->content(4096);
print $dom->to_string;

Related

Regex (or bash), get pipes between quotes (perl)

Update: Please keep in mind is that regex is my only option.
Update 2: Actually, I can use a bash based solution as well.
Trying to replace the pipes(can be more than one) that are between double quotes with commas in perl regex
Example
continuer|"First, Name"|123|12412|10/21/2020|"3|7"||Yes|No|No|
Expected output (3 and 7 are separated by a comma)
continuer|"First, Name"|123|12412|10/21/2020|"3,7"||Yes|No|No|
There may be more digits, it may not be just the two d\|d. It could be "3|7|2" and the correct output has to be "3,7,2" for that one. I've tried the following
cat <filename> | perl -pi -e 's/"\d+\|[\|\d]+/\d+,[\|\d]+/g'
but it just puts the actual string of d+ etc...
I'd really appreciate your help. ty
If it must be a regex here is a simpler one
perl -wpe's/("[^"]+")/ $1 =~ s{\|}{,}gr /eg' file
Not bullet-proof but it should work for the shown use case.†
Explanation. With /e modifier the replacement side is evaluated as code. There, a regex runs on $1 under /r so that the original ($1) is unchanged; $N are read-only and so we can't change $1 and thus couldn't run a "normal" s/// on it. With this modifier the changed string is returned, or the original if there were no changes. Just as ordered.
Once it's tested well enough add -i to change the input file "in-place" if wanted.
I must add, I see no reason that at least this part of the job can't be done using a CSV parser...
Thanks to ikegami for an improved version
perl -wpe's/"[^"]+"/ $& =~ tr{|}{,}r /eg' file
It's simpler, with no need to capture, and tr is faster
† Tested with strings like in the question, extended only as far as this
con|"F, N"|12|10/21|"3|7"||Yes|"2||4|12"|"a|b"|No|""|end|
I'd use a CSV parser, not regular expressions:
#!/usr/bin/env perl
use warnings;
use strict;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new({ binary => 1, sep_char => "|"});
while (my $row = $csv->getline(*ARGV)) {
#$row = map { tr/|/,/r } #$row;
$csv->say(*STDOUT, $row);
}
example:
$ perl demo.pl input.txt
continuer|"First, Name"|123|12412|10/21/2020|3,7||Yes|No|No|
More verbose, but also more robust and a lot easier to understand.
If you cannot install modules, Text::ParseWords is a core module you can try. It can split a string and handle quoted delimiters.
use Text::ParseWords;
my $q = q(continuer|"First, Name"|123|12412|10/21/2020|"3|7"||Yes|No|No|);
print join "|", map { tr/|/,/; $_ } quotewords('\|', 1, $q);
As a one-liner, it would be:
perl -MText::ParseWords -pe'$_ = join "|", map { tr/|/,/; $_ } quotewords('\|', 1, $_);' yourfile.txt
You said Update 2: Actually, I can use a bash based solution as well. and while this script isn't bash you could call it from bash (or any other shell) which I assume is what you really mean by "bash based" so - this will work using any awk in any shell in every Unix box:
$ awk 'BEGIN{FS=OFS="\""} {for (i=2; i<=NF; i+=2) gsub(/\|/,",",$i)} 1' file
continuer|"First, Name"|123|12412|10/21/2020|"3,7"||Yes|No|No|
Imagine yourself having to debug or enhance the clear, simple loop above above vs the regexp incantation you posted in your answer:
's/(?:(?<=")|\G(?!^))(\s*[^"|\s]+(?:\s+[^"|\s]+)*)\s*\|\s*(?=[^"]*")/$1,/g'
Remember - Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems..
I'm sure you could do what I'm doing with awk above natively in perl instead if you're trying to modify a perl script to add this functionality.
I'd use Text::CSV_XS.
perl -MText::CSV_XS=csv -e'
csv
in => \*ARGV,
sep_char => "|",
on_in => sub { tr/|/,/ for #{ $_[1] } };
'
You can provide the file name as an argument or provide the data via STDIN.
This is working right now
's/(?:(?<=")|\G(?!^))(\s*[^"|\s]+(?:\s+[^"|\s]+)*)\s*\|\s*(?=[^"]*")/$1,/g'
Credit goes to my boss at work
Thanks everyone for looking.
I hope some of you realize that some projects require certain ways and complicating an already very complicated pre existing structure is not always an option at work. I knew there would be a one liner for this, do not hate because you did not like that.

Complex regex - works in Powershell, not in Bash

The below code is a small portion of my code for Solarwinds to parse the output of a Netbackup command. This is fine for our Windows boxes but some of our boxes are RHEL.
I'm trying to convert the below code into something useable on RHEL 4.X but I'm running into a wall with parsing the regex. Obviously the below code has some of the characters escaped for use with Powershell, I have unescaped those characters for use with Shell.
I'm not great with Shell yet, but I will post a portion of my Shell code below the Powershell code.
$output = ./bpdbjobs
$Results = #()
$ColumnName = #()
foreach ($match in $OUTPUT) {
$matches = $null
$match -match "(?<jobID>\d+)?\s+(?<Type>(\b[^\d\W]+\b)|(\b[^\d\W]+\b\s+\b[^\d\W]+\b))?\s+(?<State>(Done)|(Active)|(\w+`-\w+`-\w+))?\s+(?<Status>\d+)?\s+(?<Policy>(\w+)|(\w+`_\w+)|(\w+`_\w+`_\w+))?\s+(?<Schedule>(\b[^\d\W]+\b\-\b[^\d\W]+\b)|(\-)|(\b[^\d\W]+\b))?\s+(?<Client>(\w+\.\w+\.\w+)|(\w+))?\s+(?<Dest_Media_Svr>(\w+\.\w+\.\w+)|(\w+))?\s+(?<Active_PID>\d+)?\s+(?<FATPipe>\b[^\d\W]+\b)?"
$Results+=$matches
}
The below is a small portion of Shell code I've written (which is clearly very wrong, learning as I go here). I'm just using this to test the Regex and see if it functions in Shell - (Spoiler alert) it does not.
#!/bin/bash
#
backups=bpdbjobs
results=()
for results in $backups; do
[[ $results =~ /(?<jobID>\d+)?\s+(?<Type>(\b[^\d\W]+\b)|(\b[^\d\W]+\b\s+\b[^\d\W]+\b))?\s+(?<State>(Done)|(Active)|(\w+\w+\-\w\-+))?\s+(?<Status>\d+)?\s+(?<Policy>(\w+)|(\w+\_\w+)|(\w+\_\w+\_\w+))?\s+(?<Schedule>(\b[^\d\W]+\b\-\b[^\d\W]+\b)|(\-)|(\b[^\d\W]+\b))?\s+(?<Client>(\w+\.\w+\.\w+)|(\w+))?\s+(?<Dest_Media_Svr>(\w+\.\w+\.\w+)|(\w+))?\s+(?<Active_PID>\d+)?/ ]]
done
$results
Below are the errors I get.
./netbackupsolarwinds.sh: line 9: syntax error in conditional expression: unexpected token `('
./netbackupsolarwinds.sh: line 9: syntax error near `/(?'
./netbackupsolarwinds.sh: line 9: ` [[ $results =~ /(?<jobID>\d+)?\s+(?<Type>(\b[^\d\W]+\b)|(\b[^\d\W]+\b\s+\b[^\d\W]+\b))?\s+(?<State>(Done)|(Active)|(\w+\w+\-\w\-+))?\s+(?<Status>\d+)?\s+(?<Policy>(\w+)|(\w+\_\w+)|(\w+\_\w+\_\w+))?\s+(?<Schedule>(\b[^\d\W]+\b\-\b[^\d\W]+\b)|(\-)|(\b[^\d\W]+\b))?\s+(?<Client>(\w+\.\w+\.\w+)|(\w+))?\s+(?<Dest_Media_Svr>(\w+\.\w+\.\w+)|(\w+))?\s+(?<Active_PID>\d+)?/ ]]'
From man bash:
An additional binary operator, =~, is available, with the same precedence as == and !=. When it is used, the string to the right of the operator is considered an extended regular expression and matched accordingly (as in regex(3)).
Meaning that the expression is parsed as a POSIX extended regular expression, which AFAIK does not support either named capturing groups ((?<name>...)) or character escapes (\d, \w, \s, ...).
If you want to use [[ $var =~ expr ]] you need to rewrite the regular expression. Otherwise use grep (which supports PCRE):
grep -P '(?<jobID>\d+)?\s+...' <<<$results
Updated answer, after comments exchange.
The best way to perform your migration quickly is to use the --perl-regexp Perl compatibility option of Grep, like eventually suggested in another answer.
If you still want to perform this operation with pure Bash, you need to rewrite the regular expression accordingly, following the documentation.
Thanks all for the answers. I swapped to Grep -P to no avail, turns out the named capture groups were the problem for Grep -P.
I was also unable to figure out a way to use Grep to output the capture group matches to individual variables.
This lead me to swap over to using perl, as follows, with alterations to my regex.
bpdbjobs | perl -lne 'print "$1" if /(\d+)?\s+((\b[^\d\W]+\b)|(\b[^\d\W]+\b\s+\b[^\d\W]+\b))?\s+((Done)|(Active)|(\w+\w+\-\w\-+))?\s+(\d+)?\s+((\w+)|(\w+\_\w+)|(\w+\_\w+\_\w+))?\s+((b[^\d\W]+\b\-\b[^\d\W]+\b)|(\-)|(\b[^\d\W]+\b))?\s+((\w+\.\w+\.\w+)|(\w+))?\s+((\w+\.\w+\.\w+)|(\w+))?\s+(\d+)?/g'
With $<num> referring to the capture group number. I can now list, display and (the important part) count the number of matches within an individual group, corresponding to the data found in each column.

How to remove the last 6 digits from the filename in Perl using regex

I need your help in creating a regex to delete the hh:hh:ss bits from the file name.
I have the file name in format of:
abcd_efgh_ijkl_mnop_20140720151617.txt
And I want to rename it to:
abcd_efgh_ijkl_mnop_20140720.txt
Before moving it to the server. The Perl code I am using is doesn't work.I cannot use SUBSTR or rename function due to script requirement.
$file_name = #file_array;
$file_name =~s/$\s+\d{8}(.*)/$1/;
Please help me in creating the correct regex to do the same.
Instead of focusing on what you don't want, specify a regex that states what you DO want.
In this case, you specifically want to keep the first 8 digits of numbers and truncate the rest:
use strict;
use warnings;
while (<DATA>) {
s/\d{8}\K\d+//;
print;
}
__DATA__
abcd_efgh_ijkl_mnop_20140720151617.txt
Outputs:
abcd_efgh_ijkl_mnop_20140720.txt
Or if positive lookbehind assertions are not an option because you're working with a particularly ancient version of perl, then a capture group can achieve the same result: s/(\d{8})\d+/$1/;
Try this:
$filename="abcd_efgh_ijkl_mnop_20140720151617.txt";
$filename=~s/\d{6}.txt$/.txt/sg;
You could try the below perl command,
$ echo 'abcd_efgh_ijkl_mnop_20140720151617.txt' | perl -pe 's/^(.*).{6}(\..*)$/\1\2/g'
abcd_efgh_ijkl_mnop_20140720.txt
So it would be,
$file_name =~s/^(.*).{6}(\..*)$/$1$2/g;
my $in = 'abcd_efgh_ijkl_mnop_20140720151617.txt';
print "$in\n";
my ($new) = $in =~ /(.*2014\d{4})/;
print "$new\n";
Try with non-greedy way
$file_name = 'abcd_efgh_ijkl_mnop_20140720151617.txt';
$file_name =~s/(.*?)\d{6}\.txt/$1.txt/;
print $file_name;

how to write a script or shell command to use a regex in perl?

i am an enthusiast of computers but never studied programming.
i am trying to learn Perl, because i found it interesting since i learned to use a little bit of regular expressions with Perl flavor, cause i needed to replace words in certain parts of the strings and that's how i found perl.
but i don't know anything about programming, i would like to know simple examples how to use regular expression from the shell (terminal) or basic scripts.
for example if i have in a folder a text document called : input.txt
how can i perform the following regex.
text to match :
text text text
text text text
what i want : change the second occurrence of the word text for the word: changed
(\A.*?tex.*?)text(.*?)$
replace for : \1changed\3
expected result:
text changed text
text changed text
using a text editor that would be using Multi-line and global modifiers.
now, how can i process this from the shell.
CD path and then what?
or a script? what should contain to make it workable.
please consider i don't know anything about Perl, but only about its regexp syntax
The regular expression part is easy.
s/\btext\b.*?\K\btext\b/changed/;
However, how to apply it if you're learning perl... that's the hard part. One could demonstrate a one liner, but that's not that helpful.
perl -i -pe 's/\btext\b.*?\K\btext\b/changed/;' file.txt
So instead, I'd recommend looking at perlfaq5 #How do I change, delete, or insert a line in a file, or append to the beginning of a file?. Ultimately what you need to learn is how to open a file for reading, and iterate over the lines. And alternatively, how to open a file for writing. With these two tools, you can do a lot.
use strict;
use warnings;
use autodie;
my $file = 'blah.txt';
my $newfile = 'new_blah.txt';
open my $infh, '<', $file;
open my $outfh, '>', $newfile;
while (my $line = <$infh>) {
# Any manipulation to $line here, such as that regex:
# $line =~ s/\btext\b.*?\K\btext\b/changed/;
print $outfh $line;
}
close $infh;
close $outfh;
Update to explain regex
s{
\btext\b # Find the first 'text' not embedded in another word
.*? # Non-greedily skip characters
\K # Keep the stuff left of the \K, don't include it in replacement
\btext\b # Match 2nd 'text' not embedded in another word
}{changed}x; # Replace with 'changed' /x modifier to allow whitespace in LHS.

How could I get this regex statment to capture just a select piece

I am trying to get the regex in this loop,
my $vmsn_file = $snapshots{$snapshot_num}{"filename"};
my #current_vmsn_files = $ssh->capture("find -name $vmsn_file");
foreach my $vmsn (#current_vmsn_files) {
$vmsn =~ /(.+\.vmsn)/xm;
print "$1\n";
}
to capture the filename from this line,
./vmfs/volumes/4cbcad5b-b51efa39-c3d8-001517585013/MX01/MX01-Snapshot9.vmsn
The only part I want is the part is the actual filename, not the path.
I tried using an expression that was anchored to the end of the line using $ but that did not seem to make any difference. I also tried using 2 .+ inputs, one before the capture group and the one inside the capture group. Once again no luck, also that felt kinda messy to me so I don't want to do that unless I must.
Any idea how I can get at just the file name after the last / to the end of the line?
More can be added as needed, I am not sure what I needed to post to give enough information.
--Update--
With 5 minutes of tinkering I seemed to have figured it out. (what a surprise)
So now I am left with this, (and it works)
my $vmsn_file = $snapshots{$snapshot_num}{"filename"};
my #current_vmsn_files = $ssh->capture("find -name $vmsn_file");
foreach my $vmsn (#current_vmsn_files) {
$vmsn =~ /.+\/(\w+\-Snapshot\d+\.vmsn)/xm;
print "$1\n";
}
Is there anyway to make this better?
Probably the best way is using the core module File::Basename. That will make your code most portable.
If you really want to do it with a regex and you are based on Unix, then you could use:
$vmsn =~ m%.*/([^/]+)$%;
$file = $1;
well, if you are going to use find command from the shell, and considering you stated that you only want the file name, why not
... $ssh->capture("find -name $vmsn_file -printf \"%f\n\" ");
If not, the simplest way is to split() your string on "/" and then get the last element. No need regular expressions that are too long or complicated.
See perldoc -f split for more information on usage