Replace quotes within quote encapsulated string using Perl regular expressions - regex

I am trying to replace quotes within a pipe delimited and quote encapsulated file without replacing the quotes providing the encapsulation.
I have tried using the below Perl line to replace the quotes with a back tick ` but am not sure how to replace only the quotes and not the entire group 1.
Sample data (test.txt):
"1"|"Text"|"a"\n
"2"|""Text in quotes""|"ab"\n
"3"|"Text "around" quotes"|"abc"\n
perl -pi.bak -e 's/(?<=\|")(.*)(?="\|)/\1`/' test.txt
Here is what is happening:
"1"|"`"|"a"\n
"2"|"`"|"ab"\n
"3"|"`"|"abc"\n
Here is what I am trying to achieve:
"1"|"Text"|"a"\n
"2"|"`Text in quotes`"|"ab"\n
"3"|"Text `around` quotes"|"abc"\n

With Perl 5.14 and newer, you may use
perl -pi.bak -e 's/(?:^|\|)(")?\K(.*?)(?=\1(?:$|\|))/$2=~s#"|(`)#`$1#gr/ge' test.txt
See the regex demo and an online demo.
The point here is that you match the fields with the first regex, and then you deal with double quotation marks and backticks using the second regex run on the match part.
Details
(?:^|\|) - matches the start of a string or |
(")? - an optional Group 1 matching a "
\K - match reset operator discarding all text in the current match buffer
(.*?) - Group 2: any 0+ chars other than line break chars
(?=\1(?:$|\|)) - a positive lookahead that makes sure there is the same value as in Group 1 and then the end of string or | immediately to the right of the current location.
So, Group 2 is the cell contents, with no enclosing double quotation marks. $2=~s#"|()#$1#gr replaces all " with ` and duplicates all found literal backticks in Group 2 value (see this regex demo). The "|(`) pattern matches a " or a backtick (capturing the latter into Group 1) and the `$1 replaces the match with a backtick and the contents of Group 1.

Updated   for clarification that backticks that are already present should be doubled
One way is to split on | and strip the enclosing quotes to make the remaining regex simple, then assemble the string back. That may lose some efficiency in comparison with a single regex but is much simpler to maintain
perl -F"\|" -wlanE'
say join "\|",
map { s/^"|"$//g; s/`/``/g; s/"([^"]+)"/`$1`/g; qq("$_") } #F
' data.txt
The -a option makes it "autosplit" each line so in the program the line tokens are available in #F, and the -F specifies the pattern to split on (other than default). The -l handles newlines. See Command switches in perlrun.
In the map the enclosing "s are removed and any existing backticks doubled; then " around patterns are changed, globally. Then the quotes are put back and the returned list join-ed. The | in the join is escaped so to sneak it through the shell to the Perl program; if this goes into a script (instead of a one-liner), what I'd always recommend, change that \| to |.
I don't know the typical data and possible edge cases regarding quoting, but if there may be loose (single, unpaired) quotes the above will have problems and may produce wrong output, and quietly; just as any procedure that expects paired quotes would, without an extremely detailed analysis.
It may be overall safer to simply replace all "s (other than enclosing ones), with
map { s/^"|"$//g; s/`/``/g; s/"/`/g; qq("$_") }
(or with tr instead of regex s///g). That also adds some measure of efficiency.
Another way to get to the "meat" of the data is to use Text::CSV, which allows a delimiter other than (the default) comma and absorbs the enclosing quotes. Having quotes inside fields is considered bad CSV but the module can parse that just fine as well, with choices below.
use warnings;
use strict;
use feature 'say';
use Text::CSV;
my $file = shift || 'data.txt';
my $outfile = 'new_' . $file;
my $csv = Text::CSV->new( {
binary => 1, sep_char => '|',
allow_loose_quotes => 1, escape_char => '', # quotes inside fields
always_quote => 1 # output as desired
} ) or die "Can't do CSV: ", Text::CSV->error_diag;
open my $fh, '<', $file or die "Can't open $file: $!";
open my $out_fh, '>', $outfile or die "Can't open $outfile: $!";
while (my $row = $csv->getline($fh)) {
s/`/``/g for #$row;
tr/"/`/ for #$row;
$csv->say($out_fh, $row);
}
To work with quotes inside fields the escape_char needs to differ from quote_char; I've simply set it to '' here. The output is handled by the module as well, and always_quote attribute is for that (to quote all fields, needed or not). Please see documentation.
If the purpose of the question is precisely to clean up a file format where same quoting is used both for fields and inside the fields, I'd suggest to do it all with the module. This approach allows one to cleanly and consistently set up all kinds of options, both for input and output, and is maintainable.
A few questions
What kind of data is there and is it possible to have a stray quote? Then what? This can affect even the choice of the optimal approach as it may require a detailed analysis.
If the quest here is to straighten CSV-style data, then why not double the quotes inside fields, as common and proper in CSV, instead of replacing them (and potentially hurting their textual meaning)? See module's docs, for instance.

Perl uses $1 as the placeholder for the first capturing group in the replacement part of the regex instead of \1 (used in the matching part of the regex). Your regex wasn't matching the inner quotes and would fail to match the first or last field of your pipe delimited data. Your substitution also failed to include a quote character before the captured group.
Try:
perl -pi.bak -e 's/(?<=(?:^|\|)")"([^"]*)"(?="(?:$|\|))/`$1´/' test.txt

Another Perl. After splitting by array #F, check for " that is not at the beginning/end of the elements.
perl -F"\|" -lane ' for(#F) { s/(?<!^)"(?!$)/`/g }; print join("|",#F) '
with the given inputs
$ cat grasshopper.txt
"1"|"Text"|"a"
"2"|""Text in quotes""|"ab"
"3"|"Text "around" quotes"|"abc"
$ perl -F"\|" -lane ' for(#F) { s/(?<!^)"(?!$)/`/g }; print join("|",#F) ' grasshopper.txt
"1"|"Text"|"a"
"2"|"`Text in quotes`"|"ab"
"3"|"Text `around` quotes"|"abc"
$

Related

Wild card matching

I need to match a sentences which contains both wild card character \ and . in same sentence.How to do it with Perl?
Say suppose my file has following sentences :
ttterfasghti.
ddseghies/affag
hhail/afgsh.
asfsdgagh/
adterhjc/sgsagh.
My expected output should be :
hhail/afgsh.
adterhjc/sgsagh.
Given a clarification from a comment
Any order but the matching line should contain both / and .
an easy way
perl -wne'print if m{/} and m{\.}' filename
This is inefficient in the sense that it starts the regex engine twice and scans each string twice. However, in most cases that is unnoticable while this code is much clearer than a single regex for the task.
I use {} delimiters so to not have to escape the /, in which case the m in front is compulsory. Then I use the same m{...} on the other pattern for consistency.
A most welcome inquiry comes that this be done in a script, not one-liner! Happy to oblige.
use warnings;
use strict;
my $file = shift || die "Usage: $0 file\n";
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>) {
print if m{/} and m{\.};
}
close $fh;
This feels like a duplicate, but I just can't find a good previous question for this.
For / there are two ways:
use m// operator with different separator characters, e.g. m,<regex with />,, m{<regex with />}, or
escape it, i.e. /\//
For . use escaping.
Note that inside a character class ([...]) many special characters no longer need escaping.
Hence we get:
$ perl <dummy.txt -ne 'print "$1: $_" if m,(\w+/\w*\.),'
hhail/afgsh.: hhail/afgsh.
adterhjc/sgsagh.: adterhjc/sgsagh.
i.e. the line is printed if it contains one-or-more word characters, followed by a /, zero-or-more word characters, ending with a ..
Recommended reading perlrequick, perlretut & perlre.
UPDATE after OP clarified the requirement in a comment:
$ perl <dummy.txt -ne 'print if m,/, && m{\.}'
hhail/afgsh.
adterhjc/sgsagh.

Repeating regex pattern

I have a string such as this
word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>
where, if there is one ore more words enclosed in tags. In those instances where there are more than one words (which are usually separated by - or = and potentially other non-word characters), I'd like to make sure that the tags enclose each word individually so that the resulting string would be:
word <gl>aaa</gl> word <gl>aaa</gl>-<gl>bbb</gl>=<gl>ccc</gl>
So I'm trying to come up with a regex that would find any number of iterations of \W*?(\w+) and then enclose each word individually with the tags. And ideally I'd have this as a one-liner that I can execute from the command line with perl, like so:
perl -pe 's///g;' in out
This is how far I've gotten after a lot of trial and error and googling - I'm not a programmer :( ... :
/<gl>\W*?(\w+)\W*?((\w+)\W*?){0,10}<\/gl>/
It finds the first and last word (aaa and ccc). Now, how can I make it repeat the operation and find other words if present? And then how to get the replacement? Any hints on how to do this or where I can find further information would be much appreciated?
EDIT:
This is part of a workflow that does some other transformations within a shell script:
#!/bin/sh
perl -pe '#
s/replace/me/g;
s/replace/me/g;
' $1 > tmp
... some other commands ...
This needs a mini nested-parser and I'd recommend a script, as easier to maintain
use warnings;
use strict;
use feature 'say';
my $str = q(word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>);
my $tag_re = qr{(<[^>]+>) (.+?) (</[^>]+>)}x; # / (stop markup highlighter)
$str =~ s{$tag_re}{
my ($o, $t, $c) = ($1, $2, $3); # open (tag), text, close (tag)
$t =~ s/(\w+)/$o$1$c/g;
$t;
}ge;
say $str;
The regex gives us its built-in "parsing," where words that don't match the $tag_re are unchanged. Once the $tag_re is matched, it is processed as required inside the replacement side. The /e modifier makes the replacement side be evaluated as code.
One way to provide input for a script is via command-line arguments, available in #ARGV global array in the script. For the use indicated in the question's "Edit" replace the hardcoded
my $str = q(...);
with
my $str = shift #ARGV; # first argument on the command line
and then use that script in your shell script as
#!/bin/sh
...
script.pl $1 > output_file
where $1 is the shell variable as shown in the "Edit" to the question.
In a one-liner
echo "word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>" |
perl -wpe'
s{(<[^>]+>) (.+?) (</[^>]+>)}
{($o,$t,$c)=($1,$2,$3);$t=~s/(\w+)/$o$1$c/g; $t}gex;
'
what in your shell script becomes echo $1 | perl -wpe'...' > output_file. Or you can change the code to read from #ARGV and drop the -n switch, and add a print
#!/bin/sh
...
perl -wE'$_=shift; ...; say' $1 > output_file
where ... in one-liner indicate the same code as above, and say is now needed since we don't have the -p with which the $_ is printed out once it's processed.
The shift takes an element off of an array's front and returns it. Without an argument it does that to #ARGV when outside a subroutine, as here (inside a subroutine its default target is #_).
This will do it:
s/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;
The /g at the end is the repeat and stands for "global". It will pick up matching at the end of the previous match and keep matching until it doesn't match anymore, so we have to be careful about where the match ends. That's what the (?=...) is for. It's a "followed by pattern" that tells the repeat to not include it as part of "where you left off" in the previous match. That way, it picks up where it left off by re-matching the second "word".
The s/ at the beginning is a substitution, so the command would be something like:
cat in | perl -pne 's/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;$_' > out
You need the $_ at the end because the result of the global substitution is the number of substitutions made.
This will only match one line. If your pattern spans multiple lines, you'll need some fancier code. It also assumes the XML is correct and that there are no words surrounding dashes or equals signs outside of tags. To account for this would necessitate an extra pattern match in a loop to pull out the values surrounded by gl tags so that you can do your substitution on just those portions, like:
my $e = $in;
while($in =~ /(.*?<gl>)(.*?)(?=<\/gl>)/g){
my $p = $1;
my $s = $2;
print($p);
$s =~ s/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;
print($s);
$e = $'; # ' (stop markup highlighter)
}
print($e);
You'd have to write your own surrounding loop to read STDIN and put the lines read in into $in. (You would also need to not use -p or -n flags to the perl interpreter since you're reading the input and printing the output manually.) The while loop above however grabs everything inside the gl tags and then performs your substitution on just that content. It prints everything occurring between the last match (or the beginning of the string) and before the current match ($p) and saves everything after in $e which gets printed after the last match outside the loop.

Perl regexp substitution - multiple matches

Friends,
need some help with substitution regex.
I have a string
;;;;;;;;;;;;;
and I need to replace it by
;\N;\N;\N;\N;\N;\N;\N;\N;\N;\N;\N;\N;
I tried
s/;;/;\\N/;/g
but it gives me
;\N;;\N;;\N;;\N;;\N;;\N;;
tried to fiddle with lookahead and lookbehind, but can't get it solved.
I wouldn't use a regex for this, and instead make use of split:
#!/usr/bin/env perl
use strict;
use warnings;
my $str = ';;;;;;;;;;;;;';
print join ( '\N', split ( //, $str ) );
Splitting on nulls, to get each character, and making use of the fact that join puts delimiters between characters. (So not before first, and not after last).
This gives:
;\N;\N;\N;\N;\N;\N;\N;\N;\N;\N;\N;\N;
Which I think matches your desired output?
As a oneliner, this would be:
perl -ne 'print join ( q{\N}, split // )'
Note - we need single quotes ' rather than double around the \N so it doesn't get interpolated.
If you need to handle variable content (e.g. not just ; ) you can add grep or map into the mix - I'd need some sample data to give you a useful answer there though.
I use this for infile edit, the regexp suits me better
Following on from that - perl is quite clever. It allows you to do in place editing (if that's what you're referring to) without needing to stick with regular expressions.
Traditionally you might do
perl -i.bak -p -e 's/something/somethingelse/g' somefile
What this is doing is expanding out that out into a loop:
LINE: while (defined($_ = <ARGV>)) {
s/someting/somethingelse/g;
}
continue {
die "-p destination: $!\n" unless print $_;
}
E.g. what it's actually doing is:
opening the file
iterating it by lines
transforming the line
printing the new line
And with -i that print is redirected to the new file name.
You don't have to restrict yourself to -p though - anything that generates output will work in this way - although bear in mind if it doesn't 'pass through' any lines that it doesn't modify (as a regular expression transform does) it'll lose data.
But you can definitely do:
perl -i.bak -ne 'print join ( q{\N}, split // )'
And inplace edit - but it'll trip over on lines that aren't just ;;;;; as your example.
So to avoid those:
perl -i.bak -ne 'if (m/;;;;/) { print join ( q{\N}, split // ) } else { print }'
Or perhaps more succinctly:
perl -i.bak -pe '$_ = join ( q{\N}, split // ) if m/;;;/'
Since you can't match twice the same character you approach doesn't work. To solve the problem you can only check the presence of a following ; with a lookahead (the second ; isn't a part of the match) :
s/;(?=;)/;\\N/g

perl search and replace a substring

I am trying to search for a substring and replace the whole string if the substring is found. in the below example someVal could be any value that is unknown to me.
how i can search for someServer.com and replace the whole string $oldUrl and with $newUrl?
I can do it on the whole string just fine:
$directory = "/var/tftpboot";
my $oldUrl = "someVal.someServer.com";
my $newUrl = "someNewVal.someNewServer.com";
opendir( DIR, $directory ) or die $!;
while ( my $files = readdir(DIR) ) {
next unless ( $files =~ m/\.cfg$/ );
open my $in, "<", "$directory/$files";
open my $out, ">", "$directory/temp.txt";
while (<$in>) {
s/.*$oldUrl.*/$newUrl/;
print $out $_;
}
rename "$directory/temp.txt", "$directory/$files";
}
Your script will delete much of your content because you are surrounding the match with .*. This will match any character except newline, as many times as it can, from start to end of each line, and replace it.
The functionality that you are after already exists in Perl, the use of the -pi command line switches, so it would be a good idea to make use of it rather than trying to make your own, which works exactly the same way. You do not need a one-liner to use the in-place edit. You can do this:
perl -pi script.pl *.cfg
The script should contain the name definitions and substitutions, and any error checking you need.
my $old = "someVal.someServer.com";
my $new = "someNewVal.someNewServer.com";
s/\Q$old\E/$new/g;
This is the simplest possible solution, when running with the -pi switches, as I showed above. The \Q ... \E is the quotemeta escape, which escapes meta characters in your string (highly recommended).
You might want to prevent partial matches. If you are matching foo.bar, you may not want to match foo.bar.baz, or snafoo.bar. To prevent partial matching, you can put in anchors of different kinds.
(?<!\S) -- do not allow any non-whitespace before match
\b -- match word boundary
Word boundary would be suitable if you want to replace server1.foo.bar in the above example, but not snafoo.bar. Otherwise use whitespace boundary. The reason we do a double negation with a negative lookaround assertion and negated character class is to allow beginning and end of line matches.
So, to sum up, I would do:
use strict;
use warnings;
my $old = "someVal.someServer.com";
my $new = "someNewVal.someNewServer.com";
s/(?<!\S)\Q$old\E(?!\S)/$new/g;
And run it with
perl -pi script.pl *.cfg
If you want to try it out beforehand (highly recommended!), just remove the -i switch, which will make the script print to standard output (your terminal) instead. You can then run a diff on the files to inspect the difference. E.g.:
$ perl -p script.pl test.cfg > test_replaced.cfg
$ diff test.cfg test_replaced.cfg
You will have to decide whether word boundary is more desirable, in which case you replace the lookaround assertions with \b.
Always use
use strict;
use warnings;
Even in small scripts like this. It will save you time and headaches.
If you want to match and replace any subdomain, then you should devise a specific regular expression to match them.
\b(?i:(?!-)[a-z0-9-]+\.)*someServer\.com
The following is a rewrite of your script using more Modern Perl techniques, including Path::Class to handle file and directory operations in a cross platform way and $INPLACE_EDIT to automatically handle the editing of a file.
use strict;
use warnings;
use autodie;
use Path::Class;
my $dir = dir("/var/tftpboot");
while (my $file = $dir->next) {
next unless $file =~ m/\.cfg$/;
local #ARGV = "$file";
local $^I = '.bak';
while (<>) {
s/\b(?i:(?!-)[a-z0-9-]+\.)*someServer\.com\b/someNewVal.someNewServer.com/;
print;
}
#unlink "$file$^I"; # Optionally delete backup
}
Watch for the Dot-Star: it matches everything that surrounds the old URL, so the only thing remaining on the line will be the new URL:
s/.*$oldUrl.*/$newUrl/;
Better:
s/$oldUrl/$newUrl/;
Also, you might need to close the output file before you try to rename it.
If the old URL contains special characters (dots, asterisks, dollar signs...) you might need to use \Q$oldUrl to suppress their special meaning in the regex pattern.

Regex question!

I'm not too familiar with regex but I know what I need to find-
I have a long list of data separated by newlines, and I need to delete all the lines of data that contain a string "(V)". The lines are of variable length, so I guess something to do with selecting everything between two newline characters if there's a (V) inside?
Try searching for this regular expression:
^.*\(V\).*$
Explanation:
^ start of line
.* any characters apart from new line
\( open parenthesis (escaped to avoid special behaviour)
V V
\) close parenthesis (escaped to avoid special behaviour)
.* any characters apart from new line
$ end of line (not strictly need here, included only for clarity)
Depending on your language you may need to add delimiters such as / and/or quotes " around the regular expression and you may need to enable multiline mode.
Here's an online example showing it working: Rubular
If the data is indeed rather large, then running a single regex against the whole string would be a bad idea. Instead, a simple solution like this Perl script could work for you:
open my $fh, '<', 'data.txt' or die $!;
while (my $line = <$fh>) {
if ($line =~ m/\(V\)/) {
next;
}
print $line;
}
close $fh;
This script reads the data file one line at a time and prints the lines that do not contain "(V)" to stdout. (You obviously could replace the "print" with a different data processing task)
Use the UNIX command grep, if you have access to such a system.
$ grep -v '(V)' data.txt
Grep matches all lines containing "(V)" in data.txt, and shows only the lines not matching (-v).