I need to write a quick (by tomorrow) filter script to replace line breaks (LF or CRLF) found within double quoted strings by the escaped newline \n. The content is a (broken) javascript program, so I need to allow for escape sequences like "ab\"cd" and "ab\\"cd"ef" within a string.
I understand that sed is not well-suited for the job as it work per line, so I turn to perl, of which I know nothing :)
I've written this regex: "(((\\.)|[^"\\\n])*\n?)*" and tested it with the http://regex.powertoy.org. It indeed matches quoted strings with line breaks, however, perl -p -e 's/"(((\\.)|[^"\\\n])*(\n)?)*"/TEST/g' does not.
So my questions are:
how to make perl to match line breaks?
how to write the "replace-by" part so that it keeps the original string and only replaces newlines?
There is this similar question with awk solution, but it is not quite what I need.
NOTE: I usually don't ask "please do this for me" questions, but I really don't feel like learning perl/awk by tomorrow... :)
EDIT: sample data
"abc\"def" - matches as one string
"abc\\"def"xy" - match "abcd\\" and "xy"
"ab
cd
ef" - is replaced by "ab\ncd\nef"
Here is a simple Perl solution:
s§
\G # match from the beginning of the string or the last match
([^"]*+) # till we get to a quote
"((?:[^"\\]++|\\.)*+)" # match the whole quote
§
$a = $1;
$b = $2;
$b =~ s/\r?\n/\\n/g; # replace what you want inside the quote
"$a\"$b\"";
§gex;
Here is another solution in case you wouldn't want to use /e and just do it with one regex:
use strict;
$_=<<'_quote_';
hai xtest "aa xx aax" baix "xx"
x "axa\"x\\" xa "x\\\\\"x" ax
xbai!x
_quote_
print "Original:\n", $_, "\n";
s/
(
(?:
# at the beginning of the string match till inside the quotes
^(?&outside_quote) "
# or continue from last match which always stops inside quotes
| (?!^)\G
)
(?&inside_quote) # eat things up till we find what we want
)
x # the thing we want to replace
(
(?&inside_quote) # eat more possibly till end of quote
# if going out of quote make sure the match stops inside them
# or at the end of string
(?: " (?&outside_quote) (?:"|\z) )?
)
(?(DEFINE)
(?<outside_quote> [^"]*+ ) # just eat everything till quoting starts
(?<inside_quote> (?:[^"\\x]++|\\.)*+ ) # handle escapes
)
/$1Y$2/xg;
print "Replaced:\n", $_, "\n";
Output:
Original:
hai xtest "aa xx aax" baix "xx"
x "axa\"x\\" xa "x\\\\\"x" ax
xbai!x
Replaced:
hai xtest "aa YY aaY" baix "YY"
x "aYa\"Y\\" xa "Y\\\\\"Y" ax
xbai!x
To work with line breaks instead of x, just replace it in the regex like so:
s/
(
(?:
# at the beginning of the string match till inside the quotes
^(?&outside_quote) "
# or continue from last match which always stops inside quotes
| (?!^)\G
)
(?&inside_quote) # eat things up till we find what we want
)
\r?\n # the thing we want to replace
(
(?&inside_quote) # eat more possibly till end of quote
# if going out of quote make sure the match stops inside them
# or at the end of string
(?: " (?&outside_quote) (?:"|\z) )?
)
(?(DEFINE)
(?<outside_quote> [^"]*+ ) # just eat everything till quoting starts
(?<inside_quote> (?:[^"\\\r\n]++|\\.)*+ ) # handle escapes
)
/$1\\n$2/xg;
Until the OP posts some example content to test by, try adding the "m" (and possibly the "s") flag to the end of your regex; from perldoc perlreref (reference):
m Multiline mode - ^ and $ match internal lines
s match as a Single line - . matches \n
For testing you might also find that adding the command line argument "-i.bak" so that you keep a backup of the original file (now with the extension ".bak").
Note also that if you want to capture but not store something you can use (?:PATTERN) rather than (PATTERN). Once you have your captured content use $1 through $9 to access stored matches from the matching section.
For more info see the link about as well as perldoc perlretut (tutorial) and perldoc perlre (full-ish documentation)
#!/usr/bin/perl
use warnings;
use strict;
use Regexp::Common;
$_ = '"abc\"def"' . '"abc\\\\"def"xy"' . qq("ab\ncd\nef");
print "befor: {{$_}}\n";
s{($RE{quoted})}
{ (my $x=$1) =~ s/\n/\\n/g;
$x
}ge;
print "after: {{$_}}\n";
Using Perl 5.14.0 (install with perlbrew) one can do this:
#!/usr/bin/env perl
use strict;
use warnings;
use 5.14.0;
use Regexp::Common qw/delimited/;
my $data = <<'END';
"abc\"def"
"abc\\"def"xy"
"ab
cd
ef"
END
my $output = $data =~ s/$RE{delimited}{-delim=>'"'}{-keep}/$1=~s!\n!\\n!rg/egr;
print $output;
I need 5.14.0 for the /r flag of the internal replace. If someone knows how to avoid this please let me know.
Related
I have the following situation:
^ID[ \t]*=[ \t]*('(.*)'|"(.*)")
The group with content
01
when a file contains:
ID = '01'
is the second.
Instead if:
ID = "01"
is the third.
This cause me a problem with perl:
perl -lne "print \$2 if /^ID[ \t]*=[ \t]*('(.*)'|\"(.*)\")/" test.txt
That if group with single quotes matches then i get the output:
01
Otherwise i obtain an empty string.
How do I make both the case of single quotes and double quotes interpret as group two in regex?
You can print both the groups, as they can never match at the same time:
perl -lne "print \$2.\$3 if /^ID[ \t]*=[ \t]*('(.*)'|\"(.*)\")/"
or remember the quotes in $2 and use $3 for the quoted string, followed by the remembered quote:
perl -lne "print \$3 if /^ID[ \t]*=[ \t]*((['\"])(.*)\2)/"
This looks like it's a good candidate for the branch reset operator, (?|...). Either capture in that alternation is $1, and the branch-reset construct takes care of the grouping without capturing anything:
use v5.10;
my #strings = qw( ID='01' ID="01" ID="01');
foreach ( #strings ) {
say $1 if m/^ID \h* = \h* (?|'(\d+)'|"(\d+)") /x
}
You need v5.10, and that allows you to use the \h to match horizontal whitespace.
But, you don't need to repeat the pattern. You can match the quote and match that same quote later. A relative backreference, \g{N}, can do that:
use v5.10;
my #strings = qw( ID='01' ID="01" ID="01' );
foreach ( #strings ) {
say $2 if m/^ID \h* = \h* (['"])(\d+)\g{-2} /x
}
I prefer that \g{-2} because I usually don't have to update numbering if I change the pattern to include more captures before the thing if refers to.
And, since this is a one-liner, don't type out the literal quotes (as ikegami has already shown):
say $2 if m/^ID \h* = \h* ([\x22\x27])(\d+)\g{-2} /x
Only one of the two will be defined, so simply use the one that's defined.
perl -nle'print $1//$2 if /^ID\h*=\h*(?:\x27(.*)\x27|"(.*)")/' # \x27 is '
You could also use a backreference.
perl -nle'print $2 if /^ID\h*=\h*(["\x27])(.*)\1/'
Note that all the provided solutions including these two fail (leave the escape sequence in) if you have something like ID="abc\"def" or ID="abc\ndef", assuming those are supported.
Thank you #brian_d_foy:
perl -lne "print \$1 if /^ID\h*=\h*(?|'(.*)'|\"(.*)\")/" test.txt
Or better:
perl -lne "print \$2 if /^ID\h*=\h*(['\"])(.*)\1/" test.txt
I have decided of accept also
ID = 01 #Followed by one or more horizontal spaces.
In addition to:
ID = "01" #Followed by one or more horizontal spaces.
And:
ID = '01' #Followed by one or more horizontal spaces.
Therefore I have adopted a super very complex solution:
perl -lne "print \$2 if /^ID\h*=\h*(?|(['\"])(.*)\1|(([^\h'\"]*)))\h*(?:#.*)?$/" test.txt
I have done a fusion of your both solutions #brian_d_foy. The double round parentheses are used to bring the second alternative to the second group as well, otherwise it would be the first group and without even the "branch reset operator", it would be group 4.
I after have enhanced the sintax in a function
function parse-config {
command perl -pe "s/\R/\n/g" "$2" | command perl -lne "print \$2 if /^$1\h*=\h*(?|(['\"])(.*)\1|(([^\h'\"]*)))\h*(?:#.*)?$/"
return $?
}
parse-config "ID" "test.txt"
In this:
"s/\R/\n/g"
I replace all CRLF or CR or LF, in LF. \R is a super powerfull special character present from perl v5.10. Apparently this version of perl has introduced several fundamental innovations for me. The chance would have that I needed all (\h \R ?|). Whoever did the update was brilliant.
I needed this because the dollar "$" at the end of the line did not work, because there was a "\r" before the "Linux end of line" "\n".
I am trying to remove commas between double quotes in a string, while leaving other commas intact? (This is an email address which sometimes contains spare commas). The following "brute force" code works OK on my particular machine, but is there a more elegant way to do it, perhaps with a single regex?
Duncan
$string = '06/14/2015,19:13:51,"Mrs, Nkoli,,,ka N,ebedo,,m" <ubabankoffice93#gmail.com>,1,2';
print "Initial string = ", $string, "<br>\n";
# Extract stuff between the quotes
$string =~ /\"(.*?)\"/;
$name = $1;
print "name = ", $1, "<br>\n";
# Delete all commas between the quotes
$name =~ s/,//g;
print "name minus commas = ", $name, "<br>\n";
# Put the modified name back between the quotes
$string =~ s/\"(.*?)\"/\"$name\"/;
print "new string = ", $string, "<br>\n";
You can use this kind of pattern:
$string =~ s/(?:\G(?!\A)|[^"]*")[^",]*\K(?:,|"(*SKIP)(*FAIL))//g;
pattern details:
(?: # two possible beginnings:
\G(?!\A) # contiguous to the previous match
| # OR
[^"]*" # all characters until an opening quote
)
[^",]* #"# all that is not a quote or a comma
\K # discard all previous characters from the match result
(?: # two possible cases:
, # a comma is found, so it will be replaced
| # OR
"(*SKIP)(*FAIL) #"# when the closing quote is reached, make the pattern fail
# and force the regex engine to not retry previous positions.
)
If you use an older perl version, \K and the backtracking control verbs may be not supported. In this case you can use this pattern with capture groups:
$string =~ s/((?:\G(?!\A)|[^"]*")[^",]*)(?:,|("[^"]*(?:"|\z)))/$1$2/g;
One way would be to use the nice module Text::ParseWords to isolate the specific field and perform a simple transliteration to get rid of the commas:
use strict;
use warnings;
use Text::ParseWords;
my $str = '06/14/2015,19:13:51,"Mrs, Nkoli,,,ka N,ebedo,,m" <ubabankoffice93#gmail.com>,1,2';
my #row = quotewords(',', 1, $str);
$row[2] =~ tr/,//d;
print join ",", #row;
Output:
06/14/2015,19:13:51,"Mrs Nkolika Nebedom" <ubabankoffice93#gmail.com>,1,2
I assume that no commas can appear legitimately in your email field. Otherwise some other replacement method is required.
For each line I need to add a semicolon exactly one character before the first match of an alphanumeric sign but only for the alphanumeric sign after the first appearance of a semicolon.
Example:
Input:
00000001;Root;;
00000002; Documents;;
00000003; oracle-advanced_plsql.zip;file;
00000004; Public;;
00000005; backup;;
00000006; 20110323-JM-F.7z.001;file;
00000007; 20110426-JM-F.7z.001;file;
00000008; 20110603-JM-F.7z.001;file;
00000009; 20110701-JM-F-via-summer_school;;
00000010; 20110701-JM-F-via-summer_school.7z.001;file;
Desired output:
00000001;;Root;;
00000002; ;Documents;;
00000003; ;oracle-advanced_plsql.zip;file;
00000004; ;Public;;
00000005; ;backup;;
00000006; ;20110323-JM-F.7z.001;file;
00000007; ;20110426-JM-F.7z.001;file;
00000008; ;20110603-JM-F.7z.001;file;
00000009; ;20110701-JM-F-via-summer_school;;
00000010; ;20110701-JM-F-via-summer_school.7z.001;file;
Could someone helps me please to create Perl regex for that? I'd need it in a program, not as a oneliner.
This is a way to insert a semi-colon after the first semi-colon and whitespace, but before the first non-whitespace.
s/;\s*\K(?=\S)/;/
If you feel the need, you can use \w instead of \S, but I felt with this input it was an unnecessary specification.
The \K (keep) escape is similar to a lookbehind assertion in that it does not remove what it matches. The same goes for the lookahead assertion, so all this substitution does is insert a semi-colon in the designated spot.
First of all, here is a program that seems to match your requirements:
#/usr/bin/perl -w
while(<>) {
s/^(.*?;.*?)(\w)/$1;$2/;
print $_;
}
Store it in a file 'program.pl', make it executable with 'chmod u+x program.pl' and run it on your input data like this:
program.pl input-data.txt
Here is an explanation of the regular expression:
s/ # start search-and-replace regexp
^ # start at the beginning of this line
( # save the matched characters until ')' in $1
.*?; # go forward until finding the first semicolon
.*? # go forward until finding... (to be continued below)
)
( # save the matched characters until ')' in $2
\w # ... the next alphanumeric character.
)
/ # continue with the replace part
$1;$2 # write all characters found above, but insert a ; before $2
/ # finish the search-and-replace regexp.
Based on your sample input, I would use a more specific regular expression:
s/^(\d*; *)(\w)/$1;$2/;
This expression starts at the beginning of the line, skips over numbers (\d*) followed by the first semicolon and space. Before the following word character, it inserts a semicolon.
Take what fits best to your needs!
First of all thank you for your really great answers!
Actually my code snippet looks like this:
our $seperator=";" # at the beginning of the file
#...
sub insert {
my ( $seperator, $line, #all_lines, $count, #all_out );
$count = 0;
#all_lines = read_file($filename);
foreach $line (#all_lines) {
$count = sprintf( "%08d", $count );
chomp $line;
$line =~ s/\:/$seperator/; # works
$line =~ s/\ file/file/; # works
#$line=~s/;\s*\K(?=\S)/;/; # doesn't work
$line =~ s/^(.*?$seperator.*?)(\w)/$1$seperator$2/; # doesn't work
say $count . $seperator . $line . $seperator;
$count++; # btw, is there maybe a hidden index variable in a foreach-loop I could us instead of a new variable??
push( #all_out, $count . $seperator . $line . $seperator . "\n" );
}
write_file( $csvfile, #all_out ); # using File::Slurp
}
In order to get the input which I presented you, I made already some small substitutions, as you can see in the beginning of the foreach-loop.
I am curious, why the regular expressions presented by TLP and Yaakov do not work in my code. In general they work, but only when written like in the example which Yaakov gave:
while(<>) {
s/^(.*?;.*?)(\w)/$1;$2/;
print $_;
}
I'm trying to find a way to replace spaces and double quotes with pipes (||) while leaving the spaces within the double quotes untouched.
For example, it would make something like 'word "word word" word' into 'word||word word||word' and another like 'word word word' into 'word||word||word'.
Right now I have this to work off of:
[%- MACRO typestrip(value) PERL -%]
my $htmlVal = $stash->get('value');
$htmlVal =~ s/"/||/g;
print $htmlVal
[%- END -%]
Which handles replacing double quotes with pipes just fine.
I don't know how simple or complex this should be or if it can even be done, since I have no actual background in programming and, while I have worked with some Perl, it's never been this kind before, so I apologize if I'm not doing a good job of explaining this.
I think it might be easier to use the core module Text::ParseWords to split on non-quoted whitespace, then rejoin the "words" with pipes.
#!/usr/bin/env perl
use warnings;
use strict;
use Text::ParseWords;
while (my $line = <DATA>) {
print space2pipes($line);
print "\n";
}
sub space2pipes {
my $line = shift;
chomp $line;
my #words = parse_line( qr/\s+/, 0, $line );
return join '||', #words;
}
__DATA__
word "word word" word
word word word
Putting this into your templating engine is left as an exercise for the reader :-)
This is related to a frequently-asked question, answered in section 4 of the Perl FAQ.
How can I split a [character]-delimited string except when inside [character]?
Several modules can handle this sort of parsing—Text::Balanced, Text::CSV, Text::CSV_XS, and Text::ParseWords, among others.
Take the example case of trying to split a string that is comma-separated into its different fields. You can’t use split(/,/) because you shouldn’t split if the comma is inside quotes. For example, take a data line like this:
SAR001,"","Cimetrix, Inc","Bob Smith","CAM",N,8,1,0,7,"Error, Core Dumped"
Due to the restriction of the quotes, this is a fairly complex problem. Thankfully, we have Jeffrey Friedl, author of Mastering Regular Expressions, to handle these for us. He suggests (assuming your string is contained in $text):
my #new = ();
push(#new, $+) while $text =~ m{
# groups the phrase inside the quotes
"([^\"\\]*(?:\\.[^\"\\]*)*)",?
| ([^,]+),?
| ,
}gx;
push(#new, undef) if substr($text,-1,1) eq ',';
If you want to represent quotation marks inside a quotation-mark-delimited field, escape them with backslashes (e.g., "like \"this\"").
Alternatively, the Text::ParseWords module (part of the standard Perl distribution) lets you say:
use Text::ParseWords;
#new = quotewords(",", 0, $text);
For parsing or generating CSV, though, using Text::CSV rather than implementing it yourself is highly recommended; you’ll save yourself odd bugs popping up later by just using code which has already been tried and tested in production for years.
Adapting the technique to your situation gives
my $htmlVal = 'word "word word" word';
my #chunks;
push #chunks, $+ while $htmlVal =~ m{
"([^\"\\]*(?:\\.[^\"\\]*)*)"
| (\S+)
}gx;
$htmlVal = join "||", #chunks;
print $htmlVal, "\n";
Output:
word||word word||word
Looking back, it turns out that this is an application of Randal’s Rule, as dubbed in Regular Expression Mastery by Mark Dominus:
Randal's Rule
Randal Schwartz (author of Learning Perl [and also a Stack Overflow user]) says:
Use capturing or m//g when you know what you want to keep.
Use split when you know what you want to throw away.
In your situation, you know what you want to keep, so use m//g to hang on to the text within quotes or otherwise separated by whitespace.
While Joel's answer is fine, things can be simplified a bit by specifically using shellwords to tokenize lines:
#!/usr/bin/env perl
use strict; use warnings;
use Text::ParseWords qw( shellwords );
my #strings = (
'word "word word" word',
'word "word word" "word word"',
);
#strings = map join('||', shellwords($_)), #strings;
use YAML;
print Dump \#strings;
Isn't that more readable than a bunch of regex-gobbledygook?
Seems possible and might be useful if only a regex is applicable:
$htmlVal =~ s/(?:"([^"]+)"(\s*))|(?:(\S+)(\s*))/($1||$3).($2||$4?'||':'')/eg;
(Might be beautified a bit after closer introspection.)
input:
my $htmlVal ='word "word word" word';
output:
word||word word||word
Original code has been modified after failing this case:
my $htmlVal ='word "word word" "word word"';
will now work too:
word||word word||word word
Explanation:
$htmlVal =~ s/
(?: " ([^"]+) " (\s*)) # search "abc abc" ($1), End ($2)
| # OR
(?: (\S+) (\s*)) # abcd ($3), End ($4)
/
($1||$3) . ($2||$4 ? '||' : '') # decide on $1/$2 or $3/$4
/exg;
Regards
rbo
if I have a input with new lines in it like:
[INFO]
xyz
[INFO]
How can I pull out the xyz part using $ anchors? I tried a pattern like /^\[INFO\]$(.*?)$\[INFO\]/ms, but perl gives me:
Use of uninitialized value $\ in regexp compilation at scripts\t.pl line 6.
Is there a way to shut off interpolation so the anchors work as expected?
EDIT: The key is that the end-of-line anchor is a dollar sign but at times it may be necessary to intersperse the end-of-line anchor through the pattern. If the pattern is interpolating then you might get problems such as uninitialized $\. For instance an acceptable solution here is /^\[INFO\]\s*^(.*?)\s*^\[INFO\]/ms but that does not solve the crux of the first problem. I've changed the anchors to be ^ so there is no interpolation going on, and with this input I'm free to do that. But what about when I really do want to reference EOL with $ in my pattern? How do I get the regex to compile?
The question is academic--there's no need for the $ anchors in your regex anyway. You should be using \n to match the newlines, because the $ only matches the gap between the linefeed and the character before it.
EDIT: What I'm trying to say is that you will never need to use $ that way. Any match that spans from one line to the next will have to consume the line separator somehow. Consider your example:
/^\[INFO\]$(.*?)$\[INFO\]/ms
If this did compile, the (.*?) would start out by consuming the first linefeed and keep going until it had matched \nxyz, where the second $ would succeed. But the next character is a linefeed, and the regex is looking for [, so that doesn't work. After backtracking, the (.*?) would reluctantly consume one more character--the second linefeed--but then the $ would fail.
Any time you try to match an EOL with $ and then some more stuff, the first "stuff" you'll have to match will be the linefeed, so why not match that instead? That's why the Perl regex compiler tries to interpret $\ as a variable name in your regex: it makes no sense to have an end-of-line anchor followed by a character that's not a line separator.
Based on the answer in perlfaq6 - How can I pull out lines between two patterns that are themselves on different lines? , here's what a one-liner would look like:
perl -0777 -ne 'print $1,"\n" while /\[INFO\]\s*(.*?)\s*\[INFO\]/sg' file.txt
The -0777 switch slurps in the whole file at once.
However, if you're after a subroutine that gives you the flexibility to choose what tag you want to extract, the File::Slurp module makes things a little easier:
use strict;
use warnings;
use File::Slurp qw/slurp/;
sub extract {
my ( $tag, $fileName ) = #_;
my $text = slurp $fileName;
my ($info) = $text =~ /$tag\s*(.*?)\s*$tag/sg;
return $info;
}
# Usage:
extract ( qr/\[INFO\]/, 'file.txt' );
When regexes get too tricky, they probably are the wrong tool. I might consider using the flip flop operator here. It's false until its lefthand side is true, then stays true until its righthand side is true. That way, you can choose where to start and end the extraction just by looking at individual lines:
my $string = <<'HERE';
[INFO]
xyz
[INFO]
HERE
open my $string_fh, '<', \$string;
while( <$string_fh> )
{
next if /\[INFO]/ .. /\[INFO]/;
chomp;
print "Extracted <$_>\n";
}
If you are using Perl 5.10, you can use the generalized line ending \R in a regex:
use 5.010;
my $string = <<'HERE';
[INFO]
xyz
[INFO]
HERE
my( $extracted ) = $string =~ /(?:\A|\R)\[INFO]\R(.*?)\R\[INFO]\R/;
print "Extracted <$extracted>\n";
Don't get hung up on the end-of-line anchor.
Maybe the /x modifier can help:
m/ ^\[INFO\] $ # Match INFO line
\n
^ (.*?) $ # Collect desired line
\n
^ \[INFO\] # Match another INFO line
/xms
I haven't tested that, so you'd probably have to debug it. But I think this will prevent the $ symbols from interpolating as variables.
Although I've accepted Alan Moore's answer (Ryan Thompson's answer would also have done the trick too bad I could only accept one) I wanted to make perfectly clear the solution, as it was kind of buried in the comments and discussion. The following Perl script demonstrates that Perl is using the $ to interpolate variables if any character proceeds the dollar sign, and that turning off interpolation will allow the $ to be treated as EOL.
use strict;
use warnings;
my $x = "[INFO]\nxyz\n[INFO]";
if( $x =~ /^\[INFO\]$\n(.*?)$\n\[INFO\]/m ) {
print "'$1' FOUND\n";
} else {
print "NO MATCH FOUND\n";
}
if( $x =~ m'^\[INFO\]$\n(.*?)$\n\[INFO\]'m ) {
print "'$1' FOUND\n";
} else {
print "NO MATCH FOUND\n";
}
if( $x =~ m/ ^\[INFO\] $ # Match INFO line
\n
^ (.*?) $ # Collect desired line
\n
^ \[INFO\] # Match another INFO line
/xms ) {
print "'$1' FOUND\n";
} else {
print "NO MATCH FOUND\n";
}
The script produces the following output:
Use of uninitialized value $\ in regexp compilation at t.pl line 5.
Use of uninitialized value $\ in regexp compilation at t.pl line 5.
NO MATCH FOUND
'xyz' FOUND
'xyz' FOUND