Perl multiline regex in windows - regex

I'm stuck with this scenario, I have this regex
*Input added here for clarity:
181221533;MG;3;1476729;<vars> <vint> <name>mtest</name> <storedPrecedure>f_sc_mtest</SP> <base>M_data</base> <dataType>I</dataType> <timeMS>17</timeMS> <ttidr>abc</ttidr> <base>S</base> <valor>0</valor> </vint> </vars>;889;6;85;112;01/01/2019;29/05/2019 17:17:48
182652972;MG;6314429;740484;<vars> <vint> <name>mtest</name> <sP>f_sc_mtest</sP> <base>sscy</base> <dataType>I</dataType> <timeMS>16</timeMS> <ttidr>abc</Idtype> <base>S</base> <valor>4</valor> </vint></vars>;-1;8;57217;57228;01/01/2019;06/06/2019 22:20:48
182652984;ModeloSP;6314429;740484;<vars> <vint> <name>tc_p_act</name> <sP>rndom_name</sP> <base>sscyo</base> <dataType>I</dataType> <timeMS>0</timeMS> <Idtype>XYZ</Idtype> <base>O</base> </vint>
</vars>;0;;0;41;01/01/2019;06/06/2019 22:31:22
182652988;ModeloSP;6314429;740484;<vars> <vint> <name>tc_p_act</name> <sP>rndom_name</sP> <base>sscyo</base> <dataType>I</dataType> <timeProcess>1</timeProcess> <Idtype>XYZ</Idtype> <base>O</base> </vint>
</vars>;0;;0;85;01/01/2019;06/06/2019 22:37:36
And I want to implement this regex in perl with multiline support because as you can see in the sample, there are line breaks in records and this regex searchs 'incomplete' lines (and the extra line) and fixes them (one record/line should end with a datetime)
this is what I'm attempting with perl:
perl.exe -0777 -i -pe "s/(?m)^(.*)(>)([\n]+)(<)(.*)([\n]+)(\s*)$/$1$2 $4$5/igs" "sample.txt"
And doesn't seem to work, I keep getting the same text file. I'm using perl inside a portable GIT installation (v5.34.0)
Is there something I'm missing?
edit: This is how the output should look like:
181221533;MG;3;1476729;<vars> <vint> <name>mtest</name> <storedPrecedure>f_sc_mtest</SP> <base>M_data</base> <dataType>I</dataType> <timeMS>17</timeMS> <ttidr>abc</ttidr> <base>S</base> <valor>0</valor> </vint> </vars>;889;6;85;112;01/01/2019;29/05/2019 17:17:48
182652972;MG;6314429;740484;<vars> <vint> <name>mtest</name> <sP>f_sc_mtest</sP> <base>sscy</base> <dataType>I</dataType> <timeMS>16</timeMS> <ttidr>abc</Idtype> <base>S</base> <valor>4</valor> </vint></vars>;-1;8;57217;57228;01/01/2019;06/06/2019 22:20:48
182652984;ModeloSP;6314429;740484;<vars> <vint> <name>tc_p_act</name> <sP>rndom_name</sP> <base>sscyo</base> <dataType>I</dataType> <timeMS>0</timeMS> <Idtype>XYZ</Idtype> <base>O</base> </vint> </vars>;0;;0;41;01/01/2019;06/06/2019 22:31:22
182652988;ModeloSP;6314429;740484;<vars> <vint> <name>tc_p_act</name> <sP>rndom_name</sP> <base>sscyo</base> <dataType>I</dataType> <timeProcess>1</timeProcess> <Idtype>XYZ</Idtype> <base>O</base> </vint> </vars>;0;;0;85;01/01/2019;06/06/2019 22:37:36

Capture the whole record and replace all newlines in it by a space, using another regex inside the replacement part (courtesy of /e modifier). Then replace all multiple newlines by a single one
perl.exe -0777 -wpe'
s{ (?:^|\R)\K (\d{9}; .*? \s+\d\d:\d\d:\d\d) }{$1 =~ s/\n+/ /r}segx; s{\n+}{\n}g
' file.txt
I consider a "record" to be: [0-9]{9}; on line/file beginning, then all up to and including a timestamp after spaces. The details for beginning and end of record should protect against accidental matching of possible unexpected patterns inside those tags.
This is cumbersome but it captures the record correctly I hope, even if some details change.
Apparently the above fails on Windows as it stands, while it is confirmed to work on Linux (the only system I can try it on right now).
The issue must be in newlines -- so try replacing \n in matches with \R or \r\n. In particular in the regex embedded in the replacement part. Or, to be safe and perhaps portable, replace \n with (\r?\n) (so the carriage return character is optional, need not be there in order to match).
So either
s{ (?:^|\R)\K (\d{9}; .*? \s+\d\d:\d\d:\d\d) }{$1 =~ s/\R+/ /r}segx; s{\R+}{\r\n}g
or
s{ (?:^|\R)\K(\d{9};.*?\s+\d\d:\d\d:\d\d) }{$1 =~ s/(\r\n)+/ /r}segx; s{(\r\n)+}{\r\n}g
But \R should match it on Windows, so you should be able to use \R for matching and \r\n when needed in replacements. See it under Misc in perlbackslash
Better yet, if it works, is to use PerlO layers. Normally a Windows build of Perl adds the :crlf layer by default but that seems not to be the case here.
In a one-liner try:
perl.exe -0777 -Mopen=:std,IO,:crlf -wpe'...'
Or, use the "one-liner" as a normal program, without file-processing switches, and set this up via open pragma and open a file manually
perl -wE'use open IO => ":crlf"; $_ = do { local $/; <> }; s{...}{...}; say' file
With layers set like this (in either way) use the regex with \n.

If the issue is having newlines in the wrong place, either multiple newlines in a row, or before a <, you may get away with something simple like this:
use strict;
use warnings;
my $str = do { local $/; <DATA> };
$str =~ s/\n(?=[<\n])//g;
print $str;
__DATA__
181221533;<valor>0</valor></vars>;889;6;85;112;01/01/2019;29/05/2019 17:17:48
182652972;</vars>;-1;8;57217;57228;01/01/2019;06/06/2019 22:20:48
182652984;</vint>
</vars>;0;;0;41;01/01/2019;06/06/2019 22:31:22
182652988; </vint>
</vars>;0;;0;85;01/01/2019;06/06/2019 22:37:36
(I shortened the input to make it readable)
Output:
181221533;<valor>0</valor></vars>;889;6;85;112;01/01/2019;29/05/2019 17:17:48
182652972;</vars>;-1;8;57217;57228;01/01/2019;06/06/2019 22:20:48
182652984;</vint></vars>;0;;0;41;01/01/2019;06/06/2019 22:31:22
182652988; </vint></vars>;0;;0;85;01/01/2019;06/06/2019 22:37:36

This seems to produce the wanted output:
perl.exe -0777 -pe "s: *\n(?=</): :g;s/\n+/\n/g"
The first substitution replaces whitespace followed by a newline before </ by four spaces.
The second substitution replaces multiple newlines by a single one. You can also replace it by a transliteration: tr/\n//s, the /s "squeezes" the newlines.

Related

Regex does not match in Perl, while it does in other programs

I have the following string:
load Add 20 percent
to accommodate
I want to get to:
load Add 20 percent to accommodate
With, e.g., regex in sublime, this is easily done by:
Regex:
([a-z])\n\s([a-z])
Replace:
$1 $2
However, in Perl, if I input this command, (adapted to test if I can match the pattern in any case):
perl -pi.orig -e 's/[a-z]\n.+to/TEST/g' file
It doesn't match anything.
Does anyone know why Perl would be different in this case, and what the correct formulation of the Perl command should be?
By default, Perl -p flag read input lines one by one. You can't thus expect your regex to match anything after \n.
Instead, you want to read the whole input at once. You can do this by using the flag -0777 (this is documented in perlrun):
perl -0777 -pi.orig -e 's/([a-z])\n\s(to)/$1 $2/' file
Just trying to help and reminding below your initial proposal for perl regex:
perl -pi.orig -e 's/[a-z]\n.+to/TEST/g' file
Note that in perl regex, [a-z] will match only one character, NOT including any whitespace. Then as a start please include a repetition specifier and include capability to also 'eat' whitespaces. Also to keep the recognized (but 'eaten') 'to' in the replacement, you must put it again in the replacement string, like finally in the below example perl program:
$str = "load Add 20 percent
to accommodate";
print "before:\n$str\n";
$str =~ s/([ a-z]+)\n\s*to/\1 to/;
print "after:\n$str\n";
This program produces the below input:
before:
load Add 20 percent
to accommodate
after:
load Add 20 percent to accommodate
Then it looks like that if I understood well what you want to do, your regexp should better look like:
s/([ a-z]+)\n\s*to/\1 to/ (please note the leading whitespace before 'a-z').

s/// returns out of place newline

I'm trying to use Perl to reorder the content of an md5 file. For each line, I want the filename without the path then the hash. The best command I've come up with is:
$ perl -pe 's|^([[:alnum:]]+).*?([^/]+)$|$2 $1|' DCIM.md5
The input file (DCIM.md5) is produced by md5sum on Linux. It looks like this:
e26ff03dc1bac80226e200c0c63d17a2 ./Path1/IMG_20150201_160548.jpg
01f92572e4c6f2ea42bd904497e4f939 ./Path 2/IMG_20150204_190528.jpg
afce027c977944188b4f97c5dd1bd101 ./Path3/Path 4/IMG_20151011_193008.jpg
The hash is matched by the first group ([[:alnum:]]+) in the
regular expression.
Then the spaces and the path to the file are
matched by .*?.
Then the filename is matched by ([^/]+).
The expression is enclosed with ^ (apparently non-necessary here)
and $. Without the $, the expression does not output what I expect.
I use | rather than / as a separator to avoid escaping it in file paths.
That command returns:
IMG_20150201_160548.jpg
e26ff03dc1bac80226e200c0c63d17a2IMG_20150204_190528.jpg
01f92572e4c6f2ea42bd904497e4f939IMG_20151011_193008.jpg
afce027c977944188b4f97c5dd1bd101IMG_20151011_195133.jpg
The matching is correct, the output sequence is correct (filename without path then hash) but the spacing is not: there's a newline after the filename. I expect it after the hash, like this:
IMG_20150201_160548.jpg e26ff03dc1bac80226e200c0c63d17a2
IMG_20150204_190528.jpg 01f92572e4c6f2ea42bd904497e4f939
IMG_20151011_193008.jpg afce027c977944188b4f97c5dd1bd101
It seems to me that my command outputs the newline character, but I don't know how to change this behavior.
Or possibly the problem comes from the shell, not the command?
Finally, some version information:
$ perl -version
This is perl 5, version 22, subversion 1 (v5.22.1) built for i686-linux-gnu-thread-multi-64int
(with 69 registered patches, see perl -V for more detail)
[^/]+ matches newlines, so the ones in your input are part of $2, which gets put first in your transformed $_ (And there's no newline in $1 so there's no newline at the end of $_...)
Solution: Read up on the -l option from perlrun. In particular:
-l[octnum]
enables automatic line-ending processing. It has two separate effects. First, it automatically chomps $/ (the input record separator) when used with -n or -p. Second, it assigns $\ (the output record separator) to have the value of octnum so that any print statements will have that separator added back on. If octnum is omitted, sets $\ to the current value of $/ .
Alternate solution, which uses lots of concepts from other answers, and comments ...
$ perl -pe 's|(\p{hex}+).*?([^/]+?)$|$2 $1|' DCIM.md5
... and explanation.
After investigating all the answers and trying to figure them out, I've decided that the base of the problem is that the [^/]+ is greedy. Its greediness causes it to capture the newline; it ignores the $ anchor.
This was hard for me to figure out, since I did a lot of parsing using sed before using Perl, and even a greedy wildcard won't capture a newline in sed. Hopefully this post will help those who (being used to sed as I am) are also wondering (as I did) why the $ isn't acting "as I expect it to."
We can see the "greedy" issue by trying what I'll post as another, alternate answer.
Write the file:
$ cat > DCIM.md5<<EOF
> e26ff03dc1bac80226e200c0c63d17a2 ./Path1/IMG_20150201_160548.jpg
> 01f92572e4c6f2ea42bd904497e4f939 ./Path 2/IMG_20150204_190528.jpg
> afce027c977944188b4f97c5dd1bd101 ./Path3/Path 4/IMG_20151011_193008.jpg
> EOF
Get rid of the greedy [^/]+ by changing it to [^/]+?. Parse.
$ perl -pe 's|([[:alnum:]]+).*?([^/]+?)$|$2 $1|' DCIM.md5
IMG_20150201_160548.jpg e26ff03dc1bac80226e200c0c63d17a2
IMG_20150204_190528.jpg 01f92572e4c6f2ea42bd904497e4f939
IMG_20151011_193008.jpg afce027c977944188b4f97c5dd1bd101
Desired output accomplished.
The accepted answer, by #Shawn,
$ perl -lpe 's|^([[:alnum:]]+).*?([^/]+)$|$2 $1|' DCIM.md5
basically changes the $ anchor so as to behave the way a sed person would expect it to.
The answer by #CrafterKolyan takes care of the greedy [^/] capturing the newline by saying you can't have a forward-slash or a newline. This answer still needs the $ anchor to prevent the following situation
1) .* captures the empty string (0 or more of any character)
2) [^/\n]+ captures . .
The answer by #Borodin takes a quite different approach, but it's a great concept.
#Borodin, in addition, made a great comment that allows a more-precise/more-exact version of this answer, which is the version I put at the top of this post.
Finally, if one wants to follow the Perl programming model, here's another alternative.
$ perl -pe 's|([[:xdigit:]]+).*?([^/]+?)(\n\|\Z)|$2 $1$3|' DCIM.md5
P.S. Because sed isn't quite like perl (no non-greedy wildcards,) here's a sed example that shows the behavior I discuss.
$ sed 's|^\([[:alnum:]]\+\).*/\([^/]\+\)$|\2 \1|' DCIM.md5
This is basically a "direct translation" of the perl expression except for the extra '/' before the [^/] stuff. I hope it will help those comparing sed and perl.
use [^/\n] instead of [^/]:
perl -pe 's|^([[:alnum:]]+).*?([^/\n]+)$|$2 $1|' DCIM.md5
Doing a substitution leaves you having to write a regex pattern that matches everything you don't want as well as everything you do. It's usually much better to match just the parts you need and build another string from them
Like this
for ( <> ) {
die unless m< (\w++) .*? ([^/\s]+) \s* \z >x;
print "$2 $1\n";
}
or if you must have a one-liner
perl -ne 'die unless m< (\w++) .*? ([^/\s]+) \s*\z >x; print "$2 $1\n";' myfile.md5
output
IMG_20150201_160548.jpg e26ff03dc1bac80226e200c0c63d17a2
IMG_20150204_190528.jpg 01f92572e4c6f2ea42bd904497e4f939
IMG_20151011_193008.jpg afce027c977944188b4f97c5dd1bd101

Edit within multi-line sed match

I have a very large file, containing the following blocks of lines throughout:
start :234
modify 123 directory1/directory2/file.txt
delete directory3/file2.txt
modify 899 directory4/file3.txt
Each block starts with the pattern "start : #" and ends with a blank line. Within the block, every line starts with "modify # " or "delete ".
I need to modify the path in each line, specifically appending a directory to the front. I would just use a general regex to cover the entire file for "modify #" or "delete ", but due to the enormous amount of other data in that file, there will likely be other matches to this somewhat vague pattern. So I need to use multi-line matching to find the entire block, and then perform edits within that block. This will likely result in >10,000 modifications in a single pass, so I'm also trying to keep the execution down to less than 30 minutes.
My current attempt is a sed one-liner:
sed '/^start :[0-9]\+$/ { :a /^[modify|delete] .*$/ { N; ba }; s/modify [0-9]\+ /&Appended_DIR\//g; s/delete /&Appended_DIR\//g }' file_to_edit
Which is intended to find the "start" line, loop while the lines either start with a "modify" or a "delete," and then apply the sed replacements.
However, when I execute this command, no changes are made, and the output is the same as the original file.
Is there an issue with the command I have formed? Would this be easier/more efficient to do in perl? Any help would be greatly appreciated, and I will clarify where I can.
I think you would be better off with perl
Specifically because you can work 'per record' by setting $/ - if you're records are delimited by blank lines, setting it to \n\n.
Something like this:
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = "\n\n";
while (<>) {
#multi-lines of text one at a time here.
if (m/^start :\d+/) {
s/(modify \d+)/$1 Appended_DIR\//g;
s/(delete) /$1 Appended_DIR\//g;
}
print;
}
Each iteration of the loop will pick out a blank line delimited chunk, check if it starts with a pattern, and if it does, apply some transforms.
It'll take data from STDIN via a pipe, or myscript.pl somefile.
Output is to STDOUT and you can redirect that in the normal way.
Your limiting factor on processing files in this way are typically:
Data transfer from disk
pattern complexity
The more complex a pattern, and especially if it has variable matching going on, the more backtracking the regex engine has to do, which can get expensive. Your transforms are simple, so packaging them doesn't make very much difference, and your limiting factor will be likely disk IO.
(If you want to do an in place edit, you can with this approach)
If - as noted - you can't rely on a record separator, then what you can use instead is perls range operator (other answers already do this, I'm just expanding it out a bit:
#!/usr/bin/env perl
use strict;
use warnings;
while (<>) {
if ( /^start :/ .. /^$/)
s/(modify \d+)/$1 Appended_DIR\//g;
s/(delete) /$1 Appended_DIR\//g;
}
print;
}
We don't change $/ any more, and so it remains on it's default of 'each line'. What we add though is a range operator that tests "am I currently within these two regular expressions" that's toggled true when you hit a "start" and false when you hit a blank line (assuming that's where you would want to stop?).
It applies the pattern transformation if this condition is true, and it ... ignores and carries on printing if it is not.
sed's pattern ranges are your friend here:
sed -r '/^start :[0-9]+$/,/^$/ s/^(delete |modify [0-9]+ )/&prepended_dir\//' filename
The core of this trick is /^start :[0-9]+$/,/^$/, which is to be read as a condition under which the s command that follows it is executed. The condition is true if sed currently finds itself in a range of lines of which the first matches the opening pattern ^start:[0-9]+$ and the last matches the closing pattern ^$ (an empty line). -r is for extended regex syntax (-E for old BSD seds), which makes the regex more pleasant to write.
I would also suggest using perl. Although I would try to keep it in one-liner form:
perl -i -pe 'if ( /^start :/ .. /^$/){s/(modify [0-9]+ )/$1Append_DIR\//;s/(delete )/$1Append_DIR\//; }' file_to_edit
Or you can use redirection of stdout:
perl -pe 'if ( /^start :/ .. /^$/){s/(modify [0-9]+ )/$1Append_DIR\//;s/(delete )/$1Append_DIR\//; }' file_to_edit > new_file
with gnu sed (with BRE syntax):
sed '/^start :[0-9][0-9]*$/{:a;n;/./{s/^\(modify [0-9][0-9]* \|delete \)/\1NewDir\//;ba}}' file.txt
The approach here is not to store the whole block and to proceed to the replacements. Here, when the start of the block is found the next line is loaded in pattern space, if the line is not empty, replacements are performed and the next line is loaded, etc. until the end of the block.
Note: gnu sed has the alternation feature | available, it may not be the case for some other sed versions.
a way with awk:
awk '/^start :[0-9]+$/,/^$/{if ($1=="modify"){$3="newdirMod/"$3;} else if ($1=="delete"){$2="newdirDel/"$2};}{print}' file.txt
This is very simple in Perl, and probably much faster than the sed equivalent
This one-line program inserts Appended_DIR/ after any occurrence of modify 999 or delete at the start of a line. It uses the range operator to restrict those changes to blocks of text starting with start :999 and ending with a line containing no printable characters
perl -pe"s<^(?:modify\s+\d+|delete)\s+\K><Appended_DIR/> if /^start\s+:\d+$/ .. not /\S/" file_to_edit
Good grief. sed is for simple substitutions on individual lines, that is all. Once you start using constructs other than s, g, and p (with -n) you are using the wrong tool. Just use awk:
awk '
/^start :[0-9]+$/ { inBlock=1 }
inBlock { sub(/^(modify [0-9]+|delete) /,"&Appended_DIR/") }
/^$/ { inBlock=0 }
{ print }
' file
start :234
modify 123 Appended_DIR/directory1/directory2/file.txt
delete Appended_DIR/directory3/file2.txt
modify 899 Appended_DIR/directory4/file3.txt
There's various ways you can do the above in awk but I wrote it in the above style for clarity over brevity since I assume you aren't familiar with awk but should have no trouble following that since it reuses your own sed scripts regexps and replacement text.

Perl regex substitution not working with global modifier

I have code that looks like the following:
s/(["\'])(?:\\?+.)*?\1/(my $x = $&) =~ s|^(["\'])(.*src=)([\'"])\/|$1$2$3$1.\\$baseUrl.$1\/|g;$x/ge
Ignoring the last bit (and only leaving the part where the problems occur) the code becomes:
s/(["\'])(?:\\?+.)*?\1/replace-text-here/g
I have tried using both, but I still get the same problem, which is that even though I am using the g modifier, this regex only matches and replaces the first occurrence. If this is a Perl bug, I don't know, but I was using a regex that matches everything between two quotes, and also handles escaped quotes, and I was following this blog post. In my eyes, that regex should match everything between the two quotes, then replace it, then try and find another instance of this pattern, because of the g modifier.
For a bit of background information, I am not using and version declarations, and strict and warnings are turned on, yet no warnings have shown up. My script reads an entire file into a scalar (including newlines) then the regex operates directly on that scalar. It does seem to work on each line individually - just not multiple times on one line. Perl version 5.14.2, running on Cygwin 64-bit. It could be that Cygwin (or the Perl port) is messing something up, but I doubt it.
I also tried another example from that blog post, with atomic groups and possessive quantifiers replaced with equivalent code but without those features, but this problem still plagued me.
Examples:
<?php echo ($watched_dir->getExistsFlag())?"":"<span class='ui-icon-alert'><img src='/css/images/warning-icon.png'></span>"?>
Should become (with the shortened regex):
<?php echo ($watched_dir->getExistsFlag())?replace-text-here:replace-text-here?>
Yet it only becomes:
<?php echo ($watched_dir->getExistsFlag())?replace-text-here:"<span class='ui-icon-alert'><img src='/css/images/warning-icon.png'></span>"?>
<?php echo ($sub->getTarget() != "")?"target=\"".$sub->getTarget()."\"":""; ?>
Should become:
<?php echo ($sub->getTarget() != replace-text-here)?replace-text-here.$sub->getTarget().replace-text-here:replace-text-here; ?>
And as above, only the first occurrence is changed.
(And yes, I do realise that this will spark into some sort of - don't use regex for parsing HTML/PHP. But in this case I think that regex is more appropriate, as I am not looking for context, I am looking for a string (anything within quotes) and performing an operation on that string - which is regex.)
And just a note - these regexes are running in an eval function, and the actual regex is encoded in a single quoted string (which is why the single quotes are escaped). I will try any presented solution directly though to rule out my bad programming.
EDIT: As requested, a short script that presents the problems:
#!/usr/bin/perl -w
use strict;
my $data = "this is the first line, where nothing much happens
but on the second line \"we suddenly have some double quotes\"
and on the third line there are 'single quotes'
but the fourth line has \"double quotes\" AND 'single quotes', but also another \"double quote\"
the fifth line has the interesting one - \"double quoted string 'with embedded singles' AND \\\"escaped doubles\\\"\"
and the sixth is just to say - we need a new line at the end to simulate a properly structured file
";
my $regex = 's/(["\'])(?:\\?+.)*?\1/replaced!/g';
my $regex2 = 's/([\'"]).*?\1/replaced2!/g';
print $data."\n";
$_ = $data; # to make the regex operate on $_, as per the original script
eval($regex);
print $_."\n";
$_ = $data;
eval($regex2);
print $_; # just an example of an eval, but without the fancy possessive quantifiers
This produces the following output for me:
this is the first line, where nothing much happens
but on the second line "we suddenly have some double quotes"
and on the third line there are 'single quotes'
but the fourth line has "double quotes" AND 'single quotes', but also another "double quote"
the fifth line has the interesting one - "double quoted string 'with embedded singles' AND \"escaped doubles\""
and the sixth is just to say - we need a new line at the end to simulate a properly structured file
this is the first line, where nothing much happens
but on the second line "we suddenly have some double quotes"
and on the third line there are 'single quotes'
but the fourth line has "double quotes" AND 'single quotes', but also another "double quote"
the fifth line has the interesting one - "double quoted string 'with embedded singles' AND \"escaped doubles\replaced!
and the sixth is just to say - we need a new line at the end to simulate a properly structured file
this is the first line, where nothing much happens
but on the second line replaced2!
and on the third line there are replaced2!
but the fourth line has replaced2! AND replaced2!, but also another replaced2!
the fifth line has the interesting one - replaced2!escaped doubles\replaced2!
and the sixth is just to say - we need a new line at the end to simulate a properly structured file
Even within single-quotes, \\ gets processed as \, so this:
my $regex = 's/(["\'])(?:\\?+.)*?\1/replaced!/g';
sets $regex to this:
s/(["'])(?:\?+.)*?\1/replaced!/g
which requires each character in the quoted-string to be preceded by one or more literal question-marks (\?+). Since you don't have lots of question-marks, this effectively means that you're requiring the string to be empty, either "" or ''.
The minimal fix is to add more backslashes:
my $regex = 's/(["\'])(?:\\\\?+.)*?\\1/replaced!/g';
but you really might want to rethink your approach. Do you really need to save the whole regex-replacement command as a string and run it via eval?
Update: this:
my $regex = 's/(["\'])(?:\\?+.)*?\1/replaced!/g';
should be:
my $regex = 's/(["\'])(?:\\\\?+.)*?\1/replaced!/g';
since those single quotes there in the assignment turn \\ into \ and you want the regex to end up with \\.
Please boil your problem down to a short script that demonstrates the problem (including input, bad output, eval and all). Taking what you do show and trying it:
use strict;
use warnings;
my $input = <<'END';
<?php echo ($watched_dir->getExistsFlag())?"":"<span class='ui-icon-alert'><img src='/css/images/warning-icon.png'></span>"?>
END
(my $output = $input) =~ s/(["\'])(?:\\?+.)*?\1/replace-text-here/g;
print $input,"becomes\n",$output;
produces for me:
<?php echo ($watched_dir->getExistsFlag())?"":"<span class='ui-icon-alert'><img src='/css/images/warning-icon.png'></span>"?>
becomes
<?php echo ($watched_dir->getExistsFlag())?replace-text-here:replace-text-here?>
as I would expect. What does it do for you?

Is there a truly universal wildcard in Grep? [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 3 years ago.
Really basic question here. So I'm told that a dot . matches any character EXCEPT a line break. I'm looking for something that matches any character, including line breaks.
All I want to do is to capture all the text in a website page between two specific strings, stripping the header and the footer. Something like HEADER TEXT(.+)FOOTER TEXT and then extract what's in the parentheses, but I can't find a way to include all text AND line breaks between header and footer, does this make sense? Thanks in advance!
When I need to match several characters, including line breaks, I do:
[\s\S]*?
Note I'm using a non-greedy pattern
You could do it with Perl:
$ perl -ne 'print if /HEADER TEXT/ .. /FOOTER TEXT/' file.html
To print only the text between the delimiters, use
$ perl -000 -lne 'print $1 while /HEADER TEXT(.+?)FOOTER TEXT/sg' file.html
The /s switch makes the regular expression matcher treat the entire string as a single line, which means dot matches newlines, and /g means match as many times as possible.
The examples above assume you're cranking on HTML files on the local disk. If you need to fetch them first, use get from LWP::Simple:
$ perl -MLWP::Simple -le '$_ = get "http://stackoverflow.com";
print $1 while m!<head>(.+?)</head>!sg'
Please note that parsing HTML with regular expressions as above does not work in the general case! If you're working on a quick-and-dirty scanner, fine, but for an application that needs to be more robust, use a real parser.
By definition, grep looks for lines which match; it reads a line, sees whether it matches, and prints the line.
One possible way to do what you want is with sed:
sed -n '/HEADER TEXT/,/FOOTER TEXT/p' "$#"
This prints from the first line that matches 'HEADER TEXT' to the first line that matches 'FOOTER TEXT', and then iterates; the '-n' stops the default 'print each line' operation. This won't work well if the header and footer text appear on the same line.
To do what you want, I'd probably use perl (but you could use Python if you prefer). I'd consider slurping the whole file, and then use a suitably qualified regex to find the matching portions of the file. However, the Perl one-liner given by '#gbacon' is an almost exact transliteration into Perl of the 'sed' script above and is neater than slurping.
The man page of grep says:
grep, egrep, fgrep, rgrep - print lines matching a pattern
grep is not made for matching more than a single line. You should try to solve this task with perl or awk.
As this is tagged with 'bbedit' and BBedit supports Perl-Style Pattern Modifiers you can allow the dot to match linebreaks with the switch (?s)
(?s).
will match ANY character. And yes,
(?s).+
will match the whole text.
As pointed elsewhere, grep will work for single line stuff.
For multiple-lines (in ruby with Regexp::MULTILINE, or in python, awk, sed, whatever), "\s" should also capture line breaks, so
HEADER TEXT(.*\s*)FOOTER TEXT
might work ...
here's one way to do it with gawk, if you have it
awk -vRS="FOOTER" '/HEADER/{gsub(/.*HEADER/,"");print}' file