Problem with perl multiline matching

Problem with perl multiline matching - regex

I'm trying to use a perl one-liner to update some code that spans multiple lines and am seeing some strange behavior. Here's a simple text file that shows the problem I'm seeing:
ABCD START
STOP EFGH
I expected the following to work but it doesn't end up replacing anything:
perl -pi -e 's/START\s+STOP/REPLACE/s' input.txt
After doing some experimenting I found that the \s+ in the original regex will match the newline but not any of the whitespace on the 2nd line, and adding a second \s+ doesn't work either. So for now I'm doing the following workaround, which is to add an intermediate regex that only removes the newline:
perl -pi -e 's/START\s+/START/s' input.txt
This creates the following intermediate file:
ABCD START STOP EFGH
Then I can run the original regex (although the /s is no longer needed):
perl -pi -e 's/START\s+STOP/REPLACE/s' input.txt
This creates the final, desired file:
ABCD REPLACE EFGH
It seems like the intermediate step should not be necessary. Am I missing something?

You were close. You need either -00 or -0777:
perl -0777 -pi -e 's/START\s+/START/' input.txt

perl -p processes the file one line at a time. The regex you have is correct, but it is never matched against the multi-line string.
A simple strategy, assuming the file will fit in memory, is to read the whole thing (do this without -p):
$/ = undef;
$file = <>;
$file =~ s/START\s+STOP/REPLACE/sg;
print $file;
Note, I have added the /g modifier to specify global replacement.
As a shortcut for all that extra boilerplate, you can use your existing script with the -0777 option: perl -0777pi -e 's/START\s+STOP/REPLACE/sg'. Adding /g is still needed if you may need to make multiple replacements within the file.
A hiccup that you might run into, although not with this regex: if the regex were START.+STOP, and a file contains multiple START/STOP pairs, greedy matching of .+ will eat everything from the first START to the last STOP. You can use non-greedy matching (match as little as possible) with .+?.
If you want to use the ^ and $ anchors for line boundaries anywhere in the string, then you also need the /m regex modifier.

A relatively simple one-liner (reading the file in memory):
perl -pi -e 'BEGIN{undef $/;} s/START\s+STOP/REPLACE/sg;' input.txt
Another alternative (not so simple), not reading the file in memory:
perl -ni -e '$a.=$_; \
if ( $a =~ s/START\s+STOP/REPLACE/s ) { print $a; $a=""; } \
END{$a && print $a}' input.txt

perl -MFile::Slurp -e '$content = read_file(shift); $content =~ s/START\s+STOP/REPLACE/s; print $content' input.txt

Here's a one-liner that doesn't read the entire file into memory at once:
perl -i -ne 'if (($x = $last . $_) =~ s/START\n\s*STOP/REPLACE/) \
{ print $x; $last = ""; } else { print $last; $last = $_; } \
print $last if eof ARGV' input.txt

Related

Find multi-line text & replace it, using regex, in shell script

I am trying to find a pattern of two consecutive lines, where the first line is a fixed string and the second has a part substring I like to replace.
This is to be done in sh or bash on macOS.
If I had a regex tool at hand that would operate on the entire text, this would be easy for me. However, all I find is bash's simple text replacement - which doesn't work with regex, and sed, which is line oriented.
I suspect that I can use sed in a way where it first finds a matching first line, and only then looks to replace the following line if its pattern also matches, but I cannot figure this out.
Or are there other tools present on macOS that would let me do a regex-based search-and-replace over an entire file or a string? Maybe with Python (v2.7 and v3 is installed)?
Here's a sample text and how I like it modified:
keyA
value:474
keyB
value:474 <-- only this shall be replaced (follows "keyB")
keyC
value:474
keyB
value:474
Now, I want to find all occurances where the first line is "keyB" and the following one is "value:474", and then replace that second line with another value, e.g. "value:888".
As a regex that ignores line separators, I'd write this:
Search: (\bkeyB\n\s*value):474
Replace: $1:888
So, basically, I find the pattern before the 474, and then replace it with the same pattern plus the new number 888, thereby preserving the original indentation (which is variable).

You can use
sed -e '/keyB$/{n' -e 's/\(.*\):[0-9]*/\1:888/' -e '}' file
# Or, to replace the contents of the file inline in FreeBSD sed:
sed -i '' -e '/keyB$/{n' -e 's/\(.*\):[0-9]*/\1:888/' -e '}' file
Details:
/keyB$/ - finds all lines that end with keyB
n - empties the current pattern space and reads the next line into it
s/\(.*\):[0-9]*/\1:888/ - find any text up to the last : + zero or more digits capturing that text into Group 1, and replaces with the contents of the group and :888.
The {...} create a block that is executed only once the /keyB$/ condition is met.
See an online sed demo.

Use a perl one-liner with -0777 to scan over multiple lines:
$ # inline edit:
$ perl -0777 -i -pe 's/\bkeyB\s*value):\d*/$1:888/' file.txt
$ # to stdout:
$ cat file.txt | perl -0777 -pe 's/\bkeyB\s*value):\d*/$1:888/'

In plain bash:
#!/bin/bash
keypattern='^[[:blank:]]*keyB$'
valpattern='(.*):'
replacement=888
while read -r; do
printf '%s\n' "$REPLY"
if [[ $REPLY =~ $keypattern ]]; then
read -r
if [[ $REPLY =~ $valpattern ]]; then
printf '%s%s\n' "${BASH_REMATCH[0]}" "$replacement"
else
printf '%s\n' "$REPLY"
fi
fi
done < file

Delete all characters/words that doesn't match a pattern

I have a text, without lines, and i want to delete all the characters that doesn't match a pattern:
The pattern would be from the word parameter until it finds }}. For example if i have this entry:
KHJLMNNamespaceparameter:{{"Hello i am here"}}NamespaceHSKFSAFSLLLJparameter:{{H}}...
I would like to delete everything and leave this in the file: parameter:{{"Hello i am here"}} parameter:{{H}}.
All i found out there is to delete a line that doesn't contain a pattern, but I am not able to find anything related with a huge file without /n(end of lines). It would be possible to do that using either sed, awk or Vi?
Thanks!

$ awk 'BEGIN{RS=ORS="}}"} sub(/.*parameter/,"parameter")' file
parameter:{{"Hello i am here"}}parameter:{{H}}
Note that this is gawk-specific due to the multi-char RS.

You can use this grep with -P (PCRE) regex:
grep -oP '.*?\Kparameter:\{\{.*?\}\}' file
parameter:{{"Hello i am here"}}
parameter:{{H}}

If perl is an option, you can do this:
perl -ne "my #wo = ($_ =~ /parameter:\{\{.*?\}\}/g); print join(' ',#wo);" your_text_file
In perl, the modifier *? is a non-greedy quantifier, such that it stops at the first encountered }}.
I think a perl expert can do this in one instruction, without a temporary array ...
EDIT: this command only outputs the wanted text on stdout. To change the file itself, use the switch -i when calling perl:
perl -i.bak -ne "my #wo = ($_ =~ /parameter:\{\{.*?\}\}/g); print join(' ',#wo);" your_text_file
A backup file is created with the extension .bak appended at the end, and the result is written in a file with the same name as the input filename. Note that you can get no backup file with the swtich -i alone, but some platforms don't allowed this. See doc perlrun for more information.

Need to join certain lines based on multiple-line pattern match

I have a file that looks like this:
2014-05-01 00:30:45,511
ZZZ|1|CE|web1||etc|etc
ZZZ|1|CE|web2||etc|etc
ZZZ|1|CE|web3|asd|SDAF
2014-05-01 00:30:45,511
ZZZ|1|CE|web1||etc|etc
ZZZ|1|CE|web2||etc|etc
ZZZ|1|CE|web3|asd|SDAF
I want to convert this into 2 lines by replacing the newlines followed by certain patterns with pipes. I want:
2014-05-01 00:30:45,511|ZZZ|1|CE|web1||etc|etc|ZZZ|1|CE|web2||etc|etc|ZZZ|1|CE|web3|asd|SDAF
2014-05-01 00:30:45,511|ZZZ|1|CE|web1||etc|etc|ZZZ|1|CE|web2||etc|etc|ZZZ|1|CE|web3|asd|SDAF
I am trying multiline match with perl:
cat file | perl -pe 's/\nZZZ/\|ZZZ/m'
but this does not match.
I can do perl -pe 's/\n//m' but that is too much; I need to match '\nZZZ' so that only lines beginning with ZZZ are joined to the previous line.

You just need to indicate slurp mode using the -0777 switch because you're using a regular expression that's trying to match across multiple lines.
The full solution:
perl -0777 -pe 's/\n(?=ZZZ)/|/g' file
Explanation:
Switches:
-0777: slurp files whole
-p: Creates a while(<>){...; print} loop for each line in your input file.
-e: Tells perl to execute the code on command line.
Code:
s/\n(?=ZZZ)/|/g: Replace any newline that is followed by ZZZ with a |

Try this if you want to avoid slurp mode:
perl -pe 'chomp unless eof; /\|/ and s/^/|/ or $.>1 and s/^/\n/' filename.txt
Add a record separator to the beginning of the line if it contains record separators.
Otherwise start a new line if we are past the first line.
Keep the new line at the end of the file.

I would suggest using a Lookahead, which does not kill your ZZZ Part
cat file | perl -pe 's/(\n(?=ZZZ))/|/gm'
EDIT: Online Demo

This is a pretty standard pattern. It looks like this. The path to the input file is expected as a parameter on the command line
use strict;
use warnings;
my $line;
while (<>) {
chomp;
if ( /^ZZZ/ ) {
$line .= '|' . $_;
}
else {
print $line, "\n" if $line;
$line = $_;
}
}
print $line, "\n" if $line;
output
2014-05-01 00:30:45,511|ZZZ|1|CE|web1||etc|etc|ZZZ|1|CE|web2||etc|etc|ZZZ|1|CE|web3|asd|SDAF
2014-05-01 00:30:45,511|ZZZ|1|CE|web1||etc|etc|ZZZ|1|CE|web2||etc|etc|ZZZ|1|CE|web3|asd|SDAF

Perl match newline in `-0` mode

Question
Suppose I have a file like this:
I've got a loverly bunch of coconut trees.
Newlines!
Bahahaha
Newlines!
the end.
I'd like to replace an occurence of "Newlines!" that is surrounded by blank lines with (say) NEWLINES!. So, ideal output is:
I've got a loverly bunch of coconut trees.
NEWLINES!
Bahahaha
Newlines!
the end.
Attempts
Ignoring "surrounded by newlines", I can do:
perl -p -e 's#Newlines!#NEWLINES!#g' input.txt
Which replaces all occurences of "Newlines!" with "NEWLINES!".
Now I try to pick out only the "Newlines!" surrounded with \n:
perl -p -e 's#\nNewlines!\n#\nNEWLINES!\n#g' input.txt
No luck (note - I don't need the s switch because I'm not using . and I don't need the m switch because I'm not using ^and $; regardless, adding them doesn't make this work). Lookaheads/behinds don't work either:
perl -p -e 's#(?<=\n)Newlines!(?=\n)#NEWLINES!#g' input.txt
After a bit of searching, I see that perl reads in the file line-by-line (makes sense; sed does too). So, I use the -0 switch:
perl -0p -e 's#(?<=\n)Newlines!(?=\n)#NEWLINES!#g' input.txt
Of course this doesn't work -- -0 replaces new line characters with the null character.
So my question is -- how can I match this pattern (I'd prefer not to write any perl beyond the regex 's#pattern#replacement#flags' construct)?
Is it possible to match this null character? I did try:
perl -0p -e 's#(?<=\0)Newlines!(?=\0)#NEWLINES!#g' input.txt
to no effect.
Can anyone tell me how to match newlines in perl? Whether in -0 mode or not? Or should I use something like awk? (I started with sed but it doesn't seem to have lookahead/behind support even with -r. I went to perl because I'm not at all familiar with awk).
cheers.
(PS: this question is not what I'm after because their problem had to do with a .+ matching newline).

Following should work for you:
perl -0pe 's#(?<=\n\n)Newlines!(?=\n\n)#NEWLINES!#g'

I think they way you went about things caused you to combine possible solutions in a way that didn't work.
if you use the inline editing flag you can do it like this:
perl -0p -i.bk -e 's/\n\nNewlines!\n\n/\n\nNEWLINES!\n\n/g' input.txt
I have doubled the \n's to make sure you only get the ones with empty lines above and below.

If the file is small enough to be slurped into memory all at once:
perl -0777 -pe 's/\n\nNewlines!(?=\n\n)/\n\nNEWLINES!/g'
Otherwise, keep a buffer of the last three lines read:
perl -ne 'push #buffer, $_; $buffer[1] = "NEWLINES!\n" if #buffer == 3 && ' \
-e 'join("", #buffer) eq "\nNewlines!\n\n"; ' \
-e 'print shift #buffer if #buffer == 3; END { print #buffer }'

How do I display data from the beginning of a file until the first occurrence of a regular expression?

How do I display data from the beginning of a file until the first occurrence of a regular expression?
For example, if I have a file that contains:
One
Two
Three
Bravo
Four
Five
I want to start displaying the contents of the file starting at line 1 and stopping when I find the string "B*". So the output should look like this:
One
Two
Three

perl -pe 'last if /^B/' source.txt
An explanation: the -p switch adds a loop around the code, turning it into this:
while ( <> ) {
last if /^B.*/; # The bit we provide
print;
}
The last keyword exits the surrounding loop immediately if the condition holds - in this case, /^B/, which indicates that the line begins with a B.

if its from the start of the file
awk '/^B/{exit}1' file
if you want to start from specific line number
awk '/^B/{exit}NR>=10' file # start from line 10

sed -n '1,/^B/p'
Print from line 1 to /^B/ (inclusive). -n suppresses default echo.
Update: Opps.... didn't want "Bravo", so instead the reverse action is needed ;-)
sed -n '/^B/,$!p'
/I3az/

sed '/^B/,$d'
Read that as follows: Delete (d) all lines beginning with the first line that starts with a "B" (/^B/), up and until the last line ($).

Some of the sed commands given by others will continue to unnecessarily process the input after the regex is found which could be quite slow for large input. This quits when the regex is found:
sed -n '/^Bravo/q;p'

in Perl:
perl -nle '/B.*/ && last; print; ' source.txt

Just sharing some answers I've received:
Print data starting at the first line, and continue until we find a match to the regex, then stop:
<command> | perl -n -e 'print "$_" if 1 ... /<regex>/;'
Print data starting at the first line, and continue until we find a match to the regex, BUT don't display the line that matches the regular expression:
<command> | perl -pe '/<regex>/ && exit;'
Doing it in sed:
<command> | sed -n '1,/<regex>/p'

Your problem is a variation on an answer in perlfaq6: How can I pull out lines between two patterns that are themselves on different lines?.
You can use Perl's somewhat exotic .. operator (documented in perlop):
perl -ne 'print if /START/ .. /END/' file1 file2 ...
If you wanted text and not lines, you would use
perl -0777 -ne 'print "$1\n" while /START(.*?)END/gs' file1 file2 ...
But if you want nested occurrences of START through END, you'll run up against the problem described in the question in this section on matching balanced text.
Here's another example of using ..:
while (<>) {
$in_header = 1 .. /^$/;
$in_body = /^$/ .. eof;
# now choose between them
} continue {
$. = 0 if eof; # fix $.
}

Here is a perl one-liner:
perl -pe 'last if /B/' file

If Perl is a possibilty, you could do something like this:
% perl -0ne 'if (/B.*/) { print $`; last }' INPUT_FILE

one liner with basic shell commands:
head -`grep -n B file|head -1|cut -f1 -d":"` file

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Problem with perl multiline matching - regex

You were close. You need either -00 or -0777: perl -0777 -pi -e 's/START\s+/START/' input.txt

perl -MFile::Slurp -e '$content = read_file(shift); $content =~ s/START\s+STOP/REPLACE/s; print $content' input.txt

Here's a one-liner that doesn't read the entire file into memory at once: perl -i -ne 'if (($x = $last . $_) =~ s/START\n\s*STOP/REPLACE/) \ { print $x; $last = ""; } else { print $last; $last = $_; } \ print $last if eof ARGV' input.txt

Related

Find multi-line text & replace it, using regex, in shell script

Delete all characters/words that doesn't match a pattern

Need to join certain lines based on multiple-line pattern match

Perl match newline in `-0` mode

How do I display data from the beginning of a file until the first occurrence of a regular expression?

Categories

Resources