How to find and replace multiple line texts in perl - regex

I have a text file named "data" with the content:
a
b
c
abc
I'd like to find all "abc" (doesn't need to be on the same line) and replace the leading "a" to "A". Here "b" can be any character (one or more) but not 'c'.
(This is a simplification of my actual use case.)
I thought this perl command would do
perl -pi.bak -e 's/a([^c]+?)c/A\1c/mg' data
With this 'data' was changed to:
a
b
c
Abc
I was expecting:
A
b
c
Abc
I'm not sure why perl missed the first occurrence (on line 1-3).
Let me know if you spot anything wrong with my perl command or you know a working alternative. Much appreciated.

You're reading a line at a time, applying the code to that one line. It can't possibly match across multiple lines. The simple solution is to tell perl to treat the entire file as one line using -0777.
perl -i.bak -0777pe's/a([^c]+c)/A$1/g' data
Replaced the incorrect \1 with $1.
Removed the useless /m. It only affects ^ and $, but you don't use those.
Removed the useless non-greedy modifier.
Moved the c into the capture to avoid repeating it.

Related

Inserting a potentially missing line with perl

I'm trying to modify a perl filter to insert a line that may be missing.
I might have inputs that are
A
B
C
or
A
C
A and B are fixed and known in advance. C can vary from file to file.
The real data are more complicated - callstacks generated as part of
a regression test. Depending on the compiler used (and hence the
optimization) there may be tail call elimination which can remove
the 'B' frame. After filtering the files are simply diffed.
In the second case I want to insert the 'B' line. In the first case I don't want to insert a duplicate line. I thought that this was a job for negative lookahead, using the following
s/A.(?!B)/A\nB/s;
However this seems to mean "if any part of A.(?!B) matches the input text then substitute it with A\nB" whereas I need "if all of A.(?!B) matches" then substitute.
No matter what I try it either always substitutes or never substitutes.
In a one-liner for a ready test
perl -0777 -wpe's/ ^A.*\n \K (?!B.*\n) /B-line\n/xgm' file
The \K makes it drop all matches before it, so we don't have to capture and copy them back in the replacement side. With the -0777 switch the whole file is slurped into a string, available in $_.
In order to match all such A-B?-C groups of lines we need the /g modifier (match "globally"), and for the anchor ^ to also match on internal newlines we need the /m modifier ("multiline").
The /x modifier makes it disregard literal spaces (and newlines and comments), what allows for spacing things out for readability.
On the other hand, if a line starting with A must be followed by either a line starting with B, or by a line starting with C if the B-line isn't there, then it's simpler, with no need for a lookahead
perl -0777 -wpe's/ ^A.*\n \K (^C.*\n) /B-line\n$1/xgm' file
Both these work correctly in my (basic) tests.
In either case, the rest of the file is printed as is, so you can use the -i switch to change the input file "in-place," if desired, and with -i.bak you get a backup as well. So
perl -i.bak -0777 -wpe'...' file
Or you can dump the output (redirect) into the same file to overwrite it, since the whole file was first read, if this runs out of a script.
Reading the file line by line is of course far more flexible. For example
use warnings;
use strict;
use feature 'say';
my $just_saw_A_line;
while (<>) {
if ($just_saw_A_line and not /^B/) {
say "B-line";
}
$just_saw_A_line = /^A/;
print
}
This also handles multiple A-(B?)-C line groups. It's far more easily adjusted for variations.
The program acts like a filter, taking either STDIN or lines from file(s) given on the command line, and printing lines to STDOUT. The output can then be redirected to a file, but not to the input file itself. (If the input file need be changed then the code need be modified for that.)

$ and Perl's global regular expression modifier

I finally figured out how to append text to the end of each line in a file:
perl -pe 's/$/addthis/' myfile.txt
However, as I'm trying to learn Perl for frequent regex use, I can't figure out why is it that the following perl command adds the text 'addthis' to the end and start of each line:
perl -pe 's/$/addthis/g' myfile.txt
I thought that '$' matched the end of a line no matter what modifier was used for the regex match, but I guess this is wrong?
Summary: For what you're doing, drop the /g so it only matches before the newline. The /g is telling it to match before the newline and at the end of the string (after the newline).
Without the /m modifier, $ will match either before a newline (if it occurs at the end of the string) or at the end of the string. For instance, with both "foo" and "foo\n", the $ would match after foo. With "foo\nbar", though, it would match after bar, because the embedded newline isn't at the end of the string.
With the /g modifier, you're getting all the places that $ would match -- so
s/$/X/g;
would take a line like "foo\n" and turn it into "fooX\nX".
Sidebar:
The /m modifier will allow $ to match newlines that occur before the end of the string, so that
s/$/X/mg;
would convert "foo\nbar\n" into "fooX\nbarX\nX".
As Jim Davis pointed out, $ matches both the end of the string, or before the \n character (with the /m option). (See the Regular Expressions section of the perlre Perldoc page. Using the g modifier allowed it to continue matching.
Multiple line Perl regular expressions (i.e., Perl regular expressions with the new line character in them even if it only occurs once at the end of the line) causes all sorts of complications that most Perl programmers have issues handling.
If you're reading in a file one line at a time, always use chomp before doing ANYTHING with that line. This would have solved your issue when using the g qualifier.
Further issues can happen if you're reading files on Linux/Mac which came from Windows. In that case, you will have both the \r and \n character. As I found out recently in attempting to debug a program, the \r character isn't removed by chomp. I now make sure I always open my text files for reading
Like this:
open my $file_handle, "<:crlf", $file...
This will automatically substitute the \r\n characters with just \n if this is in fact a Windows file on a Linux/Mac system. If this is a regular Linux/Mac text file, it will do nothing. Other obvious solution is not to use Windows (rim shot!).
Of course, in your case, using chomp first would have done the following:
$cat file
line one
line two
line three
line four
$ perl -pe 'chomp;s/$/addthis::/g`
line oneaddthis::line twoaddthis::line threeaddthis::line fouraddthis::
The chomp removed the \n, so now, you don't see it when the line print out. Hmm...
$ perl -pe 'chomp;s/$/addthis/g;print "\n";
line oneaddthis
line twoaddthis
line threeaddthis
line fouraddthis
That works! And, your one liner is only mildly incomprehensible.
The other thing is to take a more modern approach that Damian Conway recommends in Chapter 12 of his book Perl Best Practices:
Use \A and \z as string boundary anchors.
Even if you don’t adopt the previous practice of always using /m, using ^ and $ with their default meanings is a bad idea. Sure, you know what ^ and $ actually mean in a Perl regex1. But will those who read or maintain your code know? Or is it more likely that they will misinterpret those metacharacters in the ways described earlier?
Perl provides markers that always—and unambiguously—mean “start of string” and “end of string”: \A and \z (capital A, but lowercase z). They mean “start/end of string” regardless of whether /m is active. They mean “start/end of string” regardless of what the reader thinks ^ and $ mean.
If you followed Conaway's advice, and did this:
perl -pe 's/\z/addthis/mg' myfile.txt
You would see that your phrase addthis got added to only to the end of each and every line:
$cat file
line one
line two
line three
line four
$ perl -pe `s/\z/addthis/mg` myfile.txt
line one
addthisline two
addthisline three
addthisline four
addthis
See how well that works. That addthis was added to the very end of each line! ...Right after the \n character on that line.
Enough fun and back to work. (Wait, it's President's Day. It's a paid holiday. No work today except of course all that stuff I promised to have done by Tuesday morning).
Hope this helped you understand how much fun regular expressions are and why so many people have decided to learn Python.
1. Know what ^ and $ really mean in Perl? Uh, yes of course I do. I've been programming in Perl for a few decades. Yup, I know all this stuff. (Note to self: $ apparently doesn't mean what I always thought it meant.)
A workaround :
perl -pe 's/\n/addthis\n/'
no need g modifier : the regex is treated line by lines.

Grep is messing up my understanding

For sometime I have been trying to play with grep to retrieve data from files and I noticed something funny.
It might be my ignorance but here is what happens...
Suppose I have a file ABC. the data is:
a
abc
ab
bac
bb
ac
Now ran this grep command,
grep a* ABC
I found the output to contain lines starting a with b.c. why is this happening?
You used 'a*' as your search pattern... the '*' means ZERO or MORE of the previous character, so 'b.c' matches, having ZERO or more 'a's in it.
On a semi-related note, I'd recommend quoting the 'a*' bit, since if you have ANY files in the current subdirectory which start with a, you'll be VERY surprised to see what you're really searching for, since the shell (bash,zsh,csh,sh,dash,wtfsh...) will perform wildcard expansion automatically BEFORE the command is executed.
if you want to search for lines which START with 'a', then you'll need to anchor the search pattern with a leading ^ character, so your pattern becomes '^a*', but again, the * means ZERO or more, so it's not useful in this situation where you only have one letter... use '^a' instead.
As a contrived example, if you wanted to find all the lines containing a 'c' AND those containing the letters 'bc', then you could use 'b*c' as the search pattern... meaning ZERO or more b's, and a c.
The power of the regex search pattern is immense, and takes some time to grok. Peruse the man pages for grep(1), regex(7), pcre(3), pcresyntax(3), pcrepattern(3).
Once you get the hang of them, regex's are useful in sed, grep, perl, vim, (probably emacs too), ... uh, it's late (early?) nothing more comes to mind, but they're VERY powerful.
As some bonus, '*' means ZERO or more, '+' means ONE or more, and '?' means ZERO or ONE.
So to search for things with two or more a's... 'aa+', which is 1 a, and 1+ a (1 or more)
I ramble.... (regex(7)!)
grep tries to find that pattern in the whole line. Use ^a to get line starting with a or ^a*$ to find lines containing only as (including the empty line).
also, please quote that shell argument (eg: '^a*$'), if you use a* and there is a file in the working directory starting with an a you will get very weird results...
Try this, it works for me. The ^ means beginning of a line - so it has to start with a.
grep ^a ABC
You need to put quotes around your pattern:
grep "a*" ABC
Otherwise the * is interpreted by the shell (which does wild-card filename matching), instead of by grep itself.

Regex (grep) for multi-line search needed [duplicate]

This question already has answers here:
How can I search for a multiline pattern in a file?
(11 answers)
Closed 1 year ago.
I'm running a grep to find any *.sql file that has the word select followed by the word customerName followed by the word from. This select statement can span many lines and can contain tabs and newlines.
I've tried a few variations on the following:
$ grep -liIr --include="*.sql" --exclude-dir="\.svn*" --regexp="select[a-zA-Z0-
9+\n\r]*customerName[a-zA-Z0-9+\n\r]*from"
This, however, just runs forever. Can anyone help me with the correct syntax please?
Without the need to install the grep variant pcregrep, you can do a multiline search with grep.
$ grep -Pzo "(?s)^(\s*)\N*main.*?{.*?^\1}" *.c
Explanation:
-P activate perl-regexp for grep (a powerful extension of regular expressions)
-z Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline. That is, grep knows where the ends of the lines are, but sees the input as one big line. Beware this also adds a trailing NUL char if used with -o, see comments.
-o print only matching. Because we're using -z, the whole file is like a single big line, so if there is a match, the entire file would be printed; this way it won't do that.
In regexp:
(?s) activate PCRE_DOTALL, which means that . finds any character or newline
\N find anything except newline, even with PCRE_DOTALL activated
.*? find . in non-greedy mode, that is, stops as soon as possible.
^ find start of line
\1 backreference to the first group (\s*). This is a try to find the same indentation of method.
As you can imagine, this search prints the main method in a C (*.c) source file.
I am not very good in grep. But your problem can be solved using AWK command.
Just see
awk '/select/,/from/' *.sql
The above code will result from first occurence of select till first sequence of from. Now you need to verify whether returned statements are having customername or not. For this you can pipe the result. And can use awk or grep again.
Your fundamental problem is that grep works one line at a time - so it cannot find a SELECT statement spread across lines.
Your second problem is that the regex you are using doesn't deal with the complexity of what can appear between SELECT and FROM - in particular, it omits commas, full stops (periods) and blanks, but also quotes and anything that can be inside a quoted string.
I would likely go with a Perl-based solution, having Perl read 'paragraphs' at a time and applying a regex to that. The downside is having to deal with the recursive search - there are modules to do that, of course, including the core module File::Find.
In outline, for a single file:
$/ = "\n\n"; # Paragraphs
while (<>)
{
if ($_ =~ m/SELECT.*customerName.*FROM/mi)
{
printf file name
go to next file
}
}
That needs to be wrapped into a sub that is then invoked by the methods of File::Find.

Regular expression with sed

I'm having hard time selecting from a file using a regular expression. I'm trying to replace a specific text in the file which is full of lines like this.
/home/user/test2/data/train/train38.wav /home/user/test2/data/train/train38.mfc
I'm trying to replace the bolded text. The problem is the i don't know how to select only the bolded text since i need to use .wav in my regexp and the filename and the location of the file is also going to be different.
Hope you can help
Best regards,
Jökull
This assumes that what you want to replace is the string between the last two slashes in the first path.
sed 's|\([^/]*/\)[^/]*\(/[^/]* .*\)|\1FOO\2|' filename
produces:
/home/user/test2/data/FOO/train38.wav /home/user/test2/data/train/train38.mfc
sed processes lines one at a time, so you can omit the global option and it will only change the first 'train' on each line
sed 's/train/FOO/' testdat
vs
sed 's/train/FOO/g' testdat
which is a global replace
This is quite a bit more readable and less error-prone than some of the other possibilities, but of course there are applications which will not simplify quite as readily.
sed 's;\(\(/[^/]\+\)*\)/train\(\(/[^/]\+\)*\)\.wav;\1/FOO\3.wav;'
You can do it like this
sed -e 's/\<train\>/plane/g'
The \< tells sed to match the beginning of that work and the \> tells it to match the end of the word.
The g at the end means global so it performs the match and replace on the entire line and does not stop after the first successful match as it would normally do without g.