find and replace double newlines with perl? - regex

I'm cleaning up some web pages that for some reason have about 8 line breaks between tags. I wanted to remove most of them, and I tried this
perl -pi -w -e "s/\n\n//g" *.html
But no luck. For good measure, I tried
perl -pi -w -e "s/\n//g" *.html
and it did remove all my line breaks. What am I doing wrong?
edit I also tried \r\n\r\n, same deal. Works as a single line breaks, doesn't do anything for two consecutive ones.

Use -0:
perl -pi -0 -w -e "s/\n\n//g" *.html
The problem is that by default -p reads the file one line at a time. There's no such thing as a line with two newlines, so you didn't find any. The -0 changes the line-ending character to "\0", which probably doesn't exist in your file, so it processes the whole file at once. (Even if the file did contain NULs, you're looking for consecutive newlines, so processing it in NUL-delimited chunks won't be a problem.)
You probably want to adjust your regex as well, but it's hard to be sure exactly what you want. Try s/\n\n+/\n/g, which will replace any number of consecutive newlines with a single newline.
If the file is very large, you may not have enough memory to load it in a single chunk. A workaround for this is to pick some character that is common enough to split the file into manageable chunks, and tell Perl to use that as the line-ending character. But it also has to be a character that will not appear inside the matches you're trying to replace. For example, -0x2e will split the file on "." (ASCII 0x2E).

I was trying to replace a double newline with a single using the above recommendation on a large file (2.3G) With huge files, it will seg fault when trying to read the entire file at once. So instead of looking for a double newline, just look for lines where the only char is a newline:
perl -pi -w -e 's/^\n$//' file.txt

Related

regex command line with single-line flag

I would need to use regex in a bash script to substitute text in a file that might be on multiple lines.
I would pass s as flag in other regex engines that I know but I have a hard time for bash.
sed as far as I know doesn't support this feature.
perl it obviously does but I can not make it work in a one liner
perl -i -pe 's/<match.+match>//s $file
example text:
DONT_MATCH
<match some text here
and here
match>
DONT_MATCH
By default, . doesn't match a line feed. s simply makes . matches any character.
You are reading the file a line at a time, so you can't possibly match something that spans multiple lines. Use -0777 to treat the entire input as a one line.
perl -i -0777pe's/<match.+match>//s' "$file"
This might work for you (GNU sed):
sed '/^<match/{:a;/match>$/!{N;ba};s/.*//}' file
Gather up a collection of lines from one beginning <match to one ending match> and replace them by nothing.
N.B. This will act on all such collections throughout the file and the end-of-file condition will not effect the outcome. To only act on the first, use:
sed '/^<match/{:a;/match>$/!{N;ba};s/.*//;:b;n;bb}' file
To only act on the second such collection use:
sed -E '/^<match/{:a;/match>$/!{N;ba};x;s/^/x/;/^(x{2})$/{x;s/.*//;x};x}' file
The regex /^(x{2})$/ can be tailored to do more intricate matching e.g. /^(x|x{3,6})$/ would match the first and third to sixth collections.
With GNU sed:
$ sed -z 's/<match.*match>//g' file
DONT_MATCH
DONT_MATCH
With any sed:
$ sed 'H;1h;$!d;x; s/<match.*match>//g' file
DONT_MATCH
DONT_MATCH
Both the above approaches read the whole file into memory. If you have a big file (e.g. gigabytes), you might want a different approach.
Details
With GNU sed, the -z option reads in files with NUL as the record separator. For text files, which never contain NUL, this has the effect of reading the whole file in.
For ordinary sed, the whole file can be read in with the following steps:
H - Append current line to hold space
1h - If this is the first line, overwrite the hold space
with it
$!d - If this is not the last line, delete pattern space
and jump to the next line.
x - Exchange hold and pattern space to put whole file in
pattern space

Inserting a potentially missing line with perl

I'm trying to modify a perl filter to insert a line that may be missing.
I might have inputs that are
A
B
C
or
A
C
A and B are fixed and known in advance. C can vary from file to file.
The real data are more complicated - callstacks generated as part of
a regression test. Depending on the compiler used (and hence the
optimization) there may be tail call elimination which can remove
the 'B' frame. After filtering the files are simply diffed.
In the second case I want to insert the 'B' line. In the first case I don't want to insert a duplicate line. I thought that this was a job for negative lookahead, using the following
s/A.(?!B)/A\nB/s;
However this seems to mean "if any part of A.(?!B) matches the input text then substitute it with A\nB" whereas I need "if all of A.(?!B) matches" then substitute.
No matter what I try it either always substitutes or never substitutes.
In a one-liner for a ready test
perl -0777 -wpe's/ ^A.*\n \K (?!B.*\n) /B-line\n/xgm' file
The \K makes it drop all matches before it, so we don't have to capture and copy them back in the replacement side. With the -0777 switch the whole file is slurped into a string, available in $_.
In order to match all such A-B?-C groups of lines we need the /g modifier (match "globally"), and for the anchor ^ to also match on internal newlines we need the /m modifier ("multiline").
The /x modifier makes it disregard literal spaces (and newlines and comments), what allows for spacing things out for readability.
On the other hand, if a line starting with A must be followed by either a line starting with B, or by a line starting with C if the B-line isn't there, then it's simpler, with no need for a lookahead
perl -0777 -wpe's/ ^A.*\n \K (^C.*\n) /B-line\n$1/xgm' file
Both these work correctly in my (basic) tests.
In either case, the rest of the file is printed as is, so you can use the -i switch to change the input file "in-place," if desired, and with -i.bak you get a backup as well. So
perl -i.bak -0777 -wpe'...' file
Or you can dump the output (redirect) into the same file to overwrite it, since the whole file was first read, if this runs out of a script.
Reading the file line by line is of course far more flexible. For example
use warnings;
use strict;
use feature 'say';
my $just_saw_A_line;
while (<>) {
if ($just_saw_A_line and not /^B/) {
say "B-line";
}
$just_saw_A_line = /^A/;
print
}
This also handles multiple A-(B?)-C line groups. It's far more easily adjusted for variations.
The program acts like a filter, taking either STDIN or lines from file(s) given on the command line, and printing lines to STDOUT. The output can then be redirected to a file, but not to the input file itself. (If the input file need be changed then the code need be modified for that.)

How can I match multi-line patterns in the command line with perl-style regex?

I regularly use regex to transform text.
To transform, giant text files from the command line, perl lets me do this:
perl -pe < in.txt > out.txt
But this is inherently on a line-by-line basis. Occasionally, I want to match on multi-line things.
How can I do this in the command-line?
To slurp a file instead of doing line by line processing, use the -0777 switch:
perl -0777 -pe 's/.../.../g' in.txt > out.txt
As documented in perlrun #Command Switches:
The special value -00 will cause Perl to slurp files in paragraph mode. Any value -0400 or above will cause Perl to slurp files whole, but by convention the value -0777 is the one normally used for this purpose.
Obviously, for large files this may not work well, in which case you'll need to code some type of buffer to do this replacement. We can't advise any better though without real information about your intent.
Grepping across line boundaries
So you want to grep across lines boundaries...
You quite possibly already have pcregrep installed. As you may know, PCRE stands for Perl-Compatible Regular Expressions, and the library is definitely Perl-style, though not identical to Perl.
To match across multiple lines, you have to turn on the multi-line mode -M, which is not the same as (?m)
Running pcregrep -M "(?s)^b.*\d+" text.txt
On this text file:
a
b
c11
The output will be
b
c11
whereas grep would return empty.
Excerpt from the doc:
-M, --multiline Allow patterns to match more than one line. When this option is given, patterns may usefully contain literal newline char-
acters and internal occurrences of ^ and $ characters. The output
for a successful match may consist of more than one line, the last
of which is the one in which the match ended. If the matched string
ends with a newline sequence the output ends at the end of that line.
When this option is set, the PCRE library is called in "mul- tiline"
mode. There is a limit to the number of lines that can be matched,
imposed by the way that pcregrep buffers the input file as it scans
it. However, pcregrep ensures that at least 8K characters or the rest
of the document (whichever is the shorter) are available for
forward matching, and simi- larly the previous 8K characters (or all
the previous charac- ters, if fewer than 8K) are guaranteed to be
available for lookbehind assertions. This option does not work when
input is read line by line (see --line-buffered.)

Suppress the match itself in grep

Suppose I'have lots of files in the form of
First Line Name
Second Line Surname Adress
Third Line etc
etc
Now I'm using grep to match the first line. But I'm doing this actually to find the second line. The second line is not a pattern that can be matched (it's just depend on the first line). My regex pattern works and the command I'm using is
grep -rHIin pattern . -A 1 -m 1
Now the -A option print the line after a match. The -m option stops after 1 match( because there are other line that matches my pattern, but I'm interested just for the first match, anyway...)
This actually works but the output is like that:
./example_file:1: First Line Name
./example_file-2- Second Line Surname Adress
I've read the manual but couldn't fidn any clue or info about that. Now here is the question.
How can I suppress the match itself ? The output should be in the form of:
./example_file-2- Second Line Surname Adress
sed to the rescue:
sed -n '2,${p;n;}'
The particular sed command here starts with line 2 of its input and prints every other line. Pipe the output of grep into that and you'll only get the even-numbered lines out of the grep output.
An explanation of the sed command itself:
2,$ - the range of lines from line 2 to the last line of the file
{p;n;} - print the current line, then ignore the next line (this then gets repeated)
(In this special case of all even lines, an alternative way of writing this would be sed -n 'n;p;' since we don't actually need to special-case any leading lines. If you wanted to skip the first 5 lines of the file, this wouldn't be possible, you'd have to use the 6,$ syntax.)
You can use sed to print the line after each match:
sed -n '/<pattern>/{n;p}' <file>
To get recursion and the file names, you will need something like:
find . -type f -exec sed -n '/<pattern>/{n;s/^/{}:/;p}' \;
If you have already read a book on grep, you could also read a manual on awk, another common Unix tool.
In awk, your task will be solved with a nice simple code. (As for me, I always have to refresh my knowledge of awk's syntax by going to the manual (info awk) when I want to use it.)
Or, you could come up with a solution combining find (to iterate over your files) and grep (to select the lines) and head/tail (to discard for each individual file the lines you don't want). The complication with find is to be able to work with each file individually, discarding a line per file.
You could pipe results though grep -v pattern

Regex (grep) for multi-line search needed [duplicate]

This question already has answers here:
How can I search for a multiline pattern in a file?
(11 answers)
Closed 1 year ago.
I'm running a grep to find any *.sql file that has the word select followed by the word customerName followed by the word from. This select statement can span many lines and can contain tabs and newlines.
I've tried a few variations on the following:
$ grep -liIr --include="*.sql" --exclude-dir="\.svn*" --regexp="select[a-zA-Z0-
9+\n\r]*customerName[a-zA-Z0-9+\n\r]*from"
This, however, just runs forever. Can anyone help me with the correct syntax please?
Without the need to install the grep variant pcregrep, you can do a multiline search with grep.
$ grep -Pzo "(?s)^(\s*)\N*main.*?{.*?^\1}" *.c
Explanation:
-P activate perl-regexp for grep (a powerful extension of regular expressions)
-z Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline. That is, grep knows where the ends of the lines are, but sees the input as one big line. Beware this also adds a trailing NUL char if used with -o, see comments.
-o print only matching. Because we're using -z, the whole file is like a single big line, so if there is a match, the entire file would be printed; this way it won't do that.
In regexp:
(?s) activate PCRE_DOTALL, which means that . finds any character or newline
\N find anything except newline, even with PCRE_DOTALL activated
.*? find . in non-greedy mode, that is, stops as soon as possible.
^ find start of line
\1 backreference to the first group (\s*). This is a try to find the same indentation of method.
As you can imagine, this search prints the main method in a C (*.c) source file.
I am not very good in grep. But your problem can be solved using AWK command.
Just see
awk '/select/,/from/' *.sql
The above code will result from first occurence of select till first sequence of from. Now you need to verify whether returned statements are having customername or not. For this you can pipe the result. And can use awk or grep again.
Your fundamental problem is that grep works one line at a time - so it cannot find a SELECT statement spread across lines.
Your second problem is that the regex you are using doesn't deal with the complexity of what can appear between SELECT and FROM - in particular, it omits commas, full stops (periods) and blanks, but also quotes and anything that can be inside a quoted string.
I would likely go with a Perl-based solution, having Perl read 'paragraphs' at a time and applying a regex to that. The downside is having to deal with the recursive search - there are modules to do that, of course, including the core module File::Find.
In outline, for a single file:
$/ = "\n\n"; # Paragraphs
while (<>)
{
if ($_ =~ m/SELECT.*customerName.*FROM/mi)
{
printf file name
go to next file
}
}
That needs to be wrapped into a sub that is then invoked by the methods of File::Find.