regex command line with single-line flag - regex

I would need to use regex in a bash script to substitute text in a file that might be on multiple lines.
I would pass s as flag in other regex engines that I know but I have a hard time for bash.
sed as far as I know doesn't support this feature.
perl it obviously does but I can not make it work in a one liner
perl -i -pe 's/<match.+match>//s $file
example text:
DONT_MATCH
<match some text here
and here
match>
DONT_MATCH

By default, . doesn't match a line feed. s simply makes . matches any character.
You are reading the file a line at a time, so you can't possibly match something that spans multiple lines. Use -0777 to treat the entire input as a one line.
perl -i -0777pe's/<match.+match>//s' "$file"

This might work for you (GNU sed):
sed '/^<match/{:a;/match>$/!{N;ba};s/.*//}' file
Gather up a collection of lines from one beginning <match to one ending match> and replace them by nothing.
N.B. This will act on all such collections throughout the file and the end-of-file condition will not effect the outcome. To only act on the first, use:
sed '/^<match/{:a;/match>$/!{N;ba};s/.*//;:b;n;bb}' file
To only act on the second such collection use:
sed -E '/^<match/{:a;/match>$/!{N;ba};x;s/^/x/;/^(x{2})$/{x;s/.*//;x};x}' file
The regex /^(x{2})$/ can be tailored to do more intricate matching e.g. /^(x|x{3,6})$/ would match the first and third to sixth collections.

With GNU sed:
$ sed -z 's/<match.*match>//g' file
DONT_MATCH
DONT_MATCH
With any sed:
$ sed 'H;1h;$!d;x; s/<match.*match>//g' file
DONT_MATCH
DONT_MATCH
Both the above approaches read the whole file into memory. If you have a big file (e.g. gigabytes), you might want a different approach.
Details
With GNU sed, the -z option reads in files with NUL as the record separator. For text files, which never contain NUL, this has the effect of reading the whole file in.
For ordinary sed, the whole file can be read in with the following steps:
H - Append current line to hold space
1h - If this is the first line, overwrite the hold space
with it
$!d - If this is not the last line, delete pattern space
and jump to the next line.
x - Exchange hold and pattern space to put whole file in
pattern space

Related

Sed doesn't replace a pattern that is understood by gedit

I need to delete some content that is followed by 5 hyphens (that are in separate line) from 1000 files. Basically it looks like this:
SOME CONTENT
-----
SOME CONTENT TO BE DELETED WITH 5 HYPHENS ABOVE
I've tried to do that with this solution, but it didn't work for me:
this command — sed '/-----/,$ d' *.txt -i — can't be used because some of these texts have lines with more than 5 hyphens;
this command — sed '/^-----$/,$ d' *.txt -i — resulted in having all the files unchanged).
So I figured out that it might be something about "^" and "$" characters, but I am both sed and RegEx newbie, to be honest, and I don't know, what's the problem.
I've also found out that this RegEx — ^-{5}$(\s|\S)*$ — is good for capturing only these blocks which start exactly with 5 hyphens, but putting it into sed command gives no effect (both hyphens and text after them stay, where they were).
There's something I don't understand about sed probably, because when I use the above expression with gedit's Find&Replace, it works flawlessly. But I don't want to open, change and save 1000 files manually.
I am asking this question kinda again, because the given solution (the above link) didn't help me.
The first command I've posted (sed /-----/,$ d' *.txt -i) also resulted in deleting full content of some files, for instance a file that had 5 hyphens, new line with a single space (and no more text) at the bottom of it:
SOME CONTENT
-----
single space
EDIT:
Yes, I forgot about ' here, but in the Terminal I used these commands with it.
Yes, these files end with \n or \r. Is there a solution for it?
I think you want this:
sed '/^-\{5\}/,$ d' *.txt -i
Note that { and } need escaping.
$ sed '/^-----/p;q' file
SOME CONTENT
or
$ sed -E '/^-{5}/p;q' file
SOME CONTENT
Are you just trying to delete from ----- on it's own line (which may end with \r) to the end of the file? That'd be:
awk '{print} /^-----\r?/{exit}' file
The above will work using all awks in all shells in all UNIX systems.

Using sed with a newline in a regex

I'm bashing my head against the wall with this one. How do I do a regex replacement with sed on text that contains a newline?
I need to replace the value of the "version" XML element shown below. There are multiple version elements so I want to replace the one that comes after the "name" element.
<name>MyName</name>
<version>old</version>
Here's my command:
sed -i -E "s#(\s*<name>$NAME</name>\n\s*<version>)$VERSION_OLD(</version>)#\1$VERSION_NEW\2#g" $myfile.txt
Now as far as I know there is a way to make sed work with a newline character, but I can't figure it out. I've already used sed in my script so ideally I'd prefer to re-use it instead of say perl.
When you see your name element, you will need to use the N command to read the next line:
file:
<bar>MyName</bar>
<version>old</version>
<name>MyName</name>
<version>old</version>
<foo>MyName</foo>
<version>old</version>
With GNU sed:
sed '/<name>/{N;s/old/newer/}' file
Output:
<bar>MyName</bar>
<version>old</version>
<name>MyName</name>
<version>new</version>
<foo>MyName</foo>
<version>old</version>
If you're using GNU sed, you can use its extended addressing syntax:
sed '/<name>/,+1{/<version>/s/old/newer/}' file
Breaking this down, it says: for a line matching <name> and the following line (+1), then if the line matches <version>, substitute old with newer.
I'm assuming here that your file is generated, and will always have the name and version elements each on a single line, and adjacent. If you need to handle more free-form XML, then you should really consider an XPath-based tool rather than sed.

How can I match multi-line patterns in the command line with perl-style regex?

I regularly use regex to transform text.
To transform, giant text files from the command line, perl lets me do this:
perl -pe < in.txt > out.txt
But this is inherently on a line-by-line basis. Occasionally, I want to match on multi-line things.
How can I do this in the command-line?
To slurp a file instead of doing line by line processing, use the -0777 switch:
perl -0777 -pe 's/.../.../g' in.txt > out.txt
As documented in perlrun #Command Switches:
The special value -00 will cause Perl to slurp files in paragraph mode. Any value -0400 or above will cause Perl to slurp files whole, but by convention the value -0777 is the one normally used for this purpose.
Obviously, for large files this may not work well, in which case you'll need to code some type of buffer to do this replacement. We can't advise any better though without real information about your intent.
Grepping across line boundaries
So you want to grep across lines boundaries...
You quite possibly already have pcregrep installed. As you may know, PCRE stands for Perl-Compatible Regular Expressions, and the library is definitely Perl-style, though not identical to Perl.
To match across multiple lines, you have to turn on the multi-line mode -M, which is not the same as (?m)
Running pcregrep -M "(?s)^b.*\d+" text.txt
On this text file:
a
b
c11
The output will be
b
c11
whereas grep would return empty.
Excerpt from the doc:
-M, --multiline Allow patterns to match more than one line. When this option is given, patterns may usefully contain literal newline char-
acters and internal occurrences of ^ and $ characters. The output
for a successful match may consist of more than one line, the last
of which is the one in which the match ended. If the matched string
ends with a newline sequence the output ends at the end of that line.
When this option is set, the PCRE library is called in "mul- tiline"
mode. There is a limit to the number of lines that can be matched,
imposed by the way that pcregrep buffers the input file as it scans
it. However, pcregrep ensures that at least 8K characters or the rest
of the document (whichever is the shorter) are available for
forward matching, and simi- larly the previous 8K characters (or all
the previous charac- ters, if fewer than 8K) are guaranteed to be
available for lookbehind assertions. This option does not work when
input is read line by line (see --line-buffered.)

Suppress the match itself in grep

Suppose I'have lots of files in the form of
First Line Name
Second Line Surname Adress
Third Line etc
etc
Now I'm using grep to match the first line. But I'm doing this actually to find the second line. The second line is not a pattern that can be matched (it's just depend on the first line). My regex pattern works and the command I'm using is
grep -rHIin pattern . -A 1 -m 1
Now the -A option print the line after a match. The -m option stops after 1 match( because there are other line that matches my pattern, but I'm interested just for the first match, anyway...)
This actually works but the output is like that:
./example_file:1: First Line Name
./example_file-2- Second Line Surname Adress
I've read the manual but couldn't fidn any clue or info about that. Now here is the question.
How can I suppress the match itself ? The output should be in the form of:
./example_file-2- Second Line Surname Adress
sed to the rescue:
sed -n '2,${p;n;}'
The particular sed command here starts with line 2 of its input and prints every other line. Pipe the output of grep into that and you'll only get the even-numbered lines out of the grep output.
An explanation of the sed command itself:
2,$ - the range of lines from line 2 to the last line of the file
{p;n;} - print the current line, then ignore the next line (this then gets repeated)
(In this special case of all even lines, an alternative way of writing this would be sed -n 'n;p;' since we don't actually need to special-case any leading lines. If you wanted to skip the first 5 lines of the file, this wouldn't be possible, you'd have to use the 6,$ syntax.)
You can use sed to print the line after each match:
sed -n '/<pattern>/{n;p}' <file>
To get recursion and the file names, you will need something like:
find . -type f -exec sed -n '/<pattern>/{n;s/^/{}:/;p}' \;
If you have already read a book on grep, you could also read a manual on awk, another common Unix tool.
In awk, your task will be solved with a nice simple code. (As for me, I always have to refresh my knowledge of awk's syntax by going to the manual (info awk) when I want to use it.)
Or, you could come up with a solution combining find (to iterate over your files) and grep (to select the lines) and head/tail (to discard for each individual file the lines you don't want). The complication with find is to be able to work with each file individually, discarding a line per file.
You could pipe results though grep -v pattern

find and replace double newlines with perl?

I'm cleaning up some web pages that for some reason have about 8 line breaks between tags. I wanted to remove most of them, and I tried this
perl -pi -w -e "s/\n\n//g" *.html
But no luck. For good measure, I tried
perl -pi -w -e "s/\n//g" *.html
and it did remove all my line breaks. What am I doing wrong?
edit I also tried \r\n\r\n, same deal. Works as a single line breaks, doesn't do anything for two consecutive ones.
Use -0:
perl -pi -0 -w -e "s/\n\n//g" *.html
The problem is that by default -p reads the file one line at a time. There's no such thing as a line with two newlines, so you didn't find any. The -0 changes the line-ending character to "\0", which probably doesn't exist in your file, so it processes the whole file at once. (Even if the file did contain NULs, you're looking for consecutive newlines, so processing it in NUL-delimited chunks won't be a problem.)
You probably want to adjust your regex as well, but it's hard to be sure exactly what you want. Try s/\n\n+/\n/g, which will replace any number of consecutive newlines with a single newline.
If the file is very large, you may not have enough memory to load it in a single chunk. A workaround for this is to pick some character that is common enough to split the file into manageable chunks, and tell Perl to use that as the line-ending character. But it also has to be a character that will not appear inside the matches you're trying to replace. For example, -0x2e will split the file on "." (ASCII 0x2E).
I was trying to replace a double newline with a single using the above recommendation on a large file (2.3G) With huge files, it will seg fault when trying to read the entire file at once. So instead of looking for a double newline, just look for lines where the only char is a newline:
perl -pi -w -e 's/^\n$//' file.txt