Is there a truly universal wildcard in Grep? [duplicate] - regex

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 3 years ago.
Really basic question here. So I'm told that a dot . matches any character EXCEPT a line break. I'm looking for something that matches any character, including line breaks.
All I want to do is to capture all the text in a website page between two specific strings, stripping the header and the footer. Something like HEADER TEXT(.+)FOOTER TEXT and then extract what's in the parentheses, but I can't find a way to include all text AND line breaks between header and footer, does this make sense? Thanks in advance!

When I need to match several characters, including line breaks, I do:
[\s\S]*?
Note I'm using a non-greedy pattern

You could do it with Perl:
$ perl -ne 'print if /HEADER TEXT/ .. /FOOTER TEXT/' file.html
To print only the text between the delimiters, use
$ perl -000 -lne 'print $1 while /HEADER TEXT(.+?)FOOTER TEXT/sg' file.html
The /s switch makes the regular expression matcher treat the entire string as a single line, which means dot matches newlines, and /g means match as many times as possible.
The examples above assume you're cranking on HTML files on the local disk. If you need to fetch them first, use get from LWP::Simple:
$ perl -MLWP::Simple -le '$_ = get "http://stackoverflow.com";
print $1 while m!<head>(.+?)</head>!sg'
Please note that parsing HTML with regular expressions as above does not work in the general case! If you're working on a quick-and-dirty scanner, fine, but for an application that needs to be more robust, use a real parser.

By definition, grep looks for lines which match; it reads a line, sees whether it matches, and prints the line.
One possible way to do what you want is with sed:
sed -n '/HEADER TEXT/,/FOOTER TEXT/p' "$#"
This prints from the first line that matches 'HEADER TEXT' to the first line that matches 'FOOTER TEXT', and then iterates; the '-n' stops the default 'print each line' operation. This won't work well if the header and footer text appear on the same line.
To do what you want, I'd probably use perl (but you could use Python if you prefer). I'd consider slurping the whole file, and then use a suitably qualified regex to find the matching portions of the file. However, the Perl one-liner given by '#gbacon' is an almost exact transliteration into Perl of the 'sed' script above and is neater than slurping.

The man page of grep says:
grep, egrep, fgrep, rgrep - print lines matching a pattern
grep is not made for matching more than a single line. You should try to solve this task with perl or awk.

As this is tagged with 'bbedit' and BBedit supports Perl-Style Pattern Modifiers you can allow the dot to match linebreaks with the switch (?s)
(?s).
will match ANY character. And yes,
(?s).+
will match the whole text.

As pointed elsewhere, grep will work for single line stuff.
For multiple-lines (in ruby with Regexp::MULTILINE, or in python, awk, sed, whatever), "\s" should also capture line breaks, so
HEADER TEXT(.*\s*)FOOTER TEXT
might work ...

here's one way to do it with gawk, if you have it
awk -vRS="FOOTER" '/HEADER/{gsub(/.*HEADER/,"");print}' file

Related

Regex does not match in Perl, while it does in other programs

I have the following string:
load Add 20 percent
to accommodate
I want to get to:
load Add 20 percent to accommodate
With, e.g., regex in sublime, this is easily done by:
Regex:
([a-z])\n\s([a-z])
Replace:
$1 $2
However, in Perl, if I input this command, (adapted to test if I can match the pattern in any case):
perl -pi.orig -e 's/[a-z]\n.+to/TEST/g' file
It doesn't match anything.
Does anyone know why Perl would be different in this case, and what the correct formulation of the Perl command should be?
By default, Perl -p flag read input lines one by one. You can't thus expect your regex to match anything after \n.
Instead, you want to read the whole input at once. You can do this by using the flag -0777 (this is documented in perlrun):
perl -0777 -pi.orig -e 's/([a-z])\n\s(to)/$1 $2/' file
Just trying to help and reminding below your initial proposal for perl regex:
perl -pi.orig -e 's/[a-z]\n.+to/TEST/g' file
Note that in perl regex, [a-z] will match only one character, NOT including any whitespace. Then as a start please include a repetition specifier and include capability to also 'eat' whitespaces. Also to keep the recognized (but 'eaten') 'to' in the replacement, you must put it again in the replacement string, like finally in the below example perl program:
$str = "load Add 20 percent
to accommodate";
print "before:\n$str\n";
$str =~ s/([ a-z]+)\n\s*to/\1 to/;
print "after:\n$str\n";
This program produces the below input:
before:
load Add 20 percent
to accommodate
after:
load Add 20 percent to accommodate
Then it looks like that if I understood well what you want to do, your regexp should better look like:
s/([ a-z]+)\n\s*to/\1 to/ (please note the leading whitespace before 'a-z').

Issues with regex when searching pattern on two lines

I know this type of search has been address in a few other questions here, but for some reason I can not get it to work in my scenario.
I have a text file that contains something similar to the following patter:
some text here done
12345678_123456 226-
more text
some more text here done
12345678_234567 226-
I'm trying to find all cases where done is followed by 226- on the next line, with the 16 characters proceeding. I tried grep -Pzo and pcregrep -M but all return nothing.
I attempted multiple combinations of regex to take in account the 2 lines and the 16 chars in between. This is one of the examples I tried with grep:
grep -Pzo '(?s)done\n.\{16\}226-' filename
Related posts:
How to find patterns across multiple lines using grep?
Regex (grep) for multi-line search needed [duplicate]
How can I search for a multiline pattern in a file?
Generalize it to this (?m)done$\s+.*226-$
Because requiring a \n after 226- at end of string is a bad thing.
And not requiring a \n after 226- is also a bad thing.
Thus, the paradox is solved with (\n|$) but why the \n at all?
Both problems solved with multiline and $.
https://regex101.com/r/A33cj5/1
You must not escape { and } while using -P (PCRE) option in grep. That escaping is only for BRE.
You can use:
grep -ozP 'done\R.{16}226-\R' file
done
12345678_123456 226-
done
12345678_234567 226-
\R will match any unicode newline character. If you are only dealing with \n then you may just use:
grep -ozP 'done\n.{16}226-\n' file

How do I reference a shell variable and arbitrary digits inside a grep regex?

I am looking to translate this regular expression into grep flavour:
I am trying to filter all lines that contain refs/changes/\d+/$VAR/
Example of line that should match, assuming that VAR=285900
b3fb1e501749b98c69c623b8345a512b8e01c611 refs/changes/00/285900/9
Current code:
VAR=285900
grep 'refs/changes/\d+/$VAR/' sample.txt
I am trying to filter all lines that contain refs/changes/\d+/$VAR/
That would be
grep "refs/changes/[[:digit:]]\{1,\}/$VAR/"
or
grep -E "refs/changes/[[:digit:]]+/$VAR/"
Note that the \d+ notation is a perl thing. Some overfeatured greps might support it with an option, but I don't recommend it for portability reasons.
inside simple quotes I cannot use variable expansion
You can mix and match quotes:
foo=not; echo 'single quotes '"$foo"' here'
with double quotes it does match anything.
It's not clear what you're doing, so we can't say why it doesn't work. It should work. There is no need to escape forward slashes for grep, they don't have any special meaning.

How can I match all occurrences of a phrase not tagged with desired LaTeX tag?

Purpose
Find all occurences of "Bernie Sanders" that have not been tagged with \senator{ }. I want to list matches with grep for visual inspection. Then I'd like to fix up all files recursively with a single command (e.g. sed, which does not support non-greedy regex).
Example File
Bernie Sanders
\senator{Bernie Sanders}
The senator of Vermont is \senator{Bernie Sanders}.
A \texttt{senator of Vermont} is Bernie Sanders.
A senator of Vermont is \textit{Bernie Sanders}.
\textit{Bernie Sanders} is a senator of Vermont.
Is this the same Bernie Sanders?
Is Bernie Sanders a good senator?
Will we ever see a \textbf{President Bernie Sanders}?
Problem
The regular expression must not "accidentally" interfer with other commands
An attempt:
\[^senator]*{Bernie Sanders
I am not sure how to exclude senator, yet include spaces, and other commands before Bernie Sanders.
Start
may be proceeded by a space
not proceeded by \senator{
may be first thing on a line
End
may end with space, colon, semicolon, question mark, comma, parenthesis, bracket (if within another command), full stop/period
(?<!\\senator{)Bernie Sanders(?!\s*})
You can use this with grep -P.Lookarounds will make sure it is not wrapped in senator tag.See demo.
https://regex101.com/r/vV1wW6/7
This will only match if the prefix \senator is absent
(?<!\\senator\{)Bernie\sSanders
Another solution would be to use a perl script. The following one-liner is working on your example (it uss one of the previous regexp):
perl -pe 's/(?<!\\senator{)Bernie\sSanders/The New Bernie Sanders/g' input.tex > output.tex
Basically it replaces any match of the regexp expression in the file input.tex by the string "The New Bernie Sanders". The result is printed in output.tex.
Furthermore, this script can be included in a bash script to process several files:
#!/bin/bash
for i in {1..3}
do
perl -pe 's/(?<!\\senator{)Bernie\sSanders/The New Bernie Sanders/g' input$i.tex > output$i.tex
done
This script processes the files input1.tex, input2.tex, input3.tex and prints the files output1.tex, output2.tex, output2.tex.
(the loop is very basic but my point was simply to show that the one-liner could easily be included in a bash script).
A solution with grep (underlying assumptions: \senator{Bernie Sanders} occurs in one line; there aren't good and bad hits in the same line)
grep "Bernie Sanders" input.tex | grep -v -e '\\senator{Bernie Sanders}'

Grep/Sed between two tags with multiline

I have many files from which I need to get information.
Example of my files:
first file content:
"test This info i need grep</singleline>"
and
second file content (with two lines):
"test This info=
i need grep too</singleline>"
in results I need grep this text: from first file - "This info i need grep" and from second file - "This info= i need grep too"
In first file I use:
grep -o 'test .*</singleline>' * | sed -e 's/test \(.*\)<\/singleline>/\1/'
and successfully get "This info i need grep" but I can not get the information from the second file by using the same command.
Please help rewrite the command or write what the other.
Or, if you insist to use grep, you can:
grep -Pzo 'test(\n|.)*(?=</singleline>)' test.txt
To understand the meaning of each flag, use grep --help:
-P, --perl-regexp
PATTERN is a Perl regular expression
-o, --only-matching
show only the part of a line matching PATTERN
-z, --null-data
a data line ends in 0 byte, not newline
I'd use pcregrep, which can match multiline regexes:
pcregrep -Mo 'test \K((?s).)*?(?=</singleline>)' filename
The tricks are:
-M allows pcregrep to match on more than one line,
-o makes it print only the match,
\K throws away the part of the match that comes before it,
(?=</singleline>) is a lookahead term that matches an empty string if (and only if) it is followed by </singleline>, and
((?s).)*? to match any characters non-greedily, which is to say that if you have several occurrences of </singleline> in the file, it will match until the closest rather than the furthest. If this is not desired, remove the ?. (?s) enables the s option locally for the term to make . match newlines in it; it wouldn't do that by default.
Thanks to #CasimiretHippolyte for pointing out the ((?s).) alternative to (.|\n).
It looks like you're parsing quoted-printable encoded text, where a "soft" line break (one that is an artifact from fixed-line-width formatting) is indicated with a line-terminating = (directly before the \n).
Since in a later comment you also expressed the desire to print each match as a single line, I suggest the following 2-pass appraoch:
use awk to remove the soft line breaks
then use grep on the result
awk '/=$/ { printf "%s", substr($0, 1, length($0)-2); next } 1' file |
grep -Po 'test .*?(?=</singleline>)'
Tip of the hat to Wintermute's helpful answer for the non-greedy quantifier, *?, and both Wintermute's and Maroun Maroun's helpful answer for the positive look-ahead assertion, (?=...).
Not that the awk command removes the line-ending = (along with the newline); replace the substr call with just $0 to retain it.
Since strings of interest are first converted back their original single-line representations:
The matches are printed in their original form.
You can use regular (GNU) grep with line-by-line matching; contrast this with
needing to read the entire file at once, as in Maroun Maroun's helpful answer.
Note that, as of this writing, * must be replaced with *? in his answer to work correctly work in files with multiple matches.
needing to install another utility, pcregrep, as in Wintermute's helpful answer.
additionally, the matches would have to be cleaned up to be single-line (something you didn't originally state as a requirement).