Using grep -P and lookahead/lookbehind to get text between patterns - regex

Assume the following is in file.txt:
---------
foo bar
more foo bar
---------
when I execute grep -P '(?<=-$)(?s:.)*(?=^-)' file.txt, I expect only the middle two lines to be matched, but this expression matches nothing. What's wrong?
I also tried grep -P '(?s)(?<=-$).*(?=^-)' file.txt but same result.

Your pattern dos not work because
The P option alone only makes grep match using the PCRE regex engine
Since you have no other options, grep outputs whole matched lines, you need to add o option to output the matched text(s) and z to slurp the file into a single text
Your regex has ^ and $ anchors that match start/end of the string, not lines, by default. You need a m flag together with s flag (it makes . match any char including line break chars).
So, you may use your regex with m and -oz:
grep -Poz '(?ms)(?<=-$).*(?=^-)' file.txt
Or,
grep -Poz '(?s)-\R\K.*(?=\R-)' file.txt
where \R matces any line break sequence and \K omits the text matched so far from the overall memory buffer.
See the regex demo.

Related

How to do multiple grep pattern to find value in grepped string

I am trying to do multiple grep pattern to find a number within a grepped string.
I have a text file like this:
This is the first sample line 1
this is the second sample line
another line
total lines: 3 tot
I am trying to find a way to get just the number of total lines. So the output here should be "3"
Here are the things I've tried:
grep "total lines: [0-9]" myfile.txt
grep "total lines" myfile.txt | grep "[0-9]"
You could use sed:
sed -En 's/^total lines: ([0-9]+).*/\1/p' myfile.txt
-E extended regular expressions
-n suppress automatic printing
Match ^total lines: ([0-9]+).* (capture the number)
\1 replace the whole line with the captured number
p print the result
1st solution: Using GNU grep try following. Simply using -o option to print only matched value, -P enables PCRE regex for program. Then in regex portion matching from starting ^total lines: in each line and if a match found then discard matched values by \K option(to remove it from expected output) which is followed by 1 or more digits, using positive look ahead to make sure its followed by space(s) tot here.
grep -oP '^total lines: \K[0-9]+(?=\s+tot)' Input_file
2nd solution: With your shown samples, please try following in awk. This could be done in a single awk itself. Searching line which has string /total lines: / in it then printing 2nd last field of that line.
awk '/total lines: /{print $(NF-1)}' Input_file
3rd solution: Using awk's match function here. Matching total lines: [0-9]+ tot and then substituting everything apart from digits with null in matched values.
awk 'match($0,/total lines: [0-9]+ tot/){val=substr($0,RSTART,RLENGTH);gsub(/[^0-9]+/,"",val);print val}' Input_file
Do you have to use grep?
$ echo myfile.txt | wc -l
If you mean that the file has a line in it formatted as
total lines: 3 tot
Then refer to https://unix.stackexchange.com/questions/13466/can-grep-output-only-specified-groupings-that-match and use something like:
grep -Po 'total lines: \K\d+' myfile.txt
Notes:
Perl regex is not my forte, so the \d\w part might not work.
This may be doable without -P, but I cannot test from this windows computer.
regex101.com helped me test the above line, so it may work.
Problem with relying on pattern of last line and applying grep/sed to find pattern is that if any line in file contains such pattern, then you will have to apply some additional logic to filter that.
e.g. Consider case of below input file.
line001
total lines: 883 tot
This is the first sample line 1
this is the second sample line
another line
total lines: 883 tot
Assuming your file format is constant (i.e. Second last line will be blank and last line will contain total count), instead of using any pattern matching commands you can directly count number of rows using below awk command.
awk 'END { print NR - 2 }' myfile.txt
You can use the following awk to get the third field on a line that starts with total count: and stop processing the file further:
awk '/^total lines:/{print $3; exit}' file
See this online demo.
You can use the following GNU grep:
# Extract a non-whitespace chunk after a certain pattern
grep -oP '^total lines:\s*\K\S+' file
# Extract a number after a pattern
grep -oP '^total lines:\s*\K\d+(?:\.\d+)?' file
See an online demo. Details:
^ - start of string
total lines: - a literal string
\s* - any zero or more whitespace chars
\K - match reset operator discarding all text matched so far
\S+ - one or more non-whitespace chars
\d+(?:\.\d+)? - one or more digits and then an optional sequence of . and one or more digits.
See the regex demo.

How can I get "grep -zoP" to display every match separately?

I have a file on this form:
X/this is the first match/blabla
X-this is
the second match-
and here we have some fluff.
And I want to extract everything that appears after "X" and between the same markers. So if I have "X+match+", I want to get "match", because it appears after "X" and between the marker "+".
So for the given sample file I would like to have this output:
this is the first match
and then
this is
the second match
I managed to get all the content between X followed by a marker by using:
grep -zPo '(?<=X(.))(.|\n)+(?=\1)' file
That is:
grep -Po '(?<=X(.))(.|\n)+(?=\1)' to match X followed by (something) that gets captured and matched at the end with (?=\1) (I based the code on my answer here).
Note I use (.|\n) to match anything, including a new line, and that I also use -z in grep to match new lines as well.
So this works well, the only problem comes from the display of the output:
$ grep -zPo '(?<=X(.))(.|\n)+(?=\1)' file
this is the first matchthis is
the second match
As you can see, all the matches appear together, with "this is the first match" being followed by "this is the second match" with no separator at all. I know this comes from the usage of "-z", that treats all the file as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline (quoting "man grep").
So: is there a way to get all these results separately?
I tried also in GNU Awk:
awk 'match($0, /X(.)(\n|.*)\1/, a) {print a[1]}' file
but not even the (\n|.*) worked.
awk doesn't support backreferences within regexp definition.
Workarounds:
$ grep -zPo '(?s)(?<=X(.)).+(?=\1)' ip.txt | tr '\0' '\n'
this is the first match
this is
the second match
# with ripgrep, which supports multiline matching
$ rg -NoUP '(?s)(?<=X(.)).+(?=\1)' ip.txt
this is the first match
this is
the second match
Can also use (?s)X(.)\K.+(?=\1) instead of (?s)(?<=X(.)).+(?=\1). Also, you might want to use non-greedy quantifier here to avoid matching match+xyz+foobaz for an input X+match+xyz+foobaz+
With perl
$ perl -0777 -nE 'say $& while(/X(.)\K.+(?=\1)/sg)' ip.txt
this is the first match
this is
the second match
Here is another gnu-awk solution making use of RS and RT:
awk -v RS='X.' 'ch != "" && n=index($0, ch) {
print substr($0, 1, n-1)
}
RT {
ch = substr(RT, 2, 1)
}' file
this is the first match
this is
the second match
With GNU awk for multi-char RS, RT, and gensub() and without having to read the whole file into memory:
$ awk -v RS='X.' 'NR>1{print "<" gensub(end".*","",1) ">"} {end=substr(RT,2,1)}' file
<this is the first match>
<this is
the second match>
Obviously I added the "<" and ">" so you could see where each output record starts/ends.
The above assumes that the character after X isn't a non-repetition regexp metachar (e.g. ., ^, [, etc.) so YMMV
The use case is kind of problematic, because as soon as you print the matches, you lose the information about where exactly the separator was. But if that's acceptable, try piping to xargs -r0.
grep -zPo '(?<=X(.))(.|\n)+(?=\1)' file | xargs -r0
These options are GNU extensions, but then so is grep -z and (mostly) grep -P, so perhaps that's acceptable.
GNU grep -z terminates input/output records with null characters (useful in conjunction with other tools such as sort -z). pcregrep will not do that:
pcregrep -Mo2 '(?s)X(.)(.+?)\1' file
-onumber used instead of lookarounds. ? lazy quantifier added (in case \1 occurs later).

Parsing only first regex match in a line with several matches

Is it possible to have a regex that parses only a1bcdea1 from this line a1bcdea1ABCa1DEFa1 ?
This grep command does not work:
$ cat txtfile
a1bcdea1ABCa1DEFa1
$ grep -oE "[A-Z,a-z]1.*?[A-Z,a-z]1" txtfile
a1bcdea1ABCa1DEFa1
I want the output of grep to be only a1bcdea1.
EDIT:
It is obvious that I can just use grep -o "a1bcdea1" for the above line, but consider if one has several thousands of lines and the goal is to match FIRST [A-Z,a-z]1.*?[A-Z,a-z]1 for each single line.
How about using a ^ start anchor and restricting character set used:
grep -o '^[A-Za-z]1[A-Za-z]*1'
See this Bash demo or Regex Pattern at regex101
If you expect more digits or other characters in between, go with this
grep -oP '^[A-Za-z]1.*?[A-Za-z]1'
The lazy matching requires perl compatible mode. For not at line start, go with this
grep -oP '^.*?\K[A-Za-z]1.*?[A-Za-z]1'
\K resets beginning of the reported match and is a PCRE feature as well.
Here is a gnu awk solution using split function:
awk '(n = split($0, a, /[a-zA-Z]1/, b)) > 1 {print b[1] a[2] b[2]}' file
a1bcdea1
This awk command splits each line on regex /[a-zA-Z]1/ and stores split tokens in array a and delimiters in array b.

Matching strings with grep and \A regexp

Given the string in some file:
hel string1
hell string2
hello string3
I'd like to capture just hel using cat file | grep 'regexp here'
I tried doing a bunch of regexp but none seem to work. What makes the most sense is: grep -E '\Ahel' but that doesn't seem to work. It works on http://rubular.com/ however. Any ideas why that isn't working with grep?
Also, when pasting the above string with a tab space before each line, the \A does not seem to work on rubular. I thought \A matches beginning of string, and that doesn't matter whatever characters was before that. Why did \A stop matching when there was a space before the string?
ERE (-E) does not support \A for indicating start of match. Try ^ instead.
Use -m 1 to stop grepping after the first match in each file.
If you want grep to print only the matched string (not the entire line), use -o.
Use -h if you want to suppress the printing of filenames in the grep output.
Example:
grep -Eohm 1 "^hel" *.log
If you need to enforce only outputting if the search string is on the first line of the file, you could use head:
head -qn 1 *.log | grep -Eoh "^hel"
ERE doesn't support \A but PCRE does hence grep -P can be used with same regex (if available):
grep -P '\Ahel\b' file
hel string1
Also important is to use word boundary \b to restrict matching hello
Alternatively in ERE you can use:
egrep '^hel\b'
hel string1
I thought \A matches beginning of string, and that doesn't matter whatever characters was before that. Why did \A stop matching when there was a space before the string?
\A matches the very beginning of the text, it doesn't match the start-of-line when you have one or more lines in your text.
Anyway, grep doesn't support \A so you need to use ^ which by the way matches the start of each line in multi-line mode contrary to \A.
Using awk
awk '$1=="hel"' file
PS you do not need to cat file to grep, use grep 'regexp here' file

Grep. Line of text that does not end with "abcd"?

Hi I'm looking for a regular expression for: line of text that does not end with a certain word, let's say it's "abcd"
At first I tried with
.*[^abcd]$
That one doesn't work of course. It matches a line that doesn't end with any of the letters a,b,c or d.
So, in Advanced Grep Topics, I found this expression, but couldn't get it to work:
^(?>.*)(?<=abcd)
->
grep -e "^(?>.*)(?<=abcd)$"
Any idea for the expression I need?
Have a look at grep's -v option
grep -v 'abcd$'
If you really meant word rather that just "sequence of characters" then use
grep -v '\babcd$'
\b meaning "word-boundary"
Give this a shot:
grep -v "\<abcd\>$"
Proof of Concept
$ printf "%s\n" "foo abcd bar baz" "foo bar baz abcd" "foo bar bazabcd" | grep -v "\<abcd\>$"
foo abcd bar baz
foo bar bazabcd
Note: This will match whole words as noted by the fact that the 3rd line was returned even though it contained abcd as the last 4 letters
grep supports PCRE regular expressions when using -P flag.
One of the reason grep -e "^(?>.*)(?<=abcd)$" does not work is because the lookaround you are using is positive, which means totally opposite of what is required. (?<= is the syntax for positive lookbehind, which tells regex engine to search for lines that ends with abcd.
To search for lines that does not end with certain string, you need to use negative lookbehind. The syntax for negative lookbehind is (?<!. And because negative lookbehind includes exclamation mark which bash will try to interpret as an event, one can not use double quotes to supply regex to grep.
I used following regex to search for the lines that do not end with log.
grep -P '(?<!log)$' < <inputfile>
Similarly you can use above command and replace log with whatever pattern you want to match.
This regex can be used with other programs where inverse matching is not supported, such as -v option of grep