Why GREP can't tolerate multiple \n characters [duplicate]

Why GREP can't tolerate multiple \n characters [duplicate] - regex

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 5 years ago.
I am trying to use GREP to select multiple-line records from a file.
The records look something like that
########## Ligand Number : 1
blab bla bla
bla blab bla
########## Ligand Number : 2
blab bla bla
bla blab bla
########## Ligand Number : 3
bla bla bla
<EOF>
I am using Perl RegEx (-P).
To bypass the multiple line limitation in GREP, I use grep -zo. This way, the parser can consume multiple lines and output exactly what I want. generally, it works fine.
However, the problem is that the delimiter here is two empty lines after the end of last record line (three consecutive '\n' characters: one for end line and two for two empty lines).
When I try to use an expression like
grep -Pzo '^########## Ligand Number :\s+\d+.+?\n\n\n' inputFile
it returns nothing. It seems that grep can't tolerate consecutive '\n' characters.
Can anybody give an explanation?
P.S. I bypassed it already by translating the '\n' characters to '\a' first, then translating them back. like this following example:
cat inputFile | tr '\n' '\a' | grep -Po '########## Ligand Number :\s+\d+\a.+?\a\a\a' | tr '\a' '\n'
But I need to understand why couldn't GREP understand the '\n\n\n' pattern.

In a PCRE regex, . does not match line break symbols by default, and s modifier enables the POSIX like dot behavior.
Thus, add (?s) at the start, or replace . with [\s\S].
(?s)^########## Ligand Number :\s+\d+.+?\n\n\n

Related

Unable to match multiple digits in regex

I am simply trying to print 5 or 6 digit number present in each line.
cat file.txt
Random_something xyz ...64763
Random2 Some String abc-778986
Something something 676347
Random string without numbers
cat file.txt | sed 's/^.*\([0-9]\{5,6\}\+\).*$/\1/'
Current Output
64763
78986
76347
Random string without numbers
Expected Output
64763
778986
676347
The regex doesn't seem to work as intended with 6 digit numbers. It skips the first number of the 6 digit number for some reason and it prints the last line which I don't need as it doesn't contain any 5 or 6 digit number whatsoever

grep is a better for this with -o option that prints only matched string:
grep -Eo '[0-9]{5,6}' file
64763
778986
676347
-E is for enabling extended regex mode.
If you really want a sed, this should work:
sed -En 's/(^|.*[^0-9])([0-9]{5,6}).*/\2/p' file
64763
778986
676347
Details:
-n: Suppress normal output
(^|.*[^0-9]): Match start or anything that is followed by a non-digit
([0-9]{5,6}): Match 5 or 6 digits in capture group #2
.* Match remaining text
\2: is replacement that puts matched digits back in replacement
/p prints substituted text

With awk, you could try following. Simple explanation would be, using match function of awk and giving regex to match 5 to 6 digits in each line, if match is found then print the matched part.
awk 'match($0,/[0-9]{5,6}/){print substr($0,RSTART,RLENGTH)}' Input_file

How can I use a look after to match either a single or a double quote?

I have a series of strings I want to extract:
hello.this_is("bla bla bla")
some random text
hello.this_is('hello hello')
other stuff
What I need to get (from many files, but this is not important here) is the content between hello.this_is( and ), so my desired output is:
bla bla bla
hello hello
As you see, the text within parentheses can be enclosed with either double or single quotes.
If this was only single quotes I would use a look behind and look ahead just like this:
grep -Po "(?<=hello.this_is\(').*(?=')" file
# ^ ^
# returns ---> hello hello
Similarly, to get strings from double quotes I would say:
grep -Po '(?<=hello.this_is\(").*(?=")' file
# ^ ^
# returns ---> bla bla bla
However, I want to match both cases, so it gets both single and double quotes. I tried with using $'' to escape, but could not make it work:
grep -Po '(?<=hello.this_is\($'["\']').*(?=$'["\']')' file
# ^^^^^^^^ ^^^^^^^^
I can of course use the ASCII number and say:
grep -Po '(?<=hello.this_is\([\047\042]).*' file
but I would like to use the quotes and single quotes, since 047 and 042 are not that much representative to me as single and double quotes are.

Note: The sed command at the bottom of this answer works only as long as your strings are nice behaving strings like
"foo"
or
'bar'
As soon as your strings start to misbehave :) like:
"hello \"world\""
it won't work any more.
Your input looks like source code. For a stable solution I recommend to use a parser for that language to extract the strings.
For trivial use cases:
You can use sed. The solution is supposed to work on any POSIX platform in contrast to grep -oP which only works with GNU grep:
sed -n 's/hello\.this_is(\(["'\'']\)\([^"]*\)\(["'\'']\).*/\2/gp' file
# ^^^^^^^^ ^^
# capture group 2 ^

Use a capturing group and look for its content like the following:
grep -Po 'hello\.this_is\(([\047"])((?!\1).|\\.)*\1\)' file
This cares about escaped characters too e.g. hello.this_is("bla b\"la bla")
See live demo here
If the output should be what comes between parentheses then utilize both \K and a positive lookahead:
grep -Po 'hello\.this_is\(([\047"])\K((?!\1).|\\.)*(?=\1\))' file
Outputs:
bla bla bla
hello hello

Based on revo and hek2mgl excellent answers, I ended up using grep like this:
grep -Po '(?<=hello\.this_is\((["'\''])).*(?=\1)' file
Which can be explained as:
grep
-Po use Perl regexp machine and just prints the matches
'(?<=hello\.this_is\((["'\''])).*(?=\1)' the expression
(?<=hello\.this_is\((["'\''])) look-behind: search strings preceeded by "hello.this_is(" followed by either ' or ". Also, capture this last character to be used later on.
.* match everything...
(?=\1) until the captured character (that is, either ' or ") appears again.
The key here was to use ["'\''] to indicate either ' or ". By doing '\'' we are closing the enclosing expression, populating with a literal ' (that we have to escape) and opening the enclosing expression again.

Highlight all keys that look like '&name=' in a text with grep console [duplicate]

I want to grep the shortest match and the pattern should be something like:
<car ... model=BMW ...>
...
...
...
</car>
... means any character and the input is multiple lines.

You're looking for a non-greedy (or lazy) match. To get a non-greedy match in regular expressions you need to use the modifier ? after the quantifier. For example you can change .* to .*?.
By default grep doesn't support non-greedy modifiers, but you can use grep -P to use the Perl syntax.

Actualy the .*? only works in perl. I am not sure what the equivalent grep extended regexp syntax would be. Fortunately you can use perl syntax with grep so grep -P would work but grep -E which is same as egrep would not work (it would be greedy).
See also: http://blog.vinceliu.com/2008/02/non-greedy-regular-expression-matching.html

grep
For non-greedy match in grep you could use a negated character class. In other words, try to avoid wildcards.
For example, to fetch all links to jpeg files from the page content, you'd use:
grep -o '"[^" ]\+.jpg"'
To deal with multiple line, pipe the input through xargs first. For performance, use ripgrep.

My grep that works after trying out stuff in this thread:
echo "hi how are you " | grep -shoP ".*? "
Just make sure you append a space to each one of your lines
(Mine was a line by line search to spit out words)

Sorry I am 9 years late, but this might work for the viewers in 2020.
So suppose you have a line like "Hello my name is Jello".
Now you want to find the words that start with 'H' and end with 'o', with any number of characters in between. And we don't want lines we just want words. So for that we can use the expression:
grep "H[^ ]*o" file
This will return all the words. The way this works is that: It will allow all the characters instead of space character in between, this way we can avoid multiple words in the same line.
Now you can replace the space character with any other character you want.
Suppose the initial line was "Hello-my-name-is-Jello", then you can get words using the expression:
grep "H[^-]*o" file

The short answer is using the next regular expression:
(?s)<car .*? model=BMW .*?>.*?</car>
(?s) - this makes a match across multiline
.*? - matches any character, a number of times in a lazy way (minimal
match)
A (little) more complicated answer is:
(?s)<([a-z\-_0-9]+?) .*? model=BMW .*?>.*?</\1>
This will makes possible to match car1 and car2 in the following text
<car1 ... model=BMW ...>
...
...
...
</car1>
<car2 ... model=BMW ...>
...
...
...
</car2>
(..) represents a capturing group
\1 in this context matches the sametext as most recently matched by
capturing group number 1

I know that its a bit of a dead post but I just noticed that this works. It removed both clean-up and cleanup from my output.
> grep -v -e 'clean\-\?up'
> grep --version grep (GNU grep) 2.20

egrep command for lines that have one or more instance of 1234 but no other numbers?

So I'm fairly new to regular expressions and I'm wondering how this would be implemented as a egrep command.
I basically want to look for lines in a file that have one or more instances of "1234", but no other numbers. (non-digit characters are allowed).
Examples:
1234 - valid
12341234 - valid
12345 - invalid (since 5 is there)

You can use grep to extract the lines that contain 1234, then replace 1234 with something that doesn't appear in the input, then remove lines that still contain any digits, and replace the special string back by 1234:
< input-file grep 1234 \
| sed 's/1234/\x1/g' \
| grep -v '[0-9]' \
| sed 's/\x1/1234/g'

So, we want to select lines that have 1234 one or more times but no other digits:
grep -E '^([^[:digit:]]*1234)+[^[:digit:]]*$' file
How it works
The regex begins with ^ and ends with $. That means that is must match the whole line.
Inside the regex are two parts:
([^[:digit:]]*1234)+ matches one or more 1234 with no other digits.
[^[:digit:]]* matches any non-digits that follows the last 1234.
In olden times, one would use [0-9] to match digits. With unicode, that is no longer reliable. So, we are using [:digit:] which is unicode safe.
Example
Let's use this test file:
$ cat file
this 1234 is valid
12341234 valid
not valid 12345
not 2 valid 1234 line
no numbers so not valid
Here is the result:
$ grep -E '^([^[:digit:]]*1234)+[^[:digit:]]*$' file
this 1234 is valid
12341234 valid

If you want no other digit after your 1234 block:
egrep '\<(1234)+(\>|[^0-9])' *
-- -- --> word delimiters
---- --> the word you're looking for
------ --> non digit characters
- --> one or more times
If you want only "words" made up by the "1234" block, then you can egrep this:
egrep '\<(1234)+\>' *
-- -- --> word delimiters
---- --> the word you're looking for
- --> one or more times.

Find numbers after specific text in a string with RegEx

I have a multiline string like the following:
2012-15-08 07:04 Bla bla bla blup
2012-15-08 07:05 *** Error importing row no. 5: The import of this line failed because bla bla
2012-15-08 07:05 Another text that I don't want to search...
2012-15-08 07:06 Another text that I don't want to search...
2012-15-08 07:06 *** Error importing row no. 5: The import of this line failed because bla bla
2012-15-08 07:07 Import has finished bla bla
What I want is to extract all row numbers that have errors with the help of RegularExpression (with PowerShell). So I need to find the number between "*** Error importing row no. " and the following ":" as this will always give me the row number.
I looked at various other RegEx question but to be honest the answers are like chinese to me.
Tried to built RegEx with help of http://regexr.com/ but haven't been successful so far, for example with the following pattern:
"Error importing row no. "(.?)":"
Any hints?

Try this expression:
"Error importing row no\. (\d+):"
DEMO
Here you need to understand the quantifiers and escaped sequences:
. any character; as you want only numbers, use \d; if you meant the period character you must escape it with a backslash (\.)
? Zero or one character; this isn't what do you want, as you can here an error on line 10 and would take only the "1"
+ One or many; this will suffice for us
* Any character count; you must take care when using this with .* as it can consume your entire input

Pretty straight forward. Right now your quoting is going to cause an error in the regex you wrote up. Try this instead:
$LogText = ""#Your logging stuff
[regex]$Regex = "Error importing row no\. ([0-9]*):"
$Matches = $Regex.Matches($LogText)
$Matches | ForEach-Object {
$RowNum = $_.Groups[1].Value #(Waves hand) These are the rows you are looking for
}

THere could be multiple ways , few simple ones shown below might help:-
I took your log in a file called temp.txt.
cat temp.txt | grep " Error importing row no." | awk -F":" '{print $2}' | awk -F"." '{print $2}'
OR
cat temp.txt | grep " Error importing row no." | sed 's/\(.*\)no.\(.*\):\(.*\)/\2/'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why GREP can't tolerate multiple \n characters [duplicate] - regex

In a PCRE regex, . does not match line break symbols by default, and s modifier enables the POSIX like dot behavior. Thus, add (?s) at the start, or replace . with [\s\S]. (?s)^########## Ligand Number :\s+\d+.+?\n\n\n

Related

Unable to match multiple digits in regex

How can I use a look after to match either a single or a double quote?

Highlight all keys that look like '&name=' in a text with grep console [duplicate]

egrep command for lines that have one or more instance of 1234 but no other numbers?

Find numbers after specific text in a string with RegEx

Categories

Resources