Regex to find all lines containing foo_bar or not foo - regex

I'm stumped writing a regex to put into grep to match all lines which contain foo_bar and all other lines to don't contain foo_.* (Excepting foo_bar which should still match).
So given:
foo_bar Line 1
foo_baz Line 2
Line 3
foo_biz Line 4
I'd want Lines 1 and 3 returned.
Note I can't just match not foo_baz and not foo_biz as baz and biz can be many, many things. It needs to be foo_.*

Under OS X I executed the following command on your input :
$ grep -P -v '^foo_(?!bar)' test.txt
foo_bar Line 1
Line 3
Please note that :
-P, --perl-regexp
Interpret PATTERN as a Perl regular expression.
-v, --invert-match
Invert the sense of matching, to select non-matching lines.

You need to combine a postivie match and a negative match
So positive Match 1 for foo_bar
foo_bar
Then negative match 2 for all this is not containing foo_ using Stack overflow great explanation
Regular expression to match a line that doesn't contain a word?
^((?!foo_.*).)*$
Combining both in an altnernative regexp
(foo_bar|^((?!foo_.*).)*$)
Now let's try::
$ cat <<EOF | perl -ane ' /(foo_bar|^((?!foo_.*).)*$)/ && print $_'
> foo_bar 1
> fee_foo
> foo bar
> foo_buz RT
> fo_foo 1
> fo_ foo_bar
> EOF
gives
foo_bar 1
fee_foo
foo bar
fo_foo 1
fo_ foo_bar

With sed:
sed -n '/foo_bar/p;/foo_/!p' input
this tells the sed not to print by default (-n) and print when either foo_bar matches or foo_ does not match.

Related

sed regexp - extra unwanted line in matching output

I have this file
~/ % cat t
---
abc
def DEF
ghi GHI
---
123
456
and I would like to extract the content between the three dashes, so I try
sed -En '{N; /^---\s{5}\w+/,/^---/p}' t
I.e. 3 dashes followed by 5 whitespaces including the newline, followed by one or more word characters and ending with another set of three dashes. This gives me this output
~/ % sed -En '{N; /^---\s{5}\w+/,/^---/p}' t
---
abc
def DEF
ghi GHI
---
123
I don't want the line with "123". Why am I getting that and how do I adjust my expression to get rid of it? [EDIT]: It is important that the four spaces of indentation after the first three dashes are matched in the expression.
This might work for you (GNU sed):
sed -En '/^---/{:a;N;/^ {4}\S/M!D;/\n---/!ba;p}' file
Turn on extended regexp (-E) and off implicit printing (-n).
If a line begins --- and the following line is indented by 4 spaces, gather up the following lines until another begins --- and print them.
If the following line does not match the above criteria, delete the first and repeat.
All other lines will pass through unprinted.
N.B. The M flag on the second regexp for multiline matching , since the first line already begins --- the next must be indented.
No need to use the pattern space here - a range pattern will do fine.
$ sed -n '/^---/,/^---/p' t
---
abc
def DEF
ghi GHI
---
Tested in GNU sed 4.7 and OSX sed.
I believe you can use
perl -0777 -ne '/^---\R(\s{4}\w.*?^---)/gsm && print "$1\n";' t
Details:
-0777 - slurps the file into a single variable
^---\R(\s{4}\w.*?^---) - start of a line (^), ---, a line break, then Group 1: four whitespaces, a word char, then zero or more chars as few as possible, and then --- at the start of a line
gsm - global, all occurrences are returned, s means . matches any chars including line break chars, as m means ^ now matches start of any line, not just string start
&& print "$1\n" - if there is a match, print Group 1 value + a line break.

Regex, select the line that starts with my condition, but take only the characters after space

I have a file that has content similiar below:
ptrn: 435324kjlkj34523453
Note1: rtewqtiojdfgkasdktewitogaidfks
Note2: t4rwe3tewrkterqwotkjrekqtrtlltre
I am trying to get characters after space at the line starts with "ptrn:" . I am trying the command below ;
>>> cat daily.txt | grep '^p.*$' > dailynew.txt
and I am getting the result in the new file:
ptrn: 435324kjlkj34523453
But I want only the characters after space, which are " 435324kjlkj34523453" to be written in the new file without "ptrn:" at the beginning.
The result should be like:
435324kjlkj34523453
How can establish this goal with an efficient regex?
You can use
grep -oP '^ptrn:\s*\K.*' daily.txt > dailynew.txt
awk '/^ptrn:/{print $2}' daily.txt > dailynew.txt
sed -n 's/^ptrn:[[:space:]]*\(.*\)/\1/p' daily.txt > dailynew.txt
See the online demo. All output 435324kjlkj34523453.
In the grep PCRE regex (enabled with -P option) the patterns match
^ - the startof string
ptrn: - a ptrn: substring
\s* - zero or more whitespaces
\K - match reset operator that clears the current match value
.* - the rest of the line.
In the awk command, ^ptrn: regex is used to find the line starting with ptrn: and then {print $2} prints the value after the first whitespace, from the second "column" (since the default field separator in awk is whitespace).
In sed, the command means
-n - suppresses the default line output
s - substitution command is used
^ptrn:[[:space:]]*\(.*\) - start of string, ptrn:, zero or more whitespace, and the rest of the line captured into Group 1
\1 - replaces the match with group 1 value
p - prints the result of the substitution.
You can use this sed:
sed -nE 's/^ptrn: (.*)/\1/p' file > output_file.txt

Regex - Match a string and not match another string in the same line

I am learning regular expressions. I was trying to print lines in a file that contain a particular string and do not contain another string.
I have a few lines in the file like
k 1 : abcd
jkjkj
l 1 : efgh
kjkjk
m 1 : abok
lklk
My intention is to match lines with 1 : and not match ab on the same line.
My desired output should be 1 : efgh (This line matches 1 : and this line doesnot contain ab).
For this I have tried with regular expression ^((?!ab).*1 :*)*$. But it does not work. Can some one point out where is the issue in my expression?
as mentioned in the comments, the shell does not support lookahead.
You could pipe your text through another program like grep to get your desired regex flavor (ie perl)
cat test.txt | grep --perl '1\s:(?!.*ab)'
returns
l 1 : efgh
If you need the whole line, use awk:
awk !/ab/' && '/1[[:space:]]:/ inputfile > outputfile
It outputs lines not containing ab and containing 1 + space + :.
To get a part of a line:
sed -E -n '/ab/!s/.*(1 :.*)/\1/p' inputfile > outputfile
Skip all lines containing ab, and extract capturing group value with -n + p option/flag.

Regex substitute multiple lines for a single line

I have a plain text file in which I need to substitute multiple consecutive lines of text with a single replacement line. For example, when I have a date and time, followed by a blank line, followed by a page number,
11/13/2018 08:33:00
Page 1 of 1
I'd like to replace it with a single line (e.g., PAGE BREAK).
I've tried
sed 's/\d{2}\/\d{2}\/\d{4} \d{2}:\d{2}:\d{2}\n\nPage \d of \d/PAGE BREAK/g' file1.txt > file2.txt
and
perl -pe 's/\d{2}\/\d{2}\/\d{4} \d{2}:\d{2}:\d{2}\n\nPage \d of \d/PAGE BREAK/g' file1.txt > file2.txt
but it leaves the text unchanged.
Both sed and Perl process the input line by line. You can tell Perl to load the whole file into memory by using -0777 (if it's not too large):
perl -0777 -pe 's=[0-9]{2}/[0-9]{2}/[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}\n\nPage [0-9]+ of [0-9]+=PAGE BREAK=g'
Note that I used [0-9], because \d can match ٤, ໖, ६, or 𝟡.
I also used s=== instead of s/// so I don't have to backslash the slashes in the date part.
Another Perl variant
$ cat page_break.txt
123 45 jh kljl
11/13/2018 08:33:00
Page 1 of 1
ghjgjh hkjhj
fhfghfghfh
11/13/2018 08:33:00
Page 1 of 2
ghgigkjkj
$ perl -ne '{ if ( (/\d{2}\/\d{2}\/\d{4} \d{2}:\d{2}:\d{2}/ and $x++)or ( /^\s*$/ and $x++) or (/Page \d of \d/ and $x++) ){} if($x==0) { print "$_" } if($x==3) { print "PAGE BREAK\n"; $x=0} }' page_break.txt
123 45 jh kljl
PAGE BREAK
ghjgjh hkjhj
fhfghfghfh
PAGE BREAK
ghgigkjkj
$

Replace regex with captured group ONLY

I'm trying to understand why the following does not give me what I think (or want :)) should be returned:
sed -r 's/^(.*?)(Some text)?(.*)$/\2/' list_of_values
or Perl:
perl -lpe 's/^(.*?)(Some text)?(.*)$/$2/' list_of_values
So I want my result to be just the Some text, otherwise (meaning if there was nothing captured in $2) then it should just be EMPTY.
I did notice that with perl it does work if Some text is at the start of the line/string (which baffles me...). (Also noticed that removing ^ and $ has no effect)
Basically, I'm trying to get what grep would return with the --only-matching option as discussed here. Only I want/need to use sub/replace in the regex.
EDITED (added sample data)
Sample input:
$ cat -n list_of_values
1 Black
2 Blue
3 Brown
4 Dial Color
5 Fabric
6 Leather and Some text after that ....
7 Pearl Color
8 Stainless Steel
9 White
10 White Mother-of-Pearl Some text stuff
Desired output:
$ perl -ple '$_ = /(Some text)/ ? $1 : ""' list_of_values | cat -n
1
2
3
4
5
6 Some text
7
8
9
10 Some text
First of all, this shows how to duplicate grep -o using Perl.
You're asking why
foo Some text bar
012345678901234567
results in just a empty string instead of
Some text
Well,
At position 0, ^ matches 0 characters.
At position 0, (.*?) matches 0 characters.
At position 0, (Some text)? matches 0 characters.
At position 0, (.*) matches 17 characters.
At position 17, $ matches 0 characters.
Match succeeds.
You could use
s{^ .*? (?: (Some[ ]text) .* | $ )}{ $1 // "" }exs;
or
s{^ .*? (?: (Some[ ]text) .* | $ )}{$1}xs; # Warns if warnings are on.
Far simpler:
$_ = /(Some text)/ ? $1 : "";
I question your use of -p. Are you sure you want a line of output for each line of input? It seems to me you'd rather have
perl -nle'print $1 if /(Some text)/'