grep between two strings if pattern in the middle linux - regex

i want to grep between two strings only if there is a pattern between them.
for example, in this text:
first wanted string is Start, second is END, and the pattern is 1 2 3 each in a new line.
Start
abc
abc
1
2
3
abc
END
bla
bla
Start
abc
abc
1
2
4
abc
END
bla
bla
Start
abc
abc
1
2
3
abc
abc
END
the result should be:
Start
abc
abc
1
2
3
abc
END
Start
abc
abc
1
2
3
abc
abc
END
thanks!

sed -ne '/Start/{:a;N;/END/!b a;/\n1\n2\n3\n/p}'
Line by line:
we need only text starting with 'Start':
sed -ne '/Start/{
we found 'Start', now add everything up to 'END' to pattern space;
set label named 'a':
:a
add next line to pattern space:
N
if not found 'END' - jump to 'a'
/END/!b a
now check if we have desired pattern that contain 1 2 3 and print
they will be separated by '\n' as they were on separate lines
/\n1\n2\n3\n/p
}'

grep is not suitable, use sed instead
sed -n "/Start/,/END/p" input.txt
should work. I'm assuming input in a file input.txt.

Related

sed regexp - extra unwanted line in matching output

I have this file
~/ % cat t
---
abc
def DEF
ghi GHI
---
123
456
and I would like to extract the content between the three dashes, so I try
sed -En '{N; /^---\s{5}\w+/,/^---/p}' t
I.e. 3 dashes followed by 5 whitespaces including the newline, followed by one or more word characters and ending with another set of three dashes. This gives me this output
~/ % sed -En '{N; /^---\s{5}\w+/,/^---/p}' t
---
abc
def DEF
ghi GHI
---
123
I don't want the line with "123". Why am I getting that and how do I adjust my expression to get rid of it? [EDIT]: It is important that the four spaces of indentation after the first three dashes are matched in the expression.
This might work for you (GNU sed):
sed -En '/^---/{:a;N;/^ {4}\S/M!D;/\n---/!ba;p}' file
Turn on extended regexp (-E) and off implicit printing (-n).
If a line begins --- and the following line is indented by 4 spaces, gather up the following lines until another begins --- and print them.
If the following line does not match the above criteria, delete the first and repeat.
All other lines will pass through unprinted.
N.B. The M flag on the second regexp for multiline matching , since the first line already begins --- the next must be indented.
No need to use the pattern space here - a range pattern will do fine.
$ sed -n '/^---/,/^---/p' t
---
abc
def DEF
ghi GHI
---
Tested in GNU sed 4.7 and OSX sed.
I believe you can use
perl -0777 -ne '/^---\R(\s{4}\w.*?^---)/gsm && print "$1\n";' t
Details:
-0777 - slurps the file into a single variable
^---\R(\s{4}\w.*?^---) - start of a line (^), ---, a line break, then Group 1: four whitespaces, a word char, then zero or more chars as few as possible, and then --- at the start of a line
gsm - global, all occurrences are returned, s means . matches any chars including line break chars, as m means ^ now matches start of any line, not just string start
&& print "$1\n" - if there is a match, print Group 1 value + a line break.

Regex - Match a string and not match another string in the same line

I am learning regular expressions. I was trying to print lines in a file that contain a particular string and do not contain another string.
I have a few lines in the file like
k 1 : abcd
jkjkj
l 1 : efgh
kjkjk
m 1 : abok
lklk
My intention is to match lines with 1 : and not match ab on the same line.
My desired output should be 1 : efgh (This line matches 1 : and this line doesnot contain ab).
For this I have tried with regular expression ^((?!ab).*1 :*)*$. But it does not work. Can some one point out where is the issue in my expression?
as mentioned in the comments, the shell does not support lookahead.
You could pipe your text through another program like grep to get your desired regex flavor (ie perl)
cat test.txt | grep --perl '1\s:(?!.*ab)'
returns
l 1 : efgh
If you need the whole line, use awk:
awk !/ab/' && '/1[[:space:]]:/ inputfile > outputfile
It outputs lines not containing ab and containing 1 + space + :.
To get a part of a line:
sed -E -n '/ab/!s/.*(1 :.*)/\1/p' inputfile > outputfile
Skip all lines containing ab, and extract capturing group value with -n + p option/flag.

splitting bash string by delimiter (last line with delimiter) into array

I'm having a hard time splitting a string like this:
444,555,text with, separator
into this:
444
555
text with, separator
i.e. into a 3-element array (last element may contain comma)
I tried sed but I end up having 4 elements due to the last comma.
Any ideas?
Thanks,
With bash and array:
s='444,555,text with, separator'
IFS=, read -r a b c <<< "$s"
array=("$a" "$b" "$c")
declare -p array
Output:
declare -a array='([0]="444" [1]="555" [2]="text with, separator")'
sed editor allows replacing the number th match of the regexp(i.e. the k-th occurence of the string within a line):
str="444,555,text with, separator"
sed 's/,/\n/1; s/,/\n/1' <<< $str
The output:
444
555
text with, separator
s/,/\n/1 - 1 here is a number flag which points to the first occurrence of , to replace with \n
The following will give the same result(implying the first match on each substitution):
sed 's/,/\n/; s/,/\n/' <<< $str
Two consecutive substitutions will give 3 lines(chunks)
echo "444,555,text with, separator" | sed "s/\([0-9]*\),\([0-9]*\),\(.*\)/\1\n\2\n\3/"
Output:
444
555
text with, separator

Remove duplicate words and just print lines in which this occurs

I have a challenge to look in a file if a sentence contains 2 identical consecutive words. If so, you print the word; otherwise, you don't print the sentence.
Example:
abc2 1 def2 3 abc2
F4
--------------
dea 123 123 zy45
12 12
abc cd abc cd
xyz%$#! xyz%$#! kk
xyzxyz
abc h h h h
After running the program the output will be:
dea 123 zy45
12
xyz%$#! kk
abc h h h
3
This is what I have so far:
sed '/\([^\([^ ]\+\)[ ]\+\1]\)/d' F4 >|tmp
I got this so far but this is only separating between the sentences that have the double word and sentences that don't.
Your sed expression was quite accurate. However, it needed some mangling to make it work:
$ sed -nr 's/\b(\S+)\s+\1(\s|$)/\1/p' file
dea 123 zy45
12
xyz%$#! kk
abc h h h
The idea is the one you already implemented: match a given word with [^ ] and see if you match it again with \1. What I added is all of this to be replaced with \1 so the repeated block disappears.
Instead of [^ ] it is also useful to use \S and instead of [ ], \s. Note also the usage of \b as a word boundary to prevent false positives like fedorqui qui and the usage of \1(\s|$) to prevent other false positives like hello helloa (thanks WalterA for the examples!). Note the usage of \s|$ to match either a space or the end of the line; \b matches any not-word character, which makes it not useful for the case with xyz%$#! kk.
To prevent all lines to be printed, we use sed -n. This way, we just print (with p) those that go through the regular expression that was defined.
Note the usage of -r to get rid of all those escaping to capture groups. Without it, the command would be:
sed -n 's/\b\([^ ]\+\)[ ]\+\1/\1/p' file
Let's test it with a more comprehensive input:
$ cat a
abc2 1 def2 3 abc2
F4
--------------
dea 123 123 zy45
12 12
abc cd abc cd
xyz%$#! xyz%$#! kk
xyzxyz
fedorqui qui
hello helloa
abc h h h h
$ sed -nr 's/\b(\S+)\s+\1(\s|$)/\1/p' a
dea 123zy45
12
xyz%$#!kk
abc hh h
I was looking for a sed solution that seemed to be easy. perhaps in this case awk is better (F4 is the inputfile):
awk '{
for (i=2; i<=NF; i++) {
if ($(i-1)==$i) {
$i="";
printf("%s\n", $0);
break;
}
}
}' F4
I am not complete happy with this solution, since it will leave a double FieldSep in $0 after deleting the doubled word, but literally the OP did not see that a space or tab should be deleted too.

How to find lines with multiple occurrences of a(ny) word in a file?

I want to find lines that have multiple occurrences of a(ny) word. For example, if the input text is
John is a teacher, who is not highly paid.
abc abcde
James lives in Detroit.
abc abc abcde
Paul has 2 dogs and 2 cats.
The output should be
John is a teacher, who is not highly paid.
abc abc abcde
Paul has 2 dogs and 2 cats.
First line has is repeated, second line has abc repeated and last line has 2 repeated.
^(?=.*\b(\w+)\b.*\b\1\b).*$
Try this.See demo.
https://www.regex101.com/r/rG7gX4/6
Use this with grep -P
Here is a simple way to do it in awk
awk '{f=0;delete a;for (i=1;i<=NF;i++) if (a[$i]++) f=1} f' file
John is a teacher, who is not highly paid.
abc abc abcde
Paul has 2 dogs and 2 cats.
It loops trough every word and count them in array a
If any word found more than once, set flag f
If flag f is true, do default action, print line.
To see how many:
awk '{f=0;delete a;for (i=1;i<=NF;i++) if (a[$i]++) f=1} f {for (i in a) if (a[i]>1) printf "%sx\"%s\"-",a[i],i;print $0}' file
2x"is"-John is a teacher, who is not highly paid.
2x"abc"-abc abc abcde
2x"2"-Paul has 2 dogs and 2 cats.
Some improvement: Ignore case. Remove . and ,.
awk '{f=0;delete a;for (i=1;i<=NF;i++) {w=tolower($i);sub(/[.,]/,"",w);if (a[w]++) f=1}} f' file