Grep pattern with Multiple lines with AND operation - regex

How can I determine a pattern exists over multiple lines with grep? below is a multiline pattern I need to check is present in the file
Status: True
Type: Master
I tried the below command but it checks multiple strings on a single line but fails for strings pattern match on multiple lines
if cat file.txt | grep -P '^(?=.*Status:.*True)(?=.*Type:.*Master)'; then echo "Present"; else echo "NOT FOUND"; fi
file.txt
Interface: vlan
Status: True
Type: Master
ID: 104

Using gnu-grep you can do this:
grep -zoP '(?m)^\s*Status:\s+True\s+Type:\s+Master\s*' file
Status: True
Type: Master
Explanation:
P: Enabled PCRE regex mode
-z: Reads multiline input
-o: Prints only matched data
(?m) Enables MULTILINE mode so that we can use ^ before each line
^: Start a line

With your shown samples, please try following awk program. Written and tested in GNU awk.
awk -v RS='(^|\n)Status:[[:space:]]+True\nType:[[:space:]]+Master' '
RT{
sub(/^\n/,"",RT)
print RT
}
' Input_file
Explanation: Simple explanation would be setting RS(Record separator of awk) as regex (^|\n)Status:[[:space:]]+True\nType:[[:space:]]+Master(explained below) and in main program checking if RT is NOT NULL then remove new line(starting one) in RT with NULL and print value of RT to get expected output shown by OP.

I did it as follows:
grep -A 1 "^.*Status:.*True" test.txt | grep -B 1 "^Type:.*Master"
The -A x means "also show the x lines After the found one.
The -B y means "also show the y lines Before the found one.
So: show the "Status" line together with the next one (the "Type" one), and then show the "Type" line together with the previous one (the "Status" one).

You could also keep track of the previous line setting in every line at the end prev = $0 and use a pattern to match the previous and the current line.
awk '
prev ~ /^[[:space:]]*Status:[[:space:]]*True$/ && $0 ~ /^[[:space:]]*Type:[[:space:]]*Master$/{
printf "%s\n%s", prev, $0
}
{prev = $0}
' file.txt
Output
Status: True
Type: Master

Related

Filter (or 'cut') out column that begins with 'OS=abc'

My .fasta file consists of this repeating pattern.
>sp|P20855|HBB_CTEGU Hemoglobin subunit beta OS=Ctenodactylus gundi OX=10166 GN=HBB PE=1 SV=1
asdfaasdfaasdfasdfa
>sp|Q00812|TRHBN_NOSCO Group 1 truncated hemoglobin GlbN OS=Nostoc commune OX=1178 GN=glbN PE=3 SV=1
asdfadfasdfaasdfasdfasdfasd
>sp|P02197|MYG_CHICK Myoglobin OS=Gallus gallus OX=9031 GN=MB PE=1 SV=4
aafdsdfasdfasdfa
I want to filter out only the lines that contain '>' THEN filter out the string after 'OS=' and before 'OX=', (example line1=Ctenodactylus gundi)
The first part('>') is easy enough:
grep '>' my.fasta | cut -d " " -f 3 >> species.txt
The problem is that the number of fields is not constant BEFORE 'OS='.
But the number of column/fields between 'OS=' and 'OX=' is 2.
You can use the -P option to enable PCRE-based regex matching, and use lookaround patterns to ensure that the match is enclosed between OS= and OX=:
grep '>' my.fasta | grep -oP '(?<=OS=).*(?=OX=)'
Note that the -P option is available only to the GNU's version of grep, which may not be available by default in some environments.
IMHO awk will be more feasible here(since it could take care of regex and printing with condition part all together), could you please try following.
awk '/^>/ && match($0,/OS=.*OX=/){print substr($0,RSTART+3,RLENGTH-6)}' Input_file
Output will be as follows.
Ctenodactylus gundi
Nostoc commune
Gallus gallus
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
/^>/ && match($0,/OS=.*OX=/){ ##Checking condition if line starts from > AND matches regex OS=,*OX= means match from OS= till OX= in each line, if both conditions are TRUE.
print substr($0,RSTART+3,RLENGTH-6) ##Then print sub string of current line, whose starting point is RSTART+3 to till RLENGTH-6 of current line.
}
' Input_file ##Mentioning Input_file name here.
Using any awk in any shell on every UNIX box:
$ awk -F' O[SX]=' '/^>/{print $2}' file
Ctenodactylus gundi
Nostoc commune
Gallus gallus
sed solution:
$ sed -nE '/>/ s/^.*OS=(.*) OX=.*$/\1/p' .fasta
Ctenodactylus gundi
Nostoc commune
Gallus gallus
-n so that the pattern space is not printed unless requested; -E (extended regular expressions) so that we can use subexpressions and backreferences. The p flag to the s command means "print the pattern space".
The regular expression is supposed to match the entire line, singling out in a subexpression the fragment we must extract. I assumed OX is preceded by exactly one space, which must not appear in the output; that can be adjusted if/as needed.
This assumes that all lines that begin with > will have an OS= ... fragment immediately followed by an OX= ... fragment; if not, that can be added to the />/ filter before the s command. (By the way - can there be some OT= ... fragment between OS=... and OX= ...?)
Question though - wouldn't you rather include some identifier (perhaps part of the "label" at the beginning of each line) for each line of output? You have the fragments you requested - but do you know where each one of them comes?

Regex: find elements regardless of order

If I have the string:
geo:FR, host:www.example.com
(In reality the string is more complicated and has more fields.)
And I want to extract the "geo" value and the "host" value, I am facing a problem when the order of the keys change, as in the following:
host:www.example.com, geo:FR
I tried this line:
sed 's/.\*geo:\([^ ]*\).\*host:\([^ ]*\).*/\1,\2/'
But it only works on the first string.
Is there a way to do it in a single regex, and if not, what's the best approach?
I suggest extracting each text you need with a separate sed command:
s="geo:FR, host:www.example.com"
host="$(sed -n 's/.*host:\([^[:space:],]*\).*/\1/p' <<< "$s")"
geo="$(sed -n 's/.*geo:\([^[:space:],]*\).*/\1/p' <<< "$s")"
See the online demo, echo "$host and $geo" prints
www.example.com and FR
for both inputs.
Details
-n suppresses line output and p prints the matches
.* - matches any 0+ chars up the last...
host: - host: substring and then
\([^[:space:],]*\) - captures into Group 1 any 0 or more chars other than whitespace and a comma
.* - the rest of the line.
The result is just the contents of Group 1 (see \1 in the replacement pattern).
Whenever you have tag/name to value pairs in your input I find it best (clearest, simplest, most robust,, easiest to enhance, etc.) to first create an array that contains that mapping (f[] below) and then you can simply access the values by their tags:
$ cat file
geo:FR, host:www.example.com
host:www.example.com, geo:FR
foo:bar, host:www.example.com, stuff:nonsense, badgeo:uhoh, geo:FR, nastygeo:wahwahwah
$ cat tst.awk
BEGIN { FS=":|, *"; OFS="," }
{
for (i=1; i<=NF; i+=2) {
f[$i] = $(i+1)
}
print f["geo"], f["host"]
}
$ awk -f tst.awk file
FR,www.example.com
FR,www.example.com
FR,www.example.com
The above will work using any awk in any shell on every UNIX box.
Here I've used GNU Awk to convert your delimited key:value pairs to valid shell assignment. With Bash, you can load these assignments into your current shell using <(process substitution):
# source the file descriptor generated by proc sub
. < <(
# use comma-space as field separator, literal apostrophe as variable q
awk -F', ' -vq=\' '
# change every foo:bar in line to foo='bar' on its own line
{for(f=1;f<=NF;f++) print gensub(/:(.*)/, "=" q "\\1" q, 1, $f)}
# use here-string to load text; remove everything but first quote to use standard input
' <<< 'host:www.example.com, geo:FR'
)

How to use awk or sed to replace specific word under certain profile

a file contains data in the following format, Now I want to chang the value of showfirst under XYZ section. how to achieve that with sed or awk or grep?
I thought of line number or second appearance but that's not going to be constant In future file can contain hundreds of such profile so it has to be user based.
I know that I can extract 1st line after 'XYZ' pattern but I want it to be field based.
Thanks for help
[ABC]
showfirst =0
showlast=10
[XYZ]
showfirst=10
showlast=3
With sed:
sed '/^\[XYZ\]/,/^showfirst *=/{0,//!s/.*/showfirst=20/}' file
How it works:
/^\[XYZ\]/,/^showfirst *=/: address range that matches lines from [XYZ] to next ^showfirst
// is for lines matching above addresses([XYZ] and showfirst=10). So 0,//!: NOT in the first matching line(that is showfirst=10 line)
s/.*/showfirst=20/: replace line with showfirst=44
You can use awk like this:
awk -v val='50' 'BEGIN{FS=OFS="="} /^\[.*\]$/{ flag = ($1 == "[XYZ]"?1:0) }
flag && $1 == "showfirst" { $2 = val } 1' file
[ABC]
showfirst =0
showlast=10
[XYZ]
showfirst=50
showlast=3
In awk. All parameterized:
$ awk -v l='XYZ' -v k='showfirst' -v v='666' ' # parameters lable, key, value
BEGIN { FS=OFS="=" } # delimiters
/\[.*\]/ { f=0 } # flag down #new lable
$1=="[" l "]" { f=1 } # flag up #lable
f==1 && $1==k { $2=v } # replace value when flag up and match
1' file # print
[ABC]
showfirst =0
showlast=10
[XYZ]
showfirst=666
showlast=3
Man, even I feel confused looking at that code.
If you set the record separator (RS) to the empty stiring, awk will read a whole record at a time, assuming the records are double-newline separated.
So for example you can do something like this:
awk -v k=XYZ -v v=42 '$1 ~ "\\[" k "\\]" { sub("showfirst *=[^\n]*", "showfirst=" v) } 1' RS= ORS='\n\n' infile
Output:
[ABC]
showfirst =0
showlast=10
[XYZ]
showfirst=42
showlast=3
with sed following command solved my problem for every user,
sed '/.*\[ABC\]/,/^$/ s/.*showfirst.*/showfirst=20/' input.conf
syntax goes as sed [address] command
'/.*\[ABC\]/,/^$/ : This generates address range pointing to region starting from [ABC] up to next first blank line. Search for string will be done in this particular range only.
s/.*showfirst.*/showfirst=20/: This search for any line having showfirst in it and replace entire line with showfirst=20

Grepping out a block of text, regex

Given a large log file, what is the best way to grep a block of text?
text to be ignored
more text to be ignored
--- <---- start capture here
lots of
text with separators like "---"
---
spanning
multiple lines
--- <---- end capture here
text to be ignored
more text to be ignored
What is known?
Max number of characters in line (55 but may be less)
Number of lines in a block
Separator (which may repeat itself)
What regular expression would match this block? Desired output: list of blocks of text.
Please assume Linux command line environment
Several years ago I used this to split patches into hunks:
sed -e '$ {x;q}' -e '/##/ !{H;d}' -e '/##/ x' # note - i know sed better now
Replace /##/ with /---/.
To remove everything before first '---' and after last '---' add -e '1,/---/d' and remove the whole -e '$ {x;q}'.
Result would be like this:
sed -e '1,/---/d' -e '/---/ !{H;d}' -e x
Just tested it and it works with the given example.
Keep it simple:
$ awk 'NR==FNR {if (/^---/) { if (!start) start=NR; end=NR } next} FNR>=start && FNR<=end' file file
--- <---- start capture here
lots of
text with separators like "---"
---
spanning
multiple lines
--- <---- end capture here
$ awk 'NR==FNR {if (/^---/) { if (!start) start=NR; end=NR } next} FNR>start && FNR<end' file file
lots of
text with separators like "---"
---
spanning
multiple lines
If you have enough memory, you can use the following line. Note, however, that it will read the whole logfile into memory!
perl -0777 -lnE 'm{ ^--- .+ ^--- }xms and say $&' logfile

Grep Regex: List all lines except

I'm trying to automagically remove all lines from a text file that contains a letter "T" that is not immediately followed by a "H". I've been using grep and sending the output to another file, but I can't come up with the magic regex that will help me do this.
I don't mind using awk, sed, or some other linux tool if grep isn't the right tool to be using.
That should do it:
grep -v 'T[^H]'
-v : print lines not matching
[^H]: matches any character but H
You can do:
grep -v 'T[^H]' input
-v is the inverse match option of grep it does not list the lines that match the pattern.
The regex used is T[^H] which matches any lines that as a T followed by any character other than a H.
Read lines from file exclude EMPTY Lines and Lines starting with #
grep -v '^$\|^#' folderlist.txt
folderlist.txt
# This is list of folders
folder1/test
folder2
# This is comment
folder3
folder4/backup
folder5/backup
Results will be:
folder1/test
folder2
folder3
folder4/backup
folder5/backup
Adding 2 awk solutions to the mix here.
1st solution(simpler solution): With simple awk and any version of awk.
awk '!/T/ || /TH/' Input_file
Checking 2 conditions:
If a line doesn't contain T OR
If a line contains TH then:
If any of above condition is TRUE then print that line simply.
2nd solution(GNU awk specific): Using GNU awk using match function where mentioning regex (T)(.|$) and using match function's array creation capability.
awk '
!/T/{
print
next
}
match($0,/(T)(.|$)/,arr) && arr[1]=="T" && arr[2]=="H"
' Input_file
Explanation: firstly checking if a line doesn't have T then print that simply. Then using match function of awk to match T followed by any character OR end of the line. Since these are getting stored into 2 capturing groups so checking if array arr's 1st element is T and 2nd element is H then print that line.