Extract string embedded in pattern with a regex - regex

I've been using the bash command line with grep -e and sort -nr trying to filter and analyze some lines coming from a bunch of "data" files. So far I came out with an output file like this:
25 The X value is: bla bla bla done
19 The X value is: foo done
19 The X value is: bar done
19 The X value is: bbb done
19 The X value is: xxx yyy zzz done
where you can see the frequency and the "data" part I am interested into.
I am not able to find a regex to be used by grep to "clean those lines". I mean: I can intercept those "data" lines with a regex like is:.*done (I know this pattern is unique in the files I am analyzing), but how can I clean those lines extracting exactly the stuff between "is:" and "done"?

Try sed instead:
$ sed -r 's/^.*: (.*) done$/\1/' outputfile.txt
bla bla bla
foo
bar
bbb
xxx yyy zzz

If you wanted to return:
bla bla bla
foo
bar
bbb
xxx yyy zzz
you can use
(?<=:)(.*)(?=done)

Related

Why GREP can't tolerate multiple \n characters [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 5 years ago.
I am trying to use GREP to select multiple-line records from a file.
The records look something like that
########## Ligand Number : 1
blab bla bla
bla blab bla
########## Ligand Number : 2
blab bla bla
bla blab bla
########## Ligand Number : 3
bla bla bla
<EOF>
I am using Perl RegEx (-P).
To bypass the multiple line limitation in GREP, I use grep -zo. This way, the parser can consume multiple lines and output exactly what I want. generally, it works fine.
However, the problem is that the delimiter here is two empty lines after the end of last record line (three consecutive '\n' characters: one for end line and two for two empty lines).
When I try to use an expression like
grep -Pzo '^########## Ligand Number :\s+\d+.+?\n\n\n' inputFile
it returns nothing. It seems that grep can't tolerate consecutive '\n' characters.
Can anybody give an explanation?
P.S. I bypassed it already by translating the '\n' characters to '\a' first, then translating them back. like this following example:
cat inputFile | tr '\n' '\a' | grep -Po '########## Ligand Number :\s+\d+\a.+?\a\a\a' | tr '\a' '\n'
But I need to understand why couldn't GREP understand the '\n\n\n' pattern.
In a PCRE regex, . does not match line break symbols by default, and s modifier enables the POSIX like dot behavior.
Thus, add (?s) at the start, or replace . with [\s\S].
(?s)^########## Ligand Number :\s+\d+.+?\n\n\n

Replace matching pairs of characters with parenthesis

It is better to describe an example. I have a latex source file (this is an ordinary text file) that has a lot of charactes $ enclosing inline equations, something like this:
bla bla bla $E = mc^2$ bla blah
I would like to replace each ocurrence of a matching pair of $ characters in the file by \( ... \), like this:
bla bla bla \(E = mc^2\) bla blah
Any idea of to do this, as simple as possible? I am not sure grep is able to handle this.
Assume that the file has an even number of occurrences of $. In that case, all we have to do is replace the $ at odd positions by \(, and the $ at even positions by \).
Like this?
spacewrench$ cat foo
bla bla bla $E = mc^2$ bla blah
spacewrench$ sed -e 's/\$\(.*\)\$/\\(\1\\)/g' < foo
bla bla bla \(E = mc^2\) bla blah
sed can do it. You may need to play with the number of backslashes, plus line endings if you have expressions that extend over multiple lines.
The .* expression is greedy, so it might only put one pair of parentheses around multiple $ on a line...you can fix that by replacing .* with [^\$]*.

How to remove matching pattern?

How do i remove my matching pattern from the file?
Everytime the pattern [my_id= occurs, it shall be removed without replacement.
For example, the field [my_id=AB_123456789.1] should be AB_123456789.1.
I already tried, with no result
sed '/\[my\_id\=/d'
awk '$(NF-1) /^[protein\_id\=/d'
Also it is possible to remove the first n characters from the last but 1 field ($(NF-1)) as an alternative?
Thanks for any help
You can use:
sed 's/\[my_id=\([^]]*\)\]/\1/g' file
\[my_id=\([^]]*\)\] looks for this and replaces with the text inside (\1).
\[my_id=\([^]]*\)\] means [my_id= plus a string not containing ], that is caught with the \(...\) syntax to be printed back with \1.
Test
$ cat a
hello [my_id=AB_123456789.1] bye
adf aa [my_id=AB_123456789.1] bbb
$ sed 's/\[my_id=\([^]]*\)\]/\1/g' a
hello AB_123456789.1 bye
adf aa AB_123456789.1 bbb
You can try something like this in awk
$ cat <<test | awk 'gsub(/\[my_id=|\]/,"")'
hello [my_id=AB_123456789.1] bye
adf aa [my_id=AB_123456789.1] bbb
test
hello AB_123456789.1 bye
adf aa AB_123456789.1 bbb

Getting list of commands using regex

I have list of commands where some are having parameters which I need to skip before executing them.
show abc(h2) xyz
show abc(h2) xyz opq(h1)
show abc(h2) xyz <32>
show abc(a,l) xyz [<32>] opq
show abc
Ultimately, the list has different combinations of ( ), <>, [] with plain text commands.
I want to separate out all other commands from plain commands like "show abc".
Processing needed on commands :-
(h1), (h2), (a,l) are to be discarded
<32> - is to be replaced with any ip address
[<32>] - is to be replaced with any integer digit
I tried following but resultant file was empty :-
cat show-cmd.txt | grep "<|(|[" > hard-cmd.txt
How can I get the result file which has no plain commands using regex?
Desired output file :-
show abc xyz
show abc xyz opq
show abc xyz 1.1.1.1
show abc xyz 2 opq
Try using grep followed by sed
grep '[(<\[]' file | sed -e 's/\[<32>\]/2/g' -e 's/<32>/1.1.1.1/g' -e 's/([^)]*)//g'
Output:
show abc xyz
show abc xyz opq
show abc xyz 1.1.1.1
show abc xyz 2 opq
Please note that order of s///g command matters in your case.
Also try avoiding redundant use of cat
cat show-cmd.txt | grep "[\[\(\<]" > hard-cmd.txt
This should work. The opening and closing square brackets [] mean that only one of the options need to be present. Then the further brackets that you want to search for are provided and escaped by a .
Hope this helps.
Pulkit

Using sed and/or awk, how can do a sed swap on all text in a file until I see a line matching a regex?

How can I do a sed regex swap on all text that preceed a line matching a regex.
e.g. How can I do a swap like this
s/foo/bar/g
for all text that precedes the first point this regex matches:
m/baz/
I don't want to use positive/negative look ahead/behind in my sed regex, because those are really expensive operations on big files.
If you mean that you want to do the substitution on every line preceding the given match, this is your answer:
The substitution takes an optional address range; you can use both numbers and patterns. In this case, start from line 1, go until your pattern:
sed '1,/baz/s/foo/bar/g'
In awk:
awk '
/baz/ { done = 1 }
{
if (!done) {
gsub(/foo/, "bar")
}
print
}'
(It's really short enough to leave out the line breaks, but they make it readable)
This variation on Jefromi's answer should do the trick of not touching the line that "baz" appears on as mentioned in Jonathan's comment.
sed '1,/baz/{/baz/!s/foo/bar/g}'
$ cat file
123 abc 01
456 foo 02 bar
789 ghi
baz
blah1
blah2
foo bar
$ awk -vRS="baz" 'NR==1{gsub("foo","bar")}1' ORS="baz" file
123 abc 01
456 bar 02 bar
789 ghi
baz
blah1
blah2
foo bar
baz
use "baz" record separator , then the 1st record will be the record you want to change "foo" to "bar".
with sed, variation of Denni's solution to take care of "baz" at first line
sed '0,/baz/{/baz/!s/foo/bar/g}' file
This might work for you:
awk '/baz/{p=1};!p{gsub(/foo/,"bar")};1' file
or this:
sed '/baz/,$!s/foo/bar/g' file