Replace matching pairs of characters with parenthesis - replace

It is better to describe an example. I have a latex source file (this is an ordinary text file) that has a lot of charactes $ enclosing inline equations, something like this:
bla bla bla $E = mc^2$ bla blah
I would like to replace each ocurrence of a matching pair of $ characters in the file by \( ... \), like this:
bla bla bla \(E = mc^2\) bla blah
Any idea of to do this, as simple as possible? I am not sure grep is able to handle this.
Assume that the file has an even number of occurrences of $. In that case, all we have to do is replace the $ at odd positions by \(, and the $ at even positions by \).

Like this?
spacewrench$ cat foo
bla bla bla $E = mc^2$ bla blah
spacewrench$ sed -e 's/\$\(.*\)\$/\\(\1\\)/g' < foo
bla bla bla \(E = mc^2\) bla blah
sed can do it. You may need to play with the number of backslashes, plus line endings if you have expressions that extend over multiple lines.
The .* expression is greedy, so it might only put one pair of parentheses around multiple $ on a line...you can fix that by replacing .* with [^\$]*.

Related

How can I use a look after to match either a single or a double quote?

I have a series of strings I want to extract:
hello.this_is("bla bla bla")
some random text
hello.this_is('hello hello')
other stuff
What I need to get (from many files, but this is not important here) is the content between hello.this_is( and ), so my desired output is:
bla bla bla
hello hello
As you see, the text within parentheses can be enclosed with either double or single quotes.
If this was only single quotes I would use a look behind and look ahead just like this:
grep -Po "(?<=hello.this_is\(').*(?=')" file
# ^ ^
# returns ---> hello hello
Similarly, to get strings from double quotes I would say:
grep -Po '(?<=hello.this_is\(").*(?=")' file
# ^ ^
# returns ---> bla bla bla
However, I want to match both cases, so it gets both single and double quotes. I tried with using $'' to escape, but could not make it work:
grep -Po '(?<=hello.this_is\($'["\']').*(?=$'["\']')' file
# ^^^^^^^^ ^^^^^^^^
I can of course use the ASCII number and say:
grep -Po '(?<=hello.this_is\([\047\042]).*' file
but I would like to use the quotes and single quotes, since 047 and 042 are not that much representative to me as single and double quotes are.
Note: The sed command at the bottom of this answer works only as long as your strings are nice behaving strings like
"foo"
or
'bar'
As soon as your strings start to misbehave :) like:
"hello \"world\""
it won't work any more.
Your input looks like source code. For a stable solution I recommend to use a parser for that language to extract the strings.
For trivial use cases:
You can use sed. The solution is supposed to work on any POSIX platform in contrast to grep -oP which only works with GNU grep:
sed -n 's/hello\.this_is(\(["'\'']\)\([^"]*\)\(["'\'']\).*/\2/gp' file
# ^^^^^^^^ ^^
# capture group 2 ^
Use a capturing group and look for its content like the following:
grep -Po 'hello\.this_is\(([\047"])((?!\1).|\\.)*\1\)' file
This cares about escaped characters too e.g. hello.this_is("bla b\"la bla")
See live demo here
If the output should be what comes between parentheses then utilize both \K and a positive lookahead:
grep -Po 'hello\.this_is\(([\047"])\K((?!\1).|\\.)*(?=\1\))' file
Outputs:
bla bla bla
hello hello
Based on revo and hek2mgl excellent answers, I ended up using grep like this:
grep -Po '(?<=hello\.this_is\((["'\''])).*(?=\1)' file
Which can be explained as:
grep
-Po use Perl regexp machine and just prints the matches
'(?<=hello\.this_is\((["'\''])).*(?=\1)' the expression
(?<=hello\.this_is\((["'\''])) look-behind: search strings preceeded by "hello.this_is(" followed by either ' or ". Also, capture this last character to be used later on.
.* match everything...
(?=\1) until the captured character (that is, either ' or ") appears again.
The key here was to use ["'\''] to indicate either ' or ". By doing '\'' we are closing the enclosing expression, populating with a literal ' (that we have to escape) and opening the enclosing expression again.

Why GREP can't tolerate multiple \n characters [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 5 years ago.
I am trying to use GREP to select multiple-line records from a file.
The records look something like that
########## Ligand Number : 1
blab bla bla
bla blab bla
########## Ligand Number : 2
blab bla bla
bla blab bla
########## Ligand Number : 3
bla bla bla
<EOF>
I am using Perl RegEx (-P).
To bypass the multiple line limitation in GREP, I use grep -zo. This way, the parser can consume multiple lines and output exactly what I want. generally, it works fine.
However, the problem is that the delimiter here is two empty lines after the end of last record line (three consecutive '\n' characters: one for end line and two for two empty lines).
When I try to use an expression like
grep -Pzo '^########## Ligand Number :\s+\d+.+?\n\n\n' inputFile
it returns nothing. It seems that grep can't tolerate consecutive '\n' characters.
Can anybody give an explanation?
P.S. I bypassed it already by translating the '\n' characters to '\a' first, then translating them back. like this following example:
cat inputFile | tr '\n' '\a' | grep -Po '########## Ligand Number :\s+\d+\a.+?\a\a\a' | tr '\a' '\n'
But I need to understand why couldn't GREP understand the '\n\n\n' pattern.
In a PCRE regex, . does not match line break symbols by default, and s modifier enables the POSIX like dot behavior.
Thus, add (?s) at the start, or replace . with [\s\S].
(?s)^########## Ligand Number :\s+\d+.+?\n\n\n

Regex - delete everything but matching groups (using sublime 3)

How can I keep only matching groups and delete the rest of the text?
Using: Sublime 3 - Regex
My pattern is
1.5.1 Bla bla bla
text text text
text text text
1.5.2 Bla bla bla
text text text
I want to keep only this
1.5.1 Bla bla bla
1.5.2 Bla bla bla
I can manage to select only the groups, but not everything except them.
Link: https://regex101.com/r/pV9xU6/2
Thank you
According to the comments, it can be done in several ways:
Find: (?s)^(1\.5\.\d+[^\n]*\n[^\n]*\n)|. /gm
Replace: $1
or
Find (general way): (*SKIP)(*F)|.*\R*
Find: (1[.]5[.]\d+.*\n.*\n)(*SKIP)(*F)|.*\R*
Replace: nothing
or
Find: (^1\.5\.\d+.*\n.*\n)\K(?>.*\R)*?(?=(?1)|.*\z) /gm
Replace: nothing
Thanks for all your help.

Regex: replace newline except the ones with a dot

Thanks in advance for any help you can provide
I have a text like that:
Bla bla bla bl[CR][LF]
a bla bla bla[CR][LF]
bla bla.[CR][LF]
Bla bla bla bla bl[CR][LF]
...and so on
I'd like to replace all new lines except the ones having a dot as last character.
This is the what I wanna get to:
Bla bla bla bla bla bla bla bla bla.[CR][LF]
Bla bla bla bla bla.[CR][LF]
...and so on
I tried with Notepad++, that supports RegEx, using the Search & Replace tab (Ctrl+H). That's the code:
Search: [^.\r\n]\r\n
Replace field had just a space.
It worked, but it truncates the last character of every line.
Bla bla bla bla bla bla bla bla bl.[CR][LF]
Bla bla bla bla bl.[CR][LF]
As I am a RegEx novice, which is the best way to do that?
Use this regex: (?<!\.)\r\n in the search field. It means find any \r\n that isn't preceded by a ..
Your regex means find any three characters where the first one isn't a . \r or \n, and the last two are \r\n. But then when you go to replace, it replaces that 1st character as well. The regex I posted checks for the non-period as a zero-length, so it doesn't replace that character.

Regular expression to find text between strings WITHOUT a predefined string in the middle

I've a so written text:
11 bla gulp bla 22
11 bla bla bla 2211 bla
ble
bli 22
I need a regex to find all the text between all the couples "11" and "22" BUT that DON'T contain "gulp".
If I search (?s)11.*?22 using TextCrawler, I find all the three strings:
bla gulp bla
bla bla bla
bla ble bli
Wrong! I'd like to obtain only:
bla bla bla
bla ble bli
because "bla gulp bla" contains "gulp", and I don't want it!
Any idea? :-)
use a negative lookahead assertion:
11(?!.*?gulp.*?)(.*?)22
word boundaries might be a good idea in the middle (surrounding gulp), because it would allow to distinguish between gulp and gulping, gulped or ungulp(?):
11(?!.*?\bgulp\b.*?)(.*?)22
but putting them around everything:
\b11\b(?!.*?\bgulp\b.*?)(.*?)\b22\b
would exclude your other two results - not what you want.