Replace characters inside a RegEx match - regex

I want to match certain lines inside any text and inside that match, I want to replace a certain character as often, as it occurs.
Sample Text:
Any text and "much" "more" of it. Don't replace quotes here
CatchThis( no quotes here, "any more text" , "and so on and so forth...")
catchthat("some other text" , "or less")
some text in "between"
CatchAnything ( "even more" , "and more", no quotes there, "wall of text")
more ("text"""") and quotes after...
Now I want to replace every quote inside the round brackets with, lets say, a hash sign.
Desired outcome:
Any text and "much" "more" of it. Don't replace quotes here
CatchThis( no quotes here, #any more text# , #and so on and so forth...#)
catchthat(#some other text# , #or less#)
some text in "between"
CatchAnything ( #even more# , #and more#, no quotes there, #wall of text# )
more ("text"""") and quotes after...
Matching the lines is easy. Here's my pattern for that:
(?i)Catch(?:This|That|Anything)[ \t]*\(.+\)
Unfortunately, I have no idea how to match every quote and replace it...

The common approach to matching all occurrences of some pattern inside 2 different delimiters is via using \G anchor based regular expression.
(?i)(?:\G(?!\A)|Catch(?:This|That|Anything)\s*\()[^()"]*\K"
See the regex demo.
Explanation:
(?i) - case insensitive modifier
(?: - a non-capturing group matching 2 alternatives
\G(?!\A) - a place in the string right after the previous successful match (as \G also matches the start of the string, the (?!\A) is necessary to exclude that possibility)
| - or
Catch(?:This|That|Anything) - Catch followed with either This or That or Anything
\s* - 0+ whitespaces
\( - a literal ( symbol
) - end of the non-capturing group
[^()"]* - any 0+ chars other than (, ) and "
\K - a match reset operator
" - a double quote.

Do you really need to replace this inside regex? If your regex finds what you want, you can replace character on found string

Related

Regex to catch strings that are not inside string pattern

I want to find a regex that catch all strings that are not inside name('stringName') pattern.
For example I have this text:
fdlfksj "hello1" dsffsf "hello2\"hi" name("Tod").name('tod') 'hello3'
I want my regex to catch the strings:
"hello1", "hello2\"hi", 'hello3' (it should also should catch "hello2\"hi" because I want to ignore " escaping).
I want also that my regex will ignore "Tod" because it's inside the pattern name("...")
How should I do it?
Here is my regex that doens't work:
((?<!(name\())("[^"]*"|'[^']*'))
It doesn't work with ignore escaping: \" and \'
and it's also not ignore name("Tod")
How can I fix it?
You can use the following regex:
(?<!name\()(["'])[^\)]+?(?<!\\)\1
It will match anything other than parenthesis ([^\)]+?):
preceeded by (["']) - a quote symbol
followed by (?<!\\)\1 - the same quote symbol, which is not preceeded by a slash
In order to avoid getting the values that come after name(, there's a condition that checks that (?<!name\().
Check the demo here.
(["'])((?:\\\1)|[^\1]*?)\1
Regex Explanation
( Capturing group
["'] Match " (double) or ' (single) quote
) Close group
( Capturing group
(?: Non-capturing group
\\\1 Match \ followed by the quote by which it was started
) Close non-capturing group
| OR
[^\1]*? Non-gready match anything except a quote by which it was started
) Close group
\1 Match the close quote
See the demo
You could get out of the way what you don't want, and use a capture group for what you want to keep.
The matches that you want are in capture group 2.
name\((['"]).*?\1\)|('[^'\\]*(?:\\.[^'\\]*)*'|"[^"\\]*(?:\\.[^"\\]*)*")
Explanation
name\((['"]).*?\1\) Match name and then from the opening parenthesis till closing parenthesis between the same type of quote
| Or
( Capture group 2
('[^'\\]*(?:\\.[^'\\]*)*' match the single quoted value including matching escapes ones
|Or
[^"\\]*(?:\\.[^"\\]*)*" The same for the double quotes
) Close group 2
Regex demo

Regex: How to get all words, special characters and white spaces between quotation marks?

Currently I have a regex expression ([^\[\][\[^\[\][\n"]+) to match text between "", but this does not capture whitespaces, for e.g. if I enter " hello ", it will return hello, without the spaces before and after the word.
Is there some expression I can use to just simply catch anything between two quotation marks?
Thank you.
Maybe this will help:
(?<!\\)(\"|')(.+?)(?:(?<!\\)\1)
And to get the text inside the quotes, get the second capture group.
Proof.
Explanation
(?<!\\) - Negative lookbehind. Looks for literal backslash ('')
(\"|') - to test for the start of the "string"
(.+?) - . will match anything but newlines.
+? means as much as possible but only as much needed to match.
(?:(?<!\\)\1) - Non capturing group.
Used here so we can use the (?<!\\) described earlier without looking behind the whole expression. The
\1 matches the first capture group ((\"|')). Can be replaced with $1
You should use following regex:
\"\s*([^\"]+?)\s*\"
([^\"]+?)The text you want to get will be between space and quote.
Demo & Explanation

Regex - replace blank spaces in line (Notepad++)

I have a document with multiple information. What I want is to build a Notepad++ Regex replace function, that finds the following lines in the document and replaces the blank spaces between the "" with an underline (_).
Example:
The line is:
&LOG Part: "NAME TEST.zip"
The result should be:
&LOG Part: "NAME_TEST.zip"
The perfect solution would be that the regex finds the &LOG Part: "NAME TEST.zip" lines and replaces the blank space with an underline.
What I have tried for now is this expression to find the text between the " ":
\"[^"]*\"
It should do it, but I don't know which expression to use to replace the blank spaces with an underline.
Anyone could help with a solution?
Thanks!
The \"[^"]*\" will only match whole substrings from " up to another closest " without matching individual spaces you want to replace.
Since Notepad++ does not support infinite width lookbehind, the only possible solution is using the \G - based regex to set the boundaries and use multiple matching (this one will replace consecutive spaces with 1 _):
(?:"|(?!^)\G)\K([^ "]*) +(?=[^"]*")
Or (if each space should be replaced with an underscore):
(?:"|(?!^)\G)\K([^ "]*) (?=[^"]*")
And replace with $1_. If you need to restrict to replacing inside &LOG Part only, just add it to the beginning:
(?:&LOG Part:\s*"|(?!^)\G)\K([^ "]*) (?=[^"]*")
A human-readable explanation of the regex:
(?:"|(?!^)\G)\K - Find a ", or, with each subsequent successful match, the end of the previous successful match position, and omit all the text in the buffer (thanks to \K)
([^ "]*) - (Group 1, accessed with$1from the replacement pattern) 0+ characters other than a space and"`
+ - one or more literal spaces (replace with \h to match all horizontal whitespace, or \s to match any whitespace)
(?=[^"]*") - check if there is a double quote ahead of the current position

regex for first instance of a specific character that DOESN'T come immediately after another specific character

I have a function, translate(), takes multiple parameters. The first param is the only required and is a string, that I always wrap in single quotes, like this:
translate('hello world');
The other params are optional, but could be included like this:
translate('hello world', true, 1, 'foobar', 'etc');
And the string itself could contain escaped single quotes, like this:
translate('hello\'s world');
To the point, I now want to search through all code files for all instances of this function call, and extract just the string. To do so I've come up with the following grep, which returns everything between translate(' and either ') or ',. Almost perfect:
grep -RoPh "(?<=translate\(').*?(?='\)|'\,)" .
The problem with this though, is that if the call is something like this:
translate('hello \'world\', you\'re great!');
My grep would only return this:
hello \'world\
So I'm looking to modify this so that the part that currently looks for ') or ', instead looks for the first occurrence of ' that hasn't been escaped, i.e. doesn't immediately follow a \
Hopefully I'm making sense. Any suggestions please?
You can use this grep with PCRE regex:
grep -RoPh "\btranslate\(\s*\K'(?:[^'\\\\]*)(?:\\\\.[^'\\\\]*)*'" .
Here is a regex demo
RegEx Breakup:
\b # word boundary
translate # match literal translate
\( # match a (
\s* # match 0 or more whitespace
\K # reset the matched information
' # match starting single quote
(?: # start non-capturing group
[^'\\\\]* # match 0 or more chars that are not a backslash or single quote
) # end non-capturing group
(?: # start non-capturing group
\\\\. # match a backslash followed by char that is "escaped"
[^'\\\\]* # match 0 or more chars that are not a backslash or single quote
)* # end non-capturing group
' # match ending single quote
Here is a version without \K using look-arounds:
grep -oPhR "(?<=\btranslate\(')(?:[^'\\\\]*)(?:\\\\.[^'\\\\]*)*(?=')" .
RegEx Demo 2
I think the problem is the .*? part: the ? makes it a non-greedy pattern, meaning it'll take the shortest string that matches the pattern. In effect, you're saying, "give me the shortest string that's followed by quote+close-paren or quote+comma". In your example, "world\" is followed by a single quote and a comma, so it matches your pattern.
In these cases, I like to use something like the following reasoning:
A string is a quote, zero or more characters, and a quote: '.*'
A character is anything that isn't a quote (because a quote terminates the string): '[^']*'
Except that you can put a quote in a string by escaping it with a backslash, so a character is either "backslash followed by a quote" or, failing that, "not a quote": '(\\'|[^'])*'
Put it all together and you get
grep -RoPh "(?<=translate\(')(\\'|[^'])*(?='\)|'\,)" .

Multiline selection of blocks with ID at the end of each block with regular expression

I have regular expression:
BEGIN\s+\[([\s\S]*?)END\s+ID=(.*)\]
which select multiline text and ID from text below. I would like to select only IDs with prefix X_, but if I change ID=(.*) to ID=(X_.*) begin is selected from second pair not from third as I need. Could someone help me to get correct expression please?
text example:
BEGIN [
text a
END ID=X_1]
BEGIN [
text b
text c
END ID=Y_1]
text aaa
text bbb
BEGIN [
text d
text e
END ID=X_2]
text xxx
BEGIN [
text bbb
END ID=X_3]
It isn't the .* that's gobbling everything up as people keep saying, it's the [\s\S]*?. .* can't do it because (as the OP said) the dot doesn't match newlines.
When the END\s+ID=(X_.*)\] part of your regex fails to match the last line of the second block, you're expecting it to abandon that block and start over with the third one. That's what it have to do to make the shortest match.
In reality, it backtracks to the beginning of the line and lets [\s\S]*? consume it instead. And it keeps on consuming until it finds a place where END\s+ID=(X_.*)\] can match, which happens to be the last line of the third block.
The following regex avoids that problem by matching line by line, checking each one to see if it starts with END. This effectively confines the match to one block at a time.
(?m)^BEGIN\s+\[[\r\n]+((?:(?!END).*[\r\n]+)*)END\s+ID=(X_.*)\]
Note that I used ^ to anchor each match to the beginning of a line, so I used (?m) to turn on multiline mode. But I did not--and you should not--turn on single-line/DOTALL mode.
Assuming there aren't any newlines inside a block and the BEGIN/END statements are the first non-space of their line, I'd write the regex like this (Perl notation; change the delimiters and remove comments, whitespaces and the /x modifier if you use a different engine)
m{
\n \s* BEGIN \s+ \[ # match the beginning
( (?!\n\s*\n) .)*? # match anything that isn't an empty line
# checking with a negative look-ahead (?!PATTERN)
\n \s* END \s+ ID=X_[^\]]* \] # the ID may not contain "]"
}sx # /x: use extended syntax, /s: "." matches newlines
If the content may be anything, it might be best to create a list of all blocks, and then grep through them. This regex matches any block:
m{ (
BEGIN \s+ \[
.*? # non-greedy matching is important here
END \s+ ID=[^\]]* \] # greedy matching is safe here
) }xs
(add newlines if wanted)
Then only keep those matches that match this regex:
/ID = X_[^\]]* \] $/x # anchor at end of line
If we don't do this, backtracking may prevent a correct match ([\s\S]*? can contain END ID=X_). Your regex would put anything inside the blocks until it sees a X_.*.
So using BEGIN\s+\[([/s/S]*?)END\s+ID=(.*?)\] — note the extra question mark — one match would be:
BEGIN [
text b
text c
END ID=Y_1]
text aaa
text bbb
BEGIN [
text d
text e
END ID=X_2]
…instead of failing at the Y_. A greedy match (your unchanged regex) should result in the whole file being matched: Your (.*) eats up all characters (until the end of file) and then goes back until it finds a ].
EDIT:
Should you be using perls regex engine, we can use the (*FAIL) verb:
/BEGIN\s+\[(.*?)END\s+ID=(X_[^\]]*|(*FAIL))\]/s
"Either have an ID starting with X_ or the match fails". However, this does not solve the problem with END ID=X_1]-like statements inside your data.
Change your .* to a [^\]]* (i.e. match non-]s), so that your matches can't spill over past an END block, giving you something like BEGIN\s+\[([^\]]*?)END\s+ID=(X_[^\]]*)\]