Regular expression to replace content parentheses and their contents - regex

I am looking for a regular expression that will replace parentheses and the strings within them if the string anything that is not a digit.
The string can be any combination of characters including numbers, letters, spaces etc.
For example:
(3) will not be replaced
(1234) will not be replaced
(some letters) will be replaced
(some letters, spaces - and numbers 123) will be replaced
So far I have a regex that will replace any parentheses and its content
str = str.replaceAll("\\(.*?\\)","");

I am not good with the syntax of replaceAll, so I am just going to write the way you have written it. But I think I can help you with the regex.
Try this Regex:
\((?=[^)]*[a-zA-Z ])[^)]+?\)
Demo
OR an even better one:
\((?!\d+\))[^)]+?\)
Demo
Explanation(for 1st Regex)
\( - matches opening paranthesis
(?=[^)]*[a-zA-Z ]) - Positive Lookahead - checks for 0 or more of any characters which are not ) followed by a space or a letter
[^)]+? - Captures 1 or more characters which are not )
\) - Finally matches the closing Paranthesis
Explanation(for 2nd Regex)
\( - matches opening paranthesis
(?!\d+\)) - Negative Lookahead - matches only those strings which do not have ALL the characters as digits after the opening paranthesis but before the closing paranthesis appears
[^)]+? - Captures 1 or more characters which are not )
\) - Finally matches the closing Paranthesis
Now, you can try your Replace statement as:
str = str.replaceAll("\((?=[^)]*[a-zA-Z ])[^)]+?\)","");
OR
str = str.replaceAll("\((?!\d+\))[^)]+?\)","");

Related

Visual Studio Code regex for (not (xx))

I am trying to have a regex to capture all strings with this shape:
(not (xx))
I.e. a parenthesis, a "not", two empty spaces, two caracters between parenthesis and a closing parenthesis
I have tried:
(not (*))
But I get:
Invalid regular expression. Nothing to repeat
Any idea ?
The Nothing to repeat error is due to the * quantifier used to quantify an opening bracket of a capturing group construct.
You need to 1) escape special ( and ) chars, and 2) match any text between the closest ( and ) with a negated character class, here, with [^()]*:
\(not \([^()]*\)\)
Details:
\(not \( - a (not ( string
[^()]* - zero or more chars other than ( and )
\)\) - a )) string.
If there can be any zero or more whitespace chars between not and (, replace the literal spaces with \s*. If there must be one or more whitespaces, use \s+ instead:
\(not\s*\([^()]*\)\)
\(not\s+\([^()]*\)\)

Parenthesis content after a specific word

I'm trying to get UNIX group names using a regex (can't use groups because I can only get the process uid, so I'm using id <process_id> to get groups)
input looks like this
uid=1001(kawsay) gid=1001(kawsay) groups=1001(kawsay),27(sudo),44(video),997(gpio)\n
I'd like to capture kawsay, sudo, video and gpio
The only pieces I've got are:
a positive lookbehind to start capturing after groups: /(?<=groups)/
capture the parenthesis content: /\((\w+)\)/
Using PCRE's \G you may use this regex:
(?:\bgroups=|(?<!^)\G)[^(]*\(([^)]+)\)
Your intended matches are available in capture group #1
RegEx Demo
RegEx Details:
(?:: Start non-capture group
\bgroups=: Match word groups followed by a =
|: OR
(?<!^)\G: Start from end position of the previous match
): End non-capture group
[^(]*: Match 0 or more of any character that is not (
\(: Match opening (
([^)]+): Use capture group #1 to match 1+ of any non-) characters
\): Match closing )
You can use
(?:\G(?!\A)\),|\bgroups=)\d+\(\K\w+
See the regex demo. Details:
(?:\G(?!\A)\),|\bgroups=) - either of
\G(?!\A)\), - end of the previous match (\G operator matches either start of string or end of the previous match, so the (?!\A) is necessary to exclude the start of string location) and then ), substring
| - or
\bgroups= - a whole word groups (\b is a word boundary) and then a = char
\d+\( - one or more digits and a (
\K - match reset operator that makes the regex engine "forget" the text matched so far
\w+ - one or more word chars.
Here are two more ways to extract the strings of interest. Both return matches and do not employ capture groups. My preference is for second one.
str = "uid=1001(kawsay) gid=1001(kawsay) groups=1001(kawsay),27(sudo),44(video),997(gpio)\n"
Match substrings between parentheses that are not followed later in the string with "groups="
Match the regular expression
rgx = /(?<=\()(?!.*\bgroups=).*?(?=\))/
str.scan(rgx)
#=> ["kawsay", "sudo", "video", "gpio"]
Demo
See String#scan.
This expression can be broken down as follows.
(?<=\() # positive lookbehind asserts previous character is '('
(?! # begin negative lookahead
.* # match zero or more characters
\bgroups= # match 'groups=' preceded by a word boundary
) # end negative lookahead
.* # match zero or more characters lazily
(?=\)) # positive lookahead asserts next character is ')'
This may not be as efficient as expressions that employ \G (because of the need to determine if 'groups=' appears in the string after each left parenthesis), but that may not matter.
Extract the portion of the string following "groups=" and then match substrings between parentheses
First, obtain the portion of the string that follows "groups=":
rgx1 = /(?<=\bgroups=).*/
s = str[rgx1]
#=> "1001(kawsay),27(sudo),44(video),997(gpio)\n"
See String#[].
Then match the regular expression
rgx2 = /(?<=\()[^\)\r\n]+/
against s:
s.scan(rgx2)
#=> ["kawsay", "sudo", "video", "gpio"]
The regular expression rgx1 can be broken down as follows:
(?<=\bgroups=) # Positive lookbehind asserts that the current
# position in the string is preceded by`'groups'`,
# which is preceded by a word boundary
.* # match zero of more characters other than line
# terminators (to end of line)
rgx2 can be broken down as follows:
(?<=\() # Use a positive lookbehind to assert that the
# following character is preceded by '('
[^\)\r\n]+ # Match one or more characters other than
# ')', '\r' and '\n'
Note:
The operations can of course be chained: str[/(?<=\bgroups=).*/].scan(/(?<=\()[^\)\r\n]+/); and
rgx2 could alternatively be written /(?<=\().+?(?=\)), where ? makes the match of one or more characters lazy and (?=\)) is a positive lookahead that asserts that the match is followed by a right parenthesis.
This would probably be the fastest solution of those offered and certainly the easiest to test.

notepad++ remove text between two string using regular expression

I want to remove text between two strings using regular expression in notepad++. Here is my full string
[insertedOn]) VALUES (1, N'1F9ACCD2-3B60-49CF-830B-42B4C99F6072',
I want final string like this
[insertedOn]) VALUES (N'1F9ACCD2-3B60-49CF-830B-42B4C99F6072',
Here I removed 1, from string. 1,2,3 is in incremental order.
I tried lot of expression but not worked. Here is one of them (VALUES ()(?s)(.*)(, N')
How can I remove this?
You may use
(VALUES \().*?,\s*(N')
and replace with $1$2. Note that in case the part of string to be removed can contain line breaks, enable the . matches newline. If the N and VALUES must be matched only when in ALLCAPS, make sure the Match case option is checked.
Pattern details
(VALUES \() - Group 1 (later referred with $1 from the replacement pattern): a literal substring VALUES (
.*? - any 0+ chars, as few as possible, up to the leftmost occurrence of the sunsequent subpatterns
,\s* - a comma and 0+ whitespaces (use \h instead of \s to only match horizontal whitespace chars)
(N') - Group 2 (later referred with $2 from the replacement pattern): a literal substring N'.
You should first escape literal ( before VALUES: \(
By doing so, .* in your regex in addition to s (DOTALL) flag causes engine to greedily match up to end of input string then backtracks to stop at the first occurrence of , N' which means unexpected matches.
To improve your regex you should 1) make .* ungreedy 2) remove (?s) 3) escape (:
(VALUES \().*?, (N')
To be more precise in matching you'd better search for:
VALUES \(\K\d+, *(?=N')
and replace with nothing.
Breakdown:
VALUES \( March VALUES ( literally
\K Reset match
\d+, * Match digits preceding a comma and optional spaces
(?=N') Followed by N'

Regex to match numbers followed by a specific character

I am so sorry, I know this is a simple question, which is not appropriate here, but I am terrible in regex.
I use preg_match with a pattern of (numbers A) to match the following replaces with the substrings
2A -> <i>2A</i>
100 A -> <i>100 A</i>
84.55A -> <i>84.55A</i>
92.1 A -> <i>92.1 A</i>
The numbers can be separated from the character or not
The numbers can be decimal
The letter should not be the begging of a word (not matching 4 All;
in fact, A should be followed by a space or period or linebreak)
My problem is to apply OR conditions to match a character which may exist or not to have a single match to be replaced as
$str = preg_replace($pattern, '<i>$1</i>', $str);
I can suggest
'~\b(?<![\d.])\d*\.?\d+\s*A\b~'
See the regex demo. Replace with '<i>$0</i>' where the $0 is the backreference to the whole match.
Details:
\b - leading word boundary
(?<![\d.]) - a negative lookbehind that fails the match if there is a dot or digit before the current location (NOTE: this is added to avoid matching 33.333.4444 A like strings, just remove if not necessary)
\d*\.?\d+ - a usual simplified float/int value regex (0+ digits, an optional . and 1+ digits) (NOTE: if you need a more sophisticated regex for this, see Matching Floating Point Numbers with a Regular Expression)
\s* - 0+ whitespaces
A\b - a whole word A (here, \b is a trailing word boundary).

Why is this regex selecting this text

I am using the regex
(.*)\d.txt
on the expression
MyFile23.txt
Now the online tester says that using the above regex the mentioned string would be allowed (selected). My understanding is that it should not be allowed because there are two numeric digits 2 and 3 while the above regex expression has only one numeric digit in it i.e \d.It should have been \d+. My current expression reads. Zero of more of any character followed by one numeric digit followed by .txt. My question is why is the above string passing the regex expression ?
This regex (.*)\d.txt will still match MyFile23.txt because of .* which will match 0 or more of any character (including a digit).
So for the given input: MyFile23.txt here is the breakup:
.* # matches MyFile2
\d # matched 3
. # matches a dot (though it can match anything here due to unescaped dot)
txt # will match literal txt
To make sure it only matches MyFile2.txt you can use:
^\D*\d\.txt$
Where ^ and $ are anchors to match start and end. \D* will match 0 or more non-digit.
The pattern you have has one group (.*) which would match using your example:MyFile2
because the . allows any character.
Furthermore the . in the pattern after this group is not escaped which will result in allowing another character of any kind.
To avoid this use:
(\D*)\d+\.txt
the group (\D*) would now match all non digit characters.
Here is the explanation, your "MyFile23.txt" matches the regex pattern:
A literal period . should always be escaped as \. else it will match "any character".
And finally, (.*) matches all the string from the beginning to the last digit (MyFile2). Have a look at the "MATCH INFORMATION" area on the right at this page.
So, I'd suggest the following fix:
^\D*\d\.txt$ = beginning of a line/string, non-digit character, any number of repetitions, a digit, a literal period, a literal txt, and the end of the string/line (depending on the m switch, which depends on the input string, whether you have a list of words on separate lines, or just a separate file name).
Here is a working example.