Regex - Match parentheses without matching their contents - regex

I just need to match parentheses around some content that has to match specific criteria. I need to match only the parentheses so that I can then do a quick replacement of only those parentheses and keep their content.
For the moment, what I have matches those specific parentheses, but unfortunately also their contents: \((?:\d{2,7})\)
The criteria for matching parentheses are as following:
only match parentheses that contain \d{2,7}
I have tried positive lookahead (\((?=\d{2,7})\)), and while it does indeed not consume whatever follows the open parenthesis, it then fails to match the closing parenthesis as it backtracks to before the content...
So yeah, any help would be appreciated :)

Pure RegEx pattern: \((?=\d{2,7}\))|(?<=\()\d{2,7}\K\)
Update: I don't know about Swift, but according to this documentation, Template Matching Format part, $n can also be used similarly, as in
let myString = "(32) 123-323-2323"
let regex = try! NSRegularExpression(pattern: "\\((\\d{2,7})\\)")
let range = NSMakeRange(0, myString.characters.count)
regex.stringByReplacingMatchesInString(myString,
options: [],
range: range,
withTemplate: "$1")
With the assumption that you are using Java, I would suggest something as simple as
str.replaceAll("\\((\\d{2,7})\\)", "$1")
The pattern \((\d{2,7})\) captures the whole expression with parantheses with the number in group 1 and replaces it with only the number inside, thus effectively removing the surrounding brackets.

The regex can be \((\d{2,7})\). It will match all pairing parenthesis with content and the content is accessible via parameter 1 and can be added to string which replace the parenthesis.
How to access results of regex is language specific, I think.
EDIT:
Here is code which can work. It's untested and I have to warn you at first:
This is my first experience with Swift and online sandbox which I found couldn't compile it. But it couldn't compile examples from Apple website, either...
import Foundation
let text = "some input 22 with (65498) numbers (8643)) and 63546 (parenthesis)"
let regex = try NSRegularExpression(pattern: "\\((\\d{2,7})\\)", options: [])
let replacedStr = regex.stringByReplacingMatchesInString(text,
options: [],
range: NSRange(location: 0, length: text.characters.count),
withTemplate: "$1")

Are you okay with removing all parenthesis regardless?: [()] done.
Apparently you've said that's not okay, though that wasn't clear at the time of the question first being asked.
So, then try capturing the number part and using it as the substitution of the match. Like: in.replaceAll("\\((\\d{2,7})\\)","$1").
To put this very plainly so that any regular expression system can use it:
Match:\(([0-9]{2,7})\) means a ( a subgroup of 2 to 7 digits and a )
Substitute each match with: a reference to that match's first subgroup capture. The digits.
You can see this operating as you asked on the input:
Given some (unspecified) input that contains (1) (12) (123) etc to (1234567) and (12345678) we munge it to strip (some) parentheses.
If you follow this fiddle.re link, and press the green button.
Or with more automatic-explanation at this regex101.com link.
Or what about replacing each match with a substring of the matched content, so you don't even need a subgroup, since you never want the first or last character of the match. Like:
Pattern p = Pattern.compile("\\(\d{2,7}\\)");
Matcher m = p.matcher(mysteryInputSource);
StringBuffer sb = new StringBuffer();
while (m.find()) {
m.appendReplacement(sb, mysterInputSource.substring(m.start()+1, m.end));
}
m.appendTail(sb);
System.out.println(sb.toString());
Since you're not using Java, try translating any of the above suggestions to your language.

Related

Regular expression to find specific text within a string enclosed in two strings, but not the entire string

I have this type of text:
string1_dog_bit_johny_bit_string2
string1_cat_bit_johny_bit_string2
string1_crocodile_bit_johny_bit_string2
string3_crocodile_bit_johny_bit_string4
string4_crocodile_bit_johny_bit_string5
I want to find all occurrences of “bit” that occur only between string1 and string2. How do I do this with regex?
I found the question Regex Match all characters between two strings, but the regex there matches the entire string between string1 and string2, whereas I want to match just parts of that string.
I am doing a global replacement in Notepad++. I just need regex, code will not work.
Thank you in advance.
Roman
If I understand correctly here a code to do what you want
var intput = new List<string>
{
"string1_dog_bit_johny_bit_string2",
"string1_cat_bit_johny_bit_string2",
"string1_crocodile_bit_johny_bit_string2",
"string3_crocodile_bit_johny_bit_string4",
"string4_crocodile_bit_johny_bit_string5"
};
Regex regex = new Regex(#"(?<bitGroup>bit)");
var allMatches = new List<string>();
foreach (var str in intput)
{
if (str.StartsWith("string1") && str.EndsWith("string2"))
{
var matchCollection = regex.Matches(str);
allMatches.AddRange(matchCollection.Cast<Match>().Select(match => match.Groups["bitGroup"].Value));
}
}
Console.WriteLine("All matches {0}", allMatches.Count);
This regex will do the job:
^string1_(?:.*(bit))+.*_string2$
^ means the start of the text (or line if you use the m option like so: /<regex>/m )
$ means the end of the text
. means any character
* means the previous character/expression is repeated 0 or more times
(?:<stuff>) means a non-capturing group (<stuff> won't be captured as a result of the matching)
You could use ^string1_(.*(bit).*)*_string2$ if you don't care about performance or don't have large/many strings to check. The outer parenthesis allow multiple occurences of "bit".
If you provide us with the language you want to use, we could give more specific solutions.
edit: As you added that you're trying a replacement in Notepad++ I propose the following:
Use (?<=string1_)(.*)bit(.*)(?=_string2) as regex and $1xyz$2 as replacement pattern (replace xyz with your string). Then perform an "replace all" operation until N++ doesn't find any more matches. The problem here is that this regex will only match 1 bit per line per iteration - and therefore needs to be applied repeatedly.
Btw. even if a regexp matches the whole line, you can still only replace parts of it using capturing groups.
You can use the regex:
(?:string1|\G)(?:(?!string2).)*?\Kbit
regex101 demo. Tried it on notepad++ as well and it's working.
There're description in the demo site, but if you want more explanations, let me know and I'll elaborate!

Regex, continue matching after lookaround

I'm having trouble with lookaround in regex.
Here the problem : I have a big file I want to edit, I want to change a function by another keeping the first parameter but removing the second one.
Let say we have :
func1(paramIWantToKeep, paramIDontWant)
or
func1(func3(paramIWantToKeep), paramIDontWant)
I want to change with :
func2(paramIWantToKeep) in both case.
so I try using positive lookahead
func1\((?=.+), paramIDontWant\)
Now, I just try not to select the first parameter (then I'll manage to do the same with the parenthesis).
But it doesn't work, it appears that my regex, after ignoring the positive look ahead (.+) look for (, paramIDontWant\)) at the same position it was before the look ahead (so the opening parenthesis)
So my question is, how to continue a regex after a matching group, here after (.+).
Thanks.
PS: Sorry for the english and/or the bad construction of my question.
Edit : I use Sublime Text
The first thing you need to understand is that a regex will always match a consecutive string. There will never be gaps.
Therefore, if you want to replace 123abc456 with abc, you can't simply match 123456 and remove it.
Instead, you can use a capturing group. This will allow you to remember a section of the regex for later.
For example, to replace 123abc456 with abc, you could replace this regex:
\d+([a-z]+)\d+
with this string:
$1
What that does is actually replaces the match with the contents of the first capturing group. In this case, the capturing group was ([a-z]+), which matches abc. Thus, the entire match is replaced with just abc.
An example you may find more useful:
Given:
func1(foo, bar)
replacing this regex:
\w+\((\w+),\s*\w+\)
with this string:
func2($1)
results in:
func2(foo)
import re
t = "func1(paramKeep,paramLose)"
t1 = "func1(paramKeep,((paramLose(dog,cat))))"
t2 = "func1(func3(paramKeep),paramDont)"
t3 = "func1(func3(paramKeep),paramDont,((i)),don't,want,these)"
reg = r'(\w+\(.*?(?=,))(,.*)(\))'
keep,lose,end = re.match(reg,t).groups()
print(keep+end)
keep,lose,end = re.match(reg,t1).groups()
print(keep+end)
keep,lose,end = re.match(reg,t2).groups()
print(keep+end)
keep,lose,end = re.match(reg,t3).groups()
print(keep+end)
Produces
>>>
func1(paramKeep)
func1(paramKeep)
func1(func3(paramKeep))
func1(func3(paramKeep))
Apply these two regexp in this order
s/(func1)([^,]*)(, )?(paramIDontWant)(.)/func2$2$5/;
s/(func2\()(func3\()(paramIWantToKeep).*/$1$3)/;
These cope with the two examples you gave. I guess that the real world code you are editing is slightly more complicated but the general idea of applying a series of regexps might be helpful

Regex AND operator

Based on this answer
Regular Expressions: Is there an AND operator?
I tried the following on http://regexpal.com/ but was unable to get it to work. What am missing? Does javascript not support it?
Regex: (?=foo)(?=baz)
String: foo,bar,baz
It is impossible for both (?=foo) and (?=baz) to match at the same time. It would require the next character to be both f and b simultaneously which is impossible.
Perhaps you want this instead:
(?=.*foo)(?=.*baz)
This says that foo must appear anywhere and baz must appear anywhere, not necessarily in that order and possibly overlapping (although overlapping is not possible in this specific case because the letters themselves don't overlap).
Example of a Boolean (AND) plus Wildcard search, which I'm using inside a javascript Autocomplete plugin:
String to match: "my word"
String to search: "I'm searching for my funny words inside this text"
You need the following regex: /^(?=.*my)(?=.*word).*$/im
Explaining:
^ assert position at start of a line
?= Positive Lookahead
.* matches any character (except newline)
() Groups
$ assert position at end of a line
i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])
m modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
Test the Regex here: https://regex101.com/r/iS5jJ3/1
So, you can create a javascript function that:
Replace regex reserved characters to avoid errors
Split your string at spaces
Encapsulate your words inside regex groups
Create a regex pattern
Execute the regex match
Example:
function fullTextCompare(myWords, toMatch){
//Replace regex reserved characters
myWords=myWords.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&');
//Split your string at spaces
arrWords = myWords.split(" ");
//Encapsulate your words inside regex groups
arrWords = arrWords.map(function( n ) {
return ["(?=.*"+n+")"];
});
//Create a regex pattern
sRegex = new RegExp("^"+arrWords.join("")+".*$","im");
//Execute the regex match
return(toMatch.match(sRegex)===null?false:true);
}
//Using it:
console.log(
fullTextCompare("my word","I'm searching for my funny words inside this text")
);
//Wildcards:
console.log(
fullTextCompare("y wo","I'm searching for my funny words inside this text")
);
Maybe you are looking for something like this. If you want to select the complete line when it contains both "foo" and "baz" at the same time, this RegEx will comply that:
.*(foo)+.*(baz)+|.*(baz)+.*(foo)+.*
Maybe just an OR operator | could be enough for your problem:
String: foo,bar,baz
Regex: (foo)|(baz)
Result: ["foo", "baz"]

Regular Expression: how can I impose a perfect string matching?

Currently I am using this one ( edit: I missed to explain that I use this one for excluding exactly these words :p ):
String REGEXP = "^[^(REG_)?].*";
but matches (exluding) also ERG, EGR, GRE, etc... above
P.S.
I removed super because it is another keyword that I must filter, figure an array list composed with more of the following three words to be used as model:
REG_info1, info2, SUPER_info3, etc...
I need three filter matching one model at time, my question focus only on the second filter parsing keywords based on model "info2".
Just type it literally:
REG
This will only match REG.
So:
String REGEXP = "^(REG_|SUPER_)?.*";
Edit   After you clarified that you want to match every word that does not begin with REG_ or SUPER_, you could try this:
\b(?!REG_|SUPER_)\w+
The \b is a word boundary and the expression (?!expr) is a look-ahead assertion.
As everyone have already replied, if you want to match a line starting with REG, you use the regexp "^REG", if you want to match any line that starts REG or SUPER, you use "^(REG|SUPER)" and regular expression negation is, in general, a tricky problem.
To match all lines NOT starting with 'REG' you need to match "^[^R]|R[^E]|RE[^G]" and a regular expression to match all lines not starting with REG or SUPER can be constructed in a similar fashion (start by grouping the "not REG" in parentheses, then construct the "not SUPER" patterns as "[^S]|S[^U]|[SU[^P]...", group this and use alternation for both groups).
How about
\mREG\M
// \mREG\M
//
// Options: ^ and $ match at line breaks
//
// Assert position at the beginning of a word «\m»
// Match the characters “REG” literally «REG»
// Assert position at the end of a word «\M»
The [] indicate character classes. This is not what you want. You can just use "REG" to match REG. (You can use REG|SUPER for REG or SUPER)
REGEXP = "^(REG_|SUPER_)"
would match anything that haves REG_ or SUPER_ at the beginning of a string. You don't need more after the group "(..|..)"

Need regexp to find substring between two tokens

I suspect this has already been answered somewhere, but I can't find it, so...
I need to extract a string from between two tokens in a larger string, in which the second token will probably appear again meaning... (pseudo code...)
myString = "A=abc;B=def_3%^123+-;C=123;" ;
myB = getInnerString(myString, "B=", ";" ) ;
method getInnerString(inStr, startToken, endToken){
return inStr.replace( EXPRESSION, "$1");
}
so, when I run this using expression ".+B=(.+);.+"
I get "def_3%^123+-;C=123;" presumably because it just looks for the LAST instance of ';' in the string, rather than stopping at the first one it comes to.
I've tried using (?=) in search of that first ';' but it gives me the same result.
I can't seem to find a regExp reference that explains how one can specify the "NEXT" token rather than the one at the end.
any and all help greatly appreciated.
Similar question on SO:
Regex: To pull out a sub-string between two tags in a string
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Replace patterns that are inside delimiters using a regular expression call
RegEx matching HTML tags and extracting text
You're using a greedy pattern by not specifying the ? in it. Try this:
".+B=(.+?);.+"
Try this:
B=([^;]+);
This matches everything between B= and ; unless it is a ;. So it matches everything between B= and the first ; thereafter.
(This is a continuation of the conversation from the comments to Evan's answer.)
Here's what happens when your (corrected) regex is applied: First, the .+ matches the whole string. Then it backtracks, giving up most of the characters it just matched until it gets to the point where the B= can match. Then the (.+?) matches (and captures) everything it sees until the next part, the semicolon, can match. Then the final .+ gobbles up the remaining characters.
All you're really interested in is the "B=" and the ";" and whatever's between them, so why match the rest of the string? The only reason you have to do that is so you can replace the whole string with the contents of the capturing group. But why bother doing that if you can access contents of the group directly? Here's a demonstration (in Java, because I can't tell what language you're using):
String s = "A=abc;B=def_3%^123+-;C=123;";
Pattern p = Pattern.compile("B=(.*?);");
Matcher m = p.matcher(s);
if (m.find())
{
System.out.println(m.group(1));
}
Why do a 'replace' when a 'find' is so much more straightforward? Probably because your API makes it easier; that's why we do it in Java. Java has several regex-oriented convenience methods in its String class: replaceAll(), replaceFirst(), split(), and matches() (which returns true iff the regex matches the whole string), but not find(). And there's no convenience method for accessing capturing groups, either. We can't match the elegance of Perl one-liners like this:
print $1 if 'A=abc;B=def_3%^123+-;C=123;' =~ /B=(.*?);/;
...so we content ourselves with hacks like this:
System.out.println("A=abc;B=def_3%^123+-;C=123;"
.replaceFirst(".+B=(.*?);.+", "$1"));
Just to be clear, I'm not saying not to use these hacks, or that there's anything wrong with Evan's answer--there isn't. I just think we should understand why we use them, and what trade-offs we're making when we do.