Regular expression to match word pairs joined with colons - regex

I don't know regular expression at all. Can anybody help me with one very simple regular expression which is,
extracting 'word:word' from a sentence. e.g "Java Tutorial Format:Pdf With Location:Tokyo Javascript"?
Little modification:
the first 'word' is from a list but second is anything. "word1 in [ABC, FGR, HTY]"
guys situation demands a little more
modification.
The matching form can be "word11:word12 word13 .. " till the next "word21: ... " .
things are becoming complex with sec.....i have to learn reg ex :(
thanks in advance.

You can use the regex:
\w+:\w+
Explanation:
\w - single char which is either a letter(uppercase or lowercase), digit or a _.
\w+ - one or more of above char..basically a word
so \w+:\w+
would match a pair of words separated by a colon.

Try \b(\S+?):(\S+?)\b. Group 1 will capture "Format" and group 2, "Pdf".
A working example:
<html>
<head>
<script type="text/javascript">
function test() {
var re = /\b(\S+?):(\S+?)\b/g; // without 'g' matches only the first
var text = "Java Tutorial Format:Pdf With Location:Tokyo Javascript";
var match = null;
while ( (match = re.exec(text)) != null) {
alert(match[1] + " -- " + match[2]);
}
}
</script>
</head>
<body onload="test();">
</body>
</html>
A good reference for regexes is https://developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/RegExp

Use this snippet :
$str=" this is pavun:kumar hello world bk:systesm" ;
if ( preg_match_all ( '/(\w+\:\w+)/',$str ,$val ) )
{
print_r ( $val ) ;
}
else
{
print "Not matched \n";
}

Continuing Jaú's function with your additional requirement:
function test() {
var words = ['Format', 'Location', 'Size'],
text = "Java Tutorial Format:Pdf With Location:Tokyo Language:Javascript",
match = null;
var re = new RegExp( '(' + words.join('|') + '):(\\w+)', 'g');
while ( (match = re.exec(text)) != null) {
alert(match[1] + " = " + match[2]);
}
}

I am currently solving that problem in my nodejs app and found that this is, what I guess, suitable for colon-paired wordings:
([\w]+:)("(([^"])*)"|'(([^'])*)'|(([^\s])*))
It also matches quoted value. like a:"b" c:'d e' f:g
Example coding in es6:
const regex = /([\w]+:)("(([^"])*)"|'(([^'])*)'|(([^\s])*))/g;
const str = `category:"live casino" gsp:S1aik-UBnl aa:"b" c:'d e' f:g`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Example coding in PHP
$re = '/([\w]+:)("(([^"])*)"|\'(([^\'])*)\'|(([^\s])*))/';
$str = 'category:"live casino" gsp:S1aik-UBnl aa:"b" c:\'d e\' f:g';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);
You can check/test your regex expressions using this online tool: https://regex101.com
Btw, if not deleted by regex101.com, you can browse that example coding here

here's the non regex way, in your favourite language, split on white spaces, go through the element, check for ":" , print them if found. Eg Python
>>> s="Java Tutorial Format:Pdf With Location:Tokyo Javascript"
>>> for i in s.split():
... if ":" in i:
... print i
...
Format:Pdf
Location:Tokyo
You can do further checks to make sure its really "someword:someword" by splitting again on ":" and checking if there are 2 elements in the splitted list. eg
>>> for i in s.split():
... if ":" in i:
... a=i.split(":")
... if len(a) == 2:
... print i
...
Format:Pdf
Location:Tokyo

([^:]+):(.+)
Meaning: (everything except : one or more times), :, (any character one ore more time)
You'll find good manuals on the net... Maybe it's time for you to learn...

Related

Regular Expression: match and count "A" and stop after found "B" by counted times [duplicate]

I need a regular expression to select all the text between two outer brackets.
Example:
START_TEXT(text here(possible text)text(possible text(more text)))END_TXT
^ ^
Result:
(text here(possible text)text(possible text(more text)))
I want to add this answer for quickreference. Feel free to update.
.NET Regex using balancing groups:
\((?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\)
Where c is used as the depth counter.
Demo at Regexstorm.com
Stack Overflow: Using RegEx to balance match parenthesis
Wes' Puzzling Blog: Matching Balanced Constructs with .NET Regular Expressions
Greg Reinacker's Weblog: Nested Constructs in Regular Expressions
PCRE using a recursive pattern:
\((?:[^)(]+|(?R))*+\)
Demo at regex101; Or without alternation:
\((?:[^)(]*(?R)?)*+\)
Demo at regex101; Or unrolled for performance:
\([^)(]*+(?:(?R)[^)(]*)*+\)
Demo at regex101; The pattern is pasted at (?R) which represents (?0).
Perl, PHP, Notepad++, R: perl=TRUE, Python: PyPI regex module with (?V1) for Perl behaviour.
(the new version of PyPI regex package already defaults to this → DEFAULT_VERSION = VERSION1)
Ruby using subexpression calls:
With Ruby 2.0 \g<0> can be used to call full pattern.
\((?>[^)(]+|\g<0>)*\)
Demo at Rubular; Ruby 1.9 only supports capturing group recursion:
(\((?>[^)(]+|\g<1>)*\))
Demo at Rubular  (atomic grouping since Ruby 1.9.3)
JavaScript  API :: XRegExp.matchRecursive
XRegExp.matchRecursive(str, '\\(', '\\)', 'g');
Java: An interesting idea using forward references by #jaytea.
Without recursion up to 3 levels of nesting:
(JS, Java and other regex flavors)
To prevent runaway if unbalanced, with * on innermost [)(] only.
\((?:[^)(]|\((?:[^)(]|\((?:[^)(]|\([^)(]*\))*\))*\))*\)
Demo at regex101; Or unrolled for better performance (preferred).
\([^)(]*(?:\([^)(]*(?:\([^)(]*(?:\([^)(]*\)[^)(]*)*\)[^)(]*)*\)[^)(]*)*\)
Demo at regex101; Deeper nesting needs to be added as required.
Reference - What does this regex mean?
RexEgg.com - Recursive Regular Expressions
Regular-Expressions.info - Regular Expression Recursion
Mastering Regular Expressions - Jeffrey E.F. Friedl 1 2 3 4
Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.
But there is a simple algorithm to do this, which I described in more detail in this answer to a previous question. The gist is to write code which scans through the string keeping a counter of the open parentheses which have not yet been matched by a closing parenthesis. When that counter returns to zero, then you know you've reached the final closing parenthesis.
You can use regex recursion:
\(([^()]|(?R))*\)
[^\(]*(\(.*\))[^\)]*
[^\(]* matches everything that isn't an opening bracket at the beginning of the string, (\(.*\)) captures the required substring enclosed in brackets, and [^\)]* matches everything that isn't a closing bracket at the end of the string. Note that this expression does not attempt to match brackets; a simple parser (see dehmann's answer) would be more suitable for that.
This answer explains the theoretical limitation of why regular expressions are not the right tool for this task.
Regular expressions can not do this.
Regular expressions are based on a computing model known as Finite State Automata (FSA). As the name indicates, a FSA can remember only the current state, it has no information about the previous states.
In the above diagram, S1 and S2 are two states where S1 is the starting and final step. So if we try with the string 0110 , the transition goes as follows:
0 1 1 0
-> S1 -> S2 -> S2 -> S2 ->S1
In the above steps, when we are at second S2 i.e. after parsing 01 of 0110, the FSA has no information about the previous 0 in 01 as it can only remember the current state and the next input symbol.
In the above problem, we need to know the no of opening parenthesis; this means it has to be stored at some place. But since FSAs can not do that, a regular expression can not be written.
However, an algorithm can be written to do this task. Algorithms are generally falls under Pushdown Automata (PDA). PDA is one level above of FSA. PDA has an additional stack to store some additional information. PDAs can be used to solve the above problem, because we can 'push' the opening parenthesis in the stack and 'pop' them once we encounter a closing parenthesis. If at the end, stack is empty, then opening parenthesis and closing parenthesis matches. Otherwise not.
(?<=\().*(?=\))
If you want to select text between two matching parentheses, you are out of luck with regular expressions. This is impossible(*).
This regex just returns the text between the first opening and the last closing parentheses in your string.
(*) Unless your regex engine has features like balancing groups or recursion. The number of engines that support such features is slowly growing, but they are still not a commonly available.
It is actually possible to do it using .NET regular expressions, but it is not trivial, so read carefully.
You can read a nice article here. You also may need to read up on .NET regular expressions. You can start reading here.
Angle brackets <> were used because they do not require escaping.
The regular expression looks like this:
<
[^<>]*
(
(
(?<Open><)
[^<>]*
)+
(
(?<Close-Open>>)
[^<>]*
)+
)*
(?(Open)(?!))
>
I was also stuck in this situation when dealing with nested patterns and regular-expressions is the right tool to solve such problems.
/(\((?>[^()]+|(?1))*\))/
This is the definitive regex:
\(
(?<arguments>
(
([^\(\)']*) |
(\([^\(\)']*\)) |
'(.*?)'
)*
)
\)
Example:
input: ( arg1, arg2, arg3, (arg4), '(pip' )
output: arg1, arg2, arg3, (arg4), '(pip'
note that the '(pip' is correctly managed as string.
(tried in regulator: http://sourceforge.net/projects/regulator/)
I have written a little JavaScript library called balanced to help with this task. You can accomplish this by doing
balanced.matches({
source: source,
open: '(',
close: ')'
});
You can even do replacements:
balanced.replacements({
source: source,
open: '(',
close: ')',
replace: function (source, head, tail) {
return head + source + tail;
}
});
Here's a more complex and interactive example JSFiddle.
Adding to bobble bubble's answer, there are other regex flavors where recursive constructs are supported.
Lua
Use %b() (%b{} / %b[] for curly braces / square brackets):
for s in string.gmatch("Extract (a(b)c) and ((d)f(g))", "%b()") do print(s) end (see demo)
Raku (former Perl6):
Non-overlapping multiple balanced parentheses matches:
my regex paren_any { '(' ~ ')' [ <-[()]>+ || <&paren_any> ]* }
say "Extract (a(b)c) and ((d)f(g))" ~~ m:g/<&paren_any>/;
# => (「(a(b)c)」 「((d)f(g))」)
Overlapping multiple balanced parentheses matches:
say "Extract (a(b)c) and ((d)f(g))" ~~ m:ov:g/<&paren_any>/;
# => (「(a(b)c)」 「(b)」 「((d)f(g))」 「(d)」 「(g)」)
See demo.
Python re non-regex solution
See poke's answer for How to get an expression between balanced parentheses.
Java customizable non-regex solution
Here is a customizable solution allowing single character literal delimiters in Java:
public static List<String> getBalancedSubstrings(String s, Character markStart,
Character markEnd, Boolean includeMarkers)
{
List<String> subTreeList = new ArrayList<String>();
int level = 0;
int lastOpenDelimiter = -1;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == markStart) {
level++;
if (level == 1) {
lastOpenDelimiter = (includeMarkers ? i : i + 1);
}
}
else if (c == markEnd) {
if (level == 1) {
subTreeList.add(s.substring(lastOpenDelimiter, (includeMarkers ? i + 1 : i)));
}
if (level > 0) level--;
}
}
return subTreeList;
}
}
Sample usage:
String s = "some text(text here(possible text)text(possible text(more text)))end text";
List<String> balanced = getBalancedSubstrings(s, '(', ')', true);
System.out.println("Balanced substrings:\n" + balanced);
// => [(text here(possible text)text(possible text(more text)))]
The regular expression using Ruby (version 1.9.3 or above):
/(?<match>\((?:\g<match>|[^()]++)*\))/
Demo on rubular
The answer depends on whether you need to match matching sets of brackets, or merely the first open to the last close in the input text.
If you need to match matching nested brackets, then you need something more than regular expressions. - see #dehmann
If it's just first open to last close see #Zach
Decide what you want to happen with:
abc ( 123 ( foobar ) def ) xyz ) ghij
You need to decide what your code needs to match in this case.
"""
Here is a simple python program showing how to use regular
expressions to write a paren-matching recursive parser.
This parser recognises items enclosed by parens, brackets,
braces and <> symbols, but is adaptable to any set of
open/close patterns. This is where the re package greatly
assists in parsing.
"""
import re
# The pattern below recognises a sequence consisting of:
# 1. Any characters not in the set of open/close strings.
# 2. One of the open/close strings.
# 3. The remainder of the string.
#
# There is no reason the opening pattern can't be the
# same as the closing pattern, so quoted strings can
# be included. However quotes are not ignored inside
# quotes. More logic is needed for that....
pat = re.compile("""
( .*? )
( \( | \) | \[ | \] | \{ | \} | \< | \> |
\' | \" | BEGIN | END | $ )
( .* )
""", re.X)
# The keys to the dictionary below are the opening strings,
# and the values are the corresponding closing strings.
# For example "(" is an opening string and ")" is its
# closing string.
matching = { "(" : ")",
"[" : "]",
"{" : "}",
"<" : ">",
'"' : '"',
"'" : "'",
"BEGIN" : "END" }
# The procedure below matches string s and returns a
# recursive list matching the nesting of the open/close
# patterns in s.
def matchnested(s, term=""):
lst = []
while True:
m = pat.match(s)
if m.group(1) != "":
lst.append(m.group(1))
if m.group(2) == term:
return lst, m.group(3)
if m.group(2) in matching:
item, s = matchnested(m.group(3), matching[m.group(2)])
lst.append(m.group(2))
lst.append(item)
lst.append(matching[m.group(2)])
else:
raise ValueError("After <<%s %s>> expected %s not %s" %
(lst, s, term, m.group(2)))
# Unit test.
if __name__ == "__main__":
for s in ("simple string",
""" "double quote" """,
""" 'single quote' """,
"one'two'three'four'five'six'seven",
"one(two(three(four)five)six)seven",
"one(two(three)four)five(six(seven)eight)nine",
"one(two)three[four]five{six}seven<eight>nine",
"one(two[three{four<five>six}seven]eight)nine",
"oneBEGINtwo(threeBEGINfourENDfive)sixENDseven",
"ERROR testing ((( mismatched ))] parens"):
print "\ninput", s
try:
lst, s = matchnested(s)
print "output", lst
except ValueError as e:
print str(e)
print "done"
You need the first and last parentheses. Use something like this:
str.indexOf('('); - it will give you first occurrence
str.lastIndexOf(')'); - last one
So you need a string between,
String searchedString = str.substring(str1.indexOf('('),str1.lastIndexOf(')');
because js regex doesn't support recursive match, i can't make balanced parentheses matching work.
so this is a simple javascript for loop version that make "method(arg)" string into array
push(number) map(test(a(a()))) bass(wow, abc)
$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)
const parser = str => {
let ops = []
let method, arg
let isMethod = true
let open = []
for (const char of str) {
// skip whitespace
if (char === ' ') continue
// append method or arg string
if (char !== '(' && char !== ')') {
if (isMethod) {
(method ? (method += char) : (method = char))
} else {
(arg ? (arg += char) : (arg = char))
}
}
if (char === '(') {
// nested parenthesis should be a part of arg
if (!isMethod) arg += char
isMethod = false
open.push(char)
} else if (char === ')') {
open.pop()
// check end of arg
if (open.length < 1) {
isMethod = true
ops.push({ method, arg })
method = arg = undefined
} else {
arg += char
}
}
}
return ops
}
// const test = parser(`$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)`)
const test = parser(`push(number) map(test(a(a()))) bass(wow, abc)`)
console.log(test)
the result is like
[ { method: 'push', arg: 'number' },
{ method: 'map', arg: 'test(a(a()))' },
{ method: 'bass', arg: 'wow,abc' } ]
[ { method: '$$', arg: 'groups' },
{ method: 'filter',
arg: '{type:\'ORGANIZATION\',isDisabled:{$ne:true}}' },
{ method: 'pickBy', arg: '_id,type' },
{ method: 'map', arg: 'test()' },
{ method: 'as', arg: 'groups' } ]
While so many answers mention this in some form by saying that regex does not support recursive matching and so on, the primary reason for this lies in the roots of the Theory of Computation.
Language of the form {a^nb^n | n>=0} is not regular. Regex can only match things that form part of the regular set of languages.
Read more # here
I didn't use regex since it is difficult to deal with nested code. So this snippet should be able to allow you to grab sections of code with balanced brackets:
def extract_code(data):
""" returns an array of code snippets from a string (data)"""
start_pos = None
end_pos = None
count_open = 0
count_close = 0
code_snippets = []
for i,v in enumerate(data):
if v =='{':
count_open+=1
if not start_pos:
start_pos= i
if v=='}':
count_close +=1
if count_open == count_close and not end_pos:
end_pos = i+1
if start_pos and end_pos:
code_snippets.append((start_pos,end_pos))
start_pos = None
end_pos = None
return code_snippets
I used this to extract code snippets from a text file.
This do not fully address the OP question but I though it may be useful to some coming here to search for nested structure regexp:
Parse parmeters from function string (with nested structures) in javascript
Match structures like:
matches brackets, square brackets, parentheses, single and double quotes
Here you can see generated regexp in action
/**
* get param content of function string.
* only params string should be provided without parentheses
* WORK even if some/all params are not set
* #return [param1, param2, param3]
*/
exports.getParamsSAFE = (str, nbParams = 3) => {
const nextParamReg = /^\s*((?:(?:['"([{](?:[^'"()[\]{}]*?|['"([{](?:[^'"()[\]{}]*?|['"([{][^'"()[\]{}]*?['")}\]])*?['")}\]])*?['")}\]])|[^,])*?)\s*(?:,|$)/;
const params = [];
while (str.length) { // this is to avoid a BIG performance issue in javascript regexp engine
str = str.replace(nextParamReg, (full, p1) => {
params.push(p1);
return '';
});
}
return params;
};
This might help to match balanced parenthesis.
\s*\w+[(][^+]*[)]\s*
This one also worked
re.findall(r'\(.+\)', s)

Regexp word between the braces [duplicate]

I need a regular expression to select all the text between two outer brackets.
Example:
START_TEXT(text here(possible text)text(possible text(more text)))END_TXT
^ ^
Result:
(text here(possible text)text(possible text(more text)))
I want to add this answer for quickreference. Feel free to update.
.NET Regex using balancing groups:
\((?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\)
Where c is used as the depth counter.
Demo at Regexstorm.com
Stack Overflow: Using RegEx to balance match parenthesis
Wes' Puzzling Blog: Matching Balanced Constructs with .NET Regular Expressions
Greg Reinacker's Weblog: Nested Constructs in Regular Expressions
PCRE using a recursive pattern:
\((?:[^)(]+|(?R))*+\)
Demo at regex101; Or without alternation:
\((?:[^)(]*(?R)?)*+\)
Demo at regex101; Or unrolled for performance:
\([^)(]*+(?:(?R)[^)(]*)*+\)
Demo at regex101; The pattern is pasted at (?R) which represents (?0).
Perl, PHP, Notepad++, R: perl=TRUE, Python: PyPI regex module with (?V1) for Perl behaviour.
(the new version of PyPI regex package already defaults to this → DEFAULT_VERSION = VERSION1)
Ruby using subexpression calls:
With Ruby 2.0 \g<0> can be used to call full pattern.
\((?>[^)(]+|\g<0>)*\)
Demo at Rubular; Ruby 1.9 only supports capturing group recursion:
(\((?>[^)(]+|\g<1>)*\))
Demo at Rubular  (atomic grouping since Ruby 1.9.3)
JavaScript  API :: XRegExp.matchRecursive
XRegExp.matchRecursive(str, '\\(', '\\)', 'g');
Java: An interesting idea using forward references by #jaytea.
Without recursion up to 3 levels of nesting:
(JS, Java and other regex flavors)
To prevent runaway if unbalanced, with * on innermost [)(] only.
\((?:[^)(]|\((?:[^)(]|\((?:[^)(]|\([^)(]*\))*\))*\))*\)
Demo at regex101; Or unrolled for better performance (preferred).
\([^)(]*(?:\([^)(]*(?:\([^)(]*(?:\([^)(]*\)[^)(]*)*\)[^)(]*)*\)[^)(]*)*\)
Demo at regex101; Deeper nesting needs to be added as required.
Reference - What does this regex mean?
RexEgg.com - Recursive Regular Expressions
Regular-Expressions.info - Regular Expression Recursion
Mastering Regular Expressions - Jeffrey E.F. Friedl 1 2 3 4
Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.
But there is a simple algorithm to do this, which I described in more detail in this answer to a previous question. The gist is to write code which scans through the string keeping a counter of the open parentheses which have not yet been matched by a closing parenthesis. When that counter returns to zero, then you know you've reached the final closing parenthesis.
You can use regex recursion:
\(([^()]|(?R))*\)
[^\(]*(\(.*\))[^\)]*
[^\(]* matches everything that isn't an opening bracket at the beginning of the string, (\(.*\)) captures the required substring enclosed in brackets, and [^\)]* matches everything that isn't a closing bracket at the end of the string. Note that this expression does not attempt to match brackets; a simple parser (see dehmann's answer) would be more suitable for that.
This answer explains the theoretical limitation of why regular expressions are not the right tool for this task.
Regular expressions can not do this.
Regular expressions are based on a computing model known as Finite State Automata (FSA). As the name indicates, a FSA can remember only the current state, it has no information about the previous states.
In the above diagram, S1 and S2 are two states where S1 is the starting and final step. So if we try with the string 0110 , the transition goes as follows:
0 1 1 0
-> S1 -> S2 -> S2 -> S2 ->S1
In the above steps, when we are at second S2 i.e. after parsing 01 of 0110, the FSA has no information about the previous 0 in 01 as it can only remember the current state and the next input symbol.
In the above problem, we need to know the no of opening parenthesis; this means it has to be stored at some place. But since FSAs can not do that, a regular expression can not be written.
However, an algorithm can be written to do this task. Algorithms are generally falls under Pushdown Automata (PDA). PDA is one level above of FSA. PDA has an additional stack to store some additional information. PDAs can be used to solve the above problem, because we can 'push' the opening parenthesis in the stack and 'pop' them once we encounter a closing parenthesis. If at the end, stack is empty, then opening parenthesis and closing parenthesis matches. Otherwise not.
(?<=\().*(?=\))
If you want to select text between two matching parentheses, you are out of luck with regular expressions. This is impossible(*).
This regex just returns the text between the first opening and the last closing parentheses in your string.
(*) Unless your regex engine has features like balancing groups or recursion. The number of engines that support such features is slowly growing, but they are still not a commonly available.
It is actually possible to do it using .NET regular expressions, but it is not trivial, so read carefully.
You can read a nice article here. You also may need to read up on .NET regular expressions. You can start reading here.
Angle brackets <> were used because they do not require escaping.
The regular expression looks like this:
<
[^<>]*
(
(
(?<Open><)
[^<>]*
)+
(
(?<Close-Open>>)
[^<>]*
)+
)*
(?(Open)(?!))
>
I was also stuck in this situation when dealing with nested patterns and regular-expressions is the right tool to solve such problems.
/(\((?>[^()]+|(?1))*\))/
This is the definitive regex:
\(
(?<arguments>
(
([^\(\)']*) |
(\([^\(\)']*\)) |
'(.*?)'
)*
)
\)
Example:
input: ( arg1, arg2, arg3, (arg4), '(pip' )
output: arg1, arg2, arg3, (arg4), '(pip'
note that the '(pip' is correctly managed as string.
(tried in regulator: http://sourceforge.net/projects/regulator/)
I have written a little JavaScript library called balanced to help with this task. You can accomplish this by doing
balanced.matches({
source: source,
open: '(',
close: ')'
});
You can even do replacements:
balanced.replacements({
source: source,
open: '(',
close: ')',
replace: function (source, head, tail) {
return head + source + tail;
}
});
Here's a more complex and interactive example JSFiddle.
Adding to bobble bubble's answer, there are other regex flavors where recursive constructs are supported.
Lua
Use %b() (%b{} / %b[] for curly braces / square brackets):
for s in string.gmatch("Extract (a(b)c) and ((d)f(g))", "%b()") do print(s) end (see demo)
Raku (former Perl6):
Non-overlapping multiple balanced parentheses matches:
my regex paren_any { '(' ~ ')' [ <-[()]>+ || <&paren_any> ]* }
say "Extract (a(b)c) and ((d)f(g))" ~~ m:g/<&paren_any>/;
# => (「(a(b)c)」 「((d)f(g))」)
Overlapping multiple balanced parentheses matches:
say "Extract (a(b)c) and ((d)f(g))" ~~ m:ov:g/<&paren_any>/;
# => (「(a(b)c)」 「(b)」 「((d)f(g))」 「(d)」 「(g)」)
See demo.
Python re non-regex solution
See poke's answer for How to get an expression between balanced parentheses.
Java customizable non-regex solution
Here is a customizable solution allowing single character literal delimiters in Java:
public static List<String> getBalancedSubstrings(String s, Character markStart,
Character markEnd, Boolean includeMarkers)
{
List<String> subTreeList = new ArrayList<String>();
int level = 0;
int lastOpenDelimiter = -1;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == markStart) {
level++;
if (level == 1) {
lastOpenDelimiter = (includeMarkers ? i : i + 1);
}
}
else if (c == markEnd) {
if (level == 1) {
subTreeList.add(s.substring(lastOpenDelimiter, (includeMarkers ? i + 1 : i)));
}
if (level > 0) level--;
}
}
return subTreeList;
}
}
Sample usage:
String s = "some text(text here(possible text)text(possible text(more text)))end text";
List<String> balanced = getBalancedSubstrings(s, '(', ')', true);
System.out.println("Balanced substrings:\n" + balanced);
// => [(text here(possible text)text(possible text(more text)))]
The regular expression using Ruby (version 1.9.3 or above):
/(?<match>\((?:\g<match>|[^()]++)*\))/
Demo on rubular
The answer depends on whether you need to match matching sets of brackets, or merely the first open to the last close in the input text.
If you need to match matching nested brackets, then you need something more than regular expressions. - see #dehmann
If it's just first open to last close see #Zach
Decide what you want to happen with:
abc ( 123 ( foobar ) def ) xyz ) ghij
You need to decide what your code needs to match in this case.
"""
Here is a simple python program showing how to use regular
expressions to write a paren-matching recursive parser.
This parser recognises items enclosed by parens, brackets,
braces and <> symbols, but is adaptable to any set of
open/close patterns. This is where the re package greatly
assists in parsing.
"""
import re
# The pattern below recognises a sequence consisting of:
# 1. Any characters not in the set of open/close strings.
# 2. One of the open/close strings.
# 3. The remainder of the string.
#
# There is no reason the opening pattern can't be the
# same as the closing pattern, so quoted strings can
# be included. However quotes are not ignored inside
# quotes. More logic is needed for that....
pat = re.compile("""
( .*? )
( \( | \) | \[ | \] | \{ | \} | \< | \> |
\' | \" | BEGIN | END | $ )
( .* )
""", re.X)
# The keys to the dictionary below are the opening strings,
# and the values are the corresponding closing strings.
# For example "(" is an opening string and ")" is its
# closing string.
matching = { "(" : ")",
"[" : "]",
"{" : "}",
"<" : ">",
'"' : '"',
"'" : "'",
"BEGIN" : "END" }
# The procedure below matches string s and returns a
# recursive list matching the nesting of the open/close
# patterns in s.
def matchnested(s, term=""):
lst = []
while True:
m = pat.match(s)
if m.group(1) != "":
lst.append(m.group(1))
if m.group(2) == term:
return lst, m.group(3)
if m.group(2) in matching:
item, s = matchnested(m.group(3), matching[m.group(2)])
lst.append(m.group(2))
lst.append(item)
lst.append(matching[m.group(2)])
else:
raise ValueError("After <<%s %s>> expected %s not %s" %
(lst, s, term, m.group(2)))
# Unit test.
if __name__ == "__main__":
for s in ("simple string",
""" "double quote" """,
""" 'single quote' """,
"one'two'three'four'five'six'seven",
"one(two(three(four)five)six)seven",
"one(two(three)four)five(six(seven)eight)nine",
"one(two)three[four]five{six}seven<eight>nine",
"one(two[three{four<five>six}seven]eight)nine",
"oneBEGINtwo(threeBEGINfourENDfive)sixENDseven",
"ERROR testing ((( mismatched ))] parens"):
print "\ninput", s
try:
lst, s = matchnested(s)
print "output", lst
except ValueError as e:
print str(e)
print "done"
You need the first and last parentheses. Use something like this:
str.indexOf('('); - it will give you first occurrence
str.lastIndexOf(')'); - last one
So you need a string between,
String searchedString = str.substring(str1.indexOf('('),str1.lastIndexOf(')');
because js regex doesn't support recursive match, i can't make balanced parentheses matching work.
so this is a simple javascript for loop version that make "method(arg)" string into array
push(number) map(test(a(a()))) bass(wow, abc)
$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)
const parser = str => {
let ops = []
let method, arg
let isMethod = true
let open = []
for (const char of str) {
// skip whitespace
if (char === ' ') continue
// append method or arg string
if (char !== '(' && char !== ')') {
if (isMethod) {
(method ? (method += char) : (method = char))
} else {
(arg ? (arg += char) : (arg = char))
}
}
if (char === '(') {
// nested parenthesis should be a part of arg
if (!isMethod) arg += char
isMethod = false
open.push(char)
} else if (char === ')') {
open.pop()
// check end of arg
if (open.length < 1) {
isMethod = true
ops.push({ method, arg })
method = arg = undefined
} else {
arg += char
}
}
}
return ops
}
// const test = parser(`$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)`)
const test = parser(`push(number) map(test(a(a()))) bass(wow, abc)`)
console.log(test)
the result is like
[ { method: 'push', arg: 'number' },
{ method: 'map', arg: 'test(a(a()))' },
{ method: 'bass', arg: 'wow,abc' } ]
[ { method: '$$', arg: 'groups' },
{ method: 'filter',
arg: '{type:\'ORGANIZATION\',isDisabled:{$ne:true}}' },
{ method: 'pickBy', arg: '_id,type' },
{ method: 'map', arg: 'test()' },
{ method: 'as', arg: 'groups' } ]
While so many answers mention this in some form by saying that regex does not support recursive matching and so on, the primary reason for this lies in the roots of the Theory of Computation.
Language of the form {a^nb^n | n>=0} is not regular. Regex can only match things that form part of the regular set of languages.
Read more # here
I didn't use regex since it is difficult to deal with nested code. So this snippet should be able to allow you to grab sections of code with balanced brackets:
def extract_code(data):
""" returns an array of code snippets from a string (data)"""
start_pos = None
end_pos = None
count_open = 0
count_close = 0
code_snippets = []
for i,v in enumerate(data):
if v =='{':
count_open+=1
if not start_pos:
start_pos= i
if v=='}':
count_close +=1
if count_open == count_close and not end_pos:
end_pos = i+1
if start_pos and end_pos:
code_snippets.append((start_pos,end_pos))
start_pos = None
end_pos = None
return code_snippets
I used this to extract code snippets from a text file.
This do not fully address the OP question but I though it may be useful to some coming here to search for nested structure regexp:
Parse parmeters from function string (with nested structures) in javascript
Match structures like:
matches brackets, square brackets, parentheses, single and double quotes
Here you can see generated regexp in action
/**
* get param content of function string.
* only params string should be provided without parentheses
* WORK even if some/all params are not set
* #return [param1, param2, param3]
*/
exports.getParamsSAFE = (str, nbParams = 3) => {
const nextParamReg = /^\s*((?:(?:['"([{](?:[^'"()[\]{}]*?|['"([{](?:[^'"()[\]{}]*?|['"([{][^'"()[\]{}]*?['")}\]])*?['")}\]])*?['")}\]])|[^,])*?)\s*(?:,|$)/;
const params = [];
while (str.length) { // this is to avoid a BIG performance issue in javascript regexp engine
str = str.replace(nextParamReg, (full, p1) => {
params.push(p1);
return '';
});
}
return params;
};
This might help to match balanced parenthesis.
\s*\w+[(][^+]*[)]\s*
This one also worked
re.findall(r'\(.+\)', s)

Regex that will extract the string between two known strings [duplicate]

I want to match a portion of a string using a regular expression and then access that parenthesized substring:
var myString = "something format_abc"; // I want "abc"
var arr = /(?:^|\s)format_(.*?)(?:\s|$)/.exec(myString);
console.log(arr); // Prints: [" format_abc", "abc"] .. so far so good.
console.log(arr[1]); // Prints: undefined (???)
console.log(arr[0]); // Prints: format_undefined (!!!)
What am I doing wrong?
I've discovered that there was nothing wrong with the regular expression code above: the actual string which I was testing against was this:
"date format_%A"
Reporting that "%A" is undefined seems a very strange behaviour, but it is not directly related to this question, so I've opened a new one, Why is a matched substring returning "undefined" in JavaScript?.
The issue was that console.log takes its parameters like a printf statement, and since the string I was logging ("%A") had a special value, it was trying to find the value of the next parameter.
Update: 2019-09-10
The old way to iterate over multiple matches was not very intuitive. This lead to the proposal of the String.prototype.matchAll method. This new method is in the ECMAScript 2020 specification. It gives us a clean API and solves multiple problems. It is in major browsers and JS engines since Chrome 73+ / Node 12+ and Firefox 67+.
The method returns an iterator and is used as follows:
const string = "something format_abc";
const regexp = /(?:^|\s)format_(.*?)(?:\s|$)/g;
const matches = string.matchAll(regexp);
for (const match of matches) {
console.log(match);
console.log(match.index)
}
As it returns an iterator, we can say it's lazy, this is useful when handling particularly large numbers of capturing groups, or very large strings. But if you need, the result can be easily transformed into an Array by using the spread syntax or the Array.from method:
function getFirstGroup(regexp, str) {
const array = [...str.matchAll(regexp)];
return array.map(m => m[1]);
}
// or:
function getFirstGroup(regexp, str) {
return Array.from(str.matchAll(regexp), m => m[1]);
}
In the meantime, while this proposal gets more wide support, you can use the official shim package.
Also, the internal workings of the method are simple. An equivalent implementation using a generator function would be as follows:
function* matchAll(str, regexp) {
const flags = regexp.global ? regexp.flags : regexp.flags + "g";
const re = new RegExp(regexp, flags);
let match;
while (match = re.exec(str)) {
yield match;
}
}
A copy of the original regexp is created; this is to avoid side-effects due to the mutation of the lastIndex property when going through the multple matches.
Also, we need to ensure the regexp has the global flag to avoid an infinite loop.
I'm also happy to see that even this StackOverflow question was referenced in the discussions of the proposal.
original answer
You can access capturing groups like this:
var myString = "something format_abc";
var myRegexp = /(?:^|\s)format_(.*?)(?:\s|$)/g;
var myRegexp = new RegExp("(?:^|\s)format_(.*?)(?:\s|$)", "g");
var matches = myRegexp.exec(myString);
console.log(matches[1]); // abc
And if there are multiple matches you can iterate over them:
var myString = "something format_abc";
var myRegexp = new RegExp("(?:^|\s)format_(.*?)(?:\s|$)", "g");
match = myRegexp.exec(myString);
while (match != null) {
// matched text: match[0]
// match start: match.index
// capturing group n: match[n]
console.log(match[0])
match = myRegexp.exec(myString);
}
Here’s a method you can use to get the n​th capturing group for each match:
function getMatches(string, regex, index) {
index || (index = 1); // default to the first capturing group
var matches = [];
var match;
while (match = regex.exec(string)) {
matches.push(match[index]);
}
return matches;
}
// Example :
var myString = 'something format_abc something format_def something format_ghi';
var myRegEx = /(?:^|\s)format_(.*?)(?:\s|$)/g;
// Get an array containing the first capturing group for every match
var matches = getMatches(myString, myRegEx, 1);
// Log results
document.write(matches.length + ' matches found: ' + JSON.stringify(matches))
console.log(matches);
var myString = "something format_abc";
var arr = myString.match(/\bformat_(.*?)\b/);
console.log(arr[0] + " " + arr[1]);
The \b isn't exactly the same thing. (It works on --format_foo/, but doesn't work on format_a_b) But I wanted to show an alternative to your expression, which is fine. Of course, the match call is the important thing.
Last but not least, I found one line of code that worked fine for me (JS ES6):
let reg = /#([\S]+)/igm; // Get hashtags.
let string = 'mi alegría es total! ✌🙌\n#fiestasdefindeaño #PadreHijo #buenosmomentos #france #paris';
let matches = (string.match(reg) || []).map(e => e.replace(reg, '$1'));
console.log(matches);
This will return:
['fiestasdefindeaño', 'PadreHijo', 'buenosmomentos', 'france', 'paris']
In regards to the multi-match parentheses examples above, I was looking for an answer here after not getting what I wanted from:
var matches = mystring.match(/(?:neededToMatchButNotWantedInResult)(matchWanted)/igm);
After looking at the slightly convoluted function calls with while and .push() above, it dawned on me that the problem can be solved very elegantly with mystring.replace() instead (the replacing is NOT the point, and isn't even done, the CLEAN, built-in recursive function call option for the second parameter is!):
var yourstring = 'something format_abc something format_def something format_ghi';
var matches = [];
yourstring.replace(/format_([^\s]+)/igm, function(m, p1){ matches.push(p1); } );
After this, I don't think I'm ever going to use .match() for hardly anything ever again.
String#matchAll (see the Stage 3 Draft / December 7, 2018 proposal), simplifies acccess to all groups in the match object (mind that Group 0 is the whole match, while further groups correspond to the capturing groups in the pattern):
With matchAll available, you can avoid the while loop and exec with /g... Instead, by using matchAll, you get back an iterator which you can use with the more convenient for...of, array spread, or Array.from() constructs
This method yields a similar output to Regex.Matches in C#, re.finditer in Python, preg_match_all in PHP.
See a JS demo (tested in Google Chrome 73.0.3683.67 (official build), beta (64-bit)):
var myString = "key1:value1, key2-value2!!#key3=value3";
var matches = myString.matchAll(/(\w+)[:=-](\w+)/g);
console.log([...matches]); // All match with capturing group values
The console.log([...matches]) shows
You may also get match value or specific group values using
let matchData = "key1:value1, key2-value2!!#key3=value3".matchAll(/(\w+)[:=-](\w+)/g)
var matches = [...matchData]; // Note matchAll result is not re-iterable
console.log(Array.from(matches, m => m[0])); // All match (Group 0) values
// => [ "key1:value1", "key2-value2", "key3=value3" ]
console.log(Array.from(matches, m => m[1])); // All match (Group 1) values
// => [ "key1", "key2", "key3" ]
NOTE: See the browser compatibility details.
Terminology used in this answer:
Match indicates the result of running your RegEx pattern against your string like so: someString.match(regexPattern).
Matched patterns indicate all matched portions of the input string, which all reside inside the match array. These are all instances of your pattern inside the input string.
Matched groups indicate all groups to catch, defined in the RegEx pattern. (The patterns inside parentheses, like so: /format_(.*?)/g, where (.*?) would be a matched group.) These reside within matched patterns.
Description
To get access to the matched groups, in each of the matched patterns, you need a function or something similar to iterate over the match. There are a number of ways you can do this, as many of the other answers show. Most other answers use a while loop to iterate over all matched patterns, but I think we all know the potential dangers with that approach. It is necessary to match against a new RegExp() instead of just the pattern itself, which only got mentioned in a comment. This is because the .exec() method behaves similar to a generator function – it stops every time there is a match, but keeps its .lastIndex to continue from there on the next .exec() call.
Code examples
Below is an example of a function searchString which returns an Array of all matched patterns, where each match is an Array with all the containing matched groups. Instead of using a while loop, I have provided examples using both the Array.prototype.map() function as well as a more performant way – using a plain for-loop.
Concise versions (less code, more syntactic sugar)
These are less performant since they basically implement a forEach-loop instead of the faster for-loop.
// Concise ES6/ES2015 syntax
const searchString =
(string, pattern) =>
string
.match(new RegExp(pattern.source, pattern.flags))
.map(match =>
new RegExp(pattern.source, pattern.flags)
.exec(match));
// Or if you will, with ES5 syntax
function searchString(string, pattern) {
return string
.match(new RegExp(pattern.source, pattern.flags))
.map(match =>
new RegExp(pattern.source, pattern.flags)
.exec(match));
}
let string = "something format_abc",
pattern = /(?:^|\s)format_(.*?)(?:\s|$)/;
let result = searchString(string, pattern);
// [[" format_abc", "abc"], null]
// The trailing `null` disappears if you add the `global` flag
Performant versions (more code, less syntactic sugar)
// Performant ES6/ES2015 syntax
const searchString = (string, pattern) => {
let result = [];
const matches = string.match(new RegExp(pattern.source, pattern.flags));
for (let i = 0; i < matches.length; i++) {
result.push(new RegExp(pattern.source, pattern.flags).exec(matches[i]));
}
return result;
};
// Same thing, but with ES5 syntax
function searchString(string, pattern) {
var result = [];
var matches = string.match(new RegExp(pattern.source, pattern.flags));
for (var i = 0; i < matches.length; i++) {
result.push(new RegExp(pattern.source, pattern.flags).exec(matches[i]));
}
return result;
}
let string = "something format_abc",
pattern = /(?:^|\s)format_(.*?)(?:\s|$)/;
let result = searchString(string, pattern);
// [[" format_abc", "abc"], null]
// The trailing `null` disappears if you add the `global` flag
I have yet to compare these alternatives to the ones previously mentioned in the other answers, but I doubt this approach is less performant and less fail-safe than the others.
Your syntax probably isn't the best to keep. FF/Gecko defines RegExp as an extension of Function.
(FF2 went as far as typeof(/pattern/) == 'function')
It seems this is specific to FF -- IE, Opera, and Chrome all throw exceptions for it.
Instead, use either method previously mentioned by others: RegExp#exec or String#match.
They offer the same results:
var regex = /(?:^|\s)format_(.*?)(?:\s|$)/;
var input = "something format_abc";
regex(input); //=> [" format_abc", "abc"]
regex.exec(input); //=> [" format_abc", "abc"]
input.match(regex); //=> [" format_abc", "abc"]
There is no need to invoke the exec method! You can use "match" method directly on the string. Just don't forget the parentheses.
var str = "This is cool";
var matches = str.match(/(This is)( cool)$/);
console.log( JSON.stringify(matches) ); // will print ["This is cool","This is"," cool"] or something like that...
Position 0 has a string with all the results. Position 1 has the first match represented by parentheses, and position 2 has the second match isolated in your parentheses. Nested parentheses are tricky, so beware!
With es2018 you can now String.match() with named groups, makes your regex more explicit of what it was trying to do.
const url =
'https://stackoverflow.com/questions/432493/how-do-you-access-the-matched-groups-in-a-javascript-regular-expression?some=parameter';
const regex = /(?<protocol>https?):\/\/(?<hostname>[\w-\.]*)\/(?<pathname>[\w-\./]+)\??(?<querystring>.*?)?$/;
const { groups: segments } = url.match(regex);
console.log(segments);
and you'll get something like
{protocol: "https", hostname: "stackoverflow.com", pathname: "questions/432493/how-do-you-access-the-matched-groups-in-a-javascript-regular-expression", querystring: "some=parameter"}
A one liner that is practical only if you have a single pair of parenthesis:
while ( ( match = myRegex.exec( myStr ) ) && matches.push( match[1] ) ) {};
Using your code:
console.log(arr[1]); // prints: abc
console.log(arr[0]); // prints: format_abc
Edit: Safari 3, if it matters.
function getMatches(string, regex, index) {
index || (index = 1); // default to the first capturing group
var matches = [];
var match;
while (match = regex.exec(string)) {
matches.push(match[index]);
}
return matches;
}
// Example :
var myString = 'Rs.200 is Debited to A/c ...2031 on 02-12-14 20:05:49 (Clear Bal Rs.66248.77) AT ATM. TollFree 1800223344 18001024455 (6am-10pm)';
var myRegEx = /clear bal.+?(\d+\.?\d{2})/gi;
// Get an array containing the first capturing group for every match
var matches = getMatches(myString, myRegEx, 1);
// Log results
document.write(matches.length + ' matches found: ' + JSON.stringify(matches))
console.log(matches);
function getMatches(string, regex, index) {
index || (index = 1); // default to the first capturing group
var matches = [];
var match;
while (match = regex.exec(string)) {
matches.push(match[index]);
}
return matches;
}
// Example :
var myString = 'something format_abc something format_def something format_ghi';
var myRegEx = /(?:^|\s)format_(.*?)(?:\s|$)/g;
// Get an array containing the first capturing group for every match
var matches = getMatches(myString, myRegEx, 1);
// Log results
document.write(matches.length + ' matches found: ' + JSON.stringify(matches))
console.log(matches);
Your code works for me (FF3 on Mac) even if I agree with PhiLo that the regex should probably be:
/\bformat_(.*?)\b/
(But, of course, I'm not sure because I don't know the context of the regex.)
As #cms said in ECMAScript (ECMA-262) you can use matchAll. It return an iterator and by putting it in [... ] (spread operator) it converts to an array.(this regex extract urls of file names)
let text = `File1 File2`;
let fileUrls = [...text.matchAll(/href="(http\:\/\/[^"]+\.\w{3})\"/g)].map(r => r[1]);
console.log(fileUrls);
/*Regex function for extracting object from "window.location.search" string.
*/
var search = "?a=3&b=4&c=7"; // Example search string
var getSearchObj = function (searchString) {
var match, key, value, obj = {};
var pattern = /(\w+)=(\w+)/g;
var search = searchString.substr(1); // Remove '?'
while (match = pattern.exec(search)) {
obj[match[0].split('=')[0]] = match[0].split('=')[1];
}
return obj;
};
console.log(getSearchObj(search));
You don't really need an explicit loop to parse multiple matches — pass a replacement function as the second argument as described in: String.prototype.replace(regex, func):
var str = "Our chief weapon is {1}, {0} and {2}!";
var params= ['surprise', 'fear', 'ruthless efficiency'];
var patt = /{([^}]+)}/g;
str=str.replace(patt, function(m0, m1, position){return params[parseInt(m1)];});
document.write(str);
The m0 argument represents the full matched substring {0}, {1}, etc. m1 represents the first matching group, i.e. the part enclosed in brackets in the regex which is 0 for the first match. And position is the starting index within the string where the matching group was found — unused in this case.
We can access the matched group in a regular expressions by using backslash followed by number of the matching group:
/([a-z])\1/
In the code \1 represented matched by first group ([a-z])
I you are like me and wish regex would return an Object like this:
{
match: '...',
matchAtIndex: 0,
capturedGroups: [ '...', '...' ]
}
then snip the function from below
/**
* #param {string | number} input
* The input string to match
* #param {regex | string} expression
* Regular expression
* #param {string} flags
* Optional Flags
*
* #returns {array}
* [{
match: '...',
matchAtIndex: 0,
capturedGroups: [ '...', '...' ]
}]
*/
function regexMatch(input, expression, flags = "g") {
let regex = expression instanceof RegExp ? expression : new RegExp(expression, flags)
let matches = input.matchAll(regex)
matches = [...matches]
return matches.map(item => {
return {
match: item[0],
matchAtIndex: item.index,
capturedGroups: item.length > 1 ? item.slice(1) : undefined
}
})
}
let input = "key1:value1, key2:value2 "
let regex = /(\w+):(\w+)/g
let matches = regexMatch(input, regex)
console.log(matches)
One line solution:
const matches = (text,regex) => [...text.matchAll(regex)].map(([match])=>match)
So you can use this way (must use /g):
matches("something format_abc", /(?:^|\s)format_(.*?)(?:\s|$)/g)
result:
[" format_abc"]
JUST USE RegExp.$1...$n th group
eg:
1.To match 1st group RegExp.$1
To match 2nd group RegExp.$2
if you use 3 group in regex likey(note use after string.match(regex))
RegExp.$1 RegExp.$2 RegExp.$3
var str = "The rain in ${india} stays safe";
var res = str.match(/\${(.*?)\}/ig);
//i used only one group in above example so RegExp.$1
console.log(RegExp.$1)
//easiest way is use RegExp.$1 1st group in regex and 2nd grounp like
//RegExp.$2 if exist use after match
var regex=/\${(.*?)\}/ig;
var str = "The rain in ${SPAIN} stays ${mainly} in the plain";
var res = str.match(regex);
for (const match of res) {
var res = match.match(regex);
console.log(match);
console.log(RegExp.$1)
}
Get all group occurrence
let m=[], s = "something format_abc format_def format_ghi";
s.replace(/(?:^|\s)format_(.*?)(?:\s|$)/g, (x,y)=> m.push(y));
console.log(m);
I thought you just want to grab all the words containing the abc substring and store the matched group/entries, so I made this script:
s = 'something format_abc another word abc abc_somestring'
console.log(s.match(/\b\w*abc\w*\b/igm));
\b - a word boundary
\w* - 0+ word chars
abc - your exact match
\w* - 0+ word chars
\b - a word boundary
References: Regex: Match all the words that contains some word
https://javascript.info/regexp-introduction

Regex help need to match an ampersand OR and end of string

I'm trying to create a regex to match part of a URL
The possible URLs might be
www.mysite.com?userid=123xy
www.mysite.com?userid=123x&username=joe
www.mysite.com?tag=xyz&userid=1ww45
www.mysite.com?tag=xyz&userid=1g3x5&username=joe
I'm trying to match the userid=123456
So far I have
Dim r As New Regex("[&?]userID.*[?&]")
Debug.WriteLine(r.Match(strUrl))
But this is only matching lines 2 and 4.
Can anyone help?
(?<=[?&]userid=)[^&#\s]*
Output:
123xy
123x
1ww45
1g3x5
A few points:
This works both if you are matching one URL at a time and if you have a whitespace-separated set.
This captures the username only. It uses the non-capturing positive look-behind assertion since you only care about the username.
The fragment part, if present, will be ignored (e.g. if the URL looked like this: www.mysite.com?tag=xyz&userid=1ww45#top)
If the case of userid doesn't matter, use RegexOptions.IgnoreCase.
I got it:
[&?]userID=[^\s&#]+
PHP solution:
"/[\\?&]userid=([^&]*)/"
Tests:
$tests = [
[
"regex" => "/[\\?&]userid=([^&]*)/",
"expected" => "123xy",
"inputs" => [
"www.mysite.com?userid=123xy",
"www.mysite.com?userid=123xy&username=joe",
"www.mysite.com?tag=xyz&userid=123xy",
"www.mysite.com?tag=xyz&userid=123xy&username=joe"
]
]
];
foreach ($tests as $test) {
$regex = $test['regex'];
$expected = $test['expected'];
foreach ($test['inputs'] as $input) {
if (!preg_match($regex, $input, $match)) {
throw new Exception("Regex '{$regex}' doesn't match for input '{$input}' or error has occured.");
}
$matched = $match[1];
if ($matched !== $expected) {
throw new Exception("Found '{$matched}' instead of '{$expected}'.");
}
echo "Matched '{$matched}' in '{$input}'." . PHP_EOL;
}
}
Results:
Matched '123xy' in 'www.mysite.com?userid=123xy'.
Matched '123xy' in 'www.mysite.com?userid=123xy&username=joe'.
Matched '123xy' in 'www.mysite.com?tag=xyz&userid=123xy'.
Matched '123xy' in 'www.mysite.com?tag=xyz&userid=123xy&username=joe'.
You can use the regex: .*?(userid=\d+).*
.*? - is a non-greedy way to express: everything that comes before (userid=\d+)
Python example:
import re
a = 'www.mysite.com?userid=12345'
b = 'www.mysite.com?userid=12345&username=joe'
mat = re.match('.*?(userid=\d+).*', a)
print mat.group(1) # prints userid=12345
mat = re.match('.*?(userid=\d+).*', b)
print mat.group(1) # prints userid=12345
Link to Fiddler

Regular expression to match balanced parentheses

I need a regular expression to select all the text between two outer brackets.
Example:
START_TEXT(text here(possible text)text(possible text(more text)))END_TXT
^ ^
Result:
(text here(possible text)text(possible text(more text)))
I want to add this answer for quickreference. Feel free to update.
.NET Regex using balancing groups:
\((?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\)
Where c is used as the depth counter.
Demo at Regexstorm.com
Stack Overflow: Using RegEx to balance match parenthesis
Wes' Puzzling Blog: Matching Balanced Constructs with .NET Regular Expressions
Greg Reinacker's Weblog: Nested Constructs in Regular Expressions
PCRE using a recursive pattern:
\((?:[^)(]+|(?R))*+\)
Demo at regex101; Or without alternation:
\((?:[^)(]*(?R)?)*+\)
Demo at regex101; Or unrolled for performance:
\([^)(]*+(?:(?R)[^)(]*)*+\)
Demo at regex101; The pattern is pasted at (?R) which represents (?0).
Perl, PHP, Notepad++, R: perl=TRUE, Python: PyPI regex module with (?V1) for Perl behaviour.
(the new version of PyPI regex package already defaults to this → DEFAULT_VERSION = VERSION1)
Ruby using subexpression calls:
With Ruby 2.0 \g<0> can be used to call full pattern.
\((?>[^)(]+|\g<0>)*\)
Demo at Rubular; Ruby 1.9 only supports capturing group recursion:
(\((?>[^)(]+|\g<1>)*\))
Demo at Rubular  (atomic grouping since Ruby 1.9.3)
JavaScript  API :: XRegExp.matchRecursive
XRegExp.matchRecursive(str, '\\(', '\\)', 'g');
Java: An interesting idea using forward references by #jaytea.
Without recursion up to 3 levels of nesting:
(JS, Java and other regex flavors)
To prevent runaway if unbalanced, with * on innermost [)(] only.
\((?:[^)(]|\((?:[^)(]|\((?:[^)(]|\([^)(]*\))*\))*\))*\)
Demo at regex101; Or unrolled for better performance (preferred).
\([^)(]*(?:\([^)(]*(?:\([^)(]*(?:\([^)(]*\)[^)(]*)*\)[^)(]*)*\)[^)(]*)*\)
Demo at regex101; Deeper nesting needs to be added as required.
Reference - What does this regex mean?
RexEgg.com - Recursive Regular Expressions
Regular-Expressions.info - Regular Expression Recursion
Mastering Regular Expressions - Jeffrey E.F. Friedl 1 2 3 4
Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.
But there is a simple algorithm to do this, which I described in more detail in this answer to a previous question. The gist is to write code which scans through the string keeping a counter of the open parentheses which have not yet been matched by a closing parenthesis. When that counter returns to zero, then you know you've reached the final closing parenthesis.
You can use regex recursion:
\(([^()]|(?R))*\)
[^\(]*(\(.*\))[^\)]*
[^\(]* matches everything that isn't an opening bracket at the beginning of the string, (\(.*\)) captures the required substring enclosed in brackets, and [^\)]* matches everything that isn't a closing bracket at the end of the string. Note that this expression does not attempt to match brackets; a simple parser (see dehmann's answer) would be more suitable for that.
This answer explains the theoretical limitation of why regular expressions are not the right tool for this task.
Regular expressions can not do this.
Regular expressions are based on a computing model known as Finite State Automata (FSA). As the name indicates, a FSA can remember only the current state, it has no information about the previous states.
In the above diagram, S1 and S2 are two states where S1 is the starting and final step. So if we try with the string 0110 , the transition goes as follows:
0 1 1 0
-> S1 -> S2 -> S2 -> S2 ->S1
In the above steps, when we are at second S2 i.e. after parsing 01 of 0110, the FSA has no information about the previous 0 in 01 as it can only remember the current state and the next input symbol.
In the above problem, we need to know the no of opening parenthesis; this means it has to be stored at some place. But since FSAs can not do that, a regular expression can not be written.
However, an algorithm can be written to do this task. Algorithms are generally falls under Pushdown Automata (PDA). PDA is one level above of FSA. PDA has an additional stack to store some additional information. PDAs can be used to solve the above problem, because we can 'push' the opening parenthesis in the stack and 'pop' them once we encounter a closing parenthesis. If at the end, stack is empty, then opening parenthesis and closing parenthesis matches. Otherwise not.
(?<=\().*(?=\))
If you want to select text between two matching parentheses, you are out of luck with regular expressions. This is impossible(*).
This regex just returns the text between the first opening and the last closing parentheses in your string.
(*) Unless your regex engine has features like balancing groups or recursion. The number of engines that support such features is slowly growing, but they are still not a commonly available.
It is actually possible to do it using .NET regular expressions, but it is not trivial, so read carefully.
You can read a nice article here. You also may need to read up on .NET regular expressions. You can start reading here.
Angle brackets <> were used because they do not require escaping.
The regular expression looks like this:
<
[^<>]*
(
(
(?<Open><)
[^<>]*
)+
(
(?<Close-Open>>)
[^<>]*
)+
)*
(?(Open)(?!))
>
I was also stuck in this situation when dealing with nested patterns and regular-expressions is the right tool to solve such problems.
/(\((?>[^()]+|(?1))*\))/
This is the definitive regex:
\(
(?<arguments>
(
([^\(\)']*) |
(\([^\(\)']*\)) |
'(.*?)'
)*
)
\)
Example:
input: ( arg1, arg2, arg3, (arg4), '(pip' )
output: arg1, arg2, arg3, (arg4), '(pip'
note that the '(pip' is correctly managed as string.
(tried in regulator: http://sourceforge.net/projects/regulator/)
I have written a little JavaScript library called balanced to help with this task. You can accomplish this by doing
balanced.matches({
source: source,
open: '(',
close: ')'
});
You can even do replacements:
balanced.replacements({
source: source,
open: '(',
close: ')',
replace: function (source, head, tail) {
return head + source + tail;
}
});
Here's a more complex and interactive example JSFiddle.
Adding to bobble bubble's answer, there are other regex flavors where recursive constructs are supported.
Lua
Use %b() (%b{} / %b[] for curly braces / square brackets):
for s in string.gmatch("Extract (a(b)c) and ((d)f(g))", "%b()") do print(s) end (see demo)
Raku (former Perl6):
Non-overlapping multiple balanced parentheses matches:
my regex paren_any { '(' ~ ')' [ <-[()]>+ || <&paren_any> ]* }
say "Extract (a(b)c) and ((d)f(g))" ~~ m:g/<&paren_any>/;
# => (「(a(b)c)」 「((d)f(g))」)
Overlapping multiple balanced parentheses matches:
say "Extract (a(b)c) and ((d)f(g))" ~~ m:ov:g/<&paren_any>/;
# => (「(a(b)c)」 「(b)」 「((d)f(g))」 「(d)」 「(g)」)
See demo.
Python re non-regex solution
See poke's answer for How to get an expression between balanced parentheses.
Java customizable non-regex solution
Here is a customizable solution allowing single character literal delimiters in Java:
public static List<String> getBalancedSubstrings(String s, Character markStart,
Character markEnd, Boolean includeMarkers)
{
List<String> subTreeList = new ArrayList<String>();
int level = 0;
int lastOpenDelimiter = -1;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == markStart) {
level++;
if (level == 1) {
lastOpenDelimiter = (includeMarkers ? i : i + 1);
}
}
else if (c == markEnd) {
if (level == 1) {
subTreeList.add(s.substring(lastOpenDelimiter, (includeMarkers ? i + 1 : i)));
}
if (level > 0) level--;
}
}
return subTreeList;
}
}
Sample usage:
String s = "some text(text here(possible text)text(possible text(more text)))end text";
List<String> balanced = getBalancedSubstrings(s, '(', ')', true);
System.out.println("Balanced substrings:\n" + balanced);
// => [(text here(possible text)text(possible text(more text)))]
The regular expression using Ruby (version 1.9.3 or above):
/(?<match>\((?:\g<match>|[^()]++)*\))/
Demo on rubular
The answer depends on whether you need to match matching sets of brackets, or merely the first open to the last close in the input text.
If you need to match matching nested brackets, then you need something more than regular expressions. - see #dehmann
If it's just first open to last close see #Zach
Decide what you want to happen with:
abc ( 123 ( foobar ) def ) xyz ) ghij
You need to decide what your code needs to match in this case.
"""
Here is a simple python program showing how to use regular
expressions to write a paren-matching recursive parser.
This parser recognises items enclosed by parens, brackets,
braces and <> symbols, but is adaptable to any set of
open/close patterns. This is where the re package greatly
assists in parsing.
"""
import re
# The pattern below recognises a sequence consisting of:
# 1. Any characters not in the set of open/close strings.
# 2. One of the open/close strings.
# 3. The remainder of the string.
#
# There is no reason the opening pattern can't be the
# same as the closing pattern, so quoted strings can
# be included. However quotes are not ignored inside
# quotes. More logic is needed for that....
pat = re.compile("""
( .*? )
( \( | \) | \[ | \] | \{ | \} | \< | \> |
\' | \" | BEGIN | END | $ )
( .* )
""", re.X)
# The keys to the dictionary below are the opening strings,
# and the values are the corresponding closing strings.
# For example "(" is an opening string and ")" is its
# closing string.
matching = { "(" : ")",
"[" : "]",
"{" : "}",
"<" : ">",
'"' : '"',
"'" : "'",
"BEGIN" : "END" }
# The procedure below matches string s and returns a
# recursive list matching the nesting of the open/close
# patterns in s.
def matchnested(s, term=""):
lst = []
while True:
m = pat.match(s)
if m.group(1) != "":
lst.append(m.group(1))
if m.group(2) == term:
return lst, m.group(3)
if m.group(2) in matching:
item, s = matchnested(m.group(3), matching[m.group(2)])
lst.append(m.group(2))
lst.append(item)
lst.append(matching[m.group(2)])
else:
raise ValueError("After <<%s %s>> expected %s not %s" %
(lst, s, term, m.group(2)))
# Unit test.
if __name__ == "__main__":
for s in ("simple string",
""" "double quote" """,
""" 'single quote' """,
"one'two'three'four'five'six'seven",
"one(two(three(four)five)six)seven",
"one(two(three)four)five(six(seven)eight)nine",
"one(two)three[four]five{six}seven<eight>nine",
"one(two[three{four<five>six}seven]eight)nine",
"oneBEGINtwo(threeBEGINfourENDfive)sixENDseven",
"ERROR testing ((( mismatched ))] parens"):
print "\ninput", s
try:
lst, s = matchnested(s)
print "output", lst
except ValueError as e:
print str(e)
print "done"
You need the first and last parentheses. Use something like this:
str.indexOf('('); - it will give you first occurrence
str.lastIndexOf(')'); - last one
So you need a string between,
String searchedString = str.substring(str1.indexOf('('),str1.lastIndexOf(')');
because js regex doesn't support recursive match, i can't make balanced parentheses matching work.
so this is a simple javascript for loop version that make "method(arg)" string into array
push(number) map(test(a(a()))) bass(wow, abc)
$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)
const parser = str => {
let ops = []
let method, arg
let isMethod = true
let open = []
for (const char of str) {
// skip whitespace
if (char === ' ') continue
// append method or arg string
if (char !== '(' && char !== ')') {
if (isMethod) {
(method ? (method += char) : (method = char))
} else {
(arg ? (arg += char) : (arg = char))
}
}
if (char === '(') {
// nested parenthesis should be a part of arg
if (!isMethod) arg += char
isMethod = false
open.push(char)
} else if (char === ')') {
open.pop()
// check end of arg
if (open.length < 1) {
isMethod = true
ops.push({ method, arg })
method = arg = undefined
} else {
arg += char
}
}
}
return ops
}
// const test = parser(`$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)`)
const test = parser(`push(number) map(test(a(a()))) bass(wow, abc)`)
console.log(test)
the result is like
[ { method: 'push', arg: 'number' },
{ method: 'map', arg: 'test(a(a()))' },
{ method: 'bass', arg: 'wow,abc' } ]
[ { method: '$$', arg: 'groups' },
{ method: 'filter',
arg: '{type:\'ORGANIZATION\',isDisabled:{$ne:true}}' },
{ method: 'pickBy', arg: '_id,type' },
{ method: 'map', arg: 'test()' },
{ method: 'as', arg: 'groups' } ]
While so many answers mention this in some form by saying that regex does not support recursive matching and so on, the primary reason for this lies in the roots of the Theory of Computation.
Language of the form {a^nb^n | n>=0} is not regular. Regex can only match things that form part of the regular set of languages.
Read more # here
I didn't use regex since it is difficult to deal with nested code. So this snippet should be able to allow you to grab sections of code with balanced brackets:
def extract_code(data):
""" returns an array of code snippets from a string (data)"""
start_pos = None
end_pos = None
count_open = 0
count_close = 0
code_snippets = []
for i,v in enumerate(data):
if v =='{':
count_open+=1
if not start_pos:
start_pos= i
if v=='}':
count_close +=1
if count_open == count_close and not end_pos:
end_pos = i+1
if start_pos and end_pos:
code_snippets.append((start_pos,end_pos))
start_pos = None
end_pos = None
return code_snippets
I used this to extract code snippets from a text file.
This do not fully address the OP question but I though it may be useful to some coming here to search for nested structure regexp:
Parse parmeters from function string (with nested structures) in javascript
Match structures like:
matches brackets, square brackets, parentheses, single and double quotes
Here you can see generated regexp in action
/**
* get param content of function string.
* only params string should be provided without parentheses
* WORK even if some/all params are not set
* #return [param1, param2, param3]
*/
exports.getParamsSAFE = (str, nbParams = 3) => {
const nextParamReg = /^\s*((?:(?:['"([{](?:[^'"()[\]{}]*?|['"([{](?:[^'"()[\]{}]*?|['"([{][^'"()[\]{}]*?['")}\]])*?['")}\]])*?['")}\]])|[^,])*?)\s*(?:,|$)/;
const params = [];
while (str.length) { // this is to avoid a BIG performance issue in javascript regexp engine
str = str.replace(nextParamReg, (full, p1) => {
params.push(p1);
return '';
});
}
return params;
};
This might help to match balanced parenthesis.
\s*\w+[(][^+]*[)]\s*
This one also worked
re.findall(r'\(.+\)', s)