Match every thing between "****" or [****] - regex

I have a regex that look like this:
(?="(test)"\s*:\s*(".*?"|\[.*?]))
to match the value between "..." or [...]
Input
"test":"value0"
"test":["value1", "value2"]
Output
Group1 Group2
test value0
test "value1", "value2" // or - value1", "value2
I there any trick to ignore "" and [] and stick with two group, group1 and group2?
I tried (?="(test)"\s*:\s*(?="(.*?)"|\[(.*?)])) but this gives me 4 groups, which is not good for me.

You may use this conditional regex in PHP with branch reset group:
"(test)"\h*:\h*(?|"([^"]*)"|\[([^]]*)])
This will give you 2 capture groups in both the inputs with enclosing " or [...].
RegEx Demo
RegEx Details:
(?|..) is a branch reset group. Here Subpatterns declared within each alternative of this construct will start over from the same index
(?|"([^"]*)"|\[([^]]*)]) is if-then-else conditional subpatern which means if " is matched then use "([^"]*)" otherwise use \[([^]]*)] subpattern

You can use a pattern like
"(test)"\s*:\s*\K(?|"\K([^"]*)|\[\K([^]]*))
See the regex demo.
Details:
" - a " char
(test) - Group 1: test word
" - a " char
\s*:\s* - a colon enclosed with zero or more whitespaces
\K - match reset operator that clears the current overall match memory buffer (group value is still kept intact)
(?|"\K([^"]*)|\[\K([^]]*)) - a branch reset group:
"\K([^"]*) - matches a ", then discards it, and then captures into Group 2 zero or more chars other than "
| - or
\[\K([^]]*) - matches a [, then discards it, and then captures into Group 2 zero or more chars other than ]
In Java, you can't use \K and ?|, use capturing groups:
String s = "\"test\":[\"value1\", \"value2\"]";
Pattern pattern = Pattern.compile("\"(test)\"\\s*:\\s*(?:\"([^\"]*)|\\[([^\\]]*))");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println("Key: " + matcher.group(1));
if (matcher.group(2) != null) {
System.out.println("Value: " + matcher.group(2));
} else {
System.out.println("Value: " + matcher.group(3));
}
}
See a Java demo.

Related

Pattern match for (length)%code with before length

I have a pattern like x%c, where x is a single digit integer and c is an alphanumeric code of length x. % is just a token separator of length and code
For instance 2%74 is valid since 74 is of 2 digits. Similarly, 1%8 and 4%3232 are also valid.
I have tried regex of form ^([0-9])(%)([A-Z0-9]){\1}, where I am trying to put a limit on length by the value of group 1. It does not work apparently since the group is treated as a string, not a number.
If I change the above regex to ^([0-9])(%)([A-Z0-9]){2} it will work for 2%74 it is of no use since my length is to be limited controlled by the first group not a fixed digit.
I it is not possible by regex is there a better approach in java?
One way could be using 2 capture groups, and convert the first group to an int and count the characters for the second group.
\b(\d+)%(\d+)\b
\b Word boundary
(\d+) Capture group 1, match 1+ digits
% Match literally
(\d+) Capture group 2, match 1+ digits
\b Word boundary
Regex demo | Java demo
For example
String regex = "\\b(\\d+)%(\\d+)\\b";
String string = "2%74";
Pattern pattern = Pattern.compile(regex);
String strings[] = { "2%74", "1%8", "4%3232", "5%123456", "6%0" };
for (String s : strings) {
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
if (Integer.parseInt(matcher.group(1)) == matcher.group(2).length()) {
System.out.println("Match for " + s);
} else {
System.out.println("No match for " + s);
}
}
}
Output
Match for 2%74
Match for 1%8
Match for 4%3232
No match for 5%123456
No match for 6%0

regex match longest substring with equal first and last char

/(\w)(\w*)\1/
For this string:"mgntdygtxrvxjnwksqhxuxtrv" I match "txrvxjnwksqhxuxt" (using Ruby), but not the even longer valid substring "tdygtxrvxjnwksqhxuxt".
For a given string, here are two ways to find the longest substring that begins and ends with the same character.
Suppose
str = "mgntdygtxrvxjnwksqhxuxtrv"
Use a regular expression
r = /(.)(?=(.*\1))/
str.gsub(r).map { $1 + $2 }.max_by(&:length)
#=> "tdygtxrvxjnwksqhxuxt".
When, as here, the regular expression contains capture groups, it may be more convenient to use String#gsub without a second argument or block (in which case it returns an enumerator, which can be chained) than String#scan (" If the pattern contains groups, each individual result is itself an array containing one entry per group.") Here gsub performs no substitutions; it merely generates matches of the regular expression.
The regular expression can be made self-documenting by writing it in free-spacing mode.
r = /
(.) # match any char and save to capture group 1
(?= # begin a positive lookahead
(.*\1) # match >= 0 characters followed by the contents of capture group 1
) # end the postive lookahead
/x # free-spacing regex definition mode
The following intermediate calculation is performed:
str.gsub(r).map { $1 + $2 }
#=> ["gntdyg", "ntdygtxrvxjn", "tdygtxrvxjnwksqhxuxt", "txrvxjnwksqhxuxt",
# "xrvxjnwksqhxux", "rvxjnwksqhxuxtr", "vxjnwksqhxuxtrv", "xjnwksqhxux",
# "xux"]
Notice that this does not enumerate all substrings beginning and ending with the same character (because .* is greedy). It does not generate, for example, the substring "xrvx".
Do not use a regular expression
v = str.each_char.with_index.with_object({}) do |(c,i),h|
if h.key?(c)
h[c][:size] = i - h[c][:start] + 1
else
h[c] = { start: i, size: 1 }
end
end.max_by { |_,h| h[:size] }.last
str[v[:start], v[:size]]
#=> "tdygtxrvxjnwksqhxuxt"

Scala regex : capture between group

In below regex I need "test" as output but it gives complete string which matches the regex. How can I capture string between two groups?
val pattern = """\{outer.*\}""".r
println(pattern.findAllIn(s"try {outer.test}").matchData.map(step => step.group(0)).toList.mkString)
Input : "try {outer.test}"
expected Output : test
current output : {outer.test}
You may capture that part using:
val pattern = """\{outer\.([^{}]*)\}""".r.unanchored
val s = "try {outer.test}"
val result = s match {
case pattern(i) => i
case _ => ""
}
println(result)
The pattern matches
\{outer\. - a literal {outer. substring
([^{}]*) - Capturing group 1: zero or more (*) chars other than { and } (see [^{}] negated character class)
\} - a } char.
NOTE: if your regex must match the whole string, remove the .unanchored I added to also allow partial matches inside a string.
See the Scala demo online.
Or, you may change the pattern so that the first part is no longer as consuming pattern (it matches a string of fixed length, so it is possible):
val pattern = """(?<=\{outer\.)[^{}]*""".r
val s = "try {outer.test}"
println(pattern.findFirstIn(s).getOrElse(""))
// => test
See this Scala demo.
Here, (?<=\{outer\.), a positive lookbehind, matches {outer. but does not put it into the match value.

C++ regex: Get index of the Capture Group the SubMatch matched to

Context. I'm developing a Lexer/Tokenizing engine, which would use regex as a backend. The lexer accepts rules, which define the token types/IDs, e.g.
<identifier> = "\\b\\w+\\b".
As I envision, to do the regex match-based tokenizing, all of the rules defined by regexes are enclosed in capturing groups, and all groups are separated by ORs.
When the matching is being executed, every match we produce must have an index of the capturing group it was matched to. We use these IDs to map the matches to token types.
So the problem of this question arises - how to get the ID of the group?
Similar question here, but it does not provide the solution to my specific problem.
Exactly my problem here, but it's in JS, and I need a C/C++ solution.
So let's say I've got a regex, made up of capturing groups separated by an OR:
(\\b[a-zA-Z]+\\b)|(\\b\\d+\\b)
which matches the the whole numbers or alpha-words.
My problem requires that the index of the capture group the regex submatch matched to could be known, e.g. when matching the string
foo bar 123
3 iterations will be done. The group indexes of the matches of every iteration would be 0 0 1, because the first two matches matched the first capturing group, and the last match matched the second capturing group.
I know that in standard std::regex library it's not entirely possible (regex_token_iterator is not a solution, because I don't need to skip any matches).
I don't have much knowledge about boost::regex or PCRE regex library.
What is the best way to accomplish this task? Which is the library and method to use?
You may use the sregex_iterator to get all matches, and once there is a match you may analyze the std::match_results structure and only grab the ID-1 value of the group that participated in the match (note only one group here will match, either the first one, or the second), which can be conveniently checked with the m[index].matched:
std::regex r(R"((\b[[:alpha:]]+\b)|(\b\d+\b))");
std::string s = "foo bar 123";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << "Match value: " << m.str() << " at Position " << m.position() << '\n';
for(auto index = 1; index < m.size(); ++index ){
if (m[index].matched) {
std::cout << "Capture group ID: " << index-1 << std::endl;
break;
}
}
}
See the C++ demo. Output:
Match value: foo at Position 0
Capture group ID: 0
Match value: bar at Position 4
Capture group ID: 0
Match value: 123 at Position 8
Capture group ID: 1
Note that R"(...)" is a raw string literal, no need to double backslashes inside it.
Also, index is set to 1 at the start of the for loop because the 0th group is the whole match, but you want group IDs to be zero-based, that is why 1 is subtracted later.

regex condition: if group = x then else

I want to regex the following
foo: "/bar/baz/index.html";
but " could be ' and obviously you have to use both times the same character so foo: "...' won't work.
This is what I already have
/templateUrl:[ ]*(['"])[a-z0-9äöü.-_\/\\]*['"][ ]*;/gi
^capture group 1 ^-- here
Is it possible to do at here something like:
if capture group 1 == ' then search for '
else if capture group 1 == " then search for "
simple, just refer the first captured group.
/templateUrl: *(['"])[a-z0-9äöü.-_\/\\]*\1 *;/gi
Use back reference to the captured group 1.. i.e \1:
/templateUrl:[ ]*(['"])[a-z0-9äöü.-_\/\\]*\1[ ]*;/gi
↑ ↑
(captured group 1) (back reference)
See DEMO