regex condition: if group = x then else

regex condition: if group = x then else - regex

I want to regex the following
foo: "/bar/baz/index.html";
but " could be ' and obviously you have to use both times the same character so foo: "...' won't work.
This is what I already have
/templateUrl:[ ]*(['"])[a-z0-9äöü.-_\/\\]*['"][ ]*;/gi
^capture group 1 ^-- here
Is it possible to do at here something like:
if capture group 1 == ' then search for '
else if capture group 1 == " then search for "

simple, just refer the first captured group.
/templateUrl: *(['"])[a-z0-9äöü.-_\/\\]*\1 *;/gi

Use back reference to the captured group 1.. i.e \1:
/templateUrl:[ ]*(['"])[a-z0-9äöü.-_\/\\]*\1[ ]*;/gi
↑ ↑
(captured group 1) (back reference)
See DEMO

Related

Match every thing between "" or []

I have a regex that look like this:
(?="(test)"\s*:\s*(".*?"|\[.*?]))
to match the value between "..." or [...]
Input
"test":"value0"
"test":["value1", "value2"]
Output
Group1 Group2
test value0
test "value1", "value2" // or - value1", "value2
I there any trick to ignore "" and [] and stick with two group, group1 and group2?
I tried (?="(test)"\s*:\s*(?="(.*?)"|\[(.*?)])) but this gives me 4 groups, which is not good for me.

You may use this conditional regex in PHP with branch reset group:
"(test)"\h*:\h*(?|"([^"]*)"|\[([^]]*)])
This will give you 2 capture groups in both the inputs with enclosing " or [...].
RegEx Demo
RegEx Details:
(?|..) is a branch reset group. Here Subpatterns declared within each alternative of this construct will start over from the same index
(?|"([^"]*)"|\[([^]]*)]) is if-then-else conditional subpatern which means if " is matched then use "([^"]*)" otherwise use \[([^]]*)] subpattern

You can use a pattern like
"(test)"\s*:\s*\K(?|"\K([^"]*)|\[\K([^]]*))
See the regex demo.
Details:
" - a " char
(test) - Group 1: test word
" - a " char
\s*:\s* - a colon enclosed with zero or more whitespaces
\K - match reset operator that clears the current overall match memory buffer (group value is still kept intact)
(?|"\K([^"]*)|\[\K([^]]*)) - a branch reset group:
"\K([^"]*) - matches a ", then discards it, and then captures into Group 2 zero or more chars other than "
| - or
\[\K([^]]*) - matches a [, then discards it, and then captures into Group 2 zero or more chars other than ]
In Java, you can't use \K and ?|, use capturing groups:
String s = "\"test\":[\"value1\", \"value2\"]";
Pattern pattern = Pattern.compile("\"(test)\"\\s*:\\s*(?:\"([^\"]*)|\\[([^\\]]*))");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println("Key: " + matcher.group(1));
if (matcher.group(2) != null) {
System.out.println("Value: " + matcher.group(2));
} else {
System.out.println("Value: " + matcher.group(3));
}
}
See a Java demo.

C++ regex: Get index of the Capture Group the SubMatch matched to

Context. I'm developing a Lexer/Tokenizing engine, which would use regex as a backend. The lexer accepts rules, which define the token types/IDs, e.g.
<identifier> = "\\b\\w+\\b".
As I envision, to do the regex match-based tokenizing, all of the rules defined by regexes are enclosed in capturing groups, and all groups are separated by ORs.
When the matching is being executed, every match we produce must have an index of the capturing group it was matched to. We use these IDs to map the matches to token types.
So the problem of this question arises - how to get the ID of the group?
Similar question here, but it does not provide the solution to my specific problem.
Exactly my problem here, but it's in JS, and I need a C/C++ solution.
So let's say I've got a regex, made up of capturing groups separated by an OR:
(\\b[a-zA-Z]+\\b)|(\\b\\d+\\b)
which matches the the whole numbers or alpha-words.
My problem requires that the index of the capture group the regex submatch matched to could be known, e.g. when matching the string
foo bar 123
3 iterations will be done. The group indexes of the matches of every iteration would be 0 0 1, because the first two matches matched the first capturing group, and the last match matched the second capturing group.
I know that in standard std::regex library it's not entirely possible (regex_token_iterator is not a solution, because I don't need to skip any matches).
I don't have much knowledge about boost::regex or PCRE regex library.
What is the best way to accomplish this task? Which is the library and method to use?

You may use the sregex_iterator to get all matches, and once there is a match you may analyze the std::match_results structure and only grab the ID-1 value of the group that participated in the match (note only one group here will match, either the first one, or the second), which can be conveniently checked with the m[index].matched:
std::regex r(R"((\b[[:alpha:]]+\b)|(\b\d+\b))");
std::string s = "foo bar 123";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << "Match value: " << m.str() << " at Position " << m.position() << '\n';
for(auto index = 1; index < m.size(); ++index ){
if (m[index].matched) {
std::cout << "Capture group ID: " << index-1 << std::endl;
break;
}
}
}
See the C++ demo. Output:
Match value: foo at Position 0
Capture group ID: 0
Match value: bar at Position 4
Capture group ID: 0
Match value: 123 at Position 8
Capture group ID: 1
Note that R"(...)" is a raw string literal, no need to double backslashes inside it.
Also, index is set to 1 at the start of the for loop because the 0th group is the whole match, but you want group IDs to be zero-based, that is why 1 is subtracted later.

Extract groups separated by space

I've got following string (example):
Loader[data-prop data-attr="value"]
There can be 1 - n attributes. I want to extract every attribute. (data-prop,data-attr="value"). I tried it in many different ways, for example with \[(?:(\S+)\s)*\] but I didn't get it right. The expression should be written in PREG style..

I suggest grabbing all the key-value pairs with a regex:
'~(?:([^][]*)\b\[|(?!^)\G)\s*(\w+(?:-\w+)*(?:=(["\'])?[^\]]*?\3)?)~'
(see regex demo) and then
See IDEONE demo
$re = '~(?:([^][]*)\b\[|(?!^)\G)\s*(\w+(?:-\w+)*(?:=(["\'])?[^\]]*?\3)?)~';
$str = "Loader[data-prop data-attr=\"value\" more-here='data' and-one-more=\"\"]";
preg_match_all($re, $str, $matches);
$arr = array();
for ($i = 0; $i < count($matches); $i++) {
if ($i != 0) {
$arr = array_merge(array_filter($matches[$i]),$arr);
}
}
print_r(preg_grep('~\A(?![\'"]\z)~', $arr));
Output:
Array
(
[3] => data-prop
[4] => data-attr="value"
[5] => more-here='data'
[6] => and-one-more=""
[7] => Loader
)
Notes on the regex (it only looks too complex):
(?:([^][]*)\b\[|(?!^)\G) - a boundary: we only start matching at a [ that is preceded with a word (a-zA-Z0-9_) character (with \b\[), or right after a successful match (with (?!^)\G). Also, ([^][]*) will capture into Group 1 the part before the [.
\s* - matches zero or more whitespace symbols
(\w+(?:-\w+)*) - captures into Group 2 "words" like "word1" or "word1-word2"..."word1-wordn"
(?:=(["\'])?[^\]]*?\3)? - optional group (due to (?:...)?) matching
= - an equal sign
(["\'])? - Group 3 (auxiliary group to check for the value delimiter) capturing either ", ' or nothing
[^\]]*? - (value) zero or more characters other than ] as few as possible
\3 - the closing ' or " (the same value captured in Group 3).
Since we cannot get rid of capturing ' or ", we can preg_grep all the elements that we are not interested in with preg_grep('~\A(?![\'"]\z)~', $arr) where \A(?![\'"]\z) matches any string that is not equal to ' or ".

how about something like [\s\[]([^\s\]]+(="[^"]+)*)+
gives
MATCH 1: data-prop
MATCH 2: data-attr="value"

Regex with Non-capturing Group

I am trying to understand Non-capturing groups in Regex.
If I have the following input:
He hit the ball. Then he ran. The crowd was cheering! How did he feel? I felt so energized!
If I want to extract the first word in each sentence, I was trying to use the match pattern:
^(\w+\b.*?)|[\.!\?]\s+(\w+)
That puts the desired output in the submatch.
Match $1
He He
. Then Then
. The The
! How How
? I I
But I was thinking that using non-capturing groups, I should be able to get them back in the match.
I tried:
^(?:\w+\b.*?)|(?:[\.!\?]\s+)(\w+)
and that yielded:
Match $1
He
. Then Then
. The The
! How How
? I I
and
^(?:\w+\b.*?)|(?:[.!\?]\s+)\w+
yielded:
Match
He
. Then
. The
! How
? I
What am I missing?
(I am testing my regex using RegExLib.com, but will then transfer it to VBA).

A simple example against string "foo":
(f)(o+)
Will yield $1 = 'f' and $2 = 'oo';
(?:f)(o+)
Here, $1 = 'oo' because you've explicitly said not to capture the first matching group. And there is no second matching group.
For your scenario, this feels about right:
(?:(\w+).*?[\.\?!] {2}?)
Note that the outermost group is a non-capturing group, while the inner group (the first word of the sentence) is capturing.

The following constructs a non-capturing group for the boundary condition, and captures the word after it with a capturing group.
(?:^|[.?!]\s*)(\w+)
It's not clear from youf question how you are applying the regex to the text, but your regular "pull out another until there are no more matches" loop should work.

This works and is simple:
([A-Z])\w*
VBA requires these flag settings:
Global = True 'Match all occurrences not just first
IgnoreCase = False 'First word of each sentence starts with a capital letter
Here's some additional hard-earned info: since your regex has at least one parenthesis set, you can use Submatches to pull out only the values in the parenthesis and ignore the rest - very useful. Here is the debug output of a function I use to get Submatches, run on your string:
theMatches.Count=5
Match='He'
Submatch Count=1
Submatch='H'
Match='Then'
Submatch Count=1
Submatch='T'
Match='The'
Submatch Count=1
Submatch='T'
Match='How'
Submatch Count=1
Submatch='H'
Match='I'
Submatch Count=1
Submatch='I'
T
Here's the call to my function that returned the above:
sText = "He hit the ball. Then he ran. The crowd was cheering! How did he feel? I felt so energized!"
sRegEx = "([A-Z])\w*"
Debug.Print ExecuteRegexCapture(sText, sRegEx, 2, 0) '3rd match, 1st Submatch
And here's the function:
'Returns Submatch specified by the passed zero-based indices:
'iMatch is which match you want,
'iSubmatch is the index within the match of the parenthesis
'containing the desired results.
Function ExecuteRegexCapture(sStringToSearch, sRegEx, iMatch, iSubmatch)
Dim oRegex As Object
Set oRegex = New RegExp
oRegex.Pattern = sRegEx
oRegex.Global = True 'True = find all matches, not just first
oRegex.IgnoreCase = False
oRegex.Multiline = True 'True = [\r\n] matches across line breaks, e.g. "([\r\n].*)" will match next line + anything on it
bDebug = True
ExecuteRegexCapture = ""
Set theMatches = oRegex.Execute(sStringToSearch)
If bDebug Then Debug.Print "theMatches.Count=" & theMatches.Count
For i = 0 To theMatches.Count - 1
If bDebug Then Debug.Print "Match='" & theMatches(i) & "'"
If bDebug Then Debug.Print " Submatch Count=" & theMatches(i).SubMatches.Count
For j = 0 To theMatches(i).SubMatches.Count - 1
If bDebug Then Debug.Print " Submatch='" & theMatches(i).SubMatches(j) & "'"
Next j
Next i
If bDebug Then Debug.Print ""
If iMatch < theMatches.Count Then
If iSubmatch < theMatches(iMatch).SubMatches.Count Then
ExecuteRegexCapture = theMatches(iMatch).SubMatches(iSubmatch)
End If
End If
End Function

Named captured substring in pcre++

I want to capture named substring with the pcre++ library.
I know the pcre library has the functionality for this, but pcre++ has not implemented this.
This is was I have now (just a simple example):
pcrepp::Pcre regex("test (?P<groupName>bla)");
if (regex.search("test bla"))
{
// Get matched group by name
int pos = pcre_get_stringnumber(
regex.get_pcre(),
"groupName"
);
if (pos == PCRE_ERROR_NOSUBSTRING) return;
// Get match
std::string temp = regex[pos - 1];
std::cout << "temp: " << temp << "\n";
}
If I debug, pos return 1, and that is right, (?Pbla) is the 1th submatch (0 is the whole match). It should be ok. But... regex.matches() return 0. Why is that :S ?
Btw. I do regex[pos - 1] because pcre++ reindexes the result with 0 pointing to the first submatch, so 1. So 1 becomes 0, 2 becomes 1, 3 becomes 2, etc.
Does anybody know how to fix this?

My mistake unfortunately, I tested the regex in my real program and there the regex was different. I used something like this:
(?:/(?P<controller>[^/]+)(?:/(?P<action>[^/]+))?)?
So the group name to number conversion goes well, but when i try to access the group i get index of range because of the (?: ... )? groups. I just added a check if the group index i in the correct range, it is i could use the group.
Sorry for asking it here too early.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js