TR1 regex: capture groups? - c++

I am using TR1 Regular Expressions (for VS2010) and what I'm trying to do is search for specific pattern for a group called "name", and another pattern for a group called "value". I think what I want is called a capture group, but I'm not sure if that's the right terminology. I want to assign matches to the pattern "[^:\r\n]+):\s" to a list of matches called "name", and matches of the pattern "[^\r\n]+)\r\n)+" to a list of matches called "value".
The regex pattern I have so far is
string pattern = "((?<name>[^:\r\n]+):\s(?<value>[^\r\n]+)\r\n)+";
But the regex T4R1 header keeps throwing an exception when the program runs. What's wrong with the syntax of the pattern I have? Can someone show an example pattern that would do what I'm trying to accomplish?
Also, how would it be possible to include a substring within the pattern to match, but not actually include that substring in the results? For example, I want to match all strings of the pattern
"http://[[:alpha:]]\r\n"
, but I don't want to include the substring "http://" in the returned results of matches.

The C++ TR1 and C++11 regular expression grammars don't support named capture groups. You'll have to do unnamed capture groups.
Also, make sure you don't run into escaping issues. You'll have to escape some characters twice: one for being in a C++ string, and another for being in a regex. The pattern (([^:\r\n]+):\s\s([^\r\n]+)\r\n)+ can be written as a C++ string literal like this:
"([^:\\r\\n]+:\\s\\s([^\\r\\n]+)\\r\\n)+"
// or in C++11
R"xxx(([^:\r\n]+:\s\s([^\r\n]+)\r\n)+)xxx"
Lookbehinds are not supported either. You'll have to work around this limitation by using capture groups: use the pattern (http://)([[:alpha:]]\r\n) and grab only the second capture group.

Related

VSCode Regex Find/Replace In Files: can't get a numbered capturing group followed by numbers to work out

I have a need to replace this:
fixed variable 123
with this:
fixed variable 234
In VSCode this matches fine:
fixed(.*)123
I can't find any way to make it put the capture in the output if a number follows:
fixed$1234
fixed${1}234
But the find replace window just looks like this:
I read that VSCode uses rust flavoured rexes.. Here indicates ${1}234 should work, but VSCode just puts it in the output..
Tried named capture in a style according to here
fixed(?P<n>.*)123 //"invalid regular expression" error
VSCode doesn't seem to understand ${1}:
ps; I appreciate I could hack it in the contrived example with
FIND: fixed (.*) 123
REPL: fixed $1 234
And this does work in vscode:
but not all my data consistently has the same character before the number
After a lot of investigation by myself and #Wiktor we discovered a workaround for this apparent bug in vscode's search (aka find across files) and replace functionality in the specific case where the replace would have a single capture group followed by digits, like
$1234 where the intent is to replace with capture group 1 $1 followed by 234 or any digits. But $1234 is the actual undesired replaced output.
[This works fine in the find/replace widget for the current file but not in the find/search across files.]
There are (at least) two workarounds. Using two consecutive groups, like $1$2234 works properly as does $1$`234 (or precede with the $backtick).
So you could create a sham capture group as in (.*?)()(\d{3}) where capture group 2 has nothing in it just to get 2 consecutive capture groups in the replace or
use your intial search regex (.*?)(\d{3}) and then use $` just before or after your "real" capture group $1.
OP has filed an issue https://github.com/microsoft/vscode/issues/102221
Oddly, I just discovered that replacing with a single digit like $11 works fine but as soon as you add two or more it fails, so $112 fails.
I'd like to share some more insights and my reasoning when I searched for a workaround.
Main workaround idea is using two consecutive backreferences in the replacement.
I tried all backreference syntax described at Replacement Strings Reference: Matched Text and Backreferences. It appeared that none of \g<1>, \g{1}, ${1}, $<1>, $+{1}, etc. work. However, there are some other backreferences, like $' (inserts the portion of the string that follows the matched substring) or $` (inserts the portion of the string that precedes the matched substring). However, these two backreferences do not work in VS Code file search and replace feature, they do not insert any text when used in the replacement pattern.
So, we may use $` or $' as empty placeholders in the replacement pattern.
Find What:      fix(.*?)123
Replace With:
fix$'$1234
fix$`$1234
Or, as in my preliminary test, already provided in Mark's answer, a "technical" capturing group matching an empty string, (), can be introduced into the pattern so that a backreference to that group can be used as a "guard" before the subsequent "meaningful" backreference:
Find What: fixed()(.*)123 (see () in the pattern that can be referred to using $1)
Replace With: fixed$1$2234
Here, $1 is a "guard" placeholder allowing correct parsing of $2 backreference.
Side note about named capturing groups
Named capturing groups are supported, but you should use .NET/PCRE/Java named capturing group syntax, (?<name>...). Unfortunately, the none of the known named backreferences work replacement pattern. I tried $+{name} Boost/Perl syntax, $<name>, ${name}, none work.
Conclusion
So, there are several issues here that need to be addressed:
We need an unambiguous numbered backerence syntax (\g<1>, ${1}, or $<1>)
We need to make sure $' or $` work as expected or are parsed as literal text (same as $_ (used to include the entire input string in the replacement string) or $+ (used to insert the text matched by the highest-numbered capturing group that actually participated in the match) backreferences that are not recognized by Visual Studio Code file search and replace feature), current behavior when they do not insert any text is rather undefined
We need to introduce named backreference syntax (like \g<name> or ${name}).

List of named captures / groups in boost regex

I want to know how I can get the name(s) of capture group(s) in a regular expression in boost.
For example, if a user inputs a string which is expected to be a valid regex with named capture groups, how can one iterate through the list of defined groups in the regex and get the names of those groups. Does boost provide facilities to do so, or I am expected to write my own parser to extract those names?
As an example, if the input string is:
(?<year>[0-9]{4}).*(?<month>[0-9]{2}).*(?<day>[0-9]{2})
I want to be able to extract "year","month", and "day" out of the regex.
You can use following regex:
"\?<([^<>]+)>"
I don't think that regex engines provide such ability to give you the names of the captured groups before compiling the regex, because it needs to traverse the input regex once before parsing (and compiling) the regex which is not a optimum method, unless it compile the regex once and does all the jobs together.
So, with regards to your comment, if it's possible that you have an unnamed group you better to loop over your captured groups and see if it has name or not.
Note that maybe you could pars the cases that have unnamed groups with regex, but I don't think that it is a general way.
For example you can use the aforementioned regex within parenthesis to capture all the groups that don't have another capture group in them ([^()]* will ensure that):
`\((\?<([^<>]+)>)[^()]*\)`
And for another cases you have to write another one.

Regular Expression: How to write nested search pattern?

I'm struggling with writing RegEx pattern to find continuous sets of blocks like that:
pseudo code:
any sub-string consisted of any number of characters
finished with DDCC
repeated many times
For example I'd like to strings like this:
2342DDCC3423423DDCCfsfsfsfDDCC2weDDCC1312312qeqeDDCC
to be found.
The first part is easy: [A-Za-z0-9]+DDCC
However when I did: [[A-Za-z0-9]+DDCC]+ function has returned an empty string.
How to code multiple repetition of the pattern, which internally has the repetition syntax itself?
How about:
([A-Za-z0-9]+DDCC)(?1)+
(?1) means the same pattern as the first capturing group.
To capture all groups you can use following expression.
([A-Za-z0-9]+?DDCC) // use global flag based on your language/tool
It will capture all groups ending at DDCC. The important thing to note here is the use of ? after [A-Za-z0-9] which makes the matching non greedy.

Regular expressions middle of string

How I can get part of SIP URI?
For example I have URI sip:username#sip.somedomain.com, I need get just username and I use [^sip:](.*)[$#]+ expression, but appeared result is username#. How I can exclude from matching #?
this should do the job
(?<=^sip:)(.*)(?=[$#])
Use a lookahead instead of actually matching #:
^sip:(.*?)(?=#|\$)
Either you are using a very strange regex flavor, or your starting character class is a mistake. [^sip:] matches a single character that isn't any of s,i,p or :. I am also not certain what the $ character is for, since that isn't a part of SIP syntax.
If lookaheads are not available in your regex flavour (for instance POSIX regexes lack them), you can still match parts of the string in your regex you don't eventually want to return, if you use capture groups and only grab the contents of some of them.
For example
^sip:(.*?)[$#]+ Then only return the contents of the first capture group

Extract and use a part of string with a regex in GVIM

I've got a string:
doCall(valA, val.valB);
Using a regex in GVIM I would like to change this to:
valA = doCall(valA, val.valB);
How would I go about doing this? I use %s for basic regex search and replace in GVIM, but this a bit different from my normal usages.
Thanks
You can use this:
%s/\vdoCall\(<(\w*)>,/\1 = doCall(\1,/
\v enables “more magic” in regular expressions – not strictly necessary here but I usually use it to make the expressions simpler. <…> matches word boundaries and the in-between part matches the first parameter and puts it in the first capture group. The replacement uses \1 to access that capture group and insert into the right two places.