List of named captures / groups in boost regex - c++

I want to know how I can get the name(s) of capture group(s) in a regular expression in boost.
For example, if a user inputs a string which is expected to be a valid regex with named capture groups, how can one iterate through the list of defined groups in the regex and get the names of those groups. Does boost provide facilities to do so, or I am expected to write my own parser to extract those names?
As an example, if the input string is:
(?<year>[0-9]{4}).*(?<month>[0-9]{2}).*(?<day>[0-9]{2})
I want to be able to extract "year","month", and "day" out of the regex.

You can use following regex:
"\?<([^<>]+)>"
I don't think that regex engines provide such ability to give you the names of the captured groups before compiling the regex, because it needs to traverse the input regex once before parsing (and compiling) the regex which is not a optimum method, unless it compile the regex once and does all the jobs together.
So, with regards to your comment, if it's possible that you have an unnamed group you better to loop over your captured groups and see if it has name or not.
Note that maybe you could pars the cases that have unnamed groups with regex, but I don't think that it is a general way.
For example you can use the aforementioned regex within parenthesis to capture all the groups that don't have another capture group in them ([^()]* will ensure that):
`\((\?<([^<>]+)>)[^()]*\)`
And for another cases you have to write another one.

Related

Replace a Tag Name while keeping the rest as it is

I want to preface by saying I am a novice at regex, and I've spent a considerable amount of time trying to solve this myself using tutorials, online docs, etc. I have also gone through the suggested answers here.
Now here is my problem: I have 267 lines like this, and each county is different.
<SimpleData name="NAME">Angelina</SimpleData>
What I need to do is to replace NAME with COUNTY and keep the rest the same including the proper county name:
<SimpleData name="COUNTY">Angelina</SimpleData>
I used the following Find to find all the lines that I wanted to change, and was successful.
<SimpleData name="NAME">[\S\s\n]*?</SimpleData>
It's probably not the best way to do this, but it worked.
I hope I've explained this so it can be understood. Thanks, Paul
You need to use capturing groups with backreferences in the replacement field:
Find What: (<SimpleData name=")NAME(">[\S\s\n]*?</SimpleData>)
Replace With: $1COUNTY$2
See the regex demo
As per regular-expressions.info:
Besides grouping part of a regular expression together, parentheses also create a numbered capturing group. It stores the part of the string matched by the part of the regular expression inside the parentheses.
If your regular expression has named or numbered capturing groups, then you can reinsert the text matched by any of those capturing groups in the replacement text. Your replacement text can reference as many groups as you like, and can even reference the same group more than once. This makes it possible to rearrange the text matched by a regular expression in many different ways.
Note that, in VSCode, you can't use named groups.
You really don't want to be using regex for this job. Learn XSLT.
Any attempt to do this using regular expressions will either match things it shouldn't, or will fail to match things that it should. That's not because you're lacking regex skills, it's because of the computer science theory: XML's grammar is defined recursively, and regular expressions can't handle recursively-defined grammars.

A short way to capture/back-reference every digit of a number individually

So basically I want to reformat a 10 digit number like so:
1234567890 --> (123) 456-7890
A long way to do this would be to have each number be its own capture group and then back-reference each one individually:
'([0-9])([0-9])...([0-9])' --> (\1\2\3) \4\5\6-\7\8\9\10
This seems unnecessary and verbose, but when I try the following
'([0-9]){10}'
There appears to be only one back-reference and its of the last digit in the number.
Is there is a more elegant way to reference each character as its own capture group?
Thanks!
The following pattern will do the job: ^(\d{3})(\d{3})(\d{4})$
^(\d{3}): beginning of the string, then exactly 3 digits
(\d{3}): exactly 3 digits
(\d{4})$: exactly 4 digits, then end of the string.
Then replace by: (\1) \2-\3
Although the other answer with its example regex patterns hopefully shed light on the correct application of capture groups, it does not directly answer the question. If you fail to understand how regular expressions work (capture groups in particular), you may find yourself wanting to do the same thing with a different pattern in the future.
Is there is a more elegant way to reference each character as its own
capture group?
The initial answer is "No", there is no way to reference an individual capture of a single capture group using traditional replacement syntax - regardless of whether it is a single digit or any other capture group. Consider that you indicate a precise number of matches with {10} and it seems perfectly reasonable to be able to access each capture. But what if you had indicated a variable number of matches with + or {,3}? There would be no well-defined way of knowing how many possible captures occurred. If the same regex pattern had had more capture groups following the "repeated" capture group, there would be no way of correctly referencing the later groups. Example: Given the pattern ([a-z])+(\d){3}, the first capture group could match 4 letters one time, then the next time match 11 letters. If you wanted to refer to the captured digits, how would you do that? You could not, since \1, \2, \3, ... would all be reserved for possible capture instances of the first group.
But the inability of basic regular expressions syntax to do what you want does not remove the validity of your question, nor does it necessarily place the solution outside the realm of many regular expression implementations. Various regex implementations (i.e. language syntax and regex libraries) resolve this limitation by facilitating regex matching with various objects for accessing repeated captures. (c# and .Net regex library is one example, like match.Groups[1].Captures[3]) So even though you can't use basic replacement patterns to get want you want, the answer is often "Yes", depending on the specific implementation.

use case for ?: in tcl regexp

I read the documentation of ?: in tcl regexp. Which says that it matches an expression without capturing it.
I tried and it worked fine.
My query is, what is the proper use case for this option, as it we do not want to use capture sequence, we won't puts brackets there.
Is it just an alternate way, or have some special condition, where we should use this? Kindly clarify.
Easy: You need to group several elements in your Regex, but you don't need them as a capturing group for reference.
a+ (b+|c+) OR (a+ b+)|c+
I need braces for grouping. But if I run it like this the engine will capture all those matches. This may need a lot of memory and cost a lot of performance. If I don't need the capturing groups later for reference, I can use ?: to get grouping without the performance impact:
a+ (?:b+|c+) OR (?:a+ b+)|c+
First, have a look at the Tcl regex reference:
(expression)
Parentheses surrounding an expression specify a nested expression. The substring matching expression is captured and can be referred to via the back reference mechanism, and also captured into any corresponding match variable specified as an argument to the command.
(?:expression)
matches expression without capturing it.
While the first part describing capturing group ability to capture subtext to be referred to with backreferences is universal, the second part dwelling on initializing variables based on the capturing group is specific to Tcl.
Bearing that in mind, Tcl regex usage can be greatly simplified with non-capturing groups in case you have a pattern with a number of capturing groups, and you want to modify it by adding another group in-between existing groups.
Say, you want to match strings like abc 1234 (comment) and use {(\w+)\s+(\d+)\s+\(([^()]+)\)}:
regexp {(\w+)\s+(\d+)\s+\(([^()]+)\)} $a - body num comment
However, you were asked to also match strings with any number of word+space+digits in-between 1234 and comment. If you write
set a1 "abc 1234 more 5678 text 890 here 678 (comment)"
regexp {(\w+)\s+(\d+)(\s+\w+\s+\d+)*\s+\(([^()]+)\)} $a - body1 num1 comment1
^^^^^^^^^^^^^^^
the $comment will hold a value you would not expect.
Turning it into a non-capturing group fixes the issue.
See IDEONE demo
For other common uses of a non-capturing group, please refer to Are optional non-capturing groups redundant post.
You can use () parentheses in regex when matching multiple word options which you then do not want to capture.
(?:one|two|three)

Capture group that captures an entire string minus a section that matches a pattern

I'm not sure if this is possible, but I figured I'd ask anyways. What I need to do is effectively create a search/replace, but without using the regex s/pattern1/pattern2/ syntax as it is not directly exposed to me.
Is it possible to create a capture group that would take an image path, with the image size before the extension and remove the image size.
For instance convert http://example.com/path/to/image/filename-200x200.jpg to http://example.com/path/to/image/filename.jpg using only a capture group and no search/replace bits.
I'm asking as the software I'm working in does not currently have a search/replace functionality.
It's somewhat possible. There's no built-in capability for a match to be something other than a continuous segment of the source text, but you can work around that.
One approach you might consider is the use of non-capturing groups and concatenation. In regex, groups beginning with ?: aren't captured as matches.
For example, given the regex (A)(?:B)(C) and the string "ABC", the result would be:
1. "A"
2. "C"
In your case, then, you could capture around the part you want to ignore, then concatenate the parts you want.
Given the string you provided, http://example.com/path/to/image/filename-200x200.jpg, the regex (.+)(?:-200x200)(.+) returns:
1. "http://example.com/path/to/image/filename"
2. ".jpg"
You could then add the first and second capture groups to produce your intended result.

TR1 regex: capture groups?

I am using TR1 Regular Expressions (for VS2010) and what I'm trying to do is search for specific pattern for a group called "name", and another pattern for a group called "value". I think what I want is called a capture group, but I'm not sure if that's the right terminology. I want to assign matches to the pattern "[^:\r\n]+):\s" to a list of matches called "name", and matches of the pattern "[^\r\n]+)\r\n)+" to a list of matches called "value".
The regex pattern I have so far is
string pattern = "((?<name>[^:\r\n]+):\s(?<value>[^\r\n]+)\r\n)+";
But the regex T4R1 header keeps throwing an exception when the program runs. What's wrong with the syntax of the pattern I have? Can someone show an example pattern that would do what I'm trying to accomplish?
Also, how would it be possible to include a substring within the pattern to match, but not actually include that substring in the results? For example, I want to match all strings of the pattern
"http://[[:alpha:]]\r\n"
, but I don't want to include the substring "http://" in the returned results of matches.
The C++ TR1 and C++11 regular expression grammars don't support named capture groups. You'll have to do unnamed capture groups.
Also, make sure you don't run into escaping issues. You'll have to escape some characters twice: one for being in a C++ string, and another for being in a regex. The pattern (([^:\r\n]+):\s\s([^\r\n]+)\r\n)+ can be written as a C++ string literal like this:
"([^:\\r\\n]+:\\s\\s([^\\r\\n]+)\\r\\n)+"
// or in C++11
R"xxx(([^:\r\n]+:\s\s([^\r\n]+)\r\n)+)xxx"
Lookbehinds are not supported either. You'll have to work around this limitation by using capture groups: use the pattern (http://)([[:alpha:]]\r\n) and grab only the second capture group.