Regular Expression: How to write nested search pattern? - regex

I'm struggling with writing RegEx pattern to find continuous sets of blocks like that:
pseudo code:
any sub-string consisted of any number of characters
finished with DDCC
repeated many times
For example I'd like to strings like this:
2342DDCC3423423DDCCfsfsfsfDDCC2weDDCC1312312qeqeDDCC
to be found.
The first part is easy: [A-Za-z0-9]+DDCC
However when I did: [[A-Za-z0-9]+DDCC]+ function has returned an empty string.
How to code multiple repetition of the pattern, which internally has the repetition syntax itself?

How about:
([A-Za-z0-9]+DDCC)(?1)+
(?1) means the same pattern as the first capturing group.

To capture all groups you can use following expression.
([A-Za-z0-9]+?DDCC) // use global flag based on your language/tool
It will capture all groups ending at DDCC. The important thing to note here is the use of ? after [A-Za-z0-9] which makes the matching non greedy.

Related

Can this be parsed by regular expression [duplicate]

I keep bumping into situations where I need to capture a number of tokens from a string and after countless tries I couldn't find a way to simplify the process.
So let's say the text is:
start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end
This example has 8 items inside, but say it could have between 3 and 10 items.
I'd ideally like something like this:
start:(?:(\w+)-?){3,10}:end nice and clean BUT it only captures the last match. see here
I usually use something like this in simple situations:
start:(\w+)-(\w+)-(\w+)-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?:end
3 groups mandatory and another 7 optional because of the max 10 limit, but this doesn't look 'nice' and it would be a pain to write and track if the max limit was 100 and the matches were more complex. demo
And the best I could do so far:
start:(\w+)-((?1))-((?1))-?((?1))?-?((?1))?-?((?1))?-?((?1))?-?((?1))?:end
shorter especially if the matches are complex but still long. demo
Anyone managed to make it work as a 1 regex-only solution without programming?
I'm mostly interested on how can this be done in PCRE but other flavors would be ok too.
Update:
The purpose is to validate a match and capture individual tokens inside match 0 by RegEx alone, without any OS/Software/Programming-Language limitation
Update 2 (bounty):
With #nhahtdh's help I got to the RegExp below by using \G:
(?:start:(?=(?:[\w]+(?:-|(?=:end))){3,10}:end)|(?!^)\G-)([\w]+)
demo even shorter, but can be described without repeating code
I'm also interested in the ECMA flavor and as it doesn't support \G wondering if there's another way, especially without using /g modifier.
Read this first!
This post is to show the possibility rather than endorsing the "everything regex" approach to problem. The author has written 3-4 variations, each has subtle bug that are tricky to detect, before reaching the current solution.
For your specific example, there are other better solution that is more maintainable, such as matching and splitting the match along the delimiters.
This post deals with your specific example. I really doubt a full generalization is possible, but the idea behind is reusable for similar cases.
Summary
.NET supports capturing repeating pattern with CaptureCollection class.
For languages that supports \G and look-behind, we may be able to construct a regex that works with global matching function. It is not easy to write it completely correct and easy to write a subtly buggy regex.
For languages without \G and look-behind support: it is possible to emulate \G with ^, by chomping the input string after a single match. (Not covered in this answer).
Solution
This solution assumes the regex engine supports \G match boundary, look-ahead (?=pattern), and look-behind (?<=pattern). Java, Perl, PCRE, .NET, Ruby regex flavors support all those advanced features above.
However, you can go with your regex in .NET. Since .NET supports capturing all instances of that is matched by a capturing group that is repeated via CaptureCollection class.
For your case, it can be done in one regex, with the use of \G match boundary, and look-ahead to constrain the number of repetitions:
(?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end)
DEMO. The construction is \w+- repeated, then \w+:end.
(?:start:(?=\w+(?:-\w+){2,9}:end)|(?!^)\G-)(\w+)
DEMO. The construction is \w+ for the first item, then -\w+ repeated. (Thanks to ka ᵠ for the suggestion). This construction is simpler to reason about its correctness, since there are less alternations.
\G match boundary is especially useful when you need to do tokenization, where you need to make sure the engine not skipping ahead and matching stuffs that should have been invalid.
Explanation
Let us break down the regex:
(?:
start:(?=\w+(?:-\w+){2,9}:end)
|
(?<=-)\G
)
(\w+)
(?:-|:end)
The easiest part to recognize is (\w+) in the line before last, which is the word that you want to capture.
The last line is also quite easy to recognize: the word to be matched may be followed by - or :end.
I allow the regex to freely start matching anywhere in the string. In other words, start:...:end can appear anywhere in the string, and any number of times; the regex will simply match all the words. You only need to process the array returned to separate where the matched tokens actually come from.
As for the explanation, the beginning of the regex checks for the presence of the string start:, and the following look-ahead checks that the number of words is within specified limit and it ends with :end. Either that, or we check that the character before the previous match is a -, and continue from previous match.
For the other construction:
(?:
start:(?=\w+(?:-\w+){2,9}:end)
|
(?!^)\G-
)
(\w+)
Everything is almost the same, except that we match start:\w+ first before matching the repetition of the form -\w+. In contrast to the first construction, where we match start:\w+- first, and the repeated instances of \w+- (or \w+:end for the last repetition).
It is quite tricky to make this regex works for matching in middle of the string:
We need to check the number of words between start: and :end (as part of the requirement of the original regex).
\G matches the beginning of the string also! (?!^) is needed to prevent this behavior. Without taking care of this, the regex may produce a match when there isn't any start:.
For the first construction, the look-behind (?<=-) already prevent this case ((?!^) is implied by (?<=-)).
For the first construction (?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end), we need to make sure that we don't match anything funny after :end. The look-behind is for that purpose: it prevents any garbage after :end from matching.
The second construction doesn't run into this problem, since we will get stuck at : (of :end) after we have matched all the tokens in between.
Validation Version
If you want to do validation that the input string follows the format (no extra stuff in front and behind), and extract the data, you can add anchors as such:
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G-)(\w+)
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G)(\w+)(?:-|:end)
(Look-behind is also not needed, but we still need (?!^) to prevent \G from matching the start of the string).
Construction
For all the problems where you want to capture all instances of a repetition, I don't think there exists a general way to modify the regex. One example of a "hard" (or impossible?) case to convert is when a repetition has to backtrack one or more loop to fulfill certain condition to match.
When the original regex describes the whole input string (validation type), it is usually easier to convert compared to a regex that tries to match from the middle of the string (matching type). However, you can always do a match with the original regex, and we convert matching type problem back to validation type problem.
We build such regex by going through these steps:
Write a regex that covers the part before the repetition (e.g. start:). Let us call this prefix regex.
Match and capture the first instance. (e.g. (\w+))
(At this point, the first instance and delimiter should have been matched)
Add the \G as an alternation. Usually also need to prevent it from matching the start of the string.
Add the delimiter (if any). (e.g. -)
(After this step, the rest of the tokens should have also been matched, except the last maybe)
Add the part that covers the part after the repetition (if necessary) (e.g. :end). Let us call the part after the repetition suffix regex (whether we add it to the construction doesn't matter).
Now the hard part. You need to check that:
There is no other way to start a match, apart from the prefix regex. Take note of the \G branch.
There is no way to start any match after the suffix regex has been matched. Take note of how \G branch starts a match.
For the first construction, if you mix the suffix regex (e.g. :end) with delimiter (e.g. -) in an alternation, make sure you don't end up allowing the suffix regex as delimiter.
Although it might theoretically be possible to write a single expression, it's a lot more practical to match the outer boundaries first and then perform a split on the inner part.
In ECMAScript I would write it like this:
'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end'
.match(/^start:([\w-]+):end$/)[1] // match the inner part
.split('-') // split inner part (this could be a split regex as well)
In PHP:
$txt = 'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end';
if (preg_match('/^start:([\w-]+):end$/', $txt, $matches)) {
print_r(explode('-', $matches[1]));
}
Of course you can use the regex in this quoted string.
"(?<a>\\w+)-(?<b>\\w+)-(?:(?<c>\\w+)" \
"(?:-(?<d>\\w+)(?:-(?<e>\\w+)(?:-(?<f>\\w+)" \
"(?:-(?<g>\\w+)(?:-(?<h>\\w+)(?:-(?<i>\\w+)" \
"(?:-(?<j>\\w+))?" \
")?)?)?" \
")?)?)?" \
")"
Is it a good idea? No, I don't think so.
Not sure you can do it in that way, but you can use the global flag to find all of the words between the colons, see:
http://regex101.com/r/gK0lX1
You'd have to validate the number of groups yourself though. Without the global flag you're only getting a single match, not all matches - change {3,10} to {1,5} and you get the result 'sir' instead.
import re
s = "start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end"
print re.findall(r"(\b\w+?\b)(?:-|:end)", s)
produces
['test', 'test', 'lorem', 'ipsum', 'sir', 'doloret', 'etc', 'etc', 'something']
When you combine:
Your observation: any kind of repitition of a single capture group will result in an overwrite of the last capture, thus returning only the last capture of the capture group.
The knowledge: Any kind of capturing based on the parts, instead of the whole, makes it impossible to set a limit on the amount of times the regex engine will repeat. The limit would have to be metadata (not regex).
With a requirement that the answer cannot involve programming (looping), nor an answer that involves simply copy-pasting capturegroups as you've done in your question.
It can be deduced that it cannot be done.
Update: There are some regex engines for which p. 1 is not necessarily true. In that case the regex you have indicated start:(?:(\w+)-?){3,10}:end will do the job (source).

More compact RegEx for capturing one or more groups with delimiters - without repeating the group expression [duplicate]

Is it possible to define a part of the pattern once then perhaps name it so that it can be reused multiple times inside the main pattern without having to write it out again?
To paint a picture, my pattern looks similar to this (pseudo regex pattern)
(PAT),(PAT), ... ,(PAT)
Where PAT is some lengthy pattern.
Requirements
Not have to repeat the pattern because it's length becomes a problem (currently, Notepad++ only allows 2047 characters in the search box when using regex and I'm easily going over this limit)
Each capturing group should be able to match independently of its siblings. For example, say that my pattern is ([a-z]),([a-z]),([a-z]) then a,a,a and a,b,c should match
I've looked into naming the first capturing group then referencing it in the subsequent capturing groups but this method breaks the second requirement (i.e., it fails to match a,b,c). Is there a direct or indirect way of fulfilling both requirements using regex only?
My end goal is to be able to get and access the value of each capturing group so I can manipulate each group later in the "replace" part of the search & replace box.
To reuse a pattern, you could use (?n) where n is the number of the group to repeat. For example, your actual pattern :
(PAT),(PAT), ... ,(PAT)
can be replaced by:
(PAT),(?1), ... ,(?1)
(?1) is the same pattern as (PAT)whatever PAT is.
You may have multiple patterns:
(PAT1),(PAT2),(PAT1),(PAT2),(PAT1),(PAT2),(PAT1),(PAT2)
may be reduced to:
(PAT1),(PAT2),(?1),(?2),(?1),(?2),(?1),(?2)
or:
((PAT1),(PAT2)),(?1),(?1),(?1)
or:
((PAT1),(PAT2)),(?1){3}

Trying to extract repeating pattern from string in php/javascript

The following is in PHP but the regex will also be used in javascript.
Trying to extract repeating patterns from a string
string can be any of the following:
"something arbitrary"
"D123"
"D111|something"
"D197|what.org|when.net"
"D297|who.197d234.whatever|when.net|some other arbitrary string"
I'm currently using the following regex: /^D([0-9]{3})(?:\|([^\|]+))*/
This correctly does not match the first string, matches the second and third correctly. The problem is the third and fourth only match the Dxxx and the last string. I need each of the strings between the '|' to be matched.
I'm hoping to use a regex as it makes it a single step. I realize I could just detect the leading Dxxx then use explode or split as appropriate to break the strings out. I've just gotten stuck on wanting a single regular expression match step.
This same regex may be used in Python as well so just want a generic regex solution.
There is no way to have a dynamic number of capture groups in a regular expression, but if you know some upper limit to how many parts you would have in one string, you can just repeat the pattern that many times:
/^D([0-9]{3})(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)/
So after the initial ^D([0-9]{3})(?:$|\|) you just repeat (.*?)(?:$|\|) as many times as you need it.
When the string has fewer elements, those remaining capture groups will match the empty string.
See regex tester.
Is something like preg_match_all() (the PHP variant of a global match) also acceptable for you?
Then you could use:
^(?|D([0-9]{3})|^.+$|(?!^)\|([^|\n]*)(?=\||$))
This will match everything in a string in different matches, e.g. take your string:
D197|what.org|when.net
It will you then give three matches:
D197
what.org
when.net
Running live: https://regex101.com/r/jL2oX6/4 (Everything in green are your group matches. Ignore what's in blue.)

Regex global search without using the global flag

I'm using software that only allows a single line regular expression for filtering and it doesn't allow the global modifier to capture all patterns in the string. Currently, my expression is only returning the first instance.
Is there another way to capture all instances of the pattern in the string?
Expression: (captures hi-res jpg urls)
\{\"hiRes\"\:\"([A-Za-z0-9%\/_:.-]+)\"\,\"thumb
String:
'colorImages': { 'initial': [{"hiRes":"http://sub.website.com/images/I/81OJ6qwKxyL._SL1500_.jpg","thumb":"http://sub.website.com/images/I/41NQRigTUdL._SS40_.jpg","large":"http://sub.website.com/images/I/41NQRigTUdL.jpg","main":{"http://sub.website.com/images/I/81OJ6qwKxyL._SY355_.jpg":[272,355],"http://sub.website.com/images/I/81OJ6qwKxyL._SY450_.jpg":[345,450],"http://sub.website.com/images/I/81OJ6qwKxyL._SY550_.jpg":[422,550],"http://sub.website.com/images/I/81OJ6qwKxyL._SY606_.jpg":[465,606],"http://sub.website.com/images/I/81OJ6qwKxyL._SY679_.jpg":[521,679]},"variant":"MAIN"},{"hiRes":"http://sub.website.com/images/I/71oHZNvsLbL._SL1500_.jpg","thumb":"http://sub.website.com/images/I/31lHNGD-ZDL._SS40_.jpg","large":"http://sub.website.com/images/I/31lHNGD-ZDL.jpg","main":{"http://sub.website.com/images/I/71oHZNvsLbL._SY355_.jpg":[197,355],"http://sub.website.com/images/I/71oHZNvsLbL._SY450_.jpg":[249,450],"http://sub.website.com/images/I/71oHZNvsLbL._SY550_.jpg":[305,550],"http://sub.website.com/images/I/71oHZNvsLbL._SY606_.jpg":[336,606],"http://sub.website.com/images/I/71oHZNvsLbL._SY679_.jpg":[376,679]},"variant":"PT01"},{"hiRes":"http://sub.website.com/images/I/91VCJAcIPEL._SL1500_.jpg","thumb":"http://sub.website.com/images/I/51G1gCkOFzL._SS40_.jpg","large":"http://sub.website.com/images/I/51G1gCkOFzL.jpg","main":{"http://sub.website.com/images/I/91VCJAcIPEL._SX355_.jpg":[355,341],"http://sub.website.com/images/I/91VCJAcIPEL._SX450_.jpg":[450,433],"http://sub.website.com/images/I/91VCJAcIPEL._SX425_.jpg":[425,409],"http://sub.website.com/images/I/91VCJAcIPEL._SX466_.jpg":[466,448],"http://sub.website.com/images/I/91VCJAcIPEL._SX522_.jpg":[522,502]},"variant":"PT02"},{"hiRes":"http://sub.website.com/images/I/912B68GN4aL._SL1500_.jpg","thumb":"http://sub.website.com/images/I/51elravQx6L._SS40_.jpg","large":"http://sub.websi
An interesting question. In my understanding, the global flag cannot be "emulated" with other Regex syntax features.
One could try to emulate the global flag by a Regex repetition. You could expand your Regex so that it would match all appearances of "hiRes":... in a repetition loop. But then, you would see that although several URLs would be matched because of the loop, only the last appearance would be captured.
Switching on the global flag does more than just "continue looking". It switches on collecting more than one capture in an array. Having just a Regex loop does not do the same.
I'd like to show two examples what this means. To test the examples, use e.g. https://regex101.com/.
Here is a simple example, first with the global flag:
Given text: a i b i c i
Regex: /(i)/g
Result: array of three strings, [0]="i" Pos.2, [1]="i" Pos.6, [2]="i Pos.10"
Now without the global flag. To match more, we must add a repetition to the Regex that embraces several "i", and a condition that ignores text between two "i". Like this:
Given text: a i b i c i
Regex: /(?:(i)[^i]*)+/
Result: array of one string, [0]="i" Pos.10
This seems puzzling first, but it is correct. The Regex matches from position 2 until 10. And from that match, it captures the last "i" at position 10. So the repetition in the Regex causes not several captures but a longer matching. This is very different from what the global flag does.
To be precise, this behavior is called "greedy". It tries to match as much as possible. With the "U" flag or with certain quantifiers, you can make the Regex "ungreedy". In that case in the example above, your "ungreedily" captured "i" will be that of position 2.
As a more complex example, just enhance your initial Regex. It must ignore text from the URL until the next "hiRes", and a repetition be put around. Here it is:
/\{(?:"hiRes":"([A-Za-z0-9%\/_:.-]+)"(?:[^"]|"(?!hiRes))*)+/
The second part means: match as many as possible that is not a quota, or that is a quota not followed by hiRes. Like this, this syntax will dig until the begin of the next "hiRes". And then the repetition comes in and it starts over with "hiRes".
Try it out. It will capture only the last URL in your text.
Finally, this tutorial is very comprehensive: http://www.regular-expressions.info/

TR1 regex: capture groups?

I am using TR1 Regular Expressions (for VS2010) and what I'm trying to do is search for specific pattern for a group called "name", and another pattern for a group called "value". I think what I want is called a capture group, but I'm not sure if that's the right terminology. I want to assign matches to the pattern "[^:\r\n]+):\s" to a list of matches called "name", and matches of the pattern "[^\r\n]+)\r\n)+" to a list of matches called "value".
The regex pattern I have so far is
string pattern = "((?<name>[^:\r\n]+):\s(?<value>[^\r\n]+)\r\n)+";
But the regex T4R1 header keeps throwing an exception when the program runs. What's wrong with the syntax of the pattern I have? Can someone show an example pattern that would do what I'm trying to accomplish?
Also, how would it be possible to include a substring within the pattern to match, but not actually include that substring in the results? For example, I want to match all strings of the pattern
"http://[[:alpha:]]\r\n"
, but I don't want to include the substring "http://" in the returned results of matches.
The C++ TR1 and C++11 regular expression grammars don't support named capture groups. You'll have to do unnamed capture groups.
Also, make sure you don't run into escaping issues. You'll have to escape some characters twice: one for being in a C++ string, and another for being in a regex. The pattern (([^:\r\n]+):\s\s([^\r\n]+)\r\n)+ can be written as a C++ string literal like this:
"([^:\\r\\n]+:\\s\\s([^\\r\\n]+)\\r\\n)+"
// or in C++11
R"xxx(([^:\r\n]+:\s\s([^\r\n]+)\r\n)+)xxx"
Lookbehinds are not supported either. You'll have to work around this limitation by using capture groups: use the pattern (http://)([[:alpha:]]\r\n) and grab only the second capture group.