how to retrieve a hierarchy of matches from a regex - regex

std::regex r("((.)(.))(.)");
Running this on a three-letter string will simply return 5 matches. Coliru.
Instead, I would like to retrieve two "toplevel" matches, where the first match contains two submatches. I would like to be able to nest them to any depth and retrieve a suitable tree of matches.
It appears as if boost has something like this with "nested matches". Is this correct? And can I do this in c++11 without boost?
Extra: a slightly less trivial toy example where this might be useful:
((,[0-9]+)+)((,[a-z])+)
This would match a series of numbers, following by a series of words, all separated by commas. I would like to separate the number-matches from the word-matches, instead of having a flat series of matches.

The thing about regex is that they are not recursive descent parsers. But you can use a combination of regex and C++ (or any other language, really).
Just a note, there are some problems with this regex:
((,[0-9]+)+)((,[a-z])+)
In order to not miss matching the first item, the list must start with ,. The other problem is that you also will only catch lowercase 1 letter words.
For the sake of simplicity, I'm going to solve the first problem by assuming that you prefix each string with ,. The second problem can be solved by changing the regex:
((,[0-9]+)+)((,[a-zA-Z]+)+)
Note that this will not capture more than one set of numbers followed by a set of words. For that you must search in a loop, as the comments said.
Now that that's fixed, I can explain how you might go about accomplishing what you want.
All of the numeric matches are in matches[1]. All of the alphabetic matches are in matches[3].
You can get each individual item in the numeric list by splitting on ,. The same goes for the alphabetic list.

Related

Regular Expression: Two words in any order but with a string between?

I want to use positive lookaheads so that RegEx will pick up two words from two different sets in any order, but with a string between them of length 1 to 20 that is always in the middle.
It also is already case insensitive, allow for any number of characters including 0 before the first word found and the same after the second word found - I am unsure if it is more correct to terminate in $.
Without the any order matching I am so far as:
(?i:.*(new|launch|releas)+.{1,20}(product1|product2)+.*)
I have attempted to add any order matching with the following but it only picks up the first word:
(?i:.*(?=new|launch|releas)+.{1,20}(?=product1|product2)+.*)
I thought perhaps this was because of the +.{1,20} in the middle but I am unsure how it could work if I add this to both sets instead, as for instance this could cause a problem if the first word is the very first part of the source text it is parsing, and so no character before it.
I have seen example where \b is used for lookaheads but that also seems like it may cause a problem as I want it to match when the first word is at the start of the source text but also when it is not.
How should I edit my RegEx here please?

Regex: matching between underscores

For example, I have a string 111352_01_2_SAMPLE_TEXT_SAMPLE. I need to match first, second, third number and remaining text.
Currently I have this:
First number: ^[^_]+(?=_) (Everything until 1. underscore)
Second number: (?<=_)[^_]*(?=_) (Everything between 1. and 2. underscore)
Remaining text: (?:.*?_){3}(.*)\s* (Text after third occurrence of underscore)
Is there any more "readable" way of building expression, since the logic for first three matches in quite similar.
And what's the best way of writing expression for matching everything
Since you tagged regex-group I think a more straightforward way of retrieving these three substring could be:
^(.*?)_(.*?)_.*?_(.*)$
See the demo
Maybe you are looking to get a single regex expressions that is applicable to whichever element from the string you want. In that case you could use:
^(?:.*?_){0}([^\n_]+)
This is a zero-index type of retrieving elements delimited by an underscore. However, I do not see the benefit over a regular split() function. Change the zero to a 1, 2 or 3 etc.
Just use
^(\d+)_(\d+)_(\d+)_(.+)
See a demo on regex101.com.

Regex matching all subsequences, repeating characters

For example lets take the sequence
"aaaaaa".
I want regex to match all subsequences, including repeating characters. Meaning the total count of subsequences should be 5, instead of 3.
Clarification:
Lets numerate our characters. Our sequence will look something like
"a1a2a3a4a5a6"
All subsequences are:
"a1a2", "a2a3". "a3a4", "a4a5", "a5a6"
Can I do that in regex? I am currently programming in Java and I know it is possible to develop an algorithm there, but I would like to avoid that for now.
You can use the following regex:
(?=((a)\2))
See demo
The technique of capturing the overlapping substrings inside a positive lookahead is described here.
The difference is that you need to use 2 capturing groups: one is a "functional", technical, inner group to make sure we match two identical consecutive symbols, and the outer group (ID#1) that we can use to extract the values we need.

Regex for string that contains characters in unspecified order

When having strings like
helloworld
worldhello
ollehdlrow
Is there a regex that can match all those cases? So, basically a pattern that will match all strings that contain all characters, in unspecified order.
I tried using
/[helloworld]{10}/
but this doesn't work for obvious reasons, as it will also match eeeeeeeeee.
You definitely don't want to use regular expressions for this.
In order to check if a character exists in the string, in your case, you would have to use a positive lookahead. It would look something like this (?=a) to check for the character a. Thats fine. If we want to check for a string containing the character a and b we can do /^(?=.*a)(?=.*b)/. Problems arise if we want to check for multiple as.
View this example: http://regex101.com/r/iV2jC8
As you can see, the regex has been "told" to look two times for the letter 'a'. However, the first case still matches. This is because the engine does not save the position where it initially found the first 'a', and thus the next assertion finds the very same a. This is the case in all three of the examples. So in reality, none of them are really being validated.
You would have to do something like this: http://regex101.com/r/cR8eR4
Which as you probably can imagine will quickly get out of hand with larger patterns.
I hope this helps, best of luck.

RegExp Find skip letter in the word

I want to find word even this word is written with skip letter.
For example I want to find
references
I want also find refrences or refernces, but not refer
I write this Regexp
(\brefe?r?e?n?c?e?s?\b)
And I want to add checking for length of matched group, this group should be greather than 8.
Can I do only with regexp methods?
I don't think regex is a good tool to find similar words like you try to. What are you doing if two letters are swapped, like "refernece"? Your regex will not find it.
But to show the regex way to check for the length, you could do this by using a lookahead like this
(\b(?=.{8,}\b)refe?r?e?n?c?e?s?\b)
The (?=.{8,}\b) will check if the length from the first \b to the next \b is at least 8 characters ({8,})
See it here on Regexr
I think that using regex is not a good idea. You need more power functions. For example, if you are programming in php, you need function like similar_text. More details here: http://www.php.net/manual/en/function.similar-text.php
Basically you are asking that (in pseudo code):
input == "references" or (levenshtein("references", input)==1 and length(input) == (lenght("references")-1))
Levenshtein distance is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.
Since you want to detect only the strings where a char was skipped, you must add the constraint on the string length.