Regex matching all subsequences, repeating characters - regex

For example lets take the sequence
"aaaaaa".
I want regex to match all subsequences, including repeating characters. Meaning the total count of subsequences should be 5, instead of 3.
Clarification:
Lets numerate our characters. Our sequence will look something like
"a1a2a3a4a5a6"
All subsequences are:
"a1a2", "a2a3". "a3a4", "a4a5", "a5a6"
Can I do that in regex? I am currently programming in Java and I know it is possible to develop an algorithm there, but I would like to avoid that for now.

You can use the following regex:
(?=((a)\2))
See demo
The technique of capturing the overlapping substrings inside a positive lookahead is described here.
The difference is that you need to use 2 capturing groups: one is a "functional", technical, inner group to make sure we match two identical consecutive symbols, and the outer group (ID#1) that we can use to extract the values we need.

Related

Regex - match if at least 2 words out of N words, in any order

I'm trying to create a regex expression that will create a match if a string has at least 2 words out of N. For example, take the words ('one', 'two', 'three', 'four'). This regex should return a match for all these cases:
one two three four
twothreeone
two plus two is four
It should not return a match for:
one
three plus three is three
I have tried something like this'/^(?=.*one)(?=.*two)(?=.*three)(?=.*four).+/', but this will only match if all words ('one', 'two', 'three', 'four') are contained in the string.
Apologies for stealing someone's comment, but it does appear to work!
In Perl/PCRE you can use a reference to a subpattern in a capture group with (?n) where n is the number of the capture group. So: (one|two|three|four).*(?!\1)(?1). In the worst case, you don't have to type everything twice when you know the shortcuts ctrl+c and ctrl+v – Casimir et Hippolyte 4 hours ago
% pcretest
PCRE version 8.35 2014-04-04
re> #(one|two|three|four).*(?!\1)(?1)#
data> one one one
No match
data> one two one
0: one two
1: one
data> one four
0: one four
1: one
data> four four
No match
data> ^D
%
Indeed, in pcre, which is a popular library used by nginx (the only dependency of the whole nginx port in OpenBSD ports!) and lots of other software, you can use something like (?1) (or (?-1)) to refer to the previous pattern, so, you don't have to copy-paste the thing several times, as well as the negative look-ahead, which is just standard fare.
Here's the docs on the features at stake — you may want to look into the pcrepattern and pcresyntax manual pages, sections as below:
http://www.pcre.org/original/doc/html/pcresyntax.html#SEC19
http://mdoc.su/f/pcresyntax.3#LOOKAHEAD_AND_LOOKBEHIND_ASSERTIONS
(?!...) negative look ahead
http://www.pcre.org/original/doc/html/pcresyntax.html#SEC21
http://mdoc.su/f/pcresyntax.3#SUBROUTINE_REFERENCES_(POSSIBLY%09RECURSIVE)
(?n) call subpattern by absolute number
http://www.pcre.org/original/doc/html/pcrepattern.html#SEC24
http://mdoc.su/f/pcrepattern.3#SUBPATTERNS_AS_SUBROUTINES
etc.
In general, the http://www.pcre.org/original/pcre.txt and http://www.pcre.org/pcre2.txt pages include complete documentation, and are helpful in searching up that syntax you've seen somewhere.
(one|two|three|four).*(?!\1)(?-1)
Explanation:
Capture one of the words in the group
Find any amount of characters
If you find what was matched in the last group don't match
Unless you find another match of the group one behind this one (recursive subpattern)
This will mean when you edit it, you'll be able to just edit one capture group, assuming you're using PCRE regex (with say, PHP).
Check out the demo
Search for two copies of a target word, but capture the first and apply a negative lookahead on the second word using a back reference to the first group to assert that a different word appeared in the second group - making (at least) 2 in total.
(one|two|three|four).*(?!\1)(one|two|three|four)
See live demo.

A short way to capture/back-reference every digit of a number individually

So basically I want to reformat a 10 digit number like so:
1234567890 --> (123) 456-7890
A long way to do this would be to have each number be its own capture group and then back-reference each one individually:
'([0-9])([0-9])...([0-9])' --> (\1\2\3) \4\5\6-\7\8\9\10
This seems unnecessary and verbose, but when I try the following
'([0-9]){10}'
There appears to be only one back-reference and its of the last digit in the number.
Is there is a more elegant way to reference each character as its own capture group?
Thanks!
The following pattern will do the job: ^(\d{3})(\d{3})(\d{4})$
^(\d{3}): beginning of the string, then exactly 3 digits
(\d{3}): exactly 3 digits
(\d{4})$: exactly 4 digits, then end of the string.
Then replace by: (\1) \2-\3
Although the other answer with its example regex patterns hopefully shed light on the correct application of capture groups, it does not directly answer the question. If you fail to understand how regular expressions work (capture groups in particular), you may find yourself wanting to do the same thing with a different pattern in the future.
Is there is a more elegant way to reference each character as its own
capture group?
The initial answer is "No", there is no way to reference an individual capture of a single capture group using traditional replacement syntax - regardless of whether it is a single digit or any other capture group. Consider that you indicate a precise number of matches with {10} and it seems perfectly reasonable to be able to access each capture. But what if you had indicated a variable number of matches with + or {,3}? There would be no well-defined way of knowing how many possible captures occurred. If the same regex pattern had had more capture groups following the "repeated" capture group, there would be no way of correctly referencing the later groups. Example: Given the pattern ([a-z])+(\d){3}, the first capture group could match 4 letters one time, then the next time match 11 letters. If you wanted to refer to the captured digits, how would you do that? You could not, since \1, \2, \3, ... would all be reserved for possible capture instances of the first group.
But the inability of basic regular expressions syntax to do what you want does not remove the validity of your question, nor does it necessarily place the solution outside the realm of many regular expression implementations. Various regex implementations (i.e. language syntax and regex libraries) resolve this limitation by facilitating regex matching with various objects for accessing repeated captures. (c# and .Net regex library is one example, like match.Groups[1].Captures[3]) So even though you can't use basic replacement patterns to get want you want, the answer is often "Yes", depending on the specific implementation.

Regex to match reocurring character groups

I'm trying to write a regex that would match groups of exactly three characters, that reoccur within the text at least one time.
What I came up with is this simple regex:(.{3}).*\g1, using the \g (global) and \s (dot also matches newline) flags. However, it is clearly faulty, as it only finds a part of the groups I'm hoping to capture. Any idea how can I improve it? Here is the link to an example input https://regex101.com/r/Cuiva1/2
Edit: Here's the full list of groups I was hoping to capture as requested in the comment:GLT,VIW,IWK,KTL,GLT,LTK,LIS,KTX,TXK,XDL,KTL
If your input is always multiple triplets of uppercase characters and you're only looking for ones that repeat, then you need something more complex to avoid backtracking into a previous triplet:
/(?>[^A-Z]*+([A-Z]{3}))(?=(?:[^A-Z]*+[A-Z]{3})*?\1)|(?>[^A-Z]*+[A-Z]{3})/g
The matches from index 1 will hold what you want. If your strings are not that well formatted (i.e. may contain any length string in between repeating patterns, then you can use a simpler pattern but you'll get totally inconsistent results and miss some matches.
I re-read your desired output, you're not going to achieve this with regex. VIW and IWK are overlapping, which won't work in a single preg_match_all(). Just use string functions.

how to retrieve a hierarchy of matches from a regex

std::regex r("((.)(.))(.)");
Running this on a three-letter string will simply return 5 matches. Coliru.
Instead, I would like to retrieve two "toplevel" matches, where the first match contains two submatches. I would like to be able to nest them to any depth and retrieve a suitable tree of matches.
It appears as if boost has something like this with "nested matches". Is this correct? And can I do this in c++11 without boost?
Extra: a slightly less trivial toy example where this might be useful:
((,[0-9]+)+)((,[a-z])+)
This would match a series of numbers, following by a series of words, all separated by commas. I would like to separate the number-matches from the word-matches, instead of having a flat series of matches.
The thing about regex is that they are not recursive descent parsers. But you can use a combination of regex and C++ (or any other language, really).
Just a note, there are some problems with this regex:
((,[0-9]+)+)((,[a-z])+)
In order to not miss matching the first item, the list must start with ,. The other problem is that you also will only catch lowercase 1 letter words.
For the sake of simplicity, I'm going to solve the first problem by assuming that you prefix each string with ,. The second problem can be solved by changing the regex:
((,[0-9]+)+)((,[a-zA-Z]+)+)
Note that this will not capture more than one set of numbers followed by a set of words. For that you must search in a loop, as the comments said.
Now that that's fixed, I can explain how you might go about accomplishing what you want.
All of the numeric matches are in matches[1]. All of the alphabetic matches are in matches[3].
You can get each individual item in the numeric list by splitting on ,. The same goes for the alphabetic list.

Backreferencing something without putting it in the rest of the expression

I am trying to make a regular expression that will match all words that have a letter that repeats at least an arbitrary number of times.
For example, if I want to match words that have a letter that repeats at least 3 times, I would want to match words like
applepie banana insidious
I want to be able to change the number of repeats I'm looking for by just changing one number in my expression, so expressions that only work for a certain number of repeats are not what I'm looking for.
Currently, this is what I'm using
^(?=.*(.))(?=(.*\1){4}).*$
Where 4 is the number of repeats, a number that I can change to whatever number of repeats I'm looking for.
The above regular expression appears to work, but using a lookahead just so I can use a capturing group seems very unwieldy, and so I'm looking for a better way to solve this problem.
This will eliminate one lookahead:
\b(?=\w*(\w)(\w*\1){2})\w*
Start of word, then any number of word-characters such that they consist of any number of word characters, a particular word character, and then any number of characters and that character again, repeated at least twice.
For four repetitions, use {3} (for n repetitions, use one less).
Also, feel free to replace \b... with ^...$ as you were doing if you meant to match whole lines and not words in text.
You can use this regex:
\b\w*?(\w)(?=(?:\w*?\1){2})\w*\b
RegEx Demo
Where 2 is n-1 for n repetitions you're trying to find in a complete word.