Regex to match X repeated exactly n times in a row - regex

I am attempting to use regex to match a pattern where we have some X (any character) which occurs exactly n many times in succession. I know a little about regex, but don't know of anything like this.
My previous attempts left me using (.) as a capture group for my X, but I wasn't able to find a way to make sure this happened exactly n times (no more, and no less)
(Edit) For more context, I am trying to separate strings (containing only the letters 'r', 'p', and 's') into either "human" or "machine" generated and I want to assume that any string which has "XrrrrX" (where X is either s or p) or "YssssY" (where Y is either r or p) or "ZppppZ" (where Z is either s or r).
Some sample examples are
psrsrprrsssrrrpsprprsppspsssrsrssrpprppsrpssrp
psrpsprpsrpprpsprpsprpsrpprppsrpsprsprsprppsrp
psrrrrsprsrpsrrsprrrrrprpssssrsprrpspspppprpsr
rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
where I want to match only strings that have at most 5 of any character in a row and also at least one occurrence of xxxxx (where x is any character repeated 5 times in a row)

You need to use a back-reference to your capture group. Here is an example regex:
(.)\1{2}
Regex explained:
(.) is a capture group that captures literally anything a single time
\1 is a back-reference to the group you just captured (that single character)
{2} is a quantifier, which matches the previous token (the \1) exactly twice.
Note that, to capture a single character n times, you have to specify {n - 1} as the quantifier because the first match was already captured by (.).

I believe the following should do what you'r after:
^(?!([psr]*?([psr])\2{4})\2)(?1)(?2)*$
See an online demo
^ - Start-line anchor;
(?! - Open a negative lookahead;
([psr]*?([psr])\2{4})\2) - A nested 1st capture group to match 0+ (Lazy) characters of [psr] upto another nested 2nd capture group of any of those characters. Right after is a backreference to the 2nd capture group which we match 4 times. After that we match the content of the 2nd capture group once more to avoid it occur 6+ times before we close the negative lookahead;
(?1) - Match the same subpattern as the 1st group;
(?2)* - Match the same subpattern as the 2nd group 0+ (Greedy) times upto;
$ - End-line anchor.
I suppose this would be short for something like:
^(?![psr]*?([psr])\1{5})[psr]*?([psr])\2{4}[psr]*$

I have assumed you wish to match any of the following strings:
'psssssp'
'sppppps'
'rsssssr'
'srrrrrs'
'prrrrrp'
'rpppppr'
(but I have a sneaking suspicion that you actually want to do something else, in which case please tell me in a comment).
That could be done by matching a simple alternation:
(?:psssssp|sppppps|rsssssr|srrrrrs|prrrrrp|rpppppr)
Alternatively, you could use a regular expression that employs back-references:
([psr])(?!\1)([psr])\2{4}\1
​Demo
The second regular expression has the following components.
([psr]) # match 'p', 's' or 'r' and save to capture group 1
(?!\1) # the next character cannot be the content of capture group 1
([psr]) # match 'p', 's' or 'r' and save to capture group 2
\2{4} # match the content of capture group 2 four times
\1 # match the content of capture group 1
(?!\1) is a negative lookahead.

Related

Regex replace using pattern

I need to replace a pattern matched by regex with another pattern using regex, in C++.
Example -
We have the following characters: "a" and "b"
I want to replace like this -
Original text -
aabaaaaaaabaaabab
Replacement -
abbabbbbbbbabbbab
Replace logic -
"aab" must be replaced by "abb",
"aaab" must be replaced by "abbb",
"aaaab" must be replaced by "abbbb",
and so on...
I found the following regex for getting the matches -
aa+b
What regex replace pattern must be applied to get the desired replacement?
Thanks.
If using lookarounds be a possibility for you, here is one solution. Match on the following pattern:
(?<=a)a(?=a*b)
and then replace with just a single b. The pattern says to match:
(?<=a) assert that at least one 'a' precedes (i.e. ignore first 'a')
a a letter 'a'
(?=a*b) which is following by zero or more 'a' then a 'b'
Here is working demo.
Here is a solution that is not limited to the letters 'a' and 'b'.
Matches of the following regular expression are to be replaced by the content of capture group 2.
(?<=(.))\1(?=\1*(.))
Demo
The regular expression is comprised of the following elements.
(?<= # begin a positive lookbehind
(.) # match any character and save to capture group 1
) # end positive lookbehind
\1 # match the content of capture group 1
(?= # begin a positive lookahead
\1* # match the content of capture group 1 zero or more times
(.) # match any character and save to capture group 2
) # end positive lookahead

Regex - add a zero after second period

I have the following example of numbers, and I need to add a zero after the second period (.).
1.01.1
1.01.2
1.01.3
1.02.1
I would like them to be:
1.01.01
1.01.02
1.01.03
1.02.01
I have the following so far:
Search:
^([^.])(?:[^.]*\.){2}([^.].*)
Substitution:
0\1
but this returns:
01 only.
I need the 1.01. to be captured in a group as well, but now I'm getting confuddled.
Does anyone know what I am missing?
Thanks!!
You may try this regex replacement with 2 capture groups:
Search:
^(\d+\.\d+)\.([1-9])
Replacement:
\1.0\2
RegEx Demo
RegEx Details:
^: Start
(\d+\.\d+): Match 1+ digits + dot followed by 1+ digits in capture group #1
\.: Match a dot
([1-9]): Match digits 1-9 in capture group #2 (this is to avoid putting 0 before already existing 0)
Replacement: \1.0\2 inserts 0 just before capture group #2
You could try:
^([^.]*\.){2}\K
Replace with 0. See an online demo
^ - Start line anchor.
([^.]*\.){2} - Negated character 0+ times (greedy) followed by a literal dot, matched twice.
\K - Reset starting point of reported match.
EDIT:
Or/And if \K meta escape isn't supported, than see if the following does work:
^((?:[^.]*\.){2})
Replace with ${1}0. See the online demo
^ - Start line anchor.
( - Open 1st capture group;
(?: - Open non-capture group;
`Negated character 0+ times (greedy) followed by a literal dot.
){2} - Close non-capture group and match twice.
) - Close capture group.
Using your pattern, you can use 2 capture groups and prepend the second group with a dot in the replacement like for example \g<1>0\g<2> or ${1}0${2} or $10$2 depending on the language.
^((?:[^.]*\.){2})([^.])
^ Start of string
((?:[^.]*\.){2}) Capture group 1, match 2 times any char except a dot, then match the dot
([^.].*) Capture group 2, match any char except a dot
Regex demo
A more specific pattern could be matching the digits
^(\d+\.\d+\.)(\d)
^ Start of string
(\d+\.\d+\.) Capture group 1, match 2 times 1+ digits and a dot
(\d) Capture group 2, match a digit
Regex demo
For example in JavaScript
const regex = /^(\d+\.\d+\.)(\d)/;
[
"1.01.1",
"1.01.2",
"1.01.3",
"1.02.1",
].forEach(s => console.log(s.replace(regex, "$10$2")));
Obviously, there will be tons of solutions for this, but if this pattern holds (i.e. always the trailing group that is a single digit)... \.(\d)$ => \.0\1 would suffice - to merely insert a 0, you don't need to match the whole thing, only just enough context to uniquely identify the places targeted. In this case, finding all lines ending in a . followed by a single digit is enough.

Regex to validate subtract equations like "abc-b=ac"

I've stumbled upon a regex question.
How to validate a subtract equation like this?
A string subtract another string equals to whatever remains(all the terms are just plain strings, not sets. So ab and ba are different strings).
Pass
abc-b=ac
abcde-cd=abe
ab-a=b
abcde-a=bcde
abcde-cde=ab
Fail
abc-a=c
abcde-bd=ace
abc-cd=ab
abcde-a=cde
abc-abc=
abc-=abc
Here's what I tried and you may play around with it
https://regex101.com/r/lTWUCY/1/
Disclaimer: I see that some of the comments were deleted. So let me start by saying that, though short (in terms of code-golf), the following answer is not the most efficient in terms of steps involved. Though, looking at the nature of the question and its "puzzle" aspect, it will probably do fine. For a more efficient answer, I'd like to redirect you to this answer.
Here is my attempt:
^(.*)(.+)(.*)-\2=(?=.)\1\3$
See the online demo
^ - Start line anchor.
(.*) - A 1st capture group with 0+ non-newline characters right upto;
(.+) - A 2nd capture group with 1+ non-newline characters right upto;
(.*) - A 3rd capture group with 0+ non-newline characters right upto;
-\2= - An hyphen followed by a backreference to our 2nd capture group and a literal "=".
(?=.) - A positive lookahead to assert position is followed by at least a single character other than newline.
\1\3 - A backreference to what was captured in both the 1st and 3rd capture group.
$ - End line anchor.
EDIT:
I guess a bit more restrictive could be:
^([a-z]*)([a-z]+)((?1))-\2=(?=.)\1\3$
You may use this more efficient regex with a lookahead at the start with a capture group that matches text on the right hand side of - i.e. substring between - and = and captures it in group #1. Then in the main body of regex we just check presence of capture group #1 and capture text before and after \1 in 2 separate groups.
^(?=[^-]+-([^=]+)=.)([^-]*?)\1([^-]*)-[^=]+=\2\3$
RegEx Demo
RegEx Demo:
^: Start
(?=[^-]+-([^=]+)=.): Lookahead to make sure we have expression structure of pqr-pq=r and also more importantly capture substring between - and = in capture group #1. . after = is there for a reason to disallow any empty string after =.
([^-]*?): Match 0 or more non-- characters in capture group #2
\1: Back-reference to group #1 to make sure we match same value as in capture group #1
([^-]*): Match 0 or more non-- characters in capture group #3
-: Match a -
[^=]+: Match 0 or more non-= characters
=: Match a =
\2\3: Back-reference to group #2 and #3 which is difference of substraction
$: End

Regular expression to exclude group with 0 and more occurence issue

I need to extract 1234567 from below URLs
http://www.test.in/some--wonders-1234567---2
http://www.test.in/some--wonders-1234567
I tried with .*\-([0-9]+)(?:-{2,}2)?.
but for the first URL it returned 2, but this is in non-capturing group.
Please give me a solution. I am digging it for so long. not getting any idea.
Try this one:
.*?\-([0-9]+)(?:-{2,}2|$)
It sets lazy mode for first .* pattern, you can also remove it at all with same effect:
\-([0-9]+)(?:-{2,}2|$)
If your regex engine supports negative look behinds (some do not), you can do it this way:
(?<!\d+-+)\d+
It gives you any non-empty digit string, which is not preceded by (minuses followed by digits).
Big advantage is that you don't have to use groups here - regex itself returns what you want.
You could match a - followed by one or more digits which you could capture in a group ([0-9]+). This group will contain the value you want to extract.
Then an optional part (?:-{2,}[0-9]+)? that would match ---2 followed by asserting the end of the line $.
-(\d+)(?:-{2,}\d+)?$
Explanation
- Match literally
(\d+) Capture one or more digits in a group
(?: Non capturing group
-{2,} Match 2 or more times -
\d+ Match one or more digits
)? close non capturing group and make it optional
$ Assert position at the end of the line

Repeated capturing group PCRE

Can't get why this regex (regex101)
/[\|]?([a-z0-9A-Z]+)(?:[\(]?[,][\)]?)?[\|]?/g
captures all the input, while this (regex101)
/[\|]+([a-z0-9A-Z]+)(?:[\(]?[,][\)]?)?[\|]?/g
captures only |Func
Input string is |Func(param1, param2, param32, param54, param293, par13am, param)|
Also how can i match repeated capturing group in normal way? E.g. i have regex
/\(\(\s*([a-z\_]+){1}(?:\s+\,\s+(\d+)*)*\s*\)\)/gui
And input string is (( string , 1 , 2 )).
Regex101 says "a repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations...". I've tried to follow this tip, but it didn't helped me.
Your /[\|]+([a-z0-9A-Z]+)(?:[\(]?[,][\)]?)?[\|]?/g regex does not match because you did not define a pattern to match the words inside parentheses. You might fix it as \|+([a-z0-9A-Z]+)(?:\(?(\w+(?:\s*,\s*\w+)*)\)?)?\|?, but all the values inside parentheses would be matched into one single group that you would have to split later.
It is not possible to get an arbitrary number of captures with a PCRE regex, as in case of repeated captures only the last captured value is stored in the group buffer.
What you may do is get mutliple matches with preg_match_all capturing the initial delimiter.
So, to match the second string, you may use
(?:\G(?!\A)\s*,\s*|\|+([a-z0-9A-Z]+)\()\K\w+
See the regex demo.
Details:
(?:\G(?!\A)\s*,\s*|\|+([a-z0-9A-Z]+)\() - either the end of the previous match (\G(?!\A)) and a comma enclosed with 0+ whitespaces (\s*,\s*), or 1+ | symbols (\|+), followed with 1+ alphanumeric chars (captured into Group 1, ([a-z0-9A-Z]+)) and a ( symbol (\()
\K - omit the text matched so far
\w+ - 1+ word chars.