Regex - math with cycle - regex

How to count the amount of a match inside itself to skip some characters?
Example:
I have:
(a(b(c)))
If I run this regex: \(.+?\)
It will be return: (a(b(c)
But what I want is the ) that closes the loop, that is, the third.
I could just remove the ? From the regex but there is a problem:
Ex: \(.+\) to (a)(a(b(c))) return (a)(a(b(c)))
And what I want is for the group to return to me with the closed loop of (), that is, it should return 2 matchs to me:
match 1: (a)
match 2: (a(b(c)))
What is the question of counting in the match? Well, what I thought was if there is any way to count how many ( passed to know how many ) one should skip, that is:
1 2 3 1 2 3
( a ( b ( c ) ) )
Does anyone have any idea how to do this just using regex?

If you need to use regex, please try the recursive regex (?R).
The implementation depends on the language so let me explain it with python.
#!/usr/bin/python
import regex
str ='(a)(a(b(c)))'
m = regex.findall(r'\((?:[^()]|(?R))+\)', str)
print(m)
Output:
['(a)', '(a(b(c)))']
Explanation of the regex pattern \((?:[^()]|(?R))+\):
The inner part (?:[^()]|(?R))+ matches:
one or more [^()] or (?R) where
[^()] matches any character other than parentheses.
(?R) represents the entire regex \((?:[^()]|(?R))+\) recursively.

Related

Get a match when there are duplicate letters in a string

I have a list of inputs in google sheets,
Input
Desired Output
"To demonstrate only not an input" The repeated letters
Outdoors
Match
o
dog
No Match
step
No Match
bee
Match
e
Chessboard
Match
s
Cookbooks
Match
o, k
How do I verify if all letters are unique in a string without splitting it?
In other words if the string has one letter or more occurred twice or more, return TRUE
My process so far
I tried this solution in addition to splitting the string and dividing the length of the string on the COUNTA of unique letters of the string, if = 1 "Match", else "No match"
Or using regex
I found a method to match a letter is occure in a string 2 times this demonstration with REGEXEXTRACT But wait what needed is get TRUE when the letters are not unique in the string
=REGEXEXTRACT(A1,"o{2}?")
Returns:
oo
Something like this would do
=REGEXMATCH(Input,"(anyletter){2}?")
OR like this
=REGEXMATCH(lower(A6),"[a-zA-Z]{2}?")
Notes
The third column, "Column C," is only for demonstration and not for input.
The match is case insensitive
The string doesn't need to be splitted to aviod heavy calculation "I have long lists"
Avoid using lambda and its helper functions see why?
Its ok to return TRUE or FALSE instead of Match or No Match to keep it simple.
More examples
Input
Desired Output
Professionally
Match
Attractiveness
Match
Uncontrollably
Match
disreputably
No Match
Recommendation
Match
Interrogations
Match
Aggressiveness
Match
doublethinks
No Match
You are explicitly asking for an answer using a single regular expression. Unfortunately there is no such thing as a backreference to a former capture group using RE2. So if you'd spell out the answer to your problem it would look like:
=INDEX(IF(A2:A="","",REGEXMATCH(A2:A,"(?i)(?:a.*a|b.*b|c.*c|d.*d|e.*e|f.*f|g.*g|h.*h|i.*i|j.*j|k.*k|l.*l|m.*m|n.*n|o.*o|p.*p|q.*q|r.*r|s.*s|t.*t|u.*u|v.*v|w.*w|x.*x|y.*y|z.*z)")))
Since you are looking for case-insensitive matching (?i) modifier will help to cut down the options to just the 26 letters of the alphabet. I suppose the above can be written a bit neater like:
=INDEX(IF(A2:A="","",REGEXMATCH(A2:A,"(?i)(?:"&TEXTJOIN("|",1,REPLACE(REPT(CHAR(SEQUENCE(26,1,65)),2),2,0,".*"))&")")))
EDIT 1:
The only other reasonable way to do this (untill I learned about the PREG supported syntax of the matches clause in QUERY() by #DoubleUnary) with a single regex other than the above is to create your own UDF in GAS (AFAIK). It's going to be JavaScript based thus supporting a backreferences. GAS is not my forte, but a simple example could be:
function REGEXMATCH_JS(s) {
if (s.map) {
return s.map(REGEXMATCH_JS);
} else {
return /([a-z]).*?\1/gi.test(s);
}
}
The pattern ([a-z]).*?\1 means:
([a-z]) - Capture a single character in range a-z;
.*?\1 - Look for 0+ (lazy) characters up to a copy of this 1st captured character with a backreference.
The match is global and case-insensitive. You can now call:
=INDEX(IF(A2:A="","",REGEXMATCH_JS(A2:A)))
EDIT 2:
For those that are benchmarking speed, I am not testing this myself but maybe this would speed things up:
=INDEX(REGEXMATCH(A2:INDEX(A:A,COUNTA(A:A)),"(?i)(?:a.*a|b.*b|c.*c|d.*d|e.*e|f.*f|g.*g|h.*h|i.*i|j.*j|k.*k|l.*l|m.*m|n.*n|o.*o|p.*p|q.*q|r.*r|s.*s|t.*t|u.*u|v.*v|w.*w|x.*x|y.*y|z.*z)"))
Or:
=INDEX(REGEXMATCH(A2:INDEX(A:A,COUNTA(A:A)),"(?i)(?:"&TEXTJOIN("|",1,REPLACE(REPT(CHAR(SEQUENCE(26,1,65)),2),2,0,".*"))&")"))
Or:
=REGEXMATCH_JS(A2:INDEX(A:A,COUNTA(A:A)))
Respectively. Knowing there is a header in 1st row.
Benchmark:
Created a benchmark here.
Methodology:
Use NOW() to create a timestamp, when checkbox is clicked.
Use NOW() to create another timestamp, when the last row is filled and the checkbox is on.
The difference between those two timestamps gives time taken for the formula to complete.
The sample is a random data created from Math.random between [A-Za-z] with 10 characters per word.
Results:
Formula
Round1
Round2
Avg
% Slower than best
Sample size
10006
[re2](a.*a|b.*b)JvDv
0:00:19
0:00:19
0:00:19
-15.15%
[re2+recursion]MASTERMATCH_RE2
0:00:27
0:00:24
0:00:26
-54.55%
[Find+recursion]MASTERMATCH
0:00:17
0:00:16
0:00:17
0.00%
[PREG]Doubleunary
0:00:57
0:00:53
0:00:55
-233.33%
Conclusion:
This varies greatly based on browser/device/mobile app and on non-randomized sample data. But I found PREG to be consistently slower than re2
Use recursion.
This seems extremely faster than the regex based approach. Create a named function:
Name:
MASTERMATCH
Arguments(in this order):
word
The word to check
start
Starting at
Function:
=IF(
MID(word,start,1)="",
FALSE,
IF(
ISERROR(FIND(MID(word,start,1),word,start+1)),
MASTERMATCH(word,start+1),
TRUE
)
)
Usage:
=ARRAYFORMULA(MASTERMATCH(A2:INDEX(A2:A,COUNTA(A2:A)),1))
Or without case sensitivity
=ARRAYFORMULA(MASTERMATCH(lower(A2:A),1))
Explanation:
It recurses through each character using MID and checks whether the same character is available after this position using FIND. If so, returns true and doesn't check anymore. If not, keeps checking until the last character using recursion.
Or with regex,
Create a named function:
Name:
MASTERMATCH_RE2
Arguments(in this order):
word
The word to check
start
Starting at
Function:
IF(
MID(word,start,1)="",
FALSE,
IF(
REGEXMATCH(word,MID(word, start, 1)&"(?i).*"&MID(word,start,1)),
TRUE,
MASTERMATCH_RE2(word,start+1)
)
)
Usage:
=ARRAYFORMULA(MASTERMATCH_RE2(A2:A,1))
Or
=ARRAYFORMULA(MASTERMATCH_RE2(lower(A2:A),1))
Explanation:
It recurses through each character and creates a regex for that character. Instead of a.*a, b.*b,..., it takes the first character(using MID), eg: o in outdoor and creates a regex o.*o. If regex is positive for that regex (using REGEXMATCH), returns true and doesn't check for other letters or create other regexes.
Uses lambda, but it's efficient. Loop through each row and every character with MAP and REDUCE. REPLACE each character in the word and find the difference in length. If more than 1, don't check length anymore and return Match
=MAP(
A2:INDEX(A2:A,COUNTA(A2:A)),
LAMBDA(_,
REDUCE(
"No Match",
SEQUENCE(LEN(_)),
LAMBDA(a,c,
IF(a="Match",a,
IF(
LEN(_)-LEN(
REGEXREPLACE(_,"(?i)"&MID(_,c,1),)
)>1,
"Match",a
)
)
)
)
)
)
If you do run into lambda limitations, remove the MAP and drag fill the REDUCE formula.
=REDUCE("No Match",SEQUENCE(LEN(A2)),LAMBDA(a,c,IF(a="Match",a,IF(LEN(A2)-LEN(REGEXREPLACE(A2, "(?i)"&MID(A2,c,1),))>1,"Match",a))))
The latter is preferred for conditional formatting as well.
As Daniel Cruz said, Google Sheets functions such as regexmatch(), regexextract() and regexreplace() use RE2 regexes that do not support backreferences. However, the query() function uses Perl Compatible Regular Expressions that do support named capture groups and backreferences:
=arrayformula(
iferror( not( iserror(
match(
to_text(A3:A),
query(lower(unique(A3:A)), "where Col1 matches '.*?(?<char>.).*?\k<char>.*' ", 0),
0
)
) / (A3:A <> "") ) )
)
In my limited testing with a sample size of 1000 heterograms, pangrams, words with diacritic letters, and 10-character pseudo-random unique values from TheMaster's corpus, this PREG formula ran at about half the speed of the JvdV2 RE2 regex.
With Osm's sample of 50,000 highly repetitive sample values, the formula ran at 8x the speed of JvdV2.
A PREG regex is slower than a RE2 regex, but has the benefit that you can more easily check all characters for repeats. This lets you work with corpuses that include diacritic letters, numbers and other non-English alphabet characters:
Input
Output
Professionally
TRUE
disreputably
FALSE
Abacus
TRUE
Élysée
TRUE
naïve Ï
TRUE
määräävä
TRUE
121
TRUE
123
FALSE
You can also easily state which specific characters to check by replacing <char>. with something like <char>[\wéäåö] or <char>[^-;,.\s\d].
try:
=INDEX(IF(IFERROR(LEN(REGEXREPLACE(A1:A6, "[^"&C1:C6&"]", )), -1)>=
(LEN(SUBSTITUTE(C1:C6, "|", ))*2), "Match", "No Match"))
update
create a query heat map, filter it and vlookup back row position
=INDEX(LAMBDA(a, IF(""<>IFNA(VLOOKUP(ROW(a),
SPLIT(QUERY(QUERY(FLATTEN(ROW(a)&"​"&REGEXEXTRACT(a, REPT("(.)", LEN(a)))),
"select Col1,count(Col1) where Col1 matches '.*\w+$' group by Col1"),
"select Col1 where Col2 > 1", ), "​"), 2, )), "Match", "No Match"))
(A2:INDEX(A:A, MAX((A:A<>"")*ROW(A:A)))))
case insensitive would be:
=INDEX(LAMBDA(a, IF(""<>IFNA(VLOOKUP(ROW(a),
SPLIT(QUERY(QUERY(FLATTEN(ROW(a)&"​"&LOWER(REGEXEXTRACT(a, REPT("(.)", LEN(a))))),
"select Col1,count(Col1) where Col1 matches '.*\w+$' group by Col1"),
"select Col1 where Col2 > 1", ), "​"), 2, )), "Match", "No Match"))
(A2:INDEX(A:A, MAX((A:A<>"")*ROW(A:A)))))
Just to illustrate another method - not likely to be scaleable - try to substitute the second occurrence of the letter:
=ArrayFormula(if(isnumber(xmatch(len(A2)-1,len(substitute(upper(A2),char(sequence(1,26,65)),"",2)))),"Match","No match"))
If splitting were permitted, I would favour use of Frequency for speed, e.g.
=ArrayFormula(max(frequency(code(mid(upper(A2),sequence(len(A2)),1)),sequence(1,26,65)))>1)
You can give a try by using this RegEx : /(\w).*?\1/g in the REGEXMATCH function in google sheets.
Explanation :
(\w) - matches word characters (a-z, A-Z, 0-9, _), If you are sure that input will contain only alphabets then you can also use ([a-zA-Z]); then
.*? - zero or more characters (the ? denotes as optional that means it can match for consecutive as well as non-consecutive); until
\1 - it finds a repeat of the first matched character.
Live Demo : regex101
Coming after the battle ^^ Why not simply compare the number of unique letters in the string and its original length ?
=COUNTUNIQUE(split(regexreplace(A2;"(.)"; "$1_"); "_")) < LEN(A2)
All my tests seem fine.
(split() provided by this answer)

Complex regex get closest

I would like some help to finish my complex regex.
I spent some times on it and still can't figure out how I can achieve what I want
This is the text I want to parse :
Do [|83]([]?([]?([]?([]?([]?([]?([]?:))))):)([]?([]?:([]?:)):)([]?[]? :):)([]?([]?[]:):)([]?([]?[]:):)
Bo [|18] pz ([]?:)\n la :\n[pl]
Co [|76] pp ([]?:)
For readability, I put every text in one line only but please consider that they are not on a new line.
This is my regex so far :
(\[\|(\d*)])+(?!\\\n).*([%\sa-zA-Z]*)(\((\[[^\[\]()?:]*])+\s*\?([^()]*):([^()]*)\))
I'm reading every combinations of [|NUMBER] () one by one. The process I apply on "()" depends of the NUMBER related
When I'm parsing the first time, I'm getting this which is fine :
Then, I replace the whole value after my process :
Now, I do have :
Do [|83] blabla done Bo [|18] pz ([]?:)\n la :\n[pl] Co [|76] pp ([]?:)
When I parse them once more, I got :
The number I got is not the good one. My question is : how can I get the closest one from the string I'm parsing after?
Thanks you for any tips
You might shorten the pattern a bit and exclude matching both the square brackets and the parenthesis in the character class after matching the digit and ]
\[\|\d+][^][()]*\([^()]*\)
The pattern matches:
\[\|\d+] Match [| 1+ digits and ]
[^][()]* Match 0+ times any char other than [ ] ( )
\([^()]*\) Match (, than 0+ times any char other than ( ) then match )
Regex demo

regex validating a time value or a list of time values

I need a regex, which matches a single time value as well as lists of time values in the format hhmm[, hhmm] like for example:
"1245" or "0056, 1034,2355"
I am not so good with regex.. I thought this would do it:
(([0-1][0-9])|(2[0-3]))[0-5][0-9](,[ \t]*(([0-1][0-9])|(2[0-3]))[0-5][0-9])*
single time values are validated correctly, but if I try lists of times, every number behind the comma is accepted. It matches also "1235, 4711".
Can someone give me a hint what i am doing wrong?
Thanks in advance!
$pat = qr/(?:2[0-3]|[01][0-9])[0-5][0-9]/;
while (<DATA>) {
if (/^$pat(,\s*$pat)*$/) {
print;
}
}
__DATA__
1245
0056, 1034,2355
1034,2455
You should add a ^ to instruct the regular expression to match from the beginning of the line.
The following regex should work.
^([01][0-9]|2[0-3])[0-5][0-9](,\s*([01][0-9]|2[0-3])[0-5][0-9])*$
Try it yourself
In my opinion this is more readable regexp and it should work.
while( <DATA> ) {
if( /
^(
(
((0|1)\d)|(2[0-3]) #regex for hour (the first number may be 0, 1, or 2
#if 0 or 1, the second number can be from 0 to 9
#if 2, the second number can be from 0 to 3
)
[0-5]\d #regex for minutes (the first number
#can be from 0 to 5, second from 0 to 9)
)
(
,\s* #comma required
#the separator may be, or may not be
(
((0|1)\d)|(2[0-3])
)
[0-5]\d
)*$
/x ) {
print;
}
}
Your regular expression is basically fine except that it looks for the pattern anywhere inside the target string. That means any string that contains a single valid time will match. You must add beginning and end of string anchors ^ and $ to force the entire string to match the pattern.
You will find it clearer and easier to code regular expressions if you first write a common sub-expression and then use it like a subroutine. It also helps to use the /x modifier so that you can use whitespace to lay out the expresion more clearly.
For instance, this matches a single time string
/ ( [0-1][0-9] | 2[0-3] ) [0-5][0-9] /x
and you can go on to substitute that twice in the main expression.
It is also better to use non-capturing parentheses like (?: ... ) unless you really want to capture the substring into $1, $2 etc.
Take a look at this program and see what you think
use strict;
use warnings;
my $time = qr/(?: (?: [0-1][0-9] | 2[0-3] ) [0-5][0-9] ) /x;
while (<DATA>) {
print if /^ $time (?: ,[ \t]* $time )* $/x;
}
__DATA__
1245
0056, 1034,2355
1235, 4711
0000,1111
output
1245
0056, 1034,2355
0000,1111
This regexp must work:
/^(\d+)(, ?\d+)*$/

Regular Expression issue with * laziness

Sorry in advance that this might be a little challenging to read...
I'm trying to parse a line (actually a subject line from an IMAP server) that looks like this:
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?=
It's a little hard to see, but there are two =?/?= pairs in the above line. (There will always be one pair; there can theoretically be many.) In each of those =?/?= pairs, I want the third argument (as defined by a ? delimiter) extracted. (In the first pair, it's "Here is som", and in the second it's "e text.")
Here's the regex I'm using:
=\?(.+)\?.\?(.*?)\?=
I want it to return two matches, one for each =?/?= pair. Instead, it's returning the entire line as a single match. I would have thought that the ? in the (.*?), to make the * operator lazy, would have kept this from happening, but obviously it doesn't.
Any suggestions?
EDIT: Per suggestions below to replace ".?" with "[^(\?=)]?" I'm now trying to do:
=\?(.+)\?.\?([^(\?=)]*?)\?=
...but it's not working, either. (I'm unsure whether [^(\?=)]*? is the proper way to test for exclusion of a two-character sequence like "?=". Is it correct?)
Try this:
\=\?([^?]+)\?.\?(.*?)\?\=
I changed the .+ to [^?]+, which means "everything except ?"
A good practice in my experience is not to use .*? but instead do use the * without the ?, but refine the character class. In this case [^?]* to match a sequence of non-question mark characters.
You can also match more complex endmarkers this way, for instance, in this case your end-limiter is ?=, so you want to match nonquestionmarks, and questionmarks followed by non-equals:
([^?]*\?[^=])*[^?]*
At this point it becomes harder to choose though. I like that this solution is stricter, but readability decreases in this case.
One solution:
=\?(.*?)\?=\s*=\?(.*?)\?=
Explanation:
=\? # Literal characters '=?'
(.*?) # Match each character until find next one in the regular expression. A '?' in this case.
\?= # Literal characters '?='
\s* # Match spaces.
=\? # Literal characters '=?'
(.*?) # Match each character until find next one in the regular expression. A '?' in this case.
\?= # Literal characters '?='
Test in a 'perl' program:
use warnings;
use strict;
while ( <DATA> ) {
printf qq[Group 1 -> %s\nGroup 2 -> %s\n], $1, $2 if m/=\?(.*?)\?=\s*=\?(.*?)\?=/;
}
__DATA__
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?=
Running:
perl script.pl
Results:
Group 1 -> utf-8?Q?Here is som
Group 2 -> utf-8?Q?e text.
EDIT to comment:
I would use the global modifier /.../g. Regular expression would be:
/=\?(?:[^?]*\?){2}([^?]*)/g
Explanation:
=\? # Literal characters '=?'
(?:[^?]*\?){2} # Any number of characters except '?' with a '?' after them. This process twice to omit the string 'utf-8?Q?'
([^?]*) # Save in a group next characters until found a '?'
/g # Repeat this process multiple times until end of string.
Tested in a Perl script:
use warnings;
use strict;
while ( <DATA> ) {
printf qq[Group -> %s\n], $1 while m/=\?(?:[^?]*\?){2}([^?]*)/g;
}
__DATA__
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?= =?utf-8?Q?more text?=
Running and results:
Group -> Here is som
Group -> e text.
Group -> more text
Thanks for everyone's answers! The simplest expression that solved my issue was this:
=\?(.*?)\?.\?(.*?)\?=
The only difference between this and my originally-posted expression was the addition of a ? (non-greedy) operator on the first ".*". Critical, and I'd forgotten it.

Regex with lookahead

I can't seem to make this regex work.
The input is as follows. Its really on one row but I have inserted line breaks after each \r\n so that it's easier to see, so no check for space characters are needed.
01-03\r\n
01-04\r\n
TEXTONE\r\n
STOCKHOLM\r\n
350,00\r\n ---- 350,00 should be the last value in the first match
12-29\r\n
01-03\r\n
TEXTTWO\r\n
COPENHAGEN\r\n
10,80\r\n
This could go on with another 01-31 and 02-01, marking another new match (these are dates).
I would like to have a total of 2 matches for this input.
My problem is that I cant figure out how to look ahead and match the starting of a new match (two following dates) but not to include those dates within the first match. They should belong to the second match.
It's hard to explain, but I hope someone will get me.
This is what I got so far but its not even close:
(.*?)((?<=\\d{2}-\\d{2}))
The matches I want are:
1: 01-03\r\n01-04\r\nTEXTONE\r\nSTOCKHOLM\r\n350,00\r\n
2: 12-29\r\n01-03\r\nTEXTTWO\r\nCOPENHAGEN\r\n10,80\r\n
After that I can easily separate the columns with \r\n.
Can this more explicit pattern work to you?
(\d{2}-\d{2})\r\n(\d{2}-\d{2})\r\n(.*)\r\n(.*)\r\n(\d+(?:,?\d+))
Here's another option for you to try:
(.+?)(?=\d{2}-\d{2}\\r\\n\d{2}-\d{2}|$)
Rubular
/
\G
(
(?:
[0-9]{2}-[0-9]{2}\r\n
){2}
(?:
(?! [0-9]{2}-[0-9]{2}\r\n ) [^\n]*\n
)*
)
/xg
Why do so much work?
$string = q(01-03\r\n01-04\r\nTEXTONE\r\nSTOCKHOLM\r\n350,00\r\n12-29\r\n01-03\r\nTEXTTWO\r\nCOPENHAGEN\r\n10,80\r\n);
for (split /(?=(?:\d{2}-\d{2}\\r\\n){2})/, $string) {
print join( "\t", split /\\r\\n/), "\n"
}
Output:
01-03 01-04 TEXTONE STOCKHOLM 350,00
12-29 01-03 TEXTTWO COPENHAGEN 10,80`