Get a match when there are duplicate letters in a string - regex

I have a list of inputs in google sheets,
Input
Desired Output
"To demonstrate only not an input" The repeated letters
Outdoors
Match
o
dog
No Match
step
No Match
bee
Match
e
Chessboard
Match
s
Cookbooks
Match
o, k
How do I verify if all letters are unique in a string without splitting it?
In other words if the string has one letter or more occurred twice or more, return TRUE
My process so far
I tried this solution in addition to splitting the string and dividing the length of the string on the COUNTA of unique letters of the string, if = 1 "Match", else "No match"
Or using regex
I found a method to match a letter is occure in a string 2 times this demonstration with REGEXEXTRACT But wait what needed is get TRUE when the letters are not unique in the string
=REGEXEXTRACT(A1,"o{2}?")
Returns:
oo
Something like this would do
=REGEXMATCH(Input,"(anyletter){2}?")
OR like this
=REGEXMATCH(lower(A6),"[a-zA-Z]{2}?")
Notes
The third column, "Column C," is only for demonstration and not for input.
The match is case insensitive
The string doesn't need to be splitted to aviod heavy calculation "I have long lists"
Avoid using lambda and its helper functions see why?
Its ok to return TRUE or FALSE instead of Match or No Match to keep it simple.
More examples
Input
Desired Output
Professionally
Match
Attractiveness
Match
Uncontrollably
Match
disreputably
No Match
Recommendation
Match
Interrogations
Match
Aggressiveness
Match
doublethinks
No Match

You are explicitly asking for an answer using a single regular expression. Unfortunately there is no such thing as a backreference to a former capture group using RE2. So if you'd spell out the answer to your problem it would look like:
=INDEX(IF(A2:A="","",REGEXMATCH(A2:A,"(?i)(?:a.*a|b.*b|c.*c|d.*d|e.*e|f.*f|g.*g|h.*h|i.*i|j.*j|k.*k|l.*l|m.*m|n.*n|o.*o|p.*p|q.*q|r.*r|s.*s|t.*t|u.*u|v.*v|w.*w|x.*x|y.*y|z.*z)")))
Since you are looking for case-insensitive matching (?i) modifier will help to cut down the options to just the 26 letters of the alphabet. I suppose the above can be written a bit neater like:
=INDEX(IF(A2:A="","",REGEXMATCH(A2:A,"(?i)(?:"&TEXTJOIN("|",1,REPLACE(REPT(CHAR(SEQUENCE(26,1,65)),2),2,0,".*"))&")")))
EDIT 1:
The only other reasonable way to do this (untill I learned about the PREG supported syntax of the matches clause in QUERY() by #DoubleUnary) with a single regex other than the above is to create your own UDF in GAS (AFAIK). It's going to be JavaScript based thus supporting a backreferences. GAS is not my forte, but a simple example could be:
function REGEXMATCH_JS(s) {
if (s.map) {
return s.map(REGEXMATCH_JS);
} else {
return /([a-z]).*?\1/gi.test(s);
}
}
The pattern ([a-z]).*?\1 means:
([a-z]) - Capture a single character in range a-z;
.*?\1 - Look for 0+ (lazy) characters up to a copy of this 1st captured character with a backreference.
The match is global and case-insensitive. You can now call:
=INDEX(IF(A2:A="","",REGEXMATCH_JS(A2:A)))
EDIT 2:
For those that are benchmarking speed, I am not testing this myself but maybe this would speed things up:
=INDEX(REGEXMATCH(A2:INDEX(A:A,COUNTA(A:A)),"(?i)(?:a.*a|b.*b|c.*c|d.*d|e.*e|f.*f|g.*g|h.*h|i.*i|j.*j|k.*k|l.*l|m.*m|n.*n|o.*o|p.*p|q.*q|r.*r|s.*s|t.*t|u.*u|v.*v|w.*w|x.*x|y.*y|z.*z)"))
Or:
=INDEX(REGEXMATCH(A2:INDEX(A:A,COUNTA(A:A)),"(?i)(?:"&TEXTJOIN("|",1,REPLACE(REPT(CHAR(SEQUENCE(26,1,65)),2),2,0,".*"))&")"))
Or:
=REGEXMATCH_JS(A2:INDEX(A:A,COUNTA(A:A)))
Respectively. Knowing there is a header in 1st row.

Benchmark:
Created a benchmark here.
Methodology:
Use NOW() to create a timestamp, when checkbox is clicked.
Use NOW() to create another timestamp, when the last row is filled and the checkbox is on.
The difference between those two timestamps gives time taken for the formula to complete.
The sample is a random data created from Math.random between [A-Za-z] with 10 characters per word.
Results:
Formula
Round1
Round2
Avg
% Slower than best
Sample size
10006
[re2](a.*a|b.*b)JvDv
0:00:19
0:00:19
0:00:19
-15.15%
[re2+recursion]MASTERMATCH_RE2
0:00:27
0:00:24
0:00:26
-54.55%
[Find+recursion]MASTERMATCH
0:00:17
0:00:16
0:00:17
0.00%
[PREG]Doubleunary
0:00:57
0:00:53
0:00:55
-233.33%
Conclusion:
This varies greatly based on browser/device/mobile app and on non-randomized sample data. But I found PREG to be consistently slower than re2
Use recursion.
This seems extremely faster than the regex based approach. Create a named function:
Name:
MASTERMATCH
Arguments(in this order):
word
The word to check
start
Starting at
Function:
=IF(
MID(word,start,1)="",
FALSE,
IF(
ISERROR(FIND(MID(word,start,1),word,start+1)),
MASTERMATCH(word,start+1),
TRUE
)
)
Usage:
=ARRAYFORMULA(MASTERMATCH(A2:INDEX(A2:A,COUNTA(A2:A)),1))
Or without case sensitivity
=ARRAYFORMULA(MASTERMATCH(lower(A2:A),1))
Explanation:
It recurses through each character using MID and checks whether the same character is available after this position using FIND. If so, returns true and doesn't check anymore. If not, keeps checking until the last character using recursion.
Or with regex,
Create a named function:
Name:
MASTERMATCH_RE2
Arguments(in this order):
word
The word to check
start
Starting at
Function:
IF(
MID(word,start,1)="",
FALSE,
IF(
REGEXMATCH(word,MID(word, start, 1)&"(?i).*"&MID(word,start,1)),
TRUE,
MASTERMATCH_RE2(word,start+1)
)
)
Usage:
=ARRAYFORMULA(MASTERMATCH_RE2(A2:A,1))
Or
=ARRAYFORMULA(MASTERMATCH_RE2(lower(A2:A),1))
Explanation:
It recurses through each character and creates a regex for that character. Instead of a.*a, b.*b,..., it takes the first character(using MID), eg: o in outdoor and creates a regex o.*o. If regex is positive for that regex (using REGEXMATCH), returns true and doesn't check for other letters or create other regexes.
Uses lambda, but it's efficient. Loop through each row and every character with MAP and REDUCE. REPLACE each character in the word and find the difference in length. If more than 1, don't check length anymore and return Match
=MAP(
A2:INDEX(A2:A,COUNTA(A2:A)),
LAMBDA(_,
REDUCE(
"No Match",
SEQUENCE(LEN(_)),
LAMBDA(a,c,
IF(a="Match",a,
IF(
LEN(_)-LEN(
REGEXREPLACE(_,"(?i)"&MID(_,c,1),)
)>1,
"Match",a
)
)
)
)
)
)
If you do run into lambda limitations, remove the MAP and drag fill the REDUCE formula.
=REDUCE("No Match",SEQUENCE(LEN(A2)),LAMBDA(a,c,IF(a="Match",a,IF(LEN(A2)-LEN(REGEXREPLACE(A2, "(?i)"&MID(A2,c,1),))>1,"Match",a))))
The latter is preferred for conditional formatting as well.

As Daniel Cruz said, Google Sheets functions such as regexmatch(), regexextract() and regexreplace() use RE2 regexes that do not support backreferences. However, the query() function uses Perl Compatible Regular Expressions that do support named capture groups and backreferences:
=arrayformula(
iferror( not( iserror(
match(
to_text(A3:A),
query(lower(unique(A3:A)), "where Col1 matches '.*?(?<char>.).*?\k<char>.*' ", 0),
0
)
) / (A3:A <> "") ) )
)
In my limited testing with a sample size of 1000 heterograms, pangrams, words with diacritic letters, and 10-character pseudo-random unique values from TheMaster's corpus, this PREG formula ran at about half the speed of the JvdV2 RE2 regex.
With Osm's sample of 50,000 highly repetitive sample values, the formula ran at 8x the speed of JvdV2.
A PREG regex is slower than a RE2 regex, but has the benefit that you can more easily check all characters for repeats. This lets you work with corpuses that include diacritic letters, numbers and other non-English alphabet characters:
Input
Output
Professionally
TRUE
disreputably
FALSE
Abacus
TRUE
Élysée
TRUE
naïve Ï
TRUE
määräävä
TRUE
121
TRUE
123
FALSE
You can also easily state which specific characters to check by replacing <char>. with something like <char>[\wéäåö] or <char>[^-;,.\s\d].

try:
=INDEX(IF(IFERROR(LEN(REGEXREPLACE(A1:A6, "[^"&C1:C6&"]", )), -1)>=
(LEN(SUBSTITUTE(C1:C6, "|", ))*2), "Match", "No Match"))
update
create a query heat map, filter it and vlookup back row position
=INDEX(LAMBDA(a, IF(""<>IFNA(VLOOKUP(ROW(a),
SPLIT(QUERY(QUERY(FLATTEN(ROW(a)&"​"&REGEXEXTRACT(a, REPT("(.)", LEN(a)))),
"select Col1,count(Col1) where Col1 matches '.*\w+$' group by Col1"),
"select Col1 where Col2 > 1", ), "​"), 2, )), "Match", "No Match"))
(A2:INDEX(A:A, MAX((A:A<>"")*ROW(A:A)))))
case insensitive would be:
=INDEX(LAMBDA(a, IF(""<>IFNA(VLOOKUP(ROW(a),
SPLIT(QUERY(QUERY(FLATTEN(ROW(a)&"​"&LOWER(REGEXEXTRACT(a, REPT("(.)", LEN(a))))),
"select Col1,count(Col1) where Col1 matches '.*\w+$' group by Col1"),
"select Col1 where Col2 > 1", ), "​"), 2, )), "Match", "No Match"))
(A2:INDEX(A:A, MAX((A:A<>"")*ROW(A:A)))))

Just to illustrate another method - not likely to be scaleable - try to substitute the second occurrence of the letter:
=ArrayFormula(if(isnumber(xmatch(len(A2)-1,len(substitute(upper(A2),char(sequence(1,26,65)),"",2)))),"Match","No match"))
If splitting were permitted, I would favour use of Frequency for speed, e.g.
=ArrayFormula(max(frequency(code(mid(upper(A2),sequence(len(A2)),1)),sequence(1,26,65)))>1)

You can give a try by using this RegEx : /(\w).*?\1/g in the REGEXMATCH function in google sheets.
Explanation :
(\w) - matches word characters (a-z, A-Z, 0-9, _), If you are sure that input will contain only alphabets then you can also use ([a-zA-Z]); then
.*? - zero or more characters (the ? denotes as optional that means it can match for consecutive as well as non-consecutive); until
\1 - it finds a repeat of the first matched character.
Live Demo : regex101

Coming after the battle ^^ Why not simply compare the number of unique letters in the string and its original length ?
=COUNTUNIQUE(split(regexreplace(A2;"(.)"; "$1_"); "_")) < LEN(A2)
All my tests seem fine.
(split() provided by this answer)

Related

Shorten Regular Expression (\n) [duplicate]

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

Regular expression repetition in Ruby [duplicate]

I'm reading the regular expressions reference and I'm thinking about ? and ?? characters. Could you explain me with some examples their usefulness? I don't understand them enough.
thank you
This is an excellent question, and it took me a while to see the point of the lazy ?? quantifier myself.
? - Optional (greedy) quantifier
The usefulness of ? is easy enough to understand. If you wanted to find both http and https, you could use a pattern like this:
https?
This pattern will match both inputs, because it makes the s optional.
?? - Optional (lazy) quantifier
?? is more subtle. It usually does the same thing ? does. It doesn't change the true/false result when you ask: "Does this input satisfy this regex?" Instead, it's relevant to the question: "Which part of this input matches this regex, and which parts belong in which groups?" If an input could satisfy the pattern in more than one way, the engine will decide how to group it based on ? vs. ?? (or * vs. *?, or + vs. +?).
Say you have a set of inputs that you want to validate and parse. Here's an (admittedly silly) example:
Input:
http123
https456
httpsomething
Expected result:
Pass/Fail Group 1 Group 2
Pass http 123
Pass https 456
Pass http something
You try the first thing that comes to mind, which is this:
^(http)([a-z\d]+)$
Pass/Fail Group 1 Group 2 Grouped correctly?
Pass http 123 Yes
Pass http s456 No
Pass http something Yes
They all pass, but you can't use the second set of results because you only wanted 456 in Group 2.
Fine, let's try again. Let's say Group 2 can be letters or numbers, but not both:
(https?)([a-z]+|\d+)
Pass/Fail Group 1 Group 2 Grouped correctly?
Pass http 123 Yes
Pass https 456 Yes
Pass https omething No
Now the second input is fine, but the third one is grouped wrong because ? is greedy by default (the + is too, but the ? came first). When deciding whether the s is part of https? or [a-z]+|\d+, if the result is a pass either way, the regex engine will always pick the one on the left. So Group 2 loses s because Group 1 sucked it up.
To fix this, you make one tiny change:
(https??)([a-z]+|\d+)$
Pass/Fail Group 1 Group 2 Grouped correctly?
Pass http 123 Yes
Pass https 456 Yes
Pass http something Yes
Essentially, this means: "Match https if you have to, but see if this still passes when Group 1 is just http." The engine realizes that the s could work as part of [a-z]+|\d+, so it prefers to put it into Group 2.
The key difference between ? and ?? concerns their laziness. ?? is lazy, ? is not.
Let's say you want to search for the word "car" in a body of text, but you don't want to be restricted to just the singular "car"; you also want to match against the plural "cars".
Here's an example sentence:
I own three cars.
Now, if I wanted to match the word "car" and I only wanted to get the string "car" in return, I would use the lazy ?? like so:
cars??
This says, "look for the word car or cars; if you find either, return car and nothing more".
Now, if I wanted to match against the same words ("car" or "cars") and I wanted to get the whole match in return, I'd use the non-lazy ? like so:
cars?
This says, "look for the word car or cars, and return either car or cars, whatever you find".
In the world of computer programming, lazy generally means "evaluating only as much as is needed". So the lazy ?? only returns as much as is needed to make a match; since the "s" in "cars" is optional, don't return it. On the flip side, non-lazy (sometimes called greedy) operations evaluate as much as possible, hence the ? returns all of the match, including the optional "s".
Personally, I find myself using ? as a way of making other regular expression operators lazy (like the * and + operators) more often than I use it for simple character optionality, but YMMV.
See it in Code
Here's the above implemented in Clojure as an example:
(re-find #"cars??" "I own three cars.")
;=> "car"
(re-find #"cars?" "I own three cars.")
;=> "cars"
The item re-find is a function that takes its first argument as a regular expression #"cars??" and returns the first match it finds in the second argument "I own three cars."
Some Other Uses of Question marks in regular expressions
Apart from what's explained in other answers, there are still 3 more uses of Question Marks in regular expressions.
 
 
Negative Lookahead
Negative lookaheads are used if you want to
match something not followed by something else. The negative
lookahead construct is the pair of parentheses, with the opening
parenthesis followed by a question mark and an exclamation point. x(?!x2)
example
Consider a word There
Now, by default, the RegEx e will find the third letter e in word There.
There
^
However if you don't want the e which is immediately followed by r, then you can use RegEx e(?!r). Now the result would be:
There
^
Positive Lookahead
Positive lookahead works just the same. q(?=u) matches a q that
is immediately followed by a u, without making the u part of the
match. The positive lookahead construct is a pair of parentheses,
with the opening parenthesis followed by a question mark and an
equals sign.
example
Consider a word getting
Now, by default, the RegEx t will find the third letter t in word getting.
getting
^
However if you want the t which is immediately followed by i, then you can use RegEx t(?=i). Now the result would be:
getting
^
Non-Capturing Groups
Whenever you place a Regular Expression in parenthesis(), they
create a numbered capturing group. It stores the part of the string
matched by the part of the regular expression inside the
parentheses.
If you do not need the group to capture its match, you can optimize
this regular expression into
(?:Value)
See also this and this.
? simply makes the previous item (character, character class, group) optional:
colou?r
matches "color" and "colour"
(swimming )?pool
matches "a pool" and "the swimming pool"
?? is the same, but it's also lazy, so the item will be excluded if at all possible. As those docs note, ?? is rare in practice. I have never used it.
Running the test harness from Oracle documentation with the reluctant quantifier of the "once or not at all" match X?? shows that it works as a guaranteed always-empty match.
$ java RegexTestHarness
Enter your regex: x?
Enter input string to search: xx
I found the text "x" starting at index 0 and ending at index 1.
I found the text "x" starting at index 1 and ending at index 2.
I found the text "" starting at index 2 and ending at index 2.
Enter your regex: x??
Enter input string to search: xx
I found the text "" starting at index 0 and ending at index 0.
I found the text "" starting at index 1 and ending at index 1.
I found the text "" starting at index 2 and ending at index 2.
https://docs.oracle.com/javase/tutorial/essential/regex/quant.html
It seems identical to the empty matcher.
Enter your regex:
Enter input string to search: xx
I found the text "" starting at index 0 and ending at index 0.
I found the text "" starting at index 1 and ending at index 1.
I found the text "" starting at index 2 and ending at index 2.
Enter your regex:
Enter input string to search:
I found the text "" starting at index 0 and ending at index 0.
Enter your regex: x??
Enter input string to search:
I found the text "" starting at index 0 and ending at index 0.

Regex for last n characters of String in PostgreSQL query

Regex checks wouldn't be a strong point of mine. This is trivial but after playing around with it for 15 minutes already I think it would be quicker posting here. Ultimately I want to filter out any results of a table where a certain text column value ends with S(01 -99), i.e. the letter S followed by 2 digits. Consider the following test query
select x.* from (
select
unnest(array['kjkjkj','jhjs01','kjkj11','kjhkjh','uusus','iiosis99']::text[])
as tests ) x
where RIGHT(x.tests,3) !~ 'S[0-9]{1,2}$'
This returns everything in the unnested array, whereas I'm hoping to return everything excluding the second and last values. Any pointers in the right direction would be much appreciated. I'm using PostgreSQL v11.9
You may actually use SIMILAR TO here since your pattern is not that complex:
SELECT * FROM table
WHERE column_name NOT SIMILAR TO '%S[0-9]{2}'
SIMILAR TO patterns require a full string match, so here, % matches any text from the start of the string, then S matches S and [0-9]{2} matches two digits that must be at the end of the string.
If you were to use a regex, you could use
WHERE column_name !~ 'S[0-9]{2}$'
Or, 'S[0-9]{1,2}$' if there can be one or two digits. Since the regex search in PostgreSQL does not require a full string match, it just matches S, two (or one or two with {1,2}) digits at the end of string ($).

Why do I get empty response for regexp_matches function while using positive lookahead (?=...)

Why the following code returns just empty brackets - {''}. How to make it return matching strings?
SELECT regexp_matches('ATGCATGCATGCCAACAACAACCTGTCAAGTGAGT','(?=..CAA)','g');
Expected output is:
regexp_matches
----------------
{GCCAA}
{AACAA}
{AACAA}
{GTCAA}
(4 rows)
but instead it returns the following:
regexp_matches
----------------
{""}
{""}
{""}
{""}
(4 rows)
I actually have a bit more complicated query, which requires positive lookahead in order to cover all occurrences of patterns in the string even if they overlap.
Well, it's not pretty, but you can do it without regular expressions or custom functions.
WITH data(d) as (
SELECT * FROM (VALUES ('ATGCATGCATGCCAACAACAACCTGTCAAGTGAGT')) v
)
SELECT substr(d, x, 5) AS match
FROM data
JOIN LATERAL (SELECT generate_series(1, length(d))) g(x) ON TRUE
WHERE substr(d, x, 5) LIKE '__CAA'
;
match
-------
GCCAA
AACAA
AACAA
GTCAA
(4 rows)
Basically, get each five letter slice of the string and see if it matches __CAA.
You could change generate_series(1, length(d)) to generate_series(1, length(d)-4) because the last ones will never match, but you would have to remember to update this if the length of your matching string changes.
Using a lookahead has the problem that the lookahead itself is not part of the match but it allows overlapping searches
Without using a lookahead, you lose the ability for overlapping searches.
Using Powershell, you can loop over the indexes returned from the lookaheads and use that as an index into your searchstring to get the matches
$string = 'ATGCATGCATGCCAACAACAACCTGTCAAGTGAGT'
$r = [regex]::new('(?=..CAA)')
$r.Matches($string) | % {$string.Substring($_.Index, 5)}
returns
GCCAA
AACAA
AACAA
GTCAA
I don't know how to translate this to PostgreSQL (or if that's even possible)
update:
Aparently it won't capture inside of an assertion, that's ok because
what you really need is the first 2 characters, which can safely be
consumed. It will only give you the first 2 characters per row, but
since you know the last 3, you can easily join the set elements
with the CAA constant.
Try this
..(?=CAA)
and you're done.
If I knew the bizarre sql language, I could show you how to do the join.
Output should now be
match
-------
GC
AA
AA
GT
(4 rows)
This is the regex you need for overlapped matches.
(?=(..CAA))
https://regex101.com/r/eJ36zb/1
I think you just need this sql statement which captures group 1:
SELECT regexp_matches('ATGCATGCATGCCAACAACAACCTGTCAAGTGAGT','(?=(..CAA))','g');
Formatted regex
(?=
( . . CAA ) # (1)
)
The reason you got empty strings in your result is that
you didn't give the expression anything to consume and
nothing to capture.
I.e., it matched at the right places, but nothing was consumed or captured.
So, doing it this way allows the overlap and the capture so it
should show up on the output now.
Lookahead is a zero-width assertion. It doesn't match anything. If you change your regular expression to just a regular match/capture, you'll get a result. For matching any two characters that are followed by CAA in your case, lookahead probably isn't necessary.

How to use a REGEX pattern to remove a specific word "THE" only if at beginning of text string?

I have a text input field for titles of various things and to help minimize false negatives on search results(internal search is not the best), I need to have a REGEX pattern which looks at the first four characters of the input string and removes the word(and space after the word) _the _ if it is there at the beginning only.
For example if we are talking about the names of bands, and someone enters The Rolling Stones , what i need is for the entry to say only Rolling Stones
Can a regex be used to automatically strip these 4characters?
Applying the regex
^(?:\s*the\s*)?(.*)$
will match any string, and capture it in backreference no. 1, unless it starts with the (optionally surrounded by whitespace), in which case backref no. 1 will contain whatever follows.
You need to set the case-insensitive option in your regex engine for this to work.
You can use the ^ identifier to match a pattern at the beginning of a line, however for what you are using this for, it can be considered overkill.
A lot of languages support string manipulations, which is a more suitable choice. I can provide an example to demonstrate in Python,
>>> def func(n):
n = n[4:len(n)] if n[0:4] == "The " else n
return n
>>> func("The Rolling Stones")
'Rolling Stones'
>>> func("They Might Be Giants")
'They Might Be Giants'
As you don't clarify with language, here is a solution in Perl :
my $str = "The Rolling Stones";
$str =~ s/^the //i;
say $str; # Rolling Stones