Currently I am developing puzzle game for kids where player needs to select correct word from the grid. I used regex to match the word.
For an example I used ([D|E|C|K]){4} to match DECK because player should be able to select the word not in exact D->E->C->K order. Player may select it KDEC or EDCK or KCED or any order.
I achieved this by using ([D|E|C|K]){4}.
But here I am facing issue, this pattern matches EEEE or DDDD or DKDK and etc. Simply any combination of 4 chars from the set.
Any Idea how can I modify the regex to get my desired outcome?
Thanks in advance.
Basically, this is not a good job for a regex because this is not regular language. You'd better follow a simple algorithm to split the input string into characters, sort them, and rejoin into a string, do the same with the search string, then compare the results.
See a JavaScript demo with the word TALL:
const strings = ['TALL','LATL','TLAL','TTAL','AATT','ATL','STL'];
const search = 'TALL';
const compare_with = search.split("").sort().join("");
for (let s of strings) {
console.log(s, ':', s.split("").sort().join("") == compare_with );
}
Can we do it with a regex? In .NET, you may use balancing construct, and it is a solution, not a workaround.
Scenario 1: .NET regex engine specific solution
Assuming your search word is TALL, you may build a regex like
^(?:(T)|(A)|(L)|(L)){4}$(?<-1>)(?<-2>)(?<-3>)(?<-4>)
See the regex demo.
Details
^- start of string
(?:(T)|(A)|(L)|(L)){4} - a non-capturing group that matches 4 occurrences of
(T) - T pushed on to the Group 1 capture stack
|(A) - or A pushed on to the Group 2 capture stack
|(L) - or L pushed on to the Group 3 capture stack
|(L) - or L pushed on to the Group 4 capture stack
$ - end of string
(?<-1>)(?<-2>)(?<-3>)(?<-4>) - Pop a value from each of the capturing groups. If any group capture stack is not empty, return false and result in no match, else, there is a match.
Scenario 2: Lookahead basd work-around in case all characters are unique
You may match and capture each letter from the range into a separate capturing group and add a negative lookahead before each subsequent capturing group to avoid matching a letter matched before it.
The regex will look like
^([DECK])(?!\1)([DECK])(?!\1|\2)([DECK])(?!\1|\2|\3)([DECK])$
See the regex demo
Details
^ - start of string
([DECK]) - Group 1: a letter, D, E, C or K
(?!\1) - the next char cannot be the one captured into Group 1
([DECK]) - Group 2: a letter, D, E, C or K
(?!\1|\2)([DECK]) - the next letter cannot be equal to the first and second one
(?!\1|\2|\3)([DECK]) - the next letter cannot be equal to the first, second and third one
$ - end of string
Related
I have strings of 010xxx, 011xxx, 110xxx, 111xxx, Q10xxx, Q11xxx in a field along with other values that are not similar. They might be XyzABC.
I have two regex patterns that separately give results that are good: [1Q]_[0-9]% and 0_[1-9]%
In words return true if
first letter is 1 or Q and the 3rd letter is a 0-9
OR
the first letter is 0 and the third letter is 1-9
How do I create a search pattern that does the OR either using SIMILAR TO or regex?
One version that works by itself is:
SELECT field FROM db WHERE field SIMILAR TO '[1Q]_[0-9]%'
Not wedded to SIMILAR or regex. They were just what I could get working until I tried to or them. Open to other suggestions.
You can use a SIMILAR TO pattern like
WHERE field SIMILAR TO '([1Q]_[0-9]|0_[1-9])%'
The SIMILAR TO pattern requires a full string match, so the pattern means: start with 1 or Q, then any char, then any digit, or start with 0, any char and a non-zero digit, and then there can be any 0 or more chars afterwards.
You can also use a regex like
WHERE field ~ '^(?:[1Q].[0-9]|0.[1-9])'
See the regex demo
Details:
^ - start of string
(?: - start of a non-capturing group:
[1Q].[0-9] - 1 or Q, any char and a digit
| - or
0.[1-9] - 0, any char and a non-zero digit
) - end of a non-capturing group.
I am trying to use the result of the capture group to perform a look behind for a specific answer.
Sample of Text:
10) Once a strategy has been formulated and implemented, it is important that the firm sticks to it no matter what happens.
Answer: FALSE
11) Which of the following strategies does Tesla need to implement or achieve to gain a competitive advantage?
A) imitate the features of the most popular SUVs on the market
B) reinvest profits to build successively better electric automobiles
C) sell advertising space on their cars' digital displays
D) substitute less-expensive components to keep costs low
Answer: B
Current Output:
https://regex101.com/r/bLKmYX/1
It is currently outputting FALSE and B as the answers to these questions.
Expected Output
I would like it to output FALSE and B) reinvest profits to build successively better electric automobiles
Current Regex Expression
'^\d+\)\s*([\s\S]*?)\nAnswer:\s*(.*)'
How can I use the result of the second capture group, (B), to perform a lookbehind and get the whole answer?
What you ask for is not possible due to the fact that a captured value can only be checked after it was obtained.
You may try another logic: capture the answer letter and then match the same letter after Answer: substring using the backreference to the group value.
You may consider a pattern like
(?m)^\d+\)\s*((?:(?:(?!^\d+\))[\s\S])*?\n(([A-Z])\).*)$)?[\s\S]*?)\nAnswer:\s*(\3|FALSE)
See the regex demo.
It has 4 capturing groups now, the first one containing the whole question body, then the second one containing the answer line you need, the third one is auxiliary (it is used to check which answer is correct), and the fourth one is the answer value.
Details
(?m) - ^ now matches line start positions and $ matches line end positions
^ - start of a line
\d+ - 1+ digits
\) - a ) char
\s* - 0+ whitespaces
((?:(?:(?!^\d+\))[\s\S])*?\n(([A-Z])\).*)$)?[\s\S]*?) - Group 1:
(?:(?:(?!^\d+\))[\s\S])*?\n(([A-Z])\).*)$)? - an optional non-capturing group matching
(?:(?!^\d+\))[\s\S])*? - any char, 0 or more occurrences, that does not start a start of line, 1+ digits and then a ) sequence
\n - a newline
(([A-Z])\).*) - Group 2: an ASCII uppercase letter captured into Group 3, then ) char and then the rest of the line (.*)
$ - end of line
[\s\S]*? - any 0+ chars as few as possible
\nAnswer: - a new line, Answer: string
\s* - 0+ whitespaces
(\3|FALSE) - Group 4: Group 3 value or FALSE.
Talking about Regex, I am facing with the problem to replace hyphenations in the beginning part of a composed word.
For example:
wo-wo-wo-wonder -> wonder
hi-hi-hi-hi -> hi
wo-wo-wo -> wo
f-f-f-fight
So, for every word inside a text, I want to replace words that before the main word (wonder) have a partial or total repetition of the main word (wo-wo-wo but also wonder-wonder-wonder).
At the same time, composed words like bi-linear or
pre-trained MUST NOT be replaced, because in this case the hyphenation (pre) is not part of the main word (train).
I've seen this solution [Python find all occurrences of hyphenated word and replace at position ] and apparently it can be a good solution.
But my problem is quite different because I don't want to impose constraints about the length of hyphenation, and at the same time I want to check that hyphen is part of the main word.
This is the Regex I am actually using but as explained, it doesn't solve my full problem.
re.sub(r'(?<!\S)(\w{1,3})(?:-\1)*-(\w+)(?!\S)', '\\2', s)
Use
r'(?<!\S)(\w+)(?:-\1)*-(\1)'
or
r'\b(\w+)(?:-\1)*-(\1)'
See the regex demo
Details
(?<!\S) - a whitespace boundary (if you use \b, a word boundary)
(\w+) - Group 1: any one or more word chars
(?:-\1)* - 0 or more repetitions of - and Group 1 value
- - a hyphen
(\1) - Group 2: same value as in Group 1.
Python sample re.sub:
s = re.sub(r'(?<!\S)(\w+)(?:-\1)*-(\1)', r'\2', s)
I have a dataset with repeating pattern in the middle:
YM10a15b5c27
and
YM1b5c17
How can I get what is between "YM" and the last two numbers?
I'm using this but is getting one number in the end and should not.
/([A-Z]+)([0-9a-z]+)([0-9]+)/
Capture exactly two characters in the last group:
/([A-Z]+)([0-9a-z]+)([0-9]{2})/
You should use:
/^(?:([a-z]+))([0-9a-z]+)(?=\1)/
^ matches the start of the sentence. This is really important, because if your code is aaaa1234aaaa, then without the ^, it would also match the aaaa of the end.
(?:([a-z]+)) is a non-capturing group which takes any letter from 'a' to 'z' as group 1
(?=\1) tells the regex to match the text as long as it is followed by the same code at the starting.
All you have to do is extract the code by group(2)
An example is shown here.
Solution
If you want to match these strings as whole words, use \b(([a-z])\2)([0-9a-z]+)(\1)\b. If you need to match them as separate strings, use ^(([a-z])\2)([0-9a-z]+)(\1)$.
Explanation
\b - a word boundary (or if ^ is used, start of string)
(([a-z])\2) - Group 1: any lowercase ASCII letter, exactly two occurrences (aa, bb, etc.)
([0-9a-z]+) - Group 3: 1 or more digits or lowercase ASCII letters
(\1) - Group 4: the same text as stored in Group 1
\b - a word boundary (or if $ is used, end of string).
Intro
I have a string containing diagnosis codes (ICD-10), not separated by any character. I would like to extract all valid diagnosis codes. Valid diagnosis codes are of the form
[Letter][between 2 and 4 numbers][optional letter that is not the next match starting letter]
The regex for this pattern is (I believe)
\w\d{2,4}\w?
Example
Here is an example
mystring='F328AG560F33'
In this example there are three codes:
'F328A' 'G560' 'F33'
I would like to extract these codes with a function like str_extract_all in R (preferably but not exclusively)
My solution so far
So far, I managed to come up with an expression like:
str_extract_all(mystring,pattern='\\w\\d{2,4}\\w?(?!(\\w\\d{2,4}\\w?))')
However when applied to the example above it returns
"F328" "G560F"
Basically it misses the letter A in the first code, and misses altogether the last code "F33" by mistakenly assigning F to the preceding code.
Question
What am I doing wrong? I only want to extract values that end with a letter that is not the start of the next match, and if it is, the match should not include the letter.
Application
This question is of great relevance for example when mining patient Electronic Health Records that have not been validated.
You have a letter, two-to-four numbers then an optional letter. That optional letter, if it's there, will only ever be followed by another letter; or, put another way, never followed by a number. You can write a negative lookahead to capture this:
\w\d{2,4}(?:\w(?!\d))?
This at least works with PCRE. I don't know about how R will handle it.
Your matches are overlapping. In this case, you might use str_match_all that allows easy access to capturing groups and use a pattern with a positive lookahead containing a capturing group inside:
(?i)(?=([A-Z]\d{2,4}(?:[A-Z](?!\d{2,4}))?))
See the regex demo
Details
(?= - a positive lookahead start (it will be run at every location before each char and at the end of the string
( - Group 1 start
[A-Z] - a letter (if you use a case insensitive modifier (?i), it will be case insensitive)
\d{2,4} - 2 to 4 digit
(?: - an optional non-capturing group start:
[A-Z] - a letter
(?!\d{2,4}) - not followed with 2 to 4 digits
)? - the optional non-capturing group end
) - Group 1 end
) - Lookahead end.
R demo:
> library(stringr)
> res <- str_match_all("F328AG560F33", "(?i)(?=([A-Z]\\d{2,4}(?:[A-Z](?!\\d{2,4}))?))")
> res[[1]][,2]
[1] "F328A" "G560" "F33"