Regular Expression for checking subword between capture groups - regex

Talking about Regex, I am facing with the problem to replace hyphenations in the beginning part of a composed word.
For example:
wo-wo-wo-wonder -> wonder
hi-hi-hi-hi -> hi
wo-wo-wo -> wo
f-f-f-fight
So, for every word inside a text, I want to replace words that before the main word (wonder) have a partial or total repetition of the main word (wo-wo-wo but also wonder-wonder-wonder).
At the same time, composed words like bi-linear or
pre-trained MUST NOT be replaced, because in this case the hyphenation (pre) is not part of the main word (train).
I've seen this solution [Python find all occurrences of hyphenated word and replace at position ] and apparently it can be a good solution.
But my problem is quite different because I don't want to impose constraints about the length of hyphenation, and at the same time I want to check that hyphen is part of the main word.
This is the Regex I am actually using but as explained, it doesn't solve my full problem.
re.sub(r'(?<!\S)(\w{1,3})(?:-\1)*-(\w+)(?!\S)', '\\2', s)

Use
r'(?<!\S)(\w+)(?:-\1)*-(\1)'
or
r'\b(\w+)(?:-\1)*-(\1)'
See the regex demo
Details
(?<!\S) - a whitespace boundary (if you use \b, a word boundary)
(\w+) - Group 1: any one or more word chars
(?:-\1)* - 0 or more repetitions of - and Group 1 value
- - a hyphen
(\1) - Group 2: same value as in Group 1.
Python sample re.sub:
s = re.sub(r'(?<!\S)(\w+)(?:-\1)*-(\1)', r'\2', s)

Related

Regex to replace up to 4 digits before a word

I am using this extension for chrome (It's called Word Replacer II) and I'm trying to create a Regex find and replace.
Quick backstory, my partner is recovering from an eating disorder and I want to find all mentions of Kilojoules and kJs and replace them with .
I am entirely new to Regex and after a few hours, I'm not much closer to getting a working expression.
I need it to remove up to 4 digits before the letters "kJs". E.g, 400kJs and 1000kJs. I'd like the "400kJs and 1000kJs" to be replaced with "[removed kJs] and [removed kJs]".
The code I have put together so far is;
\s+(a{1,4}<=\d)\s+(?=kJ)
And help would be much appreciated!
You may use the following approach:
\d{1,4}\s*kJs\b
See the regex demo
If you need to keep kJs, you may wrap the right part of the pattern with a lookahead, \d{1,4}(?=\s*kJs\b).
If you do not want to touch 5 or more digit numbers, use
\b\d{1,4}\s*kJs\b
(?<!\d)\d{1,4}\s*kJs\b
That is, add a word boundary, \b, or a left-hand digit boundary, (?<!\d).
Pattern details
\d{1,4} - one to four digits
\s* - 0+ whitespaces
kJs - a string of letters
\b - a word boundary (may not be necessary if there can be no word starting with kJs).

Regex pattern to match letter combination of a word

Currently I am developing puzzle game for kids where player needs to select correct word from the grid. I used regex to match the word.
For an example I used ([D|E|C|K]){4} to match DECK because player should be able to select the word not in exact D->E->C->K order. Player may select it KDEC or EDCK or KCED or any order.
I achieved this by using ([D|E|C|K]){4}.
But here I am facing issue, this pattern matches EEEE or DDDD or DKDK and etc. Simply any combination of 4 chars from the set.
Any Idea how can I modify the regex to get my desired outcome?
Thanks in advance.
Basically, this is not a good job for a regex because this is not regular language. You'd better follow a simple algorithm to split the input string into characters, sort them, and rejoin into a string, do the same with the search string, then compare the results.
See a JavaScript demo with the word TALL:
const strings = ['TALL','LATL','TLAL','TTAL','AATT','ATL','STL'];
const search = 'TALL';
const compare_with = search.split("").sort().join("");
for (let s of strings) {
console.log(s, ':', s.split("").sort().join("") == compare_with );
}
Can we do it with a regex? In .NET, you may use balancing construct, and it is a solution, not a workaround.
Scenario 1: .NET regex engine specific solution
Assuming your search word is TALL, you may build a regex like
^(?:(T)|(A)|(L)|(L)){4}$(?<-1>)(?<-2>)(?<-3>)(?<-4>)
See the regex demo.
Details
^- start of string
(?:(T)|(A)|(L)|(L)){4} - a non-capturing group that matches 4 occurrences of
(T) - T pushed on to the Group 1 capture stack
|(A) - or A pushed on to the Group 2 capture stack
|(L) - or L pushed on to the Group 3 capture stack
|(L) - or L pushed on to the Group 4 capture stack
$ - end of string
(?<-1>)(?<-2>)(?<-3>)(?<-4>) - Pop a value from each of the capturing groups. If any group capture stack is not empty, return false and result in no match, else, there is a match.
Scenario 2: Lookahead basd work-around in case all characters are unique
You may match and capture each letter from the range into a separate capturing group and add a negative lookahead before each subsequent capturing group to avoid matching a letter matched before it.
The regex will look like
^([DECK])(?!\1)([DECK])(?!\1|\2)([DECK])(?!\1|\2|\3)([DECK])$
See the regex demo
Details
^ - start of string
([DECK]) - Group 1: a letter, D, E, C or K
(?!\1) - the next char cannot be the one captured into Group 1
([DECK]) - Group 2: a letter, D, E, C or K
(?!\1|\2)([DECK]) - the next letter cannot be equal to the first and second one
(?!\1|\2|\3)([DECK]) - the next letter cannot be equal to the first, second and third one
$ - end of string

How to exclude a word from regex subpattern?

I am using Delphi 7 and TDIPerlRegEx. I am looking for verbs in parts of sentence which contain some specific part to identify the verb.
s1 := '(I|you|he|she|it|we|they|this|that|these|those)';
s2 := (can|should|would|could|must|want to|have to|had to|might);
RegEx_Seek_1.MatchPattern := '(*UCP)(?m) \b'+s1+'\b \b'+s2+'\b \K([^ß\W]\w{2,15})\b';
The key word which is wrongly included in result is "not"; but should be exluded:
Sample text:
... that you should not ßeat of every ...
Verb like this should be included in result:
Sample text:
lest he should put forth his hand ...
Now I would explain the part with ß sign. The ß sign says, that the original text had "not" word, and then the verb is followed. But I changed this text in previous interaction or session so the source text which I am working now is as stated above. The pattern ([^ß\W]\w{2,15}) should avoid the word which is used in negative sense. This is also why do not include the "negative" verb.
So point of the question is how to exclude the "not" word from the captured text; that is - captured by this pattern, which is either ([^ß\W]\w{2,15}) or (\W{3,15}) .
I am using this pattern to replace substrings in text.
More sample text needed?
than I can bear. And
so I might have taken her
they might dwell together
they could not ßdwell together
lest you should say,
In group 3 I expect match:
for bear, taken (or posibly have instead of taken), dwell and say.
I am trying to exclude the not word, so any verb or word following not must be excluded from 3rd group or the match completely. I am interested about group 3 only. Group 1 and 2 just specifies alternatives preceding the verb.
You may use a branch reset group to match an empty string if there is not as a whole word after a modal verb, or a notional verb otherwise:
\b(I|you|he|she|it|we|they|this|that|these|those)\s+(can|should|would|could|must|want to|have to|had to|might)\s+\K(?|(?=not\b)()|([^ß\W]\w{2,15})\b)
See the regex demo
Details
\b - a word boundary
(I|you|he|she|it|we|they|this|that|these|those) - one of the pronouns in the group 1
\s+ - 1+ whitespaces (it is already acting as a word boundary on both sides of the adjacent groups)
(can|should|would|could|must|want to|have to|had to|might) - one ofthe modal verbs
\s+ - 1+ whitespaces
\K - match reset operator
(?|(?=not\b)()|([^ß\W]\w{2,15})\b) - the branch reset group matching either
(?=not\b)() - if there is not as whole word immediately to the right, capture an empty string into Group 3
| - or (here, else)
([^ß\W]\w{2,15})\b - match and capture into Group 3 any word char other than ß and then 2 to 15 word chars with a word boundary to follow.
Note that (?m) - PCRE_MULTILINE - is only necessary if you want your ^ and $ outside of character classes match start and end of lines rather than the whole string. Since your pattern has no such anchors, (?m) is redundant.

Find matches ending with a letter that is not a starting letter of the next match

Intro
I have a string containing diagnosis codes (ICD-10), not separated by any character. I would like to extract all valid diagnosis codes. Valid diagnosis codes are of the form
[Letter][between 2 and 4 numbers][optional letter that is not the next match starting letter]
The regex for this pattern is (I believe)
\w\d{2,4}\w?
Example
Here is an example
mystring='F328AG560F33'
In this example there are three codes:
'F328A' 'G560' 'F33'
I would like to extract these codes with a function like str_extract_all in R (preferably but not exclusively)
My solution so far
So far, I managed to come up with an expression like:
str_extract_all(mystring,pattern='\\w\\d{2,4}\\w?(?!(\\w\\d{2,4}\\w?))')
However when applied to the example above it returns
"F328" "G560F"
Basically it misses the letter A in the first code, and misses altogether the last code "F33" by mistakenly assigning F to the preceding code.
Question
What am I doing wrong? I only want to extract values that end with a letter that is not the start of the next match, and if it is, the match should not include the letter.
Application
This question is of great relevance for example when mining patient Electronic Health Records that have not been validated.
You have a letter, two-to-four numbers then an optional letter. That optional letter, if it's there, will only ever be followed by another letter; or, put another way, never followed by a number. You can write a negative lookahead to capture this:
\w\d{2,4}(?:\w(?!\d))?
This at least works with PCRE. I don't know about how R will handle it.
Your matches are overlapping. In this case, you might use str_match_all that allows easy access to capturing groups and use a pattern with a positive lookahead containing a capturing group inside:
(?i)(?=([A-Z]\d{2,4}(?:[A-Z](?!\d{2,4}))?))
See the regex demo
Details
(?= - a positive lookahead start (it will be run at every location before each char and at the end of the string
( - Group 1 start
[A-Z] - a letter (if you use a case insensitive modifier (?i), it will be case insensitive)
\d{2,4} - 2 to 4 digit
(?: - an optional non-capturing group start:
[A-Z] - a letter
(?!\d{2,4}) - not followed with 2 to 4 digits
)? - the optional non-capturing group end
) - Group 1 end
) - Lookahead end.
R demo:
> library(stringr)
> res <- str_match_all("F328AG560F33", "(?i)(?=([A-Z]\\d{2,4}(?:[A-Z](?!\\d{2,4}))?))")
> res[[1]][,2]
[1] "F328A" "G560" "F33"

Regex for finding words with no or only one word between them

I need to find into multiple strings two words with no words or only one word between them. I created the regex for the case to find if those two words exist in string:
^(?=[\s\S]*\bFirst\b)(?=[\s\S]*\bSecond\b)[\s\S]+
and it works correctly.
Then I tried to insert in this regex additional code:
^(?=[\s\S]*\bFirst\b)(\b\w+\b){0,1}(?=[\s\S]*\bSecond\b)[\s\S]+
but it didn't work. It selects text with two or more words between searched words. It is not what I need.
First Second - must be selected
First word1 Second - must be selected
First word1 word2 Second - must be not selected by regex, but my regex select it.
Can I get advise how to solve this problem?
Root cause
You should bear in mind that lookarounds match strings without moving along the string, they "stand their ground". Once you write ^(?=[\s\S]*\bFirst\b)(\b\w+\b){0,1}(?=[\s\S]*\bSecond\b), the execution is as follows:
^ - the regex engine checks if the current position is the start of string
(?=[\s\S]*\bFirst\b) - the positive lookahead requires the presence of any 0+ chars followed with a whole word First - note that the regex index is still at the start of the string after the lookahead returns true or false
(\b\w+\b){0,1} - this subpattern is checked only if the above check was true (i.e. there is a whole word First somewhere) and matches (consumes, moves the regex index) 1 or 0 occurrences of a whole word (i.e. there must be 1 or more word chars right at the string start
(?=[\s\S]*\bSecond\b) - another positive lookahead that makes sure there is a whole word Second somewhere after the first whole word consumed with \b\w+\b - if any. Even if the word Second is the first word in the string, this will return true since backtracking will step back the word matched with (\b\w+\b){0,1} (see, it is optional), and the Second will get asserted, and [\s\S]+ will grab the whole string (Group 1 will be empty). See the regex demo with Second word word2 First string.
So, your approach cannot guarantee the order of First and Second in the string, they are just required to be present but not necessarily in the order you expect.
Solution
If you need to check the order of First and Second in the string, you need to combine all the checks into one single lookahead. The approach might turn out very inefficient with longer strings and multiple alternatives in the lookaround, consider either unrolling the patterns, or trying mutliple regex patterns (like this pseudo-code if /\bFirst\b/.finds_match().index < /\bSecond\b/.finds_match().index => Good, go on...).
If you plan to go on with the regex approach, you may match a string that contains First....Second only in this order:
^(?=[\s\S]*\bFirst(?:\W+\w+)?\W+Second\b)[\s\S]+
See the regex demo
Details:
^ - start of string
(?=[\s\S]*\bFirst(?:\W+\w+)?\W+Second\b) - there must be:
[\s\S]* - any zero or more chars up to the last
\bFirst - whole word First
(?:\W+\w+)? - optional sequence (1 or 0 occurrences) of 1+ non-word chars and 1+ word chars
\W+ - 1+ non-word chars
Second\b - Second as a whole word
[\s\S]+ - any 1 or more characters (empty string won't match).