Regex to replace up to 4 digits before a word - regex

I am using this extension for chrome (It's called Word Replacer II) and I'm trying to create a Regex find and replace.
Quick backstory, my partner is recovering from an eating disorder and I want to find all mentions of Kilojoules and kJs and replace them with .
I am entirely new to Regex and after a few hours, I'm not much closer to getting a working expression.
I need it to remove up to 4 digits before the letters "kJs". E.g, 400kJs and 1000kJs. I'd like the "400kJs and 1000kJs" to be replaced with "[removed kJs] and [removed kJs]".
The code I have put together so far is;
\s+(a{1,4}<=\d)\s+(?=kJ)
And help would be much appreciated!

You may use the following approach:
\d{1,4}\s*kJs\b
See the regex demo
If you need to keep kJs, you may wrap the right part of the pattern with a lookahead, \d{1,4}(?=\s*kJs\b).
If you do not want to touch 5 or more digit numbers, use
\b\d{1,4}\s*kJs\b
(?<!\d)\d{1,4}\s*kJs\b
That is, add a word boundary, \b, or a left-hand digit boundary, (?<!\d).
Pattern details
\d{1,4} - one to four digits
\s* - 0+ whitespaces
kJs - a string of letters
\b - a word boundary (may not be necessary if there can be no word starting with kJs).

Related

Why my Regex is only giving me ONE group back?

Im currenty having issues with a regex that Im creating. The regex has to extract all the groups that says number #### between Hello and Regards. At this moment my regex only extracts one group and I need all the groups inside, at this case I have 2, but there may be more inside.
Regex Image
I'm using the web page https://regex101.com/
Flavor: PCRE (PHP)
Regex: Hello\s.*(number\s*[\d]*)\s.*Regards
Text:
This is my test text number 25120
Hello my name is testing
I'm 20 years old
Please help me with the regex number 1542
I have been trying to create the regex many times this is my number 5152
Regards
I'm still trying my attempt number 5150
Result:
My Result is only the group number 5152 but inside is another group number 1542.
You may use
(?si)(?:\G(?!\A)|\bHello\b)(?:(?!\bHello\b).)*?\K\bnumber\s*\d+(?=.*?\bRegards\b)
See the regex demo.
Details
(?si) - s - DOTALL modifier making . match any chars, and i makes the pattern case insensitive
(?:\G(?!\A)|\bHello\b) - either the end of the previous match (\G(?!\A)) or (|) a whole word Hello (\bHello\b)
(?:(?!\bHello\b).)*? - any char, 0 or more times but as few as possible, that does not start a whole word Hello char sequence
\K - match reset operator that discards all text matched so far
\bnumber - a whole word number
\s* - 0+ whitespaces
\d+ - 1+ digits
(?=.*?\bRegards\b) - there must be a whole word Regards somewhere after any 0+ chars (as few as possible).

.net Regex to look ahead and eliminate strings in advance that dont contain certain characters

I am Using .Net Flavor of Regex.
Suppose i have a string 123456789AB
and i want to match AB (Could be any two Capital letters) only if the string part containing numbers(123456789) has 5 and 8 in it.
So what i came up with was
(?=5)(?=8)([A-Z]{2})
But this is not working.
After some trail error on RegexStorm
I got to
(?=(.*5))(?=(.*8))[A-Z]{2}
What i am expecting is it will start matching from the start of the string as look ahead does not consume any characters.
But the part "[A-Z]{2}" does not move ahead to match AB in the input string.
My question is why is that so?
i know replacing it with .*[A-Z]{2} will make it move ahead but then the string matched has entire string in it.
What is the solution in this case other than putting word part ([A-Z]{2}) in a separate group and then catching only that group.
Lookaheads check for the pattern match immediately to the right of the current position in the string. (?=(.*5))(?=(.*8)) matches a location that is immediately followed with any 0 or more chars other than line break chars as many as possible and then 5 and then - at the same position - another similar check if performed but requiring 8 after any zero or more chars, as many as possible.
You may use as many as lookbehinds as there are required substrings before the two letters:
(?s)(?<=5.*?)(?<=8.*?)[A-Z]{2}
See the regex demo
Details
(?s) - makes the . match newline characters, too
(?<=5.*?) - a location that is immediately preceded with 5 and then 0 or more chars as few as possible
(?<=8.*?) - a location that is immediately preceded with 8 and then 0 or more chars as few as possible
[A-Z]{2} - two ASCII uppercase letters.
An alternative would be to "unfold" what you expect to match using exclusionary character classes and alternation of match order. Not pretty, but pretty fast:
(?<=\b[^58]*?(?:5[^8]*8|8[^5]*5)[^A-Z]*?)[A-Z]{2}

Regular Expression for checking subword between capture groups

Talking about Regex, I am facing with the problem to replace hyphenations in the beginning part of a composed word.
For example:
wo-wo-wo-wonder -> wonder
hi-hi-hi-hi -> hi
wo-wo-wo -> wo
f-f-f-fight
So, for every word inside a text, I want to replace words that before the main word (wonder) have a partial or total repetition of the main word (wo-wo-wo but also wonder-wonder-wonder).
At the same time, composed words like bi-linear or
pre-trained MUST NOT be replaced, because in this case the hyphenation (pre) is not part of the main word (train).
I've seen this solution [Python find all occurrences of hyphenated word and replace at position ] and apparently it can be a good solution.
But my problem is quite different because I don't want to impose constraints about the length of hyphenation, and at the same time I want to check that hyphen is part of the main word.
This is the Regex I am actually using but as explained, it doesn't solve my full problem.
re.sub(r'(?<!\S)(\w{1,3})(?:-\1)*-(\w+)(?!\S)', '\\2', s)
Use
r'(?<!\S)(\w+)(?:-\1)*-(\1)'
or
r'\b(\w+)(?:-\1)*-(\1)'
See the regex demo
Details
(?<!\S) - a whitespace boundary (if you use \b, a word boundary)
(\w+) - Group 1: any one or more word chars
(?:-\1)* - 0 or more repetitions of - and Group 1 value
- - a hyphen
(\1) - Group 2: same value as in Group 1.
Python sample re.sub:
s = re.sub(r'(?<!\S)(\w+)(?:-\1)*-(\1)', r'\2', s)

Regex for words with no doubled characters and within a string length

Every time I need to use a regex I realize I've forgotten everything about them.
I am trying to match all words that have only lowercase alphanumeric characters AND do not have doubled alphanumeric characters AND are also within {10,12} characters long.
Now, to figure out if a character is followed by the same character, I would do (.)\1. To see if a word is within 10 and 12 characters I do {10,12}. To grab only lowercase letters and the digits, I do [0-9a-z].
But how do I link them together?
Cheers!
PS: this will be running on a fairly large NLP xml (100mb+), so I would appreciate it if the regex wasn't the slowest alternative.
I think this will do what you want: -
/\b(?:([a-z0-9])(?!\1)){10,12}\b/
Explanation: -
\b // Word boundary
(?:
([a-z0-9]) // Match lowercase letters or digit
(?!\1) // Not followed by the same digit as before
){10,12} // 10 to 12 times.
\b // Word boundary
Here's one, although I'm not sure there won't be a better way...
/\b(?:([a-z0-9])(?!\1)){10,12}\b/
Here is my attempt:
(\b(?![0-9a-z]*([0-9a-z])\2)[0-9a-z]{10,12}\b)
(We have to use a lookahead, and some kind of boundary is usually very important for it to function properly. Hence \b).
At the time of writing, another answer has a false positive, matching a part of eoeuaoarounn

PCRE - perl regex

I am trying to make an regex in PCRE for string detection. The kind of strings I want to detect are abcdef001, zxyabc003. A word with first 6 characters are a-zA-Z and last two or three are digits 0-9; and this string could be anywhere in the whole text.
E.g - "User activity from server1, user id abcdef009, time 10.20am".
How do I go about this?
Try this:
/[a-zA-Z]{6}[0-9]{2,3}/
If you want to limit it to whole words, try:
/\b[a-zA-Z]{6}[0-9]{2,3}\b/
\b - word boundry
[a-zA-Z]{6} - six letters
[0-9]{2,3} - either 2 or 3 numbers
\b - word boundry
Use regex pattern
/[a-z]{6}\d{2,3}/i