Capture a word `+` same word again but with a prefix - regex

To all the Regex gurus
Any idea how to handle this beast
string = 'Position_Name [+|-|/|*] PrevYear Position_Name'
Looking for the Regex to match the occurrences of Position_Name (basically twice similar to a duplicate) but not really a dupe since it is followed by a special character and then by itself BUT with some prefix - here: 'PrevYear'. Means Position_Name is dynamic and could be any word (eg Profit, Sales, etc) but PrevYear will stay constant.
So how could I identify these lines where there's a position being mentioned twice with some math symbol in the middle (for now) and then capture those three elements since the plus could also be a / (divided by), a minus sign - or a multiply * as intended to be represented by [+|-|/|*] in my example.
PS: I do not mind programming this in two steps ... so first matching and then capturing - but still would need the regex to find these little gems (in hundreds of lines).
Elegantly finding dupes is not the problem eg via \b(\w+) \1\b but I have come to realize my capabilities are not sufficient for that combo.
Thanks on hints and support.

You can use
\b(\w+)\b\s*[-+/*]\s*PrevYear\s*\1\b
See the regex demo. Details
\b - a word boundary
(\w+) - Group 1: one or more word chars
\b - a word boundary
\s*[-+/*]\s* - a -, +, / or * enclosed with zero or more whitespaces
PrevYear - a fixed word
\s* - zero or more whitespaces
\1 - same value as captured in Group 1
\b - a word boundary.

Related

Regex to replace up to 4 digits before a word

I am using this extension for chrome (It's called Word Replacer II) and I'm trying to create a Regex find and replace.
Quick backstory, my partner is recovering from an eating disorder and I want to find all mentions of Kilojoules and kJs and replace them with .
I am entirely new to Regex and after a few hours, I'm not much closer to getting a working expression.
I need it to remove up to 4 digits before the letters "kJs". E.g, 400kJs and 1000kJs. I'd like the "400kJs and 1000kJs" to be replaced with "[removed kJs] and [removed kJs]".
The code I have put together so far is;
\s+(a{1,4}<=\d)\s+(?=kJ)
And help would be much appreciated!
You may use the following approach:
\d{1,4}\s*kJs\b
See the regex demo
If you need to keep kJs, you may wrap the right part of the pattern with a lookahead, \d{1,4}(?=\s*kJs\b).
If you do not want to touch 5 or more digit numbers, use
\b\d{1,4}\s*kJs\b
(?<!\d)\d{1,4}\s*kJs\b
That is, add a word boundary, \b, or a left-hand digit boundary, (?<!\d).
Pattern details
\d{1,4} - one to four digits
\s* - 0+ whitespaces
kJs - a string of letters
\b - a word boundary (may not be necessary if there can be no word starting with kJs).

How To Use a Regex Capture Result to Lookbehind

I am trying to use the result of the capture group to perform a look behind for a specific answer.
Sample of Text:
10) Once a strategy has been formulated and implemented, it is important that the firm sticks to it no matter what happens.
Answer: FALSE
11) Which of the following strategies does Tesla need to implement or achieve to gain a competitive advantage?
A) imitate the features of the most popular SUVs on the market
B) reinvest profits to build successively better electric automobiles
C) sell advertising space on their cars' digital displays
D) substitute less-expensive components to keep costs low
Answer: B
Current Output:
https://regex101.com/r/bLKmYX/1
It is currently outputting FALSE and B as the answers to these questions.
Expected Output
I would like it to output FALSE and B) reinvest profits to build successively better electric automobiles
Current Regex Expression
'^\d+\)\s*([\s\S]*?)\nAnswer:\s*(.*)'
How can I use the result of the second capture group, (B), to perform a lookbehind and get the whole answer?
What you ask for is not possible due to the fact that a captured value can only be checked after it was obtained.
You may try another logic: capture the answer letter and then match the same letter after Answer: substring using the backreference to the group value.
You may consider a pattern like
(?m)^\d+\)\s*((?:(?:(?!^\d+\))[\s\S])*?\n(([A-Z])\).*)$)?[\s\S]*?)\nAnswer:\s*(\3|FALSE)
See the regex demo.
It has 4 capturing groups now, the first one containing the whole question body, then the second one containing the answer line you need, the third one is auxiliary (it is used to check which answer is correct), and the fourth one is the answer value.
Details
(?m) - ^ now matches line start positions and $ matches line end positions
^ - start of a line
\d+ - 1+ digits
\) - a ) char
\s* - 0+ whitespaces
((?:(?:(?!^\d+\))[\s\S])*?\n(([A-Z])\).*)$)?[\s\S]*?) - Group 1:
(?:(?:(?!^\d+\))[\s\S])*?\n(([A-Z])\).*)$)? - an optional non-capturing group matching
(?:(?!^\d+\))[\s\S])*? - any char, 0 or more occurrences, that does not start a start of line, 1+ digits and then a ) sequence
\n - a newline
(([A-Z])\).*) - Group 2: an ASCII uppercase letter captured into Group 3, then ) char and then the rest of the line (.*)
$ - end of line
[\s\S]*? - any 0+ chars as few as possible
\nAnswer: - a new line, Answer: string
\s* - 0+ whitespaces
(\3|FALSE) - Group 4: Group 3 value or FALSE.

Regular Expression for checking subword between capture groups

Talking about Regex, I am facing with the problem to replace hyphenations in the beginning part of a composed word.
For example:
wo-wo-wo-wonder -> wonder
hi-hi-hi-hi -> hi
wo-wo-wo -> wo
f-f-f-fight
So, for every word inside a text, I want to replace words that before the main word (wonder) have a partial or total repetition of the main word (wo-wo-wo but also wonder-wonder-wonder).
At the same time, composed words like bi-linear or
pre-trained MUST NOT be replaced, because in this case the hyphenation (pre) is not part of the main word (train).
I've seen this solution [Python find all occurrences of hyphenated word and replace at position ] and apparently it can be a good solution.
But my problem is quite different because I don't want to impose constraints about the length of hyphenation, and at the same time I want to check that hyphen is part of the main word.
This is the Regex I am actually using but as explained, it doesn't solve my full problem.
re.sub(r'(?<!\S)(\w{1,3})(?:-\1)*-(\w+)(?!\S)', '\\2', s)
Use
r'(?<!\S)(\w+)(?:-\1)*-(\1)'
or
r'\b(\w+)(?:-\1)*-(\1)'
See the regex demo
Details
(?<!\S) - a whitespace boundary (if you use \b, a word boundary)
(\w+) - Group 1: any one or more word chars
(?:-\1)* - 0 or more repetitions of - and Group 1 value
- - a hyphen
(\1) - Group 2: same value as in Group 1.
Python sample re.sub:
s = re.sub(r'(?<!\S)(\w+)(?:-\1)*-(\1)', r'\2', s)

How do you specify multiples in negative character classes in regular expressions?

I am trying to write a regular expression to search for anything but digits or the * or - characters, with one caveat. Where I'm hitting a wall is that I need to be able to allow three or less digits to be found but not four or more, though even one * or - shouldn't be found.
This is what I have so far (for three matches):
.*?([^0-9\*-]+).*?([^0-9\*-]+).*?([^0-9\*-]+).*?
I have no idea where to insert {4,} for the digits (I've tried and it doesn't seem to work anywhere) or how to change it to do as I want.
For instance, in "Jack has* 777 1883874 -sheep-" I'd like it to return "Jack has 777 sheep". Or in "2343klj-3***.net" I'd like it to return "klj 3 .net"
You may use the following regex (replacing with a literal space, " "):
(?:[-*\s]|\d{4,})+
See the regex demo. Replace with $1 (to insert one captured horizontal whitespace if any).
Details
(?:[-*\s]|\d{4,})+ - a non-capturing group matching one or more consecutive repetitions of
[-*\s] - 0+ whitespaces, - or/and *
| - or
\d{4,} - 4+ digits.
Next, to remove all leading and trailing whitespace you may use
^\s+|\s+$
and replace with an empty string. ^\s+ matches 1+ whitespaces at the start of the string and \s+$ matches 1+ whitespaces at the end of the string.
With the help here, this is what works. It may be impossible to do it all in one regex because of the conflict of needing no spaces at the beginning and end but spaces in between each remaining grouping.
First, a find and replace using ([-*\h]|\d{4,})+ and replacing with a space.
Second, using ^\s*(.*)\s*$.

Parsing and extracting variables from math/algebraic expression

Given a string representing a math expression
(((x + temp) * temp_2) / 2_temp) + x3
I need to extract all the variables, except numbers and operators.
Only round brackets are permitted.
In particular, valid variables that can be extracted are:
x
x23
x_23
2_temp
2temp
2_
temp
temp_2
temp
but not
2
15
So, in general, any variable can start with any character but, if it starts with a number, then it must contain at least one letter.
I've tried this regex expression
matcher=Pattern.compile("([a-zA-Z0-9_]+\\w*?[a-zA-Z0-9_]*\\w*?)").matcher(equation);
but, for example, 15 is extracted.
Anyone can help me?
Thanks in advance
Based on your description, you seem to want to extract entities starting with an alphanumeric and then having 0+ word chars.
You may use
\b(?!\d+\b)[a-zA-Z0-9]\w*\b
See the regex demo
If you allow these variables to start with an underscore, just use \b(?!\d+\b)\w+\b.
The point here is that the (?!\d+\b) negative lookahead does not allow the string between word boundaries to only contain digits.
Details:
\b - word boundary
(?!\d+\b) - restriction: there must be no 1+ digits followed with a word boundary
[a-zA-Z0-9] - an alphanumeric (in Java, you may also use \p{Alnum})
\w* - 0 or more word chars
\b - trailing word boundary.
In Java, do not forget to use double backslashes when defining a pattern:
String pat = "\\b(?!\\d+\\b)[a-zA-Z0-9]\\w*\\b";
An alphanumeric string that contains at least one letter:
[a-zA-Z0-9_]*[a-zA-Z][a-zA-Z0-9_]*