Regular expression Positive lookbehind, ignore first 2 words - regex

I have the following sentence: total 10 item(s) 26,50
I want to extract the number 26,50 based on the word "total". I came this far with a Positive Lookbehind but I'm stuck now. (?<=total )(.*)(?=\d)

You don't need lookbehind. Use groups:
https://regex101.com/r/oC0dM3/2
total\s+(?P<COUNT>\d+)\s+item(?:\(s\))?\s+(?P<PRICE>\d+(?:,\d+)?)

Many Regex engine does not support variable variable length Look behind, in those cases your Regex would be pretty inefficient if you use lookbehind.
Use pattern grouping instead:
^total[^)]+\)\s+(.*)$
The only captured group here is your desired portion.
^total[^)]+\)\s+ matches upto the last whitespace before the desired pattern
(.*)$ gets our desired portion
Demo

Related

Regex: how do I match a character before other capture characters?

I'm trying to match on a list of strings where I want to make sure the first character is not the equals sign, don't capture that match. So, for a list (excerpted from pip freeze) like:
ply==3.10
powerline-status===2.6.dev9999-git.b-e52754d5c5c6a82238b43a5687a5c4c647c9ebc1-
psutil==4.0.0
ptyprocess==0.5.1
I want the captured output to look like this:
==3.10
==4.0.0
==0.5.1
I first thought using a negative lookahead (?![^=]) would work, but with a regular expression of (?![^=])==[0-9]+.* it ends up capturing the line I don't want:
==3.10
==2.6.dev9999-git.b-e52754d5c5c6a82238b43a5687a5c4c647c9ebc1-
==4.0.0
==0.5.1
I also tried using a non-capturing group (?:[^=]) with a regex of (?:[^=])==[0-9]+.* but that ends up capturing the first character which I also don't want:
y==3.10
l==4.0.0
s==0.5.1
So the question is this: How can one match but not capture a string before the rest of the regex?
Negative look behind would be the go:
(?<!=)==[0-9.]+
Also, here is the site I like to use:
http://www.rubular.com/
Of course it does some times help if you advise which engine/software you are using so we know what limitations there might be.
If you want to remove the version numbers from the text you could capture not an equals sign ([^=]) in the first capturing group followed by matching == and the version numbers\d+(?:\.\d+)+. Then in the replacement you would use your capturing group.
Regex
([^=])==\d+(?:\.\d+)+
Replacement
Group 1 $1
Note
You could also use ==[0-9]+.* or ==[0-9.]+ to match the double equals signs and version numbers but that would be a very broad match. The first would also match ====1test and the latter would also match ==..
There's another regex operator called a 'lookbehind assertion' (also called positive lookbehind) ?<= - and in my above example using it in the expression (?<=[^=])==[0-9]+.* results in the expected output:
==3.10
==4.0.0
==0.5.1
At the time of this writing, it took me a while to discover this - notably the lookbehind assertion currently isn't supported in the popular regex tool regexr.
If there's alternatives to using lookbehind to solve I'd love to hear it.

Capture number between two whitespaces (RegEx)

I have the following data:
SOMEDATA .test 01/45/12 2.50 THIS IS DATA
and I want to extract the number 2.50 out of this. I have managed to do this with the following RegEx:
(?<=\d{2}\/\d{2}\/\d{2} )\d+.\d+
However that doesn't work for input like this:
SOMEDATA .test 01/45/12 2500 THIS IS DATA
In this case, I want to extract the number 2500.
I can't seem to figure out a regex rule for that. Is there a way to extract something between two spaces ? So extract the text/number after the date until the next whitespace ? All I know is that the date will always have the same format and there will always be a space after the text and then a space after the number I want to extract.
Can someone help me out on this ?
Capture number between two whitespaces
A whitespace is matched with \s, and non-whitespace with \S.
So, what you can use is:
\d{2}\/\d{2}\/\d{2} +(\S+)
^^^
See the regex demo
The 1+ non-whitespace symbols are captured into Group 1.
If - for some reason - you need to only get the value as a whole match, use your lookbehind approach:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Or - if you are using PCRE - you may leverage the match reset operator \K:
\d{2}\/\d{2}\/\d{2} +\K\S+
^^
See another demo
NOTE: the \K and a capture group approaches allow 1 or more spaces after the date and are thus more flexible.
I see some people helped you already, but if you would want an alternative working one for some reason, here's what works too :)
.+ \d+\/\d+\/\d+ (\d+[\.\d]*)
So the .+ matches anything plus the first space
then the \d+/\d+/\d+ is the date parsing plus a space
the capturing group is the number, as you can see I made the last part optional, so both floating point values and normal values can be matched. Hope this helped!
Proof: https://regex101.com/r/fY3nJ2/1
Just make the fractal part optional:
(?<=\d{2}\/\d{2}\/\d{2} )\d+(?:\.\d+)?
Demo: https://regex101.com/r/jH3pU7/1
Update following clarifications in comments:
To match anything (but space) surrounded by spaces and prepended by date use:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Demo: https://regex101.com/r/jH3pU7/3
Rather than capture, you can make your entire match be the target text by using a look behind:
(?<=\d\d(\/\d\d){2} )\S+
This matches the first series of non-whitespace that follows a "date like" part.
Note also the reduction in the length of the "date like" pattern. You may consider using this part of the regex in whatever solution you use.

Regex to match a pattern with same number in R

I have set of strings which looks like the below. Each string has 3 numbers separated with an underscore (_). Each number is a value between 1 - 100.
ma_1_1_1
ma_2_100_59
ma_29_29_29
ma_100_100_100
ma_7_72_78
ma_10_10_100
ma_4_4_49
I want to write a regular expression where I can get the strings whose digits are all same. For example my output would be
ma_1_1_1, ma_29_29_29 and ma_100_100_100
Like this?
^ma_(\d+)_\1_\1$
See a demo on regex101.com.
This uses backreferences with the first captured group as well as anchors.
Use back-references to make a regex match a previous group again:
ma_(100|[1-9][0-9]?)_\1_\1\b
Regex101 Demo
This will also validate that the numbers are within range. If this validation is unnecessary, use (\d+) for the capture group.
This answer is a modification to #4castle which will only extract the strings with similar numbers.
grep("ma_(100|[0-9][0-9]|[0-9])(_\\1)(_\\1)\\b", stringList, value = T)

Workaround for the lack of lookbehind?

To answer another user's question I knocked together the below regular expression to match numbers within a string.
\b[+-]?[0-9]+(\.[0-9]+)?\b
After providing my answer I noticed that I was getting unwanted matches in cases where there was a sequence of digits with more than one period among them due to \b matching the period character. For example "2.3.4" would return matches "2.3" and "4".
A negative lookahead and lookbehind could help me here, giving me a regex like this:
\b(?<!\.)[+-]?[0-9]+(\.[0-9]+)?\b(?!\.)
...except that for some unknown reason VBScript Regex (and by extension VBA) doesn't support lookbehind.
Is there some workaround that allows me to affirm that the word boundary at the start of the match is not a period without including it in the match?
Perhaps you don't need a look behind. If you are able to extract specific capture groups instead of the entire match then you can use:
(?:[^.]|^)\b([+-]?([0-9]+(\.[0-9]+)))\b(?!\.)
Will match:
2.5
54.5
+3.45
-0.5
Won't match:
1.2.3
3.6.
.3.5
Capture group 1 will output the whole number and sign
Capture group 2 will output the whole number
Capture group 3 will output the fraction (like capture group 1 in your original expression)

Regex to match certain word but not a particular combination

I have 15 titles as follows:
fruits-and-flowers-themeA
fruits-and-flowers-themeB
fruits-and-flowers-just-test-themeA
themeAfruitsandflowers
nice-fruits-and-flowers-themeA
botanical-names-themeA
I want a regex to help me get only those titles with "themeA" in them, but it should not include "nice" and not include "just-test" or "just-tests".
I tried
^(?!.*just-test|*just-tests|nice).*?(?:themeA).*,
but I still get fruits-and-flowers-just-test-themeA in the output.
How to fix this?
Thanks
You can use this regex with negative lookahead:
^(?!.*?(?:just-tests?|nice)).*?themeA.*$
Working Demo
Option 1
You can use a single regex with lookaheads (see online demo):
^(?!.*nice?)(?!.*just-tests?).*themeA.*
The ^ asserts that the match starts at the beginning of the string (so we don't match a subset of the string
The (?!.*nice?) is a negative lookahead that asserts that at this position in the string, we cannot find any characters followed by nice
The (?!.*just-tests?) is a negative lookahead that asserts that at this position in the string, we cannot find any characters followed by just-test and an optional s
As a further tweak, you can compress the lookaheads into one using an | alternation as in anubhava's answer.
Option 2 without lookaheads (Perl, PHP/PCRE)
^(?:.*(?:nice|just-tests?).*)(*SKIP)(?!)|.*themeA.*
This one doesn't use lookaheads but just skips the unwanted titles. See demo.
Use two different regular expressions for clarity and simplicity.
Match your string against one regex that matches themeA:
/themeA/
and then check that the string does NOT match the one you don't want:
/nice|just-tests?/
Doing it in two different regexes makes it far easier to understand and maintain.