Workaround for the lack of lookbehind? - regex

To answer another user's question I knocked together the below regular expression to match numbers within a string.
\b[+-]?[0-9]+(\.[0-9]+)?\b
After providing my answer I noticed that I was getting unwanted matches in cases where there was a sequence of digits with more than one period among them due to \b matching the period character. For example "2.3.4" would return matches "2.3" and "4".
A negative lookahead and lookbehind could help me here, giving me a regex like this:
\b(?<!\.)[+-]?[0-9]+(\.[0-9]+)?\b(?!\.)
...except that for some unknown reason VBScript Regex (and by extension VBA) doesn't support lookbehind.
Is there some workaround that allows me to affirm that the word boundary at the start of the match is not a period without including it in the match?

Perhaps you don't need a look behind. If you are able to extract specific capture groups instead of the entire match then you can use:
(?:[^.]|^)\b([+-]?([0-9]+(\.[0-9]+)))\b(?!\.)
Will match:
2.5
54.5
+3.45
-0.5
Won't match:
1.2.3
3.6.
.3.5
Capture group 1 will output the whole number and sign
Capture group 2 will output the whole number
Capture group 3 will output the fraction (like capture group 1 in your original expression)

Related

Regular Expression Replace Time Value between Date-Time Format

I have an XML file with date-time formats looking like this:
<published>2019-01-03T23:54:00.000+10:00</published>
and this
<published>2019-01-07T14:22:00.001+10:00</published>
and so on, where the time value is 23:54:00.000 and 14:22:00.001.
How do I replace just the time value between the <published></published> tags with regular expressions? For example, I want to replace both time values with 03:00:00.000 so the first example becomes
<published>2019-01-03T03:00:00.000+10:00</published>
My aim is to use any existing tools/apps Notepad++ or websites since it is much faster, not any specific programming languages.
First, the obligatory warning to not try to parse xml/html with regex. It's fine if this is a once-off reformatting task and you have control over the data. A regex solution will not be very robust...
That out of the way, you will need a tool that can handle capture groups with regex, so you can match on the whole published tag and avoid false positives. A regex like this might do the trick (adjust the capture grouping as appropriate for your tool):
(\<published\>\d\d\d\d-\d\d-\d\dT)\d\d:\d\d:\d\d\.\d\d\d(\+\d\d:\d\d\<\/published\>)
Note that the above is a regex in PCRE format - demo on regex101. You may need to adjust to suit the format your tool uses.
In this regex, there are two capture groups, one before and one after the time you want to replace. An example string that you could use in the replace field of your chosen tool would be: \103:00:00.000\2 (using \1 syntax for backreferences).
Try this regex:
(<published>\d{4}(?:-\d{2}){2}T)\d{2}(?::\d{2}){2}\.\d{3}([^<]*<\/published>)
Click for Demo
Replace each match with \103:00:00.000\2 i.e. Group 1 contents followed by 03:00:00.000 followed by Group 2 contents.
Explanation:
(<published>\d{4}(?:-\d{2}){2}T) - matches <published> followed by 4 digits followed by - followed by 2 digits followed by - followed by 2 digits followed by the letter T. This sub-match is captured in Group 1
\d{2}(?::\d{2}){2}\.\d{3} - matches time of the format XX:XX:XX.XXX where X is a digit.
([^<]*<\/published>) - matches 0+ occurrences of any character that is not a < followed by </published>. This sub-match is captured in Group 2.
Before Replace:
After Replace:

Regex: how do I match a character before other capture characters?

I'm trying to match on a list of strings where I want to make sure the first character is not the equals sign, don't capture that match. So, for a list (excerpted from pip freeze) like:
ply==3.10
powerline-status===2.6.dev9999-git.b-e52754d5c5c6a82238b43a5687a5c4c647c9ebc1-
psutil==4.0.0
ptyprocess==0.5.1
I want the captured output to look like this:
==3.10
==4.0.0
==0.5.1
I first thought using a negative lookahead (?![^=]) would work, but with a regular expression of (?![^=])==[0-9]+.* it ends up capturing the line I don't want:
==3.10
==2.6.dev9999-git.b-e52754d5c5c6a82238b43a5687a5c4c647c9ebc1-
==4.0.0
==0.5.1
I also tried using a non-capturing group (?:[^=]) with a regex of (?:[^=])==[0-9]+.* but that ends up capturing the first character which I also don't want:
y==3.10
l==4.0.0
s==0.5.1
So the question is this: How can one match but not capture a string before the rest of the regex?
Negative look behind would be the go:
(?<!=)==[0-9.]+
Also, here is the site I like to use:
http://www.rubular.com/
Of course it does some times help if you advise which engine/software you are using so we know what limitations there might be.
If you want to remove the version numbers from the text you could capture not an equals sign ([^=]) in the first capturing group followed by matching == and the version numbers\d+(?:\.\d+)+. Then in the replacement you would use your capturing group.
Regex
([^=])==\d+(?:\.\d+)+
Replacement
Group 1 $1
Note
You could also use ==[0-9]+.* or ==[0-9.]+ to match the double equals signs and version numbers but that would be a very broad match. The first would also match ====1test and the latter would also match ==..
There's another regex operator called a 'lookbehind assertion' (also called positive lookbehind) ?<= - and in my above example using it in the expression (?<=[^=])==[0-9]+.* results in the expected output:
==3.10
==4.0.0
==0.5.1
At the time of this writing, it took me a while to discover this - notably the lookbehind assertion currently isn't supported in the popular regex tool regexr.
If there's alternatives to using lookbehind to solve I'd love to hear it.

Detect multiple periods in Regex and kill entire match

I'm trying to detect a price in regex with this:
^\-?[0-9]+(,[0-9]+)?(\.[0-9]+)?
This covers:
12
12.5
12.50
12,500
12,500.00
But if I pass it
12..50 or 12.5.0 or 12.0.
it still returns a match on the 12 . I want it to negate the entire string and return no match at all if there is more than one period in the entire string.
I've been trying to get my head around negative lookaheads for an hour and have searched on Stack Overflow but can't seem to find the right answer. How do I do this?
What you are looking for, is this:
^\d+(,\d{3})*(\.\d{1,2})?$
What it does:
^ Start of Line
\d+ one or more Digits followed by
(,\d{3})* zero, one or more times a , followed by three Digits followed by
(\.\d{1,2})? one or zero . followed by one or two Digits followed by
$ End of Line
This will only match valid Prices. The Comma (,) is not obligatory in this Regex, but it will be matched.
Look here: http://www.regextester.com/?fam=98001
If you work with Prices and want to store them in a Database I recommend saving them as INT. So 1,234,56 becomes 123456 or 1,234 becomes 123400. After you matched the valid price, all you have to do is to remove the ,s, split the Value by the Dot, and fill the Value of [1] with str_pad() (STR_PAD_RIGHT) with Zeros. This makes Calculations easier, in special when you work with Javascript or other different Languages.
Your regex:
^\-?[0-9]+(,[0-9]+)?(\.[0-9]+)?
Note: The regex you provided does not seem to work for 12 (without "."). Since you didn't add a quantifier after \., it tries to match that pattern literally (.).
While there are multiple ways to solve this and the most "correct" answer will depend on your specific requirements, here's a regex that will not match 12..1, but will match 12.1:
(^\-?[0-9]+(?:,[0-9]+)?(?:\.[0-9]+))+
I surrounded the entire regex you provided in a capturing group (...), and added a one or more quantifier + at the end, so that the entire regex will fail if it does not satisfy that pattern.
Also (this may or may not be what you want), I modified the inner groups into non-capturing groups (?: ... ) so that it does not return unnecessary groups.
This site offers a deconstruction of regexes and explains them:
For the regex provided: https://regex101.com/r/EDimzu/2
Unit tests: https://regex101.com/r/EDimzu/2/tests (Note the 12 one's failure for multiple languages).
You can limit it by requiring there is only 0 or 1 periods like this:
^[0-9,]+[\.]{0,1}?[0-9,]+$

Capture number between two whitespaces (RegEx)

I have the following data:
SOMEDATA .test 01/45/12 2.50 THIS IS DATA
and I want to extract the number 2.50 out of this. I have managed to do this with the following RegEx:
(?<=\d{2}\/\d{2}\/\d{2} )\d+.\d+
However that doesn't work for input like this:
SOMEDATA .test 01/45/12 2500 THIS IS DATA
In this case, I want to extract the number 2500.
I can't seem to figure out a regex rule for that. Is there a way to extract something between two spaces ? So extract the text/number after the date until the next whitespace ? All I know is that the date will always have the same format and there will always be a space after the text and then a space after the number I want to extract.
Can someone help me out on this ?
Capture number between two whitespaces
A whitespace is matched with \s, and non-whitespace with \S.
So, what you can use is:
\d{2}\/\d{2}\/\d{2} +(\S+)
^^^
See the regex demo
The 1+ non-whitespace symbols are captured into Group 1.
If - for some reason - you need to only get the value as a whole match, use your lookbehind approach:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Or - if you are using PCRE - you may leverage the match reset operator \K:
\d{2}\/\d{2}\/\d{2} +\K\S+
^^
See another demo
NOTE: the \K and a capture group approaches allow 1 or more spaces after the date and are thus more flexible.
I see some people helped you already, but if you would want an alternative working one for some reason, here's what works too :)
.+ \d+\/\d+\/\d+ (\d+[\.\d]*)
So the .+ matches anything plus the first space
then the \d+/\d+/\d+ is the date parsing plus a space
the capturing group is the number, as you can see I made the last part optional, so both floating point values and normal values can be matched. Hope this helped!
Proof: https://regex101.com/r/fY3nJ2/1
Just make the fractal part optional:
(?<=\d{2}\/\d{2}\/\d{2} )\d+(?:\.\d+)?
Demo: https://regex101.com/r/jH3pU7/1
Update following clarifications in comments:
To match anything (but space) surrounded by spaces and prepended by date use:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Demo: https://regex101.com/r/jH3pU7/3
Rather than capture, you can make your entire match be the target text by using a look behind:
(?<=\d\d(\/\d\d){2} )\S+
This matches the first series of non-whitespace that follows a "date like" part.
Note also the reduction in the length of the "date like" pattern. You may consider using this part of the regex in whatever solution you use.

Is there a way to match Regex based on previous capture group, not captured previously?

Okay, so the task is that there is a string that can either look like post, or post put or even get put post. All of these must be matched. Preferably deviances like [space]post, or get[space] should not be matched.
Currently I came up with this
^(post|put|delete|get)(( )(post|put|delete|get))*$
However I'm not satisfied with it, because I had to specify (post|put|delete|get) twice. It also matches duplications like post post.
I'd like to somehow use a backreference(?) to the first group so that I don't have to specify the same condition twice.
However, backreference \1 would help me only match post post, for example, and that's the opposite of what I want. I'd like to match a word in the first capture group that was NOT previously found in the string.
Is this even possible? I've been looking through SO questions, but my Google-fu is eluding me.
If you are using a PCRE-based regex engine, you may use subroutine calls like (?n) to recurse the subpatterns.
^(post|put|delete|get)( (?!\1)(?1))*$
^^^^
See the regex demo
Expression details:
^ - start of string
(post|put|delete|get) - Group 1 matching one of the alternatives as literal substrings
( (?!\1)(?1))* - zero or more sequences of:
- a space
(?!\1) - a negative lookahead that fails the match if the text after the current location is identical to the one captured into Group 1 due to backreference \1
(?1) - a subroutine call to the first capture group (i.e. it uses the same pattern used in Group 1)
$ - end of string
UPDATE
In order to avoid matching strings like get post post, you need to also add a negative lookahead into Group 1 so that the subroutine call was aware that we do not want to match the same value that was captured into Group 1.
^((post|put|delete|get)(?!.*\2))( (?1))*$
See the regex demo
The difference is that we capture the alternations into Group 2 and add the negative lookahead (?!.*\2) to disallow any occurrences of the word we captured further in the string. The ( (?1))* remains intact: now, the subroutine recurses the whole Capture Group 1 subpattern with the lookahead.