Capture number between two whitespaces (RegEx) - regex

I have the following data:
SOMEDATA .test 01/45/12 2.50 THIS IS DATA
and I want to extract the number 2.50 out of this. I have managed to do this with the following RegEx:
(?<=\d{2}\/\d{2}\/\d{2} )\d+.\d+
However that doesn't work for input like this:
SOMEDATA .test 01/45/12 2500 THIS IS DATA
In this case, I want to extract the number 2500.
I can't seem to figure out a regex rule for that. Is there a way to extract something between two spaces ? So extract the text/number after the date until the next whitespace ? All I know is that the date will always have the same format and there will always be a space after the text and then a space after the number I want to extract.
Can someone help me out on this ?

Capture number between two whitespaces
A whitespace is matched with \s, and non-whitespace with \S.
So, what you can use is:
\d{2}\/\d{2}\/\d{2} +(\S+)
^^^
See the regex demo
The 1+ non-whitespace symbols are captured into Group 1.
If - for some reason - you need to only get the value as a whole match, use your lookbehind approach:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Or - if you are using PCRE - you may leverage the match reset operator \K:
\d{2}\/\d{2}\/\d{2} +\K\S+
^^
See another demo
NOTE: the \K and a capture group approaches allow 1 or more spaces after the date and are thus more flexible.

I see some people helped you already, but if you would want an alternative working one for some reason, here's what works too :)
.+ \d+\/\d+\/\d+ (\d+[\.\d]*)
So the .+ matches anything plus the first space
then the \d+/\d+/\d+ is the date parsing plus a space
the capturing group is the number, as you can see I made the last part optional, so both floating point values and normal values can be matched. Hope this helped!
Proof: https://regex101.com/r/fY3nJ2/1

Just make the fractal part optional:
(?<=\d{2}\/\d{2}\/\d{2} )\d+(?:\.\d+)?
Demo: https://regex101.com/r/jH3pU7/1
Update following clarifications in comments:
To match anything (but space) surrounded by spaces and prepended by date use:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Demo: https://regex101.com/r/jH3pU7/3

Rather than capture, you can make your entire match be the target text by using a look behind:
(?<=\d\d(\/\d\d){2} )\S+
This matches the first series of non-whitespace that follows a "date like" part.
Note also the reduction in the length of the "date like" pattern. You may consider using this part of the regex in whatever solution you use.

Related

Notepad++: Can I use regex to find some values and remove only one character instead of the whole pattern?

I want to use regex in notepad to find this pattern: "[0-9]+[\.][0-9]+[,][0-9]+" e.g. 1.010,80260
However from these kind of numbers I just want to remove the '.' , so the new value should be 1010,80260 .
So far I can only replace the whole pattern. Is there a way to do it?
Thank you in advance!
You can make use of the \K meta escape since PCRE doesn't support variable width lookbehinds:
regex:
[0-9]+\K[\.](?=[0-9]+[,][0-9]+)
[0-9]+ - capture digits
\K - forget what we've captured
[\.] - capture a period; just \. can be used, no need for the char class brackets
(?=[0-9]+[,][0-9]+) - ahead of me should be digits followed by a comma and digits
replace:
Nothing
\K is bugged in Notepad++ so you could use this regex instead since you only care that at least one digit is behind the period:
(?<=\d)\.(?=[0-9]+[,][0-9]+)
You can use \K, which basically says throw away whatever was matched up until that point, then add a lookahead. Like so
[0-9]+\K\.(?=[0-9]+[,][0-9]+)
Change the regular expression to: ([0-9]+)[\.]([0-9]+[,][0-9]+)
The () pieces are groups which you can refer to in the replace with \1 for the first group, and \2 for the second group.
The docs also explain this here: https://npp-user-manual.org/docs/searching/#substitution-grouping (even better, and in more detail, than my usage in this answer...)
EDIT: I just wanted to share the animated gif showing that 'Replace' in Notepad++ 7.9.5. does not seem to work.

Select Northings from a 1 Line String

I have the following string;
Start: 738392E, 6726376N
I extracted 738392 ok using (?<=.art\:\s)([0-9A-Z]*). This gave me a one group match allowing me to extract it as a column value
.
I want to extract 6726376 the same way. Have only one group appear because I am parsing that to a column value.
Not sure why is (?=(art\:\s\s*))(?=[,])*(.*[0-9]*) giving me the entire line after S.
Helping me get it right with an explanation will go along way.
Because you used positive lookaheads. Those just make some assertions, but don't "move the head along".
(?=(art\:\s\s*)) makes sure you're before "art: ...". The next thing is another positive lookahead that you quantify with a star to make it optional. Finally you match anything, so you get the rest of the line in your capture group.
I propose a simpler regex:
(?<=(art\:\s))(\d+)\D+(\d+)
Demo
First we make a positive lookback that makes sure we're after "art: ", then we match two numbers, seperated by non-numbers.
There is no need for you to make it this complicated. Just use something like
Start: (\d+)E, (\d+)N
or
\b\d+(?=[EN]\b)
if you need to match each bit separately.
Your expression (?=(art\:\s\s*))(?=[,])*(.*[0-9]*) has several problems besides the ones already mentioned: 1) your first and second lookahead match at different locations, 2) your second lookahead is quantified, which, in 25 years, I have never seen someone do, so kudos. ;), 3) your capturing group matches about anything, including any line or the empty string.
You match the whole part after it because you use .* which will match until the end of the line.
Note that this part [0-9]* at the end of the pattern does not match because it is optional and the preceding .* already matches until the end of the string.
You could get the match without any lookarounds:
(art:\s)(\d+)[^,]+,\s(\d+)
Regex demo
If you want the matches only, you could make use of the PyPi regex module
(?<=\bStart:(?:\s+\d+[A-Z],)* )\d+(?=[A-Z])
Regex demo (For example only, using a different engine) | Python demo

Regular Expression Replace Time Value between Date-Time Format

I have an XML file with date-time formats looking like this:
<published>2019-01-03T23:54:00.000+10:00</published>
and this
<published>2019-01-07T14:22:00.001+10:00</published>
and so on, where the time value is 23:54:00.000 and 14:22:00.001.
How do I replace just the time value between the <published></published> tags with regular expressions? For example, I want to replace both time values with 03:00:00.000 so the first example becomes
<published>2019-01-03T03:00:00.000+10:00</published>
My aim is to use any existing tools/apps Notepad++ or websites since it is much faster, not any specific programming languages.
First, the obligatory warning to not try to parse xml/html with regex. It's fine if this is a once-off reformatting task and you have control over the data. A regex solution will not be very robust...
That out of the way, you will need a tool that can handle capture groups with regex, so you can match on the whole published tag and avoid false positives. A regex like this might do the trick (adjust the capture grouping as appropriate for your tool):
(\<published\>\d\d\d\d-\d\d-\d\dT)\d\d:\d\d:\d\d\.\d\d\d(\+\d\d:\d\d\<\/published\>)
Note that the above is a regex in PCRE format - demo on regex101. You may need to adjust to suit the format your tool uses.
In this regex, there are two capture groups, one before and one after the time you want to replace. An example string that you could use in the replace field of your chosen tool would be: \103:00:00.000\2 (using \1 syntax for backreferences).
Try this regex:
(<published>\d{4}(?:-\d{2}){2}T)\d{2}(?::\d{2}){2}\.\d{3}([^<]*<\/published>)
Click for Demo
Replace each match with \103:00:00.000\2 i.e. Group 1 contents followed by 03:00:00.000 followed by Group 2 contents.
Explanation:
(<published>\d{4}(?:-\d{2}){2}T) - matches <published> followed by 4 digits followed by - followed by 2 digits followed by - followed by 2 digits followed by the letter T. This sub-match is captured in Group 1
\d{2}(?::\d{2}){2}\.\d{3} - matches time of the format XX:XX:XX.XXX where X is a digit.
([^<]*<\/published>) - matches 0+ occurrences of any character that is not a < followed by </published>. This sub-match is captured in Group 2.
Before Replace:
After Replace:

Mixing Lookahead and Lookbehind in 1 Regexp

I'm trying to match first occurrence of window.location.replace("http://stackoverflow.com") in some HTML string.
Especially I want to capture the URL of the first window.location.replace entry in whole HTML string.
So for capturing URL I formulated this 2 rules:
it should be after this string: window.location.redirect("
it should be before this string ")
To achieve it I think I need to use lookbehind (for 1st rule) and lookahead (for 2nd rule).
I end up with this Regex:
.+(?<=window\.location\.redirect\(\"?=\"\))
It doesn't work. I'm not even sure that it legal to mix both rules like I did.
Can you please help me with translating my rules to Regex? Other ways of doing this (without lookahead(behind)) also appreciated.
The pattern you wrote is really not the one you need as it matches something very different from what you expect: text window.location.redirect("=") in text window.location.redirect("=") something. And it will only work in PCRE/Python if you remove the ? from before \" (as lookbehinds should be fixed-width in PCRE). It will work with ? in .NET regex.
If it is JS, you just cannot use a lookbehind as its regex engine does not support them.
Instead, use a capturing group around the unknown part you want to get:
/window\.location\.redirect\("([^"]*)"\)/
or
/window\.location\.redirect\("(.*?)"\)/
See the regex demo
No /g modifier will allow matching just one, first occurrence. Access the value you need inside Group 1.
The ([^"]*) captures 0+ characters other than a double quote (URLs you need should not have it). If these URLs you have contain a ", you should use the second approach as (.*?) will match any 0+ characters other than a newline up to the first ").

Fetch one out of two Numbers out of String

I hav a list of strings, such as: Ø20X400
I need to extract the first of the numbers - between Ø and X
I've come so far to match the numbers in general with \d+ - as simple as it is...
But I need an expression to get the first value separated, not both of them...
You can use lookarounds (?<=..) and (?=..):
(?<=Ø)\d+(?=X)
or in Java style:
(?<=Ø)\\d+(?=X)
A second way is to use a capture group:
Ø(\d+)X
or
Ø(\\d+)X
Then you can extract the content of the group.
The regex engines I know parse \n as a newline. \d is used for numbers.
The following regex gives you the first number between a Ø and a X in a capture group:
^.*?Ø(\d+)X.*
Edit live on Debuggex
This Regex will do it for you, (\d+?)X, and here is a Rubular to prove it. See, you want to group digits together, but make it non-greedy, ending the evaluation on X.
Try this one:
\d+(?=\D)
Should find first number wich has some not a number ahead
With normal regular expressions, I would say:
Ø(\d+)X
This finds the Ø character, followed by one or more numbers, followed by an X. Also, the numbers will be stored in the first capture group. Capture groups differ from one regex implementation to another, but this would typically be denoted by \1. Capture group zero, \0, is usually the matched string itself. In this version, \d denotes digits 0-9, but if your regex engine uses \n for that purpose, use:
Ø(\n+)X