Regular Expression Replace Time Value between Date-Time Format - regex

I have an XML file with date-time formats looking like this:
<published>2019-01-03T23:54:00.000+10:00</published>
and this
<published>2019-01-07T14:22:00.001+10:00</published>
and so on, where the time value is 23:54:00.000 and 14:22:00.001.
How do I replace just the time value between the <published></published> tags with regular expressions? For example, I want to replace both time values with 03:00:00.000 so the first example becomes
<published>2019-01-03T03:00:00.000+10:00</published>
My aim is to use any existing tools/apps Notepad++ or websites since it is much faster, not any specific programming languages.

First, the obligatory warning to not try to parse xml/html with regex. It's fine if this is a once-off reformatting task and you have control over the data. A regex solution will not be very robust...
That out of the way, you will need a tool that can handle capture groups with regex, so you can match on the whole published tag and avoid false positives. A regex like this might do the trick (adjust the capture grouping as appropriate for your tool):
(\<published\>\d\d\d\d-\d\d-\d\dT)\d\d:\d\d:\d\d\.\d\d\d(\+\d\d:\d\d\<\/published\>)
Note that the above is a regex in PCRE format - demo on regex101. You may need to adjust to suit the format your tool uses.
In this regex, there are two capture groups, one before and one after the time you want to replace. An example string that you could use in the replace field of your chosen tool would be: \103:00:00.000\2 (using \1 syntax for backreferences).

Try this regex:
(<published>\d{4}(?:-\d{2}){2}T)\d{2}(?::\d{2}){2}\.\d{3}([^<]*<\/published>)
Click for Demo
Replace each match with \103:00:00.000\2 i.e. Group 1 contents followed by 03:00:00.000 followed by Group 2 contents.
Explanation:
(<published>\d{4}(?:-\d{2}){2}T) - matches <published> followed by 4 digits followed by - followed by 2 digits followed by - followed by 2 digits followed by the letter T. This sub-match is captured in Group 1
\d{2}(?::\d{2}){2}\.\d{3} - matches time of the format XX:XX:XX.XXX where X is a digit.
([^<]*<\/published>) - matches 0+ occurrences of any character that is not a < followed by </published>. This sub-match is captured in Group 2.
Before Replace:
After Replace:

Related

Replace a sequence of 3 to 30 characters by another one with regex

I need to replace sequences (of various length) of characters by another character. I am working in Eclipse on xml files
For exemple -------- should be replaced by ********.
The replacement should be done only for sequences of at least 3 characters, not 1 or 2.
It is easy to find the matching sequences in regex for example with -{3,30} but I don't understand how to specify the replacement sequence.
I made this regex solution ready when question was posted but didn't submit an answer because I kept testing in eclipse and even though regex was working for find feature, a * in replacement wasn't changing text in Eclipse editor.
Here is a shorter and a bit more efficient regex:
(?!^)\G-|(?=-{3})-
Replace with a *
RegEx Demo
Breakdown:
(?!^)\G: Match from end of the previous match
-: Match a -
|: OR
(?=-{3}): Make sure we have 3 hyphens ahead
-: Match a -
Here is a screenshot from my Eclipse that shows selected match for this regex:
You can use
(?:\G(?!\A)|(?<!-)(?=-{3,30}(?!-)))-
See the regex pattern. Details:
(?:\G(?!\A)|(?<!-)(?=-{3,30}(?!-))) - either
\G(?!\A) - end of the preceding match
| - or
(?<!-)(?=-{3,30}(?!-)) - a position that is not immediately preceded with a - char and is immediately followed with 3 to 30 hyphens (not followed with another hyphen).
- - a hyphen.
The (?:\G(?!\A)|(?<!-)(?=-{3,30}(?!-)))- regex goes into the "Find What" filed, * goes to the "Replace With" field.
Note that regular expressions are only meant to be used in search fields, replacement fields must only contain replacement patterns. Usually, the replacement pattern is a string containing literal string(s) and/or backreferences. Here, we do not need any backreference as the regex does not capture anything.

Regex: grab the string that begins after a certain string and ends when it hits any other character

I am trying to use Regex to grab a substring of a large string.
The overall string has certain text, 'cow/', then any number of characters or spaces that are not digits. The first digit hit is the start of the desired substring I want.
This desired substring consists of only digits and periods, the first character or space seen that is not a digit or period indicates the end of the desired substring.
For example:
'cow/ a12.34 -123'
The desired substring is '12.34'.
So far I have this regex that partially works (I think the '| .' is not entirely correct):
(?<=([A-z]|[0-9])/\s*).?(?=\s[^0-9 |.])
Thanks in advance.
This should be easy to achieve by relying on capturing groups:
cow/[^0-9]*([0-9.]+)
The group will contain the text that you want to extract, in Java group(index), in C# with Groups[index]. Other languages provide similar features.
Don't try to solve everything inside the regular expression, but leverage the power of your runtime :)
Edit after comment on the OP:
Azure Kusto has the extract(regex, captureGroup, text [, typeLiteral]) function to extract groups from regular expression matches:
extract("cow/[^0-9]*([0-9.]+)", 1, "cow/ a12.34 -123") == "12.34";
The argument 1 tells Kusto to extract the first capturing group (the expression inside the parentheses).

Regex Select groups not found in a pattern

I have been looking at the various topics on Regex on SO, and they are all saying that to find the invert (select all that doesn't fit the criteria) you simply use the[^] syntax or negative lookahead.
I have tried using both of these methods on my Regex but the results are not adequate the [^] especially seems to take all its contents literally (even when escaped).
What I need this for:
I have a massive SQL line with a SQL dump I'm trying to remove all characters that are not the line id, and the numerical value of one column.
My regex works in matching exactly what I'm looking for; what I need to do is to invert this match so I can remove all non-matching parts in my IDE.
My regex:
/(\),\(\d{1,4},)|(,\d{10},)/
This matches a "),(<number upto 4 digits>," or ",<number of ten digits>," .
The subject
My subject is a 500Kb line of an SQL dump looking something like this (I have already removed a-z and other unwanted characters in previous simple find/replaces):
),(39,' ',1,'01761472100','#','9 ','20',1237213277,0,1237215419,''),(40,' ',3,'01445731203','#',' ','-','22 2','210410//816',1237225423,0,1484651768,''),(4270,' /
My aim is to use a regex to achive the following output:
),(39,,1237213277,,1237215419,),(40,,1237225423,,1484651768,),(4270,
Which I can then go over again and easily remove repetitions such as commas.
I have read that Negation in Regex is tricky, So, what is the syntax to get the regex I've made to work inverted? To remove all non-matching groups? What can you recommend as a way of solving this without spending hours manually reading the lines?
You may use a really helpful (*SKIP)(?!) (=(*SKIP)(*F) or (*SKIP)(*FAIL)) construct in PCRE to match these texts you know and then skip and match all other text to remove:
/(?:\),\(\d{1,4},|,\d{10},)(*SKIP)(?!)|./s
See the regex demo
Details:
(?:\),\(\d{1,4},|,\d{10},) - match 1 of the 2 alternatives:
\),\(\d{1,4}, - ),(, then 1 to 4 digits and then ,
| - or
,\d{10}, - a comma, 10 digits, a comma
(*SKIP)(?!) - omit the matched text and proceed to the next match
| - or
. - any char (since /s DOTALL modifier is passed to the regex)
The same can be done with
/(\),\(\d{1,4},|,\d{10},)?./s
and replacing with $1 backreference (since we need to put back the text captured with the patterns we need to keep), see another regex demo.

Capture number between two whitespaces (RegEx)

I have the following data:
SOMEDATA .test 01/45/12 2.50 THIS IS DATA
and I want to extract the number 2.50 out of this. I have managed to do this with the following RegEx:
(?<=\d{2}\/\d{2}\/\d{2} )\d+.\d+
However that doesn't work for input like this:
SOMEDATA .test 01/45/12 2500 THIS IS DATA
In this case, I want to extract the number 2500.
I can't seem to figure out a regex rule for that. Is there a way to extract something between two spaces ? So extract the text/number after the date until the next whitespace ? All I know is that the date will always have the same format and there will always be a space after the text and then a space after the number I want to extract.
Can someone help me out on this ?
Capture number between two whitespaces
A whitespace is matched with \s, and non-whitespace with \S.
So, what you can use is:
\d{2}\/\d{2}\/\d{2} +(\S+)
^^^
See the regex demo
The 1+ non-whitespace symbols are captured into Group 1.
If - for some reason - you need to only get the value as a whole match, use your lookbehind approach:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Or - if you are using PCRE - you may leverage the match reset operator \K:
\d{2}\/\d{2}\/\d{2} +\K\S+
^^
See another demo
NOTE: the \K and a capture group approaches allow 1 or more spaces after the date and are thus more flexible.
I see some people helped you already, but if you would want an alternative working one for some reason, here's what works too :)
.+ \d+\/\d+\/\d+ (\d+[\.\d]*)
So the .+ matches anything plus the first space
then the \d+/\d+/\d+ is the date parsing plus a space
the capturing group is the number, as you can see I made the last part optional, so both floating point values and normal values can be matched. Hope this helped!
Proof: https://regex101.com/r/fY3nJ2/1
Just make the fractal part optional:
(?<=\d{2}\/\d{2}\/\d{2} )\d+(?:\.\d+)?
Demo: https://regex101.com/r/jH3pU7/1
Update following clarifications in comments:
To match anything (but space) surrounded by spaces and prepended by date use:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Demo: https://regex101.com/r/jH3pU7/3
Rather than capture, you can make your entire match be the target text by using a look behind:
(?<=\d\d(\/\d\d){2} )\S+
This matches the first series of non-whitespace that follows a "date like" part.
Note also the reduction in the length of the "date like" pattern. You may consider using this part of the regex in whatever solution you use.

Workaround for the lack of lookbehind?

To answer another user's question I knocked together the below regular expression to match numbers within a string.
\b[+-]?[0-9]+(\.[0-9]+)?\b
After providing my answer I noticed that I was getting unwanted matches in cases where there was a sequence of digits with more than one period among them due to \b matching the period character. For example "2.3.4" would return matches "2.3" and "4".
A negative lookahead and lookbehind could help me here, giving me a regex like this:
\b(?<!\.)[+-]?[0-9]+(\.[0-9]+)?\b(?!\.)
...except that for some unknown reason VBScript Regex (and by extension VBA) doesn't support lookbehind.
Is there some workaround that allows me to affirm that the word boundary at the start of the match is not a period without including it in the match?
Perhaps you don't need a look behind. If you are able to extract specific capture groups instead of the entire match then you can use:
(?:[^.]|^)\b([+-]?([0-9]+(\.[0-9]+)))\b(?!\.)
Will match:
2.5
54.5
+3.45
-0.5
Won't match:
1.2.3
3.6.
.3.5
Capture group 1 will output the whole number and sign
Capture group 2 will output the whole number
Capture group 3 will output the fraction (like capture group 1 in your original expression)