Capture groups in MS Word regex - regex

I am trying to remove the new line prior to "n=", replace with a space and contain the captured number in (), all these in MS Word's advanced find+replace, using wildcards.
Currently:
some preceeding text
n=1,233,023
Desired result:
some preceding text (1,233,023)
I've been struggling with ^13n=(*{1,})
and replace with " (\1)" (without the quotes)
but it can't even match it.
Any help please , appreciated.
Thank you

MS Word does have weird ways in regular expressions. The following steps were succesfull for me (mine is in Dutch so please forgive any small translations errors):
Hit Ctrl+H to open Search And Replace.
Tick More and tick Use Wildcards
Now with this done we can search for:
^13(n=[0-9,]{1,})
^13 - Match newline.
( - Open capture group 1.
n= - Match "n=" literally.
[,0-9]{1,} - Match a digit or commas at least 1 time.
) - Close capture group 1.
Replace by:
^s\1
^s\1 - A space followed by capture group 1.
As mentioned I would consider the type of regular expressions Word is offering dodgy. Here you can read a bit more about it's flaws too. I couldn't create capture groups within a capture group neither was I able to create optional blocks of three consecutive digits and commas. Fortunately in your own attempt just knowing a newline followed by literally n= seemed enough.
Second to last note; because I'm Dutch my local parameter seperator is the semi-colon. This also reflects in this search and replace function its occurrence indicators meaning I used: ^13(n=[,0-9]{1;})
And one last note, another pattern I found worked for me was ^13(n=*^13), but since we had zero control of the pattern between n= and the paragraph end I would stick with my initial thought. The reason why the use of the * worked here is because we used it as an actual frequence of any characters between n= and ^13.
Before:
After:

The wildcard search term should be
(^13)([a-z])(=)([,0-9]{1,})
and the replacement is
(\4)
Note the first character above is a space.

Related

Regex: grab the string that begins after a certain string and ends when it hits any other character

I am trying to use Regex to grab a substring of a large string.
The overall string has certain text, 'cow/', then any number of characters or spaces that are not digits. The first digit hit is the start of the desired substring I want.
This desired substring consists of only digits and periods, the first character or space seen that is not a digit or period indicates the end of the desired substring.
For example:
'cow/ a12.34 -123'
The desired substring is '12.34'.
So far I have this regex that partially works (I think the '| .' is not entirely correct):
(?<=([A-z]|[0-9])/\s*).?(?=\s[^0-9 |.])
Thanks in advance.
This should be easy to achieve by relying on capturing groups:
cow/[^0-9]*([0-9.]+)
The group will contain the text that you want to extract, in Java group(index), in C# with Groups[index]. Other languages provide similar features.
Don't try to solve everything inside the regular expression, but leverage the power of your runtime :)
Edit after comment on the OP:
Azure Kusto has the extract(regex, captureGroup, text [, typeLiteral]) function to extract groups from regular expression matches:
extract("cow/[^0-9]*([0-9.]+)", 1, "cow/ a12.34 -123") == "12.34";
The argument 1 tells Kusto to extract the first capturing group (the expression inside the parentheses).

Regex - excluding characters

I am trying to create a pattern that will ignore starting from Всего word and just will capture the number 2501,472 at the end -> Всего 191 Короб-шкаф вес БРУТТО 2501,472
Also, i am trying to include word change possibilities [^Короб-шкаф|Коробка] which is working fine in another pattern i have created
([^Всего]?[^\\d]*?[^Короб-шкаф|Коробка]\s*[^вес БРУТТО\\s*] \\d,]*)
Converting my comment to answer so that solution is easy to find for future visitors.
You may use this alternation regex that uses a capture group where we are capturing number that appears after a known pattern:
\\bВсего\\h+\\d+\\h+(?:Короб-шкаф|Коробка)\\h+вес\\h+БРУТТО\\h+([\\d,]+)
Then use captured group #1 for your number comprising digits and comma characters.

How would I match all data between 2 symbols with Regex?

I'm trying to find all data (including and after) a dash (-) appears, only up to the first delimiter which is a colon.
Example data:
Input:
bart23-testaccount#test.test:Test:Test:Test
Desired output:
bart23:Test:Test:Test
I've done some research and found this regex, but it's not fit for purpose -(.*):
My purpose is for thousands of lines which are all in various types of order, however the purpose remains the same, highlight all text between the - and the first : (which I will then proceed to delete). I will be using Notepad++
I can answer any questions or make my post more specific if need be, it's kind of hard to explain.
In Notepad++ you can use regex find/replace. Look for:
^([^-]+)-[^:]+(:.*)$
which captures everything up to the first - in group 1, and everything after (and including) the first : in group 2, and replace with
\1\2
Using Notepad++, without any capture group:
Ctrl+H
Find what: -[^:]+
Replace with: LEAVE EMPTY
check Wrap around
check Regular expression
Replace all
Explanation:
- # an hyphen (by default, the first one in a line)
[^:]+ # 1 or more not colon
Result for given example:
bart23:Test:Test:Test
Screen capture:

Regular Expression Replace Time Value between Date-Time Format

I have an XML file with date-time formats looking like this:
<published>2019-01-03T23:54:00.000+10:00</published>
and this
<published>2019-01-07T14:22:00.001+10:00</published>
and so on, where the time value is 23:54:00.000 and 14:22:00.001.
How do I replace just the time value between the <published></published> tags with regular expressions? For example, I want to replace both time values with 03:00:00.000 so the first example becomes
<published>2019-01-03T03:00:00.000+10:00</published>
My aim is to use any existing tools/apps Notepad++ or websites since it is much faster, not any specific programming languages.
First, the obligatory warning to not try to parse xml/html with regex. It's fine if this is a once-off reformatting task and you have control over the data. A regex solution will not be very robust...
That out of the way, you will need a tool that can handle capture groups with regex, so you can match on the whole published tag and avoid false positives. A regex like this might do the trick (adjust the capture grouping as appropriate for your tool):
(\<published\>\d\d\d\d-\d\d-\d\dT)\d\d:\d\d:\d\d\.\d\d\d(\+\d\d:\d\d\<\/published\>)
Note that the above is a regex in PCRE format - demo on regex101. You may need to adjust to suit the format your tool uses.
In this regex, there are two capture groups, one before and one after the time you want to replace. An example string that you could use in the replace field of your chosen tool would be: \103:00:00.000\2 (using \1 syntax for backreferences).
Try this regex:
(<published>\d{4}(?:-\d{2}){2}T)\d{2}(?::\d{2}){2}\.\d{3}([^<]*<\/published>)
Click for Demo
Replace each match with \103:00:00.000\2 i.e. Group 1 contents followed by 03:00:00.000 followed by Group 2 contents.
Explanation:
(<published>\d{4}(?:-\d{2}){2}T) - matches <published> followed by 4 digits followed by - followed by 2 digits followed by - followed by 2 digits followed by the letter T. This sub-match is captured in Group 1
\d{2}(?::\d{2}){2}\.\d{3} - matches time of the format XX:XX:XX.XXX where X is a digit.
([^<]*<\/published>) - matches 0+ occurrences of any character that is not a < followed by </published>. This sub-match is captured in Group 2.
Before Replace:
After Replace:

How to match everything up to the second occurrence of a character?

So my string looks like this:
Basic information, advanced information, super information, no information
I would like to capture everything up to second comma so I get:
Basic information, advanced information
What would be the regex for that?
I tried: (.*,.*), but I get
Basic information, advanced information, super information,
This will capture up to but not including the second comma:
[^,]*,[^,]*
English translation:
[^,]* = as many non-comma characters as possible
, = a comma
[^,]* = as many non-comma characters as possible
[...] is a character class. [abc] means "a or b or c", and [^abc] means anything but a or b or c.
You could try ^(.*?,.*?),
The problem is that .* is greedy and matches maximum amount of characters. The ? behind * changes the behaviour to non-greedy.
You could also put the parenthesis around each .*? segment to capture the strings separately if you want.
I would take a DRY approach, like this:
^([^,]*,){1}[^,]*
This way you can match everything until the n occurrence of a character without repeating yourself except for the last pattern.
Although in the case of the original poster, the group and repetition of the group is useless I think this will help others that need to match more than 2 times the pattern.
Explanation:
^ From the start of the line
([^,]*,) Create a group matching everything except the comma character until it meet a comma.
{1} Count the above pattern (the number of time you need)-1. So if you need 2 put 1, if you need 20 put 19.
[^,]* Repeat the pattern one last time without the tailing comma.
Try this approach:
(.*?,.*?),.*
Link to the solution