Regex with optional, lazy, greedy group - regex

Let's take this source string from a word document:
A;SDLFJA;SDJFA;KSDJF;ALKSJDF SOURCE: 3 55 ASDKLFJA;KDSJF
sa;ldkjfa SOURCE: HYPERLINK "ASDLFA;SDFA;SKD" "MATCH9" 3 HYPERLINK
"ASDLFA;SDFA;SKD" "MATCH10" 55 a;sdkfja;ksdfj;aklsdjf;lk
I'm looking for a pattern that is composed of the literal text "SOURCE: " followed by a 1 digit number a space and a 2 digit number.
For example, in the first line of the source string, I want to find "SOURCE: 3 55".
Now, some clever boffin has decided to embed a hyperlink for the 1 digit number and another hyperlink for the 2 digit number. Lines 2 and 3 show the two embedded hyperlinks. MATCH1 refers to the first embedded hyperlink, MATCH2 is the second, and so on. I have no way of knowing how many hyperlinks will be placed before these, so one can't assume MATCH9 and MATCH10.
The text I want to extract is the "3 55" portion. I want to put it into a named group I'll call "KeepMe".
I don't mind using two different patterns, one for the hyperlink and one without.
Here's a pattern that works for the non-hyperlinked text:
SOURCE:\s+(?<KeepMe>\d*\s+\d*)
I get "3 55" in the KeepMe group just like I want.
I haven't been able to keep the hyperlink match pattern from being greedy.
Here's a failed regex pattern, (one of many):
SOURCE:\s+(?<Hyperlink>HYPERLINK.*MATCH\d*\u0022\s+)??(?<KeepMe1>\d*)\s+
(?<Hyperlink>HYPERLINK.*MATCH\d*\u0022\s+)??(?<KeepMe2>\d*)
In the above pattern, I'm trying to say:
Look for the literal SOURCE: followed by one or more spaces.
Then, optionally look for the literal text "HYPERLINK followed by some characters, followed by the literal text MATCH, followed by some digits and a double quote character in a lazy, non-greedy manner, followed by one or more spaces, followed by some digits I want to keep. Then, do another HYPERLINK pattern match like we just did and keep the digits after that, too.
Remember, in both cases, I want to extract "3 55". It can be extracted in one or two pieces though one would be best.
Any ideas???

This should do the trick:
\bSOURCE:\s+(?:HYPERLINK\s+"[^"]*"\s+"MATCH\d+"\s+)?(?<KeepMe1>\d+)\s+(?:HYPERLINK\s+"[^"]*"\s+"MATCH\d+"\s+)?(?<KeepMe2>\d+)\b
Main difference is that I replaced the .* between HYPERLINK and MATCH with something less greedy.
Fiddle: https://regex101.com/r/yE3fP4/1

A Regex that works for just the hyperlinked case is:
/(?<SourceToken>SOURCE:) # Start with a source tag
\s+ # Followed by whitespace
(?<HyperlinkMatchGroup> # Save the hyperlink & match combo.
(?<Hyperlink> # Save the hyperlink (to be discarded)
(?<HyperlinkToken>HYPERLINK\s+) # Hyperlinks start with the literal tag "HYPERLINK"
(?<HyperlinkText>".*?") # Hyperlink text contained in quotes, non-greedy
\s*) # Followed by whitespace
* # Repeating any number of times
(?<MatchToken>"MATCH\d*") # Followed by a literal tag "MATCH" and a digit string
\s* # Followed by whitespace
(?<KeepMe>\d+) # Finally, the match, which is just a series of digits
\s* # Followed by whitespace
)+ # The whole hyperlink & match pair must occur at least once
/x
It may or may not cover all your cases; I haven't spent much time digging into it.

Related

Regular expressions in notepad++ (Search and Replace)

I have a list of thousands of records within a .txt document.
some of them look like these records
201910031044 "00059" "11.31AG" "Senior Champion"
201910031044 "00060" "GBA146" "Junior Champion"
201910031044 "00999" "10.12G" "ProAM"
201910031044 "00362" "113.1LI" "Abcd"
Whenever a record similar to this occurs I'd like to get rid of the last words/numbers/etc in the last quotation marks (like "Senior Champion", "Junior Champion" etc. There are many possibilities here)
e.g. (before)
201910031044 "00059" "11.31AG" "Senior Champion"
after
201910031044 "00059" "11.31AG"
I tried the following regex but it wouldn't work.
Search: ^([0-9]{17,17} + "[0-9]{8,8}" + "[a-zA-Z0-9]").*$
Replace: \1 (replace string)
OK I forgot the . (dot) sign however even if I do not have a . (dot) sign it would not work. Not sure if it has anything to do when using the + sign used more than once.
I'd like to get rid of the last words/numbers/etc in the last quotation marks
This does the job:
Ctrl+H
Find what: ^.+\K\h+".*?"$
Replace with: LEAVE EMPTY
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline*
Replace all
Explanation:
^ # beginning of line
.+ # 1 or more any character but newline
\K # forget all we have seen until this position
\h+ # 1 or more horizontal spaces
".*?" # something inside quotes
$ # end of line
Screen capture (before):
Screen capture (after):
The RegEx looks for the 4th double quote:
^(?:[^"]*\"){4}([^|]*)
You can see this demo: https://regex101.com/r/wJ9yS6/163
You will still need to parse the lines, so probably easier opening in excel or parsing using code as a CSV.
You have a problem with the count of your characters:
you specify that the line should start with exactly 17 digits ([0-9]{17,17}). However, there are only 12 digits in the data 201910031044.
you can specify exactly 12 digits by using {12} or if it could be 12-17, then {12,17}. I'll assume exactly 12 based on the current data.
similarly, for the second column you specify that it's exactly 8 digits surrounded by quotes ("[0-9]{8,8}") but it only has 5 digits surrounded by quotes.
again, you can specify exactly 5 with {5} or 5-8 with {5,8}. I will assume exactly 5.
finally, there is no quantifier for the final field, so the regex tries to match exactly one character that is a letter or a number surrounded by quotes "[a-zA-Z0-9]".
I'm not sure if there is any limit on the number of characters, so I would go with one or more using + as quantifier "[a-zA-Z0-9]+" - if you can have zero or more, then you can use *, or if it's any other count from m to n, then you can use {m,n} as before.
Not a character count problem but the final column can also have dots but the regex doesn't account for. You can just add . inside the square brackets and it will only match dot characters. It's usually used as a wildcard but it loses its special meaning inside a character class ([]), so you get "[a-zA-Z0-9.]+"
Putting it all together, you get
Search: ^([0-9]{12} + "[0-9]{5}" + "[a-zA-Z0-9.]+").*$
Replace: \1
Which will get rid of anything after the third field in Notepad++.
This can be shortened a bit by using \d instead of [0-9] for digits and \s+ for whitespace instead of +. As a benefit, \s will also match other whitespace like tabs, so you don't have to manually account for those. This leads to
Search: ^(\d{12}\s+"\d{5}"\s+"[a-zA-Z0-9.]+").*$
Replace: \1
If you want to get rid of the last words/numbers/etc in the last quotation marks you could capture in a group what is before that and match the last quotation marks and everything between it to remove it using a negated character class.
If what is between the values can be spaces or tabs, you could use [ \t]+ to match those (using \s could also match a newline)
Note that {17,17} and {8,8} may also be written as {17} and {8} which in this case should be {12} and {5}
^([0-9]{12}[ \t]+"[0-9]{5}"[ \t]+"[a-zA-Z0-9.]+")[ \t]{2,}"[^"\r\n]+"
In parts
^ Start of string
( Capture group 1
[0-9]{12}[ \t]+ Match 12 digits and 1+ spaces or tabs
"[0-9]{5}"[ \t]+ Match 5 digits between " and 1+ spaces or tabs
"[a-zA-Z0-9.]+" Match 1+ times any of the listed between "
) Close group
[ \t]{2,} Match 1+ times
"[^"\r\n]+"
In the replacement use group 1 $1
Regex demo
Before
After

Regex for text file

I have a text file with the following text:
andal-4.1.0.jar
besc_2.1.0-beta
prov-3.0.jar
add4lib-1.0.jar
com_lab_2.0.jar
astrix
lis-2_0_1.jar
Is there any way i can split the name and the version using regex. I want to use the results to make two columns 'Name' and 'Version' in excel.
So i want the results from regex to look like
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar
So far I have used ^(?:.*-(?=\d)|\D+) to get the Version and -\d.*$ to get the Name separately. The problem with this is that when i do it for a large text file, the results from the two regex are not in the same order. So is there any way to get the results in the way I have mentioned above?
Ctrl+H
Find what: ^(.+?)[-_](\d.*)$
Replace with: $1\t$2
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all
Explanation:
^ # beginning of line
(.+?) # group 1, 1 or more any character but newline, not greedy
[-_] # a dash or underscore
(\d.*) # group 2, a digit then 0 or more any character but newline
$ # end of line
Replacement:
$1 # content of group 1
\t # a tabulation, you may replace with what you want
$2 # content of group 2
Result for given example:
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar
Not quite sure what you meant for the problem in large file, and I believe the two regex you showed are doing opposite as what you said: first one should get you the name and second one should give you version.
Anyway, here is the assumption I have to guess what may make sense to you:
"Name" may follow by - or _, followed by version string.
"Version" string is something preceded by - or _, with some digit, followed by a dot or underscore, followed by some digit, and then any string.
If these assumption make sense, you may use
^(.+?)(?:[-_](\d+[._]\d+.*))?$
as your regex. Group 1 is will be the name, Group 2 will be the Version.
Demo in regex101: https://regex101.com/r/RnwMaw/3
Explanation of regex
^ start of line
(.+?) "Name" part, using reluctant match of
at least 1 character
(?: )? Optional group of "Version String", which
consists of:
[-_] - or _
( ) Followed by the "Version" , which is
\d+ at least 1 digit,
[._] then 1 dot or underscore,
\d+ then at least 1 digit,
.* then any string
$ end of line

How to exclude a word from regex subpattern?

I am using Delphi 7 and TDIPerlRegEx. I am looking for verbs in parts of sentence which contain some specific part to identify the verb.
s1 := '(I|you|he|she|it|we|they|this|that|these|those)';
s2 := (can|should|would|could|must|want to|have to|had to|might);
RegEx_Seek_1.MatchPattern := '(*UCP)(?m) \b'+s1+'\b \b'+s2+'\b \K([^ß\W]\w{2,15})\b';
The key word which is wrongly included in result is "not"; but should be exluded:
Sample text:
... that you should not ßeat of every ...
Verb like this should be included in result:
Sample text:
lest he should put forth his hand ...
Now I would explain the part with ß sign. The ß sign says, that the original text had "not" word, and then the verb is followed. But I changed this text in previous interaction or session so the source text which I am working now is as stated above. The pattern ([^ß\W]\w{2,15}) should avoid the word which is used in negative sense. This is also why do not include the "negative" verb.
So point of the question is how to exclude the "not" word from the captured text; that is - captured by this pattern, which is either ([^ß\W]\w{2,15}) or (\W{3,15}) .
I am using this pattern to replace substrings in text.
More sample text needed?
than I can bear. And
so I might have taken her
they might dwell together
they could not ßdwell together
lest you should say,
In group 3 I expect match:
for bear, taken (or posibly have instead of taken), dwell and say.
I am trying to exclude the not word, so any verb or word following not must be excluded from 3rd group or the match completely. I am interested about group 3 only. Group 1 and 2 just specifies alternatives preceding the verb.
You may use a branch reset group to match an empty string if there is not as a whole word after a modal verb, or a notional verb otherwise:
\b(I|you|he|she|it|we|they|this|that|these|those)\s+(can|should|would|could|must|want to|have to|had to|might)\s+\K(?|(?=not\b)()|([^ß\W]\w{2,15})\b)
See the regex demo
Details
\b - a word boundary
(I|you|he|she|it|we|they|this|that|these|those) - one of the pronouns in the group 1
\s+ - 1+ whitespaces (it is already acting as a word boundary on both sides of the adjacent groups)
(can|should|would|could|must|want to|have to|had to|might) - one ofthe modal verbs
\s+ - 1+ whitespaces
\K - match reset operator
(?|(?=not\b)()|([^ß\W]\w{2,15})\b) - the branch reset group matching either
(?=not\b)() - if there is not as whole word immediately to the right, capture an empty string into Group 3
| - or (here, else)
([^ß\W]\w{2,15})\b - match and capture into Group 3 any word char other than ß and then 2 to 15 word chars with a word boundary to follow.
Note that (?m) - PCRE_MULTILINE - is only necessary if you want your ^ and $ outside of character classes match start and end of lines rather than the whole string. Since your pattern has no such anchors, (?m) is redundant.

Regex capture group that excludes optional substring?

I'm trying to construct a regex to extract Swedish organization numbers from data. These numbers can be of the following formats:
999999999999 // 12 digits, first two should be ignored.
9999999999 // 10 digits, all should be included.
99999999-9999 // 12 digits with a dash, first two digits and the dash should be ignored
999999-9999 // 10 digits with a dash, dash should be ignored.
For the 12 digit cases, the first two digits are always 16, 19 or 20. My current attempt is:
(?:16|19|20)?(\d{6}\-?\d{4})
This will return a ten digit organization number in $1, but it will contain the dash if it's present. I want the dash to be stripped (or possibly added if it's missing), so that $1 has the same format regardless of dash or no dash in the input.
The regex is in a config and will be used in code that simply extracts $1, so I can't solve this in code - I need the regex to do it "by itself".
As a last resort, I could modify the code to allow config to specify a "replace string" in addition to the search regex, and have the code use the result of the replace as the end result of the extraction. In that case I could use this:
Regex: (?:16|19|20)?(\d{6})\-?(\d{4})
Replace string: $1$2
But this causes other problems, because for other config items, the regex will return multiple "data fields", one for each capture group. To get this to work I would need, in that case, to provide a sequence of replace strings, e.g. for a tab separated format with organization number in the middle:
Regex: ([^\t]*)\t(?:16|19|20)?(\d{6})\-?(\d{4})\t([\d]*)
Replace string 1: $1 (free text field)
Replace string 2: $2-$3 (the organization number with dash "enforced")
Replace string 3: $4 (numeric field)
Workable, but rather awkward... So, any way to solve it within the search regex?

Regular Expression to parse whitespace-delimited data

I have written code to pull some data into a data table and do some data re-formatting. I need some help splitting some text into appropriate columns.
CASE 1
I have data formated like this that I need to split into 2 columns.
ABCDEFGS 0298 MSD
SDFKLJSDDSFWW 0298 RFD
I need the text before the numbers in column 1 and the numbers and text after the spaces in column 2. The number of spaces between the text and the numbers and will vary.
CASE 2 Data I have data like this that I need split into 3 columns.
00006011731 TAB FC 10MG 30UOU
00006011754 TAB FC 10MG 90UOU
00006027531 TAB CHEW 5MG 30UOU
00006071131 TAB CHEW 4MG 30UOU
00006027554 TAB CHEW 5MG 90UO
00006384130 GRAN PKT 4MG 30UOU
column is the first 11 characters That is easy
column 2 should contain all the text after the first 11 characters up to but not including the first number.
The last column is all the text after column 2
I would do it with these expressions:
(?-s)(\S+) +(.+)
and
(?-s)(.{11})(\D+)(.+)
And broken down in regex comment mode, those are:
(?x-s) # Flags: x enables comment mode, -s disables dotall mode.
( # start first capturing group
\S+ # any non-space character, greedily matched at least once.
) # end first capturing group
[ ]+ # a space character, greedily matched at least once. (brackets required in comment mode)
( # start second capturing group
.+ # any character (excluding newlines), greedily matched at least once.
) # end second capturing group
and
(?x-s) # Flags: x enables comment mode, -s disables dotall mode.
( # start first capturing group
.{11} # any character (excluding newlines), exactly 11 times.
) # end first capturing group
( # start second capturing group
\D+ # any non-digit character, greedily matched at least once.
) # end second capturing group
( # start third capturing group
.+ # any character (excluding newlines), greedily matched at least once.
) # end third capturing group
(The 'dotall' mode (flag s) means that . matches all characters, including newlines, so we have to disable it to prevent too much matching in the last group.)
Supposing you know how to handle the VB.NET code to get the groupings (matches) and that you are willing to strip the extra spaces from the groupings yourself
The Regex for case 1 is
(.*?\s+)(\d+.*)
.*? => grabs everything non greedily, so it will stop at the first space
\s+ => one or more whitespace characters
These two form the first group.
\d+ => one or more digits
.* => rest of the line
These two form the second group.
The Regex for case 2 is
(.{11})(.*?)(\d.*)
.{11} => matches 11 characters (you could restrict it to be just letters
and numbers with [a-zA-Z] or \d instead of .)
That's the first group.
.*? => Match everything non greedily, stop before the first
digit found (because that's the next regex)
That's the second group.
\d.* => a digit (used to stop the previous .*?) and the rest of the line
That's the third group.
I would use Peter Boughton's regexes, but ensure you have . matches newline turned off. If that is on, ensure you add a $ on the end :)
The greedy regexes will perform better.
The simplest way for the kind of data you are presenting is to split the line into fields at the spaces, then reunite what you want to have together. Regex.Split(line, "\\s+") should return an array of strings. This is also more robust against changing strings in the fields, for example if in the second case a line reads "00006011731 TAB 3FC 10MG 30UOU".