Regex capture group that excludes optional substring?

Regex capture group that excludes optional substring? - regex

I'm trying to construct a regex to extract Swedish organization numbers from data. These numbers can be of the following formats:
999999999999 // 12 digits, first two should be ignored.
9999999999 // 10 digits, all should be included.
99999999-9999 // 12 digits with a dash, first two digits and the dash should be ignored
999999-9999 // 10 digits with a dash, dash should be ignored.
For the 12 digit cases, the first two digits are always 16, 19 or 20. My current attempt is:
(?:16|19|20)?(\d{6}\-?\d{4})
This will return a ten digit organization number in $1, but it will contain the dash if it's present. I want the dash to be stripped (or possibly added if it's missing), so that $1 has the same format regardless of dash or no dash in the input.
The regex is in a config and will be used in code that simply extracts $1, so I can't solve this in code - I need the regex to do it "by itself".
As a last resort, I could modify the code to allow config to specify a "replace string" in addition to the search regex, and have the code use the result of the replace as the end result of the extraction. In that case I could use this:
Regex: (?:16|19|20)?(\d{6})\-?(\d{4})
Replace string: $1$2
But this causes other problems, because for other config items, the regex will return multiple "data fields", one for each capture group. To get this to work I would need, in that case, to provide a sequence of replace strings, e.g. for a tab separated format with organization number in the middle:
Regex: ([^\t]*)\t(?:16|19|20)?(\d{6})\-?(\d{4})\t([\d]*)
Replace string 1: $1 (free text field)
Replace string 2: $2-$3 (the organization number with dash "enforced")
Replace string 3: $4 (numeric field)
Workable, but rather awkward... So, any way to solve it within the search regex?

Related

Regular Expression: Find a specific group within other groups in VB.Net

I need to write a regular expression that has to replace everything except for a single group.
E.g
IN
OUT
OK THT PHP This is it 06222021
This is it
NO MTM PYT Get this content 111111
Get this content
I wrote the following Regular Expression: (\w{0,2}\s\w{0,3}\s\w{0,3}\s)(.*?)(\s\d{6}(\s|))
This RegEx creates 4 groups, using the first entry as an example the groups are:
OK THT PHP
This is it
06222021
Space Charachter
I need a way to:
Replace Group 1,2,4 with String.Empty
OR
Get Group 3, ONLY

You don't need 4 groups, you can use a single group 1 to be in the replacement and match 6-8 digits for the last part instead of only 6.
Note that this \w{0,2} will also match an empty string, you can use \w{1,2} if there has to be at least a single word char.
^\w{0,2}\s\w{0,3}\s\w{0,3}\s(.*?)\s\d{6,8}\s?$
^ Start of string
\w{0,2}\s\w{0,3}\s\w{0,3}\s Match 3 times word characters with a quantifier and a whitespace in between
(.*?) Capture group 1 match any char as least as possible
\s\d{6,8} Match a whitespace char and 6-8 digits
\s? Match an optional whitespace char
$ End of string
Regex demo
Example code
Dim s As String = "OK THT PHP This is it 06222021"
Dim result As String = Regex.Replace(s, "^\w{0,2}\s\w{0,3}\s\w{0,3}\s(.*?)\s\d{6,8}\s?$", "$1")
Console.WriteLine(result)
Output
This is it

My approach does not work with groups and does use a Replace operation. The match itself yields the desired result.
It uses look-around expressions. To find a pattern between two other patterns, you can use the general form
(?<=prefix)find(?=suffix)
This will only return find as match, excluding prefix and suffix.
If we insert your expressions, we get
(?<=\w{0,2}\s\w{0,3}\s\w{0,3}\s).*?(?=\s\d{6}\s?)
where I simplified (\s|) as \s?. We can also drop it completely, since we don't care about trailing spaces.
(?<=\w{0,2}\s\w{0,3}\s\w{0,3}\s).*?(?=\s\d{6})
Note that this works also if we have more than 6 digits because regex stops searching after it has found 6 digits and doesn't care about what follows.
This also gives a match if other things precede our pattern like in 123 OK THT PHP This is it 06222021. We can exclude such results by specifying that the search must start at the beginning of the string with ^.
If the exact length of the words and numbers does not matter, we simply write
(?<=^\w+\s\w+\s\w+\s).*?(?=\s\d+)
If the find part can contain numbers, we must specify that we want to match until the end of the line with $ (and include a possible space again).
(?<=^\w+\s\w+\s\w+\s).*?(?=\s\d+\s?$)
Finally, we use a quantifier for the 3 ocurrences of word-space:
(?<=^(\w+\s){3}).*?(?=\s\d+\s?$)
This is compact and will only return This is it or Get this content.
string result = Regex.Match(#"(?<=^(\w+\s){3}).*?(?=\s\d+\s?$)").Value;

Exclude a combination of characters with regex or add a letter

I'm trying to adjust KODI's search filter with regex so the scrapers recognize tv shows from their original file names.
They either come in this pattern:
"TV show name S04E01 some extra info" or this "TV show name 01 some extra info"
The first is not recognized, because "S04" scrambles the search in a number of ways, this needs to go.
The second is not recognized, because it needs an 'e' before numbers, otherwise, it won't be recognized as an episode number.
So I see two approaches.
Make the filter ignore s01-99
prepend an 'e' any freestanding two-digit numbers, but I worry if regex can even do that.
I have no experience in the regex, but I've been playing around coming up with this, which unsurprisingly doesn't do the trick
^(?!s{00,99})\d{2}$

You may either find \b([0-9]{2})\b regex matches and replace with E$1, or match \bs(0[1-9]|[1-9][0-9])\b pattern in an ignore filter.
Details
\b([0-9]{2})\b - matches and captures into Group 1 any two digits that are not enclosed with letters, digits and _. The E$1 replacement means that the matched text (two digits) is replaced with itself (since $1 refers to the Group 1 value) with E prepended to the value.
\bs(0[1-9]|[1-9][0-9])\b - matches an s followed with number between 01 and 99 because (0[1-9]|[1-9][0-9]) is a capturing group matching either 0 and then any digit from 1 to 9 ([1-9]), or (|) any digit from 1 to 9 ([1-9]) and then any digit ([0-9]).
NOTE: If you need to generate a number range regex, you may use this JSFiddle of mine.

Regex with optional, lazy, greedy group

Let's take this source string from a word document:
A;SDLFJA;SDJFA;KSDJF;ALKSJDF SOURCE: 3 55 ASDKLFJA;KDSJF
sa;ldkjfa SOURCE: HYPERLINK "ASDLFA;SDFA;SKD" "MATCH9" 3 HYPERLINK
"ASDLFA;SDFA;SKD" "MATCH10" 55 a;sdkfja;ksdfj;aklsdjf;lk
I'm looking for a pattern that is composed of the literal text "SOURCE: " followed by a 1 digit number a space and a 2 digit number.
For example, in the first line of the source string, I want to find "SOURCE: 3 55".
Now, some clever boffin has decided to embed a hyperlink for the 1 digit number and another hyperlink for the 2 digit number. Lines 2 and 3 show the two embedded hyperlinks. MATCH1 refers to the first embedded hyperlink, MATCH2 is the second, and so on. I have no way of knowing how many hyperlinks will be placed before these, so one can't assume MATCH9 and MATCH10.
The text I want to extract is the "3 55" portion. I want to put it into a named group I'll call "KeepMe".
I don't mind using two different patterns, one for the hyperlink and one without.
Here's a pattern that works for the non-hyperlinked text:
SOURCE:\s+(?<KeepMe>\d*\s+\d*)
I get "3 55" in the KeepMe group just like I want.
I haven't been able to keep the hyperlink match pattern from being greedy.
Here's a failed regex pattern, (one of many):
SOURCE:\s+(?<Hyperlink>HYPERLINK.*MATCH\d*\u0022\s+)??(?<KeepMe1>\d*)\s+
(?<Hyperlink>HYPERLINK.*MATCH\d*\u0022\s+)??(?<KeepMe2>\d*)
In the above pattern, I'm trying to say:
Look for the literal SOURCE: followed by one or more spaces.
Then, optionally look for the literal text "HYPERLINK followed by some characters, followed by the literal text MATCH, followed by some digits and a double quote character in a lazy, non-greedy manner, followed by one or more spaces, followed by some digits I want to keep. Then, do another HYPERLINK pattern match like we just did and keep the digits after that, too.
Remember, in both cases, I want to extract "3 55". It can be extracted in one or two pieces though one would be best.
Any ideas???

This should do the trick:
\bSOURCE:\s+(?:HYPERLINK\s+"[^"]*"\s+"MATCH\d+"\s+)?(?<KeepMe1>\d+)\s+(?:HYPERLINK\s+"[^"]*"\s+"MATCH\d+"\s+)?(?<KeepMe2>\d+)\b
Main difference is that I replaced the .* between HYPERLINK and MATCH with something less greedy.
Fiddle: https://regex101.com/r/yE3fP4/1

A Regex that works for just the hyperlinked case is:
/(?<SourceToken>SOURCE:) # Start with a source tag
\s+ # Followed by whitespace
(?<HyperlinkMatchGroup> # Save the hyperlink & match combo.
(?<Hyperlink> # Save the hyperlink (to be discarded)
(?<HyperlinkToken>HYPERLINK\s+) # Hyperlinks start with the literal tag "HYPERLINK"
(?<HyperlinkText>".*?") # Hyperlink text contained in quotes, non-greedy
\s*) # Followed by whitespace
* # Repeating any number of times
(?<MatchToken>"MATCH\d*") # Followed by a literal tag "MATCH" and a digit string
\s* # Followed by whitespace
(?<KeepMe>\d+) # Finally, the match, which is just a series of digits
\s* # Followed by whitespace
)+ # The whole hyperlink & match pair must occur at least once
/x
It may or may not cover all your cases; I haven't spent much time digging into it.

Validation of international telephone numbers with REGEXMATCH

I'm trying to apply a data validation formula to a column, checking if the content is a valid international telephone number. The problem is I can't have +1 or +some dial code because it's interpreted as an operator. So I'm looking for a regex that accepts all these, with the dial code in parentheses:
(+1)-234-567-8901
(+61)-234-567-89-01
(+46)-234 5678901
(+1) (234) 56 89 901
(+1) (234) 56-89 901
(+46).234.567.8901
(+1)/234/567/8901
A starting regex can be this one (where I also took the examples).

This regex match all the example you gave us (tested with https://fr.functions-online.com/preg_match_all.html)
/^\(\+\d+\)[\/\. \-]\(?\d{3}\)?[\/\. \-][\d\- \.\/]{7,11}$/m
^ Match the beginning of the string or new line.
To match (+1) and (+61): \(\+\d+\): The plus sign and the parentheses have to be escaped since they have special meaning in the regex. \d+ Stand for any digit (\d) character and the plus means one or more (the plus could be replaced by {1,2})
[\/\. \-] This match dot, space, slash and hyphen exactly one time.
\(?\d{3}\)?: The question mark is for optional parenthesis (? = 0 or 1 time). It expect three digits.
[\/\. \-] Same as step 3
[\d\- \.\/]{7,11}: Expect digits, hyphen, space, dot or slash between 7 and 11 time.
$ Match the end of the line or the end of the string
The m modifier allow the caret (^) and dollar sign ($) combination to match line break. Remove that if you want those symbol to match only the begining and the end of the string.
Slashes are use are delimiter for this regex (there are other character that you can use).
I must admit I don't like the last part of the regex as do not ensure that you have at least 7 digits.
It would be probably better to remove all the separator (by example with PHP function str_replace) and deal only with parenthesis and number with this regex
/(\(\+\d+\))(\(?\d{3}\)?)(\d{3})(\d{4})/m
Notice that in this last regex I used 4 capturing group to match the four digit section of the phone number. This regex keep the parenthesis and the plus sign of the first group and the optional parenthesis of the second group. To keep only the digits group, you can use this regex:
/\(\+(\d+)\)\(?(\d{3})\)?(\d{3})(\d{4})/m
Note: The groups are for formatting the phone number after validating it. It is probably better for you to keep all your phone number in your database in the same format.
Well, here are different possibility you can use.
Note: Those regex should be compatible with all regex engine, but it is good practice to specify with which language you works because regex engine don't deal the same way with advanced/fancy function.
By example, the look behind is not supported by javascript and .Net allow a more powerful control on lookbehind than PHP.
Keep me in touch if you need more information

vim match group of numbers and replace

I have a large file with data in this format:
regabc123456_user_domain_application_env_id
regdef789101_user_domain_application_env_id
in vim I want to do a search and replace ("_" for ", ") and match the machine name (regabc123456).
i am trying this:
:%s/^reg.*\{6}_/^reg.*\{6},\ /g
^ for beginning of the line 'reg' because all start with this then '.*' for anything after that but before the six digit code starts which I am tryign to catch with {6}.
This doesn't seem to be doing what I want. I can match the machine name, but I can't replace it with what I want. Is there an easier way to identify the machine name with regular expressions? example:
'reg' followed by three lower case letter followed by six numbers followed by an underscore, then replace?
Thanks.

The below regex would replace regabc123456_ to regabc123456,
:%s/^\(reg.*[0-9]\{6\}\)_/\1,/g
OR
:%s/^\(reg[a-z]\{3\}[0-9]\{6\}\)_/\1,/g
If you want a space after the comma then add space after comma in the replacement part.
%s/^\(reg[a-z]\{3\}[0-9]\{6\}\)_/\1, /g
To match a 6 digit number , you need to use [0-9]\{6\}. It repeats the previous token exactly 6 times.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex capture group that excludes optional substring? - regex

Related

Regular Expression: Find a specific group within other groups in VB.Net

Exclude a combination of characters with regex or add a letter

Regex with optional, lazy, greedy group

Validation of international telephone numbers with REGEXMATCH

vim match group of numbers and replace

Categories

Resources