Capturing a repeated group - regex

I am attempting to parse a string like the following using a .NET regular expression:
H3Y5NC8E-TGA5B6SB-2NVAQ4E0
and return the following using Split:
H3Y5NC8E
TGA5B6SB
2NVAQ4E0
I validate each character against a specific character set (note that the letters 'I', 'O', 'U' & 'W' are absent), so using string.Split is not an option. The number of characters in each group can vary and the number of groups can also vary. I am using the following expression:
([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8}-?){3}
This will match exactly 3 groups of 8 characters each. Any more or less will fail the match.
This works insofar as it correctly matches the input. However, when I use the Split method to extract each character group, I just get the final group. RegexBuddy complains that I have repeated the capturing group itself and that I should put a capture group around the repeated group. However, none of my attempts to do this achieve the desired result. I have been trying expressions like this:
(([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8})-?){4}
But this does not work.
Since I generate the regex in code, I could just expand it out by the number of groups, but I was hoping for a more elegant solution.
Please note that the character set does not include the entire alphabet. It is part of a product activation system. As such, any characters that can be accidentally interpreted as numbers or other characters are removed. e.g. The letters 'I', 'O', 'U' & 'W' are not in the character set.
The hyphens are optional since a user does not need top type them in, but they can be there if the user as done a copy & paste.

BTW, you can replace [ABCDEFGHJKLMNPQRSTVXYZ0123456789] character class with a more readable subtracted character class.
[[A-Z\d]-[IOUW]]
If you just want to match 3 groups like that, why don't you use this pattern 3 times in your regex and just use captured 1, 2, 3 subgroups to form the new string?
([[A-Z\d]-[IOUW]]){8}-([[A-Z\d]-[IOUW]]){8}-([[A-Z\d]-[IOUW]]){8}
In PHP I would return (I don't know .NET)
return "$1 $2 $3";

I have discovered the answer I was after. Here is my working code:
static void Main(string[] args)
{
string pattern = #"^\s*((?<group>[ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8})-?){3}\s*$";
string input = "H3Y5NC8E-TGA5B6SB-2NVAQ4E0";
Regex re = new Regex(pattern);
Match m = re.Match(input);
if (m.Success)
foreach (Capture c in m.Groups["group"].Captures)
Console.WriteLine(c.Value);
}

After reviewing your question and the answers given, I came up with this:
RegexOptions options = RegexOptions.None;
Regex regex = new Regex(#"([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8})", options);
string input = #"H3Y5NC8E-TGA5B6SB-2NVAQ4E0";
MatchCollection matches = regex.Matches(input);
for (int i = 0; i != matches.Count; ++i)
{
string match = matches[i].Value;
}
Since the "-" is optional, you don't need to include it. I am not sure what you was using the {4} at the end for? This will find the matches based on what you want, then using the MatchCollection you can access each match to rebuild the string.

Why use Regex? If the groups are always split by a -, can't you use Split()?

Sorry if this isn't what you intended, but your string always has the hyphen separating the groups then instead of using regex couldn't you use the String.Split() method?
Dim stringArray As Array = someString.Split("-")

What are the defining characteristics of a valid block? We'd need to know that in order to really be helpful.
My generic suggestion, validate the charset in a first step, then split and parse in a seperate method based on what you expect. If this is in a web site/app then you can use the ASP Regex validation on the front end then break it up on the back end.

If you're just checking the value of the group, with group(i).value, then you will only get the last one. However, if you want to enumerate over all the times that group was captured, use group(2).captures(i).value, as shown below.
system.text.RegularExpressions.Regex.Match("H3Y5NC8E-TGA5B6SB-2NVAQ4E0","(([ABCDEFGHJKLMNPQRSTVXYZ0123456789]+)-?)*").Groups(2).Captures(i).Value

Mike,
You can use character set of your choice inside character group. All you need is to add "+" modifier to capture all groups. See my previous answer, just change [A-Z0-9] to whatever you need (i.e. [ABCDEFGHJKLMNPQRSTVXYZ0123456789])

You can use this pattern:
Regex.Split("H3Y5NC8E-TGA5B6SB-2NVAQ4E0", "([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8}+)-?")
But you will need to filter out empty strings from resulting array.
Citation from MSDN:
If multiple matches are adjacent to one another, an empty string is inserted into the array.

Related

Shorten Regular Expression (\n) [duplicate]

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

Access multiple captures of one capture group in substition string

Suppose I have the regex (\d)+.
In .Net I can access all captures of this capture group using the match.Groups[1].Captures.
Can I also access these captures in a substition string?
So for example for the input string 523, I need to use 5, 2 and 3 in my substition string (and not just 3, which is $1).
If you intend to capture the digits each in its separate capturing group then you need to actually make a separate capturing groups for every digits like this:
(\d)(\d)(\d)
NOTE: This does not scale very well and you could not match numbers of any other length than 3 digits. In other words, no math on either 23 or 345667!
An good page with a long and detailed explanation why this cant be done as (\d)+ can be found here:
https://www.regular-expressions.info/captureall.html
So if this is indeed what you want then you need to craft your own loop that searches the string for every digit separately.
If you on the other hand need to capture the number and not the individual digits then you simply put the +sign in the wrong position. I think you should write:
(\d+)
I think the OP wants to get every single digit match separately.
Perhaps this will help you then:
<!-- language: lang-vb -->
' Create a list to put the resulting matches in
Dim ResultList As StringCollection = New StringCollection()
Dim RegexObj As New Regex("(\d)")
Dim MatchResult As Match = RegexObj.Match(strName)
While MatchResult.Success
ResultList.Add(MatchResult.Groups(1).Value)
' Console.WriteLine(MatchResult.Groups(1).Value)
MatchResult = MatchResult.NextMatch()
End While

Regex to create url friendly string

I want to create a url friendly string (one that will only contain letters, numbers and hyphens) from a user input to :
remove all characters which are not a-z, 0-9, space or hyphens
replace all spaces with hyphens
replace multiple hyphens with a single hyphen
Expected outputs :
my project -> my-project
test project -> test-project
this is # long str!ng with spaces and symbo!s -> this-is-long-strng-with-spaces-and-symbos
Currently i'm doing this in 3 steps :
$identifier = preg_replace('/[^a-zA-Z0-9\-\s]+/','',strtolower($project_name)); // remove all characters which are not a-z, 0-9, space or hyphens
$identifier = preg_replace('/(\s)+/','-',strtolower($identifier)); // replace all spaces with hyphens
$identifier = preg_replace('/(\-)+/','-',strtolower($identifier)); // replace all hyphens with single hyphen
Is there a way to do this with one single regex ?
Yeah, #Jerry is correct in saying that you can't do this in one replacement as you are trying to replace a particular string with two different items (a space or dash, depending on context). I think Jerry's answer is the best way to go about this, but something else you can do is use preg_replace_callback. This allows you to evaluate an expression and act on it according to what the match was.
$string = 'my project
test project
this is # long str!ng with spaces and symbo!s';
$string = preg_replace_callback('/([^A-Z0-9]+|\s+|-+)/i', function($m){$a = '';if(preg_match('/(\s+|-+)/i', $m[1])){$a = '-';}return $a;}, $string);
print $string;
Here is what this means:
/([^A-Z0-9]+|\s+|-+)/i This looks for any one of your three quantifiers (anything that is not a number or letter, more than one space, more than one hyphen) and if it matches any of them, it passes it along to the function for evaluation.
function($m){ ... } This is the function that will evaluate the matches. $m will hold the matches that it found.
$a = ''; Set a default of an empty string for the replacement
if(preg_match('/(\s+|-+)/i', $m[1])){$a = '-';} If our match (the value stored in $m[1]) contains multiple spaces or hyphens, then set $a to a dash instead of an empty string.
return $a; Since this is a function, we will return the value and that value will be plopped into the string wherever it found a match.
Here is a working demo
I don't think there's one way of doing that, but you could reduce the number of replaces and in an extreme case, use a one liner like that:
$text=preg_replace("/[\s-]+/",'-',preg_replace("/[^a-zA-Z0-9\s-]+/",'',$text));
It first removes all non-alphanumeric/space/dash with nothing, then replaces all spaces and multiple dashes with a single one.
Since you want to replace each thing with something different, you will have to do this in multiple iterations.
Sorry D:

Regex match between two tags or else match everything

I have a list of email addresses which take various forms:
john#smith.com
Angie <angie#aol.com>
"Mark Jones" <mark#jones.com>
I'm trying to cut only the email portion from each. Ex: I only want the angie#aol.com from the second item in the list. In other words, I want to match everything between < and > or match everything if it doesn't exist.
I know this can be done in 2 steps:
Capture on (?<=\<)(.*)(?=\>).
If there is no match, use the entire text.
But now I'm wondering: Can both steps be reduced into one simple regular expression?
What about:
(?<=\<).*(?=\>)|^[^<]*$
^[^>]*$ will match the entire string, but only if it doesn't contain a <. And that's OR'ed (|) with what you had.
Explanation:
^ - start of string
[^<] - not-< character
[^<]* - zero or more not-< characters
$ - end of string
You're after an exclusive or operator. Have a look here.
(\<.+\#.+\..+\>) matches those email addresses in side <> only...
(\<.+\#.+\..+\>)|(.+) matches everything instead of matching the first condition in the OR then skipping the second.
Depending on what language you are using to implement this regex, you might be able to use an inbuilt exclusive or operator. Otherwise, you might need to put a bit of logic in there to use the string if no matches are found. E.g. (pseudo type code):
string = 'your data above';
if( regex_finds_match ( '(\<.+\#.+\..+\>)', string ) ) {
// found match, use the match
str_to_use = regex_match(es);
} else {
// didn't find a match:
str_to_use = string;
}
It is possible, but your current logic is probably simpler. Here is what I came up with, email address will always be in the first capturing group:
^(?:.*<|)(.*?)(?:>|$)
Example: http://rubular.com/r/8tKHaYYY4T

Regex to remove characters up to a certain point in a string

How do I use regex to convert
11111aA$xx1111xxdj$%%`
to
aA$xx1111xxdj$%%
So, in other words, I want to remove (or match) the FIRST grouping of 1's.
Depending on the language, you should have a way to replace a string by regex. In Java, you can do it like this:
String s = "11111aA$xx1111xxdj$%%";
String res = s.replaceAll("^1+", "");
The ^ "anchor" indicates that the beginning of the input must be matched. The 1+ means a sequence of one or more 1 characters.
Here is a link to ideone with this running program.
The same program in C#:
var rx = new Regex("^1+");
var s = "11111aA$xx1111xxdj$%%";
var res = rx.Replace(s, "");
Console.WriteLine(res);
(link to ideone)
In general, if you would like to make a match of anything only at the beginning of a string, add a ^ prefix to your expression; similarly, adding a $ at the end makes the match accept only strings at the end of your input.
If this is the beginning, you can use this:
^[1]*
As far as replacing, it depends on the language. In powershell, I would do this:
[regex]::Replace("11111aA$xx1111xxdj$%%","^[1]*","")
This will return:
aA$xx1111xxdj$%%
If you only want to replace consecutive "1"s at the beginning of the string, replace the following with an empty string:
^1+
If the consecutive "1"s won't necessarily be the first characters in the string (but you still only want to replace one group), replace the following with the contents of the first capture group (usually \1 or $1):
1+(.*)
Note that this is only necessary if you only have a "replace all" capability available to you, but most regex implementations also provide a way to replace only one instance of a match, in which case you could just replace 1+ with an empty string.
I'm not sure but you can try this
[^1](\w*\d*\W)* - match all as a single group except starting "1"(n) symbols
In Javascript
var str = '11111aA$xx1111xxdj$%%';
var patt = /^1+/g;
str = str.replace(patt,"");