a{0,1}|b{0,1} matches only 'a', why? - regex

Why is this happening? I've a complex regexp, but here is what is driving me crazy.
a|b
Matches either single a or single b.
a+|b+
Matches either series of a or series of b.
a{1}|b{1}
Matches both single letter the same.
But I need to do this:
a{0,2}|b{0,2}
And this regexp matches only a and no b at all. What's wrong with that?
What is even funnier is that if I change the 0 to 1, so that it's {1,2}, it starts to match correctly (or better, as expected) again.
Since it seems it now quite clear, I'm adding my real example:
my $launch_regexp = '(\d*)d{0,1}(\d*)(\+{0,2}|-{0,2})(\d*)';
($dice, $fc, $op, $mod) = ($launch =~ /$launch_regexp/);
Where $launch is the same of $ARGV[1].
I want to match many things. Examples:
3 (numbers)
d10 (d + numbers)
3d10 (numbers + d + numbers)
3d10+/-5 (numbers + d + numbers + (+|-) + numbers)
3d10++/--5 (numbers + d + numbers + (++|--) + numbers)
I know my regexp also matches other strings, but now it works with + and not with -.
If I change the range with {1,2}, it matches strings with both + and - (but I need to match also strings which have not such modifiers).
This is happening on my machine with Perl 5.16.3 and I'm able to reproduce it on this website.

The string "b" can be matched by the regex a{0,2} as it correctly has zero instances of 'a'. It won't capture, but it'll match.
In order to match '','aa' or 'bb', you want (aa|bb)? and to wrap your whole regex in ^ and $
I think what you want for your solution is: (\d*)d?(\d+)(?:(\+{1,2}|\-{1,2})(\d*))?

Perl prefers earliest match in the string over anything else. Next, it prefers the earliest of a series of | alternatives (not the longest, as is the case with some regex engines).
Because your first alternative can match nothing, perl will do so at the beginning of the string, for any string that doesn't start with an a.
You probably want something like:
my ($find) = ($string) =~ /^[^ab]*(a{1,2}|b{1,2}|\z)/;

Related

Matlab: How to replace dynamic part of string with regexprep

I have strings like
#(foo) 5 + foo.^2
#(bar) bar(1,:) + bar(4,:)
and want the expression in the first group of parentheses (which could be anything) to be replaced by x in the whole string
#(x) 5 + x.^2
#(x) x(1,:) + x(4,:)
I thought this would be possible with regexprep in one step somehow, but after reading the docu and fiddling around for quite a while, I have not found a working solution, yet.
I know, one could use two commands: First, grab the string to be matched with regexp and then use it with regexprep to replace all occurrences.
However, I have the gut feeling this should be somehow possible with the functionality of dynamic expressions and tokens or the like.
Without the support of an infinite-width lookbehind, you cannot do that in one step with a single call to regexprep.
Use the first idea: extract the first word and then replace it with x when found in between word boundaries:
s = '#(bar) bar(1,:) + bar(4,:)';
word = regexp(s, '^#\((\w+)\)','tokens'){1}{1};
s = regexprep(s, strcat('\<',word,'\>'), 'x');
Output: #(x) x(1,:) + x(4,:)
The ^#\((\w+)\) regex matches the #( at the start of the string, then captures alphanumeric or _ chars into Group 1 and then matches a ). tokens option allows accessing the captured substring, and then the strcat('\<',word,'\>') part builds the whole word matching regex for the regexprep command.

Regex cannot prevent a match of suffix name made up using I,V,X and SR/JR

I am trying to prevent the inclusion of suffix name, for example, JR/SR, or other suffix made up of using I,V,X using regular expression way. To accomplish this I have implemented the following regex
((^((?!((\b((I+))\b)|(\b(V+)\b)|(\b(X+)\b)|\b(IV)\b|(\b(V?I){1,2}\b)|(\b(IX)\b)|(\bX[I|IX]{1,2}\b)|(\bX|X+[V|VI]{1,2}\b)|(\b(JR)\b)|(\b(SR)\b))).)*$))
Using this I am able to prevent various possible combination eg.,
'Last Name I',
'Last Name II',
'Last Name IJR',
'Last Name SRX' etc.
However, there are still couple of combinations remaining, which this regex can match. eg., 'Last Name IXV' or 'Last Name VXI'
These two I am not able to debug. Please suggest me in which part of this regex I can make changes to satisfy the requirement.
Thank you!
Try this pattern: .+\b(?:(?>[JS]R)|X|I|J|V)+$
Explanation:
.+ - match one or more of any characters
\b - word boudnary
(?:...) - non-capturing group
(?>...) - atomic group
[JS]R - match whether S or J followed by R
| - alternation: match what is on the left OR what's on the right
+ - quantifier: match one or more times preceeding pattern
$ - match end of the string
Demo
In order to solve this I have worked on the above regex a little bit more. And here is the final result that can successfully match up with the "roman numeral" upto thirty constituted I, V, and X.
"(\b(?!(IIX|IIV|IVV|IXX|IXI))I[IVX]{0,3}\b|\b(V|X)\b|\bV[I]{1,2}\b|\b((?!XVV|XVX)X([IXV]{1,2}))\b|\b[S|J]R\b)|^$"
What I have done here is:
I have taken those input into consideration which are standalone,
that is: SR or XXV I have observed the incorrect pattern and
have restricted them to match as a positive result.
Separate input has been ensured using \b the word boundary.
Word-boundary: It suggests that starting of a word, that means in
simple words it says "yes there is a word" or "no it is not."
it has done in the following way-
using negative lookahead (?!(IIX|IIV|IVV|IXX|IXI))
How I have arrived on this solution is given as follows:
I have observed closely all the pattern first, that from I to X - that is:
I
I I
I I I
I V
V
V I
V I I
V I I I (it is out of the range of 3 characters.)
I X
X
we have an I, V, and X at first position. Then there is another I, X and V
on the second position. After then again same I and V. I have
implemented this in the following regex of the above written code:
\b(?!(IIX|IIV|IVV|IXX|IXI))I[IVX]{0,3}\b
Start it with I and then look for any of I, V, or X in a range of 'zero' to 'three' characters, and do neglect invalid numbers written inside the ?!(IIX|IIV|IVV|IXX|IXI) Similarly, I have done with other combinations given below.
Then for V and X : \b(V|X)\b
Then for the VI, VII: \bV[I]{1,2}\b
Then for the XI - XXX: \b((?!XVV|XVX)X([IXV]{1,2}))\b
To validate a suffix name, i.e. JR, SR, one can use following regex: \b[S|J]R\b
and the last (^$) is for matching a blank string or in other words, when no input has provided to the given input-box or textbox.
You may post any question or suggestion, if you have.
Thanks!
Ps: This regex is simply a solution to validate "roman numbers" from 1 to 30 using I, V, and X. I hope it helps to learn a bit to each and every newbie of regex.
I solved this with a more explicit:
(.+) (?:(?>JR$|SR$|I$|II$|III$|IV$|MD$|DO$|PHD$))|(.+)
I know I could do something like [JS]R but I like the way this reads:
(.+) match any characters and then a space
(?:(?>JR$|SR$|I$|II$|III$|IV$|MD$|DO$|PHD$)) atomically look for but don't match endings like JR etc
|(.+) if you don't find the endings then match any characters
Feel free to add the endings you'd like to suit your needs.

parse HEX value from a string

I am trying to get a hex value from a string with this condition "VALUE: num,num,num,HEX,num,num"
I have the following
% set STRINGTOPARSE "VALUE: 12,12,13,2,9,5271256369606C00,0,0"
% regexp {(,[0-9A-Z]+,)+} $STRINGTOPARSE result1 result2 result3
1
% puts $result1
,12,
% puts $result2
,12,
% puts $result3
I believe the condition of {(,[0-9A-Z]+,)+} will be sufficient to take the HEX from above string,
but instead I got the first result ",12," and not the HEX that I want. What have I done wrong ?
You might want to use split instead:
set result [lindex [split $STRINGTOPARSE ","] 5]
regexp is not giving you the result you are looking for because the first part that matches is ,12, and the match stops there and won't look for more matches.
You could use regexp to do it, but it will be more messy... one possible way would be to match each comma:
regexp {^(?:[^,]*,){5}([0-9A-F]+),} $STRINGTOPARSE -> hex
Where (?:[^,]*,){5} matches the first 5 non-comma parts with their commas, and ([0-9A-F]+) then grabs the hex value you're looking for.
I think that the problem is that you seem to think [0-9A-Z] will have to match at least a letter, which is not the case. it will match any character within the character class and you get a match as long as you get 1 character to match.
If you wanted a regex to match a series of characters with both numbers and letters, then you would have to use some lookaheads (using classes alone might make it more messy):
regexp {\y(?=[^,A-Z]*[0-9])(?=[^,0-9]*[A-Z])[0-9A-Z]+\y} $STRINGTOPARSE -> hex
But... this might look even more complex than before, so I would advise sticking to splitting instead :)

Regular expressions string replacement of individual match within file

I have written a small program to whir through a textfile and find and replace regex where 9 digits \d{9}. It works fine, except what I need is a little more complicated.
I am finding the right data correctly. theFile is just a string with the text file streamread into it. I do this and then create and write it to another file.
But I need to find each string match individually, and replace that match with only the last 5 digits of that individual number (currently this is just replacing with FOUND). Keeping the file otherwise identical.
I am not sure how/what is the best way of doing this? would i have to split into an array of strings rather than one mass string? (it's quite a big file)
Any questions let me know, thanks in advance.
Dim regexString As String = "(\d{9})"
Dim replacement1 As String = "FOUND"
Dim rgx As New Regex(regexString)
Try
theFile = rgx.Replace(theFile, replacement1)
Catch
End try
Instead of using just one replacement pattern \d{9} split and group with two patterns, the first is 4 numbers long, the second 5 numbers. Then in the replace use only the last 5 numbers from the last group
Dim k = "abcd 123456789 abcf"
Dim ptn = "(\d{4})(\d{5})"
Dim result = Regex.Replace(k, ptn, "$2")
This approach leaves unchanged the sequences with less than 9 consecutive numbers, but if you have sequences with more than 9 numbers and don't want to change them, then you need a pattern with
Dim ptn = "(\b\d{4})(\d{5}\b)"
to fix the two groups inside a sequence of exactly nine numbers.
The question appears to ask for matches on exactly nine digits and wants the first four to be removed. Ie to replace the nine digits with the last five.
Splitting the regular expression in the question into two parts, for the unwanted and the wanted parts gives
regexString = "\d{4}(\d{5})"
which captures the wanted five digits, so then the replacement is
replacement1 ="$1"
Or in some other regular expression implementations it would be replacement1 ="\1". Additionally the replace method in some regular expression system may have additional options (parameters) for replace first versus replace n-th versus replace all occurrences.
Suppose there are more than nine digits and only the final five are wanted. In this case the regular expression can be written as one of the following (as different regular expression languages support different features). The replacement expression is the same as above.
regexString = "\d{4,}(\d{5})"
regexString = "\d\d\d\d+(\d{5})"
regexString = "\d\d\d\d\d*(\d{5})"
Because regular expressions are normally "greedy" the \d{5} should always match the final 5 digits but it may be worth finishing the regular expression with ...(\d{5})([^\d]|$) and changing the replace to be $1$2. That way it looks for a trailing non-digit or end-of-string.

Python: RE only captures first and last match

I'm trying to make a Regular Expression that captures the following:
- XX or XX:XX, up to 6 repetitions (XX:XX:XX:XX:XX:XX), where X is a hexadecimal number.
In other words, I'm trying to capture MAC addresses than can range from 1 to 6 bytes.
regex = re.compile("^([0-9a-fA-F]{2})(?:(?:\:([0-9a-fA-F]{2})){0,5})$")
The problem is that if I enter for example "11:22:33", it only captures the first match and the last, which results in ["11", "22"].
The question: is there any method that {0,5} character will let me catch all repetitions, and not the last one?
Thanks!
Not in Python, no. But you can first check the correct format with your regex, and then simply split the string at ::
result = s.split(':')
Also note that you should always write regular expressions as raw strings (otherwise you get problems with escaping). And your outer non-capturing group does nothing.
Technically there is a way to do it with regex only, but the regex is quite horrible:
r"^([0-9a-fA-F]{2})(?:([0-9a-fA-F]{2}))?(?:([0-9a-fA-F]{2}))?(?:([0-9a-fA-F]{2}))?(?:([0-9a-fA-F]{2}))?(?:([0-9a-fA-F]{2}))?$"
But here you would always get six captures, just that some might be empty.