Regexp variable alphanumeric matching

Regexp variable alphanumeric matching - regex

I am trying to match the following sample:
ZU2A ZS6D-9 ZT0ER-7 ZR6PJH-12
It is a combination of letters and numbers (alphanumeric).
Here is an explanation:
It will always start with a capital (uppercase) Z
Followed always by only ONE(1) of R,S,T or U "[R|S|T|U]"
Followed always by only ONE(1) number "[0-9]"
Followed always by a minimum of ONE(1) and optionally a maximum of THREE(3) capital (uppercase) letters like this [A-Z]{1,3}
Optionally followed by "-" and a minimum of ONE(1) and a maximum of TWO(2) numbers
At the moment I have this:
Z[R|S|T|U][0-9][A-Z]{1,}(\-)?([0-9]{1,3})
But that does not seem to catch all the samples.
EDIT: Here is a sample of a complete string:
ZU0D>APT314,ZT1ER,WIDE1,ZS3PJ-2,ZR5STU-12*/V:/021414z2610.07S/02814.02Ek067/019/A=005475!w%<!
Any help would be appreciated.
Thank You
Danny

Your main problem is that the whole optional part should be surrounded by one set of parentheses marked with ? (=optional). All in all, you want
Z[RSTU][0-9][A-Z]{1,3}(?:-[0-9]{1,2})?
A couple of extra notes:
In a character group, you can simply list the characters. So for 2 you want either [RSTU] or (?:R|S|T|U).
A group in the form of (?:example) instead of (example) prevents the sub-expression from being returned as a match. It has no effect on which inputs are matched.
You don't need to escape - with a backslash outside of a character class.
Here's an example test case script in Python:
import re
s = r'Z[RSTU][0-9][A-Z]{1,3}(?:-[0-9]{1,2})?'
rex = re.compile(s)
for test in ('ZU2A', 'ZS6D-9', 'ZT0ER-7', 'ZR6PJH-12'):
assert rex.match(test), test
long_test = 'ZU0D>APT314,ZT1ER,WIDE1,ZS3PJ-2,ZR5STU-12*/V:/021414z2610.07S/02814.02Ek067/019/A=005475!w%<!'
found = rex.findall(long_test)
assert found == ['ZU0D', 'ZT1ER', 'ZS3PJ-2', 'ZR5STU-12'], found

Related

Shorten Regular Expression (\n) [duplicate]

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?

Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets

What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.

The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))

You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"

Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.

An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False

If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))

To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

Need regex expression with multiple conditions

I need regex with following conditions
It should accept maximum of 5 digits then upto 3 decimal places
it can be negative
it can be zero
it can be only numbers (max. upto 5 digit place)
it can be null
I have tried following but its not, its not fulfilling all conditions
#"^([\-\+]?)\d{0,5}(.[0-9]{1,3})?)$"
E.g. maximum value can hold is from -99999.999 to 99999.999

Use this regex:
^[-+]?\d{0,5}(\.[0-9]{1,3})?$
I only made two changes here. First, you don't need to escape any characters inside a character class normally, except for opening and closing brackets, or possibly backslash itself. Hence, we can use [-+] to capture an initial plus or minus. Second, you need to escape the dot in your regex, to tell the engine that you want to match a literal dot.
However, I would probably phrase this regex as follows:
^[-+]?\d{1,5}(\.[0-9]{1,3})?$
This will match one to five digits, followed by an optional decimal point, followed by one to three digits.
Note that we want to capture things like:
0.123
But not
.123
i.e. we don't want to capture a leading decimal point should it not be prefixed by at least one number.
Demo here:
Regex101

I assume you're doing this in C# given the notation. Here's a little code you can use to test your expression, with two corrections:
You have to escape the dot, otherwise it means "any character". So, \. instead of .
There was an extraneous close parenthesis that prevented the expression from compiling
C#:
var expr = #"^([\-\+]?)\d{0,5}(\.[0-9]{1,3})?$";
var re = new Regex(expr);
string[] samples = {
"",
"0",
"1.1",
"1.12",
"1.123",
"12.3",
"12.34",
"12.345",
"123.4",
"12345.123",
".1",
".1234"
};
foreach(var s in samples) {
Console.WriteLine("Testing [{0}]: {1}", s, re.IsMatch(s) ? "PASS" : "FAIL");
}
Results:
Testing []: PASS
Testing [0]: PASS
Testing [1.1]: PASS
Testing [1.12]: PASS
Testing [1.123]: PASS
Testing [12.3]: PASS
Testing [12.34]: PASS
Testing [12.345]: PASS
Testing [123.4]: PASS
Testing [12345.123]: PASS
Testing [.1]: PASS
Testing [.1234]: FAIL

It should accept maximum of 5 digits
[0-9]{1,5}
then upto 3 decimal places
[0-9]{1,5}(\.[0-9]{1,3})?
it can be negative
[-]?[0-9]{1,5}(\.[0-9]{1,3})?
it can be zero
Already covered.
it can be only numbers (max. upto 5 digit place)
Already covered. 'Up to 5 digit place' contradicts your first rule, which allows 5.3.
it can be null
Not covered. I strongly suggest you remove this requirement. Even if you mean 'empty', as I sincerely hope you do, you should detect that case separately and beforehand, as you will certainly have to handle it differently.
Your regular expression contains ^ and $. I don't know why. There is nothing about start of line or end of line in the rules you specified. It also allows a leading +, which again isn't specified in your rules.

Regular expression which will match if there is no repetition

I would like to construct regular expression which will match password if there is no character repeating 4 or more times.
I have come up with regex which will match if there is character or group of characters repeating 4 times:
(?:([a-zA-Z\d]{1,})\1\1\1)
Is there any way how to match only if the string doesn't contain the repetitions? I tried the approach suggested in Regular expression to match a line that doesn't contain a word? as I thought some combination of positive/negative lookaheads will make it. But I haven't found working example yet.
By repetition I mean any number of characters anywhere in the string
Example - should not match
aaaaxbc
abababab
x14aaaabc
Example - should match
abcaxaxaz
(a is here 4 times but it is not problem, I want to filter out repeating patterns)

That link was very helpful, and I was able to use it to create the regular expression from your original expression.
^(?:(?!(?<char>[a-zA-Z\d]+)\k<char>{3,}).)+$
or
^(?:(?!([a-zA-Z\d]+)\1{3,}).)+$

Nota Bene: this solution doesn't answer exaactly to the question, it does too much relatively to the expressed need.
-----
In Python language:
import re
pat = '(?:(.)(?!.*?\\1.*?\\1.*?\\1.*\Z))+\Z'
regx = re.compile(pat)
for s in (':1*2-3=4#',
':1*1-3=4#5',
':1*1-1=4#5!6',
':1*1-1=1#',
':1*2-a=14#a~7&1{g}1'):
m = regx.match(s)
if m:
print m.group()
else:
print '--No match--'
result
:1*2-3=4#
:1*1-3=4#5
:1*1-1=4#5!6
--No match--
--No match--
It will give a lot of work to the regex motor because the principle of the pattern is that for each character of the string it runs through, it must verify that the current character isn't found three other times in the remaining sequence of characters that follow the current character.
But it works, apparently.

Regular expression help - comma delimited string

I don't write many regular expressions so I'm going to need some help on the one.
I need a regular expression that can validate that a string is an alphanumeric comma delimited string.
Examples:
123, 4A67, GGG, 767 would be valid.
12333, 78787&*, GH778 would be invalid
fghkjhfdg8797< would be invalid
This is what I have so far, but isn't quite right: ^(?=.*[a-zA-Z0-9][,]).*$
Any suggestions?

Sounds like you need an expression like this:
^[0-9a-zA-Z]+(,[0-9a-zA-Z]+)*$
Posix allows for the more self-descriptive version:
^[[:alnum:]]+(,[[:alnum:]]+)*$
^[[:alnum:]]+([[:space:]]*,[[:space:]]*[[:alnum:]]+)*$ // allow whitespace
If you're willing to admit underscores, too, search for entire words (\w+):
^\w+(,\w+)*$
^\w+(\s*,\s*\w+)*$ // allow whitespaces around the comma

Try this pattern: ^([a-zA-Z0-9]+,?\s*)+$
I tested it with your cases, as well as just a single number "123". I don't know if you will always have a comma or not.
The [a-zA-Z0-9]+ means match 1 or more of these symbols
The ,? means match 0 or 1 commas (basically, the comma is optional)
The \s* handles 1 or more spaces after the comma
and finally the outer + says match 1 or more of the pattern.
This will also match
123 123 abc (no commas) which might be a problem
This will also match 123, (ends with a comma) which might be a problem.

Try the following expression:
/^([a-z0-9\s]+,)*([a-z0-9\s]+){1}$/i
This will work for:
test
test, test
test123,Test 123,test
I would strongly suggest trimming the whitespaces at the beginning and end of each item in the comma-separated list.

You seem to be lacking repetition. How about:
^(?:[a-zA-Z0-9 ]+,)*[a-zA-Z0-9 ]+$
I'm not sure how you'd express that in VB.Net, but in Python:
>>> import re
>>> x [ "123, $a67, GGG, 767", "12333, 78787&*, GH778" ]
>>> r = '^(?:[a-zA-Z0-9 ]+,)*[a-zA-Z0-9 ]+$'
>>> for s in x:
... print re.match( r, s )
...
<_sre.SRE_Match object at 0xb75c8218>
None
>>>>
You can use shortcuts instead of listing the [a-zA-Z0-9 ] part, but this is probably easier to understand.
Analyzing the highlights:
[a-zA-Z0-9 ]+ : capture one or more (but not zero) of the listed ranges, and space.
(?:[...]+,)* : In non-capturing parenthesis, match one or more of the characters, plus a comma at the end. Match such sequences zero or more times. Capturing zero times allows for no comma.
[...]+ : capture at least one of these. This does not include a comma. This is to ensure that it does not accept a trailing comma. If a trailing comma is acceptable, then the expression is easier: ^[a-zA-Z0-9 ,]+

Yes, when you want to catch comma separated things where a comma at the end is not legal, and the things match to $LONGSTUFF, you have to repeat $LONGSTUFF:
$LONGSTUFF(,$LONGSTUFF)*
If $LONGSTUFF is really long and contains comma repeated items itself etc., it might be a good idea to not build the regexp by hand and instead rely on a computer for doing that for you, even if it's just through string concatenation. For example, I just wanted to build a regular expression to validate the CPUID parameter of a XEN configuration file, of the ['1:a=b,c=d','2:e=f,g=h'] type. I... believe this mostly fits the bill: (whitespace notwithstanding!)
xend_fudge_item_re = r"""
e[a-d]x= #register of the call return value to fudge
(
0x[0-9A-F]+ | #either hardcode the reply
[10xks]{32} #or edit the bitfield directly
)
"""
xend_string_item_re = r"""
(0x)?[0-9A-F]+: #leafnum (the contents of EAX before the call)
%s #one fudge
(,%s)* #repeated multiple times
""" % (xend_fudge_item_re, xend_fudge_item_re)
xend_syntax = re.compile(r"""
\[ #a list of
'%s' #string elements
(,'%s')* #repeated multiple times
\]
$ #and nothing else
""" % (xend_string_item_re, xend_string_item_re), re.VERBOSE | re.MULTILINE)

Try ^(?!,)((, *)?([a-zA-Z0-9])\b)*$
Step by step description:
Don't match a beginning comma (good for the upcoming "loop").
Match optional comma and spaces.
Match characters you like.
The match of a word boundary make sure that a comma is necessary if more arguments are stacked in string.

Please use - ^((([a-zA-Z0-9\s]){1,45},)+([a-zA-Z0-9\s]){1,45})$
Here, I have set max word size to 45, as longest word in english is 45 characters, can be changed as per requirement

Regex to parse international floating-point numbers

I need a regex to get numeric values that can be
111.111,11
111,111.11
111,111
And separate the integer and decimal portions so I can store in a DB with the correct syntax
I tried ([0-9]{1,3}[,.]?)+([,.][0-9]{2})? With no success since it doesn't detect the second part :(
The result should look like:
111.111,11 -> $1 = 111111; $2 = 11

First Answer:
This matches #,###,##0.00:
^[+-]?[0-9]{1,3}(?:\,?[0-9]{3})*(?:\.[0-9]{2})?$
And this matches #.###.##0,00:
^[+-]?[0-9]{1,3}(?:\.?[0-9]{3})*(?:\,[0-9]{2})?$
Joining the two (there are smarter/shorter ways to write it, but it works):
(?:^[+-]?[0-9]{1,3}(?:\,?[0-9]{3})*(?:\.[0-9]{2})?$)
|(?:^[+-]?[0-9]{1,3}(?:\.?[0-9]{3})*(?:\,[0-9]{2})?$)
You can also, add a capturing group to the last comma (or dot) to check which one was used.
Second Answer:
As pointed by Alan M, my previous solution could fail to reject a value like 11,111111.00 where a comma is missing, but the other isn't. After some tests I reached the following regex that avoids this problem:
^[+-]?[0-9]{1,3}
(?:(?<comma>\,?)[0-9]{3})?
(?:\k<comma>[0-9]{3})*
(?:\.[0-9]{2})?$
This deserves some explanation:
^[+-]?[0-9]{1,3} matches the first (1 to 3) digits;
(?:(?<comma>\,?)[0-9]{3})? matches on optional comma followed by more 3 digits, and captures the comma (or the inexistence of one) in a group called 'comma';
(?:\k<comma>[0-9]{3})* matches zero-to-any repetitions of the comma used before (if any) followed by 3 digits;
(?:\.[0-9]{2})?$ matches optional "cents" at the end of the string.
Of course, that will only cover #,###,##0.00 (not #.###.##0,00), but you can always join the regexes like I did above.
Final Answer:
Now, a complete solution. Indentations and line breaks are there for readability only.
^[+-]?[0-9]{1,3}
(?:
(?:\,[0-9]{3})*
(?:.[0-9]{2})?
|
(?:\.[0-9]{3})*
(?:\,[0-9]{2})?
|
[0-9]*
(?:[\.\,][0-9]{2})?
)$
And this variation captures the separators used:
^[+-]?[0-9]{1,3}
(?:
(?:(?<thousand>\,)[0-9]{3})*
(?:(?<decimal>\.)[0-9]{2})?
|
(?:(?<thousand>\.)[0-9]{3})*
(?:(?<decimal>\,)[0-9]{2})?
|
[0-9]*
(?:(?<decimal>[\.\,])[0-9]{2})?
)$
edit 1: "cents" are now optional;
edit 2: text added;
edit 3: second solution added;
edit 4: complete solution added;
edit 5: headings added;
edit 6: capturing added;
edit 7: last answer broke in two versions;

I would at first use this regex to determine wether a comma or a dot is used as a comma delimiter (It fetches the last of the two):
[0-9,\.]*([,\.])[0-9]*
I would then strip all of the other sign (which the previous didn't match). If there were no matches, you already have an integer and can skip the next steps. The removal of the chosen sign can easily be done with a regex, but there are also many other functions which can do this faster/better.
You are then left with a number in the form of an integer possible followed by a comma or a dot and then the decimals, where the integer- and decimal-part easily can be separated from eachother with the following regex.
([0-9]+)[,\.]?([0-9]*)
Good luck!
Edit:
Here is an example made in python, I assume the code should be self-explaining, if it is not, just ask.
import re
input = str(raw_input())
delimiterRegex = re.compile('[0-9,\.]*([,\.])[0-9]*')
splitRegex = re.compile('([0-9]+)[,\.]?([0-9]*)')
delimiter = re.findall(delimiterRegex, input)
if (delimiter[0] == ','):
input = re.sub('[\.]*','', input)
elif (delimiter[0] == '.'):
input = re.sub('[,]*','', input)
print input
With this code, the following inputs gives this:
111.111,11
111111,11
111,111.11
111111.11
111,111
111,111
After this step, one can now easily modify the string to match your needs.

How about
/(\d{1,3}(?:,\d{3})*)(\.\d{2})?/
if you care about validating that the commas separate every 3 digits exactly,
or
/(\d[\d,]*)(\.\d{2})?/
if you don't.

If I'm interpreting your question correctly so that you are saying the result SHOULD look like what you say is "would" look like, then I think you just need to leave the comma out of the character class, since it is used as a separator and not a part of what is to be matched.
So get rid of the "." first, then match the two parts.
$value = "111,111.11";
$value =~ s/\.//g;
$value =~ m/(\d+)(?:,(\d+))?/;
$1 = leading integers with periods removed
$2 = either undef if it didn't exist, or the post-comma digits if they do exist.

See Perl's Regexp::Common::number.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regexp variable alphanumeric matching - regex

Related

Shorten Regular Expression (\n) [duplicate]

Need regex expression with multiple conditions

Regular expression which will match if there is no repetition

Regular expression help - comma delimited string

Regex to parse international floating-point numbers

Categories

Resources