RegEx for an invoice format - regex

I'm quite new to regular expressions and I'm trying to create a regex for the validation of an invoice format.
The pattern should be:
JjYy (all 4 characters are legit), used 0, 2 or 4 times
e.g. no Y's at all is valid, YY is valid, YYYY is valid, but YYY should fail.
Followed by a series of 0's repeating 3 to 10 times.
The whole should never exceed 10 characters.
examples:
JyjY000000 is valid (albeit quite strange)
YY000 is valid
000000 is valid
jjj000 is invalid
jjjj0 is invalid
I learned some basics from here, but my regex fails when it shouldn't. Can someone assist in improving it?
My regex so far is: [JjYy]{0}|[JjYy]{2}|[JjYy]{4}[0]{3,10}.
The following failed also: [JjYy]{0|2|4}[0]{3,10}

As you need the total length to never exceed 10 characters I think you have to handle the three kinds of prefixes separately:
0{3,10}|[JjYy]{2}0{3,8}|[JjYy]{4}0{3,6}

How about:
^([JjYy]{2}){0,2}0{3,10}$
To check the length is ten characters or less, use a string length function rather than a regular expression - don't hammer nails with a screwdriver, and so forth.
Test:
#!perl
use warnings;
use strict;
my $re = qr/^([JjYy]{2}){0,2}0{3,10}$/;
my %tests = qw/JyjY000000 valid
YY000 valid
000000 valid
jjj000 invalid
jjjj0 invalid/;
for my $k (keys %tests) {
print "$k is ";
if ($k =~ /$re/) {
print "valid";
} else {
print "invalid";
}
print " and it should be $tests{$k}.\n";
}
Produces
jjjj0 is invalid and it should be invalid.
YY000 is valid and it should be valid.
JyjY000000 is valid and it should be valid.
jjj000 is invalid and it should be invalid.
000000 is valid and it should be valid.

([jJyY]{2}){0,2}0{3,10}
If the total length limit is inclusive of the jJyY part, you can check it with a negative look ahead to make sure there are no more than 10 characters in the string to begin with (?![jJyY0]{11,})
\b(?![jJyY0]{11,})([jJyY]{2}){0,2}0{3,10}\b

It may depend on what you are using to implement the regular expression. For example I found out the other day that Notepad++ only supports a few basic operators. Things like the pipe are not part of the core regex standard.
I'd suggest something like this:
([JjYy]{2}([JjYy]{2})?)?[0]{3,10}
If you're using a programming language, you'll need to use a string length function to validate the length.
EDIT: actually, you should be able to validate the length by separating the different situations:
([0]{3,10})|([JjYy]{2}[0]{3,8})|([JjYy]{4}[0]{3,6})

You want to limit the string to 10 characters. So in order to do this you have to consider what valid combinations will make up 10 characters.
Valid combinations therefore would be:
0000000000
000
cc00000000
cc000
cccc000000
cccc000
So, an expression to include all of these would be:
/0{3,10}|[JY]{2}0{3,8}|[JY]{4}0{3,6}/i
A case insensitive match would suffice, although you do get additional performance from some regular expression engines by explicitly saying /[JjYy]/ instead of /[JY]/i.

Related

Can you restrict two characters based on their ASCII order in regex?

Let's say I have a string of 2 characters. Using regex (as a thought exercise), I want to accept it only if the first character has an ascii value bigger than that of the second character.
ae should not match because a is before e in the the ascii table.
ea, za and aA should match for the opposite reason
f$ should match because $ is before letters in the ascii table.
It doesn't matter if aa or a matches or not, I'm only interested in the base case. Any flavor of regex is allowed.
Can it be done ? What if we restrict the problem to lowercase letters only ? What if we restrict it to [abc] only ? What if we invert the condition (accept when the characters are ordered from smallest to biggest) ? What if I want it to work for N characters instead of 2 ?
I guess that'd be almost impossible for me to do it then, however bobble-bubble impressively solved the problem with:
^~*\}*\|*\{*z*y*x*w*v*u*t*s*r*q*p*o*n*m*l*k*j*i*h*g*f*e*d*c*b*a*`*_*\^*\]*\\*\[*Z*Y*X*W*V*U*T*S*R*Q*P*O*N*M*L*K*J*I*H*G*F*E*D*C*B*A*#*\?*\>*\=*\<*;*\:*9*8*7*6*5*4*3*2*1*0*\/*\.*\-*,*\+*\**\)*\(*'*&*%*\$*\#*"*\!*$(?!^)
bobble bubble RegEx Demo
Maybe for abc only or some short sequences we would approach solving the problem with some expression similar to,
^(abc|ab|ac|bc|a|b|c)$
^(?:abc|ab|ac|bc|a|b|c)$
that might help you to see how you would go about it.
RegEx Demo 1
You can simplify that to:
^(a?b?c?)$
^(?:a?b?c?)$
RegEx Demo 2
but I'm not so sure about it.
The number of chars you're trying to allow is irrelevant to the problem you are trying to solve:
because you can simply add an independent statement, if you will, for that, such as with:
(?!.{n})
where n-1 would be the number of chars allowed, which in this case would be
(?!.{3})^(?:a?b?c?)$
(?!.{3})^(a?b?c?)$
RegEx Demo 3
A regex is not the best tool for the job.
But it's doable. A naive approach is to enumerate all the printable ascii characters and their corresponding lower range:
\x21[ -\x20]|\x22[ -\x21]|\x23[ -\x22]|\x24[ -\x23]|\x25[ -\x24]|\x26[ -\x25]|\x27[ -\x26]|\x28[ -\x27]|\x29[ -\x28]|\x2a[ -\x29]|\x2b[ -\x2a]|\x2c[ -\x2b]|\x2d[ -\x2c]|\x2e[ -\x2d]|\x2f[ -\x2e]|\x30[ -\x2f]|\x31[ -\x30]|\x32[ -\x31]|\x33[ -\x32]|\x34[ -\x33]|\x35[ -\x34]|\x36[ -\x35]|\x37[ -\x36]|\x38[ -\x37]|\x39[ -\x38]|\x3a[ -\x39]|\x3b[ -\x3a]|\x3c[ -\x3b]|\x3d[ -\x3c]|\x3e[ -\x3d]|\x3f[ -\x3e]|\x40[ -\x3f]|\x41[ -\x40]|\x42[ -\x41]|\x43[ -\x42]|\x44[ -\x43]|\x45[ -\x44]|\x46[ -\x45]|\x47[ -\x46]|\x48[ -\x47]|\x49[ -\x48]|\x4a[ -\x49]|\x4b[ -\x4a]|\x4c[ -\x4b]|\x4d[ -\x4c]|\x4e[ -\x4d]|\x4f[ -\x4e]|\x50[ -\x4f]|\x51[ -\x50]|\x52[ -\x51]|\x53[ -\x52]|\x54[ -\x53]|\x55[ -\x54]|\x56[ -\x55]|\x57[ -\x56]|\x58[ -\x57]|\x59[ -\x58]|\x5a[ -\x59]|\x5b[ -\x5a]|\x5c[ -\x5b]|\x5d[ -\x5c]|\x5e[ -\x5d]|\x5f[ -\x5e]|\x60[ -\x5f]|\x61[ -\x60]|\x62[ -\x61]|\x63[ -\x62]|\x64[ -\x63]|\x65[ -\x64]|\x66[ -\x65]|\x67[ -\x66]|\x68[ -\x67]|\x69[ -\x68]|\x6a[ -\x69]|\x6b[ -\x6a]|\x6c[ -\x6b]|\x6d[ -\x6c]|\x6e[ -\x6d]|\x6f[ -\x6e]|\x70[ -\x6f]|\x71[ -\x70]|\x72[ -\x71]|\x73[ -\x72]|\x74[ -\x73]|\x75[ -\x74]|\x76[ -\x75]|\x77[ -\x76]|\x78[ -\x77]|\x79[ -\x78]|\x7a[ -\x79]|\x7b[ -\x7a]|\x7c[ -\x7b]|\x7d[ -\x7c]|\x7e[ -\x7d]|\x7f[ -\x7e]
Try it online!
A (better) alternative is to enumerate the ascii characters in reverse order and use the ^ and $ anchors to assert there is nothing else unmatched. This should work for any string length:
^\x7f?\x7e?\x7d?\x7c?\x7b?z?y?x?w?v?u?t?s?r?q?p?o?n?m?l?k?j?i?h?g?f?e?d?c?b?a?`?\x5f?\x5e?\x5d?\x5c?\x5b?Z?Y?X?W?V?U?T?S?R?Q?P?O?N?M?L?K?J?I?H?G?F?E?D?C?B?A?#?\x3f?\x3e?\x3d?\x3c?\x3b?\x3a?9?8?7?6?5?4?3?2?1?0?\x2f?\x2e?\x2d?\x2c?\x2b?\x2a?\x29?\x28?\x27?\x26?\x25?\x24?\x23?\x22?\x21?\x20?$
Try it online!
You may replace ? with * if you want to allow duplicate characters.
ps: some people can come up with absurdly long regexes when they aren't the right tool for the job: to parse email, html or the present question.

regex interval with possible characters before and after number VBA

I'm trying to produce a regular expression that can identify a number within an interval in a string in VBA. Sometimes this number has characters around it, other times not (non-consistent notation from a supplier). The expression should identify that 1413 in the three examples below are within the number range 500-2000 (or alternatively that it's not in the number range 0-50 or 51-499).
Example:
Test 12/2014. Tot.flow:1413 m3 or
Test 12/2014. Tot.flow:1413m3 or
Test 12/2014. Tot.flow: 1413
These strings have some identifiers:
there will always be a colon before the number
there may be a white space between the colon and the number
there may be a white space between the number and the m3
m3 is not necessarily always present, and if not, the number is at the end of the string
So far what I have in my attempt to make an regex that find the number range is ([5-9][0-9][0-9]|[1]\d{3}|2000), but this matches all three digit numbers as well (2001 gives a match on 200). However, I understand that I'm missing out on a couple of concepts to achieve the ultimate goal here. I guess my problems are as following:
How to start the interval at something not being zero (found lots of questions on intervals starting on zero)
How to take into account the variations in notation both for flow: and m3?
I'm only interested in checking that the number lies within the number range. This is driving me bonkers, all help is highly appreciated!
You can just extract the number with regExp.Replace() using the following regex:
^.*:\s*(\d+).*$
The replacement part is $1.
Then, use usual number comparison to check whether the value is in the expected range (e.g. If CLng(result) > 499 And If CLng(result) < 2001 Then ...).
Test macro:
Dim re As RegExp, tgt As String, src As String
Set re = New RegExp
With re
.pattern = "^.*:\s*(\d+).*$"
.Global = False
End With
src = "Test 12/2014. Tot.flow: 1413"
tgt = re.Replace(src, "$1")
MsgBox (CLng(tgt) > 499 And CLng(tgt) < 2001)
You can try with:
:\s?([5-9]\d\d|1\d{3}|2000)\s?(m3|\n)
also, your regex ([5-9][0-9][0-9]|[1]\d{3}|2000) in my opinion is fine, it should not match numbers >500 and 2000<.

Regexp: Keyword followed by value to extract

I had this question a couple of times before, and I still couldn't find a good answer..
In my current problem, I have a console program output (string) that looks like this:
Number of assemblies processed = 1200
Number of assemblies uninstalled = 1197
Number of failures = 3
Now I want to extract those numbers and to check if there were failures. (That's a gacutil.exe output, btw.) In other words, I want to match any number [0-9]+ in the string that is preceded by 'failures = '.
How would I do that? I want to get the number only. Of course I can match the whole thing like /failures = [0-9]+/ .. and then trim the first characters with length("failures = ") or something like that. The point is, I don't want to do that, it's a lame workaround.
Because it's odd; if my pattern-to-match-but-not-into-output ("failures = ") comes after the thing i want to extract ([0-9]+), there is a way to do it:
pattern(?=expression)
To show the absurdity of this, if the whole file was processed backwards, I could use:
[0-9]+(?= = seruliaf)
... so, is there no forward-way? :T
pattern(?=expression) is a regex positive lookahead and what you are looking for is a regex positive lookbehind that goes like this (?<=expression)pattern but this feature is not supported by all flavors of regex. It depends which language you are using.
more infos at regular-expressions.info for comparison of Lookaround feature scroll down 2/3 on this page.
If your console output does actually look like that throughout, try splitting the string on "=" when the word "failure" is found, then get the last element (or the 2nd element). You did not say what your language is, but any decent language with string splitting capability would do the job. For example
gacutil.exe.... | ruby -F"=" -ane "print $F[-1] if /failure/"

Regex: How to match a string that is not only numbers

Is it possible to write a regular expression that matches all strings that does not only contain numbers? If we have these strings:
abc
a4c
4bc
ab4
123
It should match the four first, but not the last one. I have tried fiddling around in RegexBuddy with lookaheads and stuff, but I can't seem to figure it out.
(?!^\d+$)^.+$
This says lookahead for lines that do not contain all digits and match the entire line.
Unless I am missing something, I think the most concise regex is...
/\D/
...or in other words, is there a not-digit in the string?
jjnguy had it correct (if slightly redundant) in an earlier revision.
.*?[^0-9].*
#Chad, your regex,
\b.*[a-zA-Z]+.*\b
should probably allow for non letters (eg, punctuation) even though Svish's examples didn't include one. Svish's primary requirement was: not all be digits.
\b.*[^0-9]+.*\b
Then, you don't need the + in there since all you need is to guarantee 1 non-digit is in there (more might be in there as covered by the .* on the ends).
\b.*[^0-9].*\b
Next, you can do away with the \b on either end since these are unnecessary constraints (invoking reference to alphanum and _).
.*[^0-9].*
Finally, note that this last regex shows that the problem can be solved with just the basics, those basics which have existed for decades (eg, no need for the look-ahead feature). In English, the question was logically equivalent to simply asking that 1 counter-example character be found within a string.
We can test this regex in a browser by copying the following into the location bar, replacing the string "6576576i7567" with whatever you want to test.
javascript:alert(new String("6576576i7567").match(".*[^0-9].*"));
/^\d*[a-z][a-z\d]*$/
Or, case insensitive version:
/^\d*[a-z][a-z\d]*$/i
May be a digit at the beginning, then at least one letter, then letters or digits
Try this:
/^.*\D+.*$/
It returns true if there is any simbol, that is not a number. Works fine with all languages.
Since you said "match", not just validate, the following regex will match correctly
\b.*[a-zA-Z]+.*\b
Passing Tests:
abc
a4c
4bc
ab4
1b1
11b
b11
Failing Tests:
123
if you are trying to match worlds that have at least one letter but they are formed by numbers and letters (or just letters), this is what I have used:
(\d*[a-zA-Z]+\d*)+
If we want to restrict valid characters so that string can be made from a limited set of characters, try this:
(?!^\d+$)^[a-zA-Z0-9_-]{3,}$
or
(?!^\d+$)^[\w-]{3,}$
/\w+/:
Matches any letter, number or underscore. any word character
.*[^0-9]{1,}.*
Works fine for us.
We want to use the used answer, but it's not working within YANG model.
And the one I provided here is easy to understand and it's clear:
start and end could be any chars, but, but there must be at least one NON NUMERICAL characters, which is greatest.
I am using /^[0-9]*$/gm in my JavaScript code to see if string is only numbers. If yes then it should fail otherwise it will return the string.
Below is working code snippet with test cases:
function isValidURL(string) {
var res = string.match(/^[0-9]*$/gm);
if (res == null)
return string;
else
return "fail";
};
var testCase1 = "abc";
console.log(isValidURL(testCase1)); // abc
var testCase2 = "a4c";
console.log(isValidURL(testCase2)); // a4c
var testCase3 = "4bc";
console.log(isValidURL(testCase3)); // 4bc
var testCase4 = "ab4";
console.log(isValidURL(testCase4)); // ab4
var testCase5 = "123"; // fail here
console.log(isValidURL(testCase5));
I had to do something similar in MySQL and the following whilst over simplified seems to have worked for me:
where fieldname regexp ^[a-zA-Z0-9]+$
and fieldname NOT REGEXP ^[0-9]+$
This shows all fields that are alphabetical and alphanumeric but any fields that are just numeric are hidden. This seems to work.
example:
name1 - Displayed
name - Displayed
name2 - Displayed
name3 - Displayed
name4 - Displayed
n4ame - Displayed
324234234 - Not Displayed

regular expression matching all 8 character strings except "00000000"

I am trying to figure out a regular expression which matches any string with 8 symbols, which doesn't equal "00000000".
can any one help me?
thanks
In at least perl regexp using a negative lookahead assertion: ^(?!0{8}).{8}$, but personally i'd rather write it like so:
length $_ == 8 and $_ ne '00000000'
Also note that if you do use the regexp, depending on the language you might need a flag to make the dot match newlines as well, if you want that. In perl, that's the /s flag, for "single-line mode".
Unless you are being forced into it for some reason, this is not a regex problem. Just use len(s) == 8 && s != "00000000" or whatever your language uses to compare strings and lengths.
If you need a regex, ^(?!0{8})[A-Za-z0-9]{8}$ will match a string of exactly 8 characters. Changing the values inside the [] will allow you to set the accepted characters.
As mentioned in the other answers, regular expressions are not the right tool for this task. I suspect it is a homework, thus I'll only hint a solution, instead of stating it explicitly.
The regexp "any 8 symbols except 00000000" may be broken down as a sum of eight regexps in the form "8 symbols with non-zero symbol on the i-th position". Try to write down such an expression and then combine them into one using alternative ("|").
Unless you have unspecified requirements, you really don't need a regular expression for this:
if len(myString) == 8 and myString != "00000000":
...
(in the language of your choice, of course!)
If you need to extract all eight character strings not equal to "000000000" from a larger string, you could use
"(?=.{8})(?!0{8})."
to identify the first character of each sequence and extract eight characters starting with its index.
Of course, one would simply check
if stuff != '00000000'
...
but for the record, one could easily employ
heavyweight regex (in Perl) for that ;-)
...
use re 'eval';
my #strings = qw'00000000 00A00000 10000000 000000001 010000';
my $L = 8;
print map "$_ - ok\n",
grep /^(.{$L})$(??{$^Nne'0'x$L?'':'^$'})/,
#strings;
...
prints
00A00000 - ok
10000000 - ok
go figure ;-)
Regards
rbo
Wouldn't ([1-9]**|\D*){8} do it? Or am I missing something here (which is actually just the inverse of ndim's, which seems like it oughta work).
I am assuming the characters was chosen to include more than digits.
Ok so that was wrong, so Professor Bolo did I get a passing grade? (I love reg expressions so I am really curious).
>>> if re.match(r"(?:[^0]{8}?|[^0]{7}?|[^0]{6}?|[^0]{5}?|[^0]{4}?|[^0]{3}?|[^0]2}?|[^0]{1}?)", '00000000'):
print 'match'
...
>>> if re.match(r"(?:[^0]{8}?|[^0]{7}?|[^0]{6}?|[^0]{5}?|[^0]{4}?|[^0]{3}?|[^0]{2}?|[^0]{1}?)", '10000000'):
... print 'match'
match
>>> if re.match(r"(?:[^0]{8}?|[^0]{7}?|[^0]{6}?|[^0]{5}?|[^0]{4}?|[^0]{3}?|[^0]{2}?|[^0]{1}?)", '10011100'):
... print 'match'
match
>>>
That work?