regular expression matching all 8 character strings except "00000000" - regex

I am trying to figure out a regular expression which matches any string with 8 symbols, which doesn't equal "00000000".
can any one help me?
thanks

In at least perl regexp using a negative lookahead assertion: ^(?!0{8}).{8}$, but personally i'd rather write it like so:
length $_ == 8 and $_ ne '00000000'
Also note that if you do use the regexp, depending on the language you might need a flag to make the dot match newlines as well, if you want that. In perl, that's the /s flag, for "single-line mode".

Unless you are being forced into it for some reason, this is not a regex problem. Just use len(s) == 8 && s != "00000000" or whatever your language uses to compare strings and lengths.

If you need a regex, ^(?!0{8})[A-Za-z0-9]{8}$ will match a string of exactly 8 characters. Changing the values inside the [] will allow you to set the accepted characters.

As mentioned in the other answers, regular expressions are not the right tool for this task. I suspect it is a homework, thus I'll only hint a solution, instead of stating it explicitly.
The regexp "any 8 symbols except 00000000" may be broken down as a sum of eight regexps in the form "8 symbols with non-zero symbol on the i-th position". Try to write down such an expression and then combine them into one using alternative ("|").

Unless you have unspecified requirements, you really don't need a regular expression for this:
if len(myString) == 8 and myString != "00000000":
...
(in the language of your choice, of course!)

If you need to extract all eight character strings not equal to "000000000" from a larger string, you could use
"(?=.{8})(?!0{8})."
to identify the first character of each sequence and extract eight characters starting with its index.

Of course, one would simply check
if stuff != '00000000'
...
but for the record, one could easily employ
heavyweight regex (in Perl) for that ;-)
...
use re 'eval';
my #strings = qw'00000000 00A00000 10000000 000000001 010000';
my $L = 8;
print map "$_ - ok\n",
grep /^(.{$L})$(??{$^Nne'0'x$L?'':'^$'})/,
#strings;
...
prints
00A00000 - ok
10000000 - ok
go figure ;-)
Regards
rbo

Wouldn't ([1-9]**|\D*){8} do it? Or am I missing something here (which is actually just the inverse of ndim's, which seems like it oughta work).
I am assuming the characters was chosen to include more than digits.
Ok so that was wrong, so Professor Bolo did I get a passing grade? (I love reg expressions so I am really curious).
>>> if re.match(r"(?:[^0]{8}?|[^0]{7}?|[^0]{6}?|[^0]{5}?|[^0]{4}?|[^0]{3}?|[^0]2}?|[^0]{1}?)", '00000000'):
print 'match'
...
>>> if re.match(r"(?:[^0]{8}?|[^0]{7}?|[^0]{6}?|[^0]{5}?|[^0]{4}?|[^0]{3}?|[^0]{2}?|[^0]{1}?)", '10000000'):
... print 'match'
match
>>> if re.match(r"(?:[^0]{8}?|[^0]{7}?|[^0]{6}?|[^0]{5}?|[^0]{4}?|[^0]{3}?|[^0]{2}?|[^0]{1}?)", '10011100'):
... print 'match'
match
>>>
That work?

Related

Can you restrict two characters based on their ASCII order in regex?

Let's say I have a string of 2 characters. Using regex (as a thought exercise), I want to accept it only if the first character has an ascii value bigger than that of the second character.
ae should not match because a is before e in the the ascii table.
ea, za and aA should match for the opposite reason
f$ should match because $ is before letters in the ascii table.
It doesn't matter if aa or a matches or not, I'm only interested in the base case. Any flavor of regex is allowed.
Can it be done ? What if we restrict the problem to lowercase letters only ? What if we restrict it to [abc] only ? What if we invert the condition (accept when the characters are ordered from smallest to biggest) ? What if I want it to work for N characters instead of 2 ?
I guess that'd be almost impossible for me to do it then, however bobble-bubble impressively solved the problem with:
^~*\}*\|*\{*z*y*x*w*v*u*t*s*r*q*p*o*n*m*l*k*j*i*h*g*f*e*d*c*b*a*`*_*\^*\]*\\*\[*Z*Y*X*W*V*U*T*S*R*Q*P*O*N*M*L*K*J*I*H*G*F*E*D*C*B*A*#*\?*\>*\=*\<*;*\:*9*8*7*6*5*4*3*2*1*0*\/*\.*\-*,*\+*\**\)*\(*'*&*%*\$*\#*"*\!*$(?!^)
bobble bubble RegEx Demo
Maybe for abc only or some short sequences we would approach solving the problem with some expression similar to,
^(abc|ab|ac|bc|a|b|c)$
^(?:abc|ab|ac|bc|a|b|c)$
that might help you to see how you would go about it.
RegEx Demo 1
You can simplify that to:
^(a?b?c?)$
^(?:a?b?c?)$
RegEx Demo 2
but I'm not so sure about it.
The number of chars you're trying to allow is irrelevant to the problem you are trying to solve:
because you can simply add an independent statement, if you will, for that, such as with:
(?!.{n})
where n-1 would be the number of chars allowed, which in this case would be
(?!.{3})^(?:a?b?c?)$
(?!.{3})^(a?b?c?)$
RegEx Demo 3
A regex is not the best tool for the job.
But it's doable. A naive approach is to enumerate all the printable ascii characters and their corresponding lower range:
\x21[ -\x20]|\x22[ -\x21]|\x23[ -\x22]|\x24[ -\x23]|\x25[ -\x24]|\x26[ -\x25]|\x27[ -\x26]|\x28[ -\x27]|\x29[ -\x28]|\x2a[ -\x29]|\x2b[ -\x2a]|\x2c[ -\x2b]|\x2d[ -\x2c]|\x2e[ -\x2d]|\x2f[ -\x2e]|\x30[ -\x2f]|\x31[ -\x30]|\x32[ -\x31]|\x33[ -\x32]|\x34[ -\x33]|\x35[ -\x34]|\x36[ -\x35]|\x37[ -\x36]|\x38[ -\x37]|\x39[ -\x38]|\x3a[ -\x39]|\x3b[ -\x3a]|\x3c[ -\x3b]|\x3d[ -\x3c]|\x3e[ -\x3d]|\x3f[ -\x3e]|\x40[ -\x3f]|\x41[ -\x40]|\x42[ -\x41]|\x43[ -\x42]|\x44[ -\x43]|\x45[ -\x44]|\x46[ -\x45]|\x47[ -\x46]|\x48[ -\x47]|\x49[ -\x48]|\x4a[ -\x49]|\x4b[ -\x4a]|\x4c[ -\x4b]|\x4d[ -\x4c]|\x4e[ -\x4d]|\x4f[ -\x4e]|\x50[ -\x4f]|\x51[ -\x50]|\x52[ -\x51]|\x53[ -\x52]|\x54[ -\x53]|\x55[ -\x54]|\x56[ -\x55]|\x57[ -\x56]|\x58[ -\x57]|\x59[ -\x58]|\x5a[ -\x59]|\x5b[ -\x5a]|\x5c[ -\x5b]|\x5d[ -\x5c]|\x5e[ -\x5d]|\x5f[ -\x5e]|\x60[ -\x5f]|\x61[ -\x60]|\x62[ -\x61]|\x63[ -\x62]|\x64[ -\x63]|\x65[ -\x64]|\x66[ -\x65]|\x67[ -\x66]|\x68[ -\x67]|\x69[ -\x68]|\x6a[ -\x69]|\x6b[ -\x6a]|\x6c[ -\x6b]|\x6d[ -\x6c]|\x6e[ -\x6d]|\x6f[ -\x6e]|\x70[ -\x6f]|\x71[ -\x70]|\x72[ -\x71]|\x73[ -\x72]|\x74[ -\x73]|\x75[ -\x74]|\x76[ -\x75]|\x77[ -\x76]|\x78[ -\x77]|\x79[ -\x78]|\x7a[ -\x79]|\x7b[ -\x7a]|\x7c[ -\x7b]|\x7d[ -\x7c]|\x7e[ -\x7d]|\x7f[ -\x7e]
Try it online!
A (better) alternative is to enumerate the ascii characters in reverse order and use the ^ and $ anchors to assert there is nothing else unmatched. This should work for any string length:
^\x7f?\x7e?\x7d?\x7c?\x7b?z?y?x?w?v?u?t?s?r?q?p?o?n?m?l?k?j?i?h?g?f?e?d?c?b?a?`?\x5f?\x5e?\x5d?\x5c?\x5b?Z?Y?X?W?V?U?T?S?R?Q?P?O?N?M?L?K?J?I?H?G?F?E?D?C?B?A?#?\x3f?\x3e?\x3d?\x3c?\x3b?\x3a?9?8?7?6?5?4?3?2?1?0?\x2f?\x2e?\x2d?\x2c?\x2b?\x2a?\x29?\x28?\x27?\x26?\x25?\x24?\x23?\x22?\x21?\x20?$
Try it online!
You may replace ? with * if you want to allow duplicate characters.
ps: some people can come up with absurdly long regexes when they aren't the right tool for the job: to parse email, html or the present question.

How can I substitute in strings in Perl 6 by codepoint rather than by grapheme?

I need to remove diacritical marks from a string using Perl 6. I tried doing this:
my $hum = 'חוּם';
$ahm.subst(/<-[\c[HEBREW LETTER ALEF] .. \c[HEBREW LETTER TAV]]>/, '', :g);
I am trying to remove all the characters that are not in the range between HEBREW LETTER ALEF (א) and HEBREW LETTER TAV (ת). I'd expected the following code to return "חום", however it returns "חם".
I guess that what happens is that by default Perl 6 works by graphemes, considers וּ to be one grapheme, and removes all of it. It's often sensible to work by graphemes, but in my case I need it to work by codepoints.
I tried to find an adverb that would make it work by codepoint, but couldn't find it. Perhaps there is also a way in Perl 6 to use Unicode properties to exclude diacritics, or to include only letters, but I couldn't find that either.
Thanks!
My regex-fu is weak, so I'd go with a less magical solution.
First, you can remove all marks via samemark:
'חוּם'.samemark('a')
Second, you can decompose the graphemes via .NFD and operate on individual codepoints - eg only keeping values with property Grapheme_Base - and then recompose the string:
Uni.new('חוּם'.NFD.grep(*.uniprop('Grapheme_Base'))).Str
In case of mixed strings, stripping marks from Hebrew characters only could look like this:
$str.subst(:g, /<:Script<Hebrew>>+/, *.Str.samemark('a'));
Here is a simple approach:
my $hum = 'חוּם';
my $min = "\c[HEBREW LETTER ALEF]".ord;
my $max = "\c[HEBREW LETTER TAV]".ord;
my #ords;
for $hum.ords {
#ords.push($_) if $min ≤ $_ ≤ $max;
}
say join('', #ords.map: { .chr });
Output:
חום

Regular expression that accepts only characters with accents

I need a regular expression that accepts only characters having accents. For the moment I'm using this one:
[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöœøùúûüýþÿ]*$
Is there another expression, which is clearer than my expression?
i think this will solve your problem :
[œÀ-ÖØ-öø-ÿ]*$
Since all characters except the œ are between characters 192 À and 255 ÿ, could you do something like looking ahead and checking they don't contain any of the characters in the range that you don't want? I'm not sure it improves anything compared to yours but it's a bit shorter and maybe, just maybe, clearer.
(?![÷×])[À-ÿœ]
Regex isn't always the clearest way to handle text, even if it is the fastest.
You could assign your regular expression to a variable, then insert it via text interpolation:
accent_chars = '[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöœøùúûüýþÿ]'
my_regex = '^...%s*...$' % accent_chars
You can also use these ranges:
[œÀ-ÖØ-öø-ÿ]
Demonstration using Python 3:
>>> import re
>>> s = 'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöœøùúûüýþÿ'
>>> ''.join(re.findall('[œÀ-ÖØ-öø-ÿ]', s))
'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöœøùúûüýþÿ'
>>> len(''.join(re.findall('[œÀ-ÖØ-öø-ÿ]', s))) == len(s)
True
The downside is that it is not immediately clear to someone unfamiliar with Unicode that this covers every desired case.
You could also try using the POSIX bracket expression [:alpha:].
Then just prune the alphabetic characters from your string.

Regular expression any character with dynamic size

I want to use a regular expression that would do the following thing ( i extracted the part where i'm in trouble in order to simplify ):
any character for 1 to 5 first characters, then an "underscore", then some digits, then an "underscore", then some digits or dot.
With a restriction on "underscore" it should give something like that:
^([^_]{1,5})_([\\d]{2,3})_([\\d\\.]*)$
But i want to allow the "_" in the 1-5 first characters in case it still match the end of the regular expression, for example if i had somethink like:
to_to_123_12.56
I think this is linked to an eager problem in the regex engine, nevertheless, i tried to do some lazy stuff like explained here but without sucess.
Any idea ?
I used the following regex and it appeared to work fine for your task. I've simply replaced your initial [^_] with ..
^.{1,5}_\d{2,3}_[\d\.]*$
It's probably best to replace your final * with + too, unless you allow nothing after the final '_'. And note your final part allows multiple '.' (I don't know if that's what you want or not).
For the record, here's a quick Python script I used to verify the regex:
import re
strs = [ "a_12_1",
"abc_12_134",
"abcd_123_1.",
"abcde_12_1",
"a_123_123.456.7890.",
"a_12_1",
"ab_de_12_1",
]
myre = r"^.{1,5}_\d{2,3}_[\d\.]+$"
for str in strs:
m = re.match(myre, str)
if m:
print "Yes:",
if m.group(0) == str:
print "ALL",
else:
print "No:",
print str
Output is:
Yes: ALL a_12_1
Yes: ALL abc_12_134
Yes: ALL abcd_134_1.
Yes: ALL abcde_12_1
Yes: ALL a_123_123.456.7890.
Yes: ALL a_12_1
Yes: ALL ab_de_12_1
^(.{1,5})_(\d{2,3})_([\d.]*)$
works for your example. The result doesn't change whether you use a lazy quantifier or not.
While answering the comment ( writing the lazy expression ), i saw that i did a mistake... if i simply use the folowing classical regex, it works:
^(.{1,5})_([\\d]{2,3})_([\\d\\.]*)$
Thank you.

RegEx for an invoice format

I'm quite new to regular expressions and I'm trying to create a regex for the validation of an invoice format.
The pattern should be:
JjYy (all 4 characters are legit), used 0, 2 or 4 times
e.g. no Y's at all is valid, YY is valid, YYYY is valid, but YYY should fail.
Followed by a series of 0's repeating 3 to 10 times.
The whole should never exceed 10 characters.
examples:
JyjY000000 is valid (albeit quite strange)
YY000 is valid
000000 is valid
jjj000 is invalid
jjjj0 is invalid
I learned some basics from here, but my regex fails when it shouldn't. Can someone assist in improving it?
My regex so far is: [JjYy]{0}|[JjYy]{2}|[JjYy]{4}[0]{3,10}.
The following failed also: [JjYy]{0|2|4}[0]{3,10}
As you need the total length to never exceed 10 characters I think you have to handle the three kinds of prefixes separately:
0{3,10}|[JjYy]{2}0{3,8}|[JjYy]{4}0{3,6}
How about:
^([JjYy]{2}){0,2}0{3,10}$
To check the length is ten characters or less, use a string length function rather than a regular expression - don't hammer nails with a screwdriver, and so forth.
Test:
#!perl
use warnings;
use strict;
my $re = qr/^([JjYy]{2}){0,2}0{3,10}$/;
my %tests = qw/JyjY000000 valid
YY000 valid
000000 valid
jjj000 invalid
jjjj0 invalid/;
for my $k (keys %tests) {
print "$k is ";
if ($k =~ /$re/) {
print "valid";
} else {
print "invalid";
}
print " and it should be $tests{$k}.\n";
}
Produces
jjjj0 is invalid and it should be invalid.
YY000 is valid and it should be valid.
JyjY000000 is valid and it should be valid.
jjj000 is invalid and it should be invalid.
000000 is valid and it should be valid.
([jJyY]{2}){0,2}0{3,10}
If the total length limit is inclusive of the jJyY part, you can check it with a negative look ahead to make sure there are no more than 10 characters in the string to begin with (?![jJyY0]{11,})
\b(?![jJyY0]{11,})([jJyY]{2}){0,2}0{3,10}\b
It may depend on what you are using to implement the regular expression. For example I found out the other day that Notepad++ only supports a few basic operators. Things like the pipe are not part of the core regex standard.
I'd suggest something like this:
([JjYy]{2}([JjYy]{2})?)?[0]{3,10}
If you're using a programming language, you'll need to use a string length function to validate the length.
EDIT: actually, you should be able to validate the length by separating the different situations:
([0]{3,10})|([JjYy]{2}[0]{3,8})|([JjYy]{4}[0]{3,6})
You want to limit the string to 10 characters. So in order to do this you have to consider what valid combinations will make up 10 characters.
Valid combinations therefore would be:
0000000000
000
cc00000000
cc000
cccc000000
cccc000
So, an expression to include all of these would be:
/0{3,10}|[JY]{2}0{3,8}|[JY]{4}0{3,6}/i
A case insensitive match would suffice, although you do get additional performance from some regular expression engines by explicitly saying /[JjYy]/ instead of /[JY]/i.