Regexes for integer constants and for binary numbers - regex

I have tried 2 questions, could you tell me whether I am right or not?
Regular expression of nonnegative integer constants in C, where numbers beginning with 0 are octal constants and other numbers are decimal constants.
I tried 0([1-7][0-7]*)?|[1-9][0-9]*, is it right? And what string could I match? Do you think 034567 will match and 000083 match?
What is a regular expression for binary numbers x such that hx + ix = jx?
I tried (0|1){32}|1|(10)).. do you think a string like 10 will match and 11 won’t match?
Please tell me whether I am right or not.

You can always use http://www.spaweditor.com/scripts/regex/ for a quick test on whether a particular regex works as you intend it to. This along with google can help you nail the regex you want.

0([1-7][0-7])?|[1-9][0-9] is wrong because there's no repetition - it will only match 1 or 2-character strings. What you need is something like 0[0-7]*|[1-9][0-9]*, though that doesn't take hexadecimal into account (as per spec).
This one is not clear. Could you rephrase that or give some more examples?

Your regex for integer constants will not match base-10 numbers longer than two digits and octal numbers longer than three digits (2 if you don't count the leading zero). Since this is a homework, I leave it up to you to figure out what's wrong with it.
Hint: Google for "regular expression repetition quantifiers".

Question 1:
Octal numbers:
A string that start with a [0] , then can be followed by any digit 1, 2, .. 7 [1-7](assuming no leading zeroes) but can also contain zeroes after the first actual digit, so [0-7]* (* is for repetition, zero or more times).
So we get the following RegEx for this part: 0 [1-7][0-7]*
Decimal numbers:
Decimal numbers must not have a leading zero, hence start with all digits from 1 to 9 [1-9], but zeroes are allowed in all other positions as well hence we need to concatenate [0-9]*
So we get the following RegEx for this part: [1-9][0-9]*
Since we have two options (octal and decimal numbers) and either one is possible we can use the Alternation property '|' :
L = 0[1-7][0-7]* | [1-9][0-9]*
Question 2:
Quickly looking at Fermat's Last Theorem:
In number theory, Fermat's Last Theorem (sometimes called Fermat's conjecture, especially in older texts) states that no three positive integers a, b, and c can satisfy the equation an + bn = cn for any integer value of n greater than two.
(http://en.wikipedia.org/wiki/Fermat%27s_Last_Theorem)
Hence the following sets where n<=2 satisfy the equation: {0,1,2}base10 = {0,1,10}base2
If any of those elements satisfy the equation, we use the Alternation | (or)
So the regular expression can be: L = 0 | 1 | 10 but can also be L = 00 | 01 | 10 or even be L = 0 | 1 | 10 | 00 | 01
Or can be generalized into:
{0} we can have infinite number of zeroes: 0*
{1} we can have infinite number of zeroes followed by a 1: 0*1
{10} we can have infinite number of zeroes followed by 10: 0*10
So L = 0* | 0*1 | 0*10

max answered the first question.
the second appears to be the unsolvable diophantine equation of fermat's last theorem. if h,i,j are non-zero integers, x can only be 1 or 2, so you're looking for
^0*10?$
does that help?

There are several tool available to test regular expressions, such as The Regulator.
If you search for "regular expression test" you will find numerous links to online testers.

Related

Optimization of Regular Expression to match numbers bigger or equal to 50

I want to check if a number is 50 or more using a regular expression. This in itself is no problem but the number field has another regex checking the format of the entered number.
The number will be in the continental format: 123.456,78 (a dot between groups of three digits and always a comma with 2 digits at the end)
Examples:
100.000,00
50.000,00
50,00
34,34
etc.
I want to capture numbers which are 50 or more. So from the four examples above the first three should be matched.
I've come up with this rather complicated one and am wondering if there is an easier way to do this.
^(\d{1,3}[.]|[5-9][0-9]|\d{3}|[.]\d{1,3})*[,]\d{2}$
EDIT
I want to match continental numbers here. The numbers have this format due to internal regulations and specify a price.
Example: 1000 EUR would be written as 1.000,00 EUR
50000 as 50.000,00 and so on.
It's a matter of taste, obviously, but using a negative lookahead gives a simple solution.
^(?!([1-4]?\d),)[1-9](\d{1,2})?(\.\d{3})*,\d{2}\b
In words: starting from a boundary ignore all numbers that start with 1 digit OR 2 digits (the first being a 1,2,3 or 4), followed by a comma.
Check on regex101.com
Try:
EDIT ^(.{3,}|[5-9]\d),\d{2}$
It checks if:
there 3 chars or more before the ,
there are 2 numbers before the , and the first is between 5 and 9
and then a , and 2 numbers
Donno if it answer your question as it'll return true for:
aa50,00
1sdf,54
But this assumes that your original string is a number in the format you expect (as it was not a requirement in your question).
EDIT 3
The regex below tests if the number is valid referring to the continental format and if it's equal or greater than 50. See tests here.
Regex: ^((([1-9]\d{0,2}\.)(\d{3}\.){0,}\d{3})|([1-9]\d{2})|([5-9]\d)),\d{2}$
Explanation (d is a number):
([1-9]\d{0,2}\.): either d., dd. or ddd. one time with the first d between 1 and 9.
(\d{3}\.){0,}: ddd. zero or x time
\d{3}: ddd 3 digit
These 3 parts combined match any numbers equals or greater than 1000 like: 1.000, 22.002 or 100.000.000.
([1-9]\d{2}): any number between 100 and 999.
([5-9]\d)): a number between 5 and 9 followed by a number. Matches anything between 50 and 99.
So it's either the one of the parts above or this one.
Then ,\d{2}$ matches the comma and the two last digits.
I have named all inner groups, for better understanding what part of number is matched by each group. After you understand how it works, change all ?P<..> to ?:.
This one is for any dec number in the continental format.
^(?P<common_int>(?P<int>(?P<int_start>[1-9]\d{1,2}|[1-9]\d|[1-9])(?P<int_end>\.\d{3})*|0)(?!,)|(?P<dec_int_having_frac>(?P<dec_int>(?P<dec_int_start>[1-9]\d{1,2}|[1-9]\d|[1-9])(?P<dec_int_end>\.\d{3})*,)|0,|,)(?=\d))(?P<frac_from_comma>(?<=,)(?P<frac>(?P<frac_start>\d{3}\.)*(?P<frac_end>\d{1,3})))?$
test
This one is for the same with the limit number>=50
^(?P<common_int>(?P<int>(?P<int_start>[1-9]\d{1,2}|[1-9]\d|[1-9])(?P<int_end>\.\d{3})+|(?P<int_short>[1-9]\d{2}|[5-9]\d))(?!,)|(?P<dec_int_having_frac>(?P<dec_int>(?P<dec_int_start>[1-9]\d{1,2}|[1-9]\d|[1-9])(?P<dec_int_end>\.\d{3})+,)|(?P<dec_short_int>[1-9]\d{2}|[5-9]\d),)(?=\d))(?P<frac_from_comma>(?<=,)(?P<frac>(?P<frac_start>\d{3}\.)*(?P<frac_end>\d{1,3})))?$
tests
If you always have the integer part under 999.999 and fractal part always 2 digits, it will be a bit more simple:
^(?P<dec_int_having_frac>(?P<dec_int>(?P<dec_int_start>[1-9]\d{1,2}|[1-9]\d|[1-9])(?P<dec_int_end>\.\d{3})?,)|(?P<dec_short_int>[1-9]\d{2}|[5-9]\d),)(?=\d)(?P<frac_from_comma>(?<=,)(?P<frac>(?P<frac_end>\d{1,2})))?$
test
If you can guarantee that the number is correctly formed -- that is, that the regex isn't expected to detect that 5,0.1 is invalid, then there are a limited number of passing cases:
ends with \d{3}
ends with [5-9]\d
contains \d{3},
contains [5-9]\d,
It's not actually necessary to do anything with \.
The easiest regex is to code for each of these individually:
(\d{3}$|[5-9]\d$|\d{3},|[5-9]\d)
You could make it more compact and efficient by merging some of the cases:
(\d{3}[$,]|[5-9]\d[$,])
If you need to also validate the format, you will need extra complexity. I would advise against attempting to do both in a single regex.
However unless you have a very good reason for having to do this with a regex, I recommend against it. Parse the string into an integer, and compare it with 50.

Matching numbers greater than 40

I'm trying to match numbers greater than 40. The good point is that all of them have 2 decimal places, so all of them are like: 3.25, 5.89, 999.75 and they don't use any leading zeros (except on the decimal part that always have 2 digits)...
At first I tried the following code but then I realized this wouldn't match numbers like 100, 1000... even if they are greater than 40.
[4-9][0-9]\.
I don't have to match the decimal part, so don't worry about matching that, just help me to find how to match numbers greater than 40 (up to 9999 would be fine).
Thanks for your help.
This should do the job:
([4-9][0-9]|\d{3,})\.
Check it here:
http://www.regexr.com/3a5v9
Don't use regular expressions for number comparison. If, for example, you're using Javascript:
var aNumber = parseFloat("50");
if (aNumber > 40) {
// yay!
}
If your regex flavour can use negative lookbehind to match the numbers from 41 to 9999 without decimal:
\b(?:[1-9][0-9]{2,3}|[5-9][0-9]|4[1-9])(?<!\.\d{1,2})\b
(40\.(?!0[^\d]|00)\d{1,2}|(((4[1-9](?!\d)|[5-9][0-9])(?![\d])|\d*[1-9]\d{2,})(\.\d{1,2})?))
This prevents false positives from leading 0s.
This worked for me.
It tries to match 40 followed by 1 or two decimals that are not 00.
It then tries to match 4 followed by 1-9, decimal optional.
If it can't match that it matches 5-9 followed by 0-9, decimal optional.
It then triese to match any digit, any number of times, followed by 1-9, followed by 1 or 2 digits, decimal optional.
If you want to require the decimal, just remove the last question mark.
This will do it:
([4-9][0-9]+|\d{3,})
This it will get all the numbers of two digits having the first one greater than 4 or any number with three digits.
As an example http://www.regexr.com/3a5v0
You can use brackets to indicate a minimum and, if desired, maximum number of characters to match. So,
([4-9][0-9]|[1-9][0-9]{2,})\.
matches 4-9 followed by one or more digits. Presumably there's a boundary of some sort at the beginning of this, but it sounds like you have that part worked out. This uses an OR to allow for two possible groups of first digits.
(Most of the other answer are perfect for me -- This is paranoia and a bad idea :)
for use with grep -Po or Perl we could use:
'\b(\d{3,}|[4-9]\d)\.\d\d'
but this would get 40.00 (not greater than 40)
'\b(\d{3,}|[5-9]\d|4[1-9])\.\d\d|\b40\.\d?[1-9]\d?'
Corresponding to:
DDD.DD
| [5-9]D.DD
| 4[1-9].DD
| 40.D[1-9]
| 40.[1-9]D
In flex(1) you have this code to parse strings and get numbers greater than 40:
pru.l:
%option noyywrap
%%
\+?(0*[4-9][0-9]|0*[1-9][0-9][0-9][0-9]*)(\.[0-9]*)? { printf("Greater than 40: %s\n", yytext); }
\-?[0-9]*(\.[0-9]*)? { printf("Lesser than 40: %s\n", yytext); }
\n |
. ;
%%
int main()
{ yylex(); }
Install flex and compile this file it with
make pru
Then run it as:
pru <filein >fileout
or just
pru
This code constructs a deterministic finite automaton from the regular expressions listed and prints the commands listed on the right when recognizes a value greater than 40. It allows a leading optional sign and leading zeros, and an optional fractional part composed of any number of digits. And it does this with only one asignment and one decision for each character read. You have access to the automaton state table generated by flex (it writes C code for you)
the regex that recognizes numbers greater than 40 (with decimals and leading sign and zeros) is:
\+?(0*[4-9][0-9]|0*[1-9][0-9][0-9][0-9]*)(\.[0-9]*)?
and can be abreviated as:
\+?(0*[4-9][0-9]|0*[1-9][0-9]{3,})(\.[0-9]*)?
explanation:
\+? matches an optional plus sign.
(...|...) two options:
0* optional arbitrary number of leadin zeros.
[4-9][0-9] the numbers 40 to 99
[1-9][0-9]{3,} the numbers 100 and up.
(.[0-9]*)? optional decimal point followed by an arbitrary number of digits.

Reg Ex for even number of 0s and 1s

I am trying to create a regular expression that determines if a string (of any length) matches a regex pattern such that the number of 0s in the string is even, and the number of 1s in the string is even. Can anyone help me determine a regex statement that I could try and use to check the string for this pattern?
So completely reformulated my answer to reflect all the changes:
This regex would match all strings with only zeros and ones and only equal amounts of those
^(?=1*(?:01*01*)*$)(?=0*(?:10*10*)*$).*$
See it here on Regexr
I am working here with positive lookahead assertions. The big advantage here of a lookahead assertion is, that it checks the complete string, but without matching it, so both lookaheads start to check the string from the start, but for different assertions.
(?=1*(?:01*01*)*$) does check for an equal amount of 0 (including 0)
(?=0*(?:10*10*)*$) does check for an equal amount of 1 (including 0)
.* does then actually match the string
Those lookaheads checks:
(?=
1* # match 0 or more 1
(?: # open a non capturing group
0 # match one 0
1* # match 0 or more 1
0 # match one 0
1* # match 0 or more 1
)
* # repeat this pattern at least once
$ # till the end of the string
)
So, I have come up with a solution to the problem:
(11+00+(10+01)(11+00)\*(10+01))\*
For even sets of 0s, you can use the following regex to ensure that the number of 0s is even.
^(1*01*01*)*$
However, I believe that the question is to have both an even number of 0s and also an even number of 1s. Since it is possible to construct a non-deterministic finite automaton (NFA) for this problem, the solution is regular and can be represented using a regex expression. The NFA is represented via the machine below, S1 is the start/exit state.
S1 ---1----->S2
|^ <--1----- |^
|| ||
00 00
|| ||
v| v|
S3----1----->S4
<---1------
From there, there's a way to convert NFAs to regex expressions but it's been a while since my computation course. There's some notes below that seem to be helpful in explaining the steps required to convert a NFA to a regex.
http://www.cs.uiuc.edu/class/sp09/cs373/lectures/lect_08.pdf
RE-UPDATED
Try this : [ check out this demo : http://regexr.com?30m7c ]
^(00|11|0011|0110|1100|1001)+$
Hint :
Even numbers are divisible by 2, thus - in binary - they always end in zero (0)
Not a regular expression (which is likely to be impossible, although I can't prove it: the proof by contradiction via the pumping lemma fails), but the "correct" solution is avoiding a complicated and inefficient regular expression all together and using something like (in Python):
def even01(string):
return string.count("1") % 2 == 0 and string.count("0") % 2 == 0
Or if the string has to consist only of 1s and 0s:
import re
def even01(string):
return not re.search("[^01]",string) and \
string.count("1") % 2 == 0 and string.count("0") % 2 == 0
^(0((1(00)*1)*0|1(11|00)*01)|1((0(11)*0)*1|0(11|00)*10))*$
If I haven't overlooked anything, this matches any bit string where the number of 0s is even and the number of 1s is even, using only rudimentary regex operators (*, ^, $). It's slightly easier to see how it works if written like this:
^(0((1(00)*1)*0
|1(11|00)*01)
|1((0(11)*0)*1
|0(11|00)*10))*$
The following test code should illustrate the correctness - we compare the result of the pattern match against a function that tells us if a string has an even number of 0s and 1s. All bit strings of length 16 are tested.
import re
balanced = lambda s: s.count('0') % 2 == 0 and s.count('1') % 2 == 0
pat = re.compile('^(0((1(00)*1)*0|1(11|00)*01)|1((0(11)*0)*1|0(11|00)*10))*$')
size = 16
num = 2**size
for i in xrange(num):
binstr = bin(i)[2:].zfill(size)
b, m = balanced(binstr), bool(pat.match(binstr))
if b != m:
print "balanced('%s') = %d, pat.match('%s') = %d" % (binstr, b, binstr, m)
break
elif i != 0 and i % (num / 10) == 0:
# Python 2's `/` operator performs integer division
print "%d percent done..." % (100 * i / num + 1)
If you try to solve within the same sentence (starting with ^ and ending with $), you are in deep trouble. :-)
You can make sure that you have an even number of 0s (with ^(1*01*01*)*$, as stated by #david-z) OR you can make sure that you have an even number of 1s:
^(1*01*01*)*$|^(0*10*10*)*$
It works for strings with small lengths as well, such as "00" or "101", both valid strings.
I have also been working on lookaheads and lookbacks in my spare time, and using lookahead the problem can be solved while taking also account for the single 1s and/or the single 0s. So, the expression should also work for 11,1111,111111,... and also for 00,0000,000000,....
^(((?=(?:1*01*01*)*$)(?=(?:0*10*10*)*$).*)|([1]{2})*|([0]{2})*)$
Works for all cases.
So, if the string consists of only 1s or only 0s:
([1]{2})*|([0]{2})*
If it contains a mix of 0s and 1s, the positive lookahead will take care of that.
((?=(?:1*01*01*)*$)(?=(?:0*10*10*)*$).*
Combining both of them, it takes into account all string with even number of 0s and 1s.

Regex - Validation of numeric with up to 4 decimal places

I am having a bit of difficulty with the following:
I need to allow any positive numeric value up to four decimal places. Here are some examples.
Allowed:
123
12345.4
1212.56
8778787.567
123.5678
Not allowed:
-1
12.12345
-12.1234
I have tried the following:
^[0-9]{0,2}(\.[0-9]{1,4})?$|^(100)(\.[0]{1,4})?$
However this doesn't seem to work, e.g. 1000 is not allowed when it should be.
Any ideas would be greatly appreciated.
Thanks
To explain why your attempt is not working for a value of 1000, I'll break down the expression a little:
^[0-9]{0,2} # Match 0, 1, or 2 digits (can start with a zero)...
(\.[0-9]{1,4})?$ # ... optionally followed by (a decimal, then 1-4 digits)
| # -OR-
^(100) # Capture 100...
(\.[0]{1,4})?$ # ... optionally followed by (a decimal, then 1-4 ZEROS)
There is no room for 4 digits of any sort, much less 1000 (theres only room for a 0-2 digit number or the number 100)
^\d* # Match any number of digits (can start with a zero)
(\.\d{1,4})?$ # ...optionally followed by (a decimal and 1-4 digits)
This expression will pass any of the allowed examples and reject all of the Not Allowed examples as well, because you (and I) use the beginning-of-string assertion ^.
It will also pass these numbers:
.2378
1234567890
12374610237856987612364017826350947816290385
000000000000000000000.0
0
... as well as a completely blank line - which might or might not be desired
to make it reject something that starts with a zero, use this:
^(?!0\d)\d* # Match any number of digits (cannot "START" with a zero)
(\.\d{1,4})?$ # ...optionally followed by (a decimal and 1-4 digits)
This expression (which uses a negative lookahead) has these evaluations:
REJECTED Allowed
--------- -------
0000.1234 0.1234
0000 0
010 0.0
You could also test for a completely blank line in other ways, but if you wanted to reject it with the regex, use this:
^(?!0\d|$)\d*(\.\d{1,4})?$
Try this:
^[0-9]*(?:\.[0-9]{0,4})?$
Explanation: match only if starting with a digit (excluding negative numbers), optionally followed by (non-capturing group) a dot and 0-4 digits.
Edit: With this pattern .2134 would also be matched. To only allow 0 < x < 1 of format 0.2134, replace the first * with a + above.
This regex would do the trick:
^\d+(?:\.\d{1,4})?$
From the beginning of the string search for one or more digits. If there's a . it must be followed with atleast one digit but a maximum of 4.
^(?<!-)\+?\d+(\.?\d{0,4})?$
The will match something with doesn't start with -, maybe has a + followed by an integer part with at least one number and an optional floating part of maximum 4 numbers.
Note: Regex does not support scientific notation. If you want that too let me know in a comment.
Well asked!!
You can try this:
^([0-9]+[\.]?[0-9]?[0-9]?[0-9]?[0-9]?|[0-9]+)$
If you have a double value but it goes to more decimal format and you want to shorter it to 4 then !
double value = 12.3457652133
value =Double.parseDouble(new DecimalFormat("##.####").format(value));

How to detect a floating point number using a regular expression

What is a good regular expression for handling a floating point number (i.e. like Java's Float)
The answer must match against the following targets:
1) 1.
2) .2
3) 3.14
4) 5e6
5) 5e-6
6) 5E+6
7) 7.e8
8) 9.0E-10
9) .11e12
In summary, it should
ignore preceding signs
require the first character to the left of the decimal point to be non-zero
allow 0 or more digits on either side of the decimal point
permit a number without a decimal point
allow scientific notation
allow capital or lowercase 'e'
allow positive or negative exponents
For those who are wondering, yes this is a homework problem. We received this as an assignment in my graduate CS class on compilers. I've already turned in my answer for the class and will post it as an answer to this question.
[Epilogue]
My solution didn't get full credit because it didn't handle more than 1 digit to the left of the decimal. The assignment did mention handling Java floats even though none of the examples had more than 1 digit to the left of the decimal. I'll post the accepted answer in it's own post.
Just make both the decimal dot and the E-then-exponent part optional:
[1-9][0-9]*\.?[0-9]*([Ee][+-]?[0-9]+)?
I don't see why you don't want a leading [+-]? to capture a possible sign too, but, whatever!-)
Edit: there might in fact be no digits left of the decimal point (in which case I imagine there must be the decimal point and 1+ digits after it!), so a vertical-bar (alternative) is clearly needed:
(([1-9][0-9]*\.?[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?
[This is the answer from the professor]
Define:
N = [1-9]
D = 0 | N
E = [eE] [+-]? D+
L = 0 | ( N D* )
Then floating point numbers can be matched with:
( ( L . D* | . D+ ) E? ) | ( L E )
It was also acceptable to use D+ rather than L, and to prepend [+-]?.
A common mistake was to write D* . D*, but this can match just '.'.
[Edit]
Someone asked about a leading sign; I should have asked him why it was excluded but never got the chance. Since this was part of the lecture on grammars, my guess is that either it made the problem easier (not likely) or there is a small detail in parsing where you divide the problem set such that the floating point value, regardless of sign, is the focus (possible).
If you are parsing through an expression, e.g.
-5.04e-10 + 3.14159E10
the sign of the floating point value is part of the operation to be applied to the value and not an attribute of the number itself. In other words,
subtract (5.04e-10)
add (3.14159E10)
to form the result of the expression. While I'm sure mathematicians may argue the point, remember this was from a lecture on parsing.
http://www.regular-expressions.info/floatingpoint.html
Here is what I turned in.
(([1-9]+\.[0-9]*)|([1-9]*\.[0-9]+)|([1-9]+))([eE][-+]?[0-9]+)?
To make it easier to discuss, I'll label the sections
( ([1-9]+ \. [0-9]* ) | ( [1-9]* \. [0-9]+ ) | ([1-9]+)) ( [eE] [-+]? [0-9]+ )?
-------------------------------------------------------- ----------------------
   A B
A: matches everything up to the 'e/E'
B: matches the scientific notation
Breaking down A we get three parts
( ([1-9]+ \. [0-9]* ) | ( [1-9]* \. [0-9]+ ) | ([1-9]+) )
----------1---------- ---------2---------- ---3----
Part 1: Allows 1 or more digits from 1-9, decimal, 0 or more digits after the decimal (target 1)
Part 2: Allows 0 or more digits from 1-9, decimal, 1 or more digits after the decimal (target 2)
Part 3: Allows 1 or more digits from 1-9 with no decimal (see #4 in target list)
Breaking down B we get 4 basic parts
( [eE] [-+]? [0-9]+ )?
..--1- --2-- --3--- -4- ..
Part 1: requires either upper or lowercase 'e' for scientific notation (e.g. targets 8 & 9)
Part 2: allows an optional positive or negative sign for the exponent (e.g. targets 4, 5, & 6)
Part 3: allows 1 or more digits for the exponent (target 8)
Part 4: allows the scientific notation to be optional as a group (target 3)
#Kelly S. French, this regular expression matches all your test cases.
^[+-]?(\d+\.\d+|\d+\.|\.\d+|\d+)([eE][+-]?\d+)?$
Source: perldoc perlretut
'([-+])?\d*(\.)?\d+(([eE]([-+])?)?\d+)?'
That's the regular expression I have arrived at when trying to solve this kind of task in Matlab. Actually, it won't correctly detect numbers like (1.) but some additional changes may solve the problem... well, maybe the following would fix that:
'([-+])?(\d+(\.)?\d*|\d*(\.)?\d+)(([eE]([-+])?)?\d+)?'
#Kelly S. French: the sign is missing because in a parser it would get added by the unary minus (negation) expression, therefore it is not neccessary to be detected as part of a float.