Build a Regular Expression and Finite Automata - regex

I need some help understanding how take the following to make a regular expression that will be used to generate an epsilon NFA.
Alphabet is {0,1}
Language is: The set of all strings beginning with 101 and ending with 01010.
Valid strings would be:
101010
10101010
101110101
1011101010
I am more concerned with understanding how to make the regular expression.

The regular expression you need is pretty simple:
101010|101(0|1)*01010 (theoretical)
or
^101010|101[01]*01010$ (used in most programming languages)
which means either:
Match 1, 0, 1, 0, 1, 0
or
Match 1, 0, and 1.
Keep matching 0 or 1, zero or more times.
Match 0, 1, 0, 1, 0.
The following non-deterministic automata should work:

To get an idea of what you are looking for, it is helpful to use the intersection operator (denoted & below). It does not belong to the core set of rational expressions, yet it preserves rationality --- in other words, you can use it, and always find a means to express the same language without it.
Using Vcsn, I get this in text mode:
In [1]: import vcsn
In [2]: vcsn.B.expression('(101[01]*)&([01]*01010)').derived_term().expression()
Out[2]: 101010+101(0+1)*01010
and this in graphical mode, showing the intermediate automaton computed using derived_term (which includes details about the "meaning" of each state, so strip called afterwards to get something simpler to read):

I'd suggest a pattern that includes both the base-case and general case. You need to cover the base case of 101010, where the two patterns overlap (starts with "101", ends with "01010", and the last two digits of the first pattern are the first two digits of the second pattern. Then you can cover the general case of "101", any 0s or 1s, "01010", as given by Oscar.
So the full pattern would be:
^(101010|(101[01]*01010))$

Related

Integer range and multiple of

I have a number of fields I want to validate on text entry with a regex for both matching a range (0..120) and must be a multiple of 5.
For example, 0, 5, 25, 120 are valid. 1, 16, 123, 130 are not valid.
I think I have the regex for multiple of 5:
^\d*\d?((5)|(0))\.?((0)|(00))?$
and the regex for the range:
120|1[01][0-9]|[2-9][0-9]
However, I dont know how to combine these, any help much appreciated!
You can't do that with a simple regex. At least not the range-part (especially if the range should be generic/changeable).
And even if you manage to write the regex, it will be very complex and unreadable.
Write the validation on your own, using a parseStringToInt() function of your language and simple < and > checks.
Update: added another regex (see below) to be used when the range of values is not 0..120 (it can even be dynamic).
The second regex in the question does not match numbers smaller than 20. You can change it to match smaller numbers that always end in 0 or 5 to be multiple by 5:
\b(120|(1[01]|[0-9])?[05])\b
How it works (starting from inside):
(1[01]|[0-9])? matches 10, 11 or any one-digit number (0 to 9); these are the hundreds and tens in the final number; the question mark (?) after the sub-expression makes it match 0 or 1 times; this way the regex can also match numbers having only one digit (0..9);
[05] that follows matches 0 or 5 on the last digit (the units); only the numbers that end in 0 or 5 are multiple of 5;
everything is enclosed in parenthesis because | has greater priority than \b;
the outer \b matches word boundaries; they prevent the regex match only 1..3 digits from a longer number or numbers that are embedded in strings; it prevents it matching 15 in 150 or 120 in abc120.
Using dynamic range of values
The regex above is not very complex and it can be used to match numbers between 0 and 120 that are multiple of 5. When the range of values is different it cannot be used any more. It can be modified to match, lets say, numbers between 20 and 120 (as the OP asked in a comment below) but it will become harder to read.
More, if the range of allowed values is dynamic then a regex cannot be used at all to match the values inside the range. The multiplicity with 5 however can be achieved using regex :-)
For dynamic range of values that are multiple of 5 you can use this expression:
\b([1-9][0-9]*)?[05]\b
Parse the matched string as integer (the language you use probably provides such a function or a library that contains it) then use the comparison operators (<, >) of the host language to check if the matched value is inside the desired range.
At the risk of being painfully obvious
120|1[01][05]|[2-9][05]
Also, why the 2?

Validate a string containing 1, 2, 3, or 4 fields?

I need some help building a regular expression for a string which may contain 1, 2, 3, or 4 fields. Each field has a format of: tag=value.
Below is a comprehensive list of all possible strings I can have. code tag is a three-digits number:
type=buy&code=123&time=yes&save=yes
type=buy&code=123&time=yes&save=no
type=buy&code=123&time=no&save=yes
type=buy&code=123&time=no&save=no
type=buy&code=123&time=yes
type=buy&code=123&time=no
type=sell&code=123&time=yes&save=yes
type=sell&code=123&time=yes&save=no
type=sell&code=123&time=no&save=yes
type=sell&code=123&time=no&save=no
type=sell&code=123&time=yes
type=sell&code=123&time=no
type=long&code=123
type=short&code=123
type=fill&code=123
type=confirm&code=123
type=cancelall
type=resendall
So these are the possible values for the four tags:
type={buy|sell|long|short|fill|confirm|cancelall|resendall}
code=[[:digit:]]{3}
time={yes|no}
save={yes|no}
This is what I have right now:
value={buy|sell|long|short|fill|confirm|cancelall|resendall}&code=[[:digit:]]{3}&time={yes|no}&save={yes|no}
It is obviously not correct, I do not know how make number of fields to be variable.
I want to use regular expression to check if the string is in correct format from C++ code. I am already doing it by parsing the string and using multiple "if" statements which makes tens of lines of code and it is also error prone.
Thank you!
This regex will do it:
/^type=(?:(?:buy|sell)&code=\d{3}&time=(?:yes|no)(?:&save=(?:yes|no))?|(?:long|short|fill|confirm)&code=\d{3}|cancelall|resendall)$/
(using two anchors, an optional item and lots of alternations in non-capturing groups)
I am already doing it by parsing the string and using multiple "if" statements
For checking rules, this might be the better alternative. You still might use regexes for tokenizing your string.
You also might want to have a look at a parser generator, since you already seem to have a grammar available. The generator will yield parser code from that, which can be called to check the validity of your inputs and will return helpful error messages.

Time complexity of regex and Allowing jitter in pattern finding

To find patterns in string, I have the following code. In it, find.string finds substring of maximum length subject to (1) substring must be repeated consecutively at least th times and (2) substring length must be no longer than len.
reps <- function(s, n) paste(rep(s, n), collapse = "") # repeat s n times
find.string <- function(string, th = 3, len = floor(nchar(string)/th)) {
for(k in len:1) {
pat <- paste0("(.{", k, "})", reps("\\1", th-1))
r <- regexpr(pat, string, perl = TRUE)
if (attr(r, "capture.length") > 0) break
}
if (r > 0) substring(string, r, r + attr(r, "capture.length")-1) else ""
}
An example for the above mentioned code: for the string "a0cc0vaaaabaaaabaaaabaa00bvw" the pattern should come out to be "aaaab".
NOW I am trying to get patterns allowing jitter of 1 character. Example: for the string "a0cc0vaaaabaaadbaaabbaa00bvw" the pattern should come out to be "aaajb" where "j" can be anything. Can anyone suggest a modification of the above mentioned code or any new code for pattern finding, that could allow such jitters?
Also can anyone throw some light on the TIME COMPLEXITY and INTERNAL ALGORITHM used for the regexpr function ?
Thanks! :)
Not very efficient but tada:
reps <- function(s, n) paste(rep(s, n), collapse = "") # repeat s n times
find.string <- function(string, th = 3, len = floor(nchar(string)/th)) {
found <- FALSE
for(sublen in len:1) {
for(inlen in 0:sublen) {
pat <- paste0("((.{", sublen-inlen, "})(.)(.{", inlen, "}))", reps("(\\2.\\4)", th-1))
r <- regexpr(pat, string, perl = TRUE)
if (attr(r, "capture.length")[1] > 0){
found = TRUE
break;
}
}
if(found) break
}
if (r > 0) substring(string, r, r + attr(r, "capture.length")[1] - 1) else ""
}
find.string("a0cc0vaaaabaaadbaaabbaa00bvw"); # returns "aaaab"
Without any fuzzy matching tool available, I manually check each possibility. I use an inner loop to try different size prefix and suffix lengths on either size of the "jitter" character. The prefix is grouped as \2 and the suffix as \4 (the jitter is \3 but I don't use it). Then, the repeated part tries to match \2.\4 - so the prefix, any new jitter character, and the suffix.
I say not efficient because its evaluating O(len^2) different patterns, versus O(len) patterns in your code. For large len this might become a problem.
Note that I have multiple groups, and only look at the [1] position. The full r variable has more useful information, for example [1] will be the first part, [5] will be the 2nd part, [6] will be the 3rd part, etc. Also [3] will be the "jitter" character in the 1st part.
Regarding the complexity of the actual regex: it varies a lot. However, often the construction (setup) of a particular regex is vastly more intensive then the actual matching, which is why a single pattern used repeatedly can produce better results than multiple patterns. In truth, this varies a lot based on the pattern and the engine you're using - see links at the end for more info about complexity.
Regarding how regex works: just a note, this is going to be a very theoretical overview, its not meant to indicate how any particular regex engine works.
For a more practical overview, there are plenty of sites that cover just enough to know how to use a regex, but not how to build your own engine. - for example http://www.regular-expressions.info/engine.html
Regex is what's known as a state machine, specifically a (non-deterministic) finite state automaton (NFA). A very simple, real world state machine is a lightbulb: its either on, or off, and different inputs can change the state its in. A regex is much more complex, (generally) each symbol in the pattern forms a state, and different input can send it to different states. So if you have \d\d\d, 3 virtual states each can accept any digit, and any other input goes to a 4th "failure" state. The end result is the end state after all input is 'consumed'.
Perhaps you can imagine: this gets vastly more complicated, with many many states, when you use any ambiguity, such as wildcards or alternation. So our \d\d\d regex will basically be linear. But more complicated one will not be. Part of the optimization in a regex engine is converting a NFA to a DFA - a deterministic finite state automaton. Here, the ambiguity is removed, generating many more states, and this is the very computationally complex process referenced above (the construction stage).
This is really just a very theoretical overview of an ideal NFA. In practice, modern regex grammars can do a lot more than this, for example backtracing is not technically possible in a "proper" regex.
This might be a bit too high-level, but thats the basic idea. If you're curious, there are plenty of good articles about regex, different flavors, and their complexity. For example: http://swtch.com/~rsc/regexp/regexp1.html
There's basically two regex algorithm types, Perl-Style (with a lot of complex backtracking) and Thompson-NFA.
http://swtch.com/~rsc/regexp/regexp1.html
To determine which engine R uses R's svn repo is here:
*root repo:
http://svn.r-project.org/R/
http://svn.r-project.org/R/branches\R-exp-uncmin\src\regex
I poked around in there a bit and found a file called "engine.c" On first glance it doesn't look like a Thompson-NFA but I didn't take long to read it.
At any rate, the first link goes in depth into the complexity question in general and should give you a great idea as to how regex parsing works under the hood to boot.

How to implement regular expression NFA with character ranges?

When you read such posts as Regex: NFA and Thompson's algorithm everything looks rather straightforward until you realize in real life you need not only direct characters like "7" or "b", but also:
[A-Z]
[^_]
.
namely character classes (or ranges). And thus my question -- how to build NFA using character ranges? Using meta-characters like "not A", "anything else" and then computing overlapping ranges? This would lead to using tree-like structure when using final automaton, instead of just a table.
Update: please assume non-trivial in size (>>256) alphabet.
I am asking about NFA, but later I would like to convert NFA to DFA as well.
The simplest approach would be:
Use segments as labels for transitions in both NFA and DFA. For example, range [a-z] would be reperesented as segment [97, 122]; single character 'a' would become [97,97]; and any character '.' would become [minCode, maxCode].
Each negated range [^a-z] would result in two transitions from starting state to next state. In this example two transitions [minCode, 96] and [123, maxCode] should be created.
When range is represented by enumerating all possible characters [abcz], either transition per character should be created, or the code migh first group characters into ranges to optimize the number of transitions. So the [abcz] would become [a-c]|z. Thus two transitions instead of four.
This should be enough for NFA. However the classical power set construction to transform NFA to DFA will not work when there are transitions with intersecting character ranges.
To solve this issue only one additional generalization step is required. Once a set of all input symbols created, in our case it will be a set of segments, it should be transformed into a set of non-intersecting segments. This can be done in time O(n*Log(n)), where n is a number of segments in a set using priority equeue (PQ) in which segments are ordered by the left component. Example:
Procedure DISJOIN:
Input <- [97, 99] [97, 100] [98, 108]
Output -> [97, 97] [98, 99], [100, 100], [101, 108]
Step 2. To create new transitions from a "set state" the algorithm should be modified as following:
for each symbol in DISJOIN(input symbols)
S <- empty set of symbols
T <- empty "set state"
for each state in "set state"
for each transition in state.transitions
I <- intersection(symbol, transition.label)
if (I is not empty)
{
Add I to the set S
Add transition.To to the T
}
for each segement from DISJOIN(S)
Create transition from "set state" to T
To speed up matching when searching for a transition and input symbol C, transitions per state might be sorted by segments and binary search applied.

Regex for a valid 32-bit signed integer

I'm pretty sure this hasn't actually been answered yet on this site. For once and for all, what is the smallest regex that matches a numeric string that is in the range of a 32-bit signed integer, in the range -2147483648 to 2147483647.
I must use regex for validation - that is the only option available to me.
I have tried
\d{1,10}
but I can't figure out how to restrict it to the valid number range.
To aid developing in regex, it should match:
-2147483648
-2099999999
-999999999
-1
0
1
999999999
2099999999
2147483647
It should not match:
-2147483649
-2200000000
-11111111111
2147483648
2200000000
11111111111
I have set up an on-line live demo (on rubular) that has my attempt and the test cases above.
Note: The shortest regex that works will be accepted. Efficiency of regex will not be considered (unless there's a tie for shortest length).
I really hope it is just puzzler and no one will use regex for this problem in real world. Proper solution would be converting number from string to numeric type like BigInteger. This should allow us to check its range using proper methods or operators, like compareTo, >, <.
To make life easier you can use this page (dead link) to generate regex for ranges. So regex for range 0 - 2147483647 can look like
\b([0-9]{1,9}|1[0-9]{9}|2(0[0-9]{8}|1([0-3][0-9]{7}|4([0-6][0-9]{6}|7([0-3][0-9]{5}|4([0-7][0-9]{4}|8([0-2][0-9]{3}|3([0-5][0-9]{2}|6([0-3][0-9]|4[0-7])))))))))\b
(friendlier way)
\b(
[0-9]{1,9}|
1[0-9]{9}|
2(0[0-9]{8}|
1([0-3][0-9]{7}|
4([0-6][0-9]{6}|
7([0-3][0-9]{5}|
4([0-7][0-9]{4}|
8([0-2][0-9]{3}|
3([0-5][0-9]{2}|
6([0-3][0-9]|
4[0-7]
)))))))))\b
and range 0 - 2147483648
\b([0-9]{1,9}|1[0-9]{9}|2(0[0-9]{8}|1([0-3][0-9]{7}|4([0-6][0-9]{6}|7([0-3][0-9]{5}|4([0-7][0-9]{4}|8([0-2][0-9]{3}|3([0-5][0-9]{2}|6([0-3][0-9]|4[0-8])))))))))\b
So we can just combine these ranges and write it as
range of 0-2147483647 OR "-" range of 0-2147483648
which will give us
\b([0-9]{1,9}|1[0-9]{9}|2(0[0-9]{8}|1([0-3][0-9]{7}|4([0-6][0-9]{6}|7([0-3][0-9]{5}|4([0-7][0-9]{4}|8([0-2][0-9]{3}|3([0-5][0-9]{2}|6([0-3][0-9]|4[0-7])))))))))\b|-\b([0-9]{1,9}|1[0-9]{9}|2(0[0-9]{8}|1([0-3][0-9]{7}|4([0-6][0-9]{6}|7([0-3][0-9]{5}|4([0-7][0-9]{4}|8([0-2][0-9]{3}|3([0-5][0-9]{2}|6([0-3][0-9]|4[0-8])))))))))\b.
[edit]
Since Bohemian noticed in his comment final regex can be in form -?regex1|-2147483648 so here is little shorter version (also changed [0-9] to \d)
^-?(\d{1,9}|1\d{9}|2(0\d{8}|1([0-3]\d{7}|4([0-6]\d{6}|7([0-3]\d{5}|4([0-7]\d{4}|8([0-2]\d{3}|3([0-5]\d{2}|6([0-3]\d|4[0-7])))))))))$|^-2147483648$
If you will use it in Java String#matches(regex) method on each line you can also skip ^ and $ parts since they will be added automatically to make sure entire string matches regex.
I know this regex is very ugly, but just shows why regex is not good tool for range validation.
Edit:
This is the shortest regex you can get and the best way to do it:
We check every digit starting from the left, if it reaches it's limit and all the previous did, we put control on the next one.
for the range (-2147483647 to 2147483647) it could be a - signe or not. for -2147483648 it must be a - signe.
So finaly we get this:
^-?([0-9]{1,9}|[0-1][0-9]{9}|20[0-9]{8}|21[0-3][0-9]{7}|214[0-6][0-9]{6}|2147[0-3][0-9]{5}|21474[0-7][0-9]{4}|214748[0-2][0-9]{3}|2147483[0-5][0-9]{2}|21474836[0-3][0-9]|214748364[0-7])$|^(-2147483648)$
And this is a Live Demo
^(429496729[0-6]|42949672[0-8]\d|4294967[01]\d{2}|429496[0-6]\d{3}|42949[0-5]\d{4}|4294[0-8]\d{5}|429[0-3]\d{6}|42[0-8]\d{7}|4[01]\d{8}|[1-3]\d{9}|[1-9]\d{8}|[1-9]\d{7}|[1-9]\d{6}|[1-9]\d{5}|[1-9]\d{4}|[1-9]\d{3}|[1-9]\d{2}|[1-9]\d|\d)$
Kindly try this i tested randomly not thoroughly.
only for the numbers above zero. add '-' and adjust last number pattern for negative numbers.
(^\d{1,9}$|^1\d{9}$|^20\d{8}$|^21[0-3]\d{7}$|^214[0-6]\d{6}$|^2147[0-3]\d{5}$|^21474[0-7]\d{4}$|^214748[0-2]\d{3}$|^2147483[0-5]\d{2}$|^21474836[0-3]\d$|^214748364[0-7]$)
one should never use regex for this type of work.