Eiffel regular expression validation - regex

How do you create a regular expression for a certain string? And can you do it in the Assertion (precondition part of the code)?
I've been google-ing around but couldn't get anything convincing.
The question is like this:
Add a precondition to the DEPARTMENT (the class that we're working on) creation procedure that ensures that the phone number is valid. There are three possible valid phone number formats. A valid phone number consists of one of:
eight digits, the first of which is non-zero
a leading zero, a single non-zero digit area code, and then eight digits, the first of
which is non-zero
a leading ‘+’, followed by a two digit country code, then a single non-zero digit
area code, and then eight digits, the first of which is non-zero
Any embedded spaces are to be ignored when validating a phone number.
It is acceptable, but not required, to add a PHONE_NUMBER class to the system as part of
solving this problem.

There are several different questions to be answered:
How to check if a given string matches a specified regular expression in Eiffel? One can use a class RX_PCRE_MATCHER from the Gobo library. The feature compile allows setting the required regular expression and the feature recognizes allows testing if the string matches it.
How to write a regular expression for the given phone number specification? Something like "(|0[1-9]|\+[0-9]{2}[1-9])[1-9][0-8]{7}" should do though I have not checked it. It's possible to take intermediate white spaces into account in the regular expression itself, but it's much easier to get rid of them before passing to the regular expression matcher by applying prune_all (' ') on the input string.
How to add a precondition to a creation procedure to verify that the argument satisfies it? Let's assume that from the previous items we constructed a function is_phone_number that takes a STRING and returns a BOOLEAN that indicates if the specified string represents a valid phone number. A straightforward solution would be to write
make (tel: STRING)
require
is_phone_number (tel)
...
and have a feature is_phone_number in the class DEPARTMENT itself. But this prevents us from checking if the specified string represents a phone number before calling this creation procedure. So it makes sense to move is_phone_number to the class PHONE_NUMBER_VALIDATOR that class DEPARTMENT will inherit. Similarly, if PHONE_NUMBER needs to validate the string against specified rules, it can inherit PHONE_NUMBER_VALIDATOR and reuse the feature is_phone_number.

Halikal actually worked this one out, but dudn't share until now ...
This works in eiffelStudio 6.2 (note - this is gobo)
http://se.inf.ethz.ch/old/people/leitner/gobo_guidelines/naming_conventions.html
A valid phone number consists of one of:
eight digits, the first of which is non-zero
a leading zero, a single non-zero digit area code,
and then eight digits, the first of which is non-zero
a leading + followed by a two digit country code,
then a single non-zero digit area code, and then eight digits,
the first of which is non-zero
Any embedded spaces are to be ignored when validating a phone number.
require -- 040 is ascii hex space
valid_phone:
match(phone, "^\040*[1-9]\040*([0-9]\040*){7}$") = TRUE or
match(phone, "^\040*0\040*([1-9]\040*){2}([0-9]\040*){7}$") = TRUE or
match(phone, "^\040*\+\040*([0-9]\040*){2}([1-9]\040*){2}([0-9]\040*){7}$") = TRUE
feature --Regular Expression check
match(text: STRING; pattern: STRING): BOOLEAN is
-- checks whether 'text' matches a regular expression 'pattern'
require
text /= Void
pattern /= Void
local
dfa: LX_DFA_REGULAR_EXPRESSION --There's the Trick!
do
create dfa.make
dfa.compile(pattern, True) --There's the Trick!
check -- regex must be compiled before we can use it
dfa.is_compiled;
end
Result := dfa.matches(text)
-- debug: make sure of which pattern
if dfa.matches (text) then
io.putstring(text + " matches " + pattern + "%N")
end
end
end

Related

A regex for maximal periodic substrings

This is a follow up to A regex to detect periodic strings .
A period p of a string w is any positive integer p such that w[i]=w[i+p]
whenever both sides of this equation are defined. Let per(w) denote
the size of the smallest period of w . We say that a string w is
periodic iff per(w) <= |w|/2.
So informally a periodic string is just a string that is made up from a another string repeated at least once. The only complication is that at the end of the string we don't require a full copy of the repeated string as long as it is repeated in its entirety at least once.
For, example consider the string x = abcab. per(abcab) = 3 as x[1] = x[1+3] = a, x[2]=x[2+3] = b and there is no smaller period. The string abcab is therefore not periodic. However, the string ababa is periodic as per(ababa) = 2.
As more examples, abcabca, ababababa and abcabcabc are also periodic.
#horcruz, amongst others, gave a very nice regex to recognize a periodic string. It is
\b(\w*)(\w+\1)\2+\b
I would like to find all maximal periodic substrings in a longer string. These are sometimes called runs in the literature.
Formally a substring w is a maximal periodic substring if it is periodic and neither w[i-1] = w[i-1+p] nor w[j+1] = w[j+1-p]. Informally, the "run" cannot be contained in a larger "run"
with the same period.
The four maximal periodic substrings (runs) of string T = atattatt are T[4,5] = tt, T[7,8] = tt, T[1,4] = atat, T[2,8] = tattatt.
The string T = aabaabaaaacaacac contains the following 7 maximal periodic substrings (runs):
T[1,2] = aa, T[4,5] = aa, T[7,10] = aaaa, T[12,13] = aa, T[13,16] = acac, T[1,8] = aabaabaa, T[9,15] = aacaaca.
The string T = atatbatatb contains the following three runs. They are:
T[1, 4] = atat, T[6, 9] = atat and T[1, 10] = atatbatatb.
Is there a regex (with backreferences) that will capture all maximal
periodic substrings?
I don't really mind which flavor of regex but if it makes a difference, anything that the Python module re supports. However I would even be happy with PCRE if that makes the problem solvable.
(This question is partly copied from https://codegolf.stackexchange.com/questions/84592/compute-the-maximum-number-of-runs-possible-for-as-large-a-string-as-possible . )
Let's extend the regex version to the very powerful https://pypi.python.org/pypi/regex . This supports variable length lookbehinds for example.
This should do it, using Python's re module:
(?<=(.))(?=((\w*)(\w*(?!\1)\w\3)\4+))
Fiddle: https://regex101.com/r/aA9uJ0/2
Notes:
You must precede the string being scanned by a dummy character; the # in the fiddle. If that is a problem, it should be possible to work around it in the regex.
Get captured group 2 from each match to get the collection of maximal periodic substrings.
Haven't tried it with longer strings; performance may be an issue.
Explanation:
(?<=(.)) - look-behind to the character preceding the maximal periodic substring; captured as group 1
(?=...) - look-ahead, to ensure overlapping patterns are matched; see How to find overlapping matches with a regexp?
(...) - captures the maximal periodic substring (group 2)
(\w*)(\w*...\w\3)\4+ - #horcruz's regex, as proposed by OP
(?!\1) - negative look-ahead to group 1 to ensure the periodic substring is maximal
As pointed out by #ClasG, the result of my regex may be incomplete. This happens when two runs start at the same offset. Examples:
aabaab has 3 runs: aabaab, aa and aa. The first two runs start at the same offset. My regex will fail to return the shortest one.
atatbatatb has 3 runs: atatbatatb, atat, atat. Same problem here; my regex will only return the first and third run.
This may well be impossible to solve within the regex. As far as I know, there is no regex engine that is capable of returning two different matches that start at the same offset.
I see two possible solutions:
Ignore the missing runs. If I am not mistaken, then they are always duplicates; an identical run will follow within the same encapsulating run.
Do some postprocessing on the result. For every run found (let's call this X), scan earlier runs trying to find one that starts with the same character sequence (let's call this Y). When found (and not already 'used'), add an entry with the same character sequence as X, but the offset of Y.
I think it is not possible. Regular expressions cannot do complex nondeterministic jobs, even with backreferences. You need an algorithm for this.
This kind of depends on your input criteria... There is no infinite string of characters.. using back references you will be able to create a suitable representation of the last amount of occurrences of the pattern you wish to match.
\
Personally I would define buckets of length of input and then fill them.
I would then use automata to find patterns in the buckets and then finally coalesce them into larger patterns.
It's not how fast the RegEx is going to be in this case it's how fast you are going to be able to recognize a pattern and eliminate the invalid criterion.

Regex mix of at least 2 characters (alphabetic, numeric, punctuation, special character)

I am trying to create a regular expression for a password field that checks to see if the input contains a mix of at least two characters sets (alphabetic, numeric, punctuation, special character). In addition, the first and last character cannot be numeric and the length must be at least 8 characters long.
I have never dealt with conditional logic for regular expressions, so it's probably why I'm having such a hard time. So far, this (but it's not working as intended):
(?=.{8,})(\d.*[a-zA-Z])|(?=.{8,})([a-zA-Z].*\d)|(?=.{8,})(\W.*\d)|(?=.{8,})(\d.*\W)|(?=.{8,})(\W.*[a-zA-Z])|(?=.{8,})([a-zA-Z].*\W)|(?=.{8,})([a-z].*[A-Z])|(?=.{8,})([A-Z].*[a-z])
Personally, I wouldn't do this with a single regex. Why not run it through a set of simpler ones, just to save yourself the inevitable maintenance headaches down the line? Something like (in order, in pseudocode):
// First and last are non-numeric and length check
if (!regex_check(pass, /^[^0-9].{6}.*[^0-9]$/)) return false
regexes = {/[a-zA-Z]/, /[0-9]/, /\p{P}|\p{Sc}|\^/} // Different character categories
numCategories = 0
for r in regex
if (regex_check(pass, r)) numCategories += 1
if numCategories >= 2 return true
return false

How to express "with at least one component"

I trying to check string data using regular expression.
form of input data are as bellow.
#1X2Y3Z#4A5B6C (valid)
<--nothing (valid)
#1X2Y3Z (valid)
##4A (valid)
#4A# (invalid)
# must be followed by at least one component matching ([0-9]+)A, ([0-9]+)B or ([0-9]+)C
And # must be the first character if input is not an empty string.
I wrote this regex:
#(([0-9]+)X)?(([0-9]+)Y)?(([0-9]+)Z)?#(([0-9]+)A)?(([0-9]+)B)?([0-9]+)C)?
but it regards #1X2Y3Z# as valid.
# must be represented with at least one component {A,B,C} or more and empty string is also valid.
^(?:#[ABC]+)?$
+ repeats the previous token one or more times, so [ABC]+ matches one or more A or B or C. ^ called starting anchor and $ called end of the line anchor.
Update:
^(?:#(?:#?[0-9]+[ABCXYZ])+)?$
DEMO
Use this one:
^(#(?:[0-9][A-Z])*#?(?:[0-9]|[A-Z])+)?$
I tested it with your requirements as per description:
# must be represented with at least one component [([0-9]+)A([0-9]+)B, ([0-9]+)C].
And # must be exist in front of string if input is not empty string.

How to disallow a period in an MVC TextBox using the RegularExpression attribute

I have a current requirement where I need to make sure a user is not putting a period (".") in a textbox using MVC4.
So, this should be allowed: 30
But this should be disallowed: 30.2
I am trying to use the RegularExpression attribute, but I can't find a combination that works:
public class StudyRandomizationCap
{
[Range(1, 100, ErrorMessage = "Milestone must be between 1 and 100.")]
[RegularExpression(#"\.", ErrorMessage = "Milestone percentage values cannot contain decimals.")]
public short? MilestonePercentage1 { get; set; }
}
This fails validation on the client side for any number I enter, regardless of whether or not there is a decimal in it or not.
Changing it to this:
[RegularExpression(#"[^.]", ErrorMessage = "Milestone percentage values cannot contain decimals.")]
Will allow a single number between 1-9, but the minute the number is 10 and above it fails. Again, this is all on the client side.
Why is it failing on numbers between 1-100 depending on which regular expression I'm using? It there a better/easier way to do this than using regular expressions? Is there some type of limitation on how Javascript executes regular expressions that would be different from the ModelBinder?
This seems like a fairly simple and trivial check, and to me, the regular expression check should only be failing if there is a "." in the textbox.
Your regex is only matching a single character, you need to tell it to be able to match more, something like:
[RegularExpression(#"[^\.]+", ErrorMessage = "Milestone percentage values cannot contain decimals.")]
The + will mean it has to be 1 or more characters long. What you should do if you only want numbers though is to have something like
[RegularExpression(#"[0-9]+", ErrorMessage = "Milestone percentage values cannot contain decimals.")]
Your initial query will prevent them from entering a . but they can still input other characters such as symbols or A-z

Regular expressions, what a trouble!

I need your kind help to resolve this question.
I state that I am not able to use regolar expressions with Oracle PL/SQL, but I promise that I'll study them ASAP!
Please suppose you have a table with a column called MY_COLUMN of type VARCHAR2(4000).
This colums is populated as follows:
Description of first no.;00123457;Description of 2nd number;91399399119;Third Descr.;13456
You can see that the strings are composed by couple of numbers (which may begin with zero), and strings (containing all alphanumeric characters, and also dot, ', /, \, and so on):
Description1;Number1;Description2;Number2;Description3;Number3;......;DescriptionN;NumberN
Of course, N is not known, this means that the number of couples for every record can vary from record to record.
In every couple the first element is always the number (which may begin with zero, I repeat), and the second element is the string.
The field separator is ALWAYS semicolon (;).
I would like to transform the numbers as follows:
00123457 ===> 001-23457
91399399119 ===> 913-99399119
13456 ===> 134-56
This means, after the first three digits of the number, I need to put a dash "-"
How can I achieve this using regular expressions?
Thank you in advance for your kind cooperation!
I don't know Oracle/PL/SQL, but I can provide a regex:
([[:digit:]]{3})([[:digit:]]+)
matches a number of at least four digits and remembers the first three separately from the rest.
RegexBuddy constructs the following code snippet from this:
DECLARE
result VARCHAR2(255);
BEGIN
result := REGEXP_REPLACE(subject, '([[:digit:]]{3})([[:digit:]]+)', '\1-\2', 1, 0, 'c');
END;
If you need to make sure that those numbers are always directly surrounded by ;, you can alter this slightly:
(^|;)([[:digit:]]{3})([[:digit:]]+)(;|$)
However, this will not work if two numbers can directly follow each other (12345;67890 will only match the first number). If that's not a problem, use
result := REGEXP_REPLACE(subject, '(^|;)([[:digit:]]{3})([[:digit:]]+)(;|$)', '\1\2-\3\4', 1, 0, 'c');