Regex for dates format - regex

I am working under the Web Application based on ASP.NET MVC 5 and I have a great problem in my project with the field which gives the user the ability to choose format for showing Dates in the application.
The goal is to make RegularExpressionAttribute with the regex for validation date formats inputted by user.
Acceptable formats must be:
m/d/y,
m-d-y,
m:d:y,
d/m/y,
d-m-y,
d:m:y,
y/m/d,
y-m-d,
y:m:d
and the length of the date symbols may be as 'y' so far 'yyyy'. And they can be upper case.
So after hard-coding I've made the acceptable one:
((([mM]{1,4})([\/]{1})([dD]{1,4})([\/]{1})([yY]{1,4}))|(([mM]{1,4})([\-]{1})([dD]{1,4})([\-]{1})([yY]{1,4}))|(([mM]{1,4})([\:]{1})([dD]{1,4})([\:]{1})([yY]{1,4})))|((([dD]{1,4})([\/]{1})([mM]{1,4})([\/]{1})([yY]{1,4}))|(([dD]{1,4})([\-]{1})([mM]{1,4})([\-]{1})([yY]{1,4}))|(([dD]{1,4})([\:]{1})([mM]{1,4})([\:]{1})([yY]{1,4})))|((([yY]{1,4})([\/]{1})([mM]{1,4})([\/]{1})([dD]{1,4}))|(([yY]{1,4})([\-]{1})([mM]{1,4})([\-]{1})([dD]{1,4}))|(([yY]{1,4})([\:]{1})([mM]{1,4})([\:]{1})([dD]{1,4})))|((([yY]{1,4})([\/]{1})([dD]{1,4})([\/]{1})([mM]{1,4}))|(([yY]{1,4})([\-]{1})([dD]{1,4})([\-]{1})([mM]{1,4}))|(([yY]{1,4})([\:]{1})([dD]{1,4})([\:]{1})([mM]{1,4})))
This one works... But according to my scarce regex knowledge and experience I hope to get some help and better example for resolving this puzzle.
Thanks.

You have to generalize a bit.
m{1,4}([:/-])d{1,4}\1y{1,4}|d{1,4}([:/-])m{1,4}\2y{1,4}|y{1,4}([:/-])m{1,4}\3d{1,4}
Explanation:
instead of e.g. [mM] use m and set option for case insensitive match
([:/-]) all allowed delimiters as group
\1...\3 back reference to the delimiter group 1...3

Related

Complex Regex is not validating all repetitions of one specific rule

I need help to validate a field using regex. It will run in Postgres 9.5.
The rules are
The string must contain all seven services: Oil, Wiper blades, Air filter, Tires, Battery, Brake, Antifreeze
All services must have the operation hours, and the accepted values are HH[:MM]{am|pm}-HH[:MM]{am|pm}, or the literals ”working hours”, ”after hours”, ”not available” (this is the rule that I couldn't find the solution)
It is case insensitive, and the spaces should be irrelevant.
The services as separated by a pipe, and the service and working hours are separated by a colon
I did the regex:
^(?=.*(Oil))(?=.*(Wiper blades))(?=.*(Air filter))(?=.*(Tires))(?=.*(Battery))(?=.*(Brake))(?=.*(Antifreeze))(?=.*(\s{0,}(1{0,1}[0-2]|[1-9])(:[0-5][0-9]){0,1}\s{0,}([ap]m)\s{0,}-\s{0,}(1{0,1}[0-2]|[1-9])(:[0-5][0-9]){0,1}\s{0,}([ap]m)|working hours|after hours|not availabl)).+
This part of the regex is validating only one sequence, not all seven sequences.
(?=.*(\s{0,}(1{0,1}[0-2]|[1-9])(:[0-5][0-9]){0,1}\s{0,}([ap]m)\s{0,}-\s{0,}(1{0,1}[0-2]|[1-9])(:[0-5][0-9]){0,1}\s{0,}([ap]m)|working hours|after hours|not availabl))
Example of good string
Oil:8AM-10PM|Wiper blades:8 AM -10 PM|Air filter:8AM-10pm|Tires:8AM-10PM|Battery:8AM-10PM|Brake:8AM-9PM|Antifreeze:not available
Example of bad strings
Oil:8AM-10PM|Wiper blades:8AM-10PM|Air filter:8AM-10PM|Tires:8AM-10PM|Battery:8AM-10PM|Brake:8AM-9PM|Antifreeze:fsdfdsfs
Oil:8AM-10PM|Wiper blades:8AM-10PM|Air filter:8AM|Tires:8AM-10PM|Battery:8AM-10PM|Brake:8AM-9PM|Antifreeze:
Oil:8AM-10PM|Wiper blades:8AM-10PM|Air filter:8AM-10PM|Tires:8AM-10PM|Battery:|Brake:|Antifreeze:8AM-9PM
Oil:8AM-10PM|Wiper blades:8AM-10PM
Do someone have any idea what is missing to validate the seven occurrences?
I've made another regex that works :
^(((oil|Air\ filter|Wiper\ blades|Tires|Battery|Brake|Antifreeze):((((\d{1,2})((A|P)M)(-?)){2})|(not available))(\|?)){7})$
How ever, this regex does not take counts of repetition. Which mean, you could have Oil two time it will still works.
I've create a regex101 if you wish to tests more cases.

What's the format of a CUID in SAP BI/BO?

I'm interfacing with an SAP BI/BO server and some webservices require an input id, called "CUID" (Cluser Unique ID). for example, there's a webservice getObjectById which reqires a cuid as input.
I'm trying to make my code more robust by checking if the cuid entered by a user makes sense, but I can't find a regular expression that properly describes how a CUID looks like. There is a lot of documentation for GUID, but they're not the same. Below are some examples of CUID's found in our system and it looks like they are well-formatted but I'm not sure:
AQA9CNo0cXNLt6sZp5Uc5P0
AXiYjXk_6cFEo.esdGgGy_w
AZKmxuHgAgRJiducy2fqmv0
ASSn7jfNPCFDm12sv3muJwU
AUmKm2AjdPRMl.b8rf5ILww
AaratKz7EDFIgZEeI06o8Fc
ATjdf_MjcR9Anm6DgSJzxJ8
AaYbXdzZ.8FGh5Lr1R1TRVM
Afda1n_SWgxKkvU8wl3mEBw
AaZBfzy_S8FBvQKY4h9Pj64
AcfqoHIzrSFCnhDLMH854Qc
AZkMAQWkGkZDoDrKhKH9pDU
AaVI1zfn8gRJqFUHCa64cjg
My guess would: start with capital A, then add 22 random characters in range [0-9A-Za-Z_.]. but perhaps it could be the A means something else and after awhile it would be using B...
Is anyone familiar with this type of id's and how they are formatted?
(quick side question: do I need to escape the "dot" in the square brackets like this \. to get the actual dot character?)
The definition of the different ID types and their purpose is described in the SAP KB note 1285103: What are the different types of IDs used in the BusinessObjects Enterprise repository?
However, I couldn't find any description of the format of the CUID. I wouldn't make any assumptions about it though, other than the fact that it's alphanumeric.
I did a quick query on a repository and found CUIDs consisting up to 35 characters and beginning with the letters A,B,C,F,k and M.
If you look at the repository database, more specifically the table CMS_INFOOBJECTS7, you'll notice that the column SI_CUID is defined as a VARCHAR2, 56 bytes in size (Oracle RDBMS).
Thus, a valid regex expression to match these would be [a-zA-Z0-9\._]+.

Is there a way to search terms in order with RegexpQuery in lucene?

I would like to search my indexed documents in order using RegexpQuery.
For example I have 2 Document
text: Oracle unveils better than expected quarterly results.
text: Research In Motion shares gained almost 13 per cent on the Toronto Stock Exchange Friday, a day after the smartphone maker posted better than expected quarterly results.
So far I tried this but I got no luck.
Query regexq = new RegexpQuery(new Term("text", "^.+better.+quarterly.+results"));
Is there another way of implementing this?
Thanks
I believe a PhraseQuery fits what you are looking for better. You can use PhraseQuery.setSlop(int) to allow terms to appear between the terms of the query. This would like like:
Query pq = new PhraseQuery();
pq.add(new Term("text", "better"));
pq.add(new Term("text", "quarterly"));
pq.add(new Term("text", "results"));
pq.setSlop(10); //Or whatever is an appropriate slop value for you.
This sort of query is also supported by the standard QueryParser, as seen here, like:
text:"better quarterly results"~10
I think a PhraseQuery is most definitely the better implementation here, but...
Regarding RegexpQuery:
I believe it is intended to compare terms against the regex, and since the phrase you are searching for (I am assuming) is tokenized, no single Term matches your whole regex. You would need to index the entire field as a single Term to make this work, using StringField, KeywordAnalyzer, or similar.
I believe it works like Matcher.matches(), rather than Matcher.find(), which is to say, it must match the entire input term, rather than a portion of it. So, if you had specified "text" as a StringField, you would need to add a .* to the end to consume the rest of the input.
On a similar note, I'm not sure if it supports the use of the character "^" as the start of input, being that it is redundant in that case. I don't see it specified in Lucene's Regexp, but I have seen reference to it's use, so I'm not sure whether it would be accepted or not.
To summarize, a RegexpQuery could work like:
Query regexq = new RegexpQuery(new Term("text", ".+better.+quarterly.+results.*"));
If you used a StringField, or KeywordAnalyzer index the entire field as a single Term.
With the leading wildcard in your regexp, though, you could expect very poor performance from it (See the warning at the top of the RegexpQuery documentation).

RegEx Date Validation for M/D/YYYY and MM/DD/YYYY (InfoPath)

I'm looking for some RegEx for a custom pattern validation for a date field in InfoPath 2010. The accepted date format is m/d/yyyy or mm/dd/yyyy.
Attempt 1: (\d{1,2})/(\d{1,2})/(\d{4})
Attempt 2: (0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[012])/((19|20)\d\d)
Had better luck with attempt 1, and not much at all with attempt 2.
I've been having some date and time validation issues with InfoPath 2010 and regex pattern matching can be a useful approach. A basic regex for validating m/d/yyyy (without catering for the specific days in a month and allowing for '0' prefix to month or day) would be something like the following (untested):
(0?[1-9]|1[012])\/(0?[1-9]|[12][0-9]|3[01])\/\d{4}
For something more sophisticated you could have a look at this SO answer.
However, in InfoPath the format of the date displayed can be completely different to the internal format and it is this internal format that your regex needs to match. If you drop a calculated field on your form and set it to the date field you want to validate you will see something like:
2013-05-08T12:13:14
So a regular expression (again ignoring specific days per month) required to validate the date component of this is:
\d{4}-(0[1-9]|1[012])-(0[1-9]|[12][0-9]|3[01])
But this won't match against the example date because it doesn't account for the time portion following the "T". So the trick is to use an expression to perform the match against the date substring only, e.g. in my case the following works:
not(xdUtil:Match(substring-before(dfs:dataFields/my:SharePointListItem_RW/my:DateCreated, "T"), "\d{4}-(0[1-9]|1[012])-(0[1-9]|[12][0-9]|3[01])"))
I tried the following and it worked:
\d{4}-\d{1,2}-\d{1,2}
As David pointed out the internal format might be different than the one displayed because when I tried \d\d/\d\d/\d\d\d\d it didn't work even though it caters to the displayed format of the date.
I had the same problem.
I used a rule on the date field to set another hidden text field to
string(datefield).
That always came out YYYY-MM-DD which is not too hard to create a regex against. I used this one.
((19|20)\d\d)-(0?[1-9]|1[012])-(0?[1-9]|[12][0-9]|3[01])
Remember that it has to be an XML Regex which has some restrictions.
Then I set another rule on the hidden field to set a Boolean IsDateValid.

SQL Server Regular Expression Workaround in T-SQL?

I have some SQLCLR code for working with Regular Expresions. But now that it is getting migrated into Azure, which does not allow SQLCLR, that's out. I need to find a way to do regex in pure T-SQL.
Master Data Services are not available because the dev edition of MSSQL we have is not R2.
All ideas appreciated, thanks.
Regular expression match samples that need handling
(culled from regexlib and other places over the past few years)
email address
^[\w-]+(\.[\w-]+)*#([a-z0-9-]+(\.[a-z0-9-]+)*?\.[a-z]{2,6}|(\d{1,3}\.){3}\d{1,3})(:\d{4})?$
dollars
^(\$)?(([1-9]\d{0,2}(\,\d{3})*)|([1-9]\d*)|(0))(\.\d{2})?$
uri
^(http|https|ftp)\://([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]+)*#)*((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])|localhost|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-zA-Z]{2}))(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\?\'\\\+&%\$#\=~_\-]+))*$
one numeric digit
^\d$
percentage
^-?[0-9]{0,2}(\.[0-9]{1,2})?$|^-?(100)(\.[0]{1,2})?$
height notation
^\d?\d'(\d|1[01])"$
numbers between 1 1000
^([1-9]|[1-9]\d|1000)$
credit card numbers
^((4\d{3})|(5[1-5]\d{2})|(6011))-?\d{4}-?\d{4}-?\d{4}|3[4,7]\d{13}$
list of years
^([1-9]{1}[0-9]{3}[,]?)*([1-9]{1}[0-9]{3})$
days of the week
^(Sun|Mon|(T(ues|hurs))|Fri)(day|\.)?$|Wed(\.|nesday)?$|Sat(\.|urday)?$|T((ue?)|(hu?r?))\.?$
time on 12 hour clock
(?<Time>^(?:0?[1-9]:[0-5]|1(?=[012])\d:[0-5])\d(?:[ap]m)?)
time on 24 hour clock
^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
usa phone numbers
^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$
Unfortunately, you will not be able to move your CLR function(s) to SQL Azure. You will need to either use the normal string functions (PATINDEX, CHARINDEX, LIKE, and so on) or perform these operations outside of the database.
EDIT Adding some information for the examples added to the question.
Email address
This one is always controversial because people disagree about which version of the RFC they want to support. The original didn't support apostrophes, for example (or at least people insist that it didn't - I haven't dug it up from the archives and read it myself, admittedly), and it has to be expanded quite often for new TLDs (once for 4-letter TLDs like .info, then again for 6-letter TLDs like .museum). I've often heard quite knowledgeable people state that perfect e-mail validation is impossible, and having previously worked for an e-mail service provider, I can tell you that it was a constantly moving target. But for the simplest approaches, see the question TSQL Email Validation (without regex).
One numeric digit
Probably the easiest one of the bunch:
WHERE #s LIKE '[0-9]';
Credit card numbers
Assuming you strip out dashes and spaces, which you should do in any case. Note that this isn't an actual check of the credit card number algorithm to ensure that the number itself is actually valid, just that it conforms to the general format (AmEx = 15 digits starting with a 3, the rest are 16 digits - Visa starts with a 4, MasterCard starts with a 5, Discover starts with 6 and I think there's one that starts with a 7 (though that may just be gift cards of some kind)):
WHERE #s + ' ' LIKE '[3-7]'+ REPLICATE('[0-9]', 14) + '[0-9 ]';
If you want to be a little more precise at the cost of being long-winded, you can say:
WHERE (LEN(#s) = 15 AND #s LIKE '3' + REPLICATE('[0-9]', 14))
OR (LEN(#s) = 16 AND #s LIKE '[4-7]' + REPLICATE('[0-9]', 15));
USA phone numbers
Again, assuming you're going to strip out parentheses, dashes and spaces first. Pretty sure a US area code can't start with a 1; if there are other rules, I am not aware of them.
WHERE #s LIKE '[2-9]' + REPLICATE('[0-9]', 9);
-----
I'm not going to go further, because a lot of the other expressions you've defined can be extrapolated from the above. Hopefully this gives you a start. You should be able to Google for some of the others to see how other people have replicated the patterns with T-SQL. Some of them (like days of the week) can probably just be checked against a table - seems overkill to do an invasie pattern matching for a set of 7 possible values. Similarly with a list of 1000 numbers or years, these are things that will be much easier (and probably more efficient) to check if the numeric value is in a table rather than convert it to a string and see if it matches some pattern.
I'll state again that a lot of this will be much better if you can cleanse and validate the data before it gets into the database in the first place. You should strive to do this wherever possible, because without CLR, you just can't do powerful RegEx inside SQL Server.
Ken Henderson wrote about ways to replicate RegEx without CLR, but they require sp_OA* procedures, which are even less likely to ever see the light of day in Azure than CLR. Most of the other articles you'll find online use an approach similar to Ken's or use complex use of built-in string functions.
Which portions of RegEx specifically are you trying to replicate? Can you show an example of the input/output of one of your functions? Perhaps it will be easy to convert to get similar results using the built-in string functions like PATINDEX.