RegEx date pattern format validation not working as expected - regex

I'm trying to validate a date format input. The input is not the actual date but the D M Y input. And i don't want to validate the actual Date! just the formatting.
I want to evaluate any input done with double D, double M, double or quadruple Y with - or _ dividers.
My current RegEx looks the following:
^(?=.*[mM]{2})(?=.*[dD]{2})(?=.*[yY]{2,4})(?=.*[-_]{0,2}).*$
However this evaluates true even if more than the expected characters are found. The Limiters {2} seem to have no effect.
For example: mmddyyyymmmmmm will evaluate true even there are multiple m in there. which i don't understand.
The expected result is that only combinations such as the following can test true:
dd-mm-yy
MM-DD_YYYY
yyyy_dd-MM
mmddyy
YYYYddMM
and not something like:
ddyyyyymmmmmmmmm
mmddyymm
Please help me to correct my RegEx.

Usually, it makes sense to match a string that can only match the string containing allowed blocks and then use some programming means to do the rest of the "counting" work (you just check how many mm, dd, or yyyy / yy there are).
If you have to use a regex, there are two approaches.
Solution #1: Enumerating all alternatives
It is the least comfortable, not dynamic/unscalable solution where you just collect all possible pattern inside a single group:
^(?:
[dD]{2}[_-]?[mM]{2}[_-]?[yY]{2}(?:[yY]{2})? |
[mM]{2}[_-]?[dD]{2}[_-]?[yY]{2}(?:[yY]{2})? |
[mM]{2}[_-]?[yY]{2}(?:[yY]{2})?[_-]?[dD]{2} |
[dD]{2}[_-]?[yY]{2}(?:[yY]{2})?[_-]?[mM]{2} |
[yY]{2}(?:[yY]{2})?[_-]?[dD]{2}[_-]?[mM]{2} |
[yY]{2}(?:[yY]{2})?[_-]?[mM]{2}[_-]?[dD]{2}
)$
See the regex demo. ^ asserts the position in the start of the string, (?:...|...) non-capturing group with the alternatives and $ asserts the end of string.
Solution #2: Dynamic approach
This approach means matching a string that only consists of three D, M, or Y blocks and restricting the pattern with positive lookaheads that will require the string to only contain a single occurrence of each block. The bottleneck and the problem is that the blocks are multi-character strings, and thus you need to use a tempered greedy token (or unwrap it, making the regex even more monstrous):
^
(?=(?:(?![mM]{2}).)*[mM]{2}(?:(?![mM]{2}).)*$)
(?=(?:(?![dD]{2}).)*[dD]{2}(?:(?![dD]{2}).)*$)
(?=(?:(?![yY]{2}(?:[yY]{2})?).)*[yY]{2}(?:[yY]{2})?(?:(?![yY]{2}(?:[yY]{2})?).)*$)
(?:
(?:[mM]{2}|[dD]{2}|[yY]{2}(?:[yY]{2})?)
(?:[_-](?!$))?
){3}
$
See the regex demo
So, here, the (?:[mM]{2}|[dD]{2}|[yY]{2}(?:[yY]{2})?)(?:[_-](?!$))? parts repeats 3 times from start to end, so, the string can contain three occurrences of d, y or m, even if they are the same (mmmmmm will match, too). The lookaheads are all in the form of (?=(?:(?!BLOCK).)*BLOCK(?:(?!BLOCK).)*$) - matches only if there is any text but BLOCK, then a BLOCK and then any text but BLOCK till the end of the string.

Related

PostgreSQL: .csv regex - test for repeating substrings within a string (digits)

Introduction:
I have the following scenario in PostgreSQL whereby I want to perform some data validation on a .csv string prior to inserting it into a table (see the fiddle here).
I've managed to get a regex (in a CHECK constraint) which disallows spaces within strings (e.g. "12 34") and also disallows preceding zeros ("00343").
Now, the icing on the cake would be if I could use regular expressions to disallow strings which contain a repeat of an integer - i.e. if a sequence \d+ matched another \d+ within the same string.
Is this beyond the capacities of regular expressions?
My table is as follows:
CREATE TABLE test
(
data TEXT NOT NULL,
CONSTRAINT d_csv_only_ck
CHECK (data ~ '^([ ]*([1-9]\d*)+[ ]*)(,[ ]*([1-9]\d*)+[ ]*)*$')
);
And I can populate it as follows:
INSERT INTO test VALUES
('992,1005,1007,992,456,456,1008'), -- want to make this line unnacceptable - repeats!
('44,1005,1110'),
('13, 44 , 1005, 10078 '), -- acceptable - spaces before and after integers
('11,1203,6666'),
('1,11,99,2222'),
('3435'),
(' 1234 '); -- acceptable
But:
INSERT INTO test VALUES ('23432, 3433 ,00343, 567'); -- leading 0 - unnacceptable
fails (as it should), and also fails (again, as it should)
INSERT INTO test VALUES ('12 34'); -- spaces within numbers - unnacceptable
The question:
However, if you notice the first string, it has repeats of 992and 456.
I would like to be able to match these.
All of these rules do not have to be in the same regex - I can use a second CHECK constraint.
I would like to know if what I am asking is possible using Regular Expressions?
I did find this post which appears to go some (all?) of the way to solving my issue, but I'm afraid it's beyond my skillset to get it to work - I've included a small test at the bottom of the fiddle.
Please let me know should you require any further information.
p.s. as an aside, I'm not very experienced with regexes and I would welcome any input on my basic one above.
Since PostegreSQL regex does not support backreferences, you cannot apply this restriction because you would need a negative lookahead with a backreference in it.
Have a look at this PCRE regex:
^(?!.*\b(\d+)\b.*\b\1\b) *[1-9]\d* *(?:, *[1-9]\d* *)*$
See this regex demo.
Details:
^ - start of string
(?!.*\b(\d+)\b.*\b\1\b) - no same two numbers as whole word allowed anywhere in the string
* - zero or more spaces
[1-9]\d* - a non-zero digit and then any zero or more digits
* - zero or more spaces
(?:, *[1-9]\d* *)* - zero or more occurrences of
, * - comma and zero or more spaces
[1-9]\d* - a non-zero digit and then any zero or more digits
* - zero or more spaces
$ - end of string.
Even if you replace \b with \y (PostgreSQL regex word boundaries) in the PostgreSQL code, it won't work due to the drawback mentioned at the top of the answer.

Why do I get empty response for regexp_matches function while using positive lookahead (?=...)

Why the following code returns just empty brackets - {''}. How to make it return matching strings?
SELECT regexp_matches('ATGCATGCATGCCAACAACAACCTGTCAAGTGAGT','(?=..CAA)','g');
Expected output is:
regexp_matches
----------------
{GCCAA}
{AACAA}
{AACAA}
{GTCAA}
(4 rows)
but instead it returns the following:
regexp_matches
----------------
{""}
{""}
{""}
{""}
(4 rows)
I actually have a bit more complicated query, which requires positive lookahead in order to cover all occurrences of patterns in the string even if they overlap.
Well, it's not pretty, but you can do it without regular expressions or custom functions.
WITH data(d) as (
SELECT * FROM (VALUES ('ATGCATGCATGCCAACAACAACCTGTCAAGTGAGT')) v
)
SELECT substr(d, x, 5) AS match
FROM data
JOIN LATERAL (SELECT generate_series(1, length(d))) g(x) ON TRUE
WHERE substr(d, x, 5) LIKE '__CAA'
;
match
-------
GCCAA
AACAA
AACAA
GTCAA
(4 rows)
Basically, get each five letter slice of the string and see if it matches __CAA.
You could change generate_series(1, length(d)) to generate_series(1, length(d)-4) because the last ones will never match, but you would have to remember to update this if the length of your matching string changes.
Using a lookahead has the problem that the lookahead itself is not part of the match but it allows overlapping searches
Without using a lookahead, you lose the ability for overlapping searches.
Using Powershell, you can loop over the indexes returned from the lookaheads and use that as an index into your searchstring to get the matches
$string = 'ATGCATGCATGCCAACAACAACCTGTCAAGTGAGT'
$r = [regex]::new('(?=..CAA)')
$r.Matches($string) | % {$string.Substring($_.Index, 5)}
returns
GCCAA
AACAA
AACAA
GTCAA
I don't know how to translate this to PostgreSQL (or if that's even possible)
update:
Aparently it won't capture inside of an assertion, that's ok because
what you really need is the first 2 characters, which can safely be
consumed. It will only give you the first 2 characters per row, but
since you know the last 3, you can easily join the set elements
with the CAA constant.
Try this
..(?=CAA)
and you're done.
If I knew the bizarre sql language, I could show you how to do the join.
Output should now be
match
-------
GC
AA
AA
GT
(4 rows)
This is the regex you need for overlapped matches.
(?=(..CAA))
https://regex101.com/r/eJ36zb/1
I think you just need this sql statement which captures group 1:
SELECT regexp_matches('ATGCATGCATGCCAACAACAACCTGTCAAGTGAGT','(?=(..CAA))','g');
Formatted regex
(?=
( . . CAA ) # (1)
)
The reason you got empty strings in your result is that
you didn't give the expression anything to consume and
nothing to capture.
I.e., it matched at the right places, but nothing was consumed or captured.
So, doing it this way allows the overlap and the capture so it
should show up on the output now.
Lookahead is a zero-width assertion. It doesn't match anything. If you change your regular expression to just a regular match/capture, you'll get a result. For matching any two characters that are followed by CAA in your case, lookahead probably isn't necessary.

Combining 2 regular expressions

I have 2 strings and I would like to get a result that gives me everything before the first '\n\n'.
'1. melléklet a 37/2018. (XI. 13.) MNB rendelethez\n\nÁltalános kitöltési előírások\nI.\nA felügyeleti jelentésre vonatkozó általános szabályok\n\n1.
'12. melléklet a 40/2018. (XI. 14.) MNB rendelethez\n\nÁltalános kitöltési előírások\n\nKapcsolódó jogszabályok\naz Önkéntes Kölcsönös Biztosító Pénztárakról szóló 1993. évi XCVI. törvény (a továbbiakban: Öpt.);\na személyi jövedelemadóról szóló 1995. évi CXVII.
I have been trying to combine 2 regular expressions to solve my problem; however, I could be on a bad track either. Maybe a function could be easier, I do not know.
I am attaching one that says that I am finding the character 'z'
extended regex : [\z+$]
I guess finding the first number is: [^0-9.].+
My problem is how to combine these two expressions to get the string inbetween them?
Is there a more efficient way to do?
You may use
re.findall(r'^(\d.*?)(?:\n\n|$)', s, re.S)
Or with re.search, since it seems that only one match is expected:
m = re.search(r'^(\d.*?)(?:\n\n|$)', s, re.S)
if m:
print(m.group(1))
See the Python demo.
Pattern details
^ - start of a string
(\d.*?) - Capturing group 1: a digit and then any 0+ chars, as few as possible
(?:\n\n|$) - a non-capturing group matching either two newlines or end of string.
See the regex graph:

Regex cannot prevent a match of suffix name made up using I,V,X and SR/JR

I am trying to prevent the inclusion of suffix name, for example, JR/SR, or other suffix made up of using I,V,X using regular expression way. To accomplish this I have implemented the following regex
((^((?!((\b((I+))\b)|(\b(V+)\b)|(\b(X+)\b)|\b(IV)\b|(\b(V?I){1,2}\b)|(\b(IX)\b)|(\bX[I|IX]{1,2}\b)|(\bX|X+[V|VI]{1,2}\b)|(\b(JR)\b)|(\b(SR)\b))).)*$))
Using this I am able to prevent various possible combination eg.,
'Last Name I',
'Last Name II',
'Last Name IJR',
'Last Name SRX' etc.
However, there are still couple of combinations remaining, which this regex can match. eg., 'Last Name IXV' or 'Last Name VXI'
These two I am not able to debug. Please suggest me in which part of this regex I can make changes to satisfy the requirement.
Thank you!
Try this pattern: .+\b(?:(?>[JS]R)|X|I|J|V)+$
Explanation:
.+ - match one or more of any characters
\b - word boudnary
(?:...) - non-capturing group
(?>...) - atomic group
[JS]R - match whether S or J followed by R
| - alternation: match what is on the left OR what's on the right
+ - quantifier: match one or more times preceeding pattern
$ - match end of the string
Demo
In order to solve this I have worked on the above regex a little bit more. And here is the final result that can successfully match up with the "roman numeral" upto thirty constituted I, V, and X.
"(\b(?!(IIX|IIV|IVV|IXX|IXI))I[IVX]{0,3}\b|\b(V|X)\b|\bV[I]{1,2}\b|\b((?!XVV|XVX)X([IXV]{1,2}))\b|\b[S|J]R\b)|^$"
What I have done here is:
I have taken those input into consideration which are standalone,
that is: SR or XXV I have observed the incorrect pattern and
have restricted them to match as a positive result.
Separate input has been ensured using \b the word boundary.
Word-boundary: It suggests that starting of a word, that means in
simple words it says "yes there is a word" or "no it is not."
it has done in the following way-
using negative lookahead (?!(IIX|IIV|IVV|IXX|IXI))
How I have arrived on this solution is given as follows:
I have observed closely all the pattern first, that from I to X - that is:
I
I I
I I I
I V
V
V I
V I I
V I I I (it is out of the range of 3 characters.)
I X
X
we have an I, V, and X at first position. Then there is another I, X and V
on the second position. After then again same I and V. I have
implemented this in the following regex of the above written code:
\b(?!(IIX|IIV|IVV|IXX|IXI))I[IVX]{0,3}\b
Start it with I and then look for any of I, V, or X in a range of 'zero' to 'three' characters, and do neglect invalid numbers written inside the ?!(IIX|IIV|IVV|IXX|IXI) Similarly, I have done with other combinations given below.
Then for V and X : \b(V|X)\b
Then for the VI, VII: \bV[I]{1,2}\b
Then for the XI - XXX: \b((?!XVV|XVX)X([IXV]{1,2}))\b
To validate a suffix name, i.e. JR, SR, one can use following regex: \b[S|J]R\b
and the last (^$) is for matching a blank string or in other words, when no input has provided to the given input-box or textbox.
You may post any question or suggestion, if you have.
Thanks!
Ps: This regex is simply a solution to validate "roman numbers" from 1 to 30 using I, V, and X. I hope it helps to learn a bit to each and every newbie of regex.
I solved this with a more explicit:
(.+) (?:(?>JR$|SR$|I$|II$|III$|IV$|MD$|DO$|PHD$))|(.+)
I know I could do something like [JS]R but I like the way this reads:
(.+) match any characters and then a space
(?:(?>JR$|SR$|I$|II$|III$|IV$|MD$|DO$|PHD$)) atomically look for but don't match endings like JR etc
|(.+) if you don't find the endings then match any characters
Feel free to add the endings you'd like to suit your needs.

How to handle redundant cases in regex?

I have to parse a file data into good and bad records the data should be of format
Patient_id::Patient_name (year of birth)::disease
The diseases are pipe separated and are selected from the following:
1.HIV
2.Cancer
3.Flu
4.Arthritis
5.OCD
Example: 23::Alex.jr (1969)::HIV|Cancer|flu
The regex expression I have written is
\d*::[a-zA-Z]+[^\(]*\(\d{4}\)::(HIV|Cancer|flu|Arthritis|OCD)
(\|(HIV|Cancer|flu|Arthritis|OCD))*
But it's also considering the records with redundant entries
24::Robin (1980)::HIV|Cancer|Cancer|HIV
How to handle these kind of records and how to write a better expression if the list of diseases is very large.
Note: I am using hadoop maponly job for parsing so give answer in context with java.
What you might do is capture the last part with al the diseases in one group (named capturing group disease) and then use split to get the individual ones and then make the list unique.
^\d*::[a-zA-Z]+[^\(]*\(\d{4}\)::(?<disease>(?:HIV|Cancer|flu|Arthritis|OCD)(?:\|(?:HIV|Cancer|flu|Arthritis|OCD))*)$
For example:
String regex = "^\\d*::[a-zA-Z]+[^\\(]*\\(\\d{4}\\)::(?<disease>(?:HIV|Cancer|flu|Arthritis|OCD)(?:\\|(?:HIV|Cancer|flu|Arthritis|OCD))*)$";
String string = "24::Robin (1980)::HIV|Cancer|Cancer|HIV";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
String[] parts = matcher.group("disease").split("\\|");
Set<String> uniqueDiseases = new HashSet<String>(Arrays.asList(parts));
System.out.println(uniqueDiseases);
}
Result:
[HIV, Cancer]
Regex demo | Java demo
You need the negative lookahead.
Try using this regex: ^\d*::[^(]+?\s*\(\d{4}\)::(?!.*(HIV|Cancer|flu|Arthritis|OCD).*\|\1)((HIV|Cancer|flu|Arthritis|OCD)(\||$))+$.
Explanation:
The initial string ^\d*::[^(]+?\s*\(\d{4}\):: is just an optimized one to match Alex.jr example (your version did not respect any non-alphabetic symbols in names)
The negative lookahead block (?!.*(HIV|Cancer|flu|Arthritis|OCD).*\|\1) stands for "look forth for any disease name, encountered twice, and reject the string, if found any. Its distinctive feature is the (?! ... ) signature.
Finally, ((HIV|Cancer|flu|Arthritis|OCD)(\||$))+$ is also an optimized version of your block (HIV|Cancer|flu|Arthritis|OCD)(\|(HIV|Cancer|flu|Arthritis|OCD))*, oriented to avoid redundant listing.
Probably the easier to maintain method is that you use a bit changed regex,
like below:
^\d*::[a-zA-Z.]+\s\(\d{4}\)::((?:HIV|Cancer|flu|Arthritis|OCD|\|(?!\|))+)$
It contains:
^ and $ anchors (you want that the entire string is matched,
not its part).
A capturing group, including a repeated non-capturing group (a container
for alternatives). One of these alternatives is |, but with a negative
lookahead for immediately following | (this way you disallow 2 or
more consecutive |).
Then, if this regex matched for a particular row, you should:
Split group No 1 by |.
Check resulting string array for uniqueness (it should not contain
repeating entries).
Only if this check succeeds, you should accept the row in question.