PostgreSQL: .csv regex - test for repeating substrings within a string (digits) - regex

Introduction:
I have the following scenario in PostgreSQL whereby I want to perform some data validation on a .csv string prior to inserting it into a table (see the fiddle here).
I've managed to get a regex (in a CHECK constraint) which disallows spaces within strings (e.g. "12 34") and also disallows preceding zeros ("00343").
Now, the icing on the cake would be if I could use regular expressions to disallow strings which contain a repeat of an integer - i.e. if a sequence \d+ matched another \d+ within the same string.
Is this beyond the capacities of regular expressions?
My table is as follows:
CREATE TABLE test
(
data TEXT NOT NULL,
CONSTRAINT d_csv_only_ck
CHECK (data ~ '^([ ]*([1-9]\d*)+[ ]*)(,[ ]*([1-9]\d*)+[ ]*)*$')
);
And I can populate it as follows:
INSERT INTO test VALUES
('992,1005,1007,992,456,456,1008'), -- want to make this line unnacceptable - repeats!
('44,1005,1110'),
('13, 44 , 1005, 10078 '), -- acceptable - spaces before and after integers
('11,1203,6666'),
('1,11,99,2222'),
('3435'),
(' 1234 '); -- acceptable
But:
INSERT INTO test VALUES ('23432, 3433 ,00343, 567'); -- leading 0 - unnacceptable
fails (as it should), and also fails (again, as it should)
INSERT INTO test VALUES ('12 34'); -- spaces within numbers - unnacceptable
The question:
However, if you notice the first string, it has repeats of 992and 456.
I would like to be able to match these.
All of these rules do not have to be in the same regex - I can use a second CHECK constraint.
I would like to know if what I am asking is possible using Regular Expressions?
I did find this post which appears to go some (all?) of the way to solving my issue, but I'm afraid it's beyond my skillset to get it to work - I've included a small test at the bottom of the fiddle.
Please let me know should you require any further information.
p.s. as an aside, I'm not very experienced with regexes and I would welcome any input on my basic one above.

Since PostegreSQL regex does not support backreferences, you cannot apply this restriction because you would need a negative lookahead with a backreference in it.
Have a look at this PCRE regex:
^(?!.*\b(\d+)\b.*\b\1\b) *[1-9]\d* *(?:, *[1-9]\d* *)*$
See this regex demo.
Details:
^ - start of string
(?!.*\b(\d+)\b.*\b\1\b) - no same two numbers as whole word allowed anywhere in the string
* - zero or more spaces
[1-9]\d* - a non-zero digit and then any zero or more digits
* - zero or more spaces
(?:, *[1-9]\d* *)* - zero or more occurrences of
, * - comma and zero or more spaces
[1-9]\d* - a non-zero digit and then any zero or more digits
* - zero or more spaces
$ - end of string.
Even if you replace \b with \y (PostgreSQL regex word boundaries) in the PostgreSQL code, it won't work due to the drawback mentioned at the top of the answer.

Related

Converting a normal regular RegEx to one in SAS

Although reading SAS documentation and various example pages, I am struggeling to convert a slightly more complicated RegEx to SAS syntax. I using the command prxchange. This is what I came up so far convert a filename-string like pre_31DEC2019_299792458.xls to an integer number (of length 8) 299792458 inside a SAS data step:
tmp=prxchange('s/pre_([a-zA-Z0-9]{8,9})_([0-9]{1,16})\.xls/\2/g',-1,have);
want=input(tmp,8.);
The error message I have points to somewhere else in the code, but I am rather certain that it is those two lines which cause a problem since leaving out the two quoted lines makes the SAS error message vanish.
References
Inofficial SAS howto on RegEx suggests that I could use standard RegExes.
Why use regex at all?
want = input(scan(have,-2,'._'),32.);
You can use
tmp = prxchange('s/^pre_[A-Za-z0-9]+_([0-9]+)\.xls$/$1/', -1, have);
See the regex demo
Details
s/ - substitution action (we are replacing the match)
^ - start of string
pre_ - a literal prefix
[A-Za-z0-9]+ - one or more alphanumeric ASCII chars (note you may simply use .* here instead if there can be anything)
_ - an underscore
([0-9]+) - Group 1: one or more digits
\.xls$ - .xls at the end of string
$1 - the whole match, the whole string matched, will be replaced with the contents of Group 1.
As far as the prxchange function is concerned, note that it replaces all occurrences of the pattern once you pass -1 as the times argument, thus, no g flag is necessary.
Many ways you could try:
data _null_;
a="pre_31DEC2019_299792458.xls";
b=input(prxchange('s/.*\_(.*)\..*/$1/',-1,a),12.);
c=input(prxchange('s/.*(\d{9}).*/$1/',-1,a),12.);
d=input(prxchange('s/.*(?<=\_)(\d+).*/$1/',-1,a),12.);
put _all_;
run;
.* means any one character many times; for b, the numbers you need are between _and .; for c, it is 9 digitals; for d, it look behind "_" to find digitals.

Regex Erasing all except numbers with limited digits

What I want to do is erase everything except \d{4,7} only by replacing.
Any ideas to get this?
ex)
G-A15239L → 15239
(G-A and L should be selected and replaced by empty strings)
now200316stillcovid19asdf → 200316
(now and stillcovid19asdf should be selected and replaced by empty strings)
Also, replacing text is not limited as empty string.
substitutions such as $1 are possible too.
Using Regex in 'Kustom' apps. (including KLCK, KLWP, KWGT)
I don't know which engine it's using because there are no information about it
You may use
(\d{4,7})?.?
Or
(\d{4,7})|.
and replace with $1. See the regex demo.
Details
(\d{4,7})? - an optional (due to ? at the end - if it is missing, then the group is obligatory) capturing group matching 1 or 0 occurrences of 4 to 7 digits
| - or
.? - any one char other than line break chars, 1 or 0 times when ? is right after it.
So, any match of 4 to 7 digits is kept (since $1 refers to the Group 1 value) and if there is a char after it, it is removed.
It looks as if the regex is Java based since all non-matching groups are replaced with null:
So, the only possible solution is to use a second pass to post-process the results, just replace null with some kind of a delimiter, a newline for example.
Search: .*?(\d{4,7})[^\d]+|.*
Replace: $1
in for instance Notepad++ 6.0 or better (which comes with built-in PCRE support) works with your examples:
jalsdkfilwsehf
now200316stillcovid19asdf
G-A15239L
becomes:
200316
15239

Combining 2 regular expressions

I have 2 strings and I would like to get a result that gives me everything before the first '\n\n'.
'1. melléklet a 37/2018. (XI. 13.) MNB rendelethez\n\nÁltalános kitöltési előírások\nI.\nA felügyeleti jelentésre vonatkozó általános szabályok\n\n1.
'12. melléklet a 40/2018. (XI. 14.) MNB rendelethez\n\nÁltalános kitöltési előírások\n\nKapcsolódó jogszabályok\naz Önkéntes Kölcsönös Biztosító Pénztárakról szóló 1993. évi XCVI. törvény (a továbbiakban: Öpt.);\na személyi jövedelemadóról szóló 1995. évi CXVII.
I have been trying to combine 2 regular expressions to solve my problem; however, I could be on a bad track either. Maybe a function could be easier, I do not know.
I am attaching one that says that I am finding the character 'z'
extended regex : [\z+$]
I guess finding the first number is: [^0-9.].+
My problem is how to combine these two expressions to get the string inbetween them?
Is there a more efficient way to do?
You may use
re.findall(r'^(\d.*?)(?:\n\n|$)', s, re.S)
Or with re.search, since it seems that only one match is expected:
m = re.search(r'^(\d.*?)(?:\n\n|$)', s, re.S)
if m:
print(m.group(1))
See the Python demo.
Pattern details
^ - start of a string
(\d.*?) - Capturing group 1: a digit and then any 0+ chars, as few as possible
(?:\n\n|$) - a non-capturing group matching either two newlines or end of string.
See the regex graph:

Regex with 2 semi colons in notepad++

I have data like this
Giftsbirth;;Basket7;CC
Giftswedding;;Cake4;COD
I am trying to find a regex that will only select the second data (Basket7, Cake4).
From past help I tried something like
^(\w+ [^\v;;]+;;[^\v;]+)?.*
But I know that is not right
Please assist with the regex if you can
You could use a positive lookbehind (?<= to assert what is before is ;; and a positive lookahead (?= to assert that what follows is ;
Use a negative character class [^;]+ to match not a ; to match your values.
(?<=;;)[^;]+(?=;)
You may use
(?:.*;)?([^;\n\r]+);[^;\n\r]+$
Or,
.*?;;([^;\r\n]+)(?:;.*)?
and replace with $1.
Details
(?:.*;)? - an optional substring having 0+ chars other than line break chars, as many as possible, up to the ;
([^;\n\r]+) - Group 1: any one or more chars other than CR, LF and ;
; - a semi-colon
[^;\n\r]+ - any one or more chars other than CR, LF and ;
$ - end of line.
The second regex matches
.*?;; - any 0+ chars as few as possible up to (and including) the first ;;
([^;\r\n]+) - Group 1: any one or more chars other than CR, LF and ;
(?:;.*)? - an optional group matching 1 or 0 occurrences of a ; and then any 0+ chars up to the end of line
The $1 in the replacement is the value you need to keep.
You need to specify more precisely what "the second data (Basket7, Cake4)" means. This looks like CSV data with the ; set as separator, but that would place Basket7 and Cake4 in the third column, since the second column is empty. In order to write a regex that solves this problem in the general case, you need to take into account the full domain of possible lines, and you've only given two examples and let everyone guess what the underlying format and total possible variations might be.
For example, is it always reasonable to assume that that which you're looking for is always preceded by ;; and ends with a ;, and that ;; never occurs in other places than immediately before that which you're looking for? In that case, (?<=;;)([^;]*) captures this. But what if you encounter one of the following lines?
Giftsbirth;;;CC # Here, the thing matched is empty
Giftsbirth;1600;Basket7;CC # Here, the second column isn't empty
;;Basket7;CC # Here, the first column is empty
;;;CC # Here, all but the last column are empty
;;; # Here, all columns are empty
You may experience that various suggestions will give you "the right text", but if you test this on a limited subset that does not account for all variations that can reasonably be expected in the input, you will inevitably have to revise your regex.
Assuming this is a CSV where the fields don't contain literal ;s, and that you don't know anything about the length of any of the fields (and consequently that the second column isn't always empty), but that there are at least three columns, you could consider the regex:
^[^;]*;[^;]*;([^;]*)
(See demo at https://regex101.com/r/vhPNEj/1)
These assumptions may not be correct, but my ability to guess are much worse than yours, since you're sitting with a larger sample size of data. In order to succeed at automating your tasks, it is critical that you learn to modify code to fit your assumptions.
For example, you may want to disregard the cases where the third column is empty:
^[^;]*;[^;]*;([^;]+)
Here the difference is [^;]* changed into [^;]+.
Or you may want to take into account that the first column could contain semicolons when they are wrapped in double quotes, e.g. like "Giftsbirth; Holiday";;Basket7;CC:
^(?:[^;"]*|"[^"]*");[^;]*;([^;]*)
Here the difference is [^;]* changed into (?:[^;"]*|"[^"]*") being either [^;"]* (being all but ; and ") or "[^"]*" (being " followed by anything but ", which includes ;, followed by ").

Hive REGEXP_EXTRACT returning null results

I am trying to extract R7080075 and X1234567 from the sample data below. The format is always a single upper case character followed by 7 digit number. This ID is also always preceded by an underscore. Since it's user generated data, sometimes it's the first underscore in the record and sometimes all preceding spaces have been replaced with underscores.
I'm querying HDP Hive with this in the select statement:
REGEXP_EXTRACT(column_name,'[(?:(^_A-Z))](\d{7})',0)
I've tried addressing positions 0-2 and none return an error or any data. I tested the code on regextester.com and it highlighted the data I want to extract. When I then run it in Zepplin, it returns NULLs.
My regex experience is limited so I have reviewed the articles here on regexp_extract (+hive) and talked with a colleague. Thanks in advance for your help.
Sample data:
Sept Wk 5 Sunny Sailing_R7080075_12345
Holiday_Wk2_Smiles_X1234567_ABC
The Hive manual says this:
Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\\s' is necessary to match whitespace, etc.
Also, your expression includes unnecessary characters in the character class.
Try this:
REGEXP_EXTRACT(column_name,'_[A-Z](\\d{7})',0)
Since you want only the part without underscore, use this:
REGEXP_EXTRACT(column_name,'_([A-Z]\\d{7})',1)
It matches the entire pattern, but extracts only the second group instead of the entire match.
Or alternatively:
REGEXP_EXTRACT(column_name,'(?<=_)[A-Z]\\d{7}', 0)
This uses a regexp technique called "positive lookbehind". It translates to : "find me an upper case alphabet followed by 7 digits, but only if they are preceded by an _". It uses the _ for matching but doesn't consider it part of the extracted match.